案例详解 Elasticsearch故障排查指引 _生活知道

为了模拟异常场景，我们首先本地搭建一个三个节点的集群(http端口分别为9200，9400，9600)，然后创建测试用的索引：
curl –location –request PUT 'http://127.0.0.1:9200/fruit-1?pretty' \\–header 'Content-Type: application/json' \\–data-raw '{"settings":{"number_of_shards":3,"number_of_replicas":2},"mappings":{"properties":{"name":{"type":"text"},"fruit_type":{"type":"keyword"}}}}'
使用集群API查看此时的集群状态,可以看到集群status是green，总共3个主分片，每个主分片有两个副本分片，共3+3*2=9个分片。
{"cluster_name": "FantuanTech-Cluster","status": "green","timed_out": false,"number_of_nodes": 3,"number_of_data_nodes": 3,"active_primary_shards": 3,"active_shards": 9,"relocating_shards": 0,"initializing_shards": 0,"unassigned_shards": 0,"delayed_unassigned_shards": 0,"number_of_pending_tasks": 0,"number_of_in_flight_fetch": 0,"task_max_waiting_in_queue_millis": 0,"active_shards_percent_as_number": 100.0}
使用cat shard API查看分片在各节点的分布情况如下：
curl –location –request GET 'http://127.0.0.1:9200/_cat/shards/fruit-1?v'
结果如下，可以看到三个主分片分别分布在三个节点上，同时，每个节点也分布了两个副本分片。
indexshard prirep statedocs store ipnodefruit-1 2rSTARTED0208b 127.0.0.1 FantuanTech-Node-2fruit-1 2rSTARTED0208b 127.0.0.1 FantuanTech-Node-1fruit-1 2pSTARTED0208b 127.0.0.1 FantuanTech-Node-3fruit-1 1rSTARTED0208b 127.0.0.1 FantuanTech-Node-2fruit-1 1pSTARTED0208b 127.0.0.1 FantuanTech-Node-1fruit-1 1rSTARTED0208b 127.0.0.1 FantuanTech-Node-3fruit-1 0pSTARTED0208b 127.0.0.1 FantuanTech-Node-2fruit-1 0rSTARTED0208b 127.0.0.1 FantuanTech-Node-1fruit-1 0rSTARTED0208b 127.0.0.1 FantuanTech-Node-3
下面以常见的故障案例来详细分析一下排查的方法：
【案例详解 Elasticsearch故障排查指引】集群状态为yellow，集群能够对外提供服务，但是有副本分片未分配
1、停止FantuanTech-Node-3节点
2、首先通过cluster API查看集群的状态，可以看到status的值为yellow，unassigned_shards有3个，正好符合目前集群情况，每个索引都有2个副本分片，同个索引的分片不能位于同一个节点上，所以每个主分片都有一个副本分片没有被分配。
curl –location –request GET 'http://127.0.0.1:9200/_cluster/health?pretty'{"cluster_name": "FantuanTech-Cluster","status": "yellow","timed_out": false,"number_of_nodes": 2,"number_of_data_nodes": 2,"active_primary_shards": 3,"active_shards": 6,"relocating_shards": 0,"initializing_shards": 0,"unassigned_shards": 3,"delayed_unassigned_shards": 3,"number_of_pending_tasks": 0,"number_of_in_flight_fetch": 0,"task_max_waiting_in_queue_millis": 0,"active_shards_percent_as_number": 66.66666666666666}
3、再次查看具体哪些分片未被分配
可以看到有3个三片，且原来在FantuanTech-Node-3节点上的主分片已经被分配到了FantuanTech-Node-2上了。
curl –location –request GET 'http://127.0.0.1:9200/_cat/shards/fruit-1?v'indexshard prirep statedocs store ipnodefruit-1 2pSTARTED0208b 127.0.0.1 FantuanTech-Node-2fruit-1 2rSTARTED0208b 127.0.0.1 FantuanTech-Node-1fruit-1 2rUNASSIGNEDfruit-1 1rSTARTED0208b 127.0.0.1 FantuanTech-Node-2fruit-1 1pSTARTED0208b 127.0.0.1 FantuanTech-Node-1fruit-1 1rUNASSIGNEDfruit-1 0pSTARTED0208b 127.0.0.1 FantuanTech-Node-2fruit-1 0rSTARTED0208b 127.0.0.1 FantuanTech-Node-1fruit-1 0rUNASSIGNED
4、查看未被分配的原因
curl –location –request GET 'http://127.0.0.1:9200/_cluster/allocation/explain'{"index": "fruit-1","shard": 2,"primary": false,"current_state": "unassigned","unassigned_info": {"reason": "NODE_LEFT","at": "2021-01-20T07:54:56.545Z","details": "node_left [eiahI52JRLub2Q1k6Ty84A]","last_allocation_status": "no_attempt"},"can_allocate": "no","allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes","node_allocation_decisions": [{"node_id": "ixSLP9-ERGaksd8-QX-LQQ","node_name": "FantuanTech-Node-1","transport_address": "127.0.0.1:9300","node_attributes": {"ml.machine_memory": "8589934592","xpack.installed": "true","transform.node": "true","ml.max_open_jobs": "20"},"node_decision": "no","deciders": [{"decider": "same_shard","decision": "NO","explanation": "a copy of this shard is already allocated to this node [[fruit-1][2], node[ixSLP9-ERGaksd8-QX-LQQ], [R], s[STARTED], a[id=qADeEOdST0668Vk2LRgK9A]]"}]},{"node_id": "u9yh5cCVTdugLl8DmERASg","node_name": "FantuanTech-Node-2","transport_address": "127.0.0.1:9500","node_attributes": {"ml.machine_memory": "8589934592","ml.max_open_jobs": "20","xpack.installed": "true","transform.node": "true"},"node_decision": "no","deciders": [{"decider": "same_shard","decision": "NO","explanation": "a copy of this shard is already allocated to this node [[fruit-1][2], node[u9yh5cCVTdugLl8DmERASg], [P], s[STARTED], a[id=SeUTjeRYSTGvvOnozcw3uQ]]"}]}]}
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes"，可以看到这里提示了无法进行分配的原因，是因为没有能够允许分配分片的节点，查看具体某个节点，原因更加明确的是"decider": "same_shard" 。
这里的原因会有多种，会根据实际的情况进行提示，例如，手工排除FantuanTech-Node-3节点，执行以下命令后重启节点。
curl –location –request PUT 'http://127.0.0.1:9200/_cluster/settings' \\–header 'Content-Type: application/json' \\–data-raw '{"transient":{"cluster.routing.allocation.exclude._name":"FantuanTech-Node-3"}}'
再次使用explain API查看未被分配的原因会发现以下提示：
{"decider": "filter","decision": "NO","explanation": "node matches cluster setting [cluster.routing.allocation.exclude] filters [_name:\\"FantuanTech-Node-3\\"]"}
5、重启FantuanTech-Node-3节点，会发现剩下的副本分片被分配到这个节点，集群的status变成green 。
集群状态为red，集群无法对外提供服务
1、创建新的index，但排除分片落在所有的节点上
curl –location –request PUT 'http://127.0.0.1:9200/fruit-2?pretty' \\–header 'Content-Type: application/json' \\–data-raw '{"settings":{"number_of_shards":3,"number_of_replicas":2,"index.routing.allocation.exclude._name":"FantuanTech-Node-1,FantuanTech-Node-2,FantuanTech-Node-3"},"mappings":{"properties":{"name":{"type":"text"},"fruit_type":{"type":"keyword"}}}}'
此时查看集群状态，可以发现status的值是red,unassigned_shards为12 。
2、使用explain API查看分片未分配的原因
{"index": "fruit-2","shard": 2,"primary": true,"current_state": "unassigned","unassigned_info": {"reason": "INDEX_CREATED","at": "2021-01-20T13:29:29.949Z","last_allocation_status": "no"},"can_allocate": "no","allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes","node_allocation_decisions": [{"node_id": "u9yh5cCVTdugLl8DmERASg","node_name": "FantuanTech-Node-2","transport_address": "127.0.0.1:9500","node_attributes": {"ml.machine_memory": "8589934592","ml.max_open_jobs": "20","xpack.installed": "true","transform.node": "true"},"node_decision": "no","weight_ranking": 1,"deciders": [{"decider": "filter","decision": "NO","explanation": "node matches index setting [index.routing.allocation.exclude.] filters [_name:\\"FantuanTech-Node-1 OR FantuanTech-Node-2 OR FantuanTech-Node-3\\"]"}]},{"node_id": "ixSLP9-ERGaksd8-QX-LQQ","node_name": "FantuanTech-Node-1","transport_address": "127.0.0.1:9300","node_attributes": {"ml.machine_memory": "8589934592","xpack.installed": "true","transform.node": "true","ml.max_open_jobs": "20"},"node_decision": "no","weight_ranking": 2,"deciders": [{"decider": "filter","decision": "NO","explanation": "node matches index setting [index.routing.allocation.exclude.] filters [_name:\\"FantuanTech-Node-1 OR FantuanTech-Node-2 OR FantuanTech-Node-3\\"]"}]},{"node_id": "eiahI52JRLub2Q1k6Ty84A","node_name": "FantuanTech-Node-3","transport_address": "127.0.0.1:9700","node_attributes": {"ml.machine_memory": "8589934592","ml.max_open_jobs": "20","xpack.installed": "true","transform.node": "true"},"node_decision": "no","weight_ranking": 3,"deciders": [{"decider": "filter","decision": "NO","explanation": "node matches cluster setting [cluster.routing.allocation.exclude] filters [_name:\\"FantuanTech-Node-3\\"]"}]}]}
可以看到"reason": "INDEX_CREATED"表明了是在创建索引的阶段导致的分片无法完成分配，后面的"explanation"则表明了是由于命中了cluster.routing.allocation.exclude的规则，所有的节点都无法分配分片。
3、重置cluster.routing.allocation.exclude
curl –location –request PUT 'http://127.0.0.1:9200/fruit-2/_settings' \\–header 'Content-Type: application/json' \\–data-raw '{"index.routing.allocation.exclude._name":null}'
4、重新查看集群健康状态，发现status状态已经变为green 。
启动时提示unable to lock JVM Memory:error=12,reason=cannot allocate memory
这是因为配置了bootstrap.memory_lock: true,想要解决这个问题，只需要在/etc/security/limits.conf增加如下内容：
elasticsearch soft memlock unlimitedelasticsearch hard memlock unlimited节点CPU使用率很高
1、通过top命令找到消耗CPU的进程，然后通过jstack命令打印具体线程栈信息，具体可以参考文章：CPU使用率高怎么办？收藏这个排查手册吧
2、通过elasticsearch hot_threads API获取繁忙线程的堆栈，默认情况下，返回3个热点线程的信息，确定了占用CPU比较高的线程以后就能根据线程情况排查到具体的原因了，下面是一个热点线程案例。
curl –location –request GET 'http://127.0.0.1:9200/_nodes/hot_threads'::: {FantuanTech-Node-3}{eiahI52JRLub2Q1k6Ty84A}{bJOLodfwTZ2EdLR_fEvDow}{127.0.0.1}{127.0.0.1:9700}{dilmrt}{ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}Hot threads at 2021-01-20T15:16:01.328Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:::: {FantuanTech-Node-1}{ixSLP9-ERGaksd8-QX-LQQ}{qYF04NPBQXeLTpIhbQoERw}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}Hot threads at 2021-01-20T15:16:01.365Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:::: {FantuanTech-Node-2}{u9yh5cCVTdugLl8DmERASg}{tyarJ6jbSqa57KVPsi2uAQ}{127.0.0.1}{127.0.0.1:9500}{dilmrt}{ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}Hot threads at 2021-01-20T15:16:01.437Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:0.2% (842micros out of 500ms) cpu usage by thread 'elasticsearch[FantuanTech-Node-2][scheduler][T#1]'unique snapshotjava.base@15/sun.nio.fs.UnixPath.getName(UnixPath.java:332)java.base@15/sun.nio.fs.UnixPath.getName(UnixPath.java:43)java.base@15/java.io.FilePermission.containsPath(FilePermission.java:753)java.base@15/java.io.FilePermission.impliesIgnoreMask(FilePermission.java:612)java.base@15/java.io.FilePermissionCollection.implies(FilePermission.java:1209)java.base@15/java.security.Permissions.implies(Permissions.java:178)java.base@15/sun.security.provider.PolicyFile.implies(PolicyFile.java:994)java.base@15/sun.security.provider.PolicySpiFile.engineImplies(PolicySpiFile.java:75)java.base@15/java.security.Policy$PolicyDelegate.implies(Policy.java:796)app//org.elasticsearch.bootstrap.ESPolicy.implies(ESPolicy.java:102)java.base@15/java.security.ProtectionDomain.implies(ProtectionDomain.java:321)java.base@15/java.security.ProtectionDomain.impliesWithAltFilePerm(ProtectionDomain.java:353)java.base@15/java.security.AccessControlContext.checkPermission(AccessControlContext.java:450)java.base@15/java.security.AccessController.checkPermission(AccessController.java:1036)java.base@15/java.lang.SecurityManager.checkPermission(SecurityManager.java:408)java.base@15/java.lang.SecurityManager.checkRead(SecurityManager.java:747)java.base@15/sun.nio.fs.UnixPath.checkRead(UnixPath.java:810)java.base@15/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)java.base@15/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:148)java.base@15/java.nio.file.Files.readAttributes(Files.java:1843)app//org.elasticsearch.watcher.FileWatcher$FileObserver.checkAndNotify(FileWatcher.java:97)app//org.elasticsearch.watcher.FileWatcher$FileObserver.updateChildren(FileWatcher.java:216)app//org.elasticsearch.watcher.FileWatcher$FileObserver.checkAndNotify(FileWatcher.java:118)app//org.elasticsearch.watcher.FileWatcher.doCheckAndNotify(FileWatcher.java:71)app//org.elasticsearch.watcher.AbstractResourceWatcher.checkAndNotify(AbstractResourceWatcher.java:44)app//org.elasticsearch.watcher.ResourceWatcherService$ResourceMonitor.run(ResourceWatcherService.java:179)app//org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:213)app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)java.base@15/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)java.base@15/java.util.concurrent.FutureTask.run(FutureTask.java:264)java.base@15/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)java.base@15/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)java.base@15/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)java.base@15/java.lang.Thread.run(Thread.java:832)节点内存使用率很高
1、使用top命令按照占有内存高低进行排序，查找到占用内存高的进程。
2、如果发现是Elasticsearch占用内存很高，可以查看相关的内存结构，判断是哪个因素导致的。
查看bulk队列占用内存
curl –location –request GET 'http://127.0.0.1:9200/_cat/thread_pool/bulk?v'
查看显示结果queue的大小，乘以bulk请求的大小就可以判断bulk队列占用内存的大小。
查看segments占用内存
以下命令查看每个节点所有分段占用的内存：
curl –location –request GET 'http://127.0.0.1:9200/_cat/nodes?v&h=segments.memory'segments.memory2.6kb2.6kb2.6kb
以下命令查看每个索引分片上每个分段占用的内存：
indexshard segment size.memoryfruit-2 0_01364fruit-2 0_01364fruit-2 0_01364fruit-2 2_01364fruit-2 2_01364fruit-2 2_01364
查看fielddata cache占用内存
curl –location –request GET 'http://127.0.0.1:9200/_cat/nodes?v&h=name,fielddata.memory_size'namefielddata.memory_sizeFantuanTech-Node-30bFantuanTech-Node-10bFantuanTech-Node-20b
查看shard request cache占用内存
curl –location –request GET 'http://127.0.0.1:9200/_cat/nodes?v&h=name,request_cache.memory_size'namerequest_cache.memory_sizeFantuanTech-Node-30bFantuanTech-Node-10bFantuanTech-Node-20b
查看node query cache
curl –location –request GET 'http://127.0.0.1:9200/_cat/nodes?v&h=name,query_cache.memory_size'namequery_cache.memory_sizeFantuanTech-Node-30bFantuanTech-Node-10bFantuanTech-Node-20b
以上就是朝夕生活（www.30zx.com）关于“Elasticsearch故障排查指引(案例详解)”的详细内容，希望对大家有所帮助！

案例详解 Elasticsearch故障排查指引

猜你喜欢