HBase问题诊断 – RegionServer宕机 – 有态度的HBase/Spark/BigData

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

不敢表白的紫菜 · element-ui表格中的复选框默认选中状 ...· 2 年前 ·

眉毛粗的木瓜 · 【笔记】nodejs读取JSON，数组转树- ...· 2 年前 ·

乐观的番茄 · 谷歌插件开发：Tabs Manager ...· 2 年前 ·

活泼的签字笔 · RecyclerView滑动到item顶部的 ...· 2 年前 ·

快乐的黑框眼镜 · CUDA中内存访问越界问题_fb_help的 ...· 2 年前 ·

2016-04-14 21:27:13,174 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20542ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=20898ms
2016-04-14 21:27:13,174 WARN  [regionserver60020.periodicFlusher] util.Sleeper: We slept 20936ms instead of 100ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
        at org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2246)
        at org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1496)
2016-04-14 21:32:40,173 WARN  [B.DefaultRpcServer.handler=125,queue=5,port=60020] ipc.RpcServer: RpcServer.respondercallId: 7540 service: ClientService methodName: Multi size: 100.2 K connection: 10.160.247.139:56031: output error
2016-04-14 21:32:40,173 WARN  [B.DefaultRpcServer.handler=125,queue=5,port=60020] ipc.RpcServer: B.DefaultRpcServer.handler=125,queue=5,port=60020: caught a ClosedChannelException, this means that the server was processing a request but the client went away. The error message was: null

(concurrent mode failure): 45876255K->21800674K(46137344K), 10.0625300 secs] 48792749K->21800674K(49283072K), [CMS Perm : 43274K->43274K(262144K)], 10.2083040 secs] [Times: user=12.02 sys=0.00, real=10.20 secs]
2016-04-14 21:22:43,990 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10055ms
GC pool 'ParNew' had collection(s): count=2 time=244ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10062ms

2016-04-14 21:22:44,006 WARN  [ResponseProcessor for block BP-632656502-10.160.173.93-1448595094942:blk_1073941840_201226] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-632656502-10.160.173.93-1448595094942:blk_1073941840_201226

java.io.IOException: Bad response ERROR for block BP-632656502-10.160.173.93-1448595094942:blk_1073941840_201226 from datanode 10.160.173.93:50010
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)

2016-04-14 21:22:43,789 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-632656502-10.160.173.93-1448595094942:blk_1073941840_201226 received exception java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.160.173.94:50010 remote=/10.160.173.94:30110]
2016-04-14 21:22:43,779 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hz-hbase4.photo.163.org:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.160.173.94:30123 dest: /10.160.173.94:50010
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.160.173.94:50010 remote=/10.160.173.94:30123]

很显然，从日志可以看出，datanode一直在等待来自客户端的read请求，但是直至SocketTimeout，请求都没有过来，此时datanode会将该连接断开，导致客户端收到上述”Bad response ERROR ***”的异常。那这和Full GC有什么关系呢？很简单，就是因为Full GC导致HBase所有内部线程挂起，因此发往datanode的read请求也被挂起了，datanode就等啊等，左等右等都等不到，万不得已才将连接断掉。

2016-04-14 21:32:38,893 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: 1073941669, srvID: DS-22834907-10.160.173.94-50010-1448595406972, success: true 2016-04-14 21:32:38,894 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: 1073941669, srvID: DS-22834907-10.160.173.94-50010-1448595406972, success: true 网易杭州研究院技术专家。负责网易内部Hadoop&HBase等组件内核开发运维工作，擅长大数据领域架构设计，性能优化以及问题诊断。著有《HBase原理与实践》一书。微信公众号：大数据基建。邮箱：[email protected]。

博主写的很仔细,条理也很清晰学生学习了~
在此问个问题, 将文件依据文件名(UUID+TIMESTAMP)为RowKey,内容为Values,存储到Hbase中,这个方案靠谱么
文件大小一般为几十KB,文件量大概50个/s,可能会有将几天的文件提取出来至磁盘的任务.

理论上是可以但是如果按照你的说法是读取几天的文件就需要使用类似于过滤器的scan扫描效率可能会比较低建议使用Mongodb 将时间设置为二级索引这样可以满足你这样的读取场景

We stumbled over here by a difnereft page and thought I should check things out. I like what I see so now i am following you. Look forward to finding out about your web page yet again.

我们移植到自己的版本进行压力测试和性能测试之后，觉得没问题就上了。有个bug:https://issues.apache.org/jira/browse/HBASE-16222

G1算法没有尝试过目测适用于100g＋以上的内存；目前可以调优CMS参数＋blockcache offheap缓解，也可以使用一个物理机上起多个RegionServer的方式缓解（不过监控好像不好搞）～

我的hbase集群的多个regionserver比较频繁的遇到这样一个情况，就是jvm停止工作，但是当时并没有GC发生。
日志如下：
2017-05-26 18:42:10,607 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 48721ms
No GCs detected

2017-05-26 19:12:10,790 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 49500ms
No GCs detected

通过visualGC观察，这段时间确实没有gc发生。但整个jvm都停止了，从visualgc右边的第一行compile和classloader都是停止的。很奇怪，还没有找到原因。
求指教，非常感谢！

目前这类长时间虚拟机暂停都当GC处理的（具体原因本人不确定，如果需要探讨，可以加本人微信libisthanks，可以拉群交流）你们rs总jvm内存设置多大？有没有开启bucketcache的offheap模式？

14336

2. rs的Jvm内存18G，参数配置如下：
HBASE_REGIONSERVER_OPTS=”$HBASE_REGIONSERVER_OPTS -XX:SurvivorRatio=4 -server -Xmx18g -Xms18g -Xmn1g -Xss256k -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseParNewGC -XX:MaxTenuringThreshold=10 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:+CMSClassUnloadingEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -XX:-DisableExplicitGC”

3. 经过了一些参数调优，No GC 的Jvm停顿时间有所降低，但还是会发生。
以下是一个例子（2017-06-05 19:36）：
2017-06-05 19:36:40,932 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10030ms
No GCs detected

当时段的gc日志如下：
2017-06-05T19:36:03.369+0800: 797616.040: Total time for which application threads were stopped: 0.0137030 seconds
2017-06-05T19:36:22.091+0800: 797634.762: [GC2017-06-05T19:36:22.091+0800: 797634.762: [ParNew
Desired survivor size 89456640 bytes, new threshold 10 (max 10)
– age 1: 1033520 bytes, 1033520 total
– age 2: 236264 bytes, 1269784 total
– age 3: 211400 bytes, 1481184 total
– age 4: 260000 bytes, 1741184 total
– age 5: 173448 bytes, 1914632 total
– age 6: 159216 bytes, 2073848 total
– age 7: 173896 bytes, 2247744 total
– age 8: 179192 bytes, 2426936 total
– age 9: 182104 bytes, 2609040 total
– age 10: 191624 bytes, 2800664 total
: 702804K->3678K(873856K), 0.0122400 secs] 17881044K->17182126K(18699648K), 0.0123650 secs] [Times: user=0.21 sys=0.00, real=0.01 secs]
2017-06-05T19:36:22.104+0800: 797634.774: Total time for which application threads were stopped: 0.0134570 seconds
2017-06-05T19:36:23.323+0800: 797635.993: Total time for which application threads were stopped: 0.0010760 seconds
2017-06-05T19:36:23.323+0800: 797635.994: Total time for which application threads were stopped: 0.0002370 seconds
2017-06-05T19:36:40.931+0800: 797653.602: Total time for which application threads were stopped: 10.2617260 seconds
2017-06-05T19:36:40.931+0800: 797653.602: Total time for which application threads were stopped: 0.0002360 seconds
2017-06-05T19:36:40.932+0800: 797653.602: Total time for which application threads were stopped: 0.0002140 seconds
2017-06-05T19:36:40.932+0800: 797653.603: Total time for which application threads were stopped: 0.0002160 seconds
2017-06-05T19:36:40.934+0800: 797653.604: Total time for which application threads were stopped: 0.0003790 seconds
2017-06-05T19:36:40.934+0800: 797653.605: Total time for which application threads were stopped: 0.0002830 seconds
2017-06-05T19:36:40.935+0800: 797653.605: Total time for which application threads were stopped: 0.0001910 seconds
2017-06-05T19:36:40.935+0800: 797653.606: Total time for which application threads were stopped: 0.0001910 seconds
2017-06-05T19:36:40.935+0800: 797653.606: Total time for which application threads were stopped: 0.0001820 seconds
2017-06-05T19:36:40.951+0800: 797653.622: Total time for which application threads were stopped: 0.0003420 seconds
2017-06-05T19:36:40.952+0800: 797653.622: Total time for which application threads were stopped: 0.0001750 seconds
2017-06-05T19:36:40.952+0800: 797653.622: Total time for which application threads were stopped: 0.0001520 seconds
2017-06-05T19:36:40.952+0800: 797653.623: Total time for which application threads were stopped: 0.0001410 seconds
2017-06-05T19:36:45.881+0800: 797658.552: [GC2017-06-05T19:36:45.881+0800: 797658.552: [ParNew
Desired survivor size 89456640 bytes, new threshold 10 (max 10)
– age 1: 1593696 bytes, 1593696 total
– age 2: 244032 bytes, 1837728 total
– age 3: 210744 bytes, 2048472 total
– age 4: 211184 bytes, 2259656 total
– age 5: 193208 bytes, 2452864 total
– age 6: 173448 bytes, 2626312 total
– age 7: 159168 bytes, 2785480 total
– age 8: 173896 bytes, 2959376 total
– age 9: 179168 bytes, 3138544 total
– age 10: 182152 bytes, 3320696 total
: 702799K->4120K(873856K), 0.0133430 secs] 17881247K->17182824K(18699648K), 0.0134700 secs] [Times: user=0.22 sys=0.00, real=0.02 secs]
2017-06-05T19:36:45.895+0800: 797658.565: Total time for which application threads were stopped: 0.0145820 seconds
2017-06-05T19:36:53.935+0800: 797666.606: Total time for which application threads were stopped: 0.0010200 seconds

“从日志内容来看应该是hbase调用DFSClient从datanode加载block数据”BP-632656502-10.160.173.93-1448595094942:blk_1073941840_201226″，但是datanode返回失败。“
从该段文字前后贴出的日志来看，貌似是hbase写入datanode，而不是读取。