视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
hadoop集群DataNode起不来:“DiskChecker$DiskErrorExceptio
2020-11-09 13:15:25 责编:小采
文档

最近把线上一个配置在拷贝到线下一台机器后,发现 hadoop datanode起不来,总是报这个异常: 2014-03-11 10:38:44,238 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1337291857-192.168.2.5

最近把线上一个配置在拷贝到线下一台机器后,发现hadoop datanode起不来,总是报这个异常:

2014-03-11 10:38:44,238 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1337291857-192.168.2.50-1394505472069 (storage id DS-1593966629-192.168.2.50-50010-1394505524173) service to search002050.sqa.cm4/192.168.2.50:9000
org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid volume failure config value: 1
 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.
(FsDatasetImpl.java:183)
 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance
(FsDatasetFactory.java:34)
 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance
(FsDatasetFactory.java:30)
 at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:920)
 at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:882)
 at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo
(BPOfferService.java:308)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake
(BPServiceActor.java:218)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:660)
 at java.lang.Thread.run(Thread.java:662)

原因是:
dfs.datanode.failed.volumes.tolerated 这个参数直接拷贝了线上的配置为1,
其含义是:The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown. 即datanode可以忍受的磁盘损坏的个数。

在hadoop集群中,经常会发生磁盘只读或者损坏的情况。datanode在启动时会使用dfs.datanode.data.dir下配置的文件夹(用来存储block),若是有一些不可以用且个数>上面配置的值,DataNode就会启动失败。

在线上环境中fs.datanode.data.dir配置为10块盘,所以dfs.datanode.failed.volumes.tolerated设置为1,是允许有一块盘是坏的。而线下的只有一块盘,这volFailuresTolerated和volsConfigured的值都为1,所以会导致代码里面判断失败。

详见hadoop源码的FsDatasetImpl.java的182行:

 // The number of volumes required for operation is the total number 
 // of volumes minus the number of failed volumes we can tolerate.
 final int volFailuresTolerated =
 conf.getInt(DFSConfigKeys.DFS_DATANODE_FAILED_VOLUMES_TOLERATED_KEY,
 DFSConfigKeys.DFS_DATANODE_FAILED_VOLUMES_TOLERATED_DEFAULT);
 String[] dataDirs = conf.getTrimmedStrings(DFSConfigKeys.DFS_DATANODE_DATA_DIR_KEY);
 int volsConfigured = (dataDirs == null) ? 0 : dataDirs.length;
 int volsFailed = volsConfigured - storage.getNumStorageDirs();
 this.validVolsRequired = volsConfigured - volFailuresTolerated;
 if (volFailuresTolerated < 0 || volFailuresTolerated >= volsConfigured) {
 throw new DiskErrorException("Invalid volume failure "
 + " config value: " + volFailuresTolerated);
 }
 if (volsFailed > volFailuresTolerated) {
 throw new DiskErrorException("Too many failed volumes - "
 + "current valid volumes: " + storage.getNumStorageDirs() 
 + ", volumes configured: " + volsConfigured 
 + ", volumes failed: " + volsFailed
 + ", volume failures tolerated: " + volFailuresTolerated);
 }
下载本文
显示全文
专题