Improve Handling of SSD failures

Description

a. Alluxio has been configured with SSD only.
alluxio.worker.tieredstore.levels=1
alluxio.worker.tieredstore.level0.alias=SSD
alluxio.worker.tieredstore.level0.dirs.path=/hadoop01/alluxio/data/,/hadoop02/alluxio/data/,/hadoop03/alluxio/data/,/hadoop04/alluxio/data/
alluxio.worker.tieredstore.level0.dirs.quota=250GB,250GB,250GB,250GB

b. One of the disks on the Alluxio worker node is down
[16:13]:[admadan@lvsdg1dn0079:~]$ ls -la /hadoop01
ls: cannot access /hadoop01: Input/output error

c. The path being accessed is available in Alluxio and UFS (persisted) but Alluxio data is unavailable.

d. Tasks (both local and remote reads) consistently fail and lead to job failures
2018-11-20 16:06:56,152 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:266)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.<init>(HadoopShimsSecure.java:213)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:333)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:720)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:252)
... 11 more
Caused by: java.io.FileNotFoundException: /hadoop01/alluxio/data/alluxioworker/5598875746306 (Input/output error)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:124)
at alluxio.worker.block.io.LocalFileBlockReader.<init>(LocalFileBlockReader.java:46)
at alluxio.client.block.stream.LocalFilePacketReader$Factory.create(LocalFilePacketReader.java:155)
at alluxio.client.block.stream.BlockInStream.readPacket(BlockInStream.java:379)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:265)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:252)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:242)
at alluxio.client.file.FileInStream.read(FileInStream.java:126)
at alluxio.hadoop.HdfsFileInputStream.read(HdfsFileInputStream.java:100)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:66)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:419)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:255)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:97)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:83)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:71)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:67)
... 16 more

Environment

None

Status

Assignee

Bin Feng

Reporter

Bin Feng

Labels

None

Components

Fix versions

Affects versions

1.8.1

Priority

Major
Configure