when i cat a file:
alluxio fs cat /data/mytest/redis.conf > /tmp/redis.conf
where it has one block in two worker, worker1, worker2.
when worker1 is crashed, if i run such command in worker3, retry policy is available, and run successfully.
2018-08-14 15:08:13,935 WARN FileInStream - Failed to read block 318783881216 from worker WorkerNetAddress...
but if i do such thing in worker1, it will throw a exception:
java.io.IOException: syscall:getsockopt(..) failed: Connection refused: xxx
because when we read locally, in BlockInStream.createLocalBlockInStream, it throws a ConnectException, but only NotFoundException is catch, as a result of which, it's failed to select a remote node to read.
so, if we catch ConnectException | NotFoundException e in BlockInStream.create when handle createLocalBlockInStream(), it would be better?
Thanks for reporting this. I agree – if the client fails to connect to one worker with the block, it should try again with a different worker.
NotFoundException is handled by creating a netty stream to the worker instead of a local block stream. If the worker is dead, the netty stream will fail as well because it’s still trying to talk to the same worker. To recover from connect exceptions here, we need to retry at a higher level. We need to communicate to FileInStream that the worker couldn’t be reached. Perhaps we could add a new WorkerConnectFailed exception which contains the ID of the worker that couldn’t be contacted. Then in FileInStream we can catch the exception, add the killed worker to the mFailedWorkers list, and retry.
Yes. But i see the code in 1.8.0 in BlockInStream.class, NotFoundException is handled by creating a local block stream? And if we catch a new connect exception in FileInStream, it may be not added to the mFailedWorkers list, because updateStream() in FileInStream.class isn't caught in try.. catch (about which I feel puzzled).
so i add my code in ALLUXIO-3293.1.patch , and it works well. of cause, a new exception is caught here may be better.
, mind submitting your PR to our github and the code review can start there