when block is local and client is local, retry policy is invalid

Description

when i cat a file:
alluxio fs cat /data/mytest/redis.conf > /tmp/redis.conf

where it has one block in two worker, worker1, worker2.
when worker1 is crashed, if i run such command in worker3, retry policy is available, and run successfully.
2018-08-14 15:08:13,935 WARN FileInStream - Failed to read block 318783881216 from worker WorkerNetAddress...

but if i do such thing in worker1, it will throw a exception:
java.io.IOException: syscall:getsockopt(..) failed: Connection refused: xxx
at alluxio.cli.fs.command.AbstractFileSystemCommand.runWildCardCmd(AbstractFileSystemCommand.java:79)
at alluxio.cli.fs.command.CatCommand.run(CatCommand.java:86)
at alluxio.cli.AbstractShell.run(AbstractShell.java:100)
at alluxio.cli.fs.FileSystemShell.main(FileSystemShell.java:65)

because when we read locally, in BlockInStream.createLocalBlockInStream, it throws a ConnectException, but only NotFoundException is catch, as a result of which, it's failed to select a remote node to read.

so, if we catch ConnectException | NotFoundException e in BlockInStream.create when handle createLocalBlockInStream(), it would be better?

Environment

alluxio 1.8.0

Activity

Show:
Andrew Audibert
August 15, 2018, 8:23 AM

Thanks for reporting this. I agree – if the client fails to connect to one worker with the block, it should try again with a different worker.

NotFoundException is handled by creating a netty stream to the worker instead of a local block stream. If the worker is dead, the netty stream will fail as well because it’s still trying to talk to the same worker. To recover from connect exceptions here, we need to retry at a higher level. We need to communicate to FileInStream that the worker couldn’t be reached. Perhaps we could add a new WorkerConnectFailed exception which contains the ID of the worker that couldn’t be contacted. Then in FileInStream we can catch the exception, add the killed worker to the mFailedWorkers list, and retry.

snodawn
August 15, 2018, 1:41 PM

Yes. But i see the code in 1.8.0 in BlockInStream.class, NotFoundException is handled by creating a local block stream? And if we catch a new connect exception in FileInStream, it may be not added to the mFailedWorkers list, because updateStream() in FileInStream.class isn't caught in try.. catch (about which I feel puzzled).

so i add my code in ALLUXIO-3293.1.patch , and it works well. of cause, a new exception is caught here may be better.

Bin Fan
August 18, 2018, 4:21 AM

, mind submitting your PR to our github and the code review can start there

snodawn
August 21, 2018, 7:58 PM

, ok, thanks. I have submitted a PR in https://github.com/Alluxio/alluxio/pull/7785.

Bin Fan
September 11, 2018, 2:50 AM
Fixed

Assignee

snodawn

Reporter

snodawn

Labels

None

Components

Affects versions

Priority

Major