Retry reading S3 files when facing Amazon connection reset issue.


I'm running into an issue with long-running jobs are accessing files (read-only) from S3 via Alluxio FUSE. Without fail (three to four times so far) the job will run into a connection reset and fail after about 3-4 hours of constant access to thousands of small flies stored in an S3 bucket. Attached is the relevant errors at the end of worker.log and fuse.log when the failure occurred (there was nothing related in master.log). Is there any known problem with long-running jobs like this that otherwise work normally for several hours at a time?

This issue is illustrated here. and better to retry reading S3 files when encountering connection reset issue.




John Landahl
September 24, 2018, 8:43 PM

I originally reported this on the mailing list and wanted to mention that I’ve been unable to use Alluxio for long-running jobs so far because of this problem. I’d be happy to take a look at the relevant code if someone knows the general area where this would be fixed.

September 24, 2018, 8:46 PM

The class that manages reading s3 file is alluxio.underfs.s3a.S3AInputStream.

This file locates at alluxio/underfs/s3a/src/main/java/alluxio/underfs/s3a/S3AInputStream.

Welcome to contribute!

Thanks a lot!

John Landahl
September 25, 2018, 12:00 AM

Going by my worker.log sample it appears the exception is being caught at line 365 in, which is in PacketReader.runInternal(). At initial glance I don't see a clear place in the call stack where a retry loop could be placed, especially one that opens a new S3AInputStream. It’s also not clear where or how the stream is actually opened. Any hints on where to look?

Bin Fan
October 2, 2018, 11:15 PM

one way to retry: catch exception in alluxio.worker.block.UnderFileSystemBlockReader#transferTo, reset mUnderFileSystemInputStream to null on this exception, and invoke updateUnderFileSystemInputStream which will create a new input stream.

Another alternative approach is to look at alluxio.worker.block.UfsInputStreamManager. In here we cache the input streams to reuse connections (to avoid creating too many if they are reading the same S3 object), but according to the post, it seems we may put too many requests on one connection.








Fix versions

Affects versions