I'm running into an issue with long-running jobs are accessing files (read-only) from S3 via Alluxio FUSE. Without fail (three to four times so far) the job will run into a connection reset and fail after about 3-4 hours of constant access to thousands of small flies stored in an S3 bucket. Attached is the relevant errors at the end of worker.log and fuse.log when the failure occurred (there was nothing related in master.log). Is there any known problem with long-running jobs like this that otherwise work normally for several hours at a time?
This issue is illustrated here. and better to retry reading S3 files when encountering connection reset issue.