Hanging encountered trying to persist file to remote HDFS

Description

We have an Alluxio instance which has two different HDFS mounted into it. A specific directory of the local HDFS is mounted at the root i.e. Is the primary under filesystem. A remote HDFS instance is mounted under a specific subdirectory, in this case /aristotle

What we can read data from the remote HDFS instance fine since moving to 1.4.0 we are unable to write data and encounter a client hang when attempting to do so. Checking the worker logs on the worker that holds the blocks for the files to be persisted we see the following error in worker.out:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Exception in thread "persist-file-service-3" java.lang.IllegalArgumentException: Wrong FS: hdfs://aristotle-nid00000.us.cray.com:8020/user/rvesse/server.log.alluxio.0x6F144DDB78F069C8.tmp, expected: hdfs://192.168.0.1 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:776) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:577) at alluxio.underfs.hdfs.HdfsUnderFileSystem.createDirect(HdfsUnderFileSystem.java:157) at alluxio.underfs.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:57) at alluxio.underfs.hdfs.HdfsUnderFileSystem.create(HdfsUnderFileSystem.java:145) at alluxio.worker.file.FileDataManager.persistFile(FileDataManager.java:239)

This was working fine with 1.3.0, this looks to be a HDFS issue but I don't really understand why this is only started happening with 1.4.0. In both installations we use the Hadoop 2.7 build of Alluxio.

Eventually the client will spit out the following error message but it takes a long time for to happen:

1 Timed out waiting for Wait for the file to be persisted

Please note that the line where the error occurs is not protected by a try-catch-finally Block so it is entirely possible that this error is also killing the persistence worker thread.

Environment

Linux nid00009 3.10.0-327.36.3.el7_3.1-cray_ari_athena_s_cos #1 SMP Mon Jan 9 19:14:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux (CentOS 7 w/ Cray kernel mods)
Hadoop 2.7.1.2.4.0.0-169
Subversion git@github.com:hortonworks/hadoop.git -r 26104d8ac833884c8776473823007f176854f2eb
Compiled by jenkins on 2016-02-10T06:18Z
Compiled with protoc 2.5.0
From source with checksum cf48a4c63aaec76a714c1897e2ba8be6
Alluxio 1.4.0

Status

Assignee

Rob Vesse

Reporter

Rob Vesse

Labels

Components

Fix versions

Affects versions

1.4.0

Priority

Critical
Configure