Master cannot be switched in the second time

Description

Background

I am using Alluxio with HDFS under HA mode. All of them are installed in a docker.

I have three dockers: docker1, docker2, docker3. Both docker1 and docker2 have NameNode and AlluxioMaster. All of the three dockers have DataNode and AlluxioWorker. And I have zookeeper running on the three dockers as well.

Steps

  1. The namenode on docker1 is active and the master of Alluxio is also on docker1.

  2. I stoped docker1. The namenode on docker2 is switched to active successfully and the leader master of Alluxio is also switched to docker2. No issue here

  3. I started docker1, and all the workers reconnected successfully. I run "jps" on docker1 and AlluxioMaster is up. I also checked the log of AlluxioMaster on docker1, no exception is thrown.

  4. Now the active name node is on docker2 and the master of Alluxio is also on docker2.
    Important I stopped docker2. The namenode on docker1 became active but alluxio on docker1 is not started.
    I checked the log of AlluxioMaster on docker1. I found that the Alluxio Master on docker1 is already switched to leader but it is still connecting the name node on docker2, as if alluxio doesn't know hdfs has already switched to docker1.

Further more: At the beginning, there is something wrong in the alluxio configuration of HDFS HA mode. The switch will fail during the first time with the exactly the same Exception.
After a corrected the configuration of HDFS HA mode. The switch is fine during the first time.
--> So, I guess the reason of the problem is active name node of HDFS switched to another node while Alluxio doesn't know that during the second time

Log

2017-10-23 01:42:35,952 INFO RetryInvocationHandler - Exception while invoking ClientNamenodeProtocolTranslatorPB.getListing over docker2/10.240.1.102:9000 after 6 failover attempts. Trying to failover after sleeping for 16646ms.
java.net.NoRouteToHostException: No Route to Host from docker1/10.240.1.101 to docker2:9000 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1485)
at org.apache.hadoop.ipc.Client.call(Client.java:1427)
at org.apache.hadoop.ipc.Client.call(Client.java:1337)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.getListing(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:588)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:398)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:335)
at com.sun.proxy.$Proxy12.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1681)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1665)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:896)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:111)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:960)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:957)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:957)
at alluxio.underfs.hdfs.HdfsUnderFileSystem.listStatusInternal(HdfsUnderFileSystem.java:502)
at alluxio.underfs.hdfs.HdfsUnderFileSystem.listStatus(HdfsUnderFileSystem.java:293)
at alluxio.underfs.UnderFileSystemWithLogging$18.call(UnderFileSystemWithLogging.java:327)
at alluxio.underfs.UnderFileSystemWithLogging$18.call(UnderFileSystemWithLogging.java:324)
at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:520)
at alluxio.underfs.UnderFileSystemWithLogging.listStatus(UnderFileSystemWithLogging.java:324)
at alluxio.master.journal.ufs.UfsJournalSnapshot.getSnapshot(UfsJournalSnapshot.java:88)
at alluxio.master.journal.ufs.UfsJournalReader.updateInputStream(UfsJournalReader.java:204)
at alluxio.master.journal.ufs.UfsJournalReader.readInternal(UfsJournalReader.java:163)
at alluxio.master.journal.ufs.UfsJournalReader.read(UfsJournalReader.java:132)
at alluxio.master.journal.ufs.UfsJournalCheckpointThread.runInternal(UfsJournalCheckpointThread.java:141)
at alluxio.master.journal.ufs.UfsJournalCheckpointThread.run(UfsJournalCheckpointThread.java:123)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:681)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:777)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:409)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1542)
at org.apache.hadoop.ipc.Client.call(Client.java:1373)
... 34 more

Environment

None

Status

Assignee

Unassigned

Reporter

HalfLegend

Labels

Components

Affects versions

1.5.0
1.4.0
1.6.0

Priority

Critical