During a master failover, a NullPointerException will be hit and some operation will fail.
Here is the test step (but may not be reproduced every time):
1. Start zookeeper and at least 2 masters, said, masterA, masterB, assume masterA is primary
2. start caching data into worker
3. shutdown masterA, masterB will take over
4. worker will fail over to masterB
5. some operation, e.g, commitBlock will fail over to masterB in parallel with workerRegister
6. a NullPointerException occurs:
java.lang.NullPointerException
at alluxio.master.block.meta.MasterWorkerInfo.updateUsedBytes(MasterWorkerInfo.java:340)
at alluxio.master.block.DefaultBlockMaster.commitBlock(DefaultBlockMaster.java:537)
at alluxio.master.block.BlockMasterWorkerServiceHandler$2.call(BlockMasterWorkerServiceHandler.java:95)
at alluxio.master.block.BlockMasterWorkerServiceHandler$2.call(BlockMasterWorkerServiceHandler.java:92)
at alluxio.RpcUtils.call(RpcUtils.java:101)
at alluxio.RpcUtils.call(RpcUtils.java:84)
at alluxio.master.block.BlockMasterWorkerServiceHandler.commitBlock(BlockMasterWorkerServiceHandler.java:92)
at alluxio.thrift.BlockMasterWorkerService$Processor$commitBlock.getResult(BlockMasterWorkerService.java:507)
at alluxio.thrift.BlockMasterWorkerService$Processor$commitBlock.getResult(BlockMasterWorkerService.java:491)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:123)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The root cause is, worker register is broken into two steps and not atomic: getWorkerId() and registerWorker()
Here is how above problem is happening:
1. worker fails over to masterB
2. worker issues getWorkerID() to masterB
3. masterB remembers worker in mWorkers
4. before worker send registerWorker() to masterB, a commitBlock() operation arrives
5. masterB responses to commitBlock, but some fields in worker (MasterWorkerInfo) is not set, specifically, mUsedBytesOnTiers, the tier information is not registered into masterB yet,
so this line hit an OPS:
public void updateUsedBytes(String tierAlias, long usedBytesOnTier) {
mUsedBytes += usedBytesOnTier - mUsedBytesOnTiers.get(tierAlias); <<<<<<<<<<<<<<<
mUsedBytesOnTiers.put(tierAlias, usedBytesOnTier);
}
I will provide a PR for this issue.
this issue has been addressed by https://github.com/Alluxio/alluxio/pull/7780 (branch-1.8) and https://github.com/Alluxio/alluxio/pull/7812 (1.9-snapshot)