worker should only be exposed in master after registered (registration in an atomic way)

Description

During a master failover, a NullPointerException will be hit and some operation will fail.
Here is the test step (but may not be reproduced every time):
1. Start zookeeper and at least 2 masters, said, masterA, masterB, assume masterA is primary
2. start caching data into worker
3. shutdown masterA, masterB will take over
4. worker will fail over to masterB
5. some operation, e.g, commitBlock will fail over to masterB in parallel with workerRegister
6. a NullPointerException occurs:
java.lang.NullPointerException
at alluxio.master.block.meta.MasterWorkerInfo.updateUsedBytes(MasterWorkerInfo.java:340)
at alluxio.master.block.DefaultBlockMaster.commitBlock(DefaultBlockMaster.java:537)
at alluxio.master.block.BlockMasterWorkerServiceHandler$2.call(BlockMasterWorkerServiceHandler.java:95)
at alluxio.master.block.BlockMasterWorkerServiceHandler$2.call(BlockMasterWorkerServiceHandler.java:92)
at alluxio.RpcUtils.call(RpcUtils.java:101)
at alluxio.RpcUtils.call(RpcUtils.java:84)
at alluxio.master.block.BlockMasterWorkerServiceHandler.commitBlock(BlockMasterWorkerServiceHandler.java:92)
at alluxio.thrift.BlockMasterWorkerService$Processor$commitBlock.getResult(BlockMasterWorkerService.java:507)
at alluxio.thrift.BlockMasterWorkerService$Processor$commitBlock.getResult(BlockMasterWorkerService.java:491)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:123)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The root cause is, worker register is broken into two steps and not atomic: getWorkerId() and registerWorker()
Here is how above problem is happening:
1. worker fails over to masterB
2. worker issues getWorkerID() to masterB
3. masterB remembers worker in mWorkers
4. before worker send registerWorker() to masterB, a commitBlock() operation arrives
5. masterB responses to commitBlock, but some fields in worker (MasterWorkerInfo) is not set, specifically, mUsedBytesOnTiers, the tier information is not registered into masterB yet,
so this line hit an OPS:
public void updateUsedBytes(String tierAlias, long usedBytesOnTier) {
mUsedBytes += usedBytesOnTier - mUsedBytesOnTiers.get(tierAlias); <<<<<<<<<<<<<<<
mUsedBytesOnTiers.put(tierAlias, usedBytesOnTier);
}

I will provide a PR for this issue.

Environment

None

Assignee

Unassigned

Reporter

Chao Guang Li

Labels

None

Components

Fix versions

Affects versions

Priority

Major
Configure