As title description, it happens when the block is in alluxio, the ReadHandler holds a read lock, either BlockReadHandler or ShortCircuitBlockReadHandler can trigger this happen.
This will leads any write behavior, like moveBlock or removeBlock blocked, since Semaphore in ClientRWLock is not enough.
Related pr: https://github.com/Alluxio/alluxio/pull/7296
And please see my comment in this pr: https://github.com/Alluxio/alluxio/pull/5021#issuecomment-401684746
This is the root cause of above.
I have dug quite a long time, and due to lack of netty knowledge, but don't know how to fix...
I discovered this phenomenon by add some trace or debug message.
BTW, any job targets to those locked blocks was finished successfully, but FREE couldn’t perform on those blocks, that’s when i started to bury myself in related codes.
ping
2018-07-16 12:12:56,866 INFO BlockLockManager - Increase: blockId=587202560, reference=1.
2018-07-16 12:12:56,866 INFO BlockLockManager - lockBlock: lockId=457, blockId=587202560, sessionId=6964919805109758215, thread=BlockPacketReaderExecutor-14
2018-07-16 12:12:56,866 DEBUG TieredBlockStore - getBlockReader: sessionId=6964919805109758215, blockId=587202560, lockId=457, thread=BlockPacketReaderExecutor-14
2018-07-16 12:12:56,866 DEBUG TieredBlockStore - accessBlock: sessionId=6964919805109758215, blockId=587202560, thread=BlockPacketReaderExecutor-14// Normally, here should be a LOG about BlockLockManager - unlockBlock….. (when completeRequestis called
2018-07-16 12:42:21,483 DEBUG TieredBlockStore - removeBlock: sessionId=-4, blockId=587202560, location=any dir, any tier, thread=block-removal-service-2
2018-07-16 12:42:21,483 INFO BlockLockManager - Increase: blockId=587202560, reference=2.
Paste parts of logs.
For block 587202560, it was in alluxio already and read by one of the BlockPacketReaderExecutor(in BlockReadHandler
.class), related code:
But it never call completeRequest which means holding a read lock. When removeBlock request comes, it couldn’t delete 587202560 as read lock holding a semaphore. (Job was finished for hours)
Following logs is about ShortCircuitBlockReadHandler
2018-07-16 12:36:55,121 INFO BlockLockManager - Increase: blockId=1128703983616, reference=1.
2018-07-16 12:36:55,121 INFO BlockLockManager - lockBlock: lockId=13096, blockId=1128703983616, sessionId=3914670844548214980, thread=BlockOpenExecutor-3
2018-07-16 12:36:55,121 DEBUG TieredBlockStore - accessBlock: sessionId=3914670844548214980, blockId=1128703983616, thread=BlockOpenExecutor-3
2018-07-16 12:36:55,121 DEBUG TieredBlockStore - getBlockMeta: sessionId=3914670844548214980, blockId=1128703983616, lockId=13096, thread=BlockOpenExecutor-3// Normally, here should be a LOG about BlockLockManager - unlockBlock….. (when handleBlockCloseRequest is called
2018-07-16 12:42:21,900 DEBUG TieredBlockStore - removeBlock: sessionId=-4, blockId=1128703983616, location=any dir, any tier, thread=block-removal-service-1
2018-07-16 12:42:21,900 INFO BlockLockManager - Increase: blockId=1128703983616, reference=2.
Related code in ShortCircuitBlockReadHandler#handleBlockOpenRequest