Channel#writeAndFlush get stuck in AbstractReadHandler

Description

As title description, it happens when the block is in alluxio, the ReadHandler holds a read lock, either BlockReadHandler or ShortCircuitBlockReadHandler can trigger this happen.

This will leads any write behavior, like moveBlock or removeBlock blocked, since Semaphore in ClientRWLock is not enough.
Related pr: https://github.com/Alluxio/alluxio/pull/7296
And please see my comment in this pr: https://github.com/Alluxio/alluxio/pull/5021#issuecomment-401684746

This is the root cause of above.

I have dug quite a long time, and due to lack of netty knowledge, but don't know how to fix...

Environment

None

Activity

Show:
Reid Chan
July 13, 2018, 4:35 AM

I discovered this phenomenon by add some trace or debug message.

Reid Chan
July 13, 2018, 7:25 AM

BTW, any job targets to those locked blocks was finished successfully, but FREE couldn’t perform on those blocks, that’s when i started to bury myself in related codes.

Reid Chan
July 14, 2018, 7:28 AM

ping

Reid Chan
July 16, 2018, 5:02 AM
Edited

2018-07-16 12:12:56,866 INFO BlockLockManager - Increase: blockId=587202560, reference=1.
2018-07-16 12:12:56,866 INFO BlockLockManager - lockBlock: lockId=457, blockId=587202560, sessionId=6964919805109758215, thread=BlockPacketReaderExecutor-14
2018-07-16 12:12:56,866 DEBUG TieredBlockStore - getBlockReader: sessionId=6964919805109758215, blockId=587202560, lockId=457, thread=BlockPacketReaderExecutor-14
2018-07-16 12:12:56,866 DEBUG TieredBlockStore - accessBlock: sessionId=6964919805109758215, blockId=587202560, thread=BlockPacketReaderExecutor-14

// Normally, here should be a LOG about BlockLockManager - unlockBlock….. (when completeRequestis called

2018-07-16 12:42:21,483 DEBUG TieredBlockStore - removeBlock: sessionId=-4, blockId=587202560, location=any dir, any tier, thread=block-removal-service-2
2018-07-16 12:42:21,483 INFO BlockLockManager - Increase: blockId=587202560, reference=2.

Paste parts of logs.

For block 587202560, it was in alluxio already and read by one of the BlockPacketReaderExecutor(in BlockReadHandler

.class), related code:

But it never call completeRequest which means holding a read lock. When removeBlock request comes, it couldn’t delete 587202560 as read lock holding a semaphore. (Job was finished for hours)

Reid Chan
July 16, 2018, 5:19 AM
Edited

Following logs is about ShortCircuitBlockReadHandler

2018-07-16 12:36:55,121 INFO BlockLockManager - Increase: blockId=1128703983616, reference=1.
2018-07-16 12:36:55,121 INFO BlockLockManager - lockBlock: lockId=13096, blockId=1128703983616, sessionId=3914670844548214980, thread=BlockOpenExecutor-3
2018-07-16 12:36:55,121 DEBUG TieredBlockStore - accessBlock: sessionId=3914670844548214980, blockId=1128703983616, thread=BlockOpenExecutor-3
2018-07-16 12:36:55,121 DEBUG TieredBlockStore - getBlockMeta: sessionId=3914670844548214980, blockId=1128703983616, lockId=13096, thread=BlockOpenExecutor-3

// Normally, here should be a LOG about BlockLockManager - unlockBlock….. (when handleBlockCloseRequest is called

2018-07-16 12:42:21,900 DEBUG TieredBlockStore - removeBlock: sessionId=-4, blockId=1128703983616, location=any dir, any tier, thread=block-removal-service-1
2018-07-16 12:42:21,900 INFO BlockLockManager - Increase: blockId=1128703983616, reference=2.

Related code in ShortCircuitBlockReadHandler#handleBlockOpenRequest

Assignee

Reid Chan

Reporter

Reid Chan

Labels

None

Components

Affects versions

Priority

Critical
Configure