BlockAlreadyExistsException in MostAvailableFirstPolicy and RoundRobinPolicy

Description

As you can see I tried to use roundrobin and most available to make the node in cluster being balance.
I use a spark job read from current data in Alluxio and do some summary then write back to Alluxio without replace a old one.

It will be nice if you guys provide some notice about which file and folder is being missed.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 16/07/21 09:19:41 INFO type: open(alluxio://master1:19998/FACT_ADMIN_HOURLY/time=2016-07-14-16/network_id=19762/part-r-00000-ef268b10-fd71-4ef3-8fb3-30f5a63fb9df.snappy.parquet, 65536) 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master4/10.197.0.6:29998 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master4/10.197.0.6:29998 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master3/10.197.0.5:29998 16/07/21 09:19:41 INFO type: Connected to remote machine master4/10.197.0.6:29999 16/07/21 09:19:41 INFO type: Data 169097560064 from remote machine master4/10.197.0.6:29999 received 16/07/21 09:19:41 INFO type: Connected to remote machine master3/10.197.0.5:29999 16/07/21 09:19:41 INFO type: status: SUCCESS from remote machine master3/10.197.0.5:29999 received 16/07/21 09:19:41 INFO type: Connected to remote machine master3/10.197.0.5:29999 16/07/21 09:19:41 INFO type: status: SUCCESS from remote machine master3/10.197.0.5:29999 received 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master3/10.197.0.5:29998 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master3/10.197.0.5:29998 16/07/21 09:19:41 INFO type: Connected to remote machine master3/10.197.0.5:29999 16/07/21 09:19:41 INFO type: Data 169097560064 from remote machine master3/10.197.0.5:29999 received 16/07/21 09:19:41 INFO type: open(alluxio://master1:19998/FACT_ADMIN_HOURLY/time=2016-07-14-16/network_id=19762/part-r-00000-ef268b10-fd71-4ef3-8fb3-30f5a63fb9df.snappy.parquet, 65536) 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master3/10.197.0.5:29998 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master3/10.197.0.5:29998 16/07/21 09:19:41 INFO type: Connecting to remote worker @ master3/10.197.0.5:29998 16/07/21 09:19:41 INFO type: Connected to remote machine master3/10.197.0.5:29999 16/07/21 09:19:41 INFO type: Data 169097560064 from remote machine master3/10.197.0.5:29999 received 16/07/21 09:19:41 INFO type: Connected to remote machine master3/10.197.0.5:29999 16/07/21 09:19:41 INFO type: status: WRITE_ERROR from remote machine master3/10.197.0.5:29999 received 16/07/21 09:19:41 WARN type: The block with ID 169097560064 could not be cached into Alluxio storage.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 2016-07-21 02:18:26,204 INFO logger.type (BlockDataServerHandler.java:handleBlockReadRequest) - Preparation for responding to remote block request for: 169097560064 done. 2016-07-21 02:18:26,211 ERROR logger.type (BlockDataServerHandler.java:handleBlockWriteRequest) - Error writing remote block : Temp blockId 169,097,560,064 is not available, because it is already committed alluxio.exception.BlockAlreadyExistsException: Temp blockId 169,097,560,064 is not available, because it is already committed at alluxio.worker.block.TieredBlockStore.checkTempBlockIdAvailable(TieredBlockStore.java:393) at alluxio.worker.block.TieredBlockStore.createBlockMetaInternal(TieredBlockStore.java:521) at alluxio.worker.block.TieredBlockStore.createBlockMeta(TieredBlockStore.java:184) at alluxio.worker.block.BlockWorker.createBlockRemote(BlockWorker.java:336) at alluxio.worker.netty.BlockDataServerHandler.handleBlockWriteRequest(BlockDataServerHandler.java:145) at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:71) at alluxio.worker.netty.DataServerHandler.channelRead0(DataServerHandler.java:40) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:831) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:322) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:254) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745)

Environment

Centos 6.8
Java 8
cluster include 5 nodes:

  • node1: 10GB

  • node 2,3,4,5: 20GB
    node1 also a master.

Standalone mode.
RoundRobinPolicy, MostAvailableFirstPolicy

Status

Assignee

Unassigned

Reporter

BeanL

Labels

Components

Affects versions

1.1.1
1.2.0
1.1.0

Priority

Critical
Configure