Master service unavailable when setting TTL for files concurrently

Description

In our environment, we are writing Alluxio files at a very high concurrency level on several nodes. If the TTLs of files aren't been set, everything goes well. However, since these files are used for cache, after creating each of them a TTL is set for it. Then depending on some random factors, two serious abnormal results will be lead to:

  1. Result 1: Alluxio master service is totally unavailable. And we also noticed that in this case, the cpu utilization will be very high (up to 500%). The browse tab of the web UI is also unavailable.

  2. Result 2: the master log files (from master.log.1 to master.log.100) are totally filled with only one ERROR message like this
    2017-04-20 18:04:15,485 ERROR logger.type (FileSystemMaster.java:heartbeat) - Exception trying to clean up InodeFile{id=313012518911, name=20160409000000000-20160414000000000, parentId=49747, creationTimeMs=1492679366374, pinned=false, deleted=true, directory=false, persistenceState=PERSISTED, lastModificationTimeMs=1492679396347, owner=, group=, permission=420, blocks=[312995741696], blockContainerId=18656, blockSizeBytes=33554432, cacheable=true, completed=true, length=66, ttl=1200903} for ttl check: alluxio.exception.FileDoesNotExistException: Inode id 313,012,518,911 does not exist

We have checked the code base relative to the TTL and found that the cause is that TtlBucket uses a HashSet to store the inodes with TTL set, which is not threadsafe. For result 1, it should be the reason that all thrift threads are trapped in a infinite loop with TtlBucket#addInode or TtlBucket#removeInode since the internal HashSet is modified concurrently. For result 2, the set returned by TtlBucket#getInodes has reached an illegal state and when iterating its elements, it returns the same inode infinitely.

Environment

None

Status

Assignee

Yufa Zhou

Reporter

Yufa Zhou

Labels

Components

Affects versions

1.4.0

Priority

Critical
Configure