java.lang.IllegalStateException: Block 1174405218 is expected to be 67108864 bytes, but only 0 bytes are available.

Description

The system is a 4 node Centos 7 HPC cluster with a Lustre under filsystem, Lustre is a Distributed posix filesystem like like NFS.

We have worked a little with Alluxio since v1.8 with a decent amount with 2.0.0 and recently we moved to Alluxio 2.3.0 and we starting seeing errros when running a basic Kmeans with dataset sizes grater then DRAM. If the total data fits in the Allxuio cache or even the sytem DRAM things seem work work but it is larger we hit an issue.

I have triggeed the error mostly with Sparkbench legacy Kmeans from here: https://github.com/CODAIT/spark-bench/tree/legacy with a 2TB data generation process (larger then DRAM and Alluxio cache settings).

Worker settings are very basic:

We have see the error with memory.size as high as 100GB.

Also:
ALLUXIO_READ_TYPE is "CACHE"
and
ALLUXIO_WRITE_TYPE is "CACHE_THROUGH"

The erros happens after a while and things like "alluxio fs du -h / " may help it happen but not 100% sure.

This has been seen with both large and small files (same 2TB of data split into 33 to 16k number of files).

As noted yesterday on Slack we move dto Alluxio 2.2.2 and started retesting . We have not see the erorr after over 12 hours of testing and will congiune to use Alluxio 2.2.2. If we see this same error there we will post it to this ticket but 2.2.2 looks better so far.

This is the backtrace we see. The alluxio block number can be any number it moves around.

Environment

OS: CentOS Linux release 7.7.1908 (Core)
Alluxio: 2.3.0 - Opensouce tar file download
Spark: spark-2.4.6-bin-hadoop2.7 - Opensouce tar file download
Lustre ZFS : 2.12.2

Woker now:
384GB Ramm
2x Xeon
10GB ethernet
100GB OPA (not use with Allxuio)

Activity

Show:
Keith Mannthey
September 12, 2020, 4:30 AM

https://github.com/Alluxio/alluxio/issues/12102 has been opened to track this issue.

Duplicate

Assignee

Calvin Jia

Reporter

Keith Mannthey

Labels

None

Components

Affects versions

Priority

Major