The system is a 4 node Centos 7 HPC cluster with a Lustre under filsystem, Lustre is a Distributed posix filesystem like like NFS.
We have worked a little with Alluxio since v1.8 with a decent amount with 2.0.0 and recently we moved to Alluxio 2.3.0 and we starting seeing errros when running a basic Kmeans with dataset sizes grater then DRAM. If the total data fits in the Allxuio cache or even the sytem DRAM things seem work work but it is larger we hit an issue.
I have triggeed the error mostly with Sparkbench legacy Kmeans from here: https://github.com/CODAIT/spark-bench/tree/legacy with a 2TB data generation process (larger then DRAM and Alluxio cache settings).
Worker settings are very basic:
We have see the error with memory.size as high as 100GB.
ALLUXIO_READ_TYPE is "CACHE"
ALLUXIO_WRITE_TYPE is "CACHE_THROUGH"
The erros happens after a while and things like "alluxio fs du -h / " may help it happen but not 100% sure.
This has been seen with both large and small files (same 2TB of data split into 33 to 16k number of files).
As noted yesterday on Slack we move dto Alluxio 2.2.2 and started retesting . We have not see the erorr after over 12 hours of testing and will congiune to use Alluxio 2.2.2. If we see this same error there we will post it to this ticket but 2.2.2 looks better so far.
This is the backtrace we see. The alluxio block number can be any number it moves around.
OS: CentOS Linux release 7.7.1908 (Core)
Alluxio: 2.3.0 - Opensouce tar file download
Spark: spark-2.4.6-bin-hadoop2.7 - Opensouce tar file download
Lustre ZFS : 2.12.2
100GB OPA (not use with Allxuio)