It could be valuable to have R language binding to Alluxio, especially for data scientists. The use case is the following ; many data data scientists still copy data from an hadoop datalake into a single edge node to make their work, finding working hadoop too complicated. So, it could be interesting to offer them the ability not to copy data from the datalake to edge nodes by using alluxio instead to be able to access data in R (Pyhton is already done) from their edge node.
in data science we need performances on one hand, and on the other hand data scientists arr not fot many dat engineer who develop with ease with for instance Hadoop or other. They want to develop with ease and that it. I speak of what i see by my customers.
the fuse implemention is to vomplicated for them ; it i not their concern to mount file systems ; they just want to access quickly to their data locally because their R libraries work mainly locally. But in modern data analysis data are more and more in datalake. So they copy their data from the datalake to their edgz node, with the inconvenience it is and all the security breaches which involves (rgpd for instance).
Moreover the fuse implementaion is less performant “Due to the conjunct use of FUSE and JNR, the performance of the mounted file system is expected to be worse than what you would see by using the Alluxio Java client directly.”
So it seemed to me the idea of a R client alongside the Python one could be a good idea (and mndatory for my customer if we want to use alluxio)
Thanks for the detailed description.
If you want to run R programs on Alluxio, I would suggest to use the FUSE interface. The performance will be slightly worse than the native Java client, but should be comparable or better than a Python or R client which uses the Alluxio proxy.
In the use case described, you could mount the Alluxio namespace to a FUSE mount for the edge nodes (one mount per machine, a one time cost). Then users could access data in Alluxio through the mounted directory. This removes the need for them to use the Hadoop API or copy the data to the edge node.
Was the FUSE interface able to solve your use case?
I've not yet tested because I've not yet the admin rights to do it. On the
paper, it could be, even if a Rcclient could be more convenient from the
user point of view because he doesn't need to ask for the system
administrator to mount the remote FS from my understanding...
Le mer. 28 nov. 2018 à 20:01, Calvin Jia (JIRA) <email@example.com>
You’re correct, the FUSE volume would need to be mounted by a user with mount rights.
The challenge with providing a native R client is we would need to keep it up to date with our Java client. This traditionally is very difficult and causes clients to have different behaviors which is undesirable.