Support a distributed job management service in Alluxio

Description

This JIRA aims to add a service to Alluxio that can run simple I/O-related jobs in a distributed framework.

Each job is defined in a pthread-like programming model across a set of workers. The computation assigned to a single worker is called a task. Inside a task, one can define the I/O work.

A job, once submitted, will be queued on the master and then distributed to the worker nodes. Once a task is done or failed, the result will be returned to the master. A job is complete if all tasks of this job are complete successfully, or considered failed if any task fails. A failed job will trigger a retry. Each job is supposed to be idempotent, so retrying a job will not introduce side-effect.

This new job service in Alluxio will enable the following operations (in the initial implementation):

  • Async persistence (without the current limitation)

  • Replication enforcement, so user can specify the number of copies of a file

  • distributed move/cp

Environment

None

Status

Assignee

Bin Fan

Reporter

Bin Fan

Labels

Components

Fix versions

Affects versions

master

Priority

Major
Configure