train-local

On this page, explain how to use train-local.

※Install ABEJA Platform CLI and Docker are required to install before using training local. Please refer to here for Platform CLI installation

What is TRAIN-LOCAL

This function is useful that able to implement training at the environment of machine learning(GPUs) in a local environment and able to implement model management and provided API in the ABEJA Platform as using existing resources.

Create `training.yaml`

「training.yaml」is required for local training. Run following command to create training.yaml.

$ abeja training init

training.yaml will be created on directory after execute command. Please refer to the following sample and edit it if it is needed.

Please refer to here for the description method

■training.yaml (sample)

name: train-local-demo
handler: train:handler
image: abeja-inc/all-cpu:18.10
params:
  NUM_EPOCHS: '1'
  C: '1'
  MODEL_FILENAME: model.pkls

Create job definition

Execute following command with training.yaml

$ abeja training create-job-definition

Job definition will be created with training.yaml to organisation setting by “~/.abeja/config”.

Create version for job definition

Create handler function at train job cord. Refer to here for handler description. Next, move on to create version for job definition after creation of train job code has been done.

$ abeja training create-version

Run job definition at local environment.

$ abeja training train-local --help
Usage: abeja training train-local [OPTIONS]

  Local train commands

Options:
  -o, --organization_id, --organization-id TEXT
                                  Organization ID, organization_id of current
                                  credential organization is used by default
                                  [required]
  --name TEXT                     Training Job Definition Name  [required]
  --version TEXT                  Training Job Definition Version  [required]
  --description TEXT              Training Job description
  -d, --datasets DATASETPARAMSTRING
                                  Datasets name
  -e, --environment ENVIRONMENTSTRING
                                  Environment variables
  -v, --volume VOLUMEPARAMSTRING  Volume driver options, ex) /path/source/on/h
                                  ost:/path/destination/on/container
  --v1                            Specify if you use old custom runtime image
  --runtime TEXT                  Runtime, equivalent to docker run
                                  `--runtime` option
  --config PATH                   Read Configuration from PATH. By default
                                  read from `training.yaml`
  --help                          Show this message and exit.

(The value defined on training.yaml will be overridden when training.yaml is on the directory that you run the command.)

■ Command sample

■ When running already defined `training.yaml`

$ abeja training train-local --version 1 --environment NUM_EPOCHS:100 --environment C:3

After running the above command, It it able to confirm running job, logs and learning result at management console. ( Use --environment to override version of job definition.)

■ When using local data

$ abeja training train-local --version 1 --volume `pwd`:/data --environment NUM_EPOCHS:100 --environment C:3
# For instance, when there is a data at current directory and you want to put to `/data` in learning job directly,

Input above command then it will be mounted to environment(container) to be run the job from local environment.

Note:

Logs are not sent in real time as for now. (Aug./2019)
TensorBoard is not be able to use on local training.
It will not be set “read only” with the directory mounted--volume option.
The name will be changed train-local to debug-local so far.

Updated on 03 Apr 2018