Data acquisition and dataset creation

Introduction

This page explains how to use Jupyter Notebook to acquire data for learning, upload data to DataLake, and create a dataset.

Step 1

Launch Jupyter Notebook

This tutorial uses Notebook. Notebook can be used by creating a “job definition”.

Create “Job definition” and start “Notebook”.

Step 2

Launching Terminal with Jupyter Notebook

Open Terminal from the launched notebook and prepare to store the notebook file for the tutorial.

Open “Notebook” and open Terminal.

Step 3

Get the tutorial data in Terminal.

Run the command in the launched Terminal and download the Notebook file.

Download the tutorial notebook file from GitHub.

$ git clone https://github.com/abeja-inc/Platform_handson.git

You are now ready to store and create data sets from Jupyter Notebook.

Step 4

Download / decompress / check data to use with Jupyter Nootbook

In this step, the data used for learning is downloaded, decompressed, and checked.

Select and open the notebook file 01_collect_data_en.ipynb in the tutorial folder bording.

For operations in Notebook, “download data”, “decompress data”, and “check file” are implemented.

  • Data Download:Get flower images to use for learning from Google Drive
  • Data decompression:Unzip the compressed file
  • Check file:Check data count and folder name

Step 5

Create Datalake channel

Create a channel to store the acquired data in Datalake.
Note:Channel names that have already been created cannot be used for Datalake channels.

Step 6

Upload data to Datalake

Upload data to the Datalake channel. Set the authentication information required for uploading. Please refer to here for how to check authentication information.

Enter the following authentication information into the notebook and execute it. - User ID - Personal Access Token - Organization ID

Input the information of the created DataLake channel ID and execute it. This will start uploading data to the DataLake channel. After reaching 100%, check the DataLake channel and make sure that the file is stored correctly.

Step 7

Create dataset

Create a dataset.

First, run Notebook and get the information linked to the dataset in JSON format.

After that, create a data set from the left menu. This time we will conduct Classification. Specify “Dataset Type” as “Classfication”.

Paste the output JSON information to the “Property” value and execute “Create Data Set”.

In the sample, the following JSON is used.

{
  "categories": [
    {
      "category_id": 0,
      "labels": [
        {
          "label": "daisy",
          "label_id": 0
        },
        {
          "label": "dandelion",
          "label_id": 1
        },
        {
          "label": "rose",
          "label_id": 2
        },
        {
          "label": "sunflower",
          "label_id": 3
        },
        {
          "label": "tulip",
          "label_id": 4
        }
      ],
      "name": "flower-classificaiton"
    }
  ]
}

Finally, label the data and create a data set. Refer to the dataset ID created earlier and enter it into “dataset_id”.
In this example, the extension is checked when uploading, and data sets other than images are created.

Thank you for your hard work. You performed data acquisition, data upload, and data set creation from Notebook.

Next, We will explain “Learning / Model creation” using the template function.