Upload your dataset

Introduction

In this page, we explain the procedures for uploading publicly available benchmark data sets and previously annotated data to ABEJA Platform.

Advance preparation

Create a Datalake channel and upload data

First, create a Datalake channel.

from abeja.datalake import Client as DatalakeClient
from abeja.datalake.storage_type import StorageType

ABEJA_ORGANIZATION_ID = 'XXXXXXXXXXXXXX'
ABEJA_PLATFORM_USER_ID = 'XXXXXXXXXXXXXX'
ABEJA_PLATFORM_TOKEN = 'XXXXXXXXXXXXXX'

credential = {
    'user_id': ABEJA_PLATFORM_USER_ID,
    'personal_access_token': ABEJA_PLATFORM_TOKEN
}

datalake_client = DatalakeClient(organization_id=ABEJA_ORGANIZATION_ID, credential=credential)

name = 'XXXXXXXXXXXXXXXXXXX'
description = 'XXXXXXXXXXXXXXXXXXXXXXX'

channel = datalake_client.channels.create(name, description, StorageType.DATALAKE.value)

Next, upload the data.

channel = datalake_client.get_channel(channel.channel_id)
file = channel.upload_file('cat.jpg')

Create Dataset (Classification)

Create Dataset settings as follows. First, define the schema (class information). For reference, let’s consider the dog/cat class 2 classification problem. For multi-class classification, add multiple categories to the categories list.

from abeja.datasets import Client as DatasetClient
datasets_client = DatasetClient(organization_id=organization_id, credential=credential)

labels = [{"label_id": 0, "label": "dog"}, {"label_id": 1, "label": "cat"}]
category = {'labels': labels, 'category_id': 0, 'name': 'cats_dogs'}
props = {"categories": [category]}

dataset = datasets_client.datasets.create(name='XXXXXXXXXXXXX', type='classification', props=props)

Upload annotation data as follows.

source_data = [
    {
        'data_type': 'image/jpeg',
        'data_uri': 'datalake://{}/{}'.format(channel.channel_id, file.file_id),
    }
]

data = {
    'category_id': 0,
    'label_id': 1
}
attributes = {'classification': [data]}

dataset_item = dataset.dataset_items.create(source_data=source_data, attributes=attributes)

Dataset creation (Detection)

In the case of Detection, it will be as follows. The schema is the binary classification of dogs and cats as before. Change type todetection.

from abeja.datasets import Client as DatasetClient
datasets_client = DatasetClient(organization_id=organization_id, credential=credential)

labels = [{"label_id": 0, "label": "dog"}, {"label_id": 1, "label": "cat"}]
category = {'labels': labels, 'category_id': 0, 'name': 'cats_dogs'}
props = {"categories": [category]}

dataset = datasets_client.datasets.create(name='XXXXXXXXXXXXX', type='detection', props=props)

Upload annotation data as follows.

source_data = [
    {
        'data_type': 'image/jpeg',
        'data_uri': 'datalake://{}/{}'.format(channel.channel_id, file.file_id),
    }
]

rect = {'xmin': 200, 'ymin': 0, 'xmax': 1000, 'ymax': 900}
det1 = {
    'category_id': 0,
    'label_id': 1,
    'rect': rect
}
attributes = {'detection': [det1]}

dataset_item = dataset.dataset_items.create(source_data=source_data, attributes=attributes)

Create Dataset (Custom)

In addition to Classification / Detection, free-form annotations can be used. In this case, set type tocustom.

from abeja.datasets import Client as DatasetClient
datasets_client = DatasetClient(organization_id=organization_id, credential=credential)

labels = [{"label_id": 0, "label": "dog"}, {"label_id": 1, "label": "cat"}]
category = {'labels': labels, 'category_id': 0, 'name': 'cats_dogs'}
props = {"categories": [category]}

dataset = datasets_client.datasets.create(name='XXXXXXXXXXXXX', type='custom', props=props)

Upload annotation data as follows.

source_data = [
    {
        'data_type': 'image/jpeg',
        'data_uri': 'datalake://{}/{}'.format(channel.channel_id, file.file_id),
    }
]
d = {
    'category_id': 0,
    'label_id': 1,
    'text': 'nyaan'
}
attributes = {'custom': [d]}
dataset_item = dataset.dataset_items.create(source_data=source_data, attributes=attributes)

Check the data

Let’s check the last uploaded data.

from abeja.datasets import Client as DatasetClient

client = DatasetClient(organization_id=organization_id, credential=credential)

dataset = client.get_dataset(XXXXXXXXXXXX)

dataset_list = list(dataset.dataset_items.list(prefetch=False))

d = dataset_list[0]
file_content = d.source_data[0].get_content()
file_like_object = io.BytesIO(file_content)

img = Image.open(file_like_object)
annotation = d.attributes