Merge Datasets#

This topic describes the merge dataset operation.

Take the Oxford-IIIT Pet and Dogs vs Cats as examples. Their structures looks like:

Oxford-IIIT Pet/
    test/
        Abyssinian_002.jpg
        ...
    trainval/
        Abyssinian_001.jpg
        ...

Dogs vs Cats/
    test/
        1.jpg
        10.jpg
        ...
    train/
        cat.0.jpg
        cat.1.jpg
        ...

There are lots of pictures of cats and dogs in these two datasets, and now merge them to get a more diverse dataset.

Note

Before merging datasets, fork both of the open datasets first.

Create a dataset which is named as mergedDataset.

from tensorbay import GAS

# Please visit `https://gas.graviti.com/tensorbay/developer` to get the AccessKey.
gas = GAS("<YOUR_ACCESSKEY>")
dataset_client = gas.create_dataset("mergedDataset")
dataset_client.create_draft("merge dataset")

Copy all segments in OxfordIIITPetDog to mergedDataset.

pet_dataset_client = gas.get_dataset("OxfordIIITPet")
dataset_client.copy_segment("train", target_name="trainval", source_client=pet_dataset_client)
dataset_client.copy_segment("test", source_client=pet_dataset_client)

Use the catalog of OxfordIIITPet as the catalog of the merged dataset.

dataset_client.upload_catalog(pet_dataset_client.get_catalog())

Unify categories of train segment.

from tensorbay.dataset import Data

segment_client = dataset_client.get_segment("train")
for remote_data in segment_client.list_data():
    data = Data(remote_data.path)
    data.label = remote_data.label
    data.label.classification.category = data.label.classification.category.split(".")[0]
    segment_client.upload_label(data)

Note

The category in OxfordIIITPet is of two-level formats, like cat.Abyssinian, but in Dogs vs Cats it only has one level, like cat. Thus it is important to unify the categories, for example, rename cat.Abyssinian to cat.

Copy data from Dogs vs Cats to mergedDataset.

pet_dataset_client = gas.get_dataset("DogsVsCats")
for name in ["test", "train"]:
    source_segment_client = pet_dataset_client.get_segment(name)
    segment_client = dataset_client.get_segment(name)
    segment_client.copy_data(
        source_segment_client.list_data_paths(), source_client=source_segment_client
    )