Move And Copy#

This topic describes TensorBay dataset operations:

Take the Oxford-IIIT Pet as an example. Its structure looks like:

datasets/
    test/
        Abyssinian_002.jpg
        ...
    trainval/
        Abyssinian_001.jpg
        ...

Note

Before operating this dataset, fork it first.

Get the dataset client.

from tensorbay import GAS

gas = GAS("<YOUR_ACCESSKEY>")
dataset_client = gas.get_dataset("OxfordIIITPet")
dataset_client.list_segment_names()
# test, trainval

There are currently two segments: test and trainval.

Copy Segment#

Copy segment test to test_1.

dataset_client.create_draft("draft-1")
segment_client = dataset_client.copy_segment("test", "test_1")
segment_client.name
# test_1
dataset_client.list_segment_names()
# test, test_1, trainval
dataset_client.commit("copy test segment to test_1 segment")

Move Segment#

Move segment test to test_2.

dataset_client.create_draft("draft-2")
segment_client = dataset_client.move_segment("test", "test_2")
segment_client.name
# test_2
dataset_client.list_segment_names()
# test_1, trainval, test_2
dataset_client.commit("move test segment to test_2 segment")

Copy Data#

Copy all data with prefix Abyssinian in both test_1 and trainval segments to abyssinian segment.

dataset_client.create_draft("draft-3")
target_segment_client = dataset_client.create_segment("abyssinian")
for name in ["test_1", "trainval"]:
    segment_client = dataset_client.get_segment(name)
    copy_files = []
    for file_name in segment_client.list_data_paths():
        if file_name.startswith("Aabyssinian"):
            copy_files.append(file_name)
    target_segment_client.copy_data(copy_files, source_client=segment_client)

dataset_client.list_segment_names()
# test_1, test_2, trainval, abyssinian
dataset_client.commit("add abyssinian segment")

Move Data#

Split trainval segment into train and val:

  1. Extract 500 data from trainval to val segment.

  2. Move trainval to train.

import random

dataset_client.create_draft("draft-4")
val_segment_client = dataset_client.create_segment("val")
trainval_segment_client = dataset_client.get_segment("trainval")

# list_data_paths will return a lazy list, get and delete data are not supports at one time.
data_paths = list(trainval_segment_client.list_data_paths())

# Generate 500 random numbers.
val_random_numbers = random.sample(range(0, len(data_paths)), 500)

# Get the data path list by random index list.
val_ramdom_paths = [data_paths[index] for index in val_random_numbers]

# Move all data of the val random path list from trainval to train segment
val_segment_client.move_data(val_ramdom_paths, source_client=trainval_segment_client)
dataset_client.move_segment("trainval", "train")

dataset_client.list_segment_names()
# train, val, test_1, test_2, abyssinian
dataset_client.commit("split train and val segment")

Note

The data storage space will only be calculated once when a segment is copied.

Note

TensorBay SDK supports three strategies to solve the conflict when the target segment/data already exists, which can be set as an keyword argument in the above-mentioned functions.

  • abort(default): abort the process by raising InternalServerError.

  • skip: skip moving or copying segment/data.

  • override: override the whole target segment/data with the source segment/data.