Merge Datasets
This topic describes the merge dataset operation.
Take the Oxford-IIIT Pet and Dogs vs Cats as examples. Their structures looks like:
Oxford-IIIT Pet/
test/
Abyssinian_002.jpg
...
trainval/
Abyssinian_001.jpg
...
Dogs vs Cats/
test/
1.jpg
10.jpg
...
train/
cat.0.jpg
cat.1.jpg
...
There are lots of pictures of cats and dogs in these two datasets, merge them to get a more diverse dataset.
Note
Before merging datasets, fork both of the open datasets first.
Create a dataset which is named mergedDataset
.
from tensorbay import GAS
ACCESS_KEY = "Accesskey-*****"
gas = GAS(ACCESS_KEY)
dataset_client = gas.create_dataset("mergedDataset")
dataset_client.create_draft("merge dataset")
Copy all segments in OxfordIIITPetDog
to mergedDataset
.
pet_dataset_client = gas.get_dataset("OxfordIIITPet")
dataset_client.copy_segment("train", target_name="trainval", source_client=pet_dataset_client)
dataset_client.copy_segment("test", source_client=pet_dataset_client)
Use the catalog of OxfordIIITPet as the catalog of the merged dataset.
dataset_client.upload_catalog(pet_dataset_client.get_catalog())
Unify categories of train
segment.
from tensorbay.dataset import Data
segment_client = dataset_client.get_segment("train")
for remote_data in segment_client.list_data():
data = Data(remote_data.path)
data.label = remote_data.label
data.label.classification.category = data.label.classification.category.split(".")[0]
segment_client.upload_label(data)
Note
The category in OxfordIIITPet
is of two-level formats, like cat.Abyssinian
,
but in Dogs vs Cats
it only has one level, like cat
.
Thus it is important to unify the categories, for example, rename cat.Abyssinian
to cat
.
Copy data from Dogs vs Cats
to mergedDataset
.
pet_dataset_client = gas.get_dataset("DogsVsCats")
for name in ["test", "train"]:
source_segment_client = pet_dataset_client.get_segment(name)
segment_client = dataset_client.get_segment(name)
segment_client.copy_data(
source_segment_client.list_data_paths(), source_client=source_segment_client
)