Dogs vs Cats¶
This topic describes how to manage the Dogs vs Cats Dataset, which is a dataset with Classification label.
Authorize a Client Instance¶
An accesskey is needed to authenticate identity when using TensorBay.
from tensorbay import GAS
ACCESS_KEY = "Accesskey-*****"
gas = GAS(ACCESS_KEY)
Create Dataset¶
gas.create_dataset("DogsVsCats")
Organize Dataset¶
It takes the following steps to organize the “Dogs vs Cats” dataset by the Dataset
instance.
Step 1: Write the Catalog¶
A catalog contains all label information of one dataset, which is typically stored in a json file.
1{
2 "CLASSIFICATION": {
3 "categories": [{ "name": "cat" }, { "name": "dog" }]
4 }
5}
The only annotation type for “Dogs vs Cats” is Classification, and there are 2 category types.
Important
See catalog table for more catalogs with different label types.
Step 2: Write the Dataloader¶
A dataloader is needed to organize the dataset into
a Dataset
instance.
1#!/usr/bin/env python3
2#
3# Copyright 2021 Graviti. Licensed under MIT License.
4#
5# pylint: disable=invalid-name
6# pylint: disable=missing-module-docstring
7
8import os
9
10from ...dataset import Data, Dataset
11from ...label import Classification
12from .._utility import glob
13
14DATASET_NAME = "DogsVsCats"
15_SEGMENTS = {"train": True, "test": False}
16
17
18def DogsVsCats(path: str) -> Dataset:
19 """Dataloader of the `Dogs vs Cats`_ dataset.
20
21 .. _Dogs vs Cats: https://www.kaggle.com/c/dogs-vs-cats
22
23 The file structure should be like::
24
25 <path>
26 train/
27 cat.0.jpg
28 ...
29 dog.0.jpg
30 ...
31 test/
32 1000.jpg
33 1001.jpg
34 ...
35
36 Arguments:
37 path: The root directory of the dataset.
38
39 Returns:
40 Loaded :class:`~tensorbay.dataset.dataset.Dataset` instance.
41
42 """
43 root_path = os.path.abspath(os.path.expanduser(path))
44 dataset = Dataset(DATASET_NAME)
45 dataset.load_catalog(os.path.join(os.path.dirname(__file__), "catalog.json"))
46
47 for segment_name, is_labeled in _SEGMENTS.items():
48 segment = dataset.create_segment(segment_name)
49 image_paths = glob(os.path.join(root_path, segment_name, "*.jpg"))
50 for image_path in image_paths:
51 data = Data(image_path)
52 if is_labeled:
53 data.label.classification = Classification(os.path.basename(image_path)[:3])
54 segment.append(data)
55
56 return dataset
See Classification annotation for more details.
Note
Since the Dogs vs Cats dataloader above is already included in TensorBay, so it uses relative import. However, the regular import should be used when writing a new dataloader.
from tensorbay.dataset import Data, Dataset
from tensorbay.label import Classification
There are already a number of dataloaders in TensorBay SDK provided by the community. Thus, instead of writing, importing an available dataloadert is also feasible.
from tensorbay.opendataset import DogsVsCats
dataset = DogsVsCats("path/to/dataset/directory")
Note
Note that catalogs are automatically loaded in available dataloaders, users do not have to write them again.
Important
See dataloader table for more examples of dataloaders with different label types.
Visualize Dataset¶
Optionally, the organized dataset can be visualized by Pharos, which is a TensorBay SDK plug-in. This step can help users to check whether the dataset is correctly organized. Please see Visualization for more details.
Upload Dataset¶
The organized “Dogs vs Cats” dataset can be uploaded to TensorBay for sharing, reuse, etc.
dataset_client = gas.upload_dataset(dataset, jobs=8)
dataset_client.commit("initial commit")
Similar with Git, the commit step after uploading can record changes to the dataset as a version. If needed, do the modifications and commit again. Please see Version Control for more details.
Read Dataset¶
Now “Dogs vs Cats” dataset can be read from TensorBay.
dataset = Dataset("DogsVsCats", gas)
In dataset “Dogs vs Cats”, there are two
segments: train
and test
.
Get the segment names by listing them all.
dataset.keys()
Get a segment by passing the required segment name.
segment = dataset["train"]
In the train segment, there is a sequence of data, which can be obtained by index.
data = segment[0]
In each data, there is a sequence of Classification annotations, which can be obtained by index.
category = data.label.classification.category
There is only one label type in “Dogs vs Cats” dataset, which is classification
. The information stored in category is
one of the names in “categories” list of catalog.json.
See Classification label format for more details.
Delete Dataset¶
gas.delete_dataset("DogsVsCats")