THCHS-30¶
This topic describes how to manage the THCHS-30 Dataset, which is a dataset with Sentence label
Authorize a Client Instance¶
An accesskey is needed to authenticate identity when using TensorBay.
from tensorbay import GAS
ACCESS_KEY = "Accesskey-*****"
gas = GAS(ACCESS_KEY)
Create Dataset¶
gas.create_dataset("THCHS-30")
Organize Dataset¶
It takes the following steps to organize the “THCHS-30” dataset by the Dataset
instance.
Step 1: Write the Catalog¶
A Catalog contains all label information of one dataset, which is typically stored in a json file. However the catalog of THCHS-30 is too large, instead of reading it from json file, we read it by mapping from subcatalog that is loaded by the raw file. Check the dataloader below for more details.
Important
See catalog table for more catalogs with different label types.
Step 2: Write the Dataloader¶
A dataloader is needed to organize the dataset
into a Dataset
instance.
1#!/usr/bin/env python3
2#
3# Copyright 2021 Graviti. Licensed under MIT License.
4#
5# pylint: disable=invalid-name
6# pylint: disable=missing-module-docstring
7
8import os
9from itertools import islice
10from typing import List
11
12from ...dataset import Data, Dataset
13from ...label import LabeledSentence, SentenceSubcatalog, Word
14from .._utility import glob
15
16DATASET_NAME = "THCHS-30"
17_SEGMENT_NAME_LIST = ("train", "dev", "test")
18
19
20def THCHS30(path: str) -> Dataset:
21 """Dataloader of the `THCHS-30`_ dataset.
22
23 .. _THCHS-30: http://166.111.134.19:7777/data/thchs30/README.html
24
25 The file structure should be like::
26
27 <path>
28 lm_word/
29 lexicon.txt
30 data/
31 A11_0.wav.trn
32 ...
33 dev/
34 A11_101.wav
35 ...
36 train/
37 test/
38
39 Arguments:
40 path: The root directory of the dataset.
41
42 Returns:
43 Loaded :class:`~tensorbay.dataset.dataset.Dataset` instance.
44
45 """
46 dataset = Dataset(DATASET_NAME)
47 dataset.catalog.sentence = _get_subcatalog(os.path.join(path, "lm_word", "lexicon.txt"))
48 for segment_name in _SEGMENT_NAME_LIST:
49 segment = dataset.create_segment(segment_name)
50 for filename in glob(os.path.join(path, segment_name, "*.wav")):
51 data = Data(filename)
52 label_file = os.path.join(path, "data", os.path.basename(filename) + ".trn")
53 data.label.sentence = _get_label(label_file)
54 segment.append(data)
55 return dataset
56
57
58def _get_label(label_file: str) -> List[LabeledSentence]:
59 with open(label_file, encoding="utf-8") as fp:
60 labels = ((Word(text=text) for text in texts.split()) for texts in fp)
61 return [LabeledSentence(*labels)]
62
63
64def _get_subcatalog(lexion_path: str) -> SentenceSubcatalog:
65 subcatalog = SentenceSubcatalog()
66 with open(lexion_path, encoding="utf-8") as fp:
67 for line in islice(fp, 4, None):
68 subcatalog.append_lexicon(line.strip().split())
69 return subcatalog
See Sentence annotation for more details.
Note
Since the THCHS-30 dataloader above is already included in TensorBay, so it uses relative import. However, the regular import should be used when writing a new dataloader.
from tensorbay.dataset import Data, Dataset
from tensorbay.label import LabeledSentence, SentenceSubcatalog, Word
There are already a number of dataloaders in TensorBay SDK provided by the community. Thus, instead of writing, importing an available dataloadert is also feasible.
from tensorbay.opendataset import THCHS30
dataset = THCHS30("path/to/dataset/directory")
Note
Note that catalogs are automatically loaded in available dataloaders, users do not have to write them again.
Important
See dataloader table for dataloaders with different label types.
Visualize Dataset¶
Optionally, the organized dataset can be visualized by Pharos, which is a TensorBay SDK plug-in. This step can help users to check whether the dataset is correctly organized. Please see Visualization for more details.
Upload Dataset¶
The organized “THCHS-30” dataset can be uploaded to TensorBay for sharing, reuse, etc.
dataset_client = gas.upload_dataset(dataset, jobs=8)
dataset_client.commit("initial commit")
Similar with Git, the commit step after uploading can record changes to the dataset as a version. If needed, do the modifications and commit again. Please see Version Control for more details.
Read Dataset¶
Now “THCHS-30” dataset can be read from TensorBay.
dataset = Dataset("THCHS-30", gas)
In dataset “THCHS-30”, there are three
Segments:
dev
, train
and test
.
Get the segment names by listing them all.
dataset.keys()
Get a segment by passing the required segment name.
segment = dataset["dev"]
In the dev segment, there is a sequence of data, which can be obtained by index.
data = segment[0]
In each data, there is a sequence of Sentence annotations, which can be obtained by index.
labeled_sentence = data.label.sentence[0]
sentence = labeled_sentence.sentence
spell = labeled_sentence.spell
phone = labeled_sentence.phone
There is only one label type in “THCHS-30” dataset, which is Sentence
. It contains
sentence
, spell
and phone
information. See Sentence
label format for more details.
Delete Dataset¶
gas.delete_dataset("THCHS-30")