20 Newsgroups¶
This topic describes how to manage the 20 Newsgroups dataset, which is a dataset with Classification label type.
Authorize a Client Instance¶
An accesskey is needed to authenticate identity when using TensorBay.
from tensorbay import GAS
ACCESS_KEY = "Accesskey-*****"
gas = GAS(ACCESS_KEY)
Create Dataset¶
gas.create_dataset("Newsgroups20")
Organize Dataset¶
It takes the following steps to organize the “20 Newsgroups” dataset by
the Dataset
instance.
Step 1: Write the Catalog¶
A Catalog contains all label information of one dataset, which is typically stored in a json file.
1{
2 "CLASSIFICATION": {
3 "categories": [
4 { "name": "alt.atheism" },
5 { "name": "comp.graphics" },
6 { "name": "comp.os.ms-windows.misc" },
7 { "name": "comp.sys.ibm.pc.hardware" },
8 { "name": "comp.sys.mac.hardware" },
9 { "name": "comp.windows.x" },
10 { "name": "misc.forsale" },
11 { "name": "rec.autos" },
12 { "name": "rec.motorcycles" },
13 { "name": "rec.sport.baseball" },
14 { "name": "rec.sport.hockey" },
15 { "name": "sci.crypt" },
16 { "name": "sci.electronics" },
17 { "name": "sci.med" },
18 { "name": "sci.space" },
19 { "name": "soc.religion.christian" },
20 { "name": "talk.politics.guns" },
21 { "name": "talk.politics.mideast" },
22 { "name": "talk.politics.misc" },
23 { "name": "talk.religion.misc" }
24 ]
25 }
26}
The only annotation type for “20 Newsgroups” is Classification, and there are 20 category types.
Important
See catalog table for more catalogs with different label types.
Note
The categories in dataset “20 Newsgroups” have parent-child relationship, and it use “.” to sparate different levels.
Step 2: Write the Dataloader¶
A dataloader is neeeded to organize the dataset into a
Dataset
instance.
1#!/usr/bin/env python3
2#
3# Copyright 2021 Graviti. Licensed under MIT License.
4#
5# pylint: disable=invalid-name
6# pylint: disable=missing-module-docstring
7
8import os
9
10from ...dataset import Data, Dataset
11from ...label import Classification
12from .._utility import glob
13
14DATASET_NAME = "Newsgroups20"
15SEGMENT_DESCRIPTION_DICT = {
16 "20_newsgroups": "Original 20 Newsgroups data set",
17 "20news-bydate-train": (
18 "Training set of the second version of 20 Newsgroups, "
19 "which is sorted by date and has duplicates and some headers removed"
20 ),
21 "20news-bydate-test": (
22 "Test set of the second version of 20 Newsgroups, "
23 "which is sorted by date and has duplicates and some headers removed"
24 ),
25 "20news-18828": (
26 "The third version of 20 Newsgroups, which has duplicates removed "
27 "and includes only 'From' and 'Subject' headers"
28 ),
29}
30
31
32def Newsgroups20(path: str) -> Dataset:
33 """Dataloader of the `20 Newsgroups`_ dataset.
34
35 .. _20 Newsgroups: http://qwone.com/~jason/20Newsgroups/
36
37 The folder structure should be like::
38
39 <path>
40 20news-18828/
41 alt.atheism/
42 49960
43 51060
44 51119
45 51120
46 ...
47 comp.graphics/
48 comp.os.ms-windows.misc/
49 comp.sys.ibm.pc.hardware/
50 comp.sys.mac.hardware/
51 comp.windows.x/
52 misc.forsale/
53 rec.autos/
54 rec.motorcycles/
55 rec.sport.baseball/
56 rec.sport.hockey/
57 sci.crypt/
58 sci.electronics/
59 sci.med/
60 sci.space/
61 soc.religion.christian/
62 talk.politics.guns/
63 talk.politics.mideast/
64 talk.politics.misc/
65 talk.religion.misc/
66 20news-bydate-test/
67 20news-bydate-train/
68 20_newsgroups/
69
70 Arguments:
71 path: The root directory of the dataset.
72
73 Returns:
74 Loaded :class:`~tensorbay.dataset.dataset.Dataset` instance.
75
76 """
77 root_path = os.path.abspath(os.path.expanduser(path))
78 dataset = Dataset(DATASET_NAME)
79 dataset.load_catalog(os.path.join(os.path.dirname(__file__), "catalog.json"))
80
81 for segment_name, segment_description in SEGMENT_DESCRIPTION_DICT.items():
82 segment_path = os.path.join(root_path, segment_name)
83 if not os.path.isdir(segment_path):
84 continue
85
86 segment = dataset.create_segment(segment_name)
87 segment.description = segment_description
88
89 text_paths = glob(os.path.join(segment_path, "*", "*"))
90 for text_path in text_paths:
91 category = os.path.basename(os.path.dirname(text_path))
92
93 data = Data(
94 text_path, target_remote_path=f"{category}/{os.path.basename(text_path)}.txt"
95 )
96 data.label.classification = Classification(category)
97 segment.append(data)
98
99 return dataset
See Classification annotation for more details.
Note
The data in “20 Newsgroups” do not have extensions so that a “txt” extension is added to the remote path of each data file to ensure the loaded dataset could function well on TensorBay.
Note
Since the 20 Newsgroups dataloader above is already included in TensorBay, so it uses relative import. However, use regular import should be used when writing a new dataloader.
from tensorbay.dataset import Data, Dataset
from tensorbay.label import LabeledBox2D
There are already a number of dataloaders in TensorBay SDK provided by the community. Thus, instead of writing, importing an available dataloader is also feasible.
from tensorbay.opendataset import Newsgroups20
dataset = Newsgroups20("path/to/dataset/directory")
Note
Note that catalogs are automatically loaded in available dataloaders, users do not have to write them again.
Important
See dataloader table for dataloaders with different label types.
Visualize Dataset¶
Optionally, the organized dataset can be visualized by Pharos, which is a TensorBay SDK plug-in. This step can help users to check whether the dataset is correctly organized. Please see Visualization for more details.
Upload Dataset¶
The organized “20 Newsgroups” dataset can be uploaded to TensorBay for sharing, reuse, etc.
dataset_client = gas.upload_dataset(dataset, jobs=8)
dataset_client.commit("initial commit")
Similar with Git, the commit step after uploading can record changes to the dataset as a version. If needed, do the modifications and commit again. Please see Version Control for more details.
Read Dataset¶
Now “20 Newsgroups” dataset can be read from TensorBay.
dataset = Dataset("Newsgroups20", gas)
In dataset “20 Newsgroups”, there are four
Segments: 20news-18828
,
20news-bydate-test
and 20news-bydate-train
, 20_newsgroups
.
Get the segment names by listing them all.
dataset.keys()
Get a segment by passing the required segment name.
segment = dataset["20news-18828"]
In the 20news-18828 segment, there is a sequence of data, which can be obtained by index.
data = segment[0]
In each data, there is a sequence of Classification annotations, which can be obtained by index.
category = data.label.classification.category
There is only one label type in “20 Newsgroups” dataset, which is Classification
.
The information stored in category is
one of the category names in “categories” list of catalog.json.
See this page for more details about the
structure of Classification.
Delete Dataset¶
gas.delete_dataset("Newsgroups20")