20 Newsgroups#
This topic describes how to manage the 20 Newsgroups dataset, which is a dataset with Classification label type.
Create Dataset#
gas.create_dataset("Newsgroups20")
Organize Dataset#
Normally, dataloader.py
and catalog.json
are required to organize the “20 Newsgroups” dataset into the Dataset
instance.
In this example, they are stored in the same directory like:
20 Newsgroups/
catalog.json
dataloader.py
It takes the following steps to organize the “20 Newsgroups” dataset by
the Dataset
instance.
Step 1: Write the Catalog#
A Catalog contains all label information of one dataset,
which is typically stored in a json file like catalog.json
.
1{
2 "CLASSIFICATION": {
3 "categories": [
4 { "name": "alt.atheism" },
5 { "name": "comp.graphics" },
6 { "name": "comp.os.ms-windows.misc" },
7 { "name": "comp.sys.ibm.pc.hardware" },
8 { "name": "comp.sys.mac.hardware" },
9 { "name": "comp.windows.x" },
10 { "name": "misc.forsale" },
11 { "name": "rec.autos" },
12 { "name": "rec.motorcycles" },
13 { "name": "rec.sport.baseball" },
14 { "name": "rec.sport.hockey" },
15 { "name": "sci.crypt" },
16 { "name": "sci.electronics" },
17 { "name": "sci.med" },
18 { "name": "sci.space" },
19 { "name": "soc.religion.christian" },
20 { "name": "talk.politics.guns" },
21 { "name": "talk.politics.mideast" },
22 { "name": "talk.politics.misc" },
23 { "name": "talk.religion.misc" }
24 ]
25 }
26}
The only annotation type for “20 Newsgroups” is Classification, and there are 20 category types.
Note
The categories in dataset “20 Newsgroups” have parent-child relationship, and it use “.” to sparate different levels.
By passing the path of the
catalog.json
,load_catalog()
supports loading the catalog into dataset.
Important
See catalog table for more catalogs with different label types.
Step 2: Write the Dataloader#
A dataloader is neeeded to organize the dataset into a
Dataset
instance.
1#!/usr/bin/env python3
2#
3# Copyright 2021 Graviti. Licensed under MIT License.
4#
5# pylint: disable=invalid-name
6
7"""Dataloader of Newsgroups20 dataset."""
8
9import os
10
11from tensorbay.dataset import Data, Dataset
12from tensorbay.label import Classification
13from tensorbay.opendataset._utility import glob
14
15DATASET_NAME = "Newsgroups20"
16SEGMENT_DESCRIPTION_DICT = {
17 "20_newsgroups": "Original 20 Newsgroups data set",
18 "20news-bydate-train": (
19 "Training set of the second version of 20 Newsgroups, "
20 "which is sorted by date and has duplicates and some headers removed"
21 ),
22 "20news-bydate-test": (
23 "Test set of the second version of 20 Newsgroups, "
24 "which is sorted by date and has duplicates and some headers removed"
25 ),
26 "20news-18828": (
27 "The third version of 20 Newsgroups, which has duplicates removed "
28 "and includes only 'From' and 'Subject' headers"
29 ),
30}
31
32
33def Newsgroups20(path: str) -> Dataset:
34 """`20 Newsgroups <http://qwone.com/~jason/20Newsgroups/>`_ dataset.
35
36 The folder structure should be like::
37
38 <path>
39 20news-18828/
40 alt.atheism/
41 49960
42 51060
43 51119
44 51120
45 ...
46 comp.graphics/
47 comp.os.ms-windows.misc/
48 comp.sys.ibm.pc.hardware/
49 comp.sys.mac.hardware/
50 comp.windows.x/
51 misc.forsale/
52 rec.autos/
53 rec.motorcycles/
54 rec.sport.baseball/
55 rec.sport.hockey/
56 sci.crypt/
57 sci.electronics/
58 sci.med/
59 sci.space/
60 soc.religion.christian/
61 talk.politics.guns/
62 talk.politics.mideast/
63 talk.politics.misc/
64 talk.religion.misc/
65 20news-bydate-test/
66 20news-bydate-train/
67 20_newsgroups/
68
69 Arguments:
70 path: The root directory of the dataset.
71
72 Returns:
73 Loaded :class:`~tensorbay.dataset.dataset.Dataset` instance.
74
75 """
76 root_path = os.path.abspath(os.path.expanduser(path))
77 dataset = Dataset(DATASET_NAME)
78 dataset.load_catalog(os.path.join(os.path.dirname(__file__), "catalog.json"))
79
80 for segment_name, segment_description in SEGMENT_DESCRIPTION_DICT.items():
81 segment_path = os.path.join(root_path, segment_name)
82 if not os.path.isdir(segment_path):
83 continue
84
85 segment = dataset.create_segment(segment_name)
86 segment.description = segment_description
87
88 text_paths = glob(os.path.join(segment_path, "*", "*"))
89 for text_path in text_paths:
90 category = os.path.basename(os.path.dirname(text_path))
91
92 data = Data(
93 text_path, target_remote_path=f"{category}/{os.path.basename(text_path)}.txt"
94 )
95 data.label.classification = Classification(category)
96 segment.append(data)
97
98 return dataset
See Classification annotation for more details.
Note
The data in “20 Newsgroups” do not have extensions so that a “txt” extension is added to the remote path of each data file to ensure the loaded dataset could function well on TensorBay.
There are already a number of dataloaders in TensorBay SDK provided by the community. Thus, in addition to writing, importing an available dataloader is also feasible.
from tensorbay.opendataset import Newsgroups20
dataset = Newsgroups20("<path/to/dataset>")
Note
Note that catalogs are automatically loaded in available dataloaders, users do not have to write them again.
Important
See dataloader table for dataloaders with different label types.
Visualize Dataset#
Optionally, the organized dataset can be visualized by Pharos, which is a TensorBay SDK plug-in. This step can help users to check whether the dataset is correctly organized. Please see Visualization for more details.
Upload Dataset#
The organized “20 Newsgroups” dataset can be uploaded to TensorBay for sharing, reuse, etc.
dataset_client = gas.upload_dataset(dataset, jobs=8)
dataset_client.commit("initial commit")
Similar with Git, the commit step after uploading can record changes to the dataset as a version. If needed, do the modifications and commit again. Please see Version Control for more details.
Read Dataset#
Now “20 Newsgroups” dataset can be read from TensorBay.
dataset = Dataset("Newsgroups20", gas)
In dataset “20 Newsgroups”, there are four
Segments: 20news-18828
,
20news-bydate-test
and 20news-bydate-train
, 20_newsgroups
.
Get the segment names by listing them all.
dataset.keys()
Get a segment by passing the required segment name.
segment = dataset["20news-18828"]
In the 20news-18828 segment, there is a sequence of data, which can be obtained by index.
data = segment[0]
In each data, there is a sequence of Classification annotations, which can be obtained by index.
category = data.label.classification.category
There is only one label type in “20 Newsgroups” dataset, which is Classification
.
The information stored in category is
one of the category names in “categories” list of catalog.json.
See this page for more details about the
structure of Classification.
Delete Dataset#
gas.delete_dataset("Newsgroups20")