fastNLP/tutorials/tutorial_4_load_dataset.ipynb
2020-02-27 21:23:01 +08:00

310 lines
12 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 使用Loader和Pipe加载并处理数据集\n",
"\n",
"这一部分是关于如何加载数据集的教程\n",
"\n",
"## Part I: 数据集容器DataBundle\n",
"\n",
"而由于对于同一个任务训练集验证集和测试集会共用同一个词表以及具有相同的目标值所以在fastNLP中我们使用了 DataBundle 来承载同一个任务的多个数据集 DataSet 以及它们的词表 Vocabulary 。下面会有例子介绍 DataBundle 的相关使用。\n",
"\n",
"DataBundle 在fastNLP中主要在各个 Loader 和 Pipe 中被使用。 下面我们先介绍一下 Loader 和 Pipe 。\n",
"\n",
"## Part II: 加载的各种数据集的Loader\n",
"\n",
"在fastNLP中所有的 Loader 都可以通过其文档判断其支持读取的数据格式,以及读取之后返回的 DataSet 的格式, 例如 ChnSentiCorpLoader \n",
"\n",
"- download() 函数:自动将该数据集下载到缓存地址,默认缓存地址为~/.fastNLP/datasets/。由于版权等原因不是所有的Loader都实现了该方法。该方法会返回下载后文件所处的缓存地址。\n",
"\n",
"- _load() 函数:从一个数据文件中读取数据,返回一个 DataSet 。返回的DataSet的格式可从Loader文档判断。\n",
"\n",
"- load() 函数:从文件或者文件夹中读取数据为 DataSet 并将它们组装成 DataBundle。支持接受的参数类型有以下的几种\n",
"\n",
" - None, 将尝试读取自动缓存的数据仅支持提供了自动下载数据的Loader\n",
" - 文件夹路径, 默认将尝试在该文件夹下匹配文件名中含有 train , test , dev 的文件,如果有多个文件含有相同的关键字,将无法通过该方式读取\n",
" - dict, 例如{'train':\"/path/to/tr.conll\", 'dev':\"/to/validate.conll\", \"test\":\"/to/te.conll\"}。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In total 3 datasets:\n",
"\ttest has 1944 instances.\n",
"\ttrain has 17196 instances.\n",
"\tdev has 1858 instances.\n",
"\n"
]
}
],
"source": [
"from fastNLP.io import CWSLoader\n",
"\n",
"loader = CWSLoader(dataset_name='pku')\n",
"data_bundle = loader.load()\n",
"print(data_bundle)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这里表示一共有3个数据集。其中\n",
"\n",
" 3个数据集的名称分别为train、dev、test分别有17223、1831、1944个instance\n",
"\n",
"也可以取出DataSet并打印DataSet中的具体内容"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----------------------------------------------------------------+\n",
"| raw_words |\n",
"+----------------------------------------------------------------+\n",
"| 迈向 充满 希望 的 新 世纪 —— 一九九八年 新年 讲话 ... |\n",
"| 中共中央 总书记 、 国家 主席 江 泽民 |\n",
"+----------------------------------------------------------------+\n"
]
}
],
"source": [
"tr_data = data_bundle.get_dataset('train')\n",
"print(tr_data[:2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part III: 使用Pipe对数据集进行预处理\n",
"\n",
"通过 Loader 可以将文本数据读入,但并不能直接被神经网络使用,还需要进行一定的预处理。\n",
"\n",
"在fastNLP中我们使用 Pipe 的子类作为数据预处理的类, Loader 和 Pipe 一般具备一一对应的关系,该关系可以从其名称判断, 例如 CWSLoader 与 CWSPipe 是一一对应的。一般情况下Pipe处理包含以下的几个过程\n",
"1. 将raw_words或 raw_chars进行tokenize以切分成不同的词或字; \n",
"2. 再建立词或字的 Vocabulary , 并将词或字转换为index; \n",
"3. 将target 列建立词表并将target列转为index;\n",
"\n",
"所有的Pipe都可通过其文档查看该Pipe支持处理的 DataSet 以及返回的 DataBundle 中的Vocabulary的情况; 如 OntoNotesNERPipe\n",
"\n",
"各种数据集的Pipe当中都包含了以下的两个函数:\n",
"\n",
"- process() 函数:对输入的 DataBundle 进行处理, 然后返回处理之后的 DataBundle 。process函数的文档中包含了该Pipe支持处理的DataSet的格式。\n",
"- process_from_file() 函数输入数据集所在文件夹使用对应的Loader读取数据(所以该函数支持的参数类型是由于其对应的Loader的load函数决定的)然后调用相对应的process函数对数据进行预处理。相当于是把Load和process放在一个函数中执行。\n",
"\n",
"接着上面 CWSLoader 的例子,我们展示一下 CWSPipe 的功能:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In total 3 datasets:\n",
"\ttest has 1944 instances.\n",
"\ttrain has 17196 instances.\n",
"\tdev has 1858 instances.\n",
"In total 2 vocabs:\n",
"\tchars has 4777 entries.\n",
"\ttarget has 4 entries.\n",
"\n"
]
}
],
"source": [
"from fastNLP.io import CWSPipe\n",
"\n",
"data_bundle = CWSPipe().process(data_bundle)\n",
"print(data_bundle)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"表示一共有3个数据集和2个词表。其中\n",
"\n",
"- 3个数据集的名称分别为train、dev、test分别有17223、1831、1944个instance\n",
"- 2个词表分别为chars词表与target词表。其中chars词表为句子文本所构建的词表一共有4777个不同的字target词表为目标标签所构建的词表一共有4种标签。\n",
"\n",
"相较于之前CWSLoader读取的DataBundle新增了两个Vocabulary。 我们可以打印一下处理之后的DataSet"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------------------+---------------------+---------------------+---------+\n",
"| raw_words | chars | target | seq_len |\n",
"+---------------------+---------------------+---------------------+---------+\n",
"| 迈向 充满 希望... | [1224, 178, 674,... | [0, 1, 0, 1, 0, ... | 29 |\n",
"| 中共中央 总书记... | [11, 212, 11, 33... | [0, 3, 3, 1, 0, ... | 15 |\n",
"+---------------------+---------------------+---------------------+---------+\n"
]
}
],
"source": [
"tr_data = data_bundle.get_dataset('train')\n",
"print(tr_data[:2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以看到有两列为int的field: chars和target。这两列的名称同时也是DataBundle中的Vocabulary的名称。可以通过下列的代码获取并查看Vocabulary的 信息"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vocabulary(['B', 'E', 'S', 'M']...)\n"
]
}
],
"source": [
"vocab = data_bundle.get_vocab('target')\n",
"print(vocab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part IV: fastNLP封装好的Loader和Pipe\n",
"\n",
"fastNLP封装了多种任务/数据集的 Loader 和 Pipe 并提供自动下载功能,具体参见文档 [数据集](https://docs.qq.com/sheet/DVnpkTnF6VW9UeXdh?c=A1A0A0)\n",
"\n",
"## Part V: 不同格式类型的基础Loader\n",
"\n",
"除了上面提到的针对具体任务的Loader我们还提供了CSV格式和JSON格式的Loader\n",
"\n",
"**CSVLoader** 读取CSV类型的数据集文件。例子如下\n",
"\n",
"```python\n",
"from fastNLP.io.loader import CSVLoader\n",
"data_set_loader = CSVLoader(\n",
" headers=('raw_words', 'target'), sep='\\t'\n",
")\n",
"```\n",
"\n",
"表示将CSV文件中每一行的第一项将填入'raw_words' field第二项填入'target' field。其中项之间由'\\t'分割开来\n",
"\n",
"```python\n",
"data_set = data_set_loader._load('path/to/your/file')\n",
"```\n",
"\n",
"文件内容样例如下\n",
"\n",
"```csv\n",
"But it does not leave you with much . 1\n",
"You could hate it for the same reason . 1\n",
"The performances are an absolute joy . 4\n",
"```\n",
"\n",
"读取之后的DataSet具有以下的field\n",
"\n",
"| raw_words | target |\n",
"| --------------------------------------- | ------ |\n",
"| But it does not leave you with much . | 1 |\n",
"| You could hate it for the same reason . | 1 |\n",
"| The performances are an absolute joy . | 4 |\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**JsonLoader** 读取Json类型的数据集文件数据必须按行存储每行是一个包含各类属性的Json对象。例子如下\n",
"\n",
"```python\n",
"from fastNLP.io.loader import JsonLoader\n",
"loader = JsonLoader(\n",
" fields={'sentence1': 'raw_words1', 'sentence2': 'raw_words2', 'gold_label': 'target'}\n",
")\n",
"```\n",
"\n",
"表示将Json对象中'sentence1'、'sentence2'和'gold_label'对应的值赋给'raw_words1'、'raw_words2'、'target'这三个fields\n",
"\n",
"```python\n",
"data_set = loader._load('path/to/your/file')\n",
"```\n",
"\n",
"数据集内容样例如下\n",
"```\n",
"{\"annotator_labels\": [\"neutral\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"neutral\", ... }\n",
"{\"annotator_labels\": [\"contradiction\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"contradiction\", ... }\n",
"{\"annotator_labels\": [\"entailment\"], \"captionID\": \"3416050480.jpg#4\", \"gold_label\": \"entailment\", ... }\n",
"```\n",
"\n",
"读取之后的DataSet具有以下的field\n",
"\n",
"| raw_words0 | raw_words1 | target |\n",
"| ------------------------------------------------------ | ------------------------------------------------- | ------------- |\n",
"| A person on a horse jumps over a broken down airplane. | A person is training his horse for a competition. | neutral |\n",
"| A person on a horse jumps over a broken down airplane. | A person is at a diner, ordering an omelette. | contradiction |\n",
"| A person on a horse jumps over a broken down airplane. | A person is outdoors, on a horse. | entailment |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python Now",
"language": "python",
"name": "now"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}