{ "cells": [ { "cell_type": "markdown", "id": "213d538c", "metadata": {}, "source": [ "# T3. dataloader 的内部结构和基本使用\n", "\n", " 1 fastNLP 中的 dataloader\n", " \n", " 1.1 dataloader 的基本介绍\n", "\n", " 1.2 dataloader 的函数创建\n", "\n", " 2 fastNLP 中 dataloader 的延伸\n", "\n", " 2.1 collator 的概念与使用\n", "\n", " 2.2 结合 datasets 框架" ] }, { "cell_type": "markdown", "id": "85857115", "metadata": {}, "source": [ "## 1. fastNLP 中的 dataloader\n", "\n", "### 1.1 dataloader 的基本介绍\n", "\n", "在`fastNLP 0.8`的开发中,最关键的开发目标就是**实现`fastNLP`对当前主流机器学习框架**,例如\n", "\n", " **较为火热的`pytorch`**,以及**国产的`paddle`和`jittor`的兼容**,扩大受众的同时,也是助力国产\n", "\n", "本着分而治之的思想,我们可以将`fastNLP 0.8`对`pytorch`、`paddle`、`jittor`框架的兼容,划分为\n", "\n", " **对数据预处理**、**批量`batch`的划分与补齐**、**模型训练**、**模型评测**,**四个部分的兼容**\n", "\n", " 针对数据预处理,我们已经在`tutorial-1`中介绍了`dataset`和`vocabulary`的使用\n", "\n", " 而结合`tutorial-0`,我们可以发现**数据预处理环节本质上是框架无关的**\n", "\n", " 因为在不同框架下,读取的原始数据格式都差异不大,彼此也很容易转换\n", "\n", "只有涉及到张量、模型,不同框架才展现出其各自的特色:**`pytorch`中的`tensor`和`nn.Module`**\n", "\n", " **在`paddle`中称为`tensor`和`nn.Layer`**,**在`jittor`中则称为`Var`和`Module`**\n", "\n", " 因此,**模型训练、模型评测**,**是兼容的重难点**,我们将会在`tutorial-5`中详细介绍\n", "\n", " 针对批量`batch`的处理,作为`fastNLP 0.8`中框架无关部分想框架相关部分的过渡\n", "\n", " 就是`dataloader`模块的职责,这也是本篇教程`tutorial-3`讲解的重点\n", "\n", "**`dataloader`模块的职责**,详细划分可以包含以下三部分,**采样划分、补零对齐、框架匹配**\n", "\n", " 第一,确定`batch`大小,确定采样方式,划分后通过迭代器即可得到`batch`序列\n", "\n", " 第二,对于序列处理,这也是`fastNLP`主要针对的,将同个`batch`内的数据对齐\n", "\n", " 第三,**`batch`内数据格式要匹配框架**,**但`batch`结构需保持一致**,**参数匹配机制**\n", "\n", " 对此,`fastNLP 0.8`给出了 **`TorchDataLoader`、`PaddleDataLoader`和`JittorDataLoader`**\n", "\n", " 分别针对并匹配不同框架,但彼此之间参数名、属性、方法仍然类似,前两者大致如下表所示\n", "\n", "|
\n", "\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/4 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n", "| SentenceId | Sentence | Sentiment | input_ids | token_type_ids | attention_mask | target |\n", "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n", "| 1 | A series of... | negative | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1 |\n", "| 4 | A positivel... | neutral | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 2 |\n", "| 3 | Even fans o... | negative | [101, 2130,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 1 |\n", "| 5 | A comedy-dr... | positive | [101, 1037,... | [0, 0, 0, 0, 0,... | [1, 1, 1, 1, 1,... | 0 |\n", "+------------+----------------+-----------+----------------+--------------------+--------------------+--------+\n" ] } ], "source": [ "import sys\n", "sys.path.append('..')\n", "\n", "import pandas as pd\n", "from functools import partial\n", "from fastNLP.transformers.torch import BertTokenizer\n", "\n", "from fastNLP import DataSet\n", "from fastNLP import Vocabulary\n", "from fastNLP.io import DataBundle\n", "\n", "\n", "class PipeDemo:\n", " def __init__(self, tokenizer='bert-base-uncased'):\n", " self.tokenizer = BertTokenizer.from_pretrained(tokenizer)\n", "\n", " def process_from_file(self, path='./data/test4dataset.tsv'):\n", " datasets = DataSet.from_pandas(pd.read_csv(path, sep='\\t'))\n", " train_ds, test_ds = datasets.split(ratio=0.7)\n", " train_ds, dev_ds = datasets.split(ratio=0.8)\n", " data_bundle = DataBundle(datasets={'train': train_ds, 'dev': dev_ds, 'test': test_ds})\n", "\n", " encode = partial(self.tokenizer.encode_plus, max_length=100, truncation=True,\n", " return_attention_mask=True)\n", " data_bundle.apply_field_more(encode, field_name='Sentence', progress_bar='tqdm')\n", " \n", " target_vocab = Vocabulary(padding=None, unknown=None)\n", "\n", " target_vocab.from_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment')\n", " target_vocab.index_dataset(*[ds for _, ds in data_bundle.iter_datasets()], field_name='Sentiment',\n", " new_field_name='target')\n", "\n", " data_bundle.set_pad('input_ids', pad_val=self.tokenizer.pad_token_id)\n", " data_bundle.set_ignore('SentenceId', 'Sentence', 'Sentiment') \n", " return data_bundle\n", "\n", " \n", "pipe = PipeDemo(tokenizer='bert-base-uncased')\n", "\n", "data_bundle = pipe.process_from_file('./data/test4dataset.tsv')\n", "\n", "print(data_bundle.get_dataset('train'))" ] }, { "cell_type": "markdown", "id": "76e6b8ab", "metadata": {}, "source": [ "### 1.2 dataloader 的函数创建\n", "\n", "在`fastNLP 0.8`中,**更方便、可能更常用的`dataloader`创建方法是通过`prepare_xx_dataloader`函数**\n", "\n", " 例如下方的`prepare_torch_dataloader`函数,指定必要参数,读取数据集,生成对应`dataloader`\n", "\n", " 类型为`TorchDataLoader`,只能适用于`pytorch`框架,因此对应`trainer`初始化时`driver='torch'`\n", "\n", "同时我们看还可以发现,在`fastNLP 0.8`中,**`batch`表示为字典`dict`类型**,**`key`值就是原先数据集中各个字段**\n", "\n", " **除去经过`DataBundle.set_ignore`函数隐去的部分**,而`value`值为`pytorch`框架对应的`torch.Tensor`类型" ] }, { "cell_type": "code", "execution_count": 2, "id": "5fd60e42", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "