fastNLP/tutorials/fastnlp_tutorial_1.ipynb
2022-05-30 22:48:28 +08:00

1334 lines
45 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "cdc25fcd",
"metadata": {},
"source": [
"# T1. dataset 和 vocabulary 的基本使用\n",
"\n",
"  1   dataset 的使用与结构\n",
" \n",
"    1.1   dataset 的结构与创建\n",
"\n",
"    1.2   dataset 的数据预处理\n",
"\n",
"    1.3   延伸instance 和 field\n",
"\n",
"  2   vocabulary 的结构与使用\n",
"\n",
"    2.1   vocabulary 的创建与修改\n",
"\n",
"    2.2   vocabulary 与 OOV 问题\n",
"\n",
"  3   dataset 和 vocabulary 的组合使用\n",
" \n",
"    3.1   从 dataframe 中加载 dataset\n",
"\n",
"    3.2   从 dataset 中获取 vocabulary"
]
},
{
"cell_type": "markdown",
"id": "0eb18a22",
"metadata": {},
"source": [
"## 1. dataset 的基本使用\n",
"\n",
"### 1.1 dataset 的结构与创建\n",
"\n",
"在`fastNLP 0.8`中,使用`DataSet`模块表示数据集,**`dataset`类似于关系型数据库中的数据表**(下文统一为小写`dataset`\n",
"\n",
"  **主要包含`field`字段和`instance`实例两个元素**,对应`table`中的`field`字段和`record`记录\n",
"\n",
"在`fastNLP 0.8`中,`DataSet`模块被定义在`fastNLP.core.dataset`路径下,导入该模块后,最简单的\n",
"\n",
"  初始化方法,即将字典形式的表格 **`{'field1': column1, 'field2': column2, ...}`** 传入构造函数"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a1d69ad2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"from fastNLP import DataSet\n",
"\n",
"data = {'idx': [0, 1, 2], \n",
" 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"],\n",
" 'words': [['This', 'is', 'an', 'apple', '.'], \n",
" ['I', 'like', 'apples', '.'], \n",
" ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],\n",
" 'num': [5, 4, 7]}\n",
"\n",
"dataset = DataSet(data)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "9260fdc6",
"metadata": {},
"source": [
"&emsp; 在`dataset`的实例中,字段`field`的名称和实例`instance`中的字符串也可以中文"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3d72ef00",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------+--------------------+------------------------+------+\n",
"| 序号 | 句子 | 字符 | 长度 |\n",
"+------+--------------------+------------------------+------+\n",
"| 0 | 生活就像海洋, | ['生', '活', '就', ... | 7 |\n",
"| 1 | 只有意志坚强的人, | ['只', '有', '意', ... | 9 |\n",
"| 2 | 才能到达彼岸。 | ['才', '能', '到', ... | 7 |\n",
"+------+--------------------+------------------------+------+\n"
]
}
],
"source": [
"temp = {'序号': [0, 1, 2], \n",
" '句子':[\"生活就像海洋,\", \"只有意志坚强的人,\", \"才能到达彼岸。\"],\n",
" '字符': [['生', '活', '就', '像', '海', '洋', ''], \n",
" ['只', '有', '意', '志', '坚', '强', '的', '人', ''], \n",
" ['才', '能', '到', '达', '彼', '岸', '。']],\n",
" '长度': [7, 9, 7]}\n",
"\n",
"chinese = DataSet(temp)\n",
"print(chinese)"
]
},
{
"cell_type": "markdown",
"id": "202e5490",
"metadata": {},
"source": [
"在`dataset`中,使用`drop`方法可以删除满足条件的实例这里使用了python中的`lambda`表达式\n",
"\n",
"&emsp; 注一:在`drop`方法中,通过设置`inplace`参数将删除对应实例后的`dataset`作为一个新的实例生成"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "09b478f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2492313174344 2491986424200\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dropped = dataset\n",
"dropped = dropped.drop(lambda ins:ins['num'] < 5, inplace=False)\n",
"print(id(dropped), id(dataset))\n",
"print(dropped)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "aa277674",
"metadata": {},
"source": [
"&emsp; 注二:**对对象使用等号一般表示传引用**,所以对`dataset`使用等号,是传引用而不是赋值\n",
"\n",
"&emsp; &emsp; 如下所示,**`dropped`和`dataset`具有相同`id`****对`dropped`执行删除操作`dataset`同时会被修改**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "77c8583a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2491986424200 2491986424200\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dropped = dataset\n",
"dropped.drop(lambda ins:ins['num'] < 5)\n",
"print(id(dropped), id(dataset))\n",
"print(dropped)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "a76199dc",
"metadata": {},
"source": [
"在`dataset`中,使用`delet_instance`方法可以删除对应序号的`instance`实例序号从0开始"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d8824b40",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+--------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+--------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"+-----+--------------------+------------------------+-----+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.delete_instance(2)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "f4fa9f33",
"metadata": {},
"source": [
"在`dataset`中,使用`delet_field`方法可以删除对应名称的`field`字段"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f68ddb40",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+--------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+--------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"+-----+--------------------+------------------------------+\n"
]
}
],
"source": [
"dataset.delete_field('num')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "b1e9d42c",
"metadata": {},
"source": [
"### 1.2 dataset 的数据预处理\n",
"\n",
"在`dataset`模块中,`apply`、`apply_field`、`apply_more`和`apply_field_more`函数可以进行简单的数据预处理\n",
"\n",
"&emsp; **`apply`和`apply_more`输入整条实例****`apply_field`和`apply_field_more`仅输入实例的部分字段**\n",
"\n",
"&emsp; **`apply`和`apply_field`仅输出单个字段****`apply_more`和`apply_field_more`则是输出多个字段**\n",
"\n",
"&emsp; **`apply`和`apply_field`返回的是个列表****`apply_more`和`apply_field_more`返回的是个字典**\n",
"\n",
"&emsp; &emsp; 预处理过程中,通过`progress_bar`参数设置显示进度条类型,通过`num_proc`设置多进程\n",
"***\n",
"\n",
"`apply`的参数包括一个函数`func`和一个新字段名`new_field_name`,函数`func`的处理对象是`dataset`模块中\n",
"\n",
"&emsp; 的每个`instance`实例,函数`func`的处理结果存放在`new_field_name`对应的新建字段内"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "72a0b5f9",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
"source": [
"from fastNLP import DataSet\n",
"\n",
"data = {'idx': [0, 1, 2], \n",
" 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"], }\n",
"dataset = DataSet(data)\n",
"dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words', progress_bar=\"tqdm\") #\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "c10275ee",
"metadata": {},
"source": [
"&emsp; **`apply`使用的函数可以是一个基于`lambda`表达式的匿名函数****也可以是一个自定义的函数**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b1a8631f",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"\n",
"def get_words(instance):\n",
" sentence = instance['sentence']\n",
" words = sentence.split()\n",
" return words\n",
"\n",
"dataset.apply(get_words, new_field_name='words', progress_bar=\"tqdm\")\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "64abf745",
"metadata": {},
"source": [
"`apply_field`的参数,除了函数`func`外还有`field_name`和`new_field_name`,该函数`func`的处理对象仅\n",
"\n",
"&emsp; 是`dataset`模块中的每个`field_name`对应的字段内容,处理结果存放在`new_field_name`对应的新建字段内"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "057c1d2c",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words', \n",
" progress_bar=\"tqdm\")\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "5a9cc8b2",
"metadata": {},
"source": [
"`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
"\n",
"&emsp; 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "51e2f02c",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.apply_more(lambda ins:{'words': ins['sentence'].split(), 'num': len(ins['sentence'].split())}, \n",
" progress_bar=\"tqdm\")\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "02d2b7ef",
"metadata": {},
"source": [
"`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
"\n",
"&emsp; 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "db4295d5",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.apply_field_more(lambda sent:{'words': sent.split(), 'num': len(sent.split())}, \n",
" field_name='sentence', progress_bar=\"tqdm\")\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "9c09e592",
"metadata": {},
"source": [
"### 1.3 延伸instance 和 field\n",
"\n",
"在`fastNLP 0.8`中,使用`Instance`模块表示数据集`dataset`中的每条数据,被称为实例\n",
"\n",
"&emsp; 构造方式类似于构造一个字典,通过键值相同的`Instance`列表,也可以初始化一个`dataset`,代码如下"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "012f537c",
"metadata": {},
"outputs": [],
"source": [
"from fastNLP import DataSet\n",
"from fastNLP import Instance\n",
"\n",
"dataset = DataSet([\n",
" Instance(sentence=\"This is an apple .\",\n",
" words=['This', 'is', 'an', 'apple', '.'],\n",
" num=5),\n",
" Instance(sentence=\"I like apples .\",\n",
" words=['I', 'like', 'apples', '.'],\n",
" num=4),\n",
" Instance(sentence=\"Apples are good for our health .\",\n",
" words=['Apples', 'are', 'good', 'for', 'our', 'health', '.'],\n",
" num=7),\n",
" ])"
]
},
{
"cell_type": "markdown",
"id": "2fafb1ef",
"metadata": {},
"source": [
"&emsp; 通过`items`、`keys`和`values`方法,可以分别获得`dataset`的`item`列表、`key`列表、`value`列表"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a4c1c10d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dict_items([('sentence', 'This is an apple .'), ('words', ['This', 'is', 'an', 'apple', '.']), ('num', 5)])\n",
"dict_keys(['sentence', 'words', 'num'])\n",
"dict_values(['This is an apple .', ['This', 'is', 'an', 'apple', '.'], 5])\n"
]
}
],
"source": [
"ins = Instance(sentence=\"This is an apple .\", words=['This', 'is', 'an', 'apple', '.'], num=5)\n",
"\n",
"print(ins.items())\n",
"print(ins.keys())\n",
"print(ins.values())"
]
},
{
"cell_type": "markdown",
"id": "b5459a2d",
"metadata": {},
"source": [
"&emsp; 通过`add_field`方法,可以在`Instance`实例中,通过参数`field_name`添加字段,通过参数`field`赋值"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "55376402",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+------------------------+-----+-----+\n",
"| sentence | words | num | idx |\n",
"+--------------------+------------------------+-----+-----+\n",
"| This is an apple . | ['This', 'is', 'an'... | 5 | 0 |\n",
"+--------------------+------------------------+-----+-----+\n"
]
}
],
"source": [
"ins.add_field(field_name='idx', field=0)\n",
"print(ins)"
]
},
{
"cell_type": "markdown",
"id": "49caaa9c",
"metadata": {},
"source": [
"在`fastNLP 0.8`中,使用`FieldArray`模块表示数据集`dataset`中的每条字段名(注:没有`field`类)\n",
"\n",
"&emsp; 通过`get_all_fields`方法可以获取`dataset`的字段列表\n",
"\n",
"&emsp; 通过`get_field_names`方法可以获取`dataset`的字段名称列表,代码如下"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "fe15f4c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'sentence': <fastNLP.core.dataset.field.FieldArray at 0x2444977fe88>,\n",
" 'words': <fastNLP.core.dataset.field.FieldArray at 0x2444977ff08>,\n",
" 'num': <fastNLP.core.dataset.field.FieldArray at 0x2444977ff88>}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.get_all_fields()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "5433815c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['num', 'sentence', 'words']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.get_field_names()"
]
},
{
"cell_type": "markdown",
"id": "4964eeed",
"metadata": {},
"source": [
"其他`dataset`的基本使用:通过`in`或者`has_field`方法可以判断`dataset`的是否包含某种字段\n",
"\n",
"&emsp; 通过`rename_field`方法可以更改`dataset`中的字段名称;通过`concat`方法可以实现两个`dataset`中的拼接\n",
"\n",
"&emsp; 通过`len`可以统计`dataset`中的实例数目;`dataset`的全部变量与函数可以通过`dir(dataset)`查询"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "25ce5488",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3 False\n",
"6 True\n",
"+------------------------------+------------------------------+--------+\n",
"| sentence | words | length |\n",
"+------------------------------+------------------------------+--------+\n",
"| This is an apple . | ['This', 'is', 'an', 'app... | 5 |\n",
"| I like apples . | ['I', 'like', 'apples', '... | 4 |\n",
"| Apples are good for our h... | ['Apples', 'are', 'good',... | 7 |\n",
"| This is an apple . | ['This', 'is', 'an', 'app... | 5 |\n",
"| I like apples . | ['I', 'like', 'apples', '... | 4 |\n",
"| Apples are good for our h... | ['Apples', 'are', 'good',... | 7 |\n",
"+------------------------------+------------------------------+--------+\n"
]
}
],
"source": [
"print(len(dataset), dataset.has_field('length')) \n",
"if 'num' in dataset:\n",
" dataset.rename_field('num', 'length')\n",
"elif 'length' in dataset:\n",
" dataset.rename_field('length', 'num')\n",
"dataset.concat(dataset)\n",
"print(len(dataset), dataset.has_field('length')) \n",
"print(dataset) "
]
},
{
"cell_type": "markdown",
"id": "e30a6cd7",
"metadata": {},
"source": [
"## 2. vocabulary 的结构与使用\n",
"\n",
"### 2.1 vocabulary 的创建与修改\n",
"\n",
"在`fastNLP 0.8`中,使用`Vocabulary`模块表示词汇表,**`vocabulary`的核心是从单词到序号的映射**\n",
"\n",
"&emsp; 可以直接通过构造函数实例化,通过查找`word2idx`属性,可以找到`vocabulary`映射对应的字典实现\n",
"\n",
"&emsp; **默认补零`padding`用`<pad>`表示****对应序号为0****未知单词`unknown`用`<unk>`表示****对应序号1**\n",
"\n",
"&emsp; 通过打印`vocabulary`可以看到词汇表中的单词列表,其中,`padding`和`unknown`不会显示"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "3515e096",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vocabulary([]...)\n",
"{'<pad>': 0, '<unk>': 1}\n",
"<pad> 0\n",
"<unk> 1\n"
]
}
],
"source": [
"from fastNLP import Vocabulary\n",
"\n",
"vocab = Vocabulary()\n",
"print(vocab)\n",
"print(vocab.word2idx)\n",
"print(vocab.padding, vocab.padding_idx)\n",
"print(vocab.unknown, vocab.unknown_idx)"
]
},
{
"cell_type": "markdown",
"id": "640be126",
"metadata": {},
"source": [
"在`vocabulary`中,通过`add_word`方法或`add_word_lst`方法,可以单独或批量添加单词\n",
"\n",
"&emsp; 通过`len`或`word_count`属性,可以显示`vocabulary`的单词量和每个单词添加的次数"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "88c7472a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5 Counter({'生活': 1, '就像': 1, '海洋': 1})\n",
"6 Counter({'生活': 1, '就像': 1, '海洋': 1, '只有': 1})\n",
"6 {'<pad>': 0, '<unk>': 1, '生活': 2, '就像': 3, '海洋': 4, '只有': 5}\n"
]
}
],
"source": [
"vocab.add_word_lst(['生活', '就像', '海洋'])\n",
"print(len(vocab), vocab.word_count)\n",
"vocab.add_word('只有')\n",
"print(len(vocab), vocab.word_count)\n",
"print(len(vocab), vocab.word2idx)"
]
},
{
"cell_type": "markdown",
"id": "f9ec8b28",
"metadata": {},
"source": [
"&emsp; **通过`to_word`方法可以找到单词对应的序号****通过`to_index`方法可以找到序号对应的单词**\n",
"\n",
"&emsp; &emsp; 由于序号0和序号1已经被占用所以**新加入的词的序号从2开始计数**,如`'生活'`对应2\n",
"\n",
"&emsp; &emsp; 通过`has_word`方法可以判断单词是否在词汇表中,没有的单词被判做`<unk>`"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "3447acde",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<pad> 0\n",
"<unk> 1\n",
"生活 2\n",
"彼岸 1 False\n"
]
}
],
"source": [
"print(vocab.to_word(0), vocab.to_index('<pad>'))\n",
"print(vocab.to_word(1), vocab.to_index('<unk>'))\n",
"print(vocab.to_word(2), vocab.to_index('生活'))\n",
"print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))"
]
},
{
"cell_type": "markdown",
"id": "b4e36850",
"metadata": {},
"source": [
"**`vocabulary`允许反复添加相同单词****可以通过`word_count`方法看到相应单词被添加的次数**\n",
"\n",
"&emsp; 但其中没有`<unk>`和`<pad>``vocabulary`的全部变量与函数可以通过`dir(vocabulary)`查询\n",
"\n",
"&emsp; 注:**使用`add_word_lst`添加单词****单词对应序号不会动态调整****使用`dataset`添加单词的情况不同**"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "490b101c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"生活 2\n",
"彼岸 12 True\n",
"13 Counter({'人': 4, '生活': 2, '就像': 2, '海洋': 2, '只有': 2, '意志': 1, '坚强的': 1, '才': 1, '能': 1, '到达': 1, '彼岸': 1})\n",
"13 {'<pad>': 0, '<unk>': 1, '生活': 2, '就像': 3, '海洋': 4, '只有': 5, '人': 6, '意志': 7, '坚强的': 8, '才': 9, '能': 10, '到达': 11, '彼岸': 12}\n"
]
}
],
"source": [
"vocab.add_word_lst(['生活', '就像', '海洋', '只有', '意志', '坚强的', '人', '人', '人', '人', '才', '能', '到达', '彼岸'])\n",
"print(vocab.to_word(2), vocab.to_index('生活'))\n",
"print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))\n",
"print(len(vocab), vocab.word_count)\n",
"print(len(vocab), vocab.word2idx)"
]
},
{
"cell_type": "markdown",
"id": "23e32a63",
"metadata": {},
"source": [
"### 2.2 vocabulary 与 OOV 问题\n",
"\n",
"在`vocabulary`模块初始化的时候,可以通过指定`unknown`和`padding`为`None`,限制其存在\n",
"\n",
"&emsp; 此时添加单词直接从0开始标号如果遇到未知单词会直接报错即 out of vocabulary"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "a99ff909",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'positive': 0, 'negative': 1}\n",
"ValueError: word `neutral` not in vocabulary\n"
]
}
],
"source": [
"vocab = Vocabulary(unknown=None, padding=None)\n",
"\n",
"vocab.add_word_lst(['positive', 'negative'])\n",
"print(vocab.word2idx)\n",
"\n",
"try:\n",
" print(vocab.to_index('neutral'))\n",
"except ValueError:\n",
" print(\"ValueError: word `neutral` not in vocabulary\")"
]
},
{
"cell_type": "markdown",
"id": "618da6bd",
"metadata": {},
"source": [
"&emsp; 相应的,如果只指定其中的`unknown`,则编号会后移一个,同时遇到未知单词全部当做`<unk>`"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "432f74c1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'<unk>': 0, 'positive': 1, 'negative': 2}\n",
"0 <unk>\n"
]
}
],
"source": [
"vocab = Vocabulary(unknown='<unk>', padding=None)\n",
"\n",
"vocab.add_word_lst(['positive', 'negative'])\n",
"print(vocab.word2idx)\n",
"\n",
"print(vocab.to_index('neutral'), vocab.to_word(vocab.to_index('neutral')))"
]
},
{
"cell_type": "markdown",
"id": "b6263f73",
"metadata": {},
"source": [
"## 3 dataset 和 vocabulary 的组合使用\n",
" \n",
"### 3.1 从 dataframe 中加载 dataset\n",
"\n",
"以下通过 [NLP-beginner](https://github.com/FudanNLP/nlp-beginner) 实践一中 [Rotten Tomatoes 影评数据集](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) 的部分训练数据组成`test4dataset.tsv`文件\n",
"\n",
"&emsp; 介绍如何使用`dataset`、`vocabulary`简单加载并处理数据集,首先使用`pandas`模块,读取原始数据的`dataframe`"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "3dbd985d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SentenceId</th>\n",
" <th>Sentence</th>\n",
" <th>Sentiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>A series of escapades demonstrating the adage ...</td>\n",
" <td>negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>This quiet , introspective and entertaining in...</td>\n",
" <td>positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Even fans of Ismail Merchant 's work , I suspe...</td>\n",
" <td>negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>A positively thrilling combination of ethnogra...</td>\n",
" <td>neutral</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>A comedy-drama of nearly epic proportions root...</td>\n",
" <td>positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>The Importance of Being Earnest , so thick wit...</td>\n",
" <td>neutral</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SentenceId Sentence Sentiment\n",
"0 1 A series of escapades demonstrating the adage ... negative\n",
"1 2 This quiet , introspective and entertaining in... positive\n",
"2 3 Even fans of Ismail Merchant 's work , I suspe... negative\n",
"3 4 A positively thrilling combination of ethnogra... neutral\n",
"4 5 A comedy-drama of nearly epic proportions root... positive\n",
"5 6 The Importance of Being Earnest , so thick wit... neutral"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('./data/test4dataset.tsv', sep='\\t')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "919ab350",
"metadata": {},
"source": [
"接着,通过`dataset`中的`from_pandas`方法填充数据集,并使用`apply_more`方法对文本进行分词操作"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "4f634586",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/6 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------------------------------+-----------+\n",
"| SentenceId | Sentence | Sentiment |\n",
"+------------+------------------------------+-----------+\n",
"| 1 | ['a', 'series', 'of', 'es... | negative |\n",
"| 2 | ['this', 'quiet', ',', 'i... | positive |\n",
"| 3 | ['even', 'fans', 'of', 'i... | negative |\n",
"| 4 | ['a', 'positively', 'thri... | neutral |\n",
"| 5 | ['a', 'comedy-drama', 'of... | positive |\n",
"| 6 | ['the', 'importance', 'of... | neutral |\n",
"+------------+------------------------------+-----------+\n"
]
}
],
"source": [
"from fastNLP import DataSet\n",
"\n",
"dataset = DataSet()\n",
"dataset = dataset.from_pandas(df)\n",
"dataset.apply_more(lambda ins:{'SentenceId': ins['SentenceId'], \n",
" 'Sentence': ins['Sentence'].lower().split(), 'Sentiment': ins['Sentiment']}, \n",
" progress_bar=\"tqdm\")\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "5c1ae192",
"metadata": {},
"source": [
"&emsp; 如果需要保存中间结果,也可以使用`dataset`的`to_csv`方法,生成`.csv`或`.tsv`文件"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "46722efc",
"metadata": {},
"outputs": [],
"source": [
"dataset.to_csv('./data/test4dataset.csv')"
]
},
{
"cell_type": "markdown",
"id": "5ba13989",
"metadata": {},
"source": [
"### 3.2 从 dataset 中获取 vocabulary\n",
"\n",
"然后,初始化`vocabulary`,使用`vocabulary`中的`from_dataset`方法,从`dataset`的指定字段中\n",
"\n",
"&emsp; 获取字段中的所有元素,然后编号;如果指定字段是个列表,则针对字段中所有列表包含的元素编号\n",
"\n",
"&emsp; 注:**使用`dataset`添加单词****不同于`add_word_list`****单词被添加次数越多****序号越靠前**,例如案例中的`a`"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "a2de615b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counter({'a': 9, 'of': 9, ',': 7, 'the': 6, '.': 5, 'is': 3, 'and': 3, 'good': 2, 'for': 2, 'which': 2, 'this': 2, \"'s\": 2, 'series': 1, 'escapades': 1, 'demonstrating': 1, 'adage': 1, 'that': 1, 'what': 1, 'goose': 1, 'also': 1, 'gander': 1, 'some': 1, 'occasionally': 1, 'amuses': 1, 'but': 1, 'none': 1, 'amounts': 1, 'to': 1, 'much': 1, 'story': 1, 'quiet': 1, 'introspective': 1, 'entertaining': 1, 'independent': 1, 'worth': 1, 'seeking': 1, 'even': 1, 'fans': 1, 'ismail': 1, 'merchant': 1, 'work': 1, 'i': 1, 'suspect': 1, 'would': 1, 'have': 1, 'hard': 1, 'time': 1, 'sitting': 1, 'through': 1, 'one': 1, 'positively': 1, 'thrilling': 1, 'combination': 1, 'ethnography': 1, 'all': 1, 'intrigue': 1, 'betrayal': 1, 'deceit': 1, 'murder': 1, 'shakespearean': 1, 'tragedy': 1, 'or': 1, 'juicy': 1, 'soap': 1, 'opera': 1, 'comedy-drama': 1, 'nearly': 1, 'epic': 1, 'proportions': 1, 'rooted': 1, 'in': 1, 'sincere': 1, 'performance': 1, 'by': 1, 'title': 1, 'character': 1, 'undergoing': 1, 'midlife': 1, 'crisis': 1, 'importance': 1, 'being': 1, 'earnest': 1, 'so': 1, 'thick': 1, 'with': 1, 'wit': 1, 'it': 1, 'plays': 1, 'like': 1, 'reading': 1, 'from': 1, 'bartlett': 1, 'familiar': 1, 'quotations': 1}) \n",
"\n",
"{'<pad>': 0, '<unk>': 1, 'a': 2, 'of': 3, ',': 4, 'the': 5, '.': 6, 'is': 7, 'and': 8, 'good': 9, 'for': 10, 'which': 11, 'this': 12, \"'s\": 13, 'series': 14, 'escapades': 15, 'demonstrating': 16, 'adage': 17, 'that': 18, 'what': 19, 'goose': 20, 'also': 21, 'gander': 22, 'some': 23, 'occasionally': 24, 'amuses': 25, 'but': 26, 'none': 27, 'amounts': 28, 'to': 29, 'much': 30, 'story': 31, 'quiet': 32, 'introspective': 33, 'entertaining': 34, 'independent': 35, 'worth': 36, 'seeking': 37, 'even': 38, 'fans': 39, 'ismail': 40, 'merchant': 41, 'work': 42, 'i': 43, 'suspect': 44, 'would': 45, 'have': 46, 'hard': 47, 'time': 48, 'sitting': 49, 'through': 50, 'one': 51, 'positively': 52, 'thrilling': 53, 'combination': 54, 'ethnography': 55, 'all': 56, 'intrigue': 57, 'betrayal': 58, 'deceit': 59, 'murder': 60, 'shakespearean': 61, 'tragedy': 62, 'or': 63, 'juicy': 64, 'soap': 65, 'opera': 66, 'comedy-drama': 67, 'nearly': 68, 'epic': 69, 'proportions': 70, 'rooted': 71, 'in': 72, 'sincere': 73, 'performance': 74, 'by': 75, 'title': 76, 'character': 77, 'undergoing': 78, 'midlife': 79, 'crisis': 80, 'importance': 81, 'being': 82, 'earnest': 83, 'so': 84, 'thick': 85, 'with': 86, 'wit': 87, 'it': 88, 'plays': 89, 'like': 90, 'reading': 91, 'from': 92, 'bartlett': 93, 'familiar': 94, 'quotations': 95} \n",
"\n",
"Vocabulary(['a', 'series', 'of', 'escapades', 'demonstrating']...)\n"
]
}
],
"source": [
"from fastNLP import Vocabulary\n",
"\n",
"vocab = Vocabulary()\n",
"vocab = vocab.from_dataset(dataset, field_name='Sentence')\n",
"print(vocab.word_count, '\\n')\n",
"print(vocab.word2idx, '\\n')\n",
"print(vocab)"
]
},
{
"cell_type": "markdown",
"id": "f0857ccb",
"metadata": {},
"source": [
"之后,**通过`vocabulary`的`index_dataset`方法****调整`dataset`中指定字段的元素****使用编号将之代替**\n",
"\n",
"&emsp; 使用上述方法,可以将影评数据集中的单词序列转化为词编号序列,为接下来转化为词嵌入序列做准备"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "2f9a04b2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------------------------------+-----------+\n",
"| SentenceId | Sentence | Sentiment |\n",
"+------------+------------------------------+-----------+\n",
"| 1 | [2, 14, 3, 15, 16, 5, 17,... | negative |\n",
"| 2 | [12, 32, 4, 33, 8, 34, 35... | positive |\n",
"| 3 | [38, 39, 3, 40, 41, 13, 4... | negative |\n",
"| 4 | [2, 52, 53, 54, 3, 55, 8,... | neutral |\n",
"| 5 | [2, 67, 3, 68, 69, 70, 71... | positive |\n",
"| 6 | [5, 81, 3, 82, 83, 4, 84,... | neutral |\n",
"+------------+------------------------------+-----------+\n"
]
}
],
"source": [
"vocab.index_dataset(dataset, field_name='Sentence')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "6b26b707",
"metadata": {},
"source": [
"最后,使用相同方法,再将`dataset`中`Sentiment`字段中的`negative`、`neutral`、`positive`转化为数字编号"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "5f5eed18",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'negative': 0, 'positive': 1, 'neutral': 2}\n",
"+------------+------------------------------+-----------+\n",
"| SentenceId | Sentence | Sentiment |\n",
"+------------+------------------------------+-----------+\n",
"| 1 | [2, 14, 3, 15, 16, 5, 17,... | 0 |\n",
"| 2 | [12, 32, 4, 33, 8, 34, 35... | 1 |\n",
"| 3 | [38, 39, 3, 40, 41, 13, 4... | 0 |\n",
"| 4 | [2, 52, 53, 54, 3, 55, 8,... | 2 |\n",
"| 5 | [2, 67, 3, 68, 69, 70, 71... | 1 |\n",
"| 6 | [5, 81, 3, 82, 83, 4, 84,... | 2 |\n",
"+------------+------------------------------+-----------+\n"
]
}
],
"source": [
"target_vocab = Vocabulary(padding=None, unknown=None)\n",
"\n",
"target_vocab.from_dataset(dataset, field_name='Sentiment')\n",
"print(target_vocab.word2idx)\n",
"target_vocab.index_dataset(dataset, field_name='Sentiment')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "eed7ea64",
"metadata": {},
"source": [
"在最后的最后,通过以下的一张图,来总结本章关于`dataset`和`vocabulary`主要知识点的讲解,以及两者的联系\n",
"\n",
"<img src=\"./figures/T1-fig-dataset-and-vocabulary.png\" width=\"80%\" height=\"80%\" align=\"center\"></img>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35b4f0f7",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}