fastNLP/tutorials/fastnlp_tutorial_1.ipynb

1334 lines
45 KiB
Plaintext
Raw Normal View History

2022-05-03 22:24:27 +08:00
{
"cells": [
{
"cell_type": "markdown",
"id": "cdc25fcd",
"metadata": {},
"source": [
"# T1. dataset 和 vocabulary 的基本使用\n",
"\n",
"  1   dataset 的使用与结构\n",
" \n",
"    1.1   dataset 的结构与创建\n",
"\n",
"    1.2   dataset 的数据预处理\n",
"\n",
"    1.3   延伸instance 和 field\n",
"\n",
"  2   vocabulary 的结构与使用\n",
"\n",
"    2.1   vocabulary 的创建与修改\n",
"\n",
"    2.2   vocabulary 与 OOV 问题\n",
"\n",
"  3   dataset 和 vocabulary 的组合使用\n",
" \n",
"    3.1   从 dataframe 中加载 dataset\n",
"\n",
"    3.2   从 dataset 中获取 vocabulary"
]
},
{
"cell_type": "markdown",
"id": "0eb18a22",
"metadata": {},
"source": [
"## 1. dataset 的基本使用\n",
"\n",
"### 1.1 dataset 的结构与创建\n",
"\n",
"在`fastNLP 0.8`中,使用`DataSet`模块表示数据集,**`dataset`类似于关系型数据库中的数据表**(下文统一为小写`dataset`\n",
"\n",
"  **主要包含`field`字段和`instance`实例两个元素**,对应`table`中的`field`字段和`record`记录\n",
"\n",
"在`fastNLP 0.8`中,`DataSet`模块被定义在`fastNLP.core.dataset`路径下,导入该模块后,最简单的\n",
"\n",
"  初始化方法,即将字典形式的表格 **`{'field1': column1, 'field2': column2, ...}`** 传入构造函数"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a1d69ad2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
"</pre>\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
2022-05-14 15:53:14 +08:00
"from fastNLP import DataSet\n",
2022-05-03 22:24:27 +08:00
"\n",
"data = {'idx': [0, 1, 2], \n",
" 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"],\n",
" 'words': [['This', 'is', 'an', 'apple', '.'], \n",
" ['I', 'like', 'apples', '.'], \n",
" ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],\n",
" 'num': [5, 4, 7]}\n",
"\n",
"dataset = DataSet(data)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "9260fdc6",
"metadata": {},
"source": [
"&emsp; 在`dataset`的实例中,字段`field`的名称和实例`instance`中的字符串也可以中文"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3d72ef00",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------+--------------------+------------------------+------+\n",
"| 序号 | 句子 | 字符 | 长度 |\n",
"+------+--------------------+------------------------+------+\n",
"| 0 | 生活就像海洋, | ['生', '活', '就', ... | 7 |\n",
"| 1 | 只有意志坚强的人, | ['只', '有', '意', ... | 9 |\n",
"| 2 | 才能到达彼岸。 | ['才', '能', '到', ... | 7 |\n",
"+------+--------------------+------------------------+------+\n"
]
}
],
"source": [
"temp = {'序号': [0, 1, 2], \n",
" '句子':[\"生活就像海洋,\", \"只有意志坚强的人,\", \"才能到达彼岸。\"],\n",
" '字符': [['生', '活', '就', '像', '海', '洋', ''], \n",
" ['只', '有', '意', '志', '坚', '强', '的', '人', ''], \n",
" ['才', '能', '到', '达', '彼', '岸', '。']],\n",
" '长度': [7, 9, 7]}\n",
"\n",
"chinese = DataSet(temp)\n",
"print(chinese)"
]
},
{
"cell_type": "markdown",
"id": "202e5490",
"metadata": {},
"source": [
"在`dataset`中,使用`drop`方法可以删除满足条件的实例这里使用了python中的`lambda`表达式\n",
"\n",
"&emsp; 注一:在`drop`方法中,通过设置`inplace`参数将删除对应实例后的`dataset`作为一个新的实例生成"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "09b478f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-05-17 18:04:15 +08:00
"2492313174344 2491986424200\n",
2022-05-03 22:24:27 +08:00
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dropped = dataset\n",
"dropped = dropped.drop(lambda ins:ins['num'] < 5, inplace=False)\n",
"print(id(dropped), id(dataset))\n",
"print(dropped)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "aa277674",
"metadata": {},
"source": [
2022-05-13 11:49:44 +08:00
"&emsp; 注二:**对对象使用等号一般表示传引用**,所以对`dataset`使用等号,是传引用而不是赋值\n",
2022-05-03 22:24:27 +08:00
"\n",
"&emsp; &emsp; 如下所示,**`dropped`和`dataset`具有相同`id`****对`dropped`执行删除操作`dataset`同时会被修改**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "77c8583a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-05-17 18:04:15 +08:00
"2491986424200 2491986424200\n",
2022-05-03 22:24:27 +08:00
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dropped = dataset\n",
"dropped.drop(lambda ins:ins['num'] < 5)\n",
"print(id(dropped), id(dataset))\n",
"print(dropped)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "a76199dc",
"metadata": {},
"source": [
"在`dataset`中,使用`delet_instance`方法可以删除对应序号的`instance`实例序号从0开始"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d8824b40",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+--------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+--------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"+-----+--------------------+------------------------+-----+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.delete_instance(2)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "f4fa9f33",
"metadata": {},
"source": [
"在`dataset`中,使用`delet_field`方法可以删除对应名称的`field`字段"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f68ddb40",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+--------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+--------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"+-----+--------------------+------------------------------+\n"
]
}
],
"source": [
"dataset.delete_field('num')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "b1e9d42c",
"metadata": {},
"source": [
"### 1.2 dataset 的数据预处理\n",
"\n",
"在`dataset`模块中,`apply`、`apply_field`、`apply_more`和`apply_field_more`函数可以进行简单的数据预处理\n",
"\n",
2022-05-13 11:49:44 +08:00
"&emsp; **`apply`和`apply_more`输入整条实例****`apply_field`和`apply_field_more`仅输入实例的部分字段**\n",
2022-05-03 22:24:27 +08:00
"\n",
2022-05-13 11:49:44 +08:00
"&emsp; **`apply`和`apply_field`仅输出单个字段****`apply_more`和`apply_field_more`则是输出多个字段**\n",
2022-05-03 22:24:27 +08:00
"\n",
"&emsp; **`apply`和`apply_field`返回的是个列表****`apply_more`和`apply_field_more`返回的是个字典**\n",
"\n",
2022-05-14 15:53:14 +08:00
"&emsp; &emsp; 预处理过程中,通过`progress_bar`参数设置显示进度条类型,通过`num_proc`设置多进程\n",
2022-05-03 22:24:27 +08:00
"***\n",
"\n",
"`apply`的参数包括一个函数`func`和一个新字段名`new_field_name`,函数`func`的处理对象是`dataset`模块中\n",
"\n",
"&emsp; 的每个`instance`实例,函数`func`的处理结果存放在`new_field_name`对应的新建字段内"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 7,
2022-05-03 22:24:27 +08:00
"id": "72a0b5f9",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
2022-05-14 15:53:14 +08:00
"model_id": "",
2022-05-03 22:24:27 +08:00
"version_major": 2,
"version_minor": 0
},
"text/plain": [
2022-05-14 15:53:14 +08:00
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
2022-05-03 22:24:27 +08:00
]
},
"metadata": {},
"output_type": "display_data"
2022-05-14 15:53:14 +08:00
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
2022-05-03 22:24:27 +08:00
}
],
"source": [
2022-05-14 15:53:14 +08:00
"from fastNLP import DataSet\n",
"\n",
2022-05-03 22:24:27 +08:00
"data = {'idx': [0, 1, 2], \n",
" 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"], }\n",
"dataset = DataSet(data)\n",
2022-05-14 15:53:14 +08:00
"dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words', progress_bar=\"tqdm\") #\n",
2022-05-03 22:24:27 +08:00
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "c10275ee",
"metadata": {},
"source": [
"&emsp; **`apply`使用的函数可以是一个基于`lambda`表达式的匿名函数****也可以是一个自定义的函数**"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 8,
2022-05-03 22:24:27 +08:00
"id": "b1a8631f",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"dataset = DataSet(data)\n",
"\n",
"def get_words(instance):\n",
" sentence = instance['sentence']\n",
" words = sentence.split()\n",
" return words\n",
"\n",
2022-05-14 15:53:14 +08:00
"dataset.apply(get_words, new_field_name='words', progress_bar=\"tqdm\")\n",
2022-05-03 22:24:27 +08:00
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "64abf745",
"metadata": {},
"source": [
"`apply_field`的参数,除了函数`func`外还有`field_name`和`new_field_name`,该函数`func`的处理对象仅\n",
"\n",
"&emsp; 是`dataset`模块中的每个`field_name`对应的字段内容,处理结果存放在`new_field_name`对应的新建字段内"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 9,
2022-05-03 22:24:27 +08:00
"id": "057c1d2c",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"dataset = DataSet(data)\n",
2022-05-14 15:53:14 +08:00
"dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words', \n",
" progress_bar=\"tqdm\")\n",
2022-05-03 22:24:27 +08:00
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "5a9cc8b2",
"metadata": {},
"source": [
"`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
"\n",
"&emsp; 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 10,
2022-05-03 22:24:27 +08:00
"id": "51e2f02c",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"dataset = DataSet(data)\n",
2022-05-14 15:53:14 +08:00
"dataset.apply_more(lambda ins:{'words': ins['sentence'].split(), 'num': len(ins['sentence'].split())}, \n",
" progress_bar=\"tqdm\")\n",
2022-05-03 22:24:27 +08:00
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "02d2b7ef",
"metadata": {},
"source": [
"`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
"\n",
"&emsp; 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 11,
2022-05-03 22:24:27 +08:00
"id": "db4295d5",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"dataset = DataSet(data)\n",
"dataset.apply_field_more(lambda sent:{'words': sent.split(), 'num': len(sent.split())}, \n",
2022-05-14 15:53:14 +08:00
" field_name='sentence', progress_bar=\"tqdm\")\n",
2022-05-03 22:24:27 +08:00
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "9c09e592",
"metadata": {},
"source": [
"### 1.3 延伸instance 和 field\n",
"\n",
"在`fastNLP 0.8`中,使用`Instance`模块表示数据集`dataset`中的每条数据,被称为实例\n",
"\n",
"&emsp; 构造方式类似于构造一个字典,通过键值相同的`Instance`列表,也可以初始化一个`dataset`,代码如下"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 12,
2022-05-03 22:24:27 +08:00
"id": "012f537c",
"metadata": {},
"outputs": [],
"source": [
2022-05-14 15:53:14 +08:00
"from fastNLP import DataSet\n",
"from fastNLP import Instance\n",
2022-05-03 22:24:27 +08:00
"\n",
"dataset = DataSet([\n",
" Instance(sentence=\"This is an apple .\",\n",
" words=['This', 'is', 'an', 'apple', '.'],\n",
" num=5),\n",
" Instance(sentence=\"I like apples .\",\n",
" words=['I', 'like', 'apples', '.'],\n",
" num=4),\n",
" Instance(sentence=\"Apples are good for our health .\",\n",
" words=['Apples', 'are', 'good', 'for', 'our', 'health', '.'],\n",
" num=7),\n",
" ])"
]
},
{
"cell_type": "markdown",
"id": "2fafb1ef",
"metadata": {},
"source": [
"&emsp; 通过`items`、`keys`和`values`方法,可以分别获得`dataset`的`item`列表、`key`列表、`value`列表"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 13,
2022-05-03 22:24:27 +08:00
"id": "a4c1c10d",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dict_items([('sentence', 'This is an apple .'), ('words', ['This', 'is', 'an', 'apple', '.']), ('num', 5)])\n",
"dict_keys(['sentence', 'words', 'num'])\n",
"dict_values(['This is an apple .', ['This', 'is', 'an', 'apple', '.'], 5])\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"ins = Instance(sentence=\"This is an apple .\", words=['This', 'is', 'an', 'apple', '.'], num=5)\n",
"\n",
"print(ins.items())\n",
"print(ins.keys())\n",
"print(ins.values())"
]
},
{
"cell_type": "markdown",
"id": "b5459a2d",
"metadata": {},
"source": [
"&emsp; 通过`add_field`方法,可以在`Instance`实例中,通过参数`field_name`添加字段,通过参数`field`赋值"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 14,
2022-05-03 22:24:27 +08:00
"id": "55376402",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+------------------------+-----+-----+\n",
"| sentence | words | num | idx |\n",
"+--------------------+------------------------+-----+-----+\n",
"| This is an apple . | ['This', 'is', 'an'... | 5 | 0 |\n",
"+--------------------+------------------------+-----+-----+\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"ins.add_field(field_name='idx', field=0)\n",
"print(ins)"
]
},
{
"cell_type": "markdown",
"id": "49caaa9c",
"metadata": {},
"source": [
"在`fastNLP 0.8`中,使用`FieldArray`模块表示数据集`dataset`中的每条字段名(注:没有`field`类)\n",
"\n",
"&emsp; 通过`get_all_fields`方法可以获取`dataset`的字段列表\n",
"\n",
"&emsp; 通过`get_field_names`方法可以获取`dataset`的字段名称列表,代码如下"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 15,
2022-05-03 22:24:27 +08:00
"id": "fe15f4c1",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"text/plain": [
2022-05-17 18:04:15 +08:00
"{'sentence': <fastNLP.core.dataset.field.FieldArray at 0x2444977fe88>,\n",
" 'words': <fastNLP.core.dataset.field.FieldArray at 0x2444977ff08>,\n",
" 'num': <fastNLP.core.dataset.field.FieldArray at 0x2444977ff88>}"
2022-05-14 15:53:14 +08:00
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
2022-05-03 22:24:27 +08:00
"source": [
"dataset.get_all_fields()"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 16,
2022-05-03 22:24:27 +08:00
"id": "5433815c",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"text/plain": [
"['num', 'sentence', 'words']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
2022-05-03 22:24:27 +08:00
"source": [
"dataset.get_field_names()"
]
},
{
"cell_type": "markdown",
"id": "4964eeed",
"metadata": {},
"source": [
"其他`dataset`的基本使用:通过`in`或者`has_field`方法可以判断`dataset`的是否包含某种字段\n",
"\n",
"&emsp; 通过`rename_field`方法可以更改`dataset`中的字段名称;通过`concat`方法可以实现两个`dataset`中的拼接\n",
"\n",
"&emsp; 通过`len`可以统计`dataset`中的实例数目;`dataset`的全部变量与函数可以通过`dir(dataset)`查询"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 17,
2022-05-03 22:24:27 +08:00
"id": "25ce5488",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3 False\n",
"6 True\n",
"+------------------------------+------------------------------+--------+\n",
"| sentence | words | length |\n",
"+------------------------------+------------------------------+--------+\n",
"| This is an apple . | ['This', 'is', 'an', 'app... | 5 |\n",
"| I like apples . | ['I', 'like', 'apples', '... | 4 |\n",
"| Apples are good for our h... | ['Apples', 'are', 'good',... | 7 |\n",
"| This is an apple . | ['This', 'is', 'an', 'app... | 5 |\n",
"| I like apples . | ['I', 'like', 'apples', '... | 4 |\n",
"| Apples are good for our h... | ['Apples', 'are', 'good',... | 7 |\n",
"+------------------------------+------------------------------+--------+\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"print(len(dataset), dataset.has_field('length')) \n",
"if 'num' in dataset:\n",
" dataset.rename_field('num', 'length')\n",
"elif 'length' in dataset:\n",
" dataset.rename_field('length', 'num')\n",
"dataset.concat(dataset)\n",
"print(len(dataset), dataset.has_field('length')) \n",
"print(dataset) "
]
},
{
"cell_type": "markdown",
"id": "e30a6cd7",
"metadata": {},
"source": [
"## 2. vocabulary 的结构与使用\n",
"\n",
"### 2.1 vocabulary 的创建与修改\n",
"\n",
"在`fastNLP 0.8`中,使用`Vocabulary`模块表示词汇表,**`vocabulary`的核心是从单词到序号的映射**\n",
"\n",
"&emsp; 可以直接通过构造函数实例化,通过查找`word2idx`属性,可以找到`vocabulary`映射对应的字典实现\n",
"\n",
"&emsp; **默认补零`padding`用`<pad>`表示****对应序号为0****未知单词`unknown`用`<unk>`表示****对应序号1**\n",
"\n",
"&emsp; 通过打印`vocabulary`可以看到词汇表中的单词列表,其中,`padding`和`unknown`不会显示"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 18,
2022-05-03 22:24:27 +08:00
"id": "3515e096",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vocabulary([]...)\n",
"{'<pad>': 0, '<unk>': 1}\n",
"<pad> 0\n",
"<unk> 1\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
2022-05-14 15:53:14 +08:00
"from fastNLP import Vocabulary\n",
2022-05-03 22:24:27 +08:00
"\n",
"vocab = Vocabulary()\n",
"print(vocab)\n",
"print(vocab.word2idx)\n",
"print(vocab.padding, vocab.padding_idx)\n",
"print(vocab.unknown, vocab.unknown_idx)"
]
},
{
"cell_type": "markdown",
"id": "640be126",
"metadata": {},
"source": [
"在`vocabulary`中,通过`add_word`方法或`add_word_lst`方法,可以单独或批量添加单词\n",
"\n",
"&emsp; 通过`len`或`word_count`属性,可以显示`vocabulary`的单词量和每个单词添加的次数"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 19,
2022-05-03 22:24:27 +08:00
"id": "88c7472a",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5 Counter({'生活': 1, '就像': 1, '海洋': 1})\n",
"6 Counter({'生活': 1, '就像': 1, '海洋': 1, '只有': 1})\n",
"6 {'<pad>': 0, '<unk>': 1, '生活': 2, '就像': 3, '海洋': 4, '只有': 5}\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"vocab.add_word_lst(['生活', '就像', '海洋'])\n",
"print(len(vocab), vocab.word_count)\n",
"vocab.add_word('只有')\n",
2022-05-04 19:10:39 +08:00
"print(len(vocab), vocab.word_count)\n",
"print(len(vocab), vocab.word2idx)"
2022-05-03 22:24:27 +08:00
]
},
{
"cell_type": "markdown",
"id": "f9ec8b28",
"metadata": {},
"source": [
"&emsp; **通过`to_word`方法可以找到单词对应的序号****通过`to_index`方法可以找到序号对应的单词**\n",
"\n",
"&emsp; &emsp; 由于序号0和序号1已经被占用所以**新加入的词的序号从2开始计数**,如`'生活'`对应2\n",
"\n",
"&emsp; &emsp; 通过`has_word`方法可以判断单词是否在词汇表中,没有的单词被判做`<unk>`"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 20,
2022-05-03 22:24:27 +08:00
"id": "3447acde",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<pad> 0\n",
"<unk> 1\n",
"生活 2\n",
"彼岸 1 False\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"print(vocab.to_word(0), vocab.to_index('<pad>'))\n",
"print(vocab.to_word(1), vocab.to_index('<unk>'))\n",
"print(vocab.to_word(2), vocab.to_index('生活'))\n",
"print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))"
]
},
{
"cell_type": "markdown",
"id": "b4e36850",
"metadata": {},
"source": [
"**`vocabulary`允许反复添加相同单词****可以通过`word_count`方法看到相应单词被添加的次数**\n",
"\n",
2022-05-04 19:10:39 +08:00
"&emsp; 但其中没有`<unk>`和`<pad>``vocabulary`的全部变量与函数可以通过`dir(vocabulary)`查询\n",
"\n",
"&emsp; 注:**使用`add_word_lst`添加单词****单词对应序号不会动态调整****使用`dataset`添加单词的情况不同**"
2022-05-03 22:24:27 +08:00
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 21,
2022-05-03 22:24:27 +08:00
"id": "490b101c",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"生活 2\n",
"彼岸 12 True\n",
"13 Counter({'人': 4, '生活': 2, '就像': 2, '海洋': 2, '只有': 2, '意志': 1, '坚强的': 1, '才': 1, '能': 1, '到达': 1, '彼岸': 1})\n",
"13 {'<pad>': 0, '<unk>': 1, '生活': 2, '就像': 3, '海洋': 4, '只有': 5, '人': 6, '意志': 7, '坚强的': 8, '才': 9, '能': 10, '到达': 11, '彼岸': 12}\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
2022-05-04 19:10:39 +08:00
"vocab.add_word_lst(['生活', '就像', '海洋', '只有', '意志', '坚强的', '人', '人', '人', '人', '才', '能', '到达', '彼岸'])\n",
"print(vocab.to_word(2), vocab.to_index('生活'))\n",
"print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))\n",
2022-05-03 22:24:27 +08:00
"print(len(vocab), vocab.word_count)\n",
2022-05-04 19:10:39 +08:00
"print(len(vocab), vocab.word2idx)"
2022-05-03 22:24:27 +08:00
]
},
{
"cell_type": "markdown",
"id": "23e32a63",
"metadata": {},
"source": [
"### 2.2 vocabulary 与 OOV 问题\n",
"\n",
"在`vocabulary`模块初始化的时候,可以通过指定`unknown`和`padding`为`None`,限制其存在\n",
"\n",
"&emsp; 此时添加单词直接从0开始标号如果遇到未知单词会直接报错即 out of vocabulary"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 22,
2022-05-03 22:24:27 +08:00
"id": "a99ff909",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'positive': 0, 'negative': 1}\n",
"ValueError: word `neutral` not in vocabulary\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"vocab = Vocabulary(unknown=None, padding=None)\n",
"\n",
"vocab.add_word_lst(['positive', 'negative'])\n",
"print(vocab.word2idx)\n",
"\n",
"try:\n",
" print(vocab.to_index('neutral'))\n",
"except ValueError:\n",
" print(\"ValueError: word `neutral` not in vocabulary\")"
]
},
{
"cell_type": "markdown",
"id": "618da6bd",
"metadata": {},
"source": [
"&emsp; 相应的,如果只指定其中的`unknown`,则编号会后移一个,同时遇到未知单词全部当做`<unk>`"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 23,
2022-05-03 22:24:27 +08:00
"id": "432f74c1",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'<unk>': 0, 'positive': 1, 'negative': 2}\n",
"0 <unk>\n"
]
}
],
2022-05-03 22:24:27 +08:00
"source": [
"vocab = Vocabulary(unknown='<unk>', padding=None)\n",
"\n",
"vocab.add_word_lst(['positive', 'negative'])\n",
"print(vocab.word2idx)\n",
"\n",
"print(vocab.to_index('neutral'), vocab.to_word(vocab.to_index('neutral')))"
]
},
{
"cell_type": "markdown",
"id": "b6263f73",
"metadata": {},
"source": [
"## 3 dataset 和 vocabulary 的组合使用\n",
" \n",
"### 3.1 从 dataframe 中加载 dataset\n",
2022-05-04 19:10:39 +08:00
"\n",
"以下通过 [NLP-beginner](https://github.com/FudanNLP/nlp-beginner) 实践一中 [Rotten Tomatoes 影评数据集](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) 的部分训练数据组成`test4dataset.tsv`文件\n",
"\n",
"&emsp; 介绍如何使用`dataset`、`vocabulary`简单加载并处理数据集,首先使用`pandas`模块,读取原始数据的`dataframe`"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 24,
2022-05-04 19:10:39 +08:00
"id": "3dbd985d",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SentenceId</th>\n",
" <th>Sentence</th>\n",
" <th>Sentiment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>A series of escapades demonstrating the adage ...</td>\n",
" <td>negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>This quiet , introspective and entertaining in...</td>\n",
" <td>positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Even fans of Ismail Merchant 's work , I suspe...</td>\n",
" <td>negative</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>A positively thrilling combination of ethnogra...</td>\n",
" <td>neutral</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>A comedy-drama of nearly epic proportions root...</td>\n",
" <td>positive</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>The Importance of Being Earnest , so thick wit...</td>\n",
" <td>neutral</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SentenceId Sentence Sentiment\n",
"0 1 A series of escapades demonstrating the adage ... negative\n",
"1 2 This quiet , introspective and entertaining in... positive\n",
"2 3 Even fans of Ismail Merchant 's work , I suspe... negative\n",
"3 4 A positively thrilling combination of ethnogra... neutral\n",
"4 5 A comedy-drama of nearly epic proportions root... positive\n",
"5 6 The Importance of Being Earnest , so thick wit... neutral"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
2022-05-04 19:10:39 +08:00
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('./data/test4dataset.tsv', sep='\\t')\n",
"df"
2022-05-03 22:24:27 +08:00
]
},
{
"cell_type": "markdown",
2022-05-04 19:10:39 +08:00
"id": "919ab350",
2022-05-03 22:24:27 +08:00
"metadata": {},
2022-05-04 19:10:39 +08:00
"source": [
"接着,通过`dataset`中的`from_pandas`方法填充数据集,并使用`apply_more`方法对文本进行分词操作"
]
2022-05-03 22:24:27 +08:00
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 25,
2022-05-04 19:10:39 +08:00
"id": "4f634586",
2022-05-03 22:24:27 +08:00
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/6 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------------------------------+-----------+\n",
"| SentenceId | Sentence | Sentiment |\n",
"+------------+------------------------------+-----------+\n",
"| 1 | ['a', 'series', 'of', 'es... | negative |\n",
"| 2 | ['this', 'quiet', ',', 'i... | positive |\n",
"| 3 | ['even', 'fans', 'of', 'i... | negative |\n",
"| 4 | ['a', 'positively', 'thri... | neutral |\n",
"| 5 | ['a', 'comedy-drama', 'of... | positive |\n",
"| 6 | ['the', 'importance', 'of... | neutral |\n",
"+------------+------------------------------+-----------+\n"
]
}
],
2022-05-04 19:10:39 +08:00
"source": [
2022-05-14 15:53:14 +08:00
"from fastNLP import DataSet\n",
2022-05-04 19:10:39 +08:00
"\n",
"dataset = DataSet()\n",
"dataset = dataset.from_pandas(df)\n",
"dataset.apply_more(lambda ins:{'SentenceId': ins['SentenceId'], \n",
2022-05-14 15:53:14 +08:00
" 'Sentence': ins['Sentence'].lower().split(), 'Sentiment': ins['Sentiment']}, \n",
" progress_bar=\"tqdm\")\n",
2022-05-04 19:10:39 +08:00
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "5c1ae192",
"metadata": {},
"source": [
"&emsp; 如果需要保存中间结果,也可以使用`dataset`的`to_csv`方法,生成`.csv`或`.tsv`文件"
]
2022-05-03 22:24:27 +08:00
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 26,
2022-05-04 19:10:39 +08:00
"id": "46722efc",
2022-05-03 22:24:27 +08:00
"metadata": {},
"outputs": [],
2022-05-04 19:10:39 +08:00
"source": [
"dataset.to_csv('./data/test4dataset.csv')"
]
2022-05-03 22:24:27 +08:00
},
{
"cell_type": "markdown",
"id": "5ba13989",
"metadata": {},
"source": [
2022-05-04 19:10:39 +08:00
"### 3.2 从 dataset 中获取 vocabulary\n",
"\n",
"然后,初始化`vocabulary`,使用`vocabulary`中的`from_dataset`方法,从`dataset`的指定字段中\n",
"\n",
"&emsp; 获取字段中的所有元素,然后编号;如果指定字段是个列表,则针对字段中所有列表包含的元素编号\n",
"\n",
"&emsp; 注:**使用`dataset`添加单词****不同于`add_word_list`****单词被添加次数越多****序号越靠前**,例如案例中的`a`"
2022-05-03 22:24:27 +08:00
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 27,
2022-05-03 22:24:27 +08:00
"id": "a2de615b",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counter({'a': 9, 'of': 9, ',': 7, 'the': 6, '.': 5, 'is': 3, 'and': 3, 'good': 2, 'for': 2, 'which': 2, 'this': 2, \"'s\": 2, 'series': 1, 'escapades': 1, 'demonstrating': 1, 'adage': 1, 'that': 1, 'what': 1, 'goose': 1, 'also': 1, 'gander': 1, 'some': 1, 'occasionally': 1, 'amuses': 1, 'but': 1, 'none': 1, 'amounts': 1, 'to': 1, 'much': 1, 'story': 1, 'quiet': 1, 'introspective': 1, 'entertaining': 1, 'independent': 1, 'worth': 1, 'seeking': 1, 'even': 1, 'fans': 1, 'ismail': 1, 'merchant': 1, 'work': 1, 'i': 1, 'suspect': 1, 'would': 1, 'have': 1, 'hard': 1, 'time': 1, 'sitting': 1, 'through': 1, 'one': 1, 'positively': 1, 'thrilling': 1, 'combination': 1, 'ethnography': 1, 'all': 1, 'intrigue': 1, 'betrayal': 1, 'deceit': 1, 'murder': 1, 'shakespearean': 1, 'tragedy': 1, 'or': 1, 'juicy': 1, 'soap': 1, 'opera': 1, 'comedy-drama': 1, 'nearly': 1, 'epic': 1, 'proportions': 1, 'rooted': 1, 'in': 1, 'sincere': 1, 'performance': 1, 'by': 1, 'title': 1, 'character': 1, 'undergoing': 1, 'midlife': 1, 'crisis': 1, 'importance': 1, 'being': 1, 'earnest': 1, 'so': 1, 'thick': 1, 'with': 1, 'wit': 1, 'it': 1, 'plays': 1, 'like': 1, 'reading': 1, 'from': 1, 'bartlett': 1, 'familiar': 1, 'quotations': 1}) \n",
"\n",
"{'<pad>': 0, '<unk>': 1, 'a': 2, 'of': 3, ',': 4, 'the': 5, '.': 6, 'is': 7, 'and': 8, 'good': 9, 'for': 10, 'which': 11, 'this': 12, \"'s\": 13, 'series': 14, 'escapades': 15, 'demonstrating': 16, 'adage': 17, 'that': 18, 'what': 19, 'goose': 20, 'also': 21, 'gander': 22, 'some': 23, 'occasionally': 24, 'amuses': 25, 'but': 26, 'none': 27, 'amounts': 28, 'to': 29, 'much': 30, 'story': 31, 'quiet': 32, 'introspective': 33, 'entertaining': 34, 'independent': 35, 'worth': 36, 'seeking': 37, 'even': 38, 'fans': 39, 'ismail': 40, 'merchant': 41, 'work': 42, 'i': 43, 'suspect': 44, 'would': 45, 'have': 46, 'hard': 47, 'time': 48, 'sitting': 49, 'through': 50, 'one': 51, 'positively': 52, 'thrilling': 53, 'combination': 54, 'ethnography': 55, 'all': 56, 'intrigue': 57, 'betrayal': 58, 'deceit': 59, 'murder': 60, 'shakespearean': 61, 'tragedy': 62, 'or': 63, 'juicy': 64, 'soap': 65, 'opera': 66, 'comedy-drama': 67, 'nearly': 68, 'epic': 69, 'proportions': 70, 'rooted': 71, 'in': 72, 'sincere': 73, 'performance': 74, 'by': 75, 'title': 76, 'character': 77, 'undergoing': 78, 'midlife': 79, 'crisis': 80, 'importance': 81, 'being': 82, 'earnest': 83, 'so': 84, 'thick': 85, 'with': 86, 'wit': 87, 'it': 88, 'plays': 89, 'like': 90, 'reading': 91, 'from': 92, 'bartlett': 93, 'familiar': 94, 'quotations': 95} \n",
"\n",
"Vocabulary(['a', 'series', 'of', 'escapades', 'demonstrating']...)\n"
]
}
],
2022-05-04 19:10:39 +08:00
"source": [
2022-05-14 15:53:14 +08:00
"from fastNLP import Vocabulary\n",
2022-05-04 19:10:39 +08:00
"\n",
"vocab = Vocabulary()\n",
"vocab = vocab.from_dataset(dataset, field_name='Sentence')\n",
"print(vocab.word_count, '\\n')\n",
"print(vocab.word2idx, '\\n')\n",
"print(vocab)"
]
},
{
"cell_type": "markdown",
"id": "f0857ccb",
"metadata": {},
"source": [
"之后,**通过`vocabulary`的`index_dataset`方法****调整`dataset`中指定字段的元素****使用编号将之代替**\n",
"\n",
"&emsp; 使用上述方法,可以将影评数据集中的单词序列转化为词编号序列,为接下来转化为词嵌入序列做准备"
]
2022-05-03 22:24:27 +08:00
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 28,
2022-05-04 19:10:39 +08:00
"id": "2f9a04b2",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------------------------------+-----------+\n",
"| SentenceId | Sentence | Sentiment |\n",
"+------------+------------------------------+-----------+\n",
"| 1 | [2, 14, 3, 15, 16, 5, 17,... | negative |\n",
"| 2 | [12, 32, 4, 33, 8, 34, 35... | positive |\n",
"| 3 | [38, 39, 3, 40, 41, 13, 4... | negative |\n",
"| 4 | [2, 52, 53, 54, 3, 55, 8,... | neutral |\n",
"| 5 | [2, 67, 3, 68, 69, 70, 71... | positive |\n",
"| 6 | [5, 81, 3, 82, 83, 4, 84,... | neutral |\n",
"+------------+------------------------------+-----------+\n"
]
}
],
2022-05-04 19:10:39 +08:00
"source": [
"vocab.index_dataset(dataset, field_name='Sentence')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "6b26b707",
"metadata": {},
"source": [
"最后,使用相同方法,再将`dataset`中`Sentiment`字段中的`negative`、`neutral`、`positive`转化为数字编号"
]
},
{
"cell_type": "code",
2022-05-14 15:53:14 +08:00
"execution_count": 29,
2022-05-03 22:24:27 +08:00
"id": "5f5eed18",
"metadata": {},
2022-05-14 15:53:14 +08:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'negative': 0, 'positive': 1, 'neutral': 2}\n",
"+------------+------------------------------+-----------+\n",
"| SentenceId | Sentence | Sentiment |\n",
"+------------+------------------------------+-----------+\n",
"| 1 | [2, 14, 3, 15, 16, 5, 17,... | 0 |\n",
"| 2 | [12, 32, 4, 33, 8, 34, 35... | 1 |\n",
"| 3 | [38, 39, 3, 40, 41, 13, 4... | 0 |\n",
"| 4 | [2, 52, 53, 54, 3, 55, 8,... | 2 |\n",
"| 5 | [2, 67, 3, 68, 69, 70, 71... | 1 |\n",
"| 6 | [5, 81, 3, 82, 83, 4, 84,... | 2 |\n",
"+------------+------------------------------+-----------+\n"
]
}
],
2022-05-04 19:10:39 +08:00
"source": [
"target_vocab = Vocabulary(padding=None, unknown=None)\n",
"\n",
"target_vocab.from_dataset(dataset, field_name='Sentiment')\n",
"print(target_vocab.word2idx)\n",
"target_vocab.index_dataset(dataset, field_name='Sentiment')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "eed7ea64",
"metadata": {},
"source": [
"在最后的最后,通过以下的一张图,来总结本章关于`dataset`和`vocabulary`主要知识点的讲解,以及两者的联系\n",
"\n",
"<img src=\"./figures/T1-fig-dataset-and-vocabulary.png\" width=\"80%\" height=\"80%\" align=\"center\"></img>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35b4f0f7",
"metadata": {},
2022-05-03 22:24:27 +08:00
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2022-05-30 22:48:28 +08:00
"version": "3.7.13"
2022-05-03 22:24:27 +08:00
}
},
"nbformat": 4,
"nbformat_minor": 5
}