{ "cells": [ { "cell_type": "markdown", "id": "cdc25fcd", "metadata": {}, "source": [ "# T1. dataset 和 vocabulary 的基本使用\n", "\n", "  1   dataset 的使用与结构\n", " \n", "    1.1   dataset 的结构与创建\n", "\n", "    1.2   dataset 的数据预处理\n", "\n", "    1.3   延伸:instance 和 field\n", "\n", "  2   vocabulary 的结构与使用\n", "\n", "    2.1   vocabulary 的创建与修改\n", "\n", "    2.2   vocabulary 与 OOV 问题\n", "\n", "  3   dataset 和 vocabulary 的组合使用\n", " \n", "    3.1   从 dataframe 中加载 dataset\n", "\n", "    3.2   从 dataset 中获取 vocabulary" ] }, { "cell_type": "markdown", "id": "0eb18a22", "metadata": {}, "source": [ "## 1. dataset 的基本使用\n", "\n", "### 1.1 dataset 的结构与创建\n", "\n", "在`fastNLP 0.8`中,使用`DataSet`模块表示数据集,**`dataset`类似于关系型数据库中的数据表**(下文统一为小写`dataset`)\n", "\n", "  **主要包含`field`字段和`instance`实例两个元素**,对应`table`中的`field`字段和`record`记录\n", "\n", "在`fastNLP 0.8`中,`DataSet`模块被定义在`fastNLP.core.dataset`路径下,导入该模块后,最简单的\n", "\n", "  初始化方法,即将字典形式的表格 **`{'field1': column1, 'field2': column2, ...}`** 传入构造函数" ] }, { "cell_type": "code", "execution_count": 1, "id": "a1d69ad2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "from fastNLP import DataSet\n", "\n", "data = {'idx': [0, 1, 2], \n", " 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"],\n", " 'words': [['This', 'is', 'an', 'apple', '.'], \n", " ['I', 'like', 'apples', '.'], \n", " ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],\n", " 'num': [5, 4, 7]}\n", "\n", "dataset = DataSet(data)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "9260fdc6", "metadata": {}, "source": [ "  在`dataset`的实例中,字段`field`的名称和实例`instance`中的字符串也可以中文" ] }, { "cell_type": "code", "execution_count": 2, "id": "3d72ef00", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------+--------------------+------------------------+------+\n", "| 序号 | 句子 | 字符 | 长度 |\n", "+------+--------------------+------------------------+------+\n", "| 0 | 生活就像海洋, | ['生', '活', '就', ... | 7 |\n", "| 1 | 只有意志坚强的人, | ['只', '有', '意', ... | 9 |\n", "| 2 | 才能到达彼岸。 | ['才', '能', '到', ... | 7 |\n", "+------+--------------------+------------------------+------+\n" ] } ], "source": [ "temp = {'序号': [0, 1, 2], \n", " '句子':[\"生活就像海洋,\", \"只有意志坚强的人,\", \"才能到达彼岸。\"],\n", " '字符': [['生', '活', '就', '像', '海', '洋', ','], \n", " ['只', '有', '意', '志', '坚', '强', '的', '人', ','], \n", " ['才', '能', '到', '达', '彼', '岸', '。']],\n", " '长度': [7, 9, 7]}\n", "\n", "chinese = DataSet(temp)\n", "print(chinese)" ] }, { "cell_type": "markdown", "id": "202e5490", "metadata": {}, "source": [ "在`dataset`中,使用`drop`方法可以删除满足条件的实例,这里使用了python中的`lambda`表达式\n", "\n", "  注一:在`drop`方法中,通过设置`inplace`参数将删除对应实例后的`dataset`作为一个新的实例生成" ] }, { "cell_type": "code", "execution_count": 3, "id": "09b478f8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2492313174344 2491986424200\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "dropped = dataset\n", "dropped = dropped.drop(lambda ins:ins['num'] < 5, inplace=False)\n", "print(id(dropped), id(dataset))\n", "print(dropped)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "aa277674", "metadata": {}, "source": [ "  注二:**对对象使用等号一般表示传引用**,所以对`dataset`使用等号,是传引用而不是赋值\n", "\n", "    如下所示,**`dropped`和`dataset`具有相同`id`**,**对`dropped`执行删除操作`dataset`同时会被修改**" ] }, { "cell_type": "code", "execution_count": 4, "id": "77c8583a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2491986424200 2491986424200\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n", "+-----+------------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+------------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n", "+-----+------------------------+------------------------+-----+\n" ] } ], "source": [ "dropped = dataset\n", "dropped.drop(lambda ins:ins['num'] < 5)\n", "print(id(dropped), id(dataset))\n", "print(dropped)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "a76199dc", "metadata": {}, "source": [ "在`dataset`中,使用`delet_instance`方法可以删除对应序号的`instance`实例,序号从0开始" ] }, { "cell_type": "code", "execution_count": 5, "id": "d8824b40", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+--------------------+------------------------+-----+\n", "| idx | sentence | words | num |\n", "+-----+--------------------+------------------------+-----+\n", "| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n", "| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n", "+-----+--------------------+------------------------+-----+\n" ] } ], "source": [ "dataset = DataSet(data)\n", "dataset.delete_instance(2)\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "f4fa9f33", "metadata": {}, "source": [ "在`dataset`中,使用`delet_field`方法可以删除对应名称的`field`字段" ] }, { "cell_type": "code", "execution_count": 6, "id": "f68ddb40", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+--------------------+------------------------------+\n", "| idx | sentence | words |\n", "+-----+--------------------+------------------------------+\n", "| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n", "| 1 | I like apples . | ['I', 'like', 'apples', '... |\n", "+-----+--------------------+------------------------------+\n" ] } ], "source": [ "dataset.delete_field('num')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "b1e9d42c", "metadata": {}, "source": [ "### 1.2 dataset 的数据预处理\n", "\n", "在`dataset`模块中,`apply`、`apply_field`、`apply_more`和`apply_field_more`函数可以进行简单的数据预处理\n", "\n", "  **`apply`和`apply_more`输入整条实例**,**`apply_field`和`apply_field_more`仅输入实例的部分字段**\n", "\n", "  **`apply`和`apply_field`仅输出单个字段**,**`apply_more`和`apply_field_more`则是输出多个字段**\n", "\n", "  **`apply`和`apply_field`返回的是个列表**,**`apply_more`和`apply_field_more`返回的是个字典**\n", "\n", "    预处理过程中,通过`progress_bar`参数设置显示进度条类型,通过`num_proc`设置多进程\n", "***\n", "\n", "`apply`的参数包括一个函数`func`和一个新字段名`new_field_name`,函数`func`的处理对象是`dataset`模块中\n", "\n", "  的每个`instance`实例,函数`func`的处理结果存放在`new_field_name`对应的新建字段内" ] }, { "cell_type": "code", "execution_count": 7, "id": "72a0b5f9", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/3 [00:00,\n", " 'words': ,\n", " 'num': }" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.get_all_fields()" ] }, { "cell_type": "code", "execution_count": 16, "id": "5433815c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['num', 'sentence', 'words']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.get_field_names()" ] }, { "cell_type": "markdown", "id": "4964eeed", "metadata": {}, "source": [ "其他`dataset`的基本使用:通过`in`或者`has_field`方法可以判断`dataset`的是否包含某种字段\n", "\n", "  通过`rename_field`方法可以更改`dataset`中的字段名称;通过`concat`方法可以实现两个`dataset`中的拼接\n", "\n", "  通过`len`可以统计`dataset`中的实例数目;`dataset`的全部变量与函数可以通过`dir(dataset)`查询" ] }, { "cell_type": "code", "execution_count": 17, "id": "25ce5488", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3 False\n", "6 True\n", "+------------------------------+------------------------------+--------+\n", "| sentence | words | length |\n", "+------------------------------+------------------------------+--------+\n", "| This is an apple . | ['This', 'is', 'an', 'app... | 5 |\n", "| I like apples . | ['I', 'like', 'apples', '... | 4 |\n", "| Apples are good for our h... | ['Apples', 'are', 'good',... | 7 |\n", "| This is an apple . | ['This', 'is', 'an', 'app... | 5 |\n", "| I like apples . | ['I', 'like', 'apples', '... | 4 |\n", "| Apples are good for our h... | ['Apples', 'are', 'good',... | 7 |\n", "+------------------------------+------------------------------+--------+\n" ] } ], "source": [ "print(len(dataset), dataset.has_field('length')) \n", "if 'num' in dataset:\n", " dataset.rename_field('num', 'length')\n", "elif 'length' in dataset:\n", " dataset.rename_field('length', 'num')\n", "dataset.concat(dataset)\n", "print(len(dataset), dataset.has_field('length')) \n", "print(dataset) " ] }, { "cell_type": "markdown", "id": "e30a6cd7", "metadata": {}, "source": [ "## 2. vocabulary 的结构与使用\n", "\n", "### 2.1 vocabulary 的创建与修改\n", "\n", "在`fastNLP 0.8`中,使用`Vocabulary`模块表示词汇表,**`vocabulary`的核心是从单词到序号的映射**\n", "\n", "  可以直接通过构造函数实例化,通过查找`word2idx`属性,可以找到`vocabulary`映射对应的字典实现\n", "\n", "  **默认补零`padding`用``表示**,**对应序号为0**;**未知单词`unknown`用``表示**,**对应序号1**\n", "\n", "  通过打印`vocabulary`可以看到词汇表中的单词列表,其中,`padding`和`unknown`不会显示" ] }, { "cell_type": "code", "execution_count": 18, "id": "3515e096", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vocabulary([]...)\n", "{'': 0, '': 1}\n", " 0\n", " 1\n" ] } ], "source": [ "from fastNLP import Vocabulary\n", "\n", "vocab = Vocabulary()\n", "print(vocab)\n", "print(vocab.word2idx)\n", "print(vocab.padding, vocab.padding_idx)\n", "print(vocab.unknown, vocab.unknown_idx)" ] }, { "cell_type": "markdown", "id": "640be126", "metadata": {}, "source": [ "在`vocabulary`中,通过`add_word`方法或`add_word_lst`方法,可以单独或批量添加单词\n", "\n", "  通过`len`或`word_count`属性,可以显示`vocabulary`的单词量和每个单词添加的次数" ] }, { "cell_type": "code", "execution_count": 19, "id": "88c7472a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5 Counter({'生活': 1, '就像': 1, '海洋': 1})\n", "6 Counter({'生活': 1, '就像': 1, '海洋': 1, '只有': 1})\n", "6 {'': 0, '': 1, '生活': 2, '就像': 3, '海洋': 4, '只有': 5}\n" ] } ], "source": [ "vocab.add_word_lst(['生活', '就像', '海洋'])\n", "print(len(vocab), vocab.word_count)\n", "vocab.add_word('只有')\n", "print(len(vocab), vocab.word_count)\n", "print(len(vocab), vocab.word2idx)" ] }, { "cell_type": "markdown", "id": "f9ec8b28", "metadata": {}, "source": [ "  **通过`to_word`方法可以找到单词对应的序号**,**通过`to_index`方法可以找到序号对应的单词**\n", "\n", "    由于序号0和序号1已经被占用,所以**新加入的词的序号从2开始计数**,如`'生活'`对应2\n", "\n", "    通过`has_word`方法可以判断单词是否在词汇表中,没有的单词被判做``" ] }, { "cell_type": "code", "execution_count": 20, "id": "3447acde", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0\n", " 1\n", "生活 2\n", "彼岸 1 False\n" ] } ], "source": [ "print(vocab.to_word(0), vocab.to_index(''))\n", "print(vocab.to_word(1), vocab.to_index(''))\n", "print(vocab.to_word(2), vocab.to_index('生活'))\n", "print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))" ] }, { "cell_type": "markdown", "id": "b4e36850", "metadata": {}, "source": [ "**`vocabulary`允许反复添加相同单词**,**可以通过`word_count`方法看到相应单词被添加的次数**\n", "\n", "  但其中没有``和``,`vocabulary`的全部变量与函数可以通过`dir(vocabulary)`查询\n", "\n", "  注:**使用`add_word_lst`添加单词**,**单词对应序号不会动态调整**,**使用`dataset`添加单词的情况不同**" ] }, { "cell_type": "code", "execution_count": 21, "id": "490b101c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "生活 2\n", "彼岸 12 True\n", "13 Counter({'人': 4, '生活': 2, '就像': 2, '海洋': 2, '只有': 2, '意志': 1, '坚强的': 1, '才': 1, '能': 1, '到达': 1, '彼岸': 1})\n", "13 {'': 0, '': 1, '生活': 2, '就像': 3, '海洋': 4, '只有': 5, '人': 6, '意志': 7, '坚强的': 8, '才': 9, '能': 10, '到达': 11, '彼岸': 12}\n" ] } ], "source": [ "vocab.add_word_lst(['生活', '就像', '海洋', '只有', '意志', '坚强的', '人', '人', '人', '人', '才', '能', '到达', '彼岸'])\n", "print(vocab.to_word(2), vocab.to_index('生活'))\n", "print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))\n", "print(len(vocab), vocab.word_count)\n", "print(len(vocab), vocab.word2idx)" ] }, { "cell_type": "markdown", "id": "23e32a63", "metadata": {}, "source": [ "### 2.2 vocabulary 与 OOV 问题\n", "\n", "在`vocabulary`模块初始化的时候,可以通过指定`unknown`和`padding`为`None`,限制其存在\n", "\n", "  此时添加单词直接从0开始标号,如果遇到未知单词会直接报错,即 out of vocabulary" ] }, { "cell_type": "code", "execution_count": 22, "id": "a99ff909", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'positive': 0, 'negative': 1}\n", "ValueError: word `neutral` not in vocabulary\n" ] } ], "source": [ "vocab = Vocabulary(unknown=None, padding=None)\n", "\n", "vocab.add_word_lst(['positive', 'negative'])\n", "print(vocab.word2idx)\n", "\n", "try:\n", " print(vocab.to_index('neutral'))\n", "except ValueError:\n", " print(\"ValueError: word `neutral` not in vocabulary\")" ] }, { "cell_type": "markdown", "id": "618da6bd", "metadata": {}, "source": [ "  相应的,如果只指定其中的`unknown`,则编号会后移一个,同时遇到未知单词全部当做``" ] }, { "cell_type": "code", "execution_count": 23, "id": "432f74c1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'': 0, 'positive': 1, 'negative': 2}\n", "0 \n" ] } ], "source": [ "vocab = Vocabulary(unknown='', padding=None)\n", "\n", "vocab.add_word_lst(['positive', 'negative'])\n", "print(vocab.word2idx)\n", "\n", "print(vocab.to_index('neutral'), vocab.to_word(vocab.to_index('neutral')))" ] }, { "cell_type": "markdown", "id": "b6263f73", "metadata": {}, "source": [ "## 3 dataset 和 vocabulary 的组合使用\n", " \n", "### 3.1 从 dataframe 中加载 dataset\n", "\n", "以下通过 [NLP-beginner](https://github.com/FudanNLP/nlp-beginner) 实践一中 [Rotten Tomatoes 影评数据集](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) 的部分训练数据组成`test4dataset.tsv`文件\n", "\n", "  介绍如何使用`dataset`、`vocabulary`简单加载并处理数据集,首先使用`pandas`模块,读取原始数据的`dataframe`" ] }, { "cell_type": "code", "execution_count": 24, "id": "3dbd985d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SentenceIdSentenceSentiment
01A series of escapades demonstrating the adage ...negative
12This quiet , introspective and entertaining in...positive
23Even fans of Ismail Merchant 's work , I suspe...negative
34A positively thrilling combination of ethnogra...neutral
45A comedy-drama of nearly epic proportions root...positive
56The Importance of Being Earnest , so thick wit...neutral
\n", "
" ], "text/plain": [ " SentenceId Sentence Sentiment\n", "0 1 A series of escapades demonstrating the adage ... negative\n", "1 2 This quiet , introspective and entertaining in... positive\n", "2 3 Even fans of Ismail Merchant 's work , I suspe... negative\n", "3 4 A positively thrilling combination of ethnogra... neutral\n", "4 5 A comedy-drama of nearly epic proportions root... positive\n", "5 6 The Importance of Being Earnest , so thick wit... neutral" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('./data/test4dataset.tsv', sep='\\t')\n", "df" ] }, { "cell_type": "markdown", "id": "919ab350", "metadata": {}, "source": [ "接着,通过`dataset`中的`from_pandas`方法填充数据集,并使用`apply_more`方法对文本进行分词操作" ] }, { "cell_type": "code", "execution_count": 25, "id": "4f634586", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/6 [00:00': 0, '': 1, 'a': 2, 'of': 3, ',': 4, 'the': 5, '.': 6, 'is': 7, 'and': 8, 'good': 9, 'for': 10, 'which': 11, 'this': 12, \"'s\": 13, 'series': 14, 'escapades': 15, 'demonstrating': 16, 'adage': 17, 'that': 18, 'what': 19, 'goose': 20, 'also': 21, 'gander': 22, 'some': 23, 'occasionally': 24, 'amuses': 25, 'but': 26, 'none': 27, 'amounts': 28, 'to': 29, 'much': 30, 'story': 31, 'quiet': 32, 'introspective': 33, 'entertaining': 34, 'independent': 35, 'worth': 36, 'seeking': 37, 'even': 38, 'fans': 39, 'ismail': 40, 'merchant': 41, 'work': 42, 'i': 43, 'suspect': 44, 'would': 45, 'have': 46, 'hard': 47, 'time': 48, 'sitting': 49, 'through': 50, 'one': 51, 'positively': 52, 'thrilling': 53, 'combination': 54, 'ethnography': 55, 'all': 56, 'intrigue': 57, 'betrayal': 58, 'deceit': 59, 'murder': 60, 'shakespearean': 61, 'tragedy': 62, 'or': 63, 'juicy': 64, 'soap': 65, 'opera': 66, 'comedy-drama': 67, 'nearly': 68, 'epic': 69, 'proportions': 70, 'rooted': 71, 'in': 72, 'sincere': 73, 'performance': 74, 'by': 75, 'title': 76, 'character': 77, 'undergoing': 78, 'midlife': 79, 'crisis': 80, 'importance': 81, 'being': 82, 'earnest': 83, 'so': 84, 'thick': 85, 'with': 86, 'wit': 87, 'it': 88, 'plays': 89, 'like': 90, 'reading': 91, 'from': 92, 'bartlett': 93, 'familiar': 94, 'quotations': 95} \n", "\n", "Vocabulary(['a', 'series', 'of', 'escapades', 'demonstrating']...)\n" ] } ], "source": [ "from fastNLP import Vocabulary\n", "\n", "vocab = Vocabulary()\n", "vocab = vocab.from_dataset(dataset, field_name='Sentence')\n", "print(vocab.word_count, '\\n')\n", "print(vocab.word2idx, '\\n')\n", "print(vocab)" ] }, { "cell_type": "markdown", "id": "f0857ccb", "metadata": {}, "source": [ "之后,**通过`vocabulary`的`index_dataset`方法**,**调整`dataset`中指定字段的元素**,**使用编号将之代替**\n", "\n", "  使用上述方法,可以将影评数据集中的单词序列转化为词编号序列,为接下来转化为词嵌入序列做准备" ] }, { "cell_type": "code", "execution_count": 28, "id": "2f9a04b2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+------------------------------+-----------+\n", "| SentenceId | Sentence | Sentiment |\n", "+------------+------------------------------+-----------+\n", "| 1 | [2, 14, 3, 15, 16, 5, 17,... | negative |\n", "| 2 | [12, 32, 4, 33, 8, 34, 35... | positive |\n", "| 3 | [38, 39, 3, 40, 41, 13, 4... | negative |\n", "| 4 | [2, 52, 53, 54, 3, 55, 8,... | neutral |\n", "| 5 | [2, 67, 3, 68, 69, 70, 71... | positive |\n", "| 6 | [5, 81, 3, 82, 83, 4, 84,... | neutral |\n", "+------------+------------------------------+-----------+\n" ] } ], "source": [ "vocab.index_dataset(dataset, field_name='Sentence')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "6b26b707", "metadata": {}, "source": [ "最后,使用相同方法,再将`dataset`中`Sentiment`字段中的`negative`、`neutral`、`positive`转化为数字编号" ] }, { "cell_type": "code", "execution_count": 29, "id": "5f5eed18", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'negative': 0, 'positive': 1, 'neutral': 2}\n", "+------------+------------------------------+-----------+\n", "| SentenceId | Sentence | Sentiment |\n", "+------------+------------------------------+-----------+\n", "| 1 | [2, 14, 3, 15, 16, 5, 17,... | 0 |\n", "| 2 | [12, 32, 4, 33, 8, 34, 35... | 1 |\n", "| 3 | [38, 39, 3, 40, 41, 13, 4... | 0 |\n", "| 4 | [2, 52, 53, 54, 3, 55, 8,... | 2 |\n", "| 5 | [2, 67, 3, 68, 69, 70, 71... | 1 |\n", "| 6 | [5, 81, 3, 82, 83, 4, 84,... | 2 |\n", "+------------+------------------------------+-----------+\n" ] } ], "source": [ "target_vocab = Vocabulary(padding=None, unknown=None)\n", "\n", "target_vocab.from_dataset(dataset, field_name='Sentiment')\n", "print(target_vocab.word2idx)\n", "target_vocab.index_dataset(dataset, field_name='Sentiment')\n", "print(dataset)" ] }, { "cell_type": "markdown", "id": "eed7ea64", "metadata": {}, "source": [ "在最后的最后,通过以下的一张图,来总结本章关于`dataset`和`vocabulary`主要知识点的讲解,以及两者的联系\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "35b4f0f7", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }