fastNLP/tutorials/tutorial_1.ipynb
ChenXin 881ce01762
Dev0.4.0 (#149)
* 1. CRF增加支持bmeso类型的tag 2. vocabulary中增加注释

* BucketSampler增加一条错误检测

* 1.修改ClipGradientCallback的bug;删除LRSchedulerCallback中的print,之后应该传入pbar进行打印;2.增加MLP注释

* update MLP module

* 增加metric注释;修改trainer save过程中的bug

* Update README.md

fix tutorial link

* Add ENAS (Efficient Neural Architecture Search)

* add ignore_type in DataSet.add_field

* * AutoPadder will not pad when dtype is None
* add ignore_type in DataSet.apply

* 修复fieldarray中padder潜在bug

* 修复crf中typo; 以及可能导致数值不稳定的地方

* 修复CRF中可能存在的bug

* change two default init arguments of Trainer into None

* Changes to Callbacks:
* 给callback添加给定几个只读属性
* 通过manager设置这些属性
* 代码优化,减轻@transfer的负担

* * 将enas相关代码放到automl目录下
* 修复fast_param_mapping的一个bug
* Trainer添加自动创建save目录
* Vocabulary的打印,显示内容

* * 给vocabulary添加遍历方法

* 修复CRF为负数的bug

* add SQuAD metric

* add sigmoid activate function in MLP

* - add star transformer model
- add ConllLoader, for all kinds of conll-format files
- add JsonLoader, for json-format files
- add SSTLoader, for SST-2 & SST-5
- change Callback interface
- fix batch multi-process when killed
- add README to list models and their performance

* - fix test

* - fix callback & tests

* - update README

* 修改部分bug;调整callback

* 准备发布0.4.0版本“

* update readme

* support parallel loss

* 防止多卡的情况导致无法正确计算loss“

* update advance_tutorial jupyter notebook

* 1. 在embedding_loader中增加新的读取函数load_with_vocab(), load_without_vocab, 比之前的函数改变主要在(1)不再需要传入embed_dim(2)自动判断当前是word2vec还是glove.
2. vocabulary增加from_dataset(), index_dataset()函数。避免需要多行写index dataset的问题。
3. 在utils中新增一个cache_result()修饰器,用于cache函数的返回值。
4. callback中新增update_every属性

* 1.DataSet.apply()报错时提供错误的index
2.Vocabulary.from_dataset(), index_dataset()提供报错时的vocab顺序
3.embedloader在embed读取时遇到不规则的数据跳过这一行.

* update attention

* doc tools

* fix some doc errors

* 修改为中文注释,增加viterbi解码方法

* 样例版本

* - add pad sequence for lstm
- add csv, conll, json filereader
- update dataloader
- remove useless dataloader
- fix trainer loss print
- fix tests

* - fix test_tutorial

* 注释增加

* 测试文档

* 本地暂存

* 本地暂存

* 修改文档的顺序

* - add document

* 本地暂存

* update pooling

* update bert

* update documents in MLP

* update documents in snli

* combine self attention module to attention.py

* update documents on losses.py

* 对DataSet的文档进行更新

* update documents on metrics

* 1. 删除了LSTM中print的内容; 2. 将Trainer和Tester的use_cuda修改为了device; 3.补充Trainer的文档

* 增加对Trainer的注释

* 完善了trainer,callback等的文档; 修改了部分代码的命名以使得代码从文档中隐藏

* update char level encoder

* update documents on embedding.py

* - update doc

* 补充注释,并修改部分代码

* - update doc
- add get_embeddings

* 修改了文档配置项

* 修改embedding为init_embed初始化

* 1.增加对Trainer和Tester的多卡支持;

* - add test
- fix jsonloader

* 删除了注释教程

* 给 dataset 增加了get_field_names

* 修复bug

* - add Const
- fix bugs

* 修改部分注释

* - add model runner for easier test models
- add model tests

* 修改了 docs 的配置和架构

* 修改了核心部分的一大部分文档,TODO:
1. 完善 trainer 和 tester 部分的文档
2. 研究注释样例与测试

* core部分的注释基本检查完成

* 修改了 io 部分的注释

* 全部改为相对路径引用

* 全部改为相对路径引用

* small change

* 1. 从安装文件中删除api/automl的安装
2. metric中存在seq_len的bug
3. sampler中存在命名错误,已修改

* 修复 bug :兼容 cpu 版本的 PyTorch
TODO:其它地方可能也存在类似的 bug

* 修改文档中的引用部分

* 把 tqdm.autonotebook 换成tqdm.auto

* - fix batch & vocab

* 上传了文档文件 *.rst

* 上传了文档文件和若干 TODO

* 讨论并整合了若干模块

* core部分的测试和一些小修改

* 删除了一些冗余文档

* update init files

* update const files

* update const files

* 增加cnn的测试

* fix a little bug

* - update attention
- fix tests

* 完善测试

* 完成快速入门教程

* 修改了sequence_modeling 命名为 sequence_labeling 的文档

* 重新 apidoc 解决改名的遗留问题

* 修改文档格式

* 统一不同位置的seq_len_to_mask, 现统一到core.utils.seq_len_to_mask

* 增加了一行提示

* 在文档中展示 dataset_loader

* 提示 Dataset.read_csv 会被 CSVLoader 替换

* 完成 Callback 和 Trainer 之间的文档

* index更新了部分

* 删除冗余的print

* 删除用于分词的metric,因为有可能引起错误

* 修改文档中的中文名称

* 完成了详细介绍文档

* tutorial 的 ipynb 文件

* 修改了一些介绍文档

* 修改了 models 和 modules 的主页介绍

* 加上了 titlesonly 这个设置

* 修改了模块文档展示的标题

* 修改了 core 和 io 的开篇介绍

* 修改了 modules 和 models 开篇介绍

* 使用 .. todo:: 隐藏了可能被抽到文档中的 TODO 注释

* 修改了一些注释

* delete an old metric in test

* 修改 tutorials 的测试文件

* 把暂不发布的功能移到 legacy 文件夹

* 删除了不能运行的测试

* 修改 callback 的测试文件

* 删除了过时的教程和测试文件

* cache_results 参数的修改

* 修改 io 的测试文件; 删除了一些过时的测试

* 修复bug

* 修复无法通过test_utils.py的测试

* 修复与pytorch1.1中的padsequence的兼容问题; 修改Trainer的pbar

* 1. 修复metric中的bug; 2.增加metric测试

* add model summary

* 增加别名

* 删除encoder中的嵌套层

* 修改了 core 部分 import 的顺序,__all__ 暴露的内容

* 修改了 models 部分 import 的顺序,__all__ 暴露的内容

* 修改了文件名

* 修改了 modules 模块的__all__ 和 import

* fix var runn

* 增加vocab的clear方法

* 一些符合 PEP8 的微调

* 更新了cache_results的例子

* 1. 对callback中indices潜在None作出提示;2.DataSet支持通过List进行index

* 修改了一个typo

* 修改了 README.md

* update documents on bert

* update documents on encoder/bert

* 增加一个fitlog callback,实现与fitlog实验记录

* typo

* - update dataset_loader

* 增加了到 fitlog 文档的链接。

* 增加了 DataSet Loader 的文档

* - add star-transformer reproduction
2019-05-22 18:43:56 +08:00

832 lines
24 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# 详细指南"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据读入"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'raw_sentence': A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . type=str,\n",
"'label': 1 type=str}"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP.io import CSVLoader\n",
"\n",
"loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\\t')\n",
"dataset = loader.load(\"./sample_data/tutorial_sample_dataset.csv\")\n",
"dataset[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instance表示一个样本由一个或多个field属性特征组成每个field有名字和值。\n",
"\n",
"在初始化Instance时即可定义它包含的域使用 \"field_name=field_value\"的写法。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'raw_sentence': fake data type=str,\n",
"'label': 0 type=str}"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP import Instance\n",
"\n",
"dataset.append(Instance(raw_sentence='fake data', label='0'))\n",
"dataset[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据处理"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'raw_sentence': A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . type=str,\n",
"'label': 1 type=str,\n",
"'sentence': a series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . type=str,\n",
"'words': [4, 1, 6, 1, 1, 2, 1, 11, 153, 10, 28, 17, 2, 1, 10, 1, 28, 17, 2, 1, 5, 154, 6, 149, 1, 1, 23, 1, 6, 149, 1, 8, 30, 6, 4, 35, 3] type=list,\n",
"'target': 1 type=int}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP import Vocabulary\n",
"\n",
"# 将所有字母转为小写, 并所有句子变成单词序列\n",
"dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')\n",
"dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words')\n",
"\n",
"# 使用Vocabulary类统计单词并将单词序列转化为数字序列\n",
"vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')\n",
"vocab.index_dataset(dataset, field_name='words',new_field_name='words')\n",
"\n",
"# 将label转为整数\n",
"dataset.apply(lambda x: int(x['label']), new_field_name='target')\n",
"dataset[0]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'raw_sentence': A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . type=str,\n",
"'label': 1 type=str,\n",
"'sentence': a series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . type=str,\n",
"'words': [4, 1, 6, 1, 1, 2, 1, 11, 153, 10, 28, 17, 2, 1, 10, 1, 28, 17, 2, 1, 5, 154, 6, 149, 1, 1, 23, 1, 6, 149, 1, 8, 30, 6, 4, 35, 3] type=list,\n",
"'target': 1 type=int,\n",
"'seq_len': 37 type=int}\n"
]
}
],
"source": [
"# 增加长度信息\n",
"dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')\n",
"print(dataset[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 使用内置模块CNNText\n",
"设置为符合内置模块的名称"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CNNText(\n",
" (embed): Embedding(\n",
" 177, 50\n",
" (dropout): Dropout(p=0.0)\n",
" )\n",
" (conv_pool): ConvMaxpool(\n",
" (convs): ModuleList(\n",
" (0): Conv1d(50, 3, kernel_size=(3,), stride=(1,), padding=(2,))\n",
" (1): Conv1d(50, 4, kernel_size=(4,), stride=(1,), padding=(2,))\n",
" (2): Conv1d(50, 5, kernel_size=(5,), stride=(1,), padding=(2,))\n",
" )\n",
" )\n",
" (dropout): Dropout(p=0.1)\n",
" (fc): Linear(in_features=12, out_features=5, bias=True)\n",
")"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP.models import CNNText\n",
"\n",
"model_cnn = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)\n",
"model_cnn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们在使用内置模块的时候,还应该使用应该注意把 field 设定成符合内置模型输入输出的名字。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"words\n",
"seq_len\n",
"target\n"
]
}
],
"source": [
"from fastNLP import Const\n",
"\n",
"dataset.rename_field('words', Const.INPUT)\n",
"dataset.rename_field('seq_len', Const.INPUT_LEN)\n",
"dataset.rename_field('target', Const.TARGET)\n",
"\n",
"dataset.set_input(Const.INPUT, Const.INPUT_LEN)\n",
"dataset.set_target(Const.TARGET)\n",
"\n",
"print(Const.INPUT)\n",
"print(Const.INPUT_LEN)\n",
"print(Const.TARGET)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 分割训练集/验证集/测试集"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(64, 7, 7)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_dev_data, test_data = dataset.split(0.1)\n",
"train_data, dev_data = train_dev_data.split(0.1)\n",
"len(train_data), len(dev_data), len(test_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 训练(model_cnn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### loss\n",
"训练模型需要提供一个损失函数\n",
"\n",
"下面提供了一个在分类问题中常用的交叉熵损失。注意它的**初始化参数**。\n",
"\n",
"pred参数对应的是模型的forward返回的dict的一个key的名字这里是\"output\"。\n",
"\n",
"target参数对应的是dataset作为标签的field的名字这里是\"label_seq\"。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from fastNLP import CrossEntropyLoss\n",
"\n",
"# loss = CrossEntropyLoss()\n",
"# 等价于\n",
"loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Metric\n",
"定义评价指标\n",
"\n",
"这里使用准确率。参数的“命名规则”跟上面类似。\n",
"\n",
"pred参数对应的是模型的predict方法返回的dict的一个key的名字这里是\"predict\"。\n",
"\n",
"target参数对应的是dataset作为标签的field的名字这里是\"label_seq\"。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from fastNLP import AccuracyMetric\n",
"\n",
"# metrics=AccuracyMetric()\n",
"# 等价于\n",
"metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input fields after batch(if batch size is 2):\n",
"\twords: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 16]) \n",
"\tseq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) \n",
"target fields after batch(if batch size is 2):\n",
"\ttarget: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) \n",
"\n",
"training epochs started 2019-05-12-21-38-34\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=20), HTML(value='')), layout=Layout(display='…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.285714\n",
"\n",
"Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.428571\n",
"\n",
"Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.428571\n",
"\n",
"Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.428571\n",
"\n",
"Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.428571\n",
"\n",
"Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.428571\n",
"\n",
"Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.428571\n",
"\n",
"Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.857143\n",
"\n",
"Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.857143\n",
"\n",
"Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.857143\n",
"\n",
"\n",
"In Epoch:8/Step:16, got best dev performance:AccuracyMetric: acc=0.857143\n",
"Reloaded the best model.\n"
]
},
{
"data": {
"text/plain": [
"{'best_eval': {'AccuracyMetric': {'acc': 0.857143}},\n",
" 'best_epoch': 8,\n",
" 'best_step': 16,\n",
" 'seconds': 0.21}"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP import Trainer\n",
"\n",
"trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data, loss=loss, metrics=metrics)\n",
"trainer.train()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 测试(model_cnn)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[tester] \n",
"AccuracyMetric: acc=0.857143\n"
]
},
{
"data": {
"text/plain": [
"{'AccuracyMetric': {'acc': 0.857143}}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP import Tester\n",
"\n",
"tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())\n",
"tester.test()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 编写自己的模型\n",
"\n",
"完全支持 pytorch 的模型,与 pytorch 唯一不同的是返回结果是一个字典,字典中至少需要包含 \"pred\" 这个字段"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"\n",
"class LSTMText(nn.Module):\n",
" def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):\n",
" super().__init__()\n",
"\n",
" self.embedding = nn.Embedding(vocab_size, embedding_dim)\n",
" self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)\n",
" self.fc = nn.Linear(hidden_dim * 2, output_dim)\n",
" self.dropout = nn.Dropout(dropout)\n",
"\n",
" def forward(self, words):\n",
" # (input) words : (batch_size, seq_len)\n",
" words = words.permute(1,0)\n",
" # words : (seq_len, batch_size)\n",
"\n",
" embedded = self.dropout(self.embedding(words))\n",
" # embedded : (seq_len, batch_size, embedding_dim)\n",
" output, (hidden, cell) = self.lstm(embedded)\n",
" # output: (seq_len, batch_size, hidden_dim * 2)\n",
" # hidden: (num_layers * 2, batch_size, hidden_dim)\n",
" # cell: (num_layers * 2, batch_size, hidden_dim)\n",
"\n",
" hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)\n",
" hidden = self.dropout(hidden)\n",
" # hidden: (batch_size, hidden_dim * 2)\n",
"\n",
" pred = self.fc(hidden.squeeze(0))\n",
" # result: (batch_size, output_dim)\n",
" return {\"pred\":pred}"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input fields after batch(if batch size is 2):\n",
"\twords: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 16]) \n",
"\tseq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) \n",
"target fields after batch(if batch size is 2):\n",
"\ttarget: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) \n",
"\n",
"training epochs started 2019-05-12-21-38-36\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=20), HTML(value='')), layout=Layout(display='…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.714286\n",
"\n",
"Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.857143\n",
"\n",
"Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.857143\n",
"\n",
"Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.857143\n",
"\n",
"Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.857143\n",
"\n",
"Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.857143\n",
"\n",
"\n",
"In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.857143\n",
"Reloaded the best model.\n"
]
},
{
"data": {
"text/plain": [
"{'best_eval': {'AccuracyMetric': {'acc': 0.857143}},\n",
" 'best_epoch': 6,\n",
" 'best_step': 12,\n",
" 'seconds': 2.15}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_lstm = LSTMText(len(vocab),50,5)\n",
"trainer = Trainer(model=model_lstm, train_data=train_data, dev_data=dev_data, loss=loss, metrics=metrics)\n",
"trainer.train()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[tester] \n",
"AccuracyMetric: acc=0.857143\n"
]
},
{
"data": {
"text/plain": [
"{'AccuracyMetric': {'acc': 0.857143}}"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tester = Tester(test_data, model_lstm, metrics=AccuracyMetric())\n",
"tester.test()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用 Batch编写自己的训练过程"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 0 Avg Loss: 3.11 18ms\n",
"Epoch 1 Avg Loss: 2.88 30ms\n",
"Epoch 2 Avg Loss: 2.69 42ms\n",
"Epoch 3 Avg Loss: 2.47 54ms\n",
"Epoch 4 Avg Loss: 2.38 67ms\n",
"Epoch 5 Avg Loss: 2.10 78ms\n",
"Epoch 6 Avg Loss: 2.06 91ms\n",
"Epoch 7 Avg Loss: 1.92 103ms\n",
"Epoch 8 Avg Loss: 1.91 114ms\n",
"Epoch 9 Avg Loss: 1.76 126ms\n",
"[tester] \n",
"AccuracyMetric: acc=0.571429\n"
]
},
{
"data": {
"text/plain": [
"{'AccuracyMetric': {'acc': 0.571429}}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP import BucketSampler\n",
"from fastNLP import Batch\n",
"import torch\n",
"import time\n",
"\n",
"model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)\n",
"\n",
"def train(epoch, data):\n",
" optim = torch.optim.Adam(model.parameters(), lr=0.001)\n",
" lossfunc = torch.nn.CrossEntropyLoss()\n",
" batch_size = 32\n",
"\n",
" # 定义一个Batch传入DataSet规定batch_size和去batch的规则。\n",
" # 顺序Sequential随机Random相似长度组成一个batchBucket\n",
" train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')\n",
" train_batch = Batch(batch_size=batch_size, dataset=data, sampler=train_sampler)\n",
" \n",
" start_time = time.time()\n",
" for i in range(epoch):\n",
" loss_list = []\n",
" for batch_x, batch_y in train_batch:\n",
" optim.zero_grad()\n",
" output = model(batch_x['words'])\n",
" loss = lossfunc(output['pred'], batch_y['target'])\n",
" loss.backward()\n",
" optim.step()\n",
" loss_list.append(loss.item())\n",
" print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=\" \")\n",
" print('{:d}ms'.format(round((time.time()-start_time)*1000)))\n",
" loss_list.clear()\n",
" \n",
"train(10, train_data)\n",
"tester = Tester(test_data, model, metrics=AccuracyMetric())\n",
"tester.test()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 使用 Callback 实现自己想要的效果"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input fields after batch(if batch size is 2):\n",
"\twords: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 16]) \n",
"\tseq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) \n",
"target fields after batch(if batch size is 2):\n",
"\ttarget: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) \n",
"\n",
"training epochs started 2019-05-12-21-38-40\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, layout=Layout(flex='2'), max=20), HTML(value='')), layout=Layout(display='…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.285714\n",
"\n",
"Sum Time: 51ms\n",
"\n",
"\n",
"Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.285714\n",
"\n",
"Sum Time: 69ms\n",
"\n",
"\n",
"Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.285714\n",
"\n",
"Sum Time: 91ms\n",
"\n",
"\n",
"Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Sum Time: 107ms\n",
"\n",
"\n",
"Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Sum Time: 125ms\n",
"\n",
"\n",
"Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Sum Time: 142ms\n",
"\n",
"\n",
"Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Sum Time: 158ms\n",
"\n",
"\n",
"Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.571429\n",
"\n",
"Sum Time: 176ms\n",
"\n",
"\n",
"Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.714286\n",
"\n",
"Sum Time: 193ms\n",
"\n",
"\n",
"Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.857143\n",
"\n",
"Sum Time: 212ms\n",
"\n",
"\n",
"\n",
"In Epoch:10/Step:20, got best dev performance:AccuracyMetric: acc=0.857143\n",
"Reloaded the best model.\n"
]
},
{
"data": {
"text/plain": [
"{'best_eval': {'AccuracyMetric': {'acc': 0.857143}},\n",
" 'best_epoch': 10,\n",
" 'best_step': 20,\n",
" 'seconds': 0.2}"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from fastNLP import Callback\n",
"\n",
"start_time = time.time()\n",
"\n",
"class MyCallback(Callback):\n",
" def on_epoch_end(self):\n",
" print('Sum Time: {:d}ms\\n\\n'.format(round((time.time()-start_time)*1000)))\n",
" \n",
"\n",
"model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)\n",
"trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,\n",
" loss=CrossEntropyLoss(), metrics=AccuracyMetric(), callbacks=[MyCallback()])\n",
"trainer.train()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}