Merge pull request #6 from fastnlp/dev0.5.0

Dev0.5.0 pull request
This commit is contained in:
Danqing Wang 2019-07-11 15:08:53 +08:00 committed by GitHub
commit 2610c20c23
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
94 changed files with 4192 additions and 1624 deletions

View File

@ -6,50 +6,59 @@
![Hex.pm](https://img.shields.io/hexpm/l/plug.svg)
[![Documentation Status](https://readthedocs.org/projects/fastnlp/badge/?version=latest)](http://fastnlp.readthedocs.io/?badge=latest)
fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner/)、POS-Tagging等、中文分词、文本分类、[Matching](reproduction/matching/)、指代消解、摘要等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性:
fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner)、POS-Tagging等、中文分词、[文本分类](reproduction/text_classification)、[Matching](reproduction/matching)、[指代消解](reproduction/coreference_resolution)[摘要](reproduction/Summarization)等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性:
- 统一的Tabular式数据容器让数据预处理过程简洁明了。内置多种数据集的DataSet Loader省去预处理代码;
- 多种训练、测试组件例如训练器Trainer测试器Tester以及各种评测metrics等等;
- 各种方便的NLP工具例如预处理embedding加载包括EMLo和BERT; 中间数据cache等;
- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、教程以供查阅;
- 各种方便的NLP工具例如预处理embedding加载包括ELMo和BERT; 中间数据cache等;
- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、[教程](https://fastnlp.readthedocs.io/zh/latest/user/tutorials.html)以供查阅;
- 提供诸多高级模块例如Variational LSTM, Transformer, CRF等;
- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用; [详细链接](reproduction/)
- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用,详细内容见 [reproduction](reproduction) 部分;
- 便捷且具有扩展性的训练器; 提供多种内置callback函数方便实验记录、异常捕获等。
## 安装指南
fastNLP 依赖下包:
fastNLP 依赖下包:
+ numpy>=1.14.2
+ torch>=1.0.0
+ tqdm>=4.28.1
+ nltk>=3.4.1
+ requests
+ spacy
其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 [PyTorch 官网](https://pytorch.org/) 。
在依赖包安装完成后,您可以在命令行执行如下指令完成安装
```shell
pip install fastNLP
python -m spacy download en
```
## 参考资源
## fastNLP教程
- [文档](https://fastnlp.readthedocs.io/zh/latest/)
- [源码](https://github.com/fastnlp/fastNLP)
- [1. 使用DataSet预处理文本](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_1_data_preprocess.html)
- [2. 使用DataSetLoader加载数据集](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_2_load_dataset.html)
- [3. 使用Embedding模块将文本转成向量](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_3_embedding.html)
- [4. 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_4_loss_optimizer.html)
- [5. 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_5_datasetiter.html)
- [6. 快速实现序列标注模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_6_seq_labeling.html)
- [7. 使用Modules和Models快速搭建自定义模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_7_modules_models.html)
- [8. 使用Metric快速评测你的模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_8_metrics.html)
- [9. 使用Callback自定义你的训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_9_callback.html)
## 内置组件
大部分用于的 NLP 任务神经网络都可以看做由编码encoder、聚合aggregator、解码decoder三种模块组成。
大部分用于的 NLP 任务神经网络都可以看做由编码encoder、解码器decoder种模块组成。
![](./docs/source/figures/text_classification.png)
fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 三种模块的功能和常见组件如下:
fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下:
<table>
<tr>
@ -59,29 +68,17 @@ fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助
</tr>
<tr>
<td> encoder </td>
<td> 将输入编码为具有具 有表示能力的向量 </td>
<td> 将输入编码为具有具有表示能力的向量 </td>
<td> embedding, RNN, CNN, transformer
</tr>
<tr>
<td> aggregator </td>
<td> 从多个向量中聚合信息 </td>
<td> self-attention, max-pooling </td>
</tr>
<tr>
<td> decoder </td>
<td> 将具有某种表示意义的 向量解码为需要的输出 形式 </td>
<td> 将具有某种表示意义的向量解码为需要的输出形式 </td>
<td> MLP, CRF </td>
</tr>
</table>
## 完整模型
fastNLP 为不同的 NLP 任务实现了许多完整的模型,它们都经过了训练和测试。
你可以在以下两个地方查看相关信息
- [模型介绍](reproduction/)
- [模型源码](fastNLP/models/)
## 项目结构
![](./docs/source/figures/workflow.png)

View File

@ -19,6 +19,9 @@ apidoc:
server:
cd build/html && python -m http.server
dev:
rm -rf build/html && make html && make server
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new

41
docs/README.md Normal file
View File

@ -0,0 +1,41 @@
# 快速入门 fastNLP 文档编写
本教程为 fastNLP 文档编写者创建,文档编写者包括合作开发人员和文档维护人员。您在一般情况下属于前者,
只需要了解整个框架的部分内容即可。
## 合作开发人员
FastNLP的文档使用基于[reStructuredText标记语言](http://docutils.sourceforge.net/rst.html)的
[Sphinx](http://sphinx.pocoo.org/)工具生成,由[Read the Docs](https://readthedocs.org/)网站自动维护生成。
一般开发者只要编写符合reStructuredText语法规范的文档并通过[PR](https://help.github.com/en/articles/about-pull-requests)
就可以为fastNLP的文档贡献一份力量。
如果你想在本地编译文档并进行大段文档的编写您需要安装Sphinx工具以及sphinx-rtd-theme主题
```bash
fastNLP/docs> pip install sphinx
fastNLP/docs> pip install sphinx-rtd-theme
```
然后在本目录下执行 `make dev` 命令。该命令只支持Linux和MacOS系统期望看到如下输出
```bash
fastNLP/docs> make dev
rm -rf build/html && make html && make server
Running Sphinx v1.5.6
making output directory...
......
Build finished. The HTML pages are in build/html.
cd build/html && python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
```
现在您浏览器访问 http://localhost:8000/ 查看文档。如果你在远程服务器尚进行工作,则访问地址为 http://{服务器的ip地址}:8000/ 。
但您必须保证服务器的8000端口是开放的。如果您的电脑或远程服务器的8000端口被占用程序会顺延使用8001、8002……等端口。
当你结束访问时您可以使用Control(Ctrl) + C 来结束进程。
我们在[这里](./source/user/example.rst)列举了fastNLP文档经常用到的reStructuredText语法网页查看请结合Raw模式
您可以通过阅读它进行快速上手。FastNLP大部分的文档都是写在代码中通过Sphinx工具进行抽取生成的
您还可以参考这篇[未完成的文章](./source/user/docs_in_code.rst)了解代码内文档编写的规范。
## 文档维护人员
文档维护人员需要了解 Makefile 中全部命令的含义,并了解到目前的文档结构
是在 sphinx-apidoc 自动抽取的基础上进行手动修改得到的。
文档维护人员应进一步提升整个框架的自动化程度,并监督合作开发人员不要破坏文档项目的整体结构。

View File

@ -1,36 +0,0 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
set SPHINXPROJ=fastNLP
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
:end
popd

View File

@ -1,2 +0,0 @@
# FastNLP Quick Tutorial

View File

@ -24,9 +24,9 @@ copyright = '2018, xpqiu'
author = 'xpqiu'
# The short X.Y version
version = '0.4'
version = '0.4.5'
# The full version, including alpha/beta/rc tags
release = '0.4'
release = '0.4.5'
# -- General configuration ---------------------------------------------------

View File

@ -1,7 +0,0 @@
fastNLP.modules.aggregator.attention
====================================
.. automodule:: fastNLP.modules.aggregator.attention
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,7 +0,0 @@
fastNLP.modules.aggregator.pooling
==================================
.. automodule:: fastNLP.modules.aggregator.pooling
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,17 +0,0 @@
fastNLP.modules.aggregator
==========================
.. automodule:: fastNLP.modules.aggregator
:members:
:undoc-members:
:show-inheritance:
子模块
----------
.. toctree::
:titlesonly:
fastNLP.modules.aggregator.attention
fastNLP.modules.aggregator.pooling

View File

@ -12,6 +12,5 @@ fastNLP.modules
.. toctree::
:titlesonly:
fastNLP.modules.aggregator
fastNLP.modules.decoder
fastNLP.modules.encoder

View File

@ -52,11 +52,9 @@ fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models
.. toctree::
:maxdepth: 1
安装指南 <user/installation>
快速入门 <user/quickstart>
详细指南 <user/tutorial_one>
科研指南 <user/with_fitlog>
注释语法 <user/example>
安装指南 </user/installation>
快速入门 </user/quickstart>
详细指南 </user/tutorials>
API 文档
-------------

View File

@ -1,6 +1,6 @@
=================
科研向导
=================
============================================
使用fitlog 辅助 fastNLP 进行科研
============================================
本文介绍结合使用 fastNLP 和 fitlog 进行科研的方法。

View File

@ -0,0 +1,156 @@
==============================
数据格式及预处理教程
==============================
:class:`~fastNLP.DataSet` 是fastNLP中用于承载数据的容器。可以将DataSet看做是一个表格
每一行是一个sample (在fastNLP中被称为 :mod:`~fastNLP.core.instance` )
每一列是一个feature (在fastNLP中称为 :mod:`~fastNLP.core.field` )。
.. csv-table::
:header: "sentence", "words", "seq_len"
"This is the first instance .", "[This, is, the, first, instance, .]", 6
"Second instance .", "[Second, instance, .]", 3
"Third instance .", "[Third, instance, .]", 3
"...", "[...]", "..."
上面是一个样例数据中 DataSet 的存储结构。其中它的每一行是一个 :class:`~fastNLP.Instance` 对象; 每一列是一个 :class:`~fastNLP.FieldArray` 对象。
-----------------------------
数据集构建和删除
-----------------------------
我们使用传入字典的方式构建一个数据集,这是 :class:`~fastNLP.DataSet` 初始化的最基础的方式
.. code-block:: python
from fastNLP import DataSet
data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."],
'words': [['this', 'is', 'the', 'first', 'instance', '.'], ['Second', 'instance', '.'], ['Third', 'instance', '.']],
'seq_len': [6, 3, 3]}
dataset = DataSet(data)
# 传入的dict的每个key的value应该为具有相同长度的list
我们还可以使用 :func:`~fastNLP.DataSet.append` 方法向数据集内增加数据
.. code-block:: python
from fastNLP import DataSet
from fastNLP import Instance
dataset = DataSet()
instance = Instance(sentence="This is the first instance",
words=['this', 'is', 'the', 'first', 'instance', '.'],
seq_len=6)
dataset.append(instance)
# 可以继续append更多内容但是append的instance应该和前面的instance拥有完全相同的field
另外,我们还可以用 :class:`~fastNLP.Instance` 数组的方式构建数据集
.. code-block:: python
from fastNLP import DataSet
from fastNLP import Instance
dataset = DataSet([
Instance(sentence="This is the first instance",
words=['this', 'is', 'the', 'first', 'instance', '.'],
seq_len=6),
Instance(sentence="Second instance .",
words=['Second', 'instance', '.'],
seq_len=3)
])
在初步构建完数据集之后,我们可可以通过 `for` 循环遍历 :class:`~fastNLP.DataSet` 中的内容。
.. code-block:: python
for instance in dataset:
# do something
FastNLP 同样提供了多种删除数据的方法 :func:`~fastNLP.DataSet.drop`:func:`~fastNLP.DataSet.delete_instance`:func:`~fastNLP.DataSet.delete_field`
.. code-block:: python
from fastNLP import DataSet
dataset = DataSet({'a': list(range(-5, 5))})
# 返回满足条件的instance,并放入DataSet中
dropped_dataset = dataset.drop(lambda ins:ins['a']<0, inplace=False)
# 在dataset中删除满足条件的instance
dataset.drop(lambda ins:ins['a']<0) # dataset的instance数量减少
# 删除第3个instance
dataset.delete_instance(2)
# 删除名为'a'的field
dataset.delete_field('a')
-----------------------------
简单的数据预处理
-----------------------------
因为 fastNLP 中的数据是按列存储的,所以大部分的数据预处理操作是以列( :mod:`~fastNLP.core.field` )为操作对象的。
首先,我们可以检查特定名称的 :mod:`~fastNLP.core.field` 是否存在,并对其进行改名。
.. code-block:: python
# 检查是否存在名为'a'的field
dataset.has_field('a') # 或 ('a' in dataset)
# 将名为'a'的field改名为'b'
dataset.rename_field('a', 'b')
# DataSet的长度
len(dataset)
其次,我们可以使用 :func:`~fastNLP.DataSet.apply`:func:`~fastNLP.DataSet.apply_field` 进行数据预处理操作操作。
这两个方法通过传入一个对单一 :mod:`~fastNLP.core.instance` 操作的函数,
自动地帮助你对一个 :mod:`~fastNLP.core.field` 中的每个 :mod:`~fastNLP.core.instance` 调用这个函数,完成整体的操作。
这个传入的函数可以是 lambda 匿名函数,也可以是完整定义的函数。同时,你还可以用 ``new_field_name`` 参数指定数据处理后存储的 :mod:`~fastNLP.core.field` 的名称。
.. code-block:: python
from fastNLP import DataSet
data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."]}
dataset = DataSet(data)
# 将句子分成单词形式, 详见DataSet.apply()方法
dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words')
# 或使用DataSet.apply_field()
dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words')
# 除了匿名函数,也可以定义函数传递进去
def get_words(instance):
sentence = instance['sentence']
words = sentence.split()
return words
dataset.apply(get_words, new_field_name='words')
除了手动处理数据集之外,你还可以使用 fastNLP 提供的各种 :class:`~fastNLP.io.base_loader.DataSetLoader` 来进行数据处理。
详细请参考这篇教程 :doc:`使用DataSetLoader加载数据集 </tutorials/tutorial_2_load_dataset>`
-----------------------------
DataSet与pad
-----------------------------
在fastNLP里pad是与一个 :mod:`~fastNLP.core.field` 绑定的。即不同的 :mod:`~fastNLP.core.field` 可以使用不同的pad方式比如在英文任务中word需要的pad和
character的pad方式往往是不同的。fastNLP是通过一个叫做 :class:`~fastNLP.Padder` 的子类来完成的。
默认情况下所有field使用 :class:`~fastNLP.AutoPadder`
。可以通过使用以下方式设置Padder(如果将padder设置为None则该field不会进行pad操作)。
大多数情况下直接使用 :class:`~fastNLP.AutoPadder` 就可以了。
如果 :class:`~fastNLP.AutoPadder`:class:`~fastNLP.EngChar2DPadder` 无法满足需求,
也可以自己写一个 :class:`~fastNLP.Padder`
.. code-block:: python
from fastNLP import DataSet
from fastNLP import EngChar2DPadder
import random
dataset = DataSet()
max_chars, max_words, sent_num = 5, 10, 20
contents = [[
[random.randint(1, 27) for _ in range(random.randint(1, max_chars))]
for _ in range(random.randint(1, max_words))
] for _ in range(sent_num)]
# 初始化时传入
dataset.add_field('chars', contents, padder=EngChar2DPadder())
# 直接设置
dataset.set_padder('chars', EngChar2DPadder())
# 也可以设置pad的value
dataset.set_pad_val('chars', -1)

View File

@ -0,0 +1,193 @@
=========================
数据集加载教程
=========================
这一部分是一个关于如何加载数据集的教程
教程目录:
- `Part I: 数据集信息`_
- `Part II: 数据集的使用方式`_
- `Part III: 不同数据类型的DataSetLoader`_
- `Part IV: DataSetLoader举例`_
- `Part V: fastNLP封装好的数据集加载器`_
----------------------------
Part I: 数据集信息
----------------------------
在fastNLP中我们使用 :class:`~fastNLP.io.base_loader.DataInfo` 来存储数据集信息。 :class:`~fastNLP.io.base_loader.DataInfo`
类包含了两个重要内容: `datasets``vocabs`
`datasets` 是一个 `key` 为数据集名称(如 `train` `dev` ,和 `test` 等), `value`:class:`~fastNLP.DataSet` 的字典。
`vocabs` 是一个 `key` 为词表名称(如 :attr:`fastNLP.Const.INPUT` 表示输入文本的词表名称, :attr:`fastNLP.Const.TARGET` 表示目标
的真实标签词表的名称,等等), `value` 为词表内容( :class:`~fastNLP.Vocabulary` )的字典。
----------------------------
Part II: 数据集的使用方式
----------------------------
在fastNLP中我们采用 :class:`~fastNLP.io.base_loader.DataSetLoader` 来作为加载数据集的基类。
:class:`~fastNLP.io.base_loader.DataSetLoader` 定义了各种DataSetLoader所需的API接口开发者应该继承它实现各种的DataSetLoader。
在各种数据集的DataSetLoader当中至少应该编写如下内容:
- _load 函数:从一个数据文件中读取数据到一个 :class:`~fastNLP.DataSet`
- load 函数(可以使用基类的方法):从一个或多个数据文件中读取数据到一个或多个 :class:`~fastNLP.DataSet`
- process 函数:一个或多个从数据文件中读取数据,并处理成可以训练的 :class:`~fastNLP.io.DataInfo`
**\*process函数中可以调用load函数或_load函数**
DataSetLoader的_load或者load函数返回的 :class:`~fastNLP.DataSet` 当中内容为数据集的文本信息process函数返回的
:class:`~fastNLP.io.DataInfo` 当中, `datasets` 的内容为已经index好的、可以直接被 :class:`~fastNLP.Trainer`
接受的内容。
--------------------------------------------------------
Part III: 不同数据类型的DataSetLoader
--------------------------------------------------------
:class:`~fastNLP.io.dataset_loader.CSVLoader`
读取CSV类型的数据集文件。例子如下
.. code-block:: python
data_set_loader = CSVLoader(
headers=('words', 'target'), sep='\t'
)
# 表示将CSV文件中每一行的第一项填入'words' field第二项填入'target' field。
# 其中每两项之间由'\t'分割开来
data_set = data_set_loader._load('path/to/your/file')
数据集内容样例如下 ::
But it does not leave you with much . 1
You could hate it for the same reason . 1
The performances are an absolute joy . 4
:class:`~fastNLP.io.dataset_loader.JsonLoader`
读取Json类型的数据集文件数据必须按行存储每行是一个包含各类属性的Json对象。例子如下
.. code-block:: python
data_set_loader = JsonLoader(
fields={'sentence1': 'words1', 'sentence2': 'words2', 'gold_label': 'target'}
)
# 表示将Json对象中'sentence1'、'sentence2'和'gold_label'对应的值赋给'words1'、'words2'、'target'这三个fields
data_set = data_set_loader._load('path/to/your/file')
数据集内容样例如下 ::
{"annotator_labels": ["neutral"], "captionID": "3416050480.jpg#4", "gold_label": "neutral", "pairID": "3416050480.jpg#4r1n", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is training his horse for a competition.", "sentence2_binary_parse": "( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))"}
{"annotator_labels": ["contradiction"], "captionID": "3416050480.jpg#4", "gold_label": "contradiction", "pairID": "3416050480.jpg#4r1c", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is at a diner, ordering an omelette.", "sentence2_binary_parse": "( ( A person ) ( ( ( ( is ( at ( a diner ) ) ) , ) ( ordering ( an omelette ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (PP (IN at) (NP (DT a) (NN diner))) (, ,) (S (VP (VBG ordering) (NP (DT an) (NN omelette))))) (. .)))"}
{"annotator_labels": ["entailment"], "captionID": "3416050480.jpg#4", "gold_label": "entailment", "pairID": "3416050480.jpg#4r1e", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is outdoors, on a horse.", "sentence2_binary_parse": "( ( A person ) ( ( ( ( is outdoors ) , ) ( on ( a horse ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (ADVP (RB outdoors)) (, ,) (PP (IN on) (NP (DT a) (NN horse)))) (. .)))"}
------------------------------------------
Part IV: DataSetLoader举例
------------------------------------------
以Matching任务为例子
:class:`~fastNLP.io.data_loader.matching.MatchingLoader`
我们在fastNLP当中封装了一个Matching任务数据集的数据加载类 :class:`~fastNLP.io.data_loader.matching.MatchingLoader` .
在MatchingLoader类当中我们封装了一个对数据集中的文本内容进行进一步的预处理的函数
:meth:`~fastNLP.io.data_loader.matching.MatchingLoader.process`
这个函数具有各种预处理option
- 是否将文本转成全小写
- 是否需要序列长度信息,需要什么类型的序列长度信息
- 是否需要用BertTokenizer来获取序列的WordPiece信息
- 等等
具体内容参见 :meth:`fastNLP.io.MatchingLoader.process`
:class:`~fastNLP.io.data_loader.matching.SNLILoader`
一个关于SNLI数据集的DataSetLoader。SNLI数据集来自
`SNLI Data Set <https://nlp.stanford.edu/projects/snli/snli_1.0.zip>`_ .
:class:`~fastNLP.io.data_loader.matching.SNLILoader`:meth:`~fastNLP.io.data_loader.matching.SNLILoader._load`
函数中,我们用以下代码将数据集内容从文本文件读入内存
.. code-block:: python
def _load(self, path):
ds = JsonLoader._load(self, path) # SNLI数据集原始文件为Json格式可以采用JsonLoader来读取数据集文件
parentheses_table = str.maketrans({'(': None, ')': None})
# 字符串匹配格式SNLI数据集的文本中由括号分割开的组成树结构因此
# 我们将这些括号去除。
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(0))
# 把第一句话的内容用上面的字符串匹配格式进行替换并将句子分割为一个由单词组成的list
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(1))
# 对第二句话的内容进行同样的预处理
ds.drop(lambda x: x[Const.TARGET] == '-') # 将标签为'-'的样本丢掉
return ds
------------------------------------------
Part V: fastNLP封装好的数据集加载器
------------------------------------------
fastNLP封装好的数据集加载器可以适用于多种类型的任务
- `文本分类任务`_
- `序列标注任务`_
- `Matching任务`_
- `指代消解任务`_
- `摘要任务`_
文本分类任务
-------------------
文本分类任务
序列标注任务
-------------------
序列标注任务
Matching任务
-------------------
:class:`~fastNLP.io.data_loader.matching.SNLILoader`
一个关于SNLI数据集的DataSetLoader。SNLI数据集来自
`SNLI Data Set <https://nlp.stanford.edu/projects/snli/snli_1.0.zip>`_ .
:class:`~fastNLP.io.data_loader.matching.MNLILoader`
一个关于MultiNLI数据集的DataSetLoader。MultiNLI数据集来自 `GLUE benchmark <https://gluebenchmark.com/tasks>`_
:class:`~fastNLP.io.data_loader.matching.QNLILoader`
一个关于QNLI数据集的DataSetLoader。QNLI数据集来自 `GLUE benchmark <https://gluebenchmark.com/tasks>`_
:class:`~fastNLP.io.data_loader.matching.RTELoader`
一个关于Recognizing Textual Entailment数据集(RTE)的DataSetLoader。RTE数据集来自
`GLUE benchmark <https://gluebenchmark.com/tasks>`_
:class:`~fastNLP.io.data_loader.matching.QuoraLoader`
一个关于Quora数据集的DataSetLoader。
指代消解任务
-------------------
指代消解任务
摘要任务
-------------------
摘要任务

View File

@ -0,0 +1,214 @@
=========================================
使用Embedding模块将文本转成向量
=========================================
这一部分是一个关于在fastNLP当中使用embedding的教程。
教程目录:
- `Part I: embedding介绍`_
- `Part II: 使用随机初始化的embedding`_
- `Part III: 使用预训练的静态embedding`_
- `Part IV: 使用预训练的Contextual Embedding(ELMo & BERT)`_
- `Part V: 使用character-level的embedding`_
- `Part VI: 叠加使用多个embedding`_
---------------------------------------
Part I: embedding介绍
---------------------------------------
与torch.nn.Embedding类似fastNLP的embedding接受的输入是一个被index好的序列输出的内容是这个序列的embedding结果。
fastNLP的embedding包括了预训练embedding和随机初始化embedding。
---------------------------------------
Part II: 使用随机初始化的embedding
---------------------------------------
使用随机初始化的embedding参见 :class:`~fastNLP.modules.encoder.embedding.Embedding`
可以传入词表大小和embedding维度
.. code-block:: python
embed = Embedding(10000, 50)
也可以传入一个初始化的参数矩阵:
.. code-block:: python
embed = Embedding(init_embed)
其中的init_embed可以是torch.FloatTensor、torch.nn.Embedding或者numpy.ndarray。
---------------------------------------
Part III: 使用预训练的静态embedding
---------------------------------------
在使用预训练的embedding之前需要根据数据集的内容构建一个词表 :class:`~fastNLP.core.vocabulary.Vocabulary` ,在
预训练embedding类初始化的时候需要将这个词表作为参数传入。
在fastNLP中我们提供了 :class:`~fastNLP.modules.encoder.embedding.StaticEmbedding` 这一个类。
通过 :class:`~fastNLP.modules.encoder.embedding.StaticEmbedding` 可以加载预训练好的静态
Embedding例子如下
.. code-block:: python
embed = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
vocab为根据数据集构建的词表model_dir_or_name可以是一个路径也可以是embedding模型的名称
1 如果传入的是路径那么fastNLP将会根据该路径来读取预训练的权重文件并将embedding加载进来(glove
和word2vec类型的权重文件都支持)
2 如果传入的是模型名称那么fastNLP将会根据名称查找embedding模型如果在cache目录下找到模型则会
自动加载;如果找不到则会自动下载。可以通过环境变量 ``FASTNLP_CACHE_DIR`` 来自定义cache目录如::
$ FASTNLP_CACHE_DIR=~/fastnlp_cache_dir python your_python_file.py
这个命令表示fastNLP将会在 `~/fastnlp_cache_dir` 这个目录下寻找模型,找不到则会自动将模型下载到这个目录
目前支持的静态embedding模型有
========================== ================================
模型名称 模型
-------------------------- --------------------------------
en glove.840B.300d
-------------------------- --------------------------------
en-glove-840d-300 glove.840B.300d
-------------------------- --------------------------------
en-glove-6b-50 glove.6B.50d
-------------------------- --------------------------------
en-word2vec-300 谷歌word2vec 300维
-------------------------- --------------------------------
en-fasttext 英文fasttext 300维
-------------------------- --------------------------------
cn 腾讯中文词向量 200维
-------------------------- --------------------------------
cn-fasttext 中文fasttext 300维
========================== ================================
-----------------------------------------------------------
Part IV: 使用预训练的Contextual Embedding(ELMo & BERT)
-----------------------------------------------------------
在fastNLP中我们提供了ELMo和BERT的embedding :class:`~fastNLP.modules.encoder.embedding.ElmoEmbedding`
:class:`~fastNLP.modules.encoder.embedding.BertEmbedding`
与静态embedding类似ELMo的使用方法如下
.. code-block:: python
embed = ElmoEmbedding(vocab, model_dir_or_name='small', requires_grad=False)
目前支持的ElmoEmbedding模型有
========================== ================================
模型名称 模型
-------------------------- --------------------------------
small allennlp ELMo的small
-------------------------- --------------------------------
medium allennlp ELMo的medium
-------------------------- --------------------------------
original allennlp ELMo的original
-------------------------- --------------------------------
5.5b-original allennlp ELMo的5.5B original
========================== ================================
BERT-embedding的使用方法如下
.. code-block:: python
embed = BertEmbedding(
vocab, model_dir_or_name='en-base-cased', requires_grad=False, layers='4,-2,-1'
)
其中layers变量表示需要取哪几层的encode结果。
目前支持的BertEmbedding模型有
========================== ====================================
模型名称 模型
-------------------------- ------------------------------------
en bert-base-cased
-------------------------- ------------------------------------
en-base-uncased bert-base-uncased
-------------------------- ------------------------------------
en-base-cased bert-base-cased
-------------------------- ------------------------------------
en-large-uncased bert-large-uncased
-------------------------- ------------------------------------
en-large-cased bert-large-cased
-------------------------- ------------------------------------
-------------------------- ------------------------------------
en-large-cased-wwm bert-large-cased-whole-word-mask
-------------------------- ------------------------------------
en-large-uncased-wwm bert-large-uncased-whole-word-mask
-------------------------- ------------------------------------
en-base-cased-mrpc bert-base-cased-finetuned-mrpc
-------------------------- ------------------------------------
-------------------------- ------------------------------------
multilingual bert-base-multilingual-cased
-------------------------- ------------------------------------
multilingual-base-uncased bert-base-multilingual-uncased
-------------------------- ------------------------------------
multilingual-base-cased bert-base-multilingual-cased
========================== ====================================
-----------------------------------------------------
Part V: 使用character-level的embedding
-----------------------------------------------------
除了预训练的embedding以外fastNLP还提供了CharEmbedding :class:`~fastNLP.modules.encoder.embedding.CNNCharEmbedding`
:class:`~fastNLP.modules.encoder.embedding.LSTMCharEmbedding`
CNNCharEmbedding的使用例子如下
.. code-block:: python
embed = CNNCharEmbedding(vocab, embed_size=100, char_emb_size=50)
这表示这个CNNCharEmbedding当中character的embedding维度大小为50返回的embedding结果维度大小为100。
与CNNCharEmbedding类似LSTMCharEmbedding的使用例子如下
.. code-block:: python
embed = LSTMCharEmbedding(vocab, embed_size=100, char_emb_size=50)
这表示这个LSTMCharEmbedding当中character的embedding维度大小为50返回的embedding结果维度大小为100。
-----------------------------------------------------
Part VI: 叠加使用多个embedding
-----------------------------------------------------
在fastNLP中我们使用 :class:`~fastNLP.modules.encoder.embedding.StackEmbedding` 来叠加多个embedding
例子如下:
.. code-block:: python
embed_1 = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
embed_2 = StaticEmbedding(vocab, model_dir_or_name='en-word2vec-300', requires_grad=True)
stack_embed = StackEmbedding([embed_1, embed_2])
StackEmbedding会把多个embedding的结果拼接起来如上面例子的stack_embed返回的embedding维度为350维。
除此以外还可以把静态embedding跟上下文相关的embedding拼接起来
.. code-block:: python
elmo_embedding = ElmoEmbedding(vocab, model_dir_or_name='medium', layers='0,1,2', requires_grad=False)
glove_embedding = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
stack_embed = StackEmbedding([elmo_embedding, glove_embedding])

View File

@ -0,0 +1,266 @@
==============================================================================
Loss 和 optimizer 教程 ———— 以文本分类为例
==============================================================================
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段评价性文字预测其情感倾向是积极label=1、消极label=0还是中性label=2使用 :class:`~fastNLP.Trainer`:class:`~fastNLP.Tester` 来进行快速训练和测试,损失函数之前的内容与 :doc:`/tutorials/tutorial_5_datasetiter` 中的完全一样,如已经阅读过可以跳过。
--------------
数据处理
--------------
数据读入
我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.SSTLoader`轻松地读取SST数据集数据来源https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip
这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象。
.. code-block:: python
from fastNLP.io import SSTLoader
loader = SSTLoader()
#这里的all.txt是下载好数据后train.txt、dev.txt、test.txt的组合
dataset = loader.load("./trainDevTestTrees_PTB/trees/all.txt")
print(dataset[0])
输出数据如下::
{'words': ['It', "'s", 'a', 'lovely', 'film', 'with', 'lovely', 'performances', 'by', 'Buy', 'and', 'Accorsi', '.'] type=list,
'target': positive type=str}
除了读取数据外fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io`
数据处理
我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``target`` :mod:`~fastNLP.core.field` 转化为整数。
.. code-block:: python
def label_to_int(x):
if x['target']=="positive":
return 1
elif x['target']=="negative":
return 0
else:
return 2
# 将label转为整数
dataset.apply(lambda x: label_to_int(x), new_field_name='target')
``words````target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
:class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len``
所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len``:mod:`~fastNLP.core.field`
.. code-block:: python
# 增加长度信息
dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
观察可知: :meth:`~fastNLP.DataSet.apply_field`:meth:`~fastNLP.DataSet.apply` 类似,
但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
:meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
.. note::
`lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
def func_lambda(x):
return len(x)
你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field`:meth:`~fastNLP.DataSet.apply` 的参数
Vocabulary 的使用
我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabulary.index_dataset`
将单词序列转化为训练可用的数字序列。
.. code-block:: python
from fastNLP import Vocabulary
# 使用Vocabulary类统计单词并将单词序列转化为数字序列
vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
vocab.index_dataset(dataset, field_name='words',new_field_name='words')
print(dataset[0])
输出数据如下::
{'words': [27, 9, 6, 913, 16, 18, 913, 124, 31, 5715, 5, 1, 2] type=list,
'target': 1 type=int,
'seq_len': 13 type=int}
---------------------
使用内置模型训练
---------------------
内置模型的输入输出命名
fastNLP内置了一些完整的神经网络模型详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet`:mod:`~fastNLP.core.field` 的名称。
在这个例子中模型输入 (forward方法的参数) 为 ``words````seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target``
具体的命名规范可以参考 :doc:`/fastNLP.core.const`
如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet`
:mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
.. code-block:: python
from fastNLP import Const
dataset.rename_field('words', Const.INPUT)
dataset.rename_field('seq_len', Const.INPUT_LEN)
dataset.rename_field('target', Const.TARGET)
print(Const.INPUT)
print(Const.INPUT_LEN)
print(Const.TARGET)
print(Const.OUTPUT)
输出结果为::
words
seq_len
target
pred
在给 :class:`~fastNLP.DataSet`:mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
:meth:`~fastNLP.DataSet.set_input`:meth:`~fastNLP.DataSet.set_target` 两个函数。
.. code-block:: python
#使用dataset的 set_input 和 set_target函数告诉模型dataset中那些数据是输入那些数据是标签目标输出
dataset.set_input(Const.INPUT, Const.INPUT_LEN)
dataset.set_target(Const.TARGET)
数据集分割
除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法
.. code-block:: python
train_dev_data, test_data = dataset.split(0.1)
train_data, dev_data = train_dev_data.split(0.1)
print(len(train_data), len(dev_data), len(test_data))
输出结果为::
9603 1067 1185
评价指标
训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
.. code-block:: python
from fastNLP import AccuracyMetric
# metrics=AccuracyMetric() 在本例中与下面这行代码等价
metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
损失函数
训练模型需要提供一个损失函数
,fastNLP中提供了直接可以导入使用的四种loss分别为
* :class:`~fastNLP.CrossEntropyLoss`包装了torch.nn.functional.cross_entropy()函数,返回交叉熵损失(可以运用于多分类场景)
* :class:`~fastNLP.BCELoss`包装了torch.nn.functional.binary_cross_entropy()函数,返回二分类的交叉熵
* :class:`~fastNLP.L1Loss`包装了torch.nn.functional.l1_loss()函数返回L1 损失
* :class:`~fastNLP.NLLLoss`包装了torch.nn.functional.nll_loss()函数,返回负对数似然损失
下面提供了一个在分类问题中常用的交叉熵损失。注意它的 **初始化参数**
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
这里我们用 :class:`~fastNLP.Const` 来辅助命名,如果你自己编写模型中 forward 方法的返回值或
数据集中 :mod:`~fastNLP.core.field` 的名字与本例不同, 你可以把 ``pred`` 参数和 ``target`` 参数设定符合自己代码的值。
.. code-block:: python
from fastNLP import CrossEntropyLoss
# loss = CrossEntropyLoss() 在本例中与下面这行代码等价
loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)
优化器
定义模型运行的时候使用的优化器可以使用fastNLP包装好的优化器
* :class:`~fastNLP.SGD` 包装了torch.optim.SGD优化器
* :class:`~fastNLP.Adam` 包装了torch.optim.Adam优化器
也可以直接使用torch.optim.Optimizer中的优化器并在实例化 :class:`~fastNLP.Trainer` 类的时候传入优化器实参
.. code-block:: python
import torch.optim as optim
from fastNLP import Adam
#使用 torch.optim 定义优化器
optimizer_1=optim.RMSprop(model_cnn.parameters(), lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
#使用fastNLP中包装的 Adam 定义优化器
optimizer_2=Adam(lr=4e-3, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, model_params=model_cnn.parameters())
快速训练
现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ,并使用 :class:`~fastNLP.Trainer` 进行训练,
除了使用 :class:`~fastNLP.Trainer`进行训练,我们也可以通过使用 :class:`~fastNLP.DataSetIter` 来编写自己的训练过程,具体见 :doc:`/tutorials/tutorial_5_datasetiter`
.. code-block:: python
from fastNLP.models import CNNText
#词嵌入的维度、训练的轮数和batch size
EMBED_DIM = 100
N_EPOCHS = 10
BATCH_SIZE = 16
#使用CNNText的时候第一个参数输入一个tuple,作为模型定义embedding的参数
#还可以传入 kernel_nums, kernel_sizes, padding, dropout的自定义值
model_cnn = CNNText((len(vocab),EMBED_DIM), num_classes=3, padding=2, dropout=0.1)
#如果在定义trainer的时候没有传入optimizer参数模型默认的优化器为torch.optim.Adam且learning rate为lr=4e-3
#这里只使用了optimizer_1作为优化器输入感兴趣可以尝试optimizer_2或者其他优化器作为输入
#这里只使用了loss作为损失函数输入感兴趣可以尝试其他损失函数输入
trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data, loss=loss, metrics=metrics,
optimizer=optimizer_1,n_epochs=N_EPOCHS, batch_size=BATCH_SIZE)
trainer.train()
训练过程的输出如下::
input fields after batch(if batch size is 2):
words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 40])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
training epochs started 2019-07-08-15-44-48
Evaluation at Epoch 1/10. Step:601/6010. AccuracyMetric: acc=0.59044
Evaluation at Epoch 2/10. Step:1202/6010. AccuracyMetric: acc=0.599813
Evaluation at Epoch 3/10. Step:1803/6010. AccuracyMetric: acc=0.508903
Evaluation at Epoch 4/10. Step:2404/6010. AccuracyMetric: acc=0.596064
Evaluation at Epoch 5/10. Step:3005/6010. AccuracyMetric: acc=0.47985
Evaluation at Epoch 6/10. Step:3606/6010. AccuracyMetric: acc=0.589503
Evaluation at Epoch 7/10. Step:4207/6010. AccuracyMetric: acc=0.311153
Evaluation at Epoch 8/10. Step:4808/6010. AccuracyMetric: acc=0.549203
Evaluation at Epoch 9/10. Step:5409/6010. AccuracyMetric: acc=0.581068
Evaluation at Epoch 10/10. Step:6010/6010. AccuracyMetric: acc=0.523899
In Epoch:2/Step:1202, got best dev performance:AccuracyMetric: acc=0.599813
Reloaded the best model.
快速测试
:class:`~fastNLP.Trainer` 对应fastNLP 也提供了 :class:`~fastNLP.Tester` 用于快速测试,用法如下
.. code-block:: python
from fastNLP import Tester
tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())
tester.test()
训练过程输出如下::
[tester]
AccuracyMetric: acc=0.565401

View File

@ -0,0 +1,248 @@
==============================================================================
DataSetIter 教程 ———— 以文本分类为例
==============================================================================
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段评价性文字预测其情感倾向是积极label=1、消极label=0还是中性label=2使用:class:`~fastNLP.DataSetIter` 类来编写自己的训练过程。自己编写训练过程之前的内容与 :doc:`/tutorials/tutorial_4_loss_optimizer` 中的完全一样,如已经阅读过可以跳过。
--------------
数据处理
--------------
数据读入
我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.SSTLoader`轻松地读取SST数据集数据来源https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip
这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象。
.. code-block:: python
from fastNLP.io import SSTLoader
loader = SSTLoader()
#这里的all.txt是下载好数据后train.txt、dev.txt、test.txt的组合
dataset = loader.load("./trainDevTestTrees_PTB/trees/all.txt")
print(dataset[0])
输出数据如下::
{'words': ['It', "'s", 'a', 'lovely', 'film', 'with', 'lovely', 'performances', 'by', 'Buy', 'and', 'Accorsi', '.'] type=list,
'target': positive type=str}
除了读取数据外fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io`
数据处理
我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``target`` :mod:`~fastNLP.core.field` 转化为整数。
.. code-block:: python
def label_to_int(x):
if x['target']=="positive":
return 1
elif x['target']=="negative":
return 0
else:
return 2
# 将label转为整数
dataset.apply(lambda x: label_to_int(x), new_field_name='target')
``words````target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
:class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len``
所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len``:mod:`~fastNLP.core.field`
.. code-block:: python
# 增加长度信息
dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
观察可知: :meth:`~fastNLP.DataSet.apply_field`:meth:`~fastNLP.DataSet.apply` 类似,
但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
:meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
.. note::
`lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
def func_lambda(x):
return len(x)
你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field`:meth:`~fastNLP.DataSet.apply` 的参数
Vocabulary 的使用
我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabulary.index_dataset`
将单词序列转化为训练可用的数字序列。
.. code-block:: python
from fastNLP import Vocabulary
# 使用Vocabulary类统计单词并将单词序列转化为数字序列
vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
vocab.index_dataset(dataset, field_name='words',new_field_name='words')
print(dataset[0])
输出数据如下::
{'words': [27, 9, 6, 913, 16, 18, 913, 124, 31, 5715, 5, 1, 2] type=list,
'target': 1 type=int,
'seq_len': 13 type=int}
---------------------
使用内置模型训练
---------------------
内置模型的输入输出命名
fastNLP内置了一些完整的神经网络模型详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet`:mod:`~fastNLP.core.field` 的名称。
在这个例子中模型输入 (forward方法的参数) 为 ``words````seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target``
具体的命名规范可以参考 :doc:`/fastNLP.core.const`
如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet`
:mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
.. code-block:: python
from fastNLP import Const
dataset.rename_field('words', Const.INPUT)
dataset.rename_field('seq_len', Const.INPUT_LEN)
dataset.rename_field('target', Const.TARGET)
print(Const.INPUT)
print(Const.INPUT_LEN)
print(Const.TARGET)
print(Const.OUTPUT)
输出结果为::
words
seq_len
target
pred
在给 :class:`~fastNLP.DataSet`:mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
:meth:`~fastNLP.DataSet.set_input`:meth:`~fastNLP.DataSet.set_target` 两个函数。
.. code-block:: python
#使用dataset的 set_input 和 set_target函数告诉模型dataset中那些数据是输入那些数据是标签目标输出
dataset.set_input(Const.INPUT, Const.INPUT_LEN)
dataset.set_target(Const.TARGET)
数据集分割
除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法
.. code-block:: python
train_dev_data, test_data = dataset.split(0.1)
train_data, dev_data = train_dev_data.split(0.1)
print(len(train_data), len(dev_data), len(test_data))
输出结果为::
9603 1067 1185
评价指标
训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
.. code-block:: python
from fastNLP import AccuracyMetric
# metrics=AccuracyMetric() 在本例中与下面这行代码等价
metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
--------------------------
自己编写训练过程
--------------------------
如果你想用类似 PyTorch 的使用方法,自己编写训练过程,你可以参考下面这段代码。
其中使用了 fastNLP 提供的 :class:`~fastNLP.DataSetIter` 来获得小批量训练的小批量数据,
使用 :class:`~fastNLP.BucketSampler` 做为 :class:`~fastNLP.DataSetIter` 的参数来选择采样的方式。
DataSetIter
fastNLP定义的 :class:`~fastNLP.DataSetIter`用于定义一个batch并实现batch的多种功能在初始化时传入的参数有
* dataset: :class:`~fastNLP.DataSet` 对象, 数据集
* batch_size: 取出的batch大小
* sampler: 规定使用的 :class:`~fastNLP.Sampler` 若为 None, 使用 :class:`~fastNLP.RandomSampler` Default: None
* as_numpy: 若为 True, 输出batch为 `numpy.array`. 否则为 `torch.Tensor` Default: False
* prefetch: 若为 True使用多进程预先取出下一batch. Default: False
sampler
fastNLP 实现的采样器有:
* :class:`~fastNLP.BucketSampler` 可以随机地取出长度相似的元素 【初始化参数: num_bucketsbucket的数量 batch_sizebatch大小 seq_len_field_namedataset中对应序列长度的 :mod:`~fastNLP.core.field` 的名字】
* SequentialSampler 顺序取出元素的采样器【无初始化参数】
* RandomSampler随机化取元素的采样器【无初始化参数】
以下代码使用BucketSampler作为 :class:`~fastNLP.DataSetIter` 初始化的输入,运用 :class:`~fastNLP.DataSetIter` 自己写训练程序
.. code-block:: python
from fastNLP import BucketSampler
from fastNLP import DataSetIter
from fastNLP.models import CNNText
from fastNLP import Tester
import torch
import time
embed_dim = 100
model = CNNText((len(vocab),embed_dim), num_classes=3, padding=2, dropout=0.1)
def train(epoch, data, devdata):
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
lossfunc = torch.nn.CrossEntropyLoss()
batch_size = 32
# 定义一个Batch传入DataSet规定batch_size和去batch的规则。
# 顺序Sequential随机Random相似长度组成一个batchBucket
train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
train_batch = DataSetIter(batch_size=batch_size, dataset=data, sampler=train_sampler)
start_time = time.time()
print("-"*5+"start training"+"-"*5)
for i in range(epoch):
loss_list = []
for batch_x, batch_y in train_batch:
optimizer.zero_grad()
output = model(batch_x['words'])
loss = lossfunc(output['pred'], batch_y['target'])
loss.backward()
optimizer.step()
loss_list.append(loss.item())
#这里verbose如果为0在调用Tester对象的test()函数时不输出任何信息,返回评估信息; 如果为1打印出验证结果返回评估信息
#在调用过Tester对象的test()函数后调用其_format_eval_results(res)函数,结构化输出验证结果
tester_tmp = Tester(devdata, model, metrics=AccuracyMetric(), verbose=0)
res=tester_tmp.test()
print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
print(tester._format_eval_results(res),end=" ")
print('{:d}ms'.format(round((time.time()-start_time)*1000)))
loss_list.clear()
train(10, train_data, dev_data)
#使用tester进行快速测试
tester = Tester(test_data, model, metrics=AccuracyMetric())
tester.test()
这段代码的输出如下::
-----start training-----
Epoch 0 Avg Loss: 1.09 AccuracyMetric: acc=0.480787 58989ms
Epoch 1 Avg Loss: 1.00 AccuracyMetric: acc=0.500469 118348ms
Epoch 2 Avg Loss: 0.93 AccuracyMetric: acc=0.536082 176220ms
Epoch 3 Avg Loss: 0.87 AccuracyMetric: acc=0.556701 236032ms
Epoch 4 Avg Loss: 0.78 AccuracyMetric: acc=0.562324 294351ms
Epoch 5 Avg Loss: 0.69 AccuracyMetric: acc=0.58388 353673ms
Epoch 6 Avg Loss: 0.60 AccuracyMetric: acc=0.574508 412106ms
Epoch 7 Avg Loss: 0.51 AccuracyMetric: acc=0.589503 471097ms
Epoch 8 Avg Loss: 0.44 AccuracyMetric: acc=0.581068 529174ms
Epoch 9 Avg Loss: 0.39 AccuracyMetric: acc=0.572634 586216ms
[tester]
AccuracyMetric: acc=0.527426

View File

@ -0,0 +1,114 @@
=====================
序列标注教程
=====================
这一部分的内容主要展示如何使用fastNLP 实现序列标注任务。你可以使用fastNLP的各个组件快捷方便地完成序列标注任务达到出色的效果。
在阅读这篇Tutorial前希望你已经熟悉了fastNLP的基础使用包括基本数据结构以及数据预处理embedding的嵌入等希望你对之前的教程有更进一步的掌握。
我们将对CoNLL-03的英文数据集进行处理展示如何完成命名实体标注任务整个训练的过程。
载入数据
===================================
fastNLP可以方便地载入各种类型的数据。同时针对常见的数据集我们已经预先实现了载入方法其中包含CoNLL-03数据集。
在设计dataloader时以DataSetLoader为基类可以改写并应用于其他数据集的载入。
.. code-block:: python
class Conll2003DataLoader(DataSetLoader):
def __init__(self, task:str='ner', encoding_type:str='bioes'):
assert task in ('ner', 'pos', 'chunk')
index = {'ner':3, 'pos':1, 'chunk':2}[task]
#ConllLoader是fastNLP内置的类
self._loader = ConllLoader(headers=['raw_words', 'target'], indexes=[0, index])
self._tag_converters = None
if task in ('ner', 'chunk'):
#iob和iob2bioes会对tag进行统一标准化
self._tag_converters = [iob2]
if encoding_type == 'bioes':
self._tag_converters.append(iob2bioes)
def load(self, path: str):
dataset = self._loader.load(path)
def convert_tag_schema(tags):
for converter in self._tag_converters:
tags = converter(tags)
return tags
if self._tag_converters:
#使用apply实现convert_tag_schema函数实际上也支持匿名函数
dataset.apply_field(convert_tag_schema, field_name=Const.TARGET, new_field_name=Const.TARGET)
return dataset
输出数据格式如:
{'raw_words': ['on', 'Friday', ':'] type=list,
'target': ['O', 'O', 'O'] type=list},
数据处理
----------------------------
我们进一步处理数据。将数据和词表封装在 :class:`~fastNLP.DataInfo` 类中。data是DataInfo的实例。
我们输入模型的数据包括char embedding以及word embedding。在数据处理部分我们尝试完成词表的构建。
使用fastNLP中的Vocabulary类来构建词表。
.. code-block:: python
word_vocab = Vocabulary(min_freq=2)
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT)
word_vocab.index_dataset(*data.datasets.values(),field_name=Const.INPUT, new_field_name=Const.INPUT)
处理后的data对象内部为
dataset
vocabs
dataset保存了train和test中的数据并保存为dataset类型
vocab保存了wordsraw-words以及target的词表。
模型构建
--------------------------------
我们使用CNN-BILSTM-CRF模型完成这一任务。在网络构建方面fastNLP的网络定义继承pytorch的 :class:`nn.Module` 类。
自己可以按照pytorch的方式定义网络。需要注意的是命名。fastNLP的标准命名位于 :class:`~fastNLP.Const` 类。
模型的训练
首先实例化模型导入所需的char embedding以及word embedding。Embedding的载入可以参考教程。
也可以查看 :mod:`~fastNLP.modules.encoder.embedding` 使用所需的embedding 载入方法。
fastNLP将模型的训练过程封装在了 :class:`~fastnlp.trainer` 类中。
根据不同的任务调整trainer中的参数即可。通常一个trainer实例需要有指定的训练数据集模型优化器loss函数评测指标以及指定训练的epoch数batch size等参数。
.. code-block:: python
#实例化模型
model = CNNBiLSTMCRF(word_embed, char_embed, hidden_size=200, num_layers=1, tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type)
#定义优化器
optimizer = Adam(model.parameters(), lr=0.005)
#定义评估指标
Metrics=SpanFPreRecMetric(tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type)
#实例化trainer
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer, dev_data=data.datasets['test'], batch_size=10, metrics=Metrics,callbacks=callbacks, n_epochs=100)
#开始训练
trainer.train()
训练中会保存最优的参数配置。
训练的结果如下:
.. code-block:: python
Evaluation on DataSet test:
SpanFPreRecMetric: f=0.727661, pre=0.732293, rec=0.723088
Evaluation at Epoch 1/100. Step:1405/140500. SpanFPreRecMetric: f=0.727661, pre=0.732293, rec=0.723088
Evaluation on DataSet test:
SpanFPreRecMetric: f=0.784307, pre=0.779371, rec=0.789306
Evaluation at Epoch 2/100. Step:2810/140500. SpanFPreRecMetric: f=0.784307, pre=0.779371, rec=0.789306
Evaluation on DataSet test:
SpanFPreRecMetric: f=0.810068, pre=0.811003, rec=0.809136
Evaluation at Epoch 3/100. Step:4215/140500. SpanFPreRecMetric: f=0.810068, pre=0.811003, rec=0.809136
Evaluation on DataSet test:
SpanFPreRecMetric: f=0.829592, pre=0.84153, rec=0.817989
Evaluation at Epoch 4/100. Step:5620/140500. SpanFPreRecMetric: f=0.829592, pre=0.84153, rec=0.817989
Evaluation on DataSet test:
SpanFPreRecMetric: f=0.828789, pre=0.837096, rec=0.820644
Evaluation at Epoch 5/100. Step:7025/140500. SpanFPreRecMetric: f=0.828789, pre=0.837096, rec=0.820644

View File

@ -0,0 +1,205 @@
======================================
Modules 和 models 的教程
======================================
:mod:`~fastNLP.modules`:mod:`~fastNLP.models` 用于构建 fastNLP 所需的神经网络模型,它可以和 torch.nn 中的模型一起使用。
下面我们会分三节介绍编写构建模型的具体方法。
----------------------
使用 models 中的模型
----------------------
fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models.CNNText`
:class:`~fastNLP.models.SeqLabeling` 等完整的模型,以供用户直接使用。
:class:`~fastNLP.models.CNNText` 为例,我们看一个简单的文本分类的任务的实现过程。
首先是数据读入和处理部分,这里的代码和 :doc:`快速入门 </user/quickstart>` 中一致。
.. code-block:: python
from fastNLP.io import CSVLoader
from fastNLP import Vocabulary, CrossEntropyLoss, AccuracyMetric
loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")
dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words', is_input=True)
dataset.apply(lambda x: int(x['label']), new_field_name='target', is_target=True)
train_dev_data, test_data = dataset.split(0.1)
train_data, dev_data = train_dev_data.split(0.1)
vocab = Vocabulary(min_freq=2).from_dataset(train_data, field_name='words')
vocab.index_dataset(train_data, dev_data, test_data, field_name='words', new_field_name='words')
然后我们从 :mod:`~fastNLP.models` 中导入 ``CNNText`` 模型,用它进行训练
.. code-block:: python
from fastNLP.models import CNNText
from fastNLP import Trainer
model_cnn = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data,
loss=CrossEntropyLoss(), metrics=AccuracyMetric())
trainer.train()
在 iPython 环境输入 `model_cnn` ,我们可以看到 ``model_cnn`` 的网络结构
.. parsed-literal::
CNNText(
(embed): Embedding(
169, 50
(dropout): Dropout(p=0.0)
)
(conv_pool): ConvMaxpool(
(convs): ModuleList(
(0): Conv1d(50, 3, kernel_size=(3,), stride=(1,), padding=(2,))
(1): Conv1d(50, 4, kernel_size=(4,), stride=(1,), padding=(2,))
(2): Conv1d(50, 5, kernel_size=(5,), stride=(1,), padding=(2,))
)
)
(dropout): Dropout(p=0.1)
(fc): Linear(in_features=12, out_features=5, bias=True)
)
FastNLP 中内置的 models 如下表所示,您可以点击具体的名称查看详细的 API
.. csv-table::
:header: 名称, 介绍
:class:`~fastNLP.models.CNNText` , 使用 CNN 进行文本分类的模型
:class:`~fastNLP.models.SeqLabeling` , 简单的序列标注模型
:class:`~fastNLP.models.AdvSeqLabel` , 更大网络结构的序列标注模型
:class:`~fastNLP.models.ESIM` , ESIM 模型的实现
:class:`~fastNLP.models.StarTransEnc` , 带 word-embedding的Star-Transformer模 型
:class:`~fastNLP.models.STSeqLabel` , 用于序列标注的 Star-Transformer 模型
:class:`~fastNLP.models.STNLICls` ,用于自然语言推断 (NLI) 的 Star-Transformer 模型
:class:`~fastNLP.models.STSeqCls` , 用于分类任务的 Star-Transformer 模型
:class:`~fastNLP.models.BiaffineParser` , Biaffine 依存句法分析网络的实现
----------------------------
使用 nn.torch 编写模型
----------------------------
FastNLP 完全支持使用 pyTorch 编写的模型,但与 pyTorch 中编写模型的常见方法不同,
用于 fastNLP 的模型中 forward 函数需要返回一个字典,字典中至少需要包含 ``pred`` 这个字段。
下面是使用 pyTorch 中的 torch.nn 模块编写的文本分类,注意观察代码中标注的向量维度。
由于 pyTorch 使用了约定俗成的维度设置,使得 forward 中需要多次处理维度顺序
.. code-block:: python
import torch
import torch.nn as nn
class LSTMText(nn.Module):
def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, words):
# (input) words : (batch_size, seq_len)
words = words.permute(1,0)
# words : (seq_len, batch_size)
embedded = self.dropout(self.embedding(words))
# embedded : (seq_len, batch_size, embedding_dim)
output, (hidden, cell) = self.lstm(embedded)
# output: (seq_len, batch_size, hidden_dim * 2)
# hidden: (num_layers * 2, batch_size, hidden_dim)
# cell: (num_layers * 2, batch_size, hidden_dim)
hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
hidden = self.dropout(hidden)
# hidden: (batch_size, hidden_dim * 2)
pred = self.fc(hidden.squeeze(0))
# result: (batch_size, output_dim)
return {"pred":pred}
我们同样可以在 iPython 环境中查看这个模型的网络结构
.. parsed-literal::
LSTMText(
(embedding): Embedding(169, 50)
(lstm): LSTM(50, 64, num_layers=2, dropout=0.5, bidirectional=True)
(fc): Linear(in_features=128, out_features=5, bias=True)
(dropout): Dropout(p=0.5)
)
----------------------------
使用 modules 编写模型
----------------------------
下面我们使用 :mod:`fastNLP.modules` 中的组件来构建同样的网络。由于 fastNLP 统一把 ``batch_size`` 放在第一维,
在编写代码的过程中会有一定的便利。
.. code-block:: python
from fastNLP.modules import Embedding, LSTM, MLP
class Model(nn.Module):
def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = Embedding((vocab_size, embedding_dim))
self.lstm = LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True)
self.mlp = MLP([hidden_dim*2,output_dim], dropout=dropout)
def forward(self, words):
embedded = self.embedding(words)
_,(hidden,_) = self.lstm(embedded)
pred = self.mlp(torch.cat((hidden[-1],hidden[-2]),dim=1))
return {"pred":pred}
我们自己编写模型的网络结构如下
.. parsed-literal::
Model(
(embedding): Embedding(
169, 50
(dropout): Dropout(p=0.0)
)
(lstm): LSTM(
(lstm): LSTM(50, 64, num_layers=2, batch_first=True, bidirectional=True)
)
(mlp): MLP(
(hiddens): ModuleList()
(output): Linear(in_features=128, out_features=5, bias=True)
(dropout): Dropout(p=0.5)
)
)
FastNLP 中包含的各种模块如下表,您可以点击具体的名称查看详细的 API:
.. csv-table::
:header: 名称, 介绍
:class:`~fastNLP.modules.ConvolutionCharEncoder` , char级别的卷积 encoder
:class:`~fastNLP.modules.LSTMCharEncoder` , char级别基于LSTM的 encoder
:class:`~fastNLP.modules.ConvMaxpool` , 结合了Convolution和Max-Pooling于一体的模块
:class:`~fastNLP.modules.Embedding` , 基础的Embedding模块
:class:`~fastNLP.modules.LSTM` , LSTM模块, 轻量封装了PyTorch的LSTM
:class:`~fastNLP.modules.StarTransformer` , Star-Transformer 的encoder部分
:class:`~fastNLP.modules.TransformerEncoder` , Transformer的encoder模块不包含embedding层
:class:`~fastNLP.modules.VarRNN` , Variational Dropout RNN 模块
:class:`~fastNLP.modules.VarLSTM` , Variational Dropout LSTM 模块
:class:`~fastNLP.modules.VarGRU` , Variational Dropout GRU 模块
:class:`~fastNLP.modules.MaxPool` , Max-pooling模块
:class:`~fastNLP.modules.MaxPoolWithMask` , 带mask矩阵的max pooling。在做 max-pooling的时候不会考虑mask值为0的位置。
:class:`~fastNLP.modules.MultiHeadAttention` , MultiHead Attention 模块
:class:`~fastNLP.modules.MLP` , 简单的多层感知器模块
:class:`~fastNLP.modules.ConditionalRandomField` , 条件随机场模块
:class:`~fastNLP.modules.viterbi_decode` , 给定一个特征矩阵以及转移分数矩阵,计算出最佳的路径以及对应的分数 (与 :class:`~fastNLP.modules.ConditionalRandomField` 配合使用)
:class:`~fastNLP.modules.allowed_transitions` , 给定一个id到label的映射表返回所有可以跳转的列表:class:`~fastNLP.modules.ConditionalRandomField` 配合使用)

View File

@ -0,0 +1,121 @@
=====================
Metric 教程
=====================
在进行训练时fastNLP提供了各种各样的 :mod:`~fastNLP.core.metrics`
:doc:`/user/quickstart` 中所介绍的,:class:`~fastNLP.AccuracyMetric` 类的对象被直接传到 :class:`~fastNLP.Trainer` 中用于训练
.. code-block:: python
from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric
trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
loss=CrossEntropyLoss(), metrics=AccuracyMetric())
trainer.train()
除了 :class:`~fastNLP.AccuracyMetric` 之外,:class:`~fastNLP.SpanFPreRecMetric` 也是一种非常见的评价指标,
例如在序列标注问题中常以span的方式计算 F-measure, precision, recall。
另外fastNLP 还实现了用于抽取式QA如SQuAD的metric :class:`~fastNLP.ExtractiveQAMetric`
用户可以参考下面这个表格,点击第一列查看各个 :mod:`~fastNLP.core.metrics` 的详细文档。
.. csv-table::
:header: 名称, 介绍
:class:`~fastNLP.core.metrics.MetricBase` , 自定义metrics需继承的基类
:class:`~fastNLP.core.metrics.AccuracyMetric` , 简单的正确率metric
:class:`~fastNLP.core.metrics.SpanFPreRecMetric` , "同时计算 F-measure, precision, recall 值的 metric"
:class:`~fastNLP.core.metrics.ExtractiveQAMetric` , 用于抽取式QA任务 的metric
更多的 :mod:`~fastNLP.core.metrics` 正在被添加到 fastNLP 当中,敬请期待。
------------------------------
定义自己的metrics
------------------------------
在定义自己的metrics类时需继承 fastNLP 的 :class:`~fastNLP.core.metrics.MetricBase`,
并覆盖写入 ``evaluate````get_metric`` 方法。
evaluate(xxx) 中传入一个批次的数据,将针对一个批次的预测结果做评价指标的累计
get_metric(xxx) 当所有数据处理完毕时调用该方法,它将根据 evaluate函数累计的评价指标统计量来计算最终的评价结果
以分类问题中Accuracy计算为例假设model的forward返回dict中包含 `pred` 这个key, 并且该key需要用于Accuracy::
class Model(nn.Module):
def __init__(xxx):
# do something
def forward(self, xxx):
# do something
return {'pred': pred, 'other_keys':xxx} # pred's shape: batch_size x num_classes
假设dataset中 `label` 这个field是需要预测的值并且该field被设置为了target
对应的AccMetric可以按如下的定义, version1, 只使用这一次::
class AccMetric(MetricBase):
def __init__(self):
super().__init__()
# 根据你的情况自定义指标
self.corr_num = 0
self.total = 0
def evaluate(self, label, pred): # 这里的名称需要和dataset中target field与model返回的key是一样的不然找不到对应的value
# dev或test时每个batch结束会调用一次该方法需要实现如何根据每个batch累加metric
self.total += label.size(0)
self.corr_num += label.eq(pred).sum().item()
def get_metric(self, reset=True): # 在这里定义如何计算metric
acc = self.corr_num/self.total
if reset: # 是否清零以便重新计算
self.corr_num = 0
self.total = 0
return {'acc': acc} # 需要返回一个dictkey为该metric的名称该名称会显示到Trainer的progress bar中
version2如果需要复用Metric比如下一次使用AccMetric时dataset中目标field不叫label而叫y或者model的输出不是pred::
class AccMetric(MetricBase):
def __init__(self, label=None, pred=None):
# 假设在另一场景使用时目标field叫ymodel给出的key为pred_y。则只需要在初始化AccMetric时
# acc_metric = AccMetric(label='y', pred='pred_y')即可。
# 当初始化为acc_metric = AccMetric()即label=None, pred=None, fastNLP会直接使用'label', 'pred'作为key去索取对
# 应的的值
super().__init__()
self._init_param_map(label=label, pred=pred) # 该方法会注册label和pred. 仅需要注册evaluate()方法会用到的参数名即可
# 如果没有注册该则效果与version1就是一样的
# 根据你的情况自定义指标
self.corr_num = 0
self.total = 0
def evaluate(self, label, pred): # 这里的参数名称需要和self._init_param_map()注册时一致。
# dev或test时每个batch结束会调用一次该方法需要实现如何根据每个batch累加metric
self.total += label.size(0)
self.corr_num += label.eq(pred).sum().item()
def get_metric(self, reset=True): # 在这里定义如何计算metric
acc = self.corr_num/self.total
if reset: # 是否清零以便重新计算
self.corr_num = 0
self.total = 0
return {'acc': acc} # 需要返回一个dictkey为该metric的名称该名称会显示到Trainer的progress bar中
``MetricBase`` 将会在输入的字典 ``pred_dict````target_dict`` 中进行检查.
``pred_dict`` 是模型当中 ``forward()`` 函数或者 ``predict()`` 函数的返回值.
``target_dict`` 是DataSet当中的ground truth, 判定ground truth的条件是field的 ``is_target`` 被设置为True.
``MetricBase`` 会进行以下的类型检测:
1. self.evaluate当中是否有varargs, 这是不支持的.
2. self.evaluate当中所需要的参数是否既不在 ``pred_dict`` 也不在 ``target_dict`` .
3. self.evaluate当中所需要的参数是否既在 ``pred_dict`` 也在 ``target_dict`` .
除此以外在参数被传入self.evaluate以前这个函数会检测 ``pred_dict````target_dict`` 当中没有被用到的参数
如果kwargs是self.evaluate的参数则不会检测
self.evaluate将计算一个批次(batch)的评价指标,并累计。 没有返回值
self.get_metric将统计当前的评价指标并返回评价结果, 返回值需要是一个dict, key是指标名称value是指标的值

View File

@ -0,0 +1,67 @@
==============================================================================
Callback 教程
==============================================================================
在训练时我们常常要使用trick来提高模型的性能如调节学习率或者要打印训练中的信息。
这里我们提供Callback类在Trainer中插入代码完成一些自定义的操作。
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。
给出一段评价性文字预测其情感倾向是积极label=1、消极label=0还是中性label=2使用 :class:`~fastNLP.Trainer`:class:`~fastNLP.Tester` 来进行快速训练和测试。
关于数据处理Loss和Optimizer的选择可以看其他教程这里仅在训练时加入学习率衰减。
---------------------
Callback的构建和使用
---------------------
创建Callback
我们可以继承fastNLP :class:`~fastNLP.Callback` 类来定义自己的Callback。
这里我们实现一个让学习率线性衰减的Callback。
.. code-block:: python
import fastNLP
class LRDecay(fastNLP.Callback):
def __init__(self):
super(MyCallback, self).__init__()
self.base_lrs = []
self.delta = []
def on_train_begin(self):
# 初始化,仅训练开始时调用
self.base_lrs = [pg['lr'] for pg in self.optimizer.param_groups]
self.delta = [float(lr) / self.n_epochs for lr in self.base_lrs]
def on_epoch_end(self):
# 每个epoch结束时更新学习率
ep = self.epoch
lrs = [lr - d * ep for lr, d in zip(self.base_lrs, self.delta)]
self.change_lr(lrs)
def change_lr(self, lrs):
for pg, lr in zip(self.optimizer.param_groups, lrs):
pg['lr'] = lr
这里,:class:`~fastNLP.Callback` 中所有以 ``on_`` 开头的类方法会在 :class:`~fastNLP.Trainer` 的训练中在特定时间调用。
如 on_train_begin() 会在训练开始时被调用on_epoch_end() 会在每个 epoch 结束时调用。
具体有哪些类方法,参见文档。
另外,为了使用方便,可以在 :class:`~fastNLP.Callback` 内部访问 :class:`~fastNLP.Trainer` 中的属性,如 optimizer, epoch, step分别对应训练时的优化器当前epoch数和当前的总step数。
具体可访问的属性,参见文档。
使用Callback
在定义好 :class:`~fastNLP.Callback` 之后就能将它传入Trainer的 ``callbacks`` 参数,在实际训练时使用。
.. code-block:: python
"""
数据预处理,模型定义等等
"""
trainer = fastNLP.Trainer(
model=model, train_data=train_data, dev_data=dev_data,
optimizer=optimizer, metrics=metrics,
batch_size=10, n_epochs=100,
callbacks=[LRDecay()])
trainer.train()

View File

@ -0,0 +1,3 @@
===============
在代码中写文档
===============

View File

@ -20,7 +20,13 @@
小标题4
-------------------
参考 http://docutils.sourceforge.net/docs/user/rst/quickref.html
推荐使用大标题、小标题3和小标题4
官方文档 http://docutils.sourceforge.net/docs/user/rst/quickref.html
`熟悉markdown的同学推荐参考这篇文章 <https://macplay.github.io/posts/cong-markdown-dao-restructuredtext/#id30>`_
\<\>内表示的是链接地址,\<\>外的是显示到外面的文字
常见语法
============
@ -75,6 +81,7 @@ http://docutils.sf.net/ 孤立的网址会自动生成链接
不显示冒号的代码块
.. code-block:: python
:linenos:
:emphasize-lines: 1,3
@ -83,22 +90,67 @@ http://docutils.sf.net/ 孤立的网址会自动生成链接
print("有行号和高亮")
数学块
==========
.. math::
H_2O + Na = NaOH + H_2 \uparrow
复杂表格
==========
各种连接
===========
+------------------------+------------+----------+----------+
| Header row, column 1 | Header 2 | Header 3 | Header 4 |
| (header rows optional) | | | |
+========================+============+==========+==========+
| body row 1, column 1 | column 2 | column 3 | column 4 |
+------------------------+------------+----------+----------+
| body row 2 | Cells may span columns. |
+------------------------+------------+---------------------+
| body row 3 | Cells may | - Table cells |
+------------------------+ span rows. | - contain |
| body row 4 | | - body elements. |
+------------------------+------------+---------------------+
:doc:`/user/with_fitlog`
简易表格
==========
===== ===== ======
Inputs Output
------------ ------
A B A or B
===== ===== ======
False False False
True True True
===== ===== ======
csv 表格
============
.. csv-table::
:header: sentence, target
This is the first instance ., 0
Second instance ., 1
Third instance ., 1
..., ...
[重要]各种链接
===================
各种链接帮助我们连接到fastNLP文档的各个位置
\<\>内表示的是链接地址,\<\>外的是显示到外面的文字
:doc:`根据文件名链接 </user/quickstart>`
:mod:`~fastNLP.core.batch`
:class:`~fastNLP.Batch`
~表示指显示最后一项
~表示显示最后一项
:meth:`fastNLP.DataSet.apply`

View File

@ -7,10 +7,12 @@
fastNLP 依赖如下包::
torch>=0.4.0
numpy
tqdm
nltk
numpy>=1.14.2
torch>=1.0.0
tqdm>=4.28.1
nltk>=3.4.1
requests
spacy
其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 `PyTorch 官网 <https://pytorch.org/get-started/locally/>`_
在依赖包安装完成的情况,您可以在命令行执行如下指令完成安装
@ -18,3 +20,4 @@ fastNLP 依赖如下包::
.. code:: shell
>>> pip install fastNLP
>>> python -m spacy download en

View File

@ -121,4 +121,4 @@
In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
Reloaded the best model.
这份教程只是简单地介绍了使用 fastNLP 工作的流程,具体的细节分析见 :doc:`/user/tutorial_one`
这份教程只是简单地介绍了使用 fastNLP 工作的流程,更多的教程分析见 :doc:`/user/tutorials`

View File

@ -1,371 +0,0 @@
===============
详细指南
===============
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段文字预测它的标签是0~4中的哪一个
(数据来源 `kaggle <https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews>`_ )。
--------------
数据处理
--------------
数据读入
我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.CSVLoader` 类,轻松地从 csv 文件读取我们的数据。
这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象
.. code-block:: python
from fastNLP.io import CSVLoader
loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")
除了读取数据外fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io`
Instance 和 DataSet
fastNLP 中的 :class:`~fastNLP.DataSet` 类对象类似于二维表格,它的每一列是一个 :mod:`~fastNLP.core.field`
每一行是一个 :mod:`~fastNLP.core.instance` 。我们可以手动向数据集中添加 :class:`~fastNLP.Instance` 类的对象
.. code-block:: python
from fastNLP import Instance
dataset.append(Instance(raw_sentence='fake data', label='0'))
此时的 ``dataset[-1]`` 的值如下,可以看到,数据集中的每个数据包含 ``raw_sentence````label`` 两个
:mod:`~fastNLP.core.field` ,他们的类型都是 ``str`` ::
{'raw_sentence': fake data type=str, 'label': 0 type=str}
field 的修改
我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``raw_sentence`` 中字母变成小写,并将句子分词。
同时也将 ``label`` :mod:`~fastNLP.core.field` 转化为整数并改名为 ``target``
.. code-block:: python
dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words')
dataset.apply(lambda x: int(x['label']), new_field_name='target')
``words````target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
:class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len``
所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len``:mod:`~fastNLP.core.field`
.. code-block:: python
dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
观察可知: :meth:`~fastNLP.DataSet.apply_field`:meth:`~fastNLP.DataSet.apply` 类似,
但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
:meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
.. note::
`lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
def func_lambda(x):
return len(x)
你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field`:meth:`~fastNLP.DataSet.apply` 的参数
Vocabulary 的使用
我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabularyindex_dataset`
将单词序列转化为训练可用的数字序列。
.. code-block:: python
from fastNLP import Vocabulary
vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
vocab.index_dataset(dataset, field_name='words',new_field_name='words')
数据集分割
除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法(但实际应该放在后面两段改名和设置输入的代码之后)
.. code-block:: python
train_dev_data, test_data = dataset.split(0.1)
train_data, dev_data = train_dev_data.split(0.1)
len(train_data), len(dev_data), len(test_data)
---------------------
使用内置模型训练
---------------------
内置模型的输入输出命名
fastNLP内置了一些完整的神经网络模型详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet`:mod:`~fastNLP.core.field` 的名称。
在这个例子中模型输入 (forward方法的参数) 为 ``words````seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target``
具体的命名规范可以参考 :doc:`/fastNLP.core.const`
如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet`
:mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
.. code-block:: python
from fastNLP import Const
dataset.rename_field('words', Const.INPUT)
dataset.rename_field('seq_len', Const.INPUT_LEN)
dataset.rename_field('target', Const.TARGET)
在给 :class:`~fastNLP.DataSet`:mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
:meth:`~fastNLP.DataSet.set_input`:meth:`~fastNLP.DataSet.set_target` 两个函数。
.. code-block:: python
dataset.set_input(Const.INPUT, Const.INPUT_LEN)
dataset.set_target(Const.TARGET)
快速训练
现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ,并使用 :class:`~fastNLP.Trainer` 进行训练了
(其中 ``loss````metrics`` 的定义,我们将在后续两段代码中给出)。
.. code-block:: python
from fastNLP.models import CNNText
from fastNLP import Trainer
model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data,
loss=loss, metrics=metrics)
trainer.train()
训练过程的输出如下::
input fields after batch(if batch size is 2):
words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
training epochs started 2019-05-09-10-59-39
Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.333333
Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.533333
Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.533333
Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.533333
Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.6
Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.8
Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.8
Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.733333
Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.733333
Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.733333
In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
Reloaded the best model.
损失函数
训练模型需要提供一个损失函数, 下面提供了一个在分类问题中常用的交叉熵损失。注意它的 **初始化参数**
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
这里我们用 :class:`~fastNLP.Const` 来辅助命名,如果你自己编写模型中 forward 方法的返回值或
数据集中 :mod:`~fastNLP.core.field` 的名字与本例不同, 你可以把 ``pred`` 参数和 ``target`` 参数设定符合自己代码的值。
.. code-block:: python
from fastNLP import CrossEntropyLoss
# loss = CrossEntropyLoss() 在本例中与下面这行代码等价
loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)
评价指标
训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
.. code-block:: python
from fastNLP import AccuracyMetric
# metrics=AccuracyMetric() 在本例中与下面这行代码等价
metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
快速测试
:class:`~fastNLP.Trainer` 对应fastNLP 也提供了 :class:`~fastNLP.Tester` 用于快速测试,用法如下
.. code-block:: python
from fastNLP import Tester
tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())
tester.test()
---------------------
编写自己的模型
---------------------
因为 fastNLP 是基于 `PyTorch <https://pytorch.org/>`_ 开发的框架,所以我们可以基于 PyTorch 模型编写自己的神经网络模型。
与标准的 PyTorch 模型不同fastNLP 模型中 forward 方法返回的是一个字典,字典中至少需要包含 "pred" 这个字段。
而 forward 方法的参数名称必须与 :class:`~fastNLP.DataSet` 中用 :meth:`~fastNLP.DataSet.set_input` 设定的名称一致。
模型定义的代码如下:
.. code-block:: python
import torch
import torch.nn as nn
class LSTMText(nn.Module):
def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, words):
# (input) words : (batch_size, seq_len)
words = words.permute(1,0)
# words : (seq_len, batch_size)
embedded = self.dropout(self.embedding(words))
# embedded : (seq_len, batch_size, embedding_dim)
output, (hidden, cell) = self.lstm(embedded)
# output: (seq_len, batch_size, hidden_dim * 2)
# hidden: (num_layers * 2, batch_size, hidden_dim)
# cell: (num_layers * 2, batch_size, hidden_dim)
hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
hidden = self.dropout(hidden)
# hidden: (batch_size, hidden_dim * 2)
pred = self.fc(hidden.squeeze(0))
# result: (batch_size, output_dim)
return {"pred":pred}
模型的使用方法与内置模型 :class:`~fastNLP.models.CNNText` 一致
.. code-block:: python
model_lstm = LSTMText(len(vocab),50,5)
trainer = Trainer(model=model_lstm, train_data=train_data, dev_data=dev_data,
loss=loss, metrics=metrics)
trainer.train()
tester = Tester(test_data, model_lstm, metrics=AccuracyMetric())
tester.test()
.. todo::
使用 :doc:`/fastNLP.modules` 编写模型
--------------------------
自己编写训练过程
--------------------------
如果你想用类似 PyTorch 的使用方法,自己编写训练过程,你可以参考下面这段代码。其中使用了 fastNLP 提供的 :class:`~fastNLP.Batch`
来获得小批量训练的小批量数据,使用 :class:`~fastNLP.BucketSampler` 做为 :class:`~fastNLP.Batch` 的参数来选择采样的方式。
这段代码中使用了 PyTorch 的 `torch.optim.Adam` 优化器 和 `torch.nn.CrossEntropyLoss` 损失函数,并自己计算了正确率
.. code-block:: python
from fastNLP import BucketSampler
from fastNLP import Batch
import torch
import time
model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
def train(epoch, data):
optim = torch.optim.Adam(model.parameters(), lr=0.001)
lossfunc = torch.nn.CrossEntropyLoss()
batch_size = 32
train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
train_batch = Batch(batch_size=batch_size, dataset=data, sampler=train_sampler)
start_time = time.time()
for i in range(epoch):
loss_list = []
for batch_x, batch_y in train_batch:
optim.zero_grad()
output = model(batch_x['words'])
loss = lossfunc(output['pred'], batch_y['target'])
loss.backward()
optim.step()
loss_list.append(loss.item())
print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
print('{:d}ms'.format(round((time.time()-start_time)*1000)))
loss_list.clear()
train(10, train_data)
tester = Tester(test_data, model, metrics=AccuracyMetric())
tester.test()
这段代码的输出如下::
Epoch 0 Avg Loss: 2.76 17ms
Epoch 1 Avg Loss: 2.55 29ms
Epoch 2 Avg Loss: 2.37 41ms
Epoch 3 Avg Loss: 2.30 53ms
Epoch 4 Avg Loss: 2.12 65ms
Epoch 5 Avg Loss: 2.16 76ms
Epoch 6 Avg Loss: 1.88 88ms
Epoch 7 Avg Loss: 1.84 99ms
Epoch 8 Avg Loss: 1.71 111ms
Epoch 9 Avg Loss: 1.62 122ms
[tester]
AccuracyMetric: acc=0.142857
----------------------------------
使用 Callback 增强 Trainer
----------------------------------
如果你不想自己实现繁琐的训练过程,只希望在训练过程中实现一些自己的功能(比如:输出从训练开始到当前 batch 结束的总时间),
你可以使用 fastNLP 提供的 :class:`~fastNLP.Callback` 类。下面的例子中,我们继承 :class:`~fastNLP.Callback` 类实现了这个功能。
.. code-block:: python
from fastNLP import Callback
start_time = time.time()
class MyCallback(Callback):
def on_epoch_end(self):
print('Sum Time: {:d}ms\n\n'.format(round((time.time()-start_time)*1000)))
model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
loss=CrossEntropyLoss(), metrics=AccuracyMetric(), callbacks=[MyCallback()])
trainer.train()
训练输出如下::
input fields after batch(if batch size is 2):
words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 16])
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
target fields after batch(if batch size is 2):
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
training epochs started 2019-05-12-21-38-40
Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.285714
Sum Time: 51ms
…………………………
Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.857143
Sum Time: 212ms
In Epoch:10/Step:20, got best dev performance:AccuracyMetric: acc=0.857143
Reloaded the best model.
这个例子只是介绍了 :class:`~fastNLP.Callback` 类的使用方法。实际应用比如负采样、Learning Rate Decay、Early Stop 等)中
很多功能已经被 fastNLP 实现了。你可以直接 import 它们使用,详细请查看文档 :doc:`/fastNLP.core.callback`

View File

@ -0,0 +1,18 @@
===================
fastNLP详细使用教程
===================
.. toctree::
:maxdepth: 1
1. 使用DataSet预处理文本 </tutorials/tutorial_1_data_preprocess>
2. 使用DataSetLoader加载数据集 </tutorials/tutorial_2_load_dataset>
3. 使用Embedding模块将文本转成向量 </tutorials/tutorial_3_embedding>
4. 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试 </tutorials/tutorial_4_loss_optimizer>
5. 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程 </tutorials/tutorial_5_datasetiter>
6. 快速实现序列标注模型 </tutorials/tutorial_6_seq_labeling>
7. 使用Modules和Models快速搭建自定义模型 </tutorials/tutorial_7_modules_models>
8. 使用Metric快速评测你的模型 </tutorials/tutorial_8_metrics>
9. 使用Callback自定义你的训练过程 </tutorials/tutorial_9_callback>
10. 使用fitlog 辅助 fastNLP 进行科研 </tutorials/tutorial_10_fitlog>

View File

@ -37,7 +37,7 @@ __all__ = [
"AccuracyMetric",
"SpanFPreRecMetric",
"SQuADMetric",
"ExtractiveQAMetric",
"Optimizer",
"SGD",
@ -56,8 +56,9 @@ __all__ = [
"cache_results"
]
__version__ = '0.4.0'
__version__ = '0.4.5'
from .core import *
from . import models
from . import modules
from .io import data_loader

View File

@ -21,7 +21,7 @@ from .dataset import DataSet
from .field import FieldArray, Padder, AutoPadder, EngChar2DPadder
from .instance import Instance
from .losses import LossFunc, CrossEntropyLoss, L1Loss, BCELoss, NLLLoss, LossInForward
from .metrics import AccuracyMetric, SpanFPreRecMetric, SQuADMetric
from .metrics import AccuracyMetric, SpanFPreRecMetric, ExtractiveQAMetric
from .optimizer import Optimizer, SGD, Adam
from .sampler import SequentialSampler, BucketSampler, RandomSampler, Sampler
from .tester import Tester

View File

@ -448,10 +448,10 @@ class FitlogCallback(Callback):
并将验证结果写入到fitlog中这些数据集的结果是根据dev上最好的结果报道的即如果dev在第3个epoch取得了最佳
fitlog中记录的关于这些数据集的结果就是来自第三个epoch的结果
:param DataSet,dict(DataSet) data: 传入DataSet对象会使用多个Trainer中的metric对数据进行验证如果需要传入多个
:param ~fastNLP.DataSet,dict(~fastNLP.DataSet) data: 传入DataSet对象会使用多个Trainer中的metric对数据进行验证如果需要传入多个
DataSet请通过dict的方式传入dict的key将作为对应dataset的name传递给fitlog若tester不为None时data需要通过
dict的方式传入如果仅传入DataSet, 则被命名为test
:param Tester tester: Tester对象将在on_valid_end时调用tester中的DataSet会被称为为`test`
:param ~fastNLP.Tester tester: Tester对象将在on_valid_end时调用tester中的DataSet会被称为为`test`
:param int log_loss_every: 多少个step记录一次loss(记录的是这几个batch的loss平均值)如果数据集较大建议将该值设置得
大一些不然会导致log文件巨大默认为0, 即不要记录loss
:param int verbose: 是否在终端打印evaluation的结果0不打印
@ -674,7 +674,7 @@ class TensorboardCallback(Callback):
.. warning::
fastNLP 已停止对此功能的维护请等待 fastNLP 兼容 PyTorch1.1 的下一个版本
或者使用和 fastNLP 高度配合的 fitlog参见 :doc:`/user/with_fitlog`
或者使用和 fastNLP 高度配合的 fitlog参见 :doc:`/tutorials/tutorial_10_fitlog`
"""

View File

@ -78,19 +78,7 @@
sent, label = line.strip().split('\t')
dataset.append(Instance(sentence=sent, label=label))
2.2 index, 返回结果为对DataSet对象的浅拷贝
Example::
import numpy as np
from fastNLP import DataSet
dataset = DataSet({'a': np.arange(10), 'b': [[_] for _ in range(10)]})
d[0] # 使用一个下标获取一个instance
>>{'a': 0 type=int,'b': [2] type=list} # 得到一个instance
d[1:3] # 使用slice获取一个新的DataSet
>>DataSet({'a': 1 type=int, 'b': [2] type=list}, {'a': 2 type=int, 'b': [2] type=list})
2.3 对DataSet中的内容处理
2.2 对DataSet中的内容处理
Example::
@ -108,7 +96,7 @@
return words
dataset.apply(get_words, new_field_name='words')
2.4 删除DataSet的内容
2.3 删除DataSet的内容
Example::
@ -124,14 +112,14 @@
dataset.delete_field('a')
2.5 遍历DataSet的内容
2.4 遍历DataSet的内容
Example::
for instance in dataset:
# do something
2.6 一些其它操作
2.5 一些其它操作
Example::

View File

@ -6,7 +6,7 @@ __all__ = [
"MetricBase",
"AccuracyMetric",
"SpanFPreRecMetric",
"SQuADMetric"
"ExtractiveQAMetric"
]
import inspect
@ -24,6 +24,7 @@ from .utils import seq_len_to_mask
from .vocabulary import Vocabulary
from abc import abstractmethod
class MetricBase(object):
"""
所有metrics的基类,所有的传入到Trainer, Tester的Metric需要继承自该对象需要覆盖写入evaluate(), get_metric()方法
@ -735,11 +736,11 @@ def _pred_topk(y_prob, k=1):
return y_pred_topk, y_prob_topk
class SQuADMetric(MetricBase):
class ExtractiveQAMetric(MetricBase):
r"""
别名:class:`fastNLP.SQuADMetric` :class:`fastNLP.core.metrics.SQuADMetric`
别名:class:`fastNLP.ExtractiveQAMetric` :class:`fastNLP.core.metrics.ExtractiveQAMetric`
SQuAD数据集metric
抽取式QA如SQuAD的metric.
:param pred1: 参数映射表中 `pred1` 的映射关系None表示映射关系为 `pred1` -> `pred1`
:param pred2: 参数映射表中 `pred2` 的映射关系None表示映射关系为 `pred2` -> `pred2`
@ -755,7 +756,7 @@ class SQuADMetric(MetricBase):
def __init__(self, pred1=None, pred2=None, target1=None, target2=None,
beta=1, right_open=True, print_predict_stat=False):
super(SQuADMetric, self).__init__()
super(ExtractiveQAMetric, self).__init__()
self._init_param_map(pred1=pred1, pred2=pred2, target1=target1, target2=target2)

View File

@ -91,47 +91,84 @@ class Vocabulary(object):
self.idx2word = None
self.rebuild = True
# 用于承载不需要单独创建entry的词语具体见from_dataset()方法
self._no_create_word = defaultdict(int)
self._no_create_word = Counter()
@_check_build_status
def update(self, word_lst):
def update(self, word_lst, no_create_entry=False):
"""依次增加序列中词在词典中的出现频率
:param list word_lst: a list of strings
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时没有从预训练词表中找到这个词的处理方式
如果为True则不会有这个词语创建一个单独的entry它将一直被指向unk的表示; 如果为False则为这个词创建一个单独
的entry如果这个word来自于dev或者test一般设置为True如果来自与train一般设置为False以下两种情况: 如果新
加入一个word且no_create_entry为True但这个词之前已经在Vocabulary中且并不是no_create_entry的则还是会为这
个词创建一个单独的vector; 如果no_create_entry为False但这个词之前已经在Vocabulary中且并不是no_create_entry的
则这个词将认为是需要创建单独的vector的
"""
self._add_no_create_entry(word_lst, no_create_entry)
self.word_count.update(word_lst)
@_check_build_status
def add(self, word):
def add(self, word, no_create_entry=False):
"""
增加一个新词在词典中的出现频率
:param str word: 新词
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时没有从预训练词表中找到这个词的处理方式
如果为True则不会有这个词语创建一个单独的entry它将一直被指向unk的表示; 如果为False则为这个词创建一个单独
的entry如果这个word来自于dev或者test一般设置为True如果来自与train一般设置为False以下两种情况: 如果新
加入一个word且no_create_entry为True但这个词之前已经在Vocabulary中且并不是no_create_entry的则还是会为这
个词创建一个单独的vector; 如果no_create_entry为False但这个词之前已经在Vocabulary中且并不是no_create_entry的
则这个词将认为是需要创建单独的vector的
"""
self._add_no_create_entry(word, no_create_entry)
self.word_count[word] += 1
def _add_no_create_entry(self, word, no_create_entry):
"""
在新加入word时检查_no_create_word的设置
:param str, List[str] word:
:param bool no_create_entry:
:return:
"""
if isinstance(word, str):
word = [word]
for w in word:
if no_create_entry and self.word_count.get(w, 0) == self._no_create_word.get(w, 0):
self._no_create_word[w] += 1
elif not no_create_entry and w in self._no_create_word:
self._no_create_word.pop(w)
@_check_build_status
def add_word(self, word):
def add_word(self, word, no_create_entry=False):
"""
增加一个新词在词典中的出现频率
:param str word: 新词
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时没有从预训练词表中找到这个词的处理方式
如果为True则不会有这个词语创建一个单独的entry它将一直被指向unk的表示; 如果为False则为这个词创建一个单独
的entry如果这个word来自于dev或者test一般设置为True如果来自与train一般设置为False以下两种情况: 如果新
加入一个word且no_create_entry为True但这个词之前已经在Vocabulary中且并不是no_create_entry的则还是会为这
个词创建一个单独的vector; 如果no_create_entry为False但这个词之前已经在Vocabulary中且并不是no_create_entry的
则这个词将认为是需要创建单独的vector的
"""
if word in self._no_create_word:
self._no_create_word.pop(word)
self.add(word)
self.add(word, no_create_entry=no_create_entry)
@_check_build_status
def add_word_lst(self, word_lst):
def add_word_lst(self, word_lst, no_create_entry=False):
"""
依次增加序列中词在词典中的出现频率
:param list[str] word_lst: 词的序列
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时没有从预训练词表中找到这个词的处理方式
如果为True则不会有这个词语创建一个单独的entry它将一直被指向unk的表示; 如果为False则为这个词创建一个单独
的entry如果这个word来自于dev或者test一般设置为True如果来自与train一般设置为False以下两种情况: 如果新
加入一个word且no_create_entry为True但这个词之前已经在Vocabulary中且并不是no_create_entry的则还是会为这
个词创建一个单独的vector; 如果no_create_entry为False但这个词之前已经在Vocabulary中且并不是no_create_entry的
则这个词将认为是需要创建单独的vector的
"""
for word in word_lst:
if word in self._no_create_word:
self._no_create_word.pop(word)
self.update(word_lst)
self.update(word_lst, no_create_entry=no_create_entry)
def build_vocab(self):
"""
@ -283,23 +320,17 @@ class Vocabulary(object):
for fn in field_name:
field = ins[fn]
if isinstance(field, str):
if no_create_entry and field not in self.word_count:
self._no_create_word[field] += 1
self.add_word(field)
self.add_word(field, no_create_entry=no_create_entry)
elif isinstance(field, (list, np.ndarray)):
if not isinstance(field[0], (list, np.ndarray)):
for word in field:
if no_create_entry and word not in self.word_count:
self._no_create_word[word] += 1
self.add_word(word)
self.add_word(word, no_create_entry=no_create_entry)
else:
if isinstance(field[0][0], (list, np.ndarray)):
raise RuntimeError("Only support field with 2 dimensions.")
for words in field:
for word in words:
if no_create_entry and word not in self.word_count:
self._no_create_word[word] += 1
self.add_word(word)
self.add_word(word, no_create_entry=no_create_entry)
for idx, dataset in enumerate(datasets):
if isinstance(dataset, DataSet):

View File

@ -12,22 +12,22 @@
__all__ = [
'EmbedLoader',
'DataInfo',
'DataBundle',
'DataSetLoader',
'CSVLoader',
'JsonLoader',
'ConllLoader',
'PeopleDailyCorpusLoader',
'Conll2003Loader',
'ModelLoader',
'ModelSaver',
'SSTLoader',
'ConllLoader',
'Conll2003Loader',
'MatchingLoader',
'PeopleDailyCorpusLoader',
'SNLILoader',
'SSTLoader',
'SST2Loader',
'MNLILoader',
'QNLILoader',
'QuoraLoader',
@ -35,11 +35,8 @@ __all__ = [
]
from .embed_loader import EmbedLoader
from .base_loader import DataInfo, DataSetLoader
from .dataset_loader import CSVLoader, JsonLoader, ConllLoader, \
PeopleDailyCorpusLoader, Conll2003Loader
from .base_loader import DataBundle, DataSetLoader
from .dataset_loader import CSVLoader, JsonLoader
from .model_io import ModelLoader, ModelSaver
from .data_loader.sst import SSTLoader
from .data_loader.matching import MatchingLoader, SNLILoader, \
MNLILoader, QNLILoader, QuoraLoader, RTELoader
from .data_loader import *

View File

@ -1,6 +1,6 @@
__all__ = [
"BaseLoader",
'DataInfo',
'DataBundle',
'DataSetLoader',
]
@ -109,7 +109,7 @@ def _uncompress(src, dst):
raise ValueError('unsupported file {}'.format(src))
class DataInfo:
class DataBundle:
"""
经过处理的数据信息包括一系列数据集比如分开的训练集验证集和测试集及它们所用的词表和词嵌入
@ -201,20 +201,20 @@ class DataSetLoader:
"""
raise NotImplementedError
def process(self, paths: Union[str, Dict[str, str]], **options) -> DataInfo:
def process(self, paths: Union[str, Dict[str, str]], **options) -> DataBundle:
"""
对于特定的任务和数据集读取并处理数据返回处理DataInfo类对象或字典
从指定一个或多个路径中的文件中读取数据DataInfo对象中可以包含一个或多个数据集
如果处理多个路径传入的 dict key 与返回DataInfo中的 dict 中的 key 保存一致
返回的 :class:`DataInfo` 对象有如下属性
返回的 :class:`DataBundle` 对象有如下属性
- vocabs: 由从数据集中获取的词表组成的字典每个词表
- datasets: 一个dict包含一系列 :class:`~fastNLP.DataSet` 类型的对象其中 field 的命名参考 :mod:`~fastNLP.core.const`
:param paths: 原始数据读取的路径
:param options: 根据不同的任务和数据集设计自己的参数
:return: 返回一个 DataInfo
:return: 返回一个 DataBundle
"""
raise NotImplementedError

View File

@ -4,16 +4,32 @@
这些模块的使用方法如下:
"""
__all__ = [
'SSTLoader',
'ConllLoader',
'Conll2003Loader',
'IMDBLoader',
'MatchingLoader',
'SNLILoader',
'MNLILoader',
'MTL16Loader',
'PeopleDailyCorpusLoader',
'QNLILoader',
'QuoraLoader',
'RTELoader',
'SSTLoader',
'SST2Loader',
'SNLILoader',
'YelpLoader',
]
from .sst import SSTLoader
from .matching import MatchingLoader, SNLILoader, \
MNLILoader, QNLILoader, QuoraLoader, RTELoader
from .conll import ConllLoader, Conll2003Loader
from .imdb import IMDBLoader
from .matching import MatchingLoader
from .mnli import MNLILoader
from .mtl import MTL16Loader
from .people_daily import PeopleDailyCorpusLoader
from .qnli import QNLILoader
from .quora import QuoraLoader
from .rte import RTELoader
from .snli import SNLILoader
from .sst import SSTLoader, SST2Loader
from .yelp import YelpLoader

View File

@ -0,0 +1,73 @@
from ...core.dataset import DataSet
from ...core.instance import Instance
from ..base_loader import DataSetLoader
from ..file_reader import _read_conll
class ConllLoader(DataSetLoader):
"""
别名:class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.data_loader.ConllLoader`
读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略因为
该符号在conll 2003中被用为文档分割符
列号从0开始, 每列对应内容为::
Column Type
0 Document ID
1 Part number
2 Word number
3 Word itself
4 Part-of-Speech
5 Parse bit
6 Predicate lemma
7 Predicate Frameset ID
8 Word sense
9 Speaker/Author
10 Named Entities
11:N Predicate Arguments
N Coreference
:param headers: 每一列数据的名称需为List or Tuple of str``header`` ``indexes`` 一一对应
:param indexes: 需要保留的数据列下标从0开始若为 ``None`` 则所有列都保留Default: ``None``
:param dropna: 是否忽略非法数据 ``False`` 遇到非法数据时抛出 ``ValueError`` Default: ``False``
"""
def __init__(self, headers, indexes=None, dropna=False):
super(ConllLoader, self).__init__()
if not isinstance(headers, (list, tuple)):
raise TypeError(
'invalid headers: {}, should be list of strings'.format(headers))
self.headers = headers
self.dropna = dropna
if indexes is None:
self.indexes = list(range(len(self.headers)))
else:
if len(indexes) != len(headers):
raise ValueError
self.indexes = indexes
def _load(self, path):
ds = DataSet()
for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
ins = {h: data[i] for i, h in enumerate(self.headers)}
ds.append(Instance(**ins))
return ds
class Conll2003Loader(ConllLoader):
"""
别名:class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader`
读取Conll2003数据
关于数据集的更多信息,参考:
https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
"""
def __init__(self):
headers = [
'tokens', 'pos', 'chunks', 'ner',
]
super(Conll2003Loader, self).__init__(headers=headers)

View File

@ -0,0 +1,96 @@
from typing import Union, Dict
from ..embed_loader import EmbeddingOption, EmbedLoader
from ..base_loader import DataSetLoader, DataBundle
from ...core.vocabulary import VocabularyOption, Vocabulary
from ...core.dataset import DataSet
from ...core.instance import Instance
from ...core.const import Const
from ..utils import get_tokenizer
class IMDBLoader(DataSetLoader):
"""
读取IMDB数据集DataSet包含以下fields:
words: list(str), 需要分类的文本
target: str, 文本的标签
"""
def __init__(self):
super(IMDBLoader, self).__init__()
self.tokenizer = get_tokenizer()
def _load(self, path):
dataset = DataSet()
with open(path, 'r', encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
parts = line.split('\t')
target = parts[0]
words = self.tokenizer(parts[1].lower())
dataset.append(Instance(words=words, target=target))
if len(dataset) == 0:
raise RuntimeError(f"{path} has no valid data.")
return dataset
def process(self,
paths: Union[str, Dict[str, str]],
src_vocab_opt: VocabularyOption = None,
tgt_vocab_opt: VocabularyOption = None,
char_level_op=False):
datasets = {}
info = DataBundle()
for name, path in paths.items():
dataset = self.load(path)
datasets[name] = dataset
def wordtochar(words):
chars = []
for word in words:
word = word.lower()
for char in word:
chars.append(char)
chars.append('')
chars.pop()
return chars
if char_level_op:
for dataset in datasets.values():
dataset.apply_field(wordtochar, field_name="words", new_field_name='chars')
datasets["train"], datasets["dev"] = datasets["train"].split(0.1, shuffle=False)
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
src_vocab.from_dataset(datasets['train'], field_name='words')
src_vocab.index_dataset(*datasets.values(), field_name='words')
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
tgt_vocab.from_dataset(datasets['train'], field_name='target')
tgt_vocab.index_dataset(*datasets.values(), field_name='target')
info.vocabs = {
Const.INPUT: src_vocab,
Const.TARGET: tgt_vocab
}
info.datasets = datasets
for name, dataset in info.datasets.items():
dataset.set_input(Const.INPUT)
dataset.set_target(Const.TARGET)
return info

View File

@ -1,18 +1,17 @@
import os
from typing import Union, Dict
from typing import Union, Dict, List
from ...core.const import Const
from ...core.vocabulary import Vocabulary
from ..base_loader import DataInfo, DataSetLoader
from ..dataset_loader import JsonLoader, CSVLoader
from ..base_loader import DataBundle, DataSetLoader
from ..file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
from ...modules.encoder._bert import BertTokenizer
class MatchingLoader(DataSetLoader):
"""
别名:class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.dataset_loader.MatchingLoader`
别名:class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.data_loader.MatchingLoader`
读取Matching任务的数据集
@ -34,7 +33,8 @@ class MatchingLoader(DataSetLoader):
to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None,
cut_text: int = None, get_index=True, auto_pad_length: int=None,
auto_pad_token: str='<pad>', set_input: Union[list, str, bool]=True,
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataInfo:
set_target: Union[list, str, bool]=True, concat: Union[str, list, bool]=None,
extra_split: List[str]=None, ) -> DataBundle:
"""
:param paths: str或者Dict[str, str]如果是str则为数据集所在的文件夹或者是全路径文件名如果是文件夹
则会从self.paths里面找对应的数据集名称与文件名如果是Dict则为数据集名称如traindevtest
@ -57,6 +57,7 @@ class MatchingLoader(DataSetLoader):
:param concat: 是否需要将两个句子拼接起来如果为False则不会拼接如果为True则会在两个句子之间插入一个<sep>
如果传入一个长度为4的list则分别表示插在第一句开始前第一句结束后第二句开始前第二句结束后的标识符如果
传入字符串 ``bert`` 则会采用bert的拼接方式等价于['[CLS]', '[SEP]', '', '[SEP]'].
:param extra_split: 额外的分隔符即除了空格之外的用于分词的字符
:return:
"""
if isinstance(set_input, str):
@ -79,7 +80,7 @@ class MatchingLoader(DataSetLoader):
else:
path = paths
data_info = DataInfo()
data_info = DataBundle()
for data_name in path.keys():
data_info.datasets[data_name] = self._load(path[data_name])
@ -90,6 +91,24 @@ class MatchingLoader(DataSetLoader):
if Const.TARGET in data_set.get_field_names():
data_set.set_target(Const.TARGET)
if extra_split is not None:
for data_name, data_set in data_info.datasets.items():
data_set.apply(lambda x: ' '.join(x[Const.INPUTS(0)]), new_field_name=Const.INPUTS(0))
data_set.apply(lambda x: ' '.join(x[Const.INPUTS(1)]), new_field_name=Const.INPUTS(1))
for s in extra_split:
data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
new_field_name=Const.INPUTS(0))
data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
new_field_name=Const.INPUTS(0))
_filt = lambda x: x
data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(0)].split(' '))),
new_field_name=Const.INPUTS(0), is_input=auto_set_input)
data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(1)].split(' '))),
new_field_name=Const.INPUTS(1), is_input=auto_set_input)
_filt = None
if to_lower:
for data_name, data_set in data_info.datasets.items():
data_set.apply(lambda x: [w.lower() for w in x[Const.INPUTS(0)]], new_field_name=Const.INPUTS(0),
@ -227,204 +246,3 @@ class MatchingLoader(DataSetLoader):
data_set.set_target(*[target for target in set_target if target in data_set.get_field_names()])
return data_info
class SNLILoader(MatchingLoader, JsonLoader):
"""
别名:class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.dataset_loader.SNLILoader`
读取SNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
"""
def __init__(self, paths: dict=None):
fields = {
'sentence1_binary_parse': Const.INPUTS(0),
'sentence2_binary_parse': Const.INPUTS(1),
'gold_label': Const.TARGET,
}
paths = paths if paths is not None else {
'train': 'snli_1.0_train.jsonl',
'dev': 'snli_1.0_dev.jsonl',
'test': 'snli_1.0_test.jsonl'}
MatchingLoader.__init__(self, paths=paths)
JsonLoader.__init__(self, fields=fields)
def _load(self, path):
ds = JsonLoader._load(self, path)
parentheses_table = str.maketrans({'(': None, ')': None})
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(0))
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(1))
ds.drop(lambda x: x[Const.TARGET] == '-')
return ds
class RTELoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.RTELoader` :class:`fastNLP.io.dataset_loader.RTELoader`
读取RTE数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev': 'dev.tsv',
'test': 'test.tsv' # test set has not label
}
MatchingLoader.__init__(self, paths=paths)
self.fields = {
'sentence1': Const.INPUTS(0),
'sentence2': Const.INPUTS(1),
'label': Const.TARGET,
}
CSVLoader.__init__(self, sep='\t')
def _load(self, path):
ds = CSVLoader._load(self, path)
for k, v in self.fields.items():
if v in ds.get_field_names():
ds.rename_field(k, v)
for fields in ds.get_all_fields():
if Const.INPUT in fields:
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
return ds
class QNLILoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.dataset_loader.QNLILoader`
读取QNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev': 'dev.tsv',
'test': 'test.tsv' # test set has not label
}
MatchingLoader.__init__(self, paths=paths)
self.fields = {
'question': Const.INPUTS(0),
'sentence': Const.INPUTS(1),
'label': Const.TARGET,
}
CSVLoader.__init__(self, sep='\t')
def _load(self, path):
ds = CSVLoader._load(self, path)
for k, v in self.fields.items():
if v in ds.get_field_names():
ds.rename_field(k, v)
for fields in ds.get_all_fields():
if Const.INPUT in fields:
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
return ds
class MNLILoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.dataset_loader.MNLILoader`
读取MNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev_matched': 'dev_matched.tsv',
'dev_mismatched': 'dev_mismatched.tsv',
'test_matched': 'test_matched.tsv',
'test_mismatched': 'test_mismatched.tsv',
# 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt',
# 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt',
# test_0.9_mathed与mismatched是MNLI0.9版本的数据来源kaggle
}
MatchingLoader.__init__(self, paths=paths)
CSVLoader.__init__(self, sep='\t')
self.fields = {
'sentence1_binary_parse': Const.INPUTS(0),
'sentence2_binary_parse': Const.INPUTS(1),
'gold_label': Const.TARGET,
}
def _load(self, path):
ds = CSVLoader._load(self, path)
for k, v in self.fields.items():
if k in ds.get_field_names():
ds.rename_field(k, v)
if Const.TARGET in ds.get_field_names():
if ds[0][Const.TARGET] == 'hidden':
ds.delete_field(Const.TARGET)
parentheses_table = str.maketrans({'(': None, ')': None})
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(0))
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(1))
if Const.TARGET in ds.get_field_names():
ds.drop(lambda x: x[Const.TARGET] == '-')
return ds
class QuoraLoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.dataset_loader.QuoraLoader`
读取MNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev': 'dev.tsv',
'test': 'test.tsv',
}
MatchingLoader.__init__(self, paths=paths)
CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID'))
def _load(self, path):
ds = CSVLoader._load(self, path)
return ds

View File

@ -0,0 +1,60 @@
from ...core.const import Const
from .matching import MatchingLoader
from ..dataset_loader import CSVLoader
class MNLILoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.data_loader.MNLILoader`
读取MNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev_matched': 'dev_matched.tsv',
'dev_mismatched': 'dev_mismatched.tsv',
'test_matched': 'test_matched.tsv',
'test_mismatched': 'test_mismatched.tsv',
# 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt',
# 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt',
# test_0.9_mathed与mismatched是MNLI0.9版本的数据来源kaggle
}
MatchingLoader.__init__(self, paths=paths)
CSVLoader.__init__(self, sep='\t')
self.fields = {
'sentence1_binary_parse': Const.INPUTS(0),
'sentence2_binary_parse': Const.INPUTS(1),
'gold_label': Const.TARGET,
}
def _load(self, path):
ds = CSVLoader._load(self, path)
for k, v in self.fields.items():
if k in ds.get_field_names():
ds.rename_field(k, v)
if Const.TARGET in ds.get_field_names():
if ds[0][Const.TARGET] == 'hidden':
ds.delete_field(Const.TARGET)
parentheses_table = str.maketrans({'(': None, ')': None})
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(0))
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(1))
if Const.TARGET in ds.get_field_names():
ds.drop(lambda x: x[Const.TARGET] == '-')
return ds

View File

@ -0,0 +1,65 @@
from typing import Union, Dict
from ..base_loader import DataBundle
from ..dataset_loader import CSVLoader
from ...core.vocabulary import Vocabulary, VocabularyOption
from ...core.const import Const
from ..utils import check_dataloader_paths
class MTL16Loader(CSVLoader):
"""
读取MTL16数据集DataSet包含以下fields:
words: list(str), 需要分类的文本
target: str, 文本的标签
数据来源https://pan.baidu.com/s/1c2L6vdA
"""
def __init__(self):
super(MTL16Loader, self).__init__(headers=(Const.TARGET, Const.INPUT), sep='\t')
def _load(self, path):
dataset = super(MTL16Loader, self)._load(path)
dataset.apply(lambda x: x[Const.INPUT].lower().split(), new_field_name=Const.INPUT)
if len(dataset) == 0:
raise RuntimeError(f"{path} has no valid data.")
return dataset
def process(self,
paths: Union[str, Dict[str, str]],
src_vocab_opt: VocabularyOption = None,
tgt_vocab_opt: VocabularyOption = None,):
paths = check_dataloader_paths(paths)
datasets = {}
info = DataBundle()
for name, path in paths.items():
dataset = self.load(path)
datasets[name] = dataset
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
info.vocabs = {
Const.INPUT: src_vocab,
Const.TARGET: tgt_vocab
}
info.datasets = datasets
for name, dataset in info.datasets.items():
dataset.set_input(Const.INPUT)
dataset.set_target(Const.TARGET)
return info

View File

@ -0,0 +1,85 @@
from ..base_loader import DataSetLoader
from ...core.dataset import DataSet
from ...core.instance import Instance
from ...core.const import Const
class PeopleDailyCorpusLoader(DataSetLoader):
"""
别名:class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader`
读取人民日报数据集
"""
def __init__(self, pos=True, ner=True):
super(PeopleDailyCorpusLoader, self).__init__()
self.pos = pos
self.ner = ner
def _load(self, data_path):
with open(data_path, "r", encoding="utf-8") as f:
sents = f.readlines()
examples = []
for sent in sents:
if len(sent) <= 2:
continue
inside_ne = False
sent_pos_tag = []
sent_words = []
sent_ner = []
words = sent.strip().split()[1:]
for word in words:
if "[" in word and "]" in word:
ner_tag = "U"
print(word)
elif "[" in word:
inside_ne = True
ner_tag = "B"
word = word[1:]
elif "]" in word:
ner_tag = "L"
word = word[:word.index("]")]
if inside_ne is True:
inside_ne = False
else:
raise RuntimeError("only ] appears!")
else:
if inside_ne is True:
ner_tag = "I"
else:
ner_tag = "O"
tmp = word.split("/")
token, pos = tmp[0], tmp[1]
sent_ner.append(ner_tag)
sent_pos_tag.append(pos)
sent_words.append(token)
example = [sent_words]
if self.pos is True:
example.append(sent_pos_tag)
if self.ner is True:
example.append(sent_ner)
examples.append(example)
return self.convert(examples)
def convert(self, data):
"""
:param data: python 内置对象
:return: 一个 :class:`~fastNLP.DataSet` 类型的对象
"""
data_set = DataSet()
for item in data:
sent_words = item[0]
if self.pos is True and self.ner is True:
instance = Instance(
words=sent_words, pos_tags=item[1], ner=item[2])
elif self.pos is True:
instance = Instance(words=sent_words, pos_tags=item[1])
elif self.ner is True:
instance = Instance(words=sent_words, ner=item[1])
else:
instance = Instance(words=sent_words)
data_set.append(instance)
data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN)
return data_set

View File

@ -0,0 +1,45 @@
from ...core.const import Const
from .matching import MatchingLoader
from ..dataset_loader import CSVLoader
class QNLILoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.data_loader.QNLILoader`
读取QNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev': 'dev.tsv',
'test': 'test.tsv' # test set has not label
}
MatchingLoader.__init__(self, paths=paths)
self.fields = {
'question': Const.INPUTS(0),
'sentence': Const.INPUTS(1),
'label': Const.TARGET,
}
CSVLoader.__init__(self, sep='\t')
def _load(self, path):
ds = CSVLoader._load(self, path)
for k, v in self.fields.items():
if k in ds.get_field_names():
ds.rename_field(k, v)
for fields in ds.get_all_fields():
if Const.INPUT in fields:
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
return ds

View File

@ -0,0 +1,32 @@
from ...core.const import Const
from .matching import MatchingLoader
from ..dataset_loader import CSVLoader
class QuoraLoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.data_loader.QuoraLoader`
读取MNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev': 'dev.tsv',
'test': 'test.tsv',
}
MatchingLoader.__init__(self, paths=paths)
CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID'))
def _load(self, path):
ds = CSVLoader._load(self, path)
return ds

View File

@ -0,0 +1,45 @@
from ...core.const import Const
from .matching import MatchingLoader
from ..dataset_loader import CSVLoader
class RTELoader(MatchingLoader, CSVLoader):
"""
别名:class:`fastNLP.io.RTELoader` :class:`fastNLP.io.data_loader.RTELoader`
读取RTE数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源:
"""
def __init__(self, paths: dict=None):
paths = paths if paths is not None else {
'train': 'train.tsv',
'dev': 'dev.tsv',
'test': 'test.tsv' # test set has not label
}
MatchingLoader.__init__(self, paths=paths)
self.fields = {
'sentence1': Const.INPUTS(0),
'sentence2': Const.INPUTS(1),
'label': Const.TARGET,
}
CSVLoader.__init__(self, sep='\t')
def _load(self, path):
ds = CSVLoader._load(self, path)
for k, v in self.fields.items():
if k in ds.get_field_names():
ds.rename_field(k, v)
for fields in ds.get_all_fields():
if Const.INPUT in fields:
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
return ds

View File

@ -0,0 +1,44 @@
from ...core.const import Const
from .matching import MatchingLoader
from ..dataset_loader import JsonLoader
class SNLILoader(MatchingLoader, JsonLoader):
"""
别名:class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.data_loader.SNLILoader`
读取SNLI数据集读取的DataSet包含fields::
words1: list(str)第一句文本, premise
words2: list(str), 第二句文本, hypothesis
target: str, 真实标签
数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
"""
def __init__(self, paths: dict=None):
fields = {
'sentence1_binary_parse': Const.INPUTS(0),
'sentence2_binary_parse': Const.INPUTS(1),
'gold_label': Const.TARGET,
}
paths = paths if paths is not None else {
'train': 'snli_1.0_train.jsonl',
'dev': 'snli_1.0_dev.jsonl',
'test': 'snli_1.0_test.jsonl'}
MatchingLoader.__init__(self, paths=paths)
JsonLoader.__init__(self, fields=fields)
def _load(self, path):
ds = JsonLoader._load(self, path)
parentheses_table = str.maketrans({'(': None, ')': None})
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(0))
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
new_field_name=Const.INPUTS(1))
ds.drop(lambda x: x[Const.TARGET] == '-')
return ds

View File

@ -1,19 +1,19 @@
from typing import Iterable
from typing import Union, Dict
from nltk import Tree
import spacy
from ..base_loader import DataInfo, DataSetLoader
from ..base_loader import DataBundle, DataSetLoader
from ..dataset_loader import CSVLoader
from ...core.vocabulary import VocabularyOption, Vocabulary
from ...core.dataset import DataSet
from ...core.const import Const
from ...core.instance import Instance
from ..utils import check_dataloader_paths, get_tokenizer
class SSTLoader(DataSetLoader):
URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
DATA_DIR = 'sst/'
"""
别名:class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.dataset_loader.SSTLoader`
别名:class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.data_loader.SSTLoader`
读取SST数据集, DataSet包含fields::
@ -26,6 +26,9 @@ class SSTLoader(DataSetLoader):
:param fine_grained: 是否使用SST-5标准 ``False`` , 使用SST-2Default: ``False``
"""
URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
DATA_DIR = 'sst/'
def __init__(self, subtree=False, fine_grained=False):
self.subtree = subtree
@ -57,8 +60,8 @@ class SSTLoader(DataSetLoader):
def _get_one(self, data, subtree):
tree = Tree.fromstring(data)
if subtree:
return [([x.text for x in self.tokenizer(' '.join(t.leaves()))], t.label()) for t in tree.subtrees() ]
return [([x.text for x in self.tokenizer(' '.join(tree.leaves()))], tree.label())]
return [(self.tokenizer(' '.join(t.leaves())), t.label()) for t in tree.subtrees() ]
return [(self.tokenizer(' '.join(tree.leaves())), tree.label())]
def process(self,
paths, train_subtree=True,
@ -70,7 +73,7 @@ class SSTLoader(DataSetLoader):
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
info = DataInfo()
info = DataBundle()
origin_subtree = self.subtree
self.subtree = train_subtree
info.datasets['train'] = self._load(paths['train'])
@ -98,3 +101,75 @@ class SSTLoader(DataSetLoader):
return info
class SST2Loader(CSVLoader):
"""
数据来源"SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
"""
def __init__(self):
super(SST2Loader, self).__init__(sep='\t')
self.tokenizer = get_tokenizer()
self.field = {'sentence': Const.INPUT, 'label': Const.TARGET}
def _load(self, path: str) -> DataSet:
ds = super(SST2Loader, self)._load(path)
for k, v in self.field.items():
if k in ds.get_field_names():
ds.rename_field(k, v)
ds.apply(lambda x: self.tokenizer(x[Const.INPUT]), new_field_name=Const.INPUT)
print("all count:", len(ds))
return ds
def process(self,
paths: Union[str, Dict[str, str]],
src_vocab_opt: VocabularyOption = None,
tgt_vocab_opt: VocabularyOption = None,
char_level_op=False):
paths = check_dataloader_paths(paths)
datasets = {}
info = DataBundle()
for name, path in paths.items():
dataset = self.load(path)
datasets[name] = dataset
def wordtochar(words):
chars = []
for word in words:
word = word.lower()
for char in word:
chars.append(char)
chars.append('')
chars.pop()
return chars
input_name, target_name = Const.INPUT, Const.TARGET
info.vocabs={}
# 就分隔为char形式
if char_level_op:
for dataset in datasets.values():
dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
info.vocabs = {
Const.INPUT: src_vocab,
Const.TARGET: tgt_vocab
}
info.datasets = datasets
for name, dataset in info.datasets.items():
dataset.set_input(Const.INPUT)
dataset.set_target(Const.TARGET)
return info

View File

@ -0,0 +1,127 @@
import csv
from typing import Iterable
from ...core.const import Const
from ...core.dataset import DataSet
from ...core.instance import Instance
from ...core.vocabulary import VocabularyOption, Vocabulary
from ..base_loader import DataBundle, DataSetLoader
from typing import Union, Dict
from ..utils import check_dataloader_paths, get_tokenizer
class YelpLoader(DataSetLoader):
"""
读取Yelp_full/Yelp_polarity数据集, DataSet包含fields:
words: list(str), 需要分类的文本
target: str, 文本的标签
chars:list(str),未index的字符列表
数据集yelp_full/yelp_polarity
:param fine_grained: 是否使用SST-5标准 ``False`` , 使用SST-2Default: ``False``
:param lower: 是否需要自动转小写默认为False
"""
def __init__(self, fine_grained=False, lower=False):
super(YelpLoader, self).__init__()
tag_v = {'1.0': 'very negative', '2.0': 'negative', '3.0': 'neutral',
'4.0': 'positive', '5.0': 'very positive'}
if not fine_grained:
tag_v['1.0'] = tag_v['2.0']
tag_v['5.0'] = tag_v['4.0']
self.fine_grained = fine_grained
self.tag_v = tag_v
self.lower = lower
self.tokenizer = get_tokenizer()
def _load(self, path):
ds = DataSet()
csv_reader = csv.reader(open(path, encoding='utf-8'))
all_count = 0
real_count = 0
for row in csv_reader:
all_count += 1
if len(row) == 2:
target = self.tag_v[row[0] + ".0"]
words = clean_str(row[1], self.tokenizer, self.lower)
if len(words) != 0:
ds.append(Instance(words=words, target=target))
real_count += 1
print("all count:", all_count)
print("real count:", real_count)
return ds
def process(self, paths: Union[str, Dict[str, str]],
train_ds: Iterable[str] = None,
src_vocab_op: VocabularyOption = None,
tgt_vocab_op: VocabularyOption = None,
char_level_op=False):
paths = check_dataloader_paths(paths)
info = DataBundle(datasets=self.load(paths))
src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
_train_ds = [info.datasets[name]
for name in train_ds] if train_ds else info.datasets.values()
def wordtochar(words):
chars = []
for word in words:
word = word.lower()
for char in word:
chars.append(char)
chars.append('')
chars.pop()
return chars
input_name, target_name = Const.INPUT, Const.TARGET
info.vocabs = {}
# 就分隔为char形式
if char_level_op:
for dataset in info.datasets.values():
dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
else:
src_vocab.from_dataset(*_train_ds, field_name=input_name)
src_vocab.index_dataset(*info.datasets.values(), field_name=input_name, new_field_name=input_name)
info.vocabs[input_name] = src_vocab
tgt_vocab.from_dataset(*_train_ds, field_name=target_name)
tgt_vocab.index_dataset(
*info.datasets.values(),
field_name=target_name, new_field_name=target_name)
info.vocabs[target_name] = tgt_vocab
info.datasets['train'], info.datasets['dev'] = info.datasets['train'].split(0.1, shuffle=False)
for name, dataset in info.datasets.items():
dataset.set_input(Const.INPUT)
dataset.set_target(Const.TARGET)
return info
def clean_str(sentence, tokenizer, char_lower=False):
"""
heavily borrowed from github
https://github.com/LukeZhuang/Hierarchical-Attention-Network/blob/master/yelp-preprocess.ipynb
:param sentence: is a str
:return:
"""
if char_lower:
sentence = sentence.lower()
import re
nonalpnum = re.compile('[^0-9a-zA-Z?!\']+')
words = tokenizer(sentence)
words_collection = []
for word in words:
if word in ['-lrb-', '-rrb-', '<sssss>', '-r', '-l', 'b-']:
continue
tt = nonalpnum.split(word)
t = ''.join(tt)
if t != '':
words_collection.append(t)
return words_collection

View File

@ -15,199 +15,13 @@ dataset_loader模块实现了许多 DataSetLoader, 用于读取不同格式的
__all__ = [
'CSVLoader',
'JsonLoader',
'ConllLoader',
'PeopleDailyCorpusLoader',
'Conll2003Loader',
]
import os
from nltk import Tree
from typing import Union, Dict
from ..core.vocabulary import Vocabulary
from ..core.dataset import DataSet
from ..core.instance import Instance
from .file_reader import _read_csv, _read_json, _read_conll
from .base_loader import DataSetLoader, DataInfo
from ..core.const import Const
from ..modules.encoder._bert import BertTokenizer
class PeopleDailyCorpusLoader(DataSetLoader):
"""
别名:class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader`
读取人民日报数据集
"""
def __init__(self, pos=True, ner=True):
super(PeopleDailyCorpusLoader, self).__init__()
self.pos = pos
self.ner = ner
def _load(self, data_path):
with open(data_path, "r", encoding="utf-8") as f:
sents = f.readlines()
examples = []
for sent in sents:
if len(sent) <= 2:
continue
inside_ne = False
sent_pos_tag = []
sent_words = []
sent_ner = []
words = sent.strip().split()[1:]
for word in words:
if "[" in word and "]" in word:
ner_tag = "U"
print(word)
elif "[" in word:
inside_ne = True
ner_tag = "B"
word = word[1:]
elif "]" in word:
ner_tag = "L"
word = word[:word.index("]")]
if inside_ne is True:
inside_ne = False
else:
raise RuntimeError("only ] appears!")
else:
if inside_ne is True:
ner_tag = "I"
else:
ner_tag = "O"
tmp = word.split("/")
token, pos = tmp[0], tmp[1]
sent_ner.append(ner_tag)
sent_pos_tag.append(pos)
sent_words.append(token)
example = [sent_words]
if self.pos is True:
example.append(sent_pos_tag)
if self.ner is True:
example.append(sent_ner)
examples.append(example)
return self.convert(examples)
def convert(self, data):
"""
:param data: python 内置对象
:return: 一个 :class:`~fastNLP.DataSet` 类型的对象
"""
data_set = DataSet()
for item in data:
sent_words = item[0]
if self.pos is True and self.ner is True:
instance = Instance(
words=sent_words, pos_tags=item[1], ner=item[2])
elif self.pos is True:
instance = Instance(words=sent_words, pos_tags=item[1])
elif self.ner is True:
instance = Instance(words=sent_words, ner=item[1])
else:
instance = Instance(words=sent_words)
data_set.append(instance)
data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN)
return data_set
class ConllLoader(DataSetLoader):
"""
别名:class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.dataset_loader.ConllLoader`
读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略因为
该符号在conll 2003中被用为文档分割符
列号从0开始, 每列对应内容为::
Column Type
0 Document ID
1 Part number
2 Word number
3 Word itself
4 Part-of-Speech
5 Parse bit
6 Predicate lemma
7 Predicate Frameset ID
8 Word sense
9 Speaker/Author
10 Named Entities
11:N Predicate Arguments
N Coreference
:param headers: 每一列数据的名称需为List or Tuple of str``header`` ``indexes`` 一一对应
:param indexes: 需要保留的数据列下标从0开始若为 ``None`` 则所有列都保留Default: ``None``
:param dropna: 是否忽略非法数据 ``False`` 遇到非法数据时抛出 ``ValueError`` Default: ``False``
"""
def __init__(self, headers, indexes=None, dropna=False):
super(ConllLoader, self).__init__()
if not isinstance(headers, (list, tuple)):
raise TypeError(
'invalid headers: {}, should be list of strings'.format(headers))
self.headers = headers
self.dropna = dropna
if indexes is None:
self.indexes = list(range(len(self.headers)))
else:
if len(indexes) != len(headers):
raise ValueError
self.indexes = indexes
def _load(self, path):
ds = DataSet()
for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
ins = {h: data[i] for i, h in enumerate(self.headers)}
ds.append(Instance(**ins))
return ds
class Conll2003Loader(ConllLoader):
"""
别名:class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader`
读取Conll2003数据
关于数据集的更多信息,参考:
https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
"""
def __init__(self):
headers = [
'tokens', 'pos', 'chunks', 'ner',
]
super(Conll2003Loader, self).__init__(headers=headers)
def _cut_long_sentence(sent, max_sample_length=200):
"""
将长于max_sample_length的sentence截成多段只会在有空格的地方发生截断
所以截取的句子可能长于或者短于max_sample_length
:param sent: str.
:param max_sample_length: int.
:return: list of str.
"""
sent_no_space = sent.replace(' ', '')
cutted_sentence = []
if len(sent_no_space) > max_sample_length:
parts = sent.strip().split()
new_line = ''
length = 0
for part in parts:
length += len(part)
new_line += part + ' '
if length > max_sample_length:
new_line = new_line[:-1]
cutted_sentence.append(new_line)
length = 0
new_line = ''
if new_line != '':
cutted_sentence.append(new_line[:-1])
else:
cutted_sentence.append(sent)
return cutted_sentence
from .file_reader import _read_csv, _read_json
from .base_loader import DataSetLoader
class JsonLoader(DataSetLoader):
@ -272,6 +86,36 @@ class CSVLoader(DataSetLoader):
return ds
def _cut_long_sentence(sent, max_sample_length=200):
"""
将长于max_sample_length的sentence截成多段只会在有空格的地方发生截断
所以截取的句子可能长于或者短于max_sample_length
:param sent: str.
:param max_sample_length: int.
:return: list of str.
"""
sent_no_space = sent.replace(' ', '')
cutted_sentence = []
if len(sent_no_space) > max_sample_length:
parts = sent.strip().split()
new_line = ''
length = 0
for part in parts:
length += len(part)
new_line += part + ' '
if length > max_sample_length:
new_line = new_line[:-1]
cutted_sentence.append(new_line)
length = 0
new_line = ''
if new_line != '':
cutted_sentence.append(new_line[:-1])
else:
cutted_sentence.append(sent)
return cutted_sentence
def _add_seg_tag(data):
"""

View File

@ -17,6 +17,10 @@ PRETRAINED_BERT_MODEL_DIR = {
'en-large-uncased': 'bert-large-uncased-20939f45.zip',
'en-large-cased': 'bert-large-cased-e0cf90fc.zip',
'en-large-cased-wwm': 'bert-large-cased-wwm-a457f118.zip',
'en-large-uncased-wwm': 'bert-large-uncased-wwm-92a50aeb.zip',
'en-base-cased-mrpc': 'bert-base-cased-finetuned-mrpc-c7099855.zip',
'cn': 'bert-base-chinese-29d0a84a.zip',
'cn-base': 'bert-base-chinese-29d0a84a.zip',
@ -68,6 +72,7 @@ def cached_path(url_or_filename: str, cache_dir: Path=None) -> Path:
"unable to parse {} as a URL or as a local path".format(url_or_filename)
)
def get_filepath(filepath):
"""
如果filepath中只有一个文件则直接返回对应的全路径
@ -82,6 +87,7 @@ def get_filepath(filepath):
return filepath
return filepath
def get_defalt_path():
"""
获取默认的fastNLP存放路径, 如果将FASTNLP_CACHE_PATH设置在了环境变量中将使用环境变量的值使得不用每个用户都去下载
@ -98,6 +104,7 @@ def get_defalt_path():
fastnlp_cache_dir = os.path.expanduser(os.path.join("~", ".fastNLP"))
return fastnlp_cache_dir
def _get_base_url(name):
# 返回的URL结尾必须是/
if 'FASTNLP_BASE_URL' in os.environ:
@ -105,6 +112,7 @@ def _get_base_url(name):
return fastnlp_base_url
raise RuntimeError("There function is not available right now.")
def split_filename_suffix(filepath):
"""
给定filepath返回对应的name和suffix
@ -116,6 +124,7 @@ def split_filename_suffix(filepath):
return filename[:-7], '.tar.gz'
return os.path.splitext(filename)
def get_from_cache(url: str, cache_dir: Path = None) -> Path:
"""
尝试在cache_dir中寻找url定义的资源; 如果没有找到则从url下载并将结果放在cache_dir下缓存的名称由url的结果推断而来
@ -226,6 +235,7 @@ def get_from_cache(url: str, cache_dir: Path = None) -> Path:
return get_filepath(cache_path)
def unzip_file(file: Path, to: Path):
# unpack and write out in CoNLL column-like format
from zipfile import ZipFile
@ -234,13 +244,15 @@ def unzip_file(file: Path, to: Path):
# Extract all the contents of zip file in current directory
zipObj.extractall(to)
def untar_gz_file(file:Path, to:Path):
import tarfile
with tarfile.open(file, 'r:gz') as tar:
tar.extractall(to)
def match_file(dir_name:str, cache_dir:str)->str:
def match_file(dir_name: str, cache_dir: str) -> str:
"""
匹配的原则是在cache_dir下的文件: (1) 与dir_name完全一致; (2) 除了后缀以外和dir_name完全一致
如果找到了两个匹配的结果将报错. 如果找到了则返回匹配的文件的名称; 没有找到返回空字符串
@ -261,6 +273,7 @@ def match_file(dir_name:str, cache_dir:str)->str:
else:
raise RuntimeError(f"Duplicate matched files:{matched_filenames}, this should be caused by a bug.")
if __name__ == '__main__':
cache_dir = Path('caches')
cache_dir = None

View File

@ -4,149 +4,209 @@ __all__ = [
import torch
import torch.nn as nn
import torch.nn.functional as F
from .base_model import BaseModel
from ..core.const import Const
from ..modules import decoder as Decoder
from ..modules import encoder as Encoder
from ..modules import aggregator as Aggregator
from ..core.utils import seq_len_to_mask
from torch.nn import CrossEntropyLoss
my_inf = 10e12
from fastNLP.models import BaseModel
from fastNLP.modules.encoder.embedding import TokenEmbedding
from fastNLP.modules.encoder.lstm import LSTM
from fastNLP.core.const import Const
from fastNLP.core.utils import seq_len_to_mask
class ESIM(BaseModel):
"""
别名:class:`fastNLP.models.ESIM` :class:`fastNLP.models.snli.ESIM`
"""ESIM model的一个PyTorch实现
论文参见 https://arxiv.org/pdf/1609.06038.pdf
ESIM模型的一个PyTorch实现
ESIM模型的论文: Enhanced LSTM for Natural Language Inference (arXiv: 1609.06038)
:param int vocab_size: 词表大小
:param int embed_dim: 词嵌入维度
:param int hidden_size: LSTM隐层大小
:param float dropout: dropout大小默认为0
:param int num_classes: 标签数目默认为3
:param numpy.array init_embedding: 初始词嵌入矩阵形状为(vocab_size, embed_dim)默认为None即随机初始化词嵌入矩阵
:param fastNLP.TokenEmbedding init_embedding: 初始化的TokenEmbedding
:param int hidden_size: 隐藏层大小默认值为Embedding的维度
:param int num_labels: 目标标签种类数量默认值为3
:param float dropout_rate: dropout的比率默认值为0.3
:param float dropout_embed: 对Embedding的dropout比率默认值为0.1
"""
def __init__(self, vocab_size, embed_dim, hidden_size, dropout=0.0, num_classes=3, init_embedding=None):
def __init__(self, init_embedding: TokenEmbedding, hidden_size=None, num_labels=3, dropout_rate=0.3,
dropout_embed=0.1):
super(ESIM, self).__init__()
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.hidden_size = hidden_size
self.dropout = dropout
self.n_labels = num_classes
self.drop = nn.Dropout(self.dropout)
self.embedding = Encoder.Embedding(
(self.vocab_size, self.embed_dim), dropout=self.dropout,
)
self.embedding_layer = nn.Linear(self.embed_dim, self.hidden_size)
self.encoder = Encoder.LSTM(
input_size=self.embed_dim, hidden_size=self.hidden_size, num_layers=1, bias=True,
batch_first=True, bidirectional=True
)
self.bi_attention = Aggregator.BiAttention()
self.mean_pooling = Aggregator.AvgPoolWithMask()
self.max_pooling = Aggregator.MaxPoolWithMask()
self.inference_layer = nn.Linear(self.hidden_size * 4, self.hidden_size)
self.decoder = Encoder.LSTM(
input_size=self.hidden_size, hidden_size=self.hidden_size, num_layers=1, bias=True,
batch_first=True, bidirectional=True
)
self.output = Decoder.MLP([4 * self.hidden_size, self.hidden_size, self.n_labels], 'tanh', dropout=self.dropout)
def forward(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
""" Forward function
:param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
:param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
:param torch.LongTensor seq_len1: [B] premise的长度
:param torch.LongTensor seq_len2: [B] hypothesis的长度
:param torch.LongTensor target: [B] 真实目标值
:return: dict prediction: [B, n_labels(N)] 预测结果
"""
premise0 = self.embedding_layer(self.embedding(words1))
hypothesis0 = self.embedding_layer(self.embedding(words2))
if seq_len1 is not None:
seq_len1 = seq_len_to_mask(seq_len1)
else:
seq_len1 = torch.ones(premise0.size(0), premise0.size(1))
seq_len1 = (seq_len1.long()).to(device=premise0.device)
if seq_len2 is not None:
seq_len2 = seq_len_to_mask(seq_len2)
else:
seq_len2 = torch.ones(hypothesis0.size(0), hypothesis0.size(1))
seq_len2 = (seq_len2.long()).to(device=hypothesis0.device)
_BP, _PSL, _HP = premise0.size()
_BH, _HSL, _HH = hypothesis0.size()
_BPL, _PLL = seq_len1.size()
_HPL, _HLL = seq_len2.size()
assert _BP == _BH and _BPL == _HPL and _BP == _BPL
assert _HP == _HH
assert _PSL == _PLL and _HSL == _HLL
B, PL, H = premise0.size()
B, HL, H = hypothesis0.size()
a0 = self.encoder(self.drop(premise0)) # a0: [B, PL, H * 2]
b0 = self.encoder(self.drop(hypothesis0)) # b0: [B, HL, H * 2]
a = torch.mean(a0.view(B, PL, -1, H), dim=2) # a: [B, PL, H]
b = torch.mean(b0.view(B, HL, -1, H), dim=2) # b: [B, HL, H]
ai, bi = self.bi_attention(a, b, seq_len1, seq_len2)
ma = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 4 * H]
mb = torch.cat((b, bi, b - bi, b * bi), dim=2) # mb: [B, HL, 4 * H]
f_ma = self.inference_layer(ma)
f_mb = self.inference_layer(mb)
vat = self.decoder(self.drop(f_ma))
vbt = self.decoder(self.drop(f_mb))
va = torch.mean(vat.view(B, PL, -1, H), dim=2) # va: [B, PL, H]
vb = torch.mean(vbt.view(B, HL, -1, H), dim=2) # vb: [B, HL, H]
va_ave = self.mean_pooling(va, seq_len1, dim=1) # va_ave: [B, H]
va_max, va_arg_max = self.max_pooling(va, seq_len1, dim=1) # va_max: [B, H]
vb_ave = self.mean_pooling(vb, seq_len2, dim=1) # vb_ave: [B, H]
vb_max, vb_arg_max = self.max_pooling(vb, seq_len2, dim=1) # vb_max: [B, H]
v = torch.cat((va_ave, va_max, vb_ave, vb_max), dim=1) # v: [B, 4 * H]
prediction = torch.tanh(self.output(v)) # prediction: [B, N]
if target is not None:
func = nn.CrossEntropyLoss()
loss = func(prediction, target)
return {Const.OUTPUT: prediction, Const.LOSS: loss}
return {Const.OUTPUT: prediction}
def predict(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
""" Predict function
:param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
:param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
:param torch.LongTensor seq_len1: [B] premise的长度
:param torch.LongTensor seq_len2: [B] hypothesis的长度
:param torch.LongTensor target: [B] 真实目标值
:return: dict prediction: [B, n_labels(N)] 预测结果
self.embedding = init_embedding
self.dropout_embed = EmbedDropout(p=dropout_embed)
if hidden_size is None:
hidden_size = self.embedding.embed_size
self.rnn = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
# self.rnn = LSTM(self.embedding.embed_size, hidden_size, dropout=dropout_rate, bidirectional=True)
self.interfere = nn.Sequential(nn.Dropout(p=dropout_rate),
nn.Linear(8 * hidden_size, hidden_size),
nn.ReLU())
nn.init.xavier_uniform_(self.interfere[1].weight.data)
self.bi_attention = SoftmaxAttention()
self.rnn_high = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
# self.rnn_high = LSTM(hidden_size, hidden_size, dropout=dropout_rate, bidirectional=True,)
self.classifier = nn.Sequential(nn.Dropout(p=dropout_rate),
nn.Linear(8 * hidden_size, hidden_size),
nn.Tanh(),
nn.Dropout(p=dropout_rate),
nn.Linear(hidden_size, num_labels))
self.dropout_rnn = nn.Dropout(p=dropout_rate)
nn.init.xavier_uniform_(self.classifier[1].weight.data)
nn.init.xavier_uniform_(self.classifier[4].weight.data)
def forward(self, words1, words2, seq_len1, seq_len2, target=None):
"""
prediction = self.forward(words1, words2, seq_len1, seq_len2)[Const.OUTPUT]
return {Const.OUTPUT: torch.argmax(prediction, dim=-1)}
:param words1: [batch, seq_len]
:param words2: [batch, seq_len]
:param seq_len1: [batch]
:param seq_len2: [batch]
:param target:
:return:
"""
mask1 = seq_len_to_mask(seq_len1, words1.size(1))
mask2 = seq_len_to_mask(seq_len2, words2.size(1))
a0 = self.embedding(words1) # B * len * emb_dim
b0 = self.embedding(words2)
a0, b0 = self.dropout_embed(a0), self.dropout_embed(b0)
a = self.rnn(a0, mask1.byte()) # a: [B, PL, 2 * H]
b = self.rnn(b0, mask2.byte())
# a = self.dropout_rnn(self.rnn(a0, seq_len1)[0]) # a: [B, PL, 2 * H]
# b = self.dropout_rnn(self.rnn(b0, seq_len2)[0])
ai, bi = self.bi_attention(a, mask1, b, mask2)
a_ = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 8 * H]
b_ = torch.cat((b, bi, b - bi, b * bi), dim=2)
a_f = self.interfere(a_)
b_f = self.interfere(b_)
a_h = self.rnn_high(a_f, mask1.byte()) # ma: [B, PL, 2 * H]
b_h = self.rnn_high(b_f, mask2.byte())
# a_h = self.dropout_rnn(self.rnn_high(a_f, seq_len1)[0]) # ma: [B, PL, 2 * H]
# b_h = self.dropout_rnn(self.rnn_high(b_f, seq_len2)[0])
a_avg = self.mean_pooling(a_h, mask1, dim=1)
a_max, _ = self.max_pooling(a_h, mask1, dim=1)
b_avg = self.mean_pooling(b_h, mask2, dim=1)
b_max, _ = self.max_pooling(b_h, mask2, dim=1)
out = torch.cat((a_avg, a_max, b_avg, b_max), dim=1) # v: [B, 8 * H]
logits = torch.tanh(self.classifier(out))
if target is not None:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits, target)
return {Const.LOSS: loss, Const.OUTPUT: logits}
else:
return {Const.OUTPUT: logits}
def predict(self, **kwargs):
pred = self.forward(**kwargs)[Const.OUTPUT].argmax(-1)
return {Const.OUTPUT: pred}
# input [batch_size, len , hidden]
# mask [batch_size, len] (111...00)
@staticmethod
def mean_pooling(input, mask, dim=1):
masks = mask.view(mask.size(0), mask.size(1), -1).float()
return torch.sum(input * masks, dim=dim) / torch.sum(masks, dim=1)
@staticmethod
def max_pooling(input, mask, dim=1):
my_inf = 10e12
masks = mask.view(mask.size(0), mask.size(1), -1)
masks = masks.expand(-1, -1, input.size(2)).float()
return torch.max(input + masks.le(0.5).float() * -my_inf, dim=dim)
class EmbedDropout(nn.Dropout):
def forward(self, sequences_batch):
ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1])
dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False)
return dropout_mask.unsqueeze(1) * sequences_batch
class BiRNN(nn.Module):
def __init__(self, input_size, hidden_size, dropout_rate=0.3):
super(BiRNN, self).__init__()
self.dropout_rate = dropout_rate
self.rnn = nn.LSTM(input_size, hidden_size,
num_layers=1,
bidirectional=True,
batch_first=True)
def forward(self, x, x_mask):
# Sort x
lengths = x_mask.data.eq(1).long().sum(1)
_, idx_sort = torch.sort(lengths, dim=0, descending=True)
_, idx_unsort = torch.sort(idx_sort, dim=0)
lengths = list(lengths[idx_sort])
x = x.index_select(0, idx_sort)
# Pack it up
rnn_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True)
# Apply dropout to input
if self.dropout_rate > 0:
dropout_input = F.dropout(rnn_input.data, p=self.dropout_rate, training=self.training)
rnn_input = nn.utils.rnn.PackedSequence(dropout_input, rnn_input.batch_sizes)
output = self.rnn(rnn_input)[0]
# Unpack everything
output = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)[0]
output = output.index_select(0, idx_unsort)
if output.size(1) != x_mask.size(1):
padding = torch.zeros(output.size(0),
x_mask.size(1) - output.size(1),
output.size(2)).type(output.data.type())
output = torch.cat([output, padding], 1)
return output
def masked_softmax(tensor, mask):
tensor_shape = tensor.size()
reshaped_tensor = tensor.view(-1, tensor_shape[-1])
# Reshape the mask so it matches the size of the input tensor.
while mask.dim() < tensor.dim():
mask = mask.unsqueeze(1)
mask = mask.expand_as(tensor).contiguous().float()
reshaped_mask = mask.view(-1, mask.size()[-1])
result = F.softmax(reshaped_tensor * reshaped_mask, dim=-1)
result = result * reshaped_mask
# 1e-13 is added to avoid divisions by zero.
result = result / (result.sum(dim=-1, keepdim=True) + 1e-13)
return result.view(*tensor_shape)
def weighted_sum(tensor, weights, mask):
w_sum = weights.bmm(tensor)
while mask.dim() < w_sum.dim():
mask = mask.unsqueeze(1)
mask = mask.transpose(-1, -2)
mask = mask.expand_as(w_sum).contiguous().float()
return w_sum * mask
class SoftmaxAttention(nn.Module):
def forward(self, premise_batch, premise_mask, hypothesis_batch, hypothesis_mask):
similarity_matrix = premise_batch.bmm(hypothesis_batch.transpose(2, 1)
.contiguous())
prem_hyp_attn = masked_softmax(similarity_matrix, hypothesis_mask)
hyp_prem_attn = masked_softmax(similarity_matrix.transpose(1, 2)
.contiguous(),
premise_mask)
attended_premises = weighted_sum(hypothesis_batch,
prem_hyp_attn,
premise_mask)
attended_hypotheses = weighted_sum(premise_batch,
hyp_prem_attn,
hypothesis_mask)
return attended_premises, attended_hypotheses

View File

@ -46,8 +46,8 @@ class StarTransEnc(nn.Module):
super(StarTransEnc, self).__init__()
self.embedding = get_embeddings(init_embed)
emb_dim = self.embedding.embedding_dim
#self.emb_fc = nn.Linear(emb_dim, hidden_size)
self.emb_drop = nn.Dropout(emb_dropout)
self.emb_fc = nn.Linear(emb_dim, hidden_size)
# self.emb_drop = nn.Dropout(emb_dropout)
self.encoder = StarTransformer(hidden_size=hidden_size,
num_layers=num_layers,
num_head=num_head,
@ -65,7 +65,7 @@ class StarTransEnc(nn.Module):
[batch, hidden] 全局 relay 节点, 详见论文
"""
x = self.embedding(x)
#x = self.emb_fc(self.emb_drop(x))
x = self.emb_fc(x)
nodes, relay = self.encoder(x, mask)
return nodes, relay

View File

@ -1,11 +1,11 @@
"""
大部分用于的 NLP 任务神经网络都可以看做由编码 :mod:`~fastNLP.modules.encoder`
聚合 :mod:`~fastNLP.modules.aggregator` 解码 :mod:`~fastNLP.modules.decoder` 种模块组成
解码 :mod:`~fastNLP.modules.decoder` 种模块组成
.. image:: figures/text_classification.png
:mod:`~fastNLP.modules` 中实现了 fastNLP 提供的诸多模块组件可以帮助用户快速搭建自己所需的网络
种模块的功能和常见组件如下:
种模块的功能和常见组件如下:
+-----------------------+-----------------------+-----------------------+
| module type | functionality | example |
@ -13,9 +13,6 @@
| encoder | 将输入编码为具有具 | embedding, RNN, CNN, |
| | 有表示能力的向量 | transformer |
+-----------------------+-----------------------+-----------------------+
| aggregator | 从多个向量中聚合信息 | self-attention, |
| | | max-pooling |
+-----------------------+-----------------------+-----------------------+
| decoder | 将具有某种表示意义的 | MLP, CRF |
| | 向量解码为需要的输出 | |
| | 形式 | |
@ -46,10 +43,8 @@ __all__ = [
"allowed_transitions",
]
from . import aggregator
from . import decoder
from . import encoder
from .aggregator import *
from .decoder import *
from .dropout import TimestepDropout
from .encoder import *

View File

@ -1,14 +0,0 @@
__all__ = [
"MaxPool",
"MaxPoolWithMask",
"AvgPool",
"MultiHeadAttention",
]
from .pooling import MaxPool
from .pooling import MaxPoolWithMask
from .pooling import AvgPool
from .pooling import AvgPoolWithMask
from .attention import MultiHeadAttention

View File

@ -22,7 +22,14 @@ __all__ = [
"VarRNN",
"VarLSTM",
"VarGRU"
"VarGRU",
"MaxPool",
"MaxPoolWithMask",
"AvgPool",
"AvgPoolWithMask",
"MultiHeadAttention",
]
from ._bert import BertModel
from .bert import BertWordPieceEncoder
@ -34,3 +41,6 @@ from .lstm import LSTM
from .star_transformer import StarTransformer
from .transformer import TransformerEncoder
from .variational_rnn import VarRNN, VarLSTM, VarGRU
from .pooling import MaxPool, MaxPoolWithMask, AvgPool, AvgPoolWithMask
from .attention import MultiHeadAttention

View File

@ -6,14 +6,13 @@ from typing import Optional, Tuple, List, Callable
import os
import h5py
import numpy
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import PackedSequence, pad_packed_sequence
from ...core.vocabulary import Vocabulary
import json
import pickle
from ..utils import get_dropout_mask
import codecs
@ -244,13 +243,13 @@ class LstmbiLm(nn.Module):
def __init__(self, config):
super(LstmbiLm, self).__init__()
self.config = config
self.encoder = nn.LSTM(self.config['encoder']['projection_dim'],
self.config['encoder']['dim'],
num_layers=self.config['encoder']['n_layers'],
self.encoder = nn.LSTM(self.config['lstm']['projection_dim'],
self.config['lstm']['dim'],
num_layers=self.config['lstm']['n_layers'],
bidirectional=True,
batch_first=True,
dropout=self.config['dropout'])
self.projection = nn.Linear(self.config['encoder']['dim'], self.config['encoder']['projection_dim'], bias=True)
self.projection = nn.Linear(self.config['lstm']['dim'], self.config['lstm']['projection_dim'], bias=True)
def forward(self, inputs, seq_len):
sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True)
@ -260,7 +259,7 @@ class LstmbiLm(nn.Module):
output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=self.batch_first)
_, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
output = output[unsort_idx]
forward, backward = output.split(self.config['encoder']['dim'], 2)
forward, backward = output.split(self.config['lstm']['dim'], 2)
return torch.cat([self.projection(forward), self.projection(backward)], dim=2)
@ -268,13 +267,13 @@ class ElmobiLm(torch.nn.Module):
def __init__(self, config):
super(ElmobiLm, self).__init__()
self.config = config
input_size = config['encoder']['projection_dim']
hidden_size = config['encoder']['projection_dim']
cell_size = config['encoder']['dim']
num_layers = config['encoder']['n_layers']
memory_cell_clip_value = config['encoder']['cell_clip']
state_projection_clip_value = config['encoder']['proj_clip']
recurrent_dropout_probability = config['dropout']
input_size = config['lstm']['projection_dim']
hidden_size = config['lstm']['projection_dim']
cell_size = config['lstm']['dim']
num_layers = config['lstm']['n_layers']
memory_cell_clip_value = config['lstm']['cell_clip']
state_projection_clip_value = config['lstm']['proj_clip']
recurrent_dropout_probability = 0.0
self.input_size = input_size
self.hidden_size = hidden_size
@ -409,199 +408,50 @@ class ElmobiLm(torch.nn.Module):
torch.cat(final_memory_states, 0))
return stacked_sequence_outputs, final_state_tuple
def load_weights(self, weight_file: str) -> None:
"""
Load the pre-trained weights from the file.
"""
requires_grad = False
with h5py.File(weight_file, 'r') as fin:
for i_layer, lstms in enumerate(
zip(self.forward_layers, self.backward_layers)
):
for j_direction, lstm in enumerate(lstms):
# lstm is an instance of LSTMCellWithProjection
cell_size = lstm.cell_size
dataset = fin['RNN_%s' % j_direction]['RNN']['MultiRNNCell']['Cell%s' % i_layer
]['LSTMCell']
# tensorflow packs together both W and U matrices into one matrix,
# but pytorch maintains individual matrices. In addition, tensorflow
# packs the gates as input, memory, forget, output but pytorch
# uses input, forget, memory, output. So we need to modify the weights.
tf_weights = numpy.transpose(dataset['W_0'][...])
torch_weights = tf_weights.copy()
# split the W from U matrices
input_size = lstm.input_size
input_weights = torch_weights[:, :input_size]
recurrent_weights = torch_weights[:, input_size:]
tf_input_weights = tf_weights[:, :input_size]
tf_recurrent_weights = tf_weights[:, input_size:]
# handle the different gate order convention
for torch_w, tf_w in [[input_weights, tf_input_weights],
[recurrent_weights, tf_recurrent_weights]]:
torch_w[(1 * cell_size):(2 * cell_size), :] = tf_w[(2 * cell_size):(3 * cell_size), :]
torch_w[(2 * cell_size):(3 * cell_size), :] = tf_w[(1 * cell_size):(2 * cell_size), :]
lstm.input_linearity.weight.data.copy_(torch.FloatTensor(input_weights))
lstm.state_linearity.weight.data.copy_(torch.FloatTensor(recurrent_weights))
lstm.input_linearity.weight.requires_grad = requires_grad
lstm.state_linearity.weight.requires_grad = requires_grad
# the bias weights
tf_bias = dataset['B'][...]
# tensorflow adds 1.0 to forget gate bias instead of modifying the
# parameters...
tf_bias[(2 * cell_size):(3 * cell_size)] += 1
torch_bias = tf_bias.copy()
torch_bias[(1 * cell_size):(2 * cell_size)
] = tf_bias[(2 * cell_size):(3 * cell_size)]
torch_bias[(2 * cell_size):(3 * cell_size)
] = tf_bias[(1 * cell_size):(2 * cell_size)]
lstm.state_linearity.bias.data.copy_(torch.FloatTensor(torch_bias))
lstm.state_linearity.bias.requires_grad = requires_grad
# the projection weights
proj_weights = numpy.transpose(dataset['W_P_0'][...])
lstm.state_projection.weight.data.copy_(torch.FloatTensor(proj_weights))
lstm.state_projection.weight.requires_grad = requires_grad
class LstmTokenEmbedder(nn.Module):
def __init__(self, config, word_emb_layer, char_emb_layer):
super(LstmTokenEmbedder, self).__init__()
self.config = config
self.word_emb_layer = word_emb_layer
self.char_emb_layer = char_emb_layer
self.output_dim = config['encoder']['projection_dim']
emb_dim = 0
if word_emb_layer is not None:
emb_dim += word_emb_layer.n_d
if char_emb_layer is not None:
emb_dim += char_emb_layer.n_d * 2
self.char_lstm = nn.LSTM(char_emb_layer.n_d, char_emb_layer.n_d, num_layers=1, bidirectional=True,
batch_first=True, dropout=config['dropout'])
self.projection = nn.Linear(emb_dim, self.output_dim, bias=True)
def forward(self, words, chars):
embs = []
if self.word_emb_layer is not None:
if hasattr(self, 'words_to_words'):
words = self.words_to_words[words]
word_emb = self.word_emb_layer(words)
embs.append(word_emb)
if self.char_emb_layer is not None:
batch_size, seq_len, _ = chars.shape
chars = chars.view(batch_size * seq_len, -1)
chars_emb = self.char_emb_layer(chars)
# TODO 这里应该要考虑seq_len的问题
_, (chars_outputs, __) = self.char_lstm(chars_emb)
chars_outputs = chars_outputs.contiguous().view(-1, self.config['token_embedder']['embedding']['dim'] * 2)
embs.append(chars_outputs)
token_embedding = torch.cat(embs, dim=2)
return self.projection(token_embedding)
class ConvTokenEmbedder(nn.Module):
def __init__(self, config, weight_file, word_emb_layer, char_emb_layer, char_vocab):
def __init__(self, config, weight_file, word_emb_layer, char_emb_layer):
super(ConvTokenEmbedder, self).__init__()
self.weight_file = weight_file
self.word_emb_layer = word_emb_layer
self.char_emb_layer = char_emb_layer
self.output_dim = config['encoder']['projection_dim']
self.output_dim = config['lstm']['projection_dim']
self._options = config
self.requires_grad = False
self._load_weights()
self._char_embedding_weights = char_emb_layer.weight.data
def _load_weights(self):
self._load_cnn_weights()
self._load_highway()
self._load_projection()
char_cnn_options = self._options['char_cnn']
if char_cnn_options['activation'] == 'tanh':
self.activation = torch.tanh
elif char_cnn_options['activation'] == 'relu':
self.activation = torch.nn.functional.relu
else:
raise Exception("Unknown activation")
def _load_cnn_weights(self):
cnn_options = self._options['token_embedder']
filters = cnn_options['filters']
char_embed_dim = cnn_options['embedding']['dim']
if char_emb_layer is not None:
self.char_conv = []
cnn_config = config['char_cnn']
filters = cnn_config['filters']
char_embed_dim = cnn_config['embedding']['dim']
convolutions = []
convolutions = []
for i, (width, num) in enumerate(filters):
conv = torch.nn.Conv1d(
in_channels=char_embed_dim,
out_channels=num,
kernel_size=width,
bias=True
)
# load the weights
with h5py.File(self.weight_file, 'r') as fin:
weight = fin['CNN']['W_cnn_{}'.format(i)][...]
bias = fin['CNN']['b_cnn_{}'.format(i)][...]
for i, (width, num) in enumerate(filters):
conv = torch.nn.Conv1d(
in_channels=char_embed_dim,
out_channels=num,
kernel_size=width,
bias=True
)
convolutions.append(conv)
self.add_module('char_conv_{}'.format(i), conv)
w_reshaped = numpy.transpose(weight.squeeze(axis=0), axes=(2, 1, 0))
if w_reshaped.shape != tuple(conv.weight.data.shape):
raise ValueError("Invalid weight file")
conv.weight.data.copy_(torch.FloatTensor(w_reshaped))
conv.bias.data.copy_(torch.FloatTensor(bias))
self._convolutions = convolutions
conv.weight.requires_grad = self.requires_grad
conv.bias.requires_grad = self.requires_grad
n_filters = sum(f[1] for f in filters)
n_highway = cnn_config['n_highway']
convolutions.append(conv)
self.add_module('char_conv_{}'.format(i), conv)
self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu)
self._convolutions = convolutions
def _load_highway(self):
# the highway layers have same dimensionality as the number of cnn filters
cnn_options = self._options['token_embedder']
filters = cnn_options['filters']
n_filters = sum(f[1] for f in filters)
n_highway = cnn_options['n_highway']
# create the layers, and load the weights
self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu)
for k in range(n_highway):
# The AllenNLP highway is one matrix multplication with concatenation of
# transform and carry weights.
with h5py.File(self.weight_file, 'r') as fin:
# The weights are transposed due to multiplication order assumptions in tf
# vs pytorch (tf.matmul(X, W) vs pytorch.matmul(W, X))
w_transform = numpy.transpose(fin['CNN_high_{}'.format(k)]['W_transform'][...])
# -1.0 since AllenNLP is g * x + (1 - g) * f(x) but tf is (1 - g) * x + g * f(x)
w_carry = -1.0 * numpy.transpose(fin['CNN_high_{}'.format(k)]['W_carry'][...])
weight = numpy.concatenate([w_transform, w_carry], axis=0)
self._highways._layers[k].weight.data.copy_(torch.FloatTensor(weight))
self._highways._layers[k].weight.requires_grad = self.requires_grad
b_transform = fin['CNN_high_{}'.format(k)]['b_transform'][...]
b_carry = -1.0 * fin['CNN_high_{}'.format(k)]['b_carry'][...]
bias = numpy.concatenate([b_transform, b_carry], axis=0)
self._highways._layers[k].bias.data.copy_(torch.FloatTensor(bias))
self._highways._layers[k].bias.requires_grad = self.requires_grad
def _load_projection(self):
cnn_options = self._options['token_embedder']
filters = cnn_options['filters']
n_filters = sum(f[1] for f in filters)
self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True)
with h5py.File(self.weight_file, 'r') as fin:
weight = fin['CNN_proj']['W_proj'][...]
bias = fin['CNN_proj']['b_proj'][...]
self._projection.weight.data.copy_(torch.FloatTensor(numpy.transpose(weight)))
self._projection.bias.data.copy_(torch.FloatTensor(bias))
self._projection.weight.requires_grad = self.requires_grad
self._projection.bias.requires_grad = self.requires_grad
self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True)
def forward(self, words, chars):
"""
@ -616,15 +466,8 @@ class ConvTokenEmbedder(nn.Module):
# self._char_embedding_weights
# )
batch_size, sequence_length, max_char_len = chars.size()
character_embedding = self.char_emb_layer(chars).reshape(batch_size*sequence_length, max_char_len, -1)
character_embedding = self.char_emb_layer(chars).reshape(batch_size * sequence_length, max_char_len, -1)
# run convolutions
cnn_options = self._options['token_embedder']
if cnn_options['activation'] == 'tanh':
activation = torch.tanh
elif cnn_options['activation'] == 'relu':
activation = torch.nn.functional.relu
else:
raise Exception("Unknown activation")
# (batch_size * sequence_length, embed_dim, max_chars_per_token)
character_embedding = torch.transpose(character_embedding, 1, 2)
@ -634,7 +477,7 @@ class ConvTokenEmbedder(nn.Module):
convolved = conv(character_embedding)
# (batch_size * sequence_length, n_filters for this width)
convolved, _ = torch.max(convolved, dim=-1)
convolved = activation(convolved)
convolved = self.activation(convolved)
convs.append(convolved)
# (batch_size * sequence_length, n_filters)
@ -712,8 +555,8 @@ class _ElmoModel(nn.Module):
def __init__(self, model_dir: str, vocab: Vocabulary = None, cache_word_reprs: bool = False):
super(_ElmoModel, self).__init__()
dir = os.walk(model_dir)
self.model_dir = model_dir
dir = os.walk(self.model_dir)
config_file = None
weight_file = None
config_count = 0
@ -723,7 +566,7 @@ class _ElmoModel(nn.Module):
if file_name.__contains__(".json"):
config_file = file_name
config_count += 1
elif file_name.__contains__(".hdf5"):
elif file_name.__contains__(".pkl"):
weight_file = file_name
weight_count += 1
if config_count > 1 or weight_count > 1:
@ -734,7 +577,6 @@ class _ElmoModel(nn.Module):
config = json.load(open(os.path.join(model_dir, config_file), 'r'))
self.weight_file = os.path.join(model_dir, weight_file)
self.config = config
self.requires_grad = False
OOV_TAG = '<oov>'
PAD_TAG = '<pad>'
@ -744,102 +586,84 @@ class _ElmoModel(nn.Module):
EOW_TAG = '<eow>'
# For the model trained with character-based word encoder.
if config['token_embedder']['embedding']['dim'] > 0:
char_lexicon = {}
with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi:
for line in fpi:
tokens = line.strip().split('\t')
if len(tokens) == 1:
tokens.insert(0, '\u3000')
token, i = tokens
char_lexicon[token] = int(i)
char_lexicon = {}
with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi:
for line in fpi:
tokens = line.strip().split('\t')
if len(tokens) == 1:
tokens.insert(0, '\u3000')
token, i = tokens
char_lexicon[token] = int(i)
# 做一些sanity check
for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]:
assert special_word in char_lexicon, f"{special_word} not found in char.dic."
# 做一些sanity check
for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]:
assert special_word in char_lexicon, f"{special_word} not found in char.dic."
# 从vocab中构建char_vocab
char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG)
# 需要保证<bow>与<eow>在里面
char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG])
# 从vocab中构建char_vocab
char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG)
# 需要保证<bow>与<eow>在里面
char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG])
for word, index in vocab:
char_vocab.add_word_lst(list(word))
for word, index in vocab:
char_vocab.add_word_lst(list(word))
self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab)+1, vocab.padding_idx
# 根据char_lexicon调整, 多设置一位是预留给word padding的(该位置的char表示为全0表示)
char_emb_layer = nn.Embedding(len(char_vocab)+1, int(config['token_embedder']['embedding']['dim']),
padding_idx=len(char_vocab))
with h5py.File(self.weight_file, 'r') as fin:
char_embed_weights = fin['char_embed'][...]
char_embed_weights = torch.from_numpy(char_embed_weights)
found_char_count = 0
for char, index in char_vocab: # 调整character embedding
if char in char_lexicon:
index_in_pre = char_lexicon.get(char)
found_char_count += 1
else:
index_in_pre = char_lexicon[OOV_TAG]
char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre]
self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab) + 1, vocab.padding_idx
# 根据char_lexicon调整, 多设置一位是预留给word padding的(该位置的char表示为全0表示)
char_emb_layer = nn.Embedding(len(char_vocab) + 1, int(config['char_cnn']['embedding']['dim']),
padding_idx=len(char_vocab))
print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.")
# 生成words到chars的映射
if config['token_embedder']['name'].lower() == 'cnn':
max_chars = config['token_embedder']['max_characters_per_token']
elif config['token_embedder']['name'].lower() == 'lstm':
max_chars = max(map(lambda x: len(x[0]), vocab)) + 2 # 需要补充两个<bow>与<eow>
# 读入预训练权重 这里的elmo_model 包含char_cnn和 lstm 的 state_dict
elmo_model = torch.load(os.path.join(self.model_dir, weight_file), map_location='cpu')
char_embed_weights = elmo_model["char_cnn"]['char_emb_layer.weight']
found_char_count = 0
for char, index in char_vocab: # 调整character embedding
if char in char_lexicon:
index_in_pre = char_lexicon.get(char)
found_char_count += 1
else:
raise ValueError('Unknown token_embedder: {0}'.format(config['token_embedder']['name']))
index_in_pre = char_lexicon[OOV_TAG]
char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre]
self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab)+2, max_chars),
fill_value=len(char_vocab),
dtype=torch.long),
requires_grad=False)
for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab)+1)]:
if len(word) + 2 > max_chars:
word = word[:max_chars - 2]
if index == self._pad_index:
continue
elif word == BOS_TAG or word == EOS_TAG:
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [
char_vocab.to_index(EOW_TAG)]
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
else:
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [
char_vocab.to_index(EOW_TAG)]
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
self.words_to_chars_embedding[index] = torch.LongTensor(char_ids)
print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.")
# 生成words到chars的映射
max_chars = config['char_cnn']['max_characters_per_token']
self.char_vocab = char_vocab
else:
char_emb_layer = None
self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab) + 2, max_chars),
fill_value=len(char_vocab),
dtype=torch.long),
requires_grad=False)
for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab) + 1)]:
if len(word) + 2 > max_chars:
word = word[:max_chars - 2]
if index == self._pad_index:
continue
elif word == BOS_TAG or word == EOS_TAG:
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [
char_vocab.to_index(EOW_TAG)]
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
else:
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [
char_vocab.to_index(EOW_TAG)]
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
self.words_to_chars_embedding[index] = torch.LongTensor(char_ids)
if config['token_embedder']['name'].lower() == 'cnn':
self.token_embedder = ConvTokenEmbedder(
config, self.weight_file, None, char_emb_layer, self.char_vocab)
elif config['token_embedder']['name'].lower() == 'lstm':
self.token_embedder = LstmTokenEmbedder(
config, None, char_emb_layer)
self.char_vocab = char_vocab
if config['token_embedder']['word_dim'] > 0 \
and vocab._no_create_word_length > 0: # 需要映射使得来自于dev, test的idx指向unk
words_to_words = nn.Parameter(torch.arange(len(vocab) + 2).long(), requires_grad=False)
for word, idx in vocab:
if vocab._is_word_no_create_entry(word):
words_to_words[idx] = vocab.unknown_idx
setattr(self.token_embedder, 'words_to_words', words_to_words)
self.output_dim = config['encoder']['projection_dim']
self.token_embedder = ConvTokenEmbedder(
config, self.weight_file, None, char_emb_layer)
elmo_model["char_cnn"]['char_emb_layer.weight'] = char_emb_layer.weight
self.token_embedder.load_state_dict(elmo_model["char_cnn"])
# 暂时只考虑 elmo
if config['encoder']['name'].lower() == 'elmo':
self.encoder = ElmobiLm(config)
elif config['encoder']['name'].lower() == 'lstm':
self.encoder = LstmbiLm(config)
self.output_dim = config['lstm']['projection_dim']
self.encoder.load_weights(self.weight_file)
# lstm encoder
self.encoder = ElmobiLm(config)
self.encoder.load_state_dict(elmo_model["lstm"])
if cache_word_reprs:
if config['token_embedder']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用
if config['char_cnn']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用
print("Start to generate cache word representations.")
batch_size = 320
# bos eos
@ -848,7 +672,7 @@ class _ElmoModel(nn.Module):
int(word_size % batch_size != 0)
self.cached_word_embedding = nn.Embedding(word_size,
config['encoder']['projection_dim'])
config['lstm']['projection_dim'])
with torch.no_grad():
for i in range(num_batches):
words = torch.arange(i * batch_size,
@ -877,6 +701,8 @@ class _ElmoModel(nn.Module):
expanded_words[:, 0].fill_(self.bos_index)
expanded_words[torch.arange(batch_size).to(words), seq_len + 1] = self.eos_index
seq_len = seq_len + 2
zero_tensor = expanded_words.new_zeros(expanded_words.shape)
mask = (expanded_words == zero_tensor).unsqueeze(-1)
if hasattr(self, 'cached_word_embedding'):
token_embedding = self.cached_word_embedding(expanded_words)
else:
@ -886,20 +712,16 @@ class _ElmoModel(nn.Module):
chars = None
token_embedding = self.token_embedder(expanded_words, chars) # batch_size x max_len x embed_dim
if self.config['encoder']['name'] == 'elmo':
encoder_output = self.encoder(token_embedding, seq_len)
if encoder_output.size(2) < max_len + 2:
num_layers, _, output_len, hidden_size = encoder_output.size()
dummy_tensor = encoder_output.new_zeros(num_layers, batch_size,
max_len + 2 - output_len, hidden_size)
encoder_output = torch.cat((encoder_output, dummy_tensor), 2)
sz = encoder_output.size() # 2, batch_size, max_len, hidden_size
token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3])
encoder_output = torch.cat((token_embedding, encoder_output), dim=0)
elif self.config['encoder']['name'] == 'lstm':
encoder_output = self.encoder(token_embedding, seq_len)
else:
raise ValueError('Unknown encoder: {0}'.format(self.config['encoder']['name']))
encoder_output = self.encoder(token_embedding, seq_len)
if encoder_output.size(2) < max_len + 2:
num_layers, _, output_len, hidden_size = encoder_output.size()
dummy_tensor = encoder_output.new_zeros(num_layers, batch_size,
max_len + 2 - output_len, hidden_size)
encoder_output = torch.cat((encoder_output, dummy_tensor), 2)
sz = encoder_output.size() # 2, batch_size, max_len, hidden_size
token_embedding = token_embedding.masked_fill(mask, 0)
token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3])
encoder_output = torch.cat((token_embedding, encoder_output), dim=0)
# 删除<eos>, <bos>. 这里没有精确地删除,但应该也不会影响最后的结果了。
encoder_output = encoder_output[:, :, 1:-1]

View File

@ -8,9 +8,9 @@ import torch
import torch.nn.functional as F
from torch import nn
from ..dropout import TimestepDropout
from fastNLP.modules.dropout import TimestepDropout
from ..utils import initial_parameter
from fastNLP.modules.utils import initial_parameter
class DotAttention(nn.Module):
@ -45,8 +45,7 @@ class DotAttention(nn.Module):
class MultiHeadAttention(nn.Module):
"""
别名:class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.aggregator.attention.MultiHeadAttention`
别名:class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.encoder.attention.MultiHeadAttention`
:param input_size: int, 输入维度的大小同时也是输出维度的大小
:param key_size: int, 每个head的维度大小

View File

@ -2,35 +2,22 @@
import os
from torch import nn
import torch
from ...io.file_utils import _get_base_url, cached_path
from ...io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
from ._bert import _WordPieceBertModel, BertModel
class BertWordPieceEncoder(nn.Module):
"""
读取bert模型读取之后调用index_dataset方法在dataset中生成word_pieces这一列
:param fastNLP.Vocabulary vocab: 词表
:param str model_dir_or_name: 模型所在目录或者模型的名称默认值为``en-base-uncased``
:param str layers:最终结果中的表示','隔开层数可以以负数去索引倒数几层
:param bool requires_grad: 是否需要gradient
"""
def __init__(self, model_dir_or_name:str='en-base-uncased', layers:str='-1',
requires_grad:bool=False):
def __init__(self, model_dir_or_name: str='en-base-uncased', layers: str='-1',
requires_grad: bool=False):
super().__init__()
PRETRAIN_URL = _get_base_url('bert')
PRETRAINED_BERT_MODEL_DIR = {'en': 'bert-base-cased-f89bfe08.zip',
'en-base-uncased': 'bert-base-uncased-3413b23c.zip',
'en-base-cased': 'bert-base-cased-f89bfe08.zip',
'en-large-uncased': 'bert-large-uncased-20939f45.zip',
'en-large-cased': 'bert-large-cased-e0cf90fc.zip',
'cn': 'bert-base-chinese-29d0a84a.zip',
'cn-base': 'bert-base-chinese-29d0a84a.zip',
'multilingual': 'bert-base-multilingual-cased-1bd364ee.zip',
'multilingual-base-uncased': 'bert-base-multilingual-uncased-f8730fe4.zip',
'multilingual-base-cased': 'bert-base-multilingual-cased-1bd364ee.zip',
}
if model_dir_or_name in PRETRAINED_BERT_MODEL_DIR:
model_name = PRETRAINED_BERT_MODEL_DIR[model_dir_or_name]
@ -89,4 +76,4 @@ class BertWordPieceEncoder(nn.Module):
outputs = self.model(word_pieces, token_type_ids)
outputs = torch.cat([*outputs], dim=-1)
return outputs
return outputs

View File

@ -135,7 +135,7 @@ class TokenEmbedding(nn.Module):
:param torch.LongTensor words: batch_size x max_len
:return:
"""
if self.dropout_word > 0 and self.training:
if self.word_dropout > 0 and self.training:
mask = torch.ones_like(words).float() * self.word_dropout
mask = torch.bernoulli(mask).byte() # dropout_word越大越多位置为1
words = words.masked_fill(mask, self._word_unk_index)
@ -174,8 +174,16 @@ class TokenEmbedding(nn.Module):
def embed_size(self) -> int:
return self._embed_size
@property
def embedding_dim(self) -> int:
return self._embed_size
@property
def num_embedding(self) -> int:
"""
这个值可能会大于实际的embedding矩阵的大小
:return:
"""
return len(self._word_vocab)
def get_word_vocab(self):
@ -531,11 +539,11 @@ class ElmoEmbedding(ContextualEmbedding):
self.model = _ElmoModel(model_dir, vocab, cache_word_reprs=cache_word_reprs)
if layers=='mix':
self.layer_weights = nn.Parameter(torch.zeros(self.model.config['encoder']['n_layers']+1),
self.layer_weights = nn.Parameter(torch.zeros(self.model.config['lstm']['n_layers']+1),
requires_grad=requires_grad)
self.gamma = nn.Parameter(torch.ones(1), requires_grad=requires_grad)
self._get_outputs = self._get_mixed_outputs
self._embed_size = self.model.config['encoder']['projection_dim'] * 2
self._embed_size = self.model.config['lstm']['projection_dim'] * 2
else:
layers = list(map(int, layers.split(',')))
assert len(layers) > 0, "Must choose one output"
@ -543,7 +551,7 @@ class ElmoEmbedding(ContextualEmbedding):
assert 0 <= layer <= 2, "Layer index should be in range [0, 2]."
self.layers = layers
self._get_outputs = self._get_layer_outputs
self._embed_size = len(self.layers) * self.model.config['encoder']['projection_dim'] * 2
self._embed_size = len(self.layers) * self.model.config['lstm']['projection_dim'] * 2
self.requires_grad = requires_grad
@ -810,7 +818,7 @@ class CNNCharEmbedding(TokenEmbedding):
# 为1的地方为mask
chars_masks = chars.eq(self.char_pad_index) # batch_size x max_len x max_word_len 如果为0, 说明是padding的位置了
chars = self.char_embedding(chars) # batch_size x max_len x max_word_len x embed_size
self.dropout(chars)
chars = self.dropout(chars)
reshaped_chars = chars.reshape(batch_size*max_len, max_word_len, -1)
reshaped_chars = reshaped_chars.transpose(1, 2) # B' x E x M
conv_chars = [conv(reshaped_chars).transpose(1, 2).reshape(batch_size, max_len, max_word_len, -1)
@ -962,7 +970,7 @@ class LSTMCharEmbedding(TokenEmbedding):
chars = self.fc(chars)
return self.dropout(words)
return self.dropout(chars)
@property
def requires_grad(self):

View File

@ -1,7 +1,8 @@
__all__ = [
"MaxPool",
"MaxPoolWithMask",
"AvgPool"
"AvgPool",
"AvgPoolWithMask"
]
import torch
import torch.nn as nn
@ -9,7 +10,7 @@ import torch.nn as nn
class MaxPool(nn.Module):
"""
别名:class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.aggregator.pooling.MaxPool`
别名:class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.encoder.pooling.MaxPool`
Max-pooling模块
@ -58,7 +59,7 @@ class MaxPool(nn.Module):
class MaxPoolWithMask(nn.Module):
"""
别名:class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.MaxPoolWithMask`
别名:class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.encoder.pooling.MaxPoolWithMask`
带mask矩阵的max pooling在做max-pooling的时候不会考虑mask值为0的位置
"""
@ -98,7 +99,7 @@ class KMaxPool(nn.Module):
class AvgPool(nn.Module):
"""
别名:class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.aggregator.pooling.AvgPool`
别名:class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.encoder.pooling.AvgPool`
给定形如[batch_size, max_len, hidden_size]的输入在最后一维进行avg pooling. 输出为[batch_size, hidden_size]
"""
@ -125,7 +126,7 @@ class AvgPool(nn.Module):
class AvgPoolWithMask(nn.Module):
"""
别名:class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.AvgPoolWithMask`
别名:class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.encoder.pooling.AvgPoolWithMask`
给定形如[batch_size, max_len, hidden_size]的输入在最后一维进行avg pooling. 输出为[batch_size, hidden_size], pooling
的时候只会考虑mask为1的位置

View File

@ -34,8 +34,8 @@ class StarTransformer(nn.Module):
super(StarTransformer, self).__init__()
self.iters = num_layers
self.norm = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(self.iters)])
self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1)
self.norm = nn.ModuleList([nn.LayerNorm(hidden_size, eps=1e-6) for _ in range(self.iters)])
# self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1)
self.emb_drop = nn.Dropout(dropout)
self.ring_att = nn.ModuleList(
[_MSA1(hidden_size, nhead=num_head, head_dim=head_dim, dropout=0.0)

View File

@ -3,7 +3,7 @@ __all__ = [
]
from torch import nn
from ..aggregator.attention import MultiHeadAttention
from fastNLP.modules.encoder.attention import MultiHeadAttention
from ..dropout import TimestepDropout

View File

@ -8,7 +8,8 @@ import os
from fastNLP.core.dataset import DataSet
from .utils import load_url
from .processor import ModelProcessor
from fastNLP.io.dataset_loader import _cut_long_sentence, ConllLoader
from fastNLP.io.dataset_loader import _cut_long_sentence
from fastNLP.io.data_loader import ConllLoader
from fastNLP.core.instance import Instance
from ..api.pipeline import Pipeline
from fastNLP.core.metrics import SpanFPreRecMetric

View File

@ -2,14 +2,14 @@
这里复现了在fastNLP中实现的模型旨在达到与论文中相符的性能。
复现的模型有:
- [Star-Transformer](Star_transformer/)
- [Star-Transformer](Star_transformer)
- [Biaffine](https://github.com/fastnlp/fastNLP/blob/999a14381747068e9e6a7cc370037b320197db00/fastNLP/models/biaffine_parser.py#L239)
- [CNNText](https://github.com/fastnlp/fastNLP/blob/999a14381747068e9e6a7cc370037b320197db00/fastNLP/models/cnn_text_classification.py#L12)
- ...
# 任务复现
## Text Classification (文本分类)
- still in progress
- [Text Classification 文本分类任务复现](text_classification)
## Matching (自然语言推理/句子匹配)
@ -20,12 +20,12 @@
- [NER](seqence_labelling/ner)
## Coreference resolution (指代消解)
- still in progress
## Coreference Resolution (共指消解)
- [Coreference Resolution 共指消解任务复现](coreference_resolution)
## Summarization (摘要)
- still in progress
- [Summerization 摘要任务复现](Summarization)
## ...

View File

@ -9,26 +9,3 @@ paper: [Star-Transformer](https://arxiv.org/abs/1902.09113)
|Text Classification|SST|-|51.2|
|Natural Language Inference|SNLI|-|83.76|
## Usage
``` python
# for sequence labeling(ner, pos tagging, etc)
from fastNLP.models.star_transformer import STSeqLabel
model = STSeqLabel(
vocab_size=10000, num_cls=50,
emb_dim=300)
# for sequence classification
from fastNLP.models.star_transformer import STSeqCls
model = STSeqCls(
vocab_size=10000, num_cls=50,
emb_dim=300)
# for natural language inference
from fastNLP.models.star_transformer import STNLICls
model = STNLICls(
vocab_size=10000, num_cls=50,
emb_dim=300)
```

View File

@ -2,8 +2,7 @@ import torch
import json
import os
from fastNLP import Vocabulary
from fastNLP.io.dataset_loader import ConllLoader
from fastNLP.io.data_loader import SSTLoader, SNLILoader
from fastNLP.io.data_loader import ConllLoader, SSTLoader, SNLILoader
from fastNLP.core import Const as C
import numpy as np

View File

@ -10,7 +10,8 @@ from fastNLP.models.star_transformer import STSeqLabel, STSeqCls, STNLICls
from fastNLP.core.const import Const as C
import sys
#sys.path.append('/remote-home/yfshao/workdir/dev_fastnlp/')
pre_dir = '/home/ec2-user/fast_data/'
import os
pre_dir = os.path.join(os.environ['HOME'], 'workdir/datasets/')
g_model_select = {
'pos': STSeqLabel,
@ -19,7 +20,7 @@ g_model_select = {
'nli': STNLICls,
}
g_emb_file_path = {'en': pre_dir + 'glove.840B.300d.txt',
g_emb_file_path = {'en': pre_dir + 'word_vector/glove.840B.300d.txt',
'zh': pre_dir + 'cc.zh.300.vec'}
g_args = None
@ -55,7 +56,7 @@ def get_conll2012_ner():
def get_sst():
path = pre_dir + 'sst'
path = pre_dir + 'SST'
files = ['train.txt', 'dev.txt', 'test.txt']
return load_sst(path, files)
@ -171,10 +172,10 @@ def train():
sampler=FN.BucketSampler(100, g_args.bsz, C.INPUT_LEN),
callbacks=[MyCallback()])
trainer.train()
print(trainer.train())
tester = FN.Tester(data=test_data, model=model, metrics=metric,
batch_size=128, device=device)
tester.test()
print(tester.test())
def test():

View File

@ -2,7 +2,7 @@ import pickle
import numpy as np
from fastNLP.core.vocabulary import Vocabulary
from fastNLP.io.base_loader import DataInfo
from fastNLP.io.base_loader import DataBundle
from fastNLP.io.dataset_loader import JsonLoader
from fastNLP.core.const import Const
@ -66,7 +66,7 @@ class SummarizationLoader(JsonLoader):
:param domain: bool build vocab for publication, use 'X' for unknown
:param tag: bool build vocab for tag, use 'X' for unknown
:param load_vocab: bool build vocab (False) or load vocab (True)
:return: DataInfo
:return: DataBundle
datasets: dict keys correspond to the paths dict
vocabs: dict key: vocab(if "train" in paths), domain(if domain=True), tag(if tag=True)
embeddings: optional
@ -182,7 +182,7 @@ class SummarizationLoader(JsonLoader):
for ds in datasets.values():
vocab_dict["vocab"].index_dataset(ds, field_name=Const.INPUT, new_field_name=Const.INPUT)
return DataInfo(vocabs=vocab_dict, datasets=datasets)
return DataBundle(vocabs=vocab_dict, datasets=datasets)

View File

@ -1,24 +1,36 @@
import unittest
from ..data.dataloader import SummarizationLoader
import sys
sys.path.append('..')
from data.dataloader import SummarizationLoader
vocab_size = 100000
vocab_path = "testdata/vocab"
sent_max_len = 100
doc_max_timesteps = 50
class TestSummarizationLoader(unittest.TestCase):
def test_case1(self):
sum_loader = SummarizationLoader()
paths = {"train":"testdata/train.jsonl", "valid":"testdata/val.jsonl", "test":"testdata/test.jsonl"}
data = sum_loader.process(paths=paths)
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps)
print(data.datasets)
def test_case2(self):
sum_loader = SummarizationLoader()
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"}
data = sum_loader.process(paths=paths, domain=True)
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, domain=True)
print(data.datasets, data.vocabs)
def test_case3(self):
sum_loader = SummarizationLoader()
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"}
data = sum_loader.process(paths=paths, tag=True)
print(data.datasets, data.vocabs)
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, tag=True)
print(data.datasets, data.vocabs)

View File

@ -3,7 +3,7 @@ from datetime import timedelta
from fastNLP.io.dataset_loader import JsonLoader
from fastNLP.modules.encoder._bert import BertTokenizer
from fastNLP.io.base_loader import DataInfo
from fastNLP.io.base_loader import DataBundle
from fastNLP.core.const import Const
class BertData(JsonLoader):
@ -110,7 +110,7 @@ class BertData(JsonLoader):
# set paddding value
datasets[name].set_pad_val('article', 0)
return DataInfo(datasets=datasets)
return DataBundle(datasets=datasets)
class BertSumLoader(JsonLoader):
@ -154,4 +154,4 @@ class BertSumLoader(JsonLoader):
print('Finished in {}'.format(timedelta(seconds=time()-start)))
return DataInfo(datasets=datasets)
return DataBundle(datasets=datasets)

View File

@ -11,7 +11,7 @@ Coreference resolution是查找文本中指向同一现实实体的所有表达
由于版权问题,本文无法提供数据集的下载,请自行下载。
原始数据集的格式为conll格式详细介绍参考数据集给出的官方介绍页面。
代码实现采用了论文作者Lee的预处理方法具体细节参[链接](https://github.com/kentonl/e2e-coref/blob/e2e/setup_training.sh)。
代码实现采用了论文作者Lee的预处理方法具体细节参[链接](https://github.com/kentonl/e2e-coref/blob/e2e/setup_training.sh)。
处理之后的数据集为json格式例子
```
{
@ -25,12 +25,12 @@ Coreference resolution是查找文本中指向同一现实实体的所有表达
### embedding 数据集下载
[turian emdedding](https://lil.cs.washington.edu/coref/turian.50d.txt)
[glove embedding]( https://nlp.stanford.edu/data/glove.840B.300d.zip)
[glove embedding](https://nlp.stanford.edu/data/glove.840B.300d.zip)
## 运行
```python
```shell
# 训练代码
CUDA_VISIBLE_DEVICES=0 python train.py
# 测试代码
@ -39,9 +39,9 @@ CUDA_VISIBLE_DEVICES=0 python valid.py
## 结果
原论文作者在测试集上取得了67.2%的结果AllenNLP复现的结果为 [63.0%](https://allennlp.org/models)。
其中allenNLP训练时没有加入speaker信息没有variational dropout以及只使用了100的antecedents而不是250。
其中AllenNLP训练时没有加入speaker信息没有variational dropout以及只使用了100的antecedents而不是250。
在与allenNLP使用同样的超参和配置时本代码复现取得了63.6%的F1值。
在与AllenNLP使用同样的超参和配置时本代码复现取得了63.6%的F1值。
## 问题

View File

@ -1,7 +1,7 @@
from fastNLP.io.dataset_loader import JsonLoader,DataSet,Instance
from fastNLP.io.file_reader import _read_json
from fastNLP.core.vocabulary import Vocabulary
from fastNLP.io.base_loader import DataInfo
from fastNLP.io.base_loader import DataBundle
from reproduction.coreference_resolution.model.config import Config
import reproduction.coreference_resolution.model.preprocess as preprocess
@ -26,7 +26,7 @@ class CRLoader(JsonLoader):
return dataset
def process(self, paths, **kwargs):
data_info = DataInfo()
data_info = DataBundle()
for name in ['train', 'test', 'dev']:
data_info.datasets[name] = self.load(paths[name])

View File

@ -1,7 +1,7 @@
from fastNLP.io.base_loader import DataSetLoader, DataInfo
from fastNLP.io.dataset_loader import ConllLoader
from fastNLP.io.base_loader import DataSetLoader, DataBundle
from fastNLP.io.data_loader import ConllLoader
import numpy as np
from itertools import chain
@ -76,7 +76,7 @@ class CTBxJointLoader(DataSetLoader):
gold_label_word_pairs:
"""
paths = check_dataloader_paths(paths)
data = DataInfo()
data = DataBundle()
for name, path in paths.items():
dataset = self.load(path)

View File

@ -2,13 +2,13 @@
这里使用fastNLP复现了几个著名的Matching任务的模型旨在达到与论文中相符的性能。这几个任务的评价指标均为准确率(%).
复现的模型有(按论文发表时间顺序排序):
- CNTN模型代码(still in progress)[](); 训练代码(still in progress)[]().
- CNTN[模型代码](model/cntn.py); [训练代码](matching_cntn.py).
论文链接:[Convolutional Neural Tensor Network Architecture for Community-based Question Answering](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844).
- ESIM[模型代码](model/esim.py); [训练代码](matching_esim.py).
论文链接:[Enhanced LSTM for Natural Language Inference](https://arxiv.org/pdf/1609.06038.pdf).
- DIIN模型代码(still in progress)[](); 训练代码(still in progress)[]().
论文链接:[Natural Language Inference over Interaction Space](https://arxiv.org/pdf/1709.04348.pdf).
- MwAN模型代码(still in progress)[](); 训练代码(still in progress)[]().
- MwAN[模型代码](model/mwan.py); [训练代码](matching_mwan.py).
论文链接:[Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/proceedings/2018/0613.pdf).
- BERT[模型代码](model/bert.py); [训练代码](matching_bert.py).
论文链接:[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf).
@ -21,10 +21,10 @@
model name | SNLI | MNLI | RTE | QNLI | Quora
:---: | :---: | :---: | :---: | :---: | :---:
CNTN [](); [论文](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844) | 74.53 vs - | 60.84/-(dev) vs - | 57.4(dev) vs - | 62.53(dev) vs - | - |
CNTN [代码](model/cntn.py); [论文](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844) | 77.79 vs - | 63.29/63.16(dev) vs - | 57.04(dev) vs - | 62.38(dev) vs - | - |
ESIM[代码](model/bert.py); [论文](https://arxiv.org/pdf/1609.06038.pdf) | 88.13(glove) vs 88.0(glove)/88.7(elmo) | 77.78/76.49 vs 72.4/72.1* | 59.21(dev) vs - | 76.97(dev) vs - | - |
DIIN [](); [论文](https://arxiv.org/pdf/1709.04348.pdf) | - vs 88.0 | - vs 78.8/77.8 | - | - | - vs 89.06 |
MwAN [](); [论文](https://www.ijcai.org/proceedings/2018/0613.pdf) | 87.9 vs 88.3 | 77.3/76.7(dev) vs 78.5/77.7 | - | 74.6(dev) vs - | 85.6 vs 89.12 |
MwAN [代码](model/mwan.py); [论文](https://www.ijcai.org/proceedings/2018/0613.pdf) | 87.9 vs 88.3 | 77.3/76.7(dev) vs 78.5/77.7 | - | 74.6(dev) vs - | 85.6 vs 89.12 |
BERT (BASE version)[代码](model/bert.py); [论文](https://arxiv.org/pdf/1810.04805.pdf) | 90.6 vs - | - vs 84.6/83.4| 67.87(dev) vs 66.4 | 90.97(dev) vs 90.5 | - |
*ESIM模型由MNLI官方复现的结果为72.4/72.1ESIM原论文当中没有汇报MNLI数据集的结果。
@ -44,7 +44,7 @@ Performance on Test set:
model name | CNTN | ESIM | DIIN | MwAN | BERT-Base | BERT-Large
:---: | :---: | :---: | :---: | :---: | :---: | :---:
__performance__ | - | 88.13 | - | 87.9 | 90.6 | 91.16
__performance__ | 77.79 | 88.13 | - | 87.9 | 90.6 | 91.16
## MNLI
[Link to MNLI main page](https://www.nyu.edu/projects/bowman/multinli/)
@ -60,7 +60,7 @@ Performance on Test set(matched/mismatched):
model name | CNTN | ESIM | DIIN | MwAN | BERT-Base
:---: | :---: | :---: | :---: | :---: | :---: |
__performance__ | - | 77.78/76.49 | - | 77.3/76.7(dev) | - |
__performance__ | 63.29/63.16(dev) | 77.78/76.49 | - | 77.3/76.7(dev) | - |
## RTE
@ -92,7 +92,7 @@ Performance on __Dev__ set:
model name | CNTN | ESIM | DIIN | MwAN | BERT
:---: | :---: | :---: | :---: | :---: | :---:
__performance__ | - | 76.97 | - | 74.6 | -
__performance__ | 62.38 | 76.97 | - | 74.6 | -
## Quora

View File

@ -5,7 +5,7 @@ from typing import Union, Dict
from fastNLP.core.const import Const
from fastNLP.core.vocabulary import Vocabulary
from fastNLP.io.base_loader import DataInfo, DataSetLoader
from fastNLP.io.base_loader import DataBundle, DataSetLoader
from fastNLP.io.dataset_loader import JsonLoader, CSVLoader
from fastNLP.io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
from fastNLP.modules.encoder._bert import BertTokenizer
@ -35,7 +35,7 @@ class MatchingLoader(DataSetLoader):
to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None,
cut_text: int = None, get_index=True, auto_pad_length: int=None,
auto_pad_token: str='<pad>', set_input: Union[list, str, bool]=True,
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataInfo:
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataBundle:
"""
:param paths: str或者Dict[str, str]如果是str则为数据集所在的文件夹或者是全路径文件名如果是文件夹
则会从self.paths里面找对应的数据集名称与文件名如果是Dict则为数据集名称如traindevtest
@ -80,7 +80,7 @@ class MatchingLoader(DataSetLoader):
else:
path = paths
data_info = DataInfo()
data_info = DataBundle()
for data_name in path.keys():
data_info.datasets[data_name] = self._load(path[data_name])

View File

@ -0,0 +1,145 @@
import sys
import os
import random
import numpy as np
import torch
from torch.optim import Adadelta, SGD
from torch.optim.lr_scheduler import StepLR
from tqdm import tqdm
from fastNLP import CrossEntropyLoss
from fastNLP import cache_results
from fastNLP.core import Trainer, Tester, Adam, AccuracyMetric, Const
from fastNLP.core.predictor import Predictor
from fastNLP.core.callback import GradientClipCallback, LRScheduler, FitlogCallback
from fastNLP.modules.encoder.embedding import ElmoEmbedding, StaticEmbedding
from fastNLP.io.data_loader import MNLILoader, QNLILoader, QuoraLoader, SNLILoader, RTELoader
from reproduction.matching.model.mwan import MwanModel
import fitlog
fitlog.debug()
import argparse
argument = argparse.ArgumentParser()
argument.add_argument('--task' , choices = ['snli', 'rte', 'qnli', 'mnli'],default = 'snli')
argument.add_argument('--batch-size' , type = int , default = 128)
argument.add_argument('--n-epochs' , type = int , default = 50)
argument.add_argument('--lr' , type = float , default = 1)
argument.add_argument('--testset-name' , type = str , default = 'test')
argument.add_argument('--devset-name' , type = str , default = 'dev')
argument.add_argument('--seed' , type = int , default = 42)
argument.add_argument('--hidden-size' , type = int , default = 150)
argument.add_argument('--dropout' , type = float , default = 0.3)
arg = argument.parse_args()
random.seed(arg.seed)
np.random.seed(arg.seed)
torch.manual_seed(arg.seed)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
torch.cuda.manual_seed_all(arg.seed)
print (n_gpu)
for k in arg.__dict__:
print(k, arg.__dict__[k], type(arg.__dict__[k]))
# load data set
if arg.task == 'snli':
@cache_results(f'snli_mwan.pkl')
def read_snli():
data_info = SNLILoader().process(
paths='path/to/snli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
get_index=True, concat=False, extra_split=['/','%','-'],
)
return data_info
data_info = read_snli()
elif arg.task == 'rte':
@cache_results(f'rte_mwan.pkl')
def read_rte():
data_info = RTELoader().process(
paths='path/to/rte/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
get_index=True, concat=False, extra_split=['/','%','-'],
)
return data_info
data_info = read_rte()
elif arg.task == 'qnli':
data_info = QNLILoader().process(
paths='path/to/qnli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
get_index=True, concat=False , cut_text=512, extra_split=['/','%','-'],
)
elif arg.task == 'mnli':
@cache_results(f'mnli_v0.9_mwan.pkl')
def read_mnli():
data_info = MNLILoader().process(
paths='path/to/mnli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
get_index=True, concat=False, extra_split=['/','%','-'],
)
return data_info
data_info = read_mnli()
else:
raise RuntimeError(f'NOT support {arg.task} task yet!')
print(data_info)
print(len(data_info.vocabs['words']))
model = MwanModel(
num_class = len(data_info.vocabs[Const.TARGET]),
EmbLayer = StaticEmbedding(data_info.vocabs[Const.INPUT], requires_grad=False, normalize=False),
ElmoLayer = None,
args_of_imm = {
"input_size" : 300 ,
"hidden_size" : arg.hidden_size ,
"dropout" : arg.dropout ,
"use_allennlp" : False ,
} ,
)
optimizer = Adadelta(lr=arg.lr, params=model.parameters())
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
callbacks = [
LRScheduler(scheduler),
]
if arg.task in ['snli']:
callbacks.append(FitlogCallback(data_info.datasets[arg.testset_name], verbose=1))
elif arg.task == 'mnli':
callbacks.append(FitlogCallback({'dev_matched': data_info.datasets['dev_matched'],
'dev_mismatched': data_info.datasets['dev_mismatched']},
verbose=1))
trainer = Trainer(
train_data = data_info.datasets['train'],
model = model,
optimizer = optimizer,
num_workers = 0,
batch_size = arg.batch_size,
n_epochs = arg.n_epochs,
print_every = -1,
dev_data = data_info.datasets[arg.devset_name],
metrics = AccuracyMetric(pred = "pred" , target = "target"),
metric_key = 'acc',
device = [i for i in range(torch.cuda.device_count())],
check_code_level = -1,
callbacks = callbacks,
loss = CrossEntropyLoss(pred = "pred" , target = "target")
)
trainer.train(load_best_model=True)
tester = Tester(
data=data_info.datasets[arg.testset_name],
model=model,
metrics=AccuracyMetric(),
batch_size=arg.batch_size,
device=[i for i in range(torch.cuda.device_count())],
)
tester.test()

View File

@ -0,0 +1,455 @@
import torch as tc
import torch.nn as nn
import torch.nn.functional as F
import sys
import os
import math
from fastNLP.core.const import Const
class RNNModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, bidrect, dropout):
super(RNNModel, self).__init__()
if num_layers <= 1:
dropout = 0.0
self.rnn = nn.GRU(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers,
batch_first=True, dropout=dropout, bidirectional=bidrect)
self.number = (2 if bidrect else 1) * num_layers
def forward(self, x, mask):
'''
mask: (batch_size, seq_len)
x: (batch_size, seq_len, input_size)
'''
lens = (mask).long().sum(dim=1)
lens, idx_sort = tc.sort(lens, descending=True)
_, idx_unsort = tc.sort(idx_sort)
x = x[idx_sort]
x = nn.utils.rnn.pack_padded_sequence(x, lens, batch_first=True)
self.rnn.flatten_parameters()
y, h = self.rnn(x)
y, lens = nn.utils.rnn.pad_packed_sequence(y, batch_first=True)
h = h.transpose(0,1).contiguous() #make batch size first
y = y[idx_unsort] #(batch_size, seq_len, bid * hid_size)
h = h[idx_unsort] #(batch_size, number, hid_size)
return y, h
class Contexualizer(nn.Module):
def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.3):
super(Contexualizer, self).__init__()
self.rnn = RNNModel(input_size, hidden_size, num_layers, True, dropout)
self.output_size = hidden_size * 2
self.reset_parameters()
def reset_parameters(self):
weights = self.rnn.rnn.all_weights
for w1 in weights:
for w2 in w1:
if len(list(w2.size())) <= 1:
w2.data.fill_(0)
else: nn.init.xavier_normal_(w2.data, gain=1.414)
def forward(self, s, mask):
y = self.rnn(s, mask)[0] # (batch_size, seq_len, 2 * hidden_size)
return y
class ConcatAttention_Param(nn.Module):
def __init__(self, input_size, hidden_size, dropout=0.2):
super(ConcatAttention_Param, self).__init__()
self.ln = nn.Linear(input_size + hidden_size, hidden_size)
self.v = nn.Linear(hidden_size, 1, bias=False)
self.vq = nn.Parameter(tc.rand(hidden_size))
self.drop = nn.Dropout(dropout)
self.output_size = input_size
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.v.weight.data)
nn.init.xavier_uniform_(self.ln.weight.data)
self.ln.bias.data.fill_(0)
def forward(self, h, mask):
'''
h: (batch_size, len, input_size)
mask: (batch_size, len)
'''
vq = self.vq.view(1,1,-1).expand(h.size(0), h.size(1), self.vq.size(0))
s = self.v(tc.tanh(self.ln(tc.cat([h,vq],-1)))).squeeze(-1) # (batch_size, len)
s = s - ((mask == 0).float() * 10000)
a = tc.softmax(s, dim=1)
r = a.unsqueeze(-1) * h # (batch_size, len, input_size)
r = tc.sum(r, dim=1) # (batch_size, input_size)
return self.drop(r)
def get_2dmask(mask_hq, mask_hp, siz=None):
if siz is None:
siz = (mask_hq.size(0), mask_hq.size(1), mask_hp.size(1))
mask_mat = 1
if mask_hq is not None:
mask_mat = mask_mat * mask_hq.unsqueeze(2).expand(siz)
if mask_hp is not None:
mask_mat = mask_mat * mask_hp.unsqueeze(1).expand(siz)
return mask_mat
def Attention(hq, hp, mask_hq, mask_hp, my_method):
standard_size = (hq.size(0), hq.size(1), hp.size(1), hq.size(-1))
mask_mat = get_2dmask(mask_hq, mask_hp, standard_size[:-1])
hq_mat = hq.unsqueeze(2).expand(standard_size)
hp_mat = hp.unsqueeze(1).expand(standard_size)
s = my_method(hq_mat, hp_mat) # (batch_size, len_q, len_p)
s = s - ((mask_mat == 0).float() * 10000)
a = tc.softmax(s, dim=1)
q = a.unsqueeze(-1) * hq_mat #(batch_size, len_q, len_p, input_size)
q = tc.sum(q, dim=1) #(batch_size, len_p, input_size)
return q
class ConcatAttention(nn.Module):
def __init__(self, input_size, hidden_size, dropout=0.2, input_size_2=-1):
super(ConcatAttention, self).__init__()
if input_size_2 < 0:
input_size_2 = input_size
self.ln = nn.Linear(input_size + input_size_2, hidden_size)
self.v = nn.Linear(hidden_size, 1, bias=False)
self.drop = nn.Dropout(dropout)
self.output_size = input_size
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.v.weight.data)
nn.init.xavier_uniform_(self.ln.weight.data)
self.ln.bias.data.fill_(0)
def my_method(self, hq_mat, hp_mat):
s = tc.cat([hq_mat, hp_mat], dim=-1)
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p)
return s
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
'''
hq: (batch_size, len_q, input_size)
mask_hq: (batch_size, len_q)
'''
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method))
class MinusAttention(nn.Module):
def __init__(self, input_size, hidden_size, dropout=0.2):
super(MinusAttention, self).__init__()
self.ln = nn.Linear(input_size, hidden_size)
self.v = nn.Linear(hidden_size, 1, bias=False)
self.drop = nn.Dropout(dropout)
self.output_size = input_size
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.v.weight.data)
nn.init.xavier_uniform_(self.ln.weight.data)
self.ln.bias.data.fill_(0)
def my_method(self, hq_mat, hp_mat):
s = hq_mat - hp_mat
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p) s[j,t]
return s
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method))
class DotProductAttention(nn.Module):
def __init__(self, input_size, hidden_size, dropout=0.2):
super(DotProductAttention, self).__init__()
self.ln = nn.Linear(input_size, hidden_size)
self.v = nn.Linear(hidden_size, 1, bias=False)
self.drop = nn.Dropout(dropout)
self.output_size = input_size
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.v.weight.data)
nn.init.xavier_uniform_(self.ln.weight.data)
self.ln.bias.data.fill_(0)
def my_method(self, hq_mat, hp_mat):
s = hq_mat * hp_mat
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p) s[j,t]
return s
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method))
class BiLinearAttention(nn.Module):
def __init__(self, input_size, hidden_size, dropout=0.2, input_size_2=-1):
super(BiLinearAttention, self).__init__()
input_size_2 = input_size if input_size_2 < 0 else input_size_2
self.ln = nn.Linear(input_size_2, input_size)
self.drop = nn.Dropout(dropout)
self.output_size = input_size
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.ln.weight.data)
self.ln.bias.data.fill_(0)
def my_method(self, hq, hp, mask_p):
# (bs, len, input_size)
hp = self.ln(hp)
hp = hp * mask_p.unsqueeze(-1)
s = tc.matmul(hq, hp.transpose(-1,-2))
return s
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
standard_size = (hq.size(0), hq.size(1), hp.size(1), hq.size(-1))
mask_mat = get_2dmask(mask_hq, mask_hp, standard_size[:-1])
s = self.my_method(hq, hp, mask_hp) # (batch_size, len_q, len_p)
s = s - ((mask_mat == 0).float() * 10000)
a = tc.softmax(s, dim=1)
hq_mat = hq.unsqueeze(2).expand(standard_size)
q = a.unsqueeze(-1) * hq_mat #(batch_size, len_q, len_p, input_size)
q = tc.sum(q, dim=1) #(batch_size, len_p, input_size)
return self.drop(q)
class AggAttention(nn.Module):
def __init__(self, input_size, hidden_size, dropout=0.2):
super(AggAttention, self).__init__()
self.ln = nn.Linear(input_size + hidden_size, hidden_size)
self.v = nn.Linear(hidden_size, 1, bias=False)
self.vq = nn.Parameter(tc.rand(hidden_size, 1))
self.drop = nn.Dropout(dropout)
self.output_size = input_size
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.vq.data)
nn.init.xavier_uniform_(self.v.weight.data)
nn.init.xavier_uniform_(self.ln.weight.data)
self.ln.bias.data.fill_(0)
self.vq.data = self.vq.data[:,0]
def forward(self, hs, mask):
'''
hs: [(batch_size, len_q, input_size), ...]
mask: (batch_size, len_q)
'''
hs = tc.cat([h.unsqueeze(0) for h in hs], dim=0)# (4, batch_size, len_q, input_size)
vq = self.vq.view(1,1,1,-1).expand(hs.size(0), hs.size(1), hs.size(2), self.vq.size(0))
s = self.v(tc.tanh(self.ln(tc.cat([hs,vq],-1)))).squeeze(-1)# (4, batch_size, len_q)
s = s - ((mask.unsqueeze(0) == 0).float() * 10000)
a = tc.softmax(s, dim=0)
x = a.unsqueeze(-1) * hs
x = tc.sum(x, dim=0)#(batch_size, len_q, input_size)
return self.drop(x)
class Aggragator(nn.Module):
def __init__(self, input_size, hidden_size, dropout=0.3):
super(Aggragator, self).__init__()
now_size = input_size
self.ln = nn.Linear(2 * input_size, 2 * input_size)
now_size = 2 * input_size
self.rnn = Contexualizer(now_size, hidden_size, 2, dropout)
now_size = self.rnn.output_size
self.agg_att = AggAttention(now_size, now_size, dropout)
now_size = self.agg_att.output_size
self.agg_rnn = Contexualizer(now_size, hidden_size, 2, dropout)
self.drop = nn.Dropout(dropout)
self.output_size = self.agg_rnn.output_size
def forward(self, qs, hp, mask):
'''
qs: [ (batch_size, len_p, input_size), ...]
hp: (batch_size, len_p, input_size)
mask if the same of hp's mask
'''
hs = [0 for _ in range(len(qs))]
for i in range(len(qs)):
q = qs[i]
x = tc.cat([q, hp], dim=-1)
g = tc.sigmoid(self.ln(x))
x_star = x * g
h = self.rnn(x_star, mask)
hs[i] = h
x = self.agg_att(hs, mask) #(batch_size, len_p, output_size)
h = self.agg_rnn(x, mask) #(batch_size, len_p, output_size)
return self.drop(h)
class Mwan_Imm(nn.Module):
def __init__(self, input_size, hidden_size, num_class=3, dropout=0.2, use_allennlp=False):
super(Mwan_Imm, self).__init__()
now_size = input_size
self.enc_s1 = Contexualizer(now_size, hidden_size, 2, dropout)
self.enc_s2 = Contexualizer(now_size, hidden_size, 2, dropout)
now_size = self.enc_s1.output_size
self.att_c = ConcatAttention(now_size, hidden_size, dropout)
self.att_b = BiLinearAttention(now_size, hidden_size, dropout)
self.att_d = DotProductAttention(now_size, hidden_size, dropout)
self.att_m = MinusAttention(now_size, hidden_size, dropout)
now_size = self.att_c.output_size
self.agg = Aggragator(now_size, hidden_size, dropout)
now_size = self.enc_s1.output_size
self.pred_1 = ConcatAttention_Param(now_size, hidden_size, dropout)
now_size = self.agg.output_size
self.pred_2 = ConcatAttention(now_size, hidden_size, dropout,
input_size_2=self.pred_1.output_size)
now_size = self.pred_2.output_size
self.ln1 = nn.Linear(now_size, hidden_size)
self.ln2 = nn.Linear(hidden_size, num_class)
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.ln1.weight.data)
nn.init.xavier_uniform_(self.ln2.weight.data)
self.ln1.bias.data.fill_(0)
self.ln2.bias.data.fill_(0)
def forward(self, s1, s2, mas_s1, mas_s2):
hq = self.enc_s1(s1, mas_s1) #(batch_size, len_q, output_size)
hp = self.enc_s1(s2, mas_s2)
mas_s1 = mas_s1[:,:hq.size(1)]
mas_s2 = mas_s2[:,:hp.size(1)]
mas_q, mas_p = mas_s1, mas_s2
qc = self.att_c(hq, hp, mas_s1, mas_s2) #(batch_size, len_p, output_size)
qb = self.att_b(hq, hp, mas_s1, mas_s2)
qd = self.att_d(hq, hp, mas_s1, mas_s2)
qm = self.att_m(hq, hp, mas_s1, mas_s2)
ho = self.agg([qc,qb,qd,qm], hp, mas_s2) #(batch_size, len_p, output_size)
rq = self.pred_1(hq, mas_q) #(batch_size, output_size)
rp = self.pred_2(ho, rq.unsqueeze(1), mas_p)#(batch_size, 1, output_size)
rp = rp.squeeze(1) #(batch_size, output_size)
rp = F.relu(self.ln1(rp))
rp = self.ln2(rp)
return rp
class MwanModel(nn.Module):
def __init__(self, num_class, EmbLayer, args_of_imm={}, ElmoLayer=None):
super(MwanModel, self).__init__()
self.emb = EmbLayer
if ElmoLayer is not None:
self.elmo = ElmoLayer
self.elmo_preln = nn.Linear(3 * self.elmo.emb_size, self.elmo.emb_size)
self.elmo_ln = nn.Linear(args_of_imm["input_size"] +
self.elmo.emb_size, args_of_imm["input_size"])
else:
self.elmo = None
self.imm = Mwan_Imm(num_class=num_class, **args_of_imm)
self.drop = nn.Dropout(args_of_imm["dropout"])
def forward(self, words1, words2, str_s1=None, str_s2=None, *pargs, **kwargs):
'''
str_s is for elmo use , however we don't use elmo
str_s: (batch_size, seq_len, word_len)
'''
s1, s2 = words1, words2
mas_s1 = (s1 != 0).float() # mas: (batch_size, seq_len)
mas_s2 = (s2 != 0).float() # mas: (batch_size, seq_len)
mas_s1.requires_grad = False
mas_s2.requires_grad = False
s1_emb = self.emb(s1)
s2_emb = self.emb(s2)
if self.elmo is not None:
s1_elmo = self.elmo(str_s1)
s2_elmo = self.elmo(str_s2)
s1_elmo = tc.tanh(self.elmo_preln(tc.cat(s1_elmo, dim=-1)))
s2_elmo = tc.tanh(self.elmo_preln(tc.cat(s2_elmo, dim=-1)))
s1_emb = tc.cat([s1_emb, s1_elmo], dim=-1)
s2_emb = tc.cat([s2_emb, s2_elmo], dim=-1)
s1_emb = tc.tanh(self.elmo_ln(s1_emb))
s2_emb = tc.tanh(self.elmo_ln(s2_emb))
s1_emb = self.drop(s1_emb)
s2_emb = self.drop(s2_emb)
y = self.imm(s1_emb, s2_emb, mas_s1, mas_s2)
return {
Const.OUTPUT: y,
}

View File

@ -1,7 +1,7 @@
from fastNLP.io.embed_loader import EmbeddingOption, EmbedLoader
from fastNLP.core.vocabulary import VocabularyOption
from fastNLP.io.base_loader import DataSetLoader, DataInfo
from fastNLP.io.base_loader import DataSetLoader, DataBundle
from typing import Union, Dict, List, Iterator
from fastNLP import DataSet
from fastNLP import Instance
@ -161,7 +161,7 @@ class SigHanLoader(DataSetLoader):
# 推荐大家使用这个check_data_loader_paths进行paths的验证
paths = check_dataloader_paths(paths)
datasets = {}
data = DataInfo()
data = DataBundle()
bigram = bigram_vocab_opt is not None
for name, path in paths.items():
dataset = self.load(path, bigram=bigram)

View File

@ -0,0 +1,93 @@
from fastNLP.core.vocabulary import VocabularyOption
from fastNLP.io.base_loader import DataSetLoader, DataBundle
from typing import Union, Dict
from fastNLP import Vocabulary
from fastNLP import Const
from reproduction.utils import check_dataloader_paths
from fastNLP.io import ConllLoader
from reproduction.seqence_labelling.ner.data.utils import iob2bioes, iob2
class Conll2003DataLoader(DataSetLoader):
def __init__(self, task:str='ner', encoding_type:str='bioes'):
"""
加载Conll2003格式的英语语料该数据集的信息可以在https://www.clips.uantwerpen.be/conll2003/ner/找到当task为pos
返回的DataSet中target取值于第2列; 当task为chunk时返回的DataSet中target取值于第3列;当task为ner时返回
的DataSet中target取值于第4列所有"-DOCSTART- -X- O O"将被忽略这会导致数据的数量少于很多文献报道的值
鉴于"-DOCSTART- -X- O O"只是用于文档分割的符号并不应该作为预测对象所以我们忽略了数据中的-DOCTSTART-开头的行
ner与chunk任务读取后的数据的target将为encoding_type类型pos任务读取后就是pos列的数据
:param task: 指定需要标注任务可选ner, pos, chunk
"""
assert task in ('ner', 'pos', 'chunk')
index = {'ner':3, 'pos':1, 'chunk':2}[task]
self._loader = ConllLoader(headers=['raw_words', 'target'], indexes=[0, index])
self._tag_converters = []
if task in ('ner', 'chunk'):
self._tag_converters = [iob2]
if encoding_type == 'bioes':
self._tag_converters.append(iob2bioes)
def load(self, path: str):
dataset = self._loader.load(path)
def convert_tag_schema(tags):
for converter in self._tag_converters:
tags = converter(tags)
return tags
if self._tag_converters:
dataset.apply_field(convert_tag_schema, field_name=Const.TARGET, new_field_name=Const.TARGET)
return dataset
def process(self, paths: Union[str, Dict[str, str]], word_vocab_opt:VocabularyOption=None, lower:bool=False):
"""
读取并处理数据数据中的'-DOCSTART-'开头的行会被忽略
:param paths:
:param word_vocab_opt: vocabulary的初始化值
:param lower: 是否将所有字母转为小写
:return:
"""
# 读取数据
paths = check_dataloader_paths(paths)
data = DataBundle()
input_fields = [Const.TARGET, Const.INPUT, Const.INPUT_LEN]
target_fields = [Const.TARGET, Const.INPUT_LEN]
for name, path in paths.items():
dataset = self.load(path)
dataset.apply_field(lambda words: words, field_name='raw_words', new_field_name=Const.INPUT)
if lower:
dataset.words.lower()
data.datasets[name] = dataset
# 对construct vocab
word_vocab = Vocabulary(min_freq=2) if word_vocab_opt is None else Vocabulary(**word_vocab_opt)
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT,
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train'])
word_vocab.index_dataset(*data.datasets.values(), field_name=Const.INPUT, new_field_name=Const.INPUT)
data.vocabs[Const.INPUT] = word_vocab
# cap words
cap_word_vocab = Vocabulary()
cap_word_vocab.from_dataset(data.datasets['train'], field_name='raw_words',
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train'])
cap_word_vocab.index_dataset(*data.datasets.values(), field_name='raw_words', new_field_name='cap_words')
input_fields.append('cap_words')
data.vocabs['cap_words'] = cap_word_vocab
# 对target建vocab
target_vocab = Vocabulary(unknown=None, padding=None)
target_vocab.from_dataset(*data.datasets.values(), field_name=Const.TARGET)
target_vocab.index_dataset(*data.datasets.values(), field_name=Const.TARGET)
data.vocabs[Const.TARGET] = target_vocab
for name, dataset in data.datasets.items():
dataset.add_seq_len(Const.INPUT, new_field_name=Const.INPUT_LEN)
dataset.set_input(*input_fields)
dataset.set_target(*target_fields)
return data
if __name__ == '__main__':
pass

View File

@ -0,0 +1,152 @@
from fastNLP.core.vocabulary import VocabularyOption
from fastNLP.io.base_loader import DataSetLoader, DataBundle
from typing import Union, Dict
from fastNLP import DataSet
from fastNLP import Vocabulary
from fastNLP import Const
from reproduction.utils import check_dataloader_paths
from fastNLP.io import ConllLoader
from reproduction.seqence_labelling.ner.data.utils import iob2bioes, iob2
class OntoNoteNERDataLoader(DataSetLoader):
"""
用于读取处理为Conll格式后的OntoNote数据将OntoNote数据处理为conll格式的过程可以参考https://github.com/yhcc/OntoNotes-5.0-NER
"""
def __init__(self, encoding_type:str='bioes'):
assert encoding_type in ('bioes', 'bio')
self.encoding_type = encoding_type
if encoding_type=='bioes':
self.encoding_method = iob2bioes
else:
self.encoding_method = iob2
def load(self, path:str)->DataSet:
"""
给定一个文件路径读取数据返回的DataSet包含以下的field
raw_words: List[str]
target: List[str]
:param path:
:return:
"""
dataset = ConllLoader(headers=['raw_words', 'target'], indexes=[3, 10]).load(path)
def convert_to_bio(tags):
bio_tags = []
flag = None
for tag in tags:
label = tag.strip("()*")
if '(' in tag:
bio_label = 'B-' + label
flag = label
elif flag:
bio_label = 'I-' + flag
else:
bio_label = 'O'
if ')' in tag:
flag = None
bio_tags.append(bio_label)
return self.encoding_method(bio_tags)
def convert_word(words):
converted_words = []
for word in words:
word = word.replace('/.', '.') # 有些结尾的.是/.形式的
if not word.startswith('-'):
converted_words.append(word)
continue
# 以下是由于这些符号被转义了,再转回来
tfrs = {'-LRB-':'(',
'-RRB-': ')',
'-LSB-': '[',
'-RSB-': ']',
'-LCB-': '{',
'-RCB-': '}'
}
if word in tfrs:
converted_words.append(tfrs[word])
else:
converted_words.append(word)
return converted_words
dataset.apply_field(convert_word, field_name='raw_words', new_field_name='raw_words')
dataset.apply_field(convert_to_bio, field_name='target', new_field_name='target')
return dataset
def process(self, paths: Union[str, Dict[str, str]], word_vocab_opt:VocabularyOption=None,
lower:bool=True)->DataBundle:
"""
读取并处理数据返回的DataInfo包含以下的内容
vocabs:
word: Vocabulary
target: Vocabulary
datasets:
train: DataSet
words: List[int], 被设置为input
target: int. label被同时设置为input和target
seq_len: int. 句子的长度被同时设置为input和target
raw_words: List[str]
xxx(根据传入的paths可能有所变化)
:param paths:
:param word_vocab_opt: vocabulary的初始化值
:param lower: 是否使用小写
:return:
"""
paths = check_dataloader_paths(paths)
data = DataBundle()
input_fields = [Const.TARGET, Const.INPUT, Const.INPUT_LEN]
target_fields = [Const.TARGET, Const.INPUT_LEN]
for name, path in paths.items():
dataset = self.load(path)
dataset.apply_field(lambda words: words, field_name='raw_words', new_field_name=Const.INPUT)
if lower:
dataset.words.lower()
data.datasets[name] = dataset
# 对construct vocab
word_vocab = Vocabulary(min_freq=2) if word_vocab_opt is None else Vocabulary(**word_vocab_opt)
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT,
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train'])
word_vocab.index_dataset(*data.datasets.values(), field_name=Const.INPUT, new_field_name=Const.INPUT)
data.vocabs[Const.INPUT] = word_vocab
# cap words
cap_word_vocab = Vocabulary()
cap_word_vocab.from_dataset(*data.datasets.values(), field_name='raw_words')
cap_word_vocab.index_dataset(*data.datasets.values(), field_name='raw_words', new_field_name='cap_words')
input_fields.append('cap_words')
data.vocabs['cap_words'] = cap_word_vocab
# 对target建vocab
target_vocab = Vocabulary(unknown=None, padding=None)
target_vocab.from_dataset(*data.datasets.values(), field_name=Const.TARGET)
target_vocab.index_dataset(*data.datasets.values(), field_name=Const.TARGET)
data.vocabs[Const.TARGET] = target_vocab
for name, dataset in data.datasets.items():
dataset.add_seq_len(Const.INPUT, new_field_name=Const.INPUT_LEN)
dataset.set_input(*input_fields)
dataset.set_target(*target_fields)
return data
if __name__ == '__main__':
loader = OntoNoteNERDataLoader()
dataset = loader.load('/hdd/fudanNLP/fastNLP/others/data/v4/english/test.txt')
print(dataset.target.value_count())
print(dataset[:4])
"""
train 115812 2200752
development 15680 304684
test 12217 230111
train 92403 1901772
valid 13606 279180
test 10258 204135
"""

View File

@ -0,0 +1,49 @@
from typing import List
def iob2(tags:List[str])->List[str]:
"""
检查数据是否是合法的IOB数据如果是IOB1会被自动转换为IOB2
:param tags: 需要转换的tags
"""
for i, tag in enumerate(tags):
if tag == "O":
continue
split = tag.split("-")
if len(split) != 2 or split[0] not in ["I", "B"]:
raise TypeError("The encoding schema is not a valid IOB type.")
if split[0] == "B":
continue
elif i == 0 or tags[i - 1] == "O": # conversion IOB1 to IOB2
tags[i] = "B" + tag[1:]
elif tags[i - 1][1:] == tag[1:]:
continue
else: # conversion IOB1 to IOB2
tags[i] = "B" + tag[1:]
return tags
def iob2bioes(tags:List[str])->List[str]:
"""
将iob的tag转换为bmeso编码
:param tags:
:return:
"""
new_tags = []
for i, tag in enumerate(tags):
if tag == 'O':
new_tags.append(tag)
else:
split = tag.split('-')[0]
if split == 'B':
if i+1!=len(tags) and tags[i+1].split('-')[0] == 'I':
new_tags.append(tag)
else:
new_tags.append(tag.replace('B-', 'S-'))
elif split == 'I':
if i + 1<len(tags) and tags[i+1].split('-')[0] == 'I':
new_tags.append(tag)
else:
new_tags.append(tag.replace('I-', 'E-'))
else:
raise TypeError("Invalid IOB format.")
return new_tags

View File

@ -106,7 +106,9 @@ class IDCNN(nn.Module):
if self.crf is not None and target is not None:
loss = self.crf(y.transpose(1, 2), t, mask)
else:
t.masked_fill_(mask == 0, -100)
y.masked_fill_((mask == 0)[:,None,:], -100)
# f_mask = mask.float()
# t = f_mask * t + (1-f_mask) * -100
loss = F.cross_entropy(y, t, ignore_index=-100)
return loss
@ -130,13 +132,3 @@ class IDCNN(nn.Module):
C.OUTPUT: pred,
}
def predict(self, words, seq_len, chars=None):
res = self.forward(
words=words,
seq_len=seq_len,
chars=chars,
target=None
)[C.OUTPUT]
return {
C.OUTPUT: res
}

View File

@ -11,9 +11,8 @@ from fastNLP import Const
class CNNBiLSTMCRF(nn.Module):
def __init__(self, embed, char_embed, hidden_size, num_layers, tag_vocab, dropout=0.5, encoding_type='bioes'):
super().__init__()
self.embedding = Embedding(embed, dropout=0.5, dropout_word=0)
self.char_embedding = Embedding(char_embed, dropout=0.5, dropout_word=0.01)
self.embedding = embed
self.char_embedding = char_embed
self.lstm = LSTM(input_size=self.embedding.embedding_dim+self.char_embedding.embedding_dim,
hidden_size=hidden_size//2, num_layers=num_layers,
bidirectional=True, batch_first=True)
@ -33,24 +32,24 @@ class CNNBiLSTMCRF(nn.Module):
if 'crf' in name:
nn.init.zeros_(param)
def _forward(self, words, cap_words, seq_len, target=None):
words = self.embedding(words)
chars = self.char_embedding(cap_words)
words = torch.cat([words, chars], dim=-1)
def _forward(self, words, seq_len, target=None):
word_embeds = self.embedding(words)
char_embeds = self.char_embedding(words)
words = torch.cat((word_embeds, char_embeds), dim=-1)
outputs, _ = self.lstm(words, seq_len)
self.dropout(outputs)
logits = F.log_softmax(self.fc(outputs), dim=-1)
if target is not None:
loss = self.crf(logits, target, seq_len_to_mask(seq_len))
loss = self.crf(logits, target, seq_len_to_mask(seq_len, max_len=logits.size(1))).mean()
return {Const.LOSS: loss}
else:
pred, _ = self.crf.viterbi_decode(logits, seq_len_to_mask(seq_len))
pred, _ = self.crf.viterbi_decode(logits, seq_len_to_mask(seq_len, max_len=logits.size(1)))
return {Const.OUTPUT: pred}
def forward(self, words, cap_words, seq_len, target):
return self._forward(words, cap_words, seq_len, target)
def forward(self, words, seq_len, target):
return self._forward(words, seq_len, target)
def predict(self, words, cap_words, seq_len):
return self._forward(words, cap_words, seq_len, None)
def predict(self, words, seq_len):
return self._forward(words, seq_len, None)

View File

@ -1,6 +1,7 @@
import sys
sys.path.append('../../..')
from fastNLP.modules.encoder.embedding import CNNCharEmbedding, StaticEmbedding, BertEmbedding, ElmoEmbedding, LSTMCharEmbedding
from fastNLP.modules.encoder.embedding import CNNCharEmbedding, StaticEmbedding, BertEmbedding, ElmoEmbedding, StackEmbedding
from fastNLP.core.vocabulary import VocabularyOption
from reproduction.seqence_labelling.ner.model.lstm_cnn_crf import CNNBiLSTMCRF
@ -12,7 +13,10 @@ from torch.optim import SGD, Adam
from fastNLP import GradientClipCallback
from fastNLP.core.callback import FitlogCallback, LRScheduler
from torch.optim.lr_scheduler import LambdaLR
from reproduction.seqence_labelling.ner.model.swats import SWATS
from fastNLP.core.optimizer import AdamW
# from reproduction.seqence_labelling.ner.model.swats import SWATS
from reproduction.seqence_labelling.chinese_ner.callbacks import SaveModelCallback
from fastNLP import cache_results
import fitlog
fitlog.debug()
@ -20,17 +24,20 @@ fitlog.debug()
from reproduction.seqence_labelling.ner.data.Conll2003Loader import Conll2003DataLoader
encoding_type = 'bioes'
data = Conll2003DataLoader(encoding_type=encoding_type).process('../../../../others/data/conll2003',
word_vocab_opt=VocabularyOption(min_freq=2),
lower=False)
@cache_results('caches/upper_conll2003.pkl')
def load_data():
data = Conll2003DataLoader(encoding_type=encoding_type).process('../../../../others/data/conll2003',
word_vocab_opt=VocabularyOption(min_freq=1),
lower=False)
return data
data = load_data()
print(data)
char_embed = CNNCharEmbedding(vocab=data.vocabs['cap_words'], embed_size=30, char_emb_size=30, filter_nums=[30],
kernel_sizes=[3])
char_embed = CNNCharEmbedding(vocab=data.vocabs['words'], embed_size=30, char_emb_size=30, filter_nums=[30],
kernel_sizes=[3], word_dropout=0.01, dropout=0.5)
# char_embed = LSTMCharEmbedding(vocab=data.vocabs['cap_words'], embed_size=30 ,char_emb_size=30)
word_embed = StaticEmbedding(vocab=data.vocabs[Const.INPUT],
model_dir_or_name='/hdd/fudanNLP/pretrain_vectors/wiki_en_100_50_case_2.txt',
requires_grad=True)
word_embed = StaticEmbedding(vocab=data.vocabs['words'],
model_dir_or_name='/hdd/fudanNLP/pretrain_vectors/glove.6B.100d.txt',
requires_grad=True, lower=True, word_dropout=0.01, dropout=0.5)
word_embed.embedding.weight.data = word_embed.embedding.weight.data/word_embed.embedding.weight.data.std()
# import joblib
@ -46,25 +53,28 @@ word_embed.embedding.weight.data = word_embed.embedding.weight.data/word_embed.e
# for name, dataset in data.datasets.items():
# dataset.apply_field(convert_to_ids, field_name='raw_words', new_field_name=Const.INPUT)
# word_embed = ElmoEmbedding(vocab=data.vocabs['cap_words'],
# model_dir_or_name='/hdd/fudanNLP/fastNLP/others/pretrained_models/elmo_en',
# requires_grad=True)
# elmo_embed = ElmoEmbedding(vocab=data.vocabs['cap_words'],
# model_dir_or_name='.',
# requires_grad=True, layers='mix')
# char_embed = StackEmbedding([elmo_embed, char_embed])
model = CNNBiLSTMCRF(word_embed, char_embed, hidden_size=200, num_layers=1, tag_vocab=data.vocabs[Const.TARGET],
encoding_type=encoding_type)
callbacks = [
GradientClipCallback(clip_type='value', clip_value=5)
, FitlogCallback({'test':data.datasets['test']}, verbose=1)
GradientClipCallback(clip_type='value', clip_value=5),
FitlogCallback({'test':data.datasets['test']}, verbose=1),
# SaveModelCallback('save_models/', top=3, only_param=False, save_on_exception=True)
]
# optimizer = Adam(model.parameters(), lr=0.005)
optimizer = SWATS(model.parameters(), verbose=True)
# optimizer = SGD(model.parameters(), lr=0.008, momentum=0.9)
# scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
# callbacks.append(scheduler)
# optimizer = Adam(model.parameters(), lr=0.001)
# optimizer = SWATS(model.parameters(), verbose=True)
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
callbacks.append(scheduler)
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer, sampler=BucketSampler(),
device=1, dev_data=data.datasets['dev'], batch_size=10,
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer, sampler=BucketSampler(batch_size=20),
device=1, dev_data=data.datasets['dev'], batch_size=20,
metrics=SpanFPreRecMetric(tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type),
callbacks=callbacks, num_workers=1, n_epochs=100)
callbacks=callbacks, num_workers=2, n_epochs=100)
trainer.train()

View File

@ -1,4 +1,5 @@
from reproduction.seqence_labelling.ner.data.OntoNoteLoader import OntoNoteNERDataLoader
from reproduction.seqence_labelling.ner.data.Conll2003Loader import Conll2003DataLoader
from fastNLP.core.callback import FitlogCallback, LRScheduler
from fastNLP import GradientClipCallback
from torch.optim.lr_scheduler import LambdaLR, CosineAnnealingLR
@ -6,11 +7,14 @@ from torch.optim import SGD, Adam
from fastNLP import Const
from fastNLP import RandomSampler, BucketSampler
from fastNLP import SpanFPreRecMetric
from fastNLP import Trainer
from fastNLP import Trainer, Tester
from fastNLP.core.metrics import MetricBase
from reproduction.seqence_labelling.ner.model.dilated_cnn import IDCNN
from fastNLP.core.utils import Option
from fastNLP.modules.encoder.embedding import CNNCharEmbedding, StaticEmbedding
from fastNLP.core.utils import cache_results
from fastNLP.core.vocabulary import VocabularyOption
import fitlog
import sys
import torch.cuda
import os
@ -24,7 +28,6 @@ encoding_type = 'bioes'
def get_path(path):
return os.path.join(os.environ['HOME'], path)
data_path = get_path('workdir/datasets/ontonotes-v4')
ops = Option(
batch_size=128,
@ -33,34 +36,45 @@ ops = Option(
repeats=3,
num_layers=3,
num_filters=400,
use_crf=True,
use_crf=False,
gradient_clip=5,
)
@cache_results('ontonotes-cache')
@cache_results('ontonotes-case-cache')
def load_data():
data = OntoNoteNERDataLoader(encoding_type=encoding_type).process(data_path,
lower=True)
print('loading data')
data = OntoNoteNERDataLoader(encoding_type=encoding_type).process(
paths = get_path('workdir/datasets/ontonotes-v4'),
lower=False,
word_vocab_opt=VocabularyOption(min_freq=0),
)
# data = Conll2003DataLoader(task='ner', encoding_type=encoding_type).process(
# paths=get_path('workdir/datasets/conll03'),
# lower=False, word_vocab_opt=VocabularyOption(min_freq=0)
# )
# char_embed = CNNCharEmbedding(vocab=data.vocabs['cap_words'], embed_size=30, char_emb_size=30, filter_nums=[30],
# kernel_sizes=[3])
print('loading embedding')
word_embed = StaticEmbedding(vocab=data.vocabs[Const.INPUT],
model_dir_or_name='en-glove-840b-300',
requires_grad=True)
return data, [word_embed]
data, embeds = load_data()
print(data)
print(data.datasets['train'][0])
print(list(data.vocabs.keys()))
for ds in data.datasets.values():
ds.rename_field('cap_words', 'chars')
ds.set_input('chars')
# for ds in data.datasets.values():
# ds.rename_field('cap_words', 'chars')
# ds.set_input('chars')
word_embed = embeds[0]
char_embed = CNNCharEmbedding(data.vocabs['cap_words'])
word_embed.embedding.weight.data /= word_embed.embedding.weight.data.std()
# char_embed = CNNCharEmbedding(data.vocabs['cap_words'])
char_embed = None
# for ds in data.datasets:
# ds.rename_field('')
@ -75,14 +89,44 @@ model = IDCNN(init_embed=word_embed,
kernel_size=3,
use_crf=ops.use_crf, use_projection=True,
block_loss=True,
input_dropout=0.33, hidden_dropout=0.2, inner_dropout=0.2)
input_dropout=0.5, hidden_dropout=0.2, inner_dropout=0.2)
print(model)
callbacks = [GradientClipCallback(clip_value=ops.gradient_clip, clip_type='norm'),]
callbacks = [GradientClipCallback(clip_value=ops.gradient_clip, clip_type='value'),]
metrics = []
metrics.append(
SpanFPreRecMetric(
tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type,
pred=Const.OUTPUT, target=Const.TARGET, seq_len=Const.INPUT_LEN,
)
)
class LossMetric(MetricBase):
def __init__(self, loss=None):
super(LossMetric, self).__init__()
self._init_param_map(loss=loss)
self.total_loss = 0.0
self.steps = 0
def evaluate(self, loss):
self.total_loss += float(loss)
self.steps += 1
def get_metric(self, reset=True):
result = {'loss': self.total_loss / (self.steps + 1e-12)}
if reset:
self.total_loss = 0.0
self.steps = 0
return result
metrics.append(
LossMetric(loss=Const.LOSS)
)
optimizer = Adam(model.parameters(), lr=ops.lr, weight_decay=0)
# scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
callbacks.append(scheduler)
# callbacks.append(LRScheduler(CosineAnnealingLR(optimizer, 15)))
# optimizer = SWATS(model.parameters(), verbose=True)
# optimizer = Adam(model.parameters(), lr=0.005)
@ -92,8 +136,20 @@ device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer,
sampler=BucketSampler(num_buckets=50, batch_size=ops.batch_size),
device=device, dev_data=data.datasets['dev'], batch_size=ops.batch_size,
metrics=SpanFPreRecMetric(
tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type),
metrics=metrics,
check_code_level=-1,
callbacks=callbacks, num_workers=2, n_epochs=ops.num_epochs)
trainer.train()
torch.save(model, 'idcnn.pt')
tester = Tester(
data=data.datasets['test'],
model=model,
metrics=metrics,
batch_size=ops.batch_size,
num_workers=2,
device=device
)
tester.test()

View File

@ -7,9 +7,9 @@ dpcnn:论文链接[Deep Pyramid Convolutional Neural Networks for TextCategoriza
HAN:论文链接[Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)
LSTM+self_attention:论文链接[A Structured Self-attentive Sentence Embedding](<https://arxiv.org/pdf/1703.03130.pdf>)
LSTM+self_attention:论文链接[A Structured Self-attentive Sentence Embedding](https://arxiv.org/pdf/1703.03130.pdf)
AWD-LSTM:论文链接[Regularizing and Optimizing LSTM Language Models](<https://arxiv.org/pdf/1708.02182.pdf>)
AWD-LSTM:论文链接[Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)
# 数据集及复现结果汇总

View File

@ -1,6 +1,6 @@
from fastNLP.io.embed_loader import EmbeddingOption, EmbedLoader
from fastNLP.core.vocabulary import VocabularyOption
from fastNLP.io.base_loader import DataSetLoader, DataInfo
from fastNLP.io.base_loader import DataSetLoader, DataBundle
from typing import Union, Dict, List, Iterator
from fastNLP import DataSet
from fastNLP import Instance
@ -50,7 +50,7 @@ class IMDBLoader(DataSetLoader):
char_level_op=False):
datasets = {}
info = DataInfo()
info = DataBundle()
for name, path in paths.items():
dataset = self.load(path)
datasets[name] = dataset

View File

@ -1,6 +1,6 @@
from fastNLP.io.embed_loader import EmbeddingOption, EmbedLoader
from fastNLP.core.vocabulary import VocabularyOption
from fastNLP.io.base_loader import DataSetLoader, DataInfo
from fastNLP.io.base_loader import DataSetLoader, DataBundle
from typing import Union, Dict, List, Iterator
from fastNLP import DataSet
from fastNLP import Instance
@ -47,7 +47,7 @@ class MTL16Loader(DataSetLoader):
paths = check_dataloader_paths(paths)
datasets = {}
info = DataInfo()
info = DataBundle()
for name, path in paths.items():
dataset = self.load(path)
datasets[name] = dataset

View File

@ -1,6 +1,6 @@
from typing import Iterable
from nltk import Tree
from fastNLP.io.base_loader import DataInfo, DataSetLoader
from fastNLP.io.base_loader import DataBundle, DataSetLoader
from fastNLP.core.vocabulary import VocabularyOption, Vocabulary
from fastNLP import DataSet
from fastNLP import Instance
@ -68,7 +68,7 @@ class SSTLoader(DataSetLoader):
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
info = DataInfo(datasets=self.load(paths))
info = DataBundle(datasets=self.load(paths))
_train_ds = [info.datasets[name]
for name in train_ds] if train_ds else info.datasets.values()
src_vocab.from_dataset(*_train_ds, field_name=input_name)
@ -134,7 +134,7 @@ class sst2Loader(DataSetLoader):
paths = check_dataloader_paths(paths)
datasets = {}
info = DataInfo()
info = DataBundle()
for name, path in paths.items():
dataset = self.load(path)
datasets[name] = dataset

View File

@ -4,7 +4,7 @@ from typing import Iterable
from fastNLP import DataSet, Instance, Vocabulary
from fastNLP.core.vocabulary import VocabularyOption
from fastNLP.io import JsonLoader
from fastNLP.io.base_loader import DataInfo,DataSetLoader
from fastNLP.io.base_loader import DataBundle,DataSetLoader
from fastNLP.io.embed_loader import EmbeddingOption
from fastNLP.io.file_reader import _read_json
from typing import Union, Dict
@ -134,7 +134,7 @@ class yelpLoader(DataSetLoader):
char_level_op=False):
paths = check_dataloader_paths(paths)
datasets = {}
info = DataInfo(datasets=self.load(paths))
info = DataBundle(datasets=self.load(paths))
src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)

View File

@ -11,7 +11,7 @@ from reproduction.text_classification.model.dpcnn import DPCNN
from data.yelpLoader import yelpLoader
from fastNLP.core.sampler import BucketSampler
import torch.nn as nn
from fastNLP.core import LRScheduler
from fastNLP.core import LRScheduler, Callback
from fastNLP.core.const import Const as C
from fastNLP.core.vocabulary import VocabularyOption
from utils.util_init import set_rng_seeds
@ -25,14 +25,14 @@ os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
class Config():
seed = 12345
model_dir_or_name = "dpcnn-yelp-p"
model_dir_or_name = "dpcnn-yelp-f"
embedding_grad = True
train_epoch = 30
batch_size = 100
task = "yelp_p"
task = "yelp_f"
#datadir = 'workdir/datasets/SST'
datadir = 'workdir/datasets/yelp_polarity'
# datadir = 'workdir/datasets/yelp_full'
# datadir = 'workdir/datasets/yelp_polarity'
datadir = 'workdir/datasets/yelp_full'
#datafile = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"}
datafile = {"train": "train.csv", "test": "test.csv"}
lr = 1e-3
@ -73,6 +73,8 @@ def load_data():
datainfo, embedding = load_data()
embedding.embedding.weight.data /= embedding.embedding.weight.data.std()
print(embedding.embedding.weight.mean(), embedding.embedding.weight.std())
# 2.或直接复用fastNLP的模型
@ -92,11 +94,12 @@ optimizer = SGD([param for param in model.parameters() if param.requires_grad ==
lr=ops.lr, momentum=0.9, weight_decay=ops.weight_decay)
callbacks = []
# callbacks.append(LRScheduler(CosineAnnealingLR(optimizer, 5)))
callbacks.append(
LRScheduler(LambdaLR(optimizer, lambda epoch: ops.lr if epoch <
ops.train_epoch * 0.8 else ops.lr * 0.1))
)
callbacks.append(LRScheduler(CosineAnnealingLR(optimizer, 5)))
# callbacks.append(
# LRScheduler(LambdaLR(optimizer, lambda epoch: ops.lr if epoch <
# ops.train_epoch * 0.8 else ops.lr * 0.1))
# )
# callbacks.append(
# FitlogCallback(data=datainfo.datasets, verbose=1)

View File

@ -3,3 +3,4 @@ torch>=1.0.0
tqdm>=4.28.1
nltk>=3.4.1
requests
spacy

View File

@ -88,6 +88,27 @@ class TestAdd(unittest.TestCase):
for i in range(num_samples):
self.assertEqual(True, vocab._is_word_no_create_entry(chr(start_char + i)+chr(start_char + i)))
def test_no_entry(self):
# 先建立vocabulary然后变化no_create_entry, 测试能否正确识别
text = ["FastNLP", "works", "well", "in", "most", "cases", "and", "scales", "well", "in",
"works", "well", "in", "most", "cases", "scales", "well"]
vocab = Vocabulary()
vocab.add_word_lst(text)
self.assertFalse(vocab._is_word_no_create_entry('FastNLP'))
vocab.add_word('FastNLP', no_create_entry=True)
self.assertFalse(vocab._is_word_no_create_entry('FastNLP'))
vocab.add_word('fastnlp', no_create_entry=True)
self.assertTrue(vocab._is_word_no_create_entry('fastnlp'))
vocab.add_word('fastnlp', no_create_entry=False)
self.assertFalse(vocab._is_word_no_create_entry('fastnlp'))
vocab.add_word_lst(['1']*10, no_create_entry=True)
self.assertTrue(vocab._is_word_no_create_entry('1'))
vocab.add_word('1')
self.assertFalse(vocab._is_word_no_create_entry('1'))
class TestIndexing(unittest.TestCase):
def test_len(self):
@ -127,6 +148,21 @@ class TestIndexing(unittest.TestCase):
self.assertTrue(word in text)
self.assertTrue(idx < len(vocab))
def test_rebuild(self):
# 测试build之后新加入词原来的词顺序不变
vocab = Vocabulary()
text = [str(idx) for idx in range(10)]
vocab.update(text)
for i in text:
self.assertEqual(int(i)+2, vocab.to_index(i))
indexes = []
for word, index in vocab:
indexes.append((word, index))
vocab.add_word_lst([str(idx) for idx in range(10, 13)])
for idx, pair in enumerate(indexes):
self.assertEqual(pair[1], vocab.to_index(pair[0]))
for i in range(13):
self.assertEqual(int(i)+2, vocab.to_index(str(i)))
class TestOther(unittest.TestCase):
def test_additional_update(self):

View File

@ -1,8 +1,7 @@
import unittest
import os
from fastNLP.io import Conll2003Loader, PeopleDailyCorpusLoader, CSVLoader, JsonLoader
from fastNLP.io.data_loader import SSTLoader, SNLILoader
from reproduction.text_classification.data.yelpLoader import yelpLoader
from fastNLP.io import CSVLoader, JsonLoader
from fastNLP.io.data_loader import SSTLoader, SNLILoader, Conll2003Loader, PeopleDailyCorpusLoader
class TestDatasetLoader(unittest.TestCase):
@ -31,7 +30,7 @@ class TestDatasetLoader(unittest.TestCase):
ds = JsonLoader().load('test/data_for_tests/sample_snli.jsonl')
assert len(ds) == 3
def test_SST(self):
def no_test_SST(self):
train_data = """(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
(4 (4 (4 (2 The) (4 (3 gorgeously) (3 (2 elaborate) (2 continuation)))) (2 (2 (2 of) (2 ``)) (2 (2 The) (2 (2 (2 Lord) (2 (2 of) (2 (2 the) (2 Rings)))) (2 (2 '') (2 trilogy)))))) (2 (3 (2 (2 is) (2 (2 so) (2 huge))) (2 (2 that) (3 (2 (2 (2 a) (2 column)) (2 (2 of) (2 words))) (2 (2 (2 (2 can) (1 not)) (3 adequately)) (2 (2 describe) (2 (3 (2 (2 co-writer\/director) (2 (2 Peter) (3 (2 Jackson) (2 's)))) (3 (2 expanded) (2 vision))) (2 (2 of) (2 (2 (2 J.R.R.) (2 (2 Tolkien) (2 's))) (2 Middle-earth))))))))) (2 .)))
(3 (3 (2 (2 (2 (2 (2 Singer\/composer) (2 (2 Bryan) (2 Adams))) (2 (2 contributes) (2 (2 (2 a) (2 slew)) (2 (2 of) (2 songs))))) (2 (2 --) (2 (2 (2 (2 a) (2 (2 few) (3 potential))) (2 (2 (2 hits) (2 ,)) (2 (2 (2 a) (2 few)) (1 (1 (2 more) (1 (2 simply) (2 intrusive))) (2 (2 to) (2 (2 the) (2 story))))))) (2 --)))) (2 but)) (3 (4 (2 the) (3 (2 whole) (2 package))) (2 (3 certainly) (3 (2 captures) (2 (1 (2 the) (2 (2 (2 intended) (2 (2 ,) (2 (2 er) (2 ,)))) (3 spirit))) (2 (2 of) (2 (2 the) (2 piece)))))))) (2 .))
@ -65,6 +64,12 @@ class TestDatasetLoader(unittest.TestCase):
def test_import(self):
import fastNLP
from fastNLP.io import SNLILoader
ds = SNLILoader().process('test/data_for_tests/sample_snli.jsonl', to_lower=True,
get_index=True, seq_len_type='seq_len', extra_split=['-'])
assert 'train' in ds.datasets
assert len(ds.datasets) == 1
assert len(ds.datasets['train']) == 3
ds = SNLILoader().process('test/data_for_tests/sample_snli.jsonl', to_lower=True,
get_index=True, seq_len_type='seq_len')
assert 'train' in ds.datasets