mirror of
https://gitee.com/fastnlp/fastNLP.git
synced 2024-12-04 21:28:01 +08:00
commit
2610c20c23
45
README.md
45
README.md
@ -6,50 +6,59 @@
|
||||
![Hex.pm](https://img.shields.io/hexpm/l/plug.svg)
|
||||
[![Documentation Status](https://readthedocs.org/projects/fastnlp/badge/?version=latest)](http://fastnlp.readthedocs.io/?badge=latest)
|
||||
|
||||
fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner/)、POS-Tagging等)、中文分词、文本分类、[Matching](reproduction/matching/)、指代消解、摘要等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性:
|
||||
fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner)、POS-Tagging等)、中文分词、[文本分类](reproduction/text_classification)、[Matching](reproduction/matching)、[指代消解](reproduction/coreference_resolution)、[摘要](reproduction/Summarization)等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性:
|
||||
|
||||
- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的DataSet Loader,省去预处理代码;
|
||||
- 多种训练、测试组件,例如训练器Trainer;测试器Tester;以及各种评测metrics等等;
|
||||
- 各种方便的NLP工具,例如预处理embedding加载(包括EMLo和BERT); 中间数据cache等;
|
||||
- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、教程以供查阅;
|
||||
- 各种方便的NLP工具,例如预处理embedding加载(包括ELMo和BERT); 中间数据cache等;
|
||||
- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、[教程](https://fastnlp.readthedocs.io/zh/latest/user/tutorials.html)以供查阅;
|
||||
- 提供诸多高级模块,例如Variational LSTM, Transformer, CRF等;
|
||||
- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用; [详细链接](reproduction/)
|
||||
- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用,详细内容见 [reproduction](reproduction) 部分;
|
||||
- 便捷且具有扩展性的训练器; 提供多种内置callback函数,方便实验记录、异常捕获等。
|
||||
|
||||
|
||||
## 安装指南
|
||||
|
||||
fastNLP 依赖如下包:
|
||||
fastNLP 依赖以下包:
|
||||
|
||||
+ numpy>=1.14.2
|
||||
+ torch>=1.0.0
|
||||
+ tqdm>=4.28.1
|
||||
+ nltk>=3.4.1
|
||||
+ requests
|
||||
+ spacy
|
||||
|
||||
其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 [PyTorch 官网](https://pytorch.org/) 。
|
||||
在依赖包安装完成后,您可以在命令行执行如下指令完成安装
|
||||
|
||||
```shell
|
||||
pip install fastNLP
|
||||
python -m spacy download en
|
||||
```
|
||||
|
||||
|
||||
## 参考资源
|
||||
## fastNLP教程
|
||||
|
||||
- [文档](https://fastnlp.readthedocs.io/zh/latest/)
|
||||
- [源码](https://github.com/fastnlp/fastNLP)
|
||||
- [1. 使用DataSet预处理文本](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_1_data_preprocess.html)
|
||||
- [2. 使用DataSetLoader加载数据集](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_2_load_dataset.html)
|
||||
- [3. 使用Embedding模块将文本转成向量](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_3_embedding.html)
|
||||
- [4. 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_4_loss_optimizer.html)
|
||||
- [5. 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_5_datasetiter.html)
|
||||
- [6. 快速实现序列标注模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_6_seq_labeling.html)
|
||||
- [7. 使用Modules和Models快速搭建自定义模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_7_modules_models.html)
|
||||
- [8. 使用Metric快速评测你的模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_8_metrics.html)
|
||||
- [9. 使用Callback自定义你的训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_9_callback.html)
|
||||
|
||||
|
||||
|
||||
## 内置组件
|
||||
|
||||
大部分用于的 NLP 任务神经网络都可以看做由编码(encoder)、聚合(aggregator)、解码(decoder)三种模块组成。
|
||||
大部分用于的 NLP 任务神经网络都可以看做由编码器(encoder)、解码器(decoder)两种模块组成。
|
||||
|
||||
|
||||
![](./docs/source/figures/text_classification.png)
|
||||
|
||||
fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 三种模块的功能和常见组件如下:
|
||||
fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下:
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
@ -59,29 +68,17 @@ fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助
|
||||
</tr>
|
||||
<tr>
|
||||
<td> encoder </td>
|
||||
<td> 将输入编码为具有具 有表示能力的向量 </td>
|
||||
<td> 将输入编码为具有具有表示能力的向量 </td>
|
||||
<td> embedding, RNN, CNN, transformer
|
||||
</tr>
|
||||
<tr>
|
||||
<td> aggregator </td>
|
||||
<td> 从多个向量中聚合信息 </td>
|
||||
<td> self-attention, max-pooling </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td> decoder </td>
|
||||
<td> 将具有某种表示意义的 向量解码为需要的输出 形式 </td>
|
||||
<td> 将具有某种表示意义的向量解码为需要的输出形式 </td>
|
||||
<td> MLP, CRF </td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
|
||||
## 完整模型
|
||||
fastNLP 为不同的 NLP 任务实现了许多完整的模型,它们都经过了训练和测试。
|
||||
|
||||
你可以在以下两个地方查看相关信息
|
||||
- [模型介绍](reproduction/)
|
||||
- [模型源码](fastNLP/models/)
|
||||
|
||||
## 项目结构
|
||||
|
||||
![](./docs/source/figures/workflow.png)
|
||||
|
@ -19,6 +19,9 @@ apidoc:
|
||||
server:
|
||||
cd build/html && python -m http.server
|
||||
|
||||
dev:
|
||||
rm -rf build/html && make html && make server
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
|
41
docs/README.md
Normal file
41
docs/README.md
Normal file
@ -0,0 +1,41 @@
|
||||
# 快速入门 fastNLP 文档编写
|
||||
|
||||
本教程为 fastNLP 文档编写者创建,文档编写者包括合作开发人员和文档维护人员。您在一般情况下属于前者,
|
||||
只需要了解整个框架的部分内容即可。
|
||||
|
||||
## 合作开发人员
|
||||
|
||||
FastNLP的文档使用基于[reStructuredText标记语言](http://docutils.sourceforge.net/rst.html)的
|
||||
[Sphinx](http://sphinx.pocoo.org/)工具生成,由[Read the Docs](https://readthedocs.org/)网站自动维护生成。
|
||||
一般开发者只要编写符合reStructuredText语法规范的文档并通过[PR](https://help.github.com/en/articles/about-pull-requests),
|
||||
就可以为fastNLP的文档贡献一份力量。
|
||||
|
||||
如果你想在本地编译文档并进行大段文档的编写,您需要安装Sphinx工具以及sphinx-rtd-theme主题:
|
||||
```bash
|
||||
fastNLP/docs> pip install sphinx
|
||||
fastNLP/docs> pip install sphinx-rtd-theme
|
||||
```
|
||||
然后在本目录下执行 `make dev` 命令。该命令只支持Linux和MacOS系统,期望看到如下输出:
|
||||
```bash
|
||||
fastNLP/docs> make dev
|
||||
rm -rf build/html && make html && make server
|
||||
Running Sphinx v1.5.6
|
||||
making output directory...
|
||||
......
|
||||
Build finished. The HTML pages are in build/html.
|
||||
cd build/html && python -m http.server
|
||||
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
|
||||
```
|
||||
现在您浏览器访问 http://localhost:8000/ 查看文档。如果你在远程服务器尚进行工作,则访问地址为 http://{服务器的ip地址}:8000/ 。
|
||||
但您必须保证服务器的8000端口是开放的。如果您的电脑或远程服务器的8000端口被占用,程序会顺延使用8001、8002……等端口。
|
||||
当你结束访问时,您可以使用Control(Ctrl) + C 来结束进程。
|
||||
|
||||
我们在[这里](./source/user/example.rst)列举了fastNLP文档经常用到的reStructuredText语法(网页查看请结合Raw模式),
|
||||
您可以通过阅读它进行快速上手。FastNLP大部分的文档都是写在代码中通过Sphinx工具进行抽取生成的,
|
||||
您还可以参考这篇[未完成的文章](./source/user/docs_in_code.rst)了解代码内文档编写的规范。
|
||||
|
||||
## 文档维护人员
|
||||
|
||||
文档维护人员需要了解 Makefile 中全部命令的含义,并了解到目前的文档结构
|
||||
是在 sphinx-apidoc 自动抽取的基础上进行手动修改得到的。
|
||||
文档维护人员应进一步提升整个框架的自动化程度,并监督合作开发人员不要破坏文档项目的整体结构。
|
@ -1,36 +0,0 @@
|
||||
@ECHO OFF
|
||||
|
||||
pushd %~dp0
|
||||
|
||||
REM Command file for Sphinx documentation
|
||||
|
||||
if "%SPHINXBUILD%" == "" (
|
||||
set SPHINXBUILD=sphinx-build
|
||||
)
|
||||
set SOURCEDIR=source
|
||||
set BUILDDIR=build
|
||||
set SPHINXPROJ=fastNLP
|
||||
|
||||
if "%1" == "" goto help
|
||||
|
||||
%SPHINXBUILD% >NUL 2>NUL
|
||||
if errorlevel 9009 (
|
||||
echo.
|
||||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
|
||||
echo.installed, then set the SPHINXBUILD environment variable to point
|
||||
echo.to the full path of the 'sphinx-build' executable. Alternatively you
|
||||
echo.may add the Sphinx directory to PATH.
|
||||
echo.
|
||||
echo.If you don't have Sphinx installed, grab it from
|
||||
echo.http://sphinx-doc.org/
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
|
||||
goto end
|
||||
|
||||
:help
|
||||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
|
||||
|
||||
:end
|
||||
popd
|
@ -1,2 +0,0 @@
|
||||
# FastNLP Quick Tutorial
|
||||
|
@ -24,9 +24,9 @@ copyright = '2018, xpqiu'
|
||||
author = 'xpqiu'
|
||||
|
||||
# The short X.Y version
|
||||
version = '0.4'
|
||||
version = '0.4.5'
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = '0.4'
|
||||
release = '0.4.5'
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
|
@ -1,7 +0,0 @@
|
||||
fastNLP.modules.aggregator.attention
|
||||
====================================
|
||||
|
||||
.. automodule:: fastNLP.modules.aggregator.attention
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
@ -1,7 +0,0 @@
|
||||
fastNLP.modules.aggregator.pooling
|
||||
==================================
|
||||
|
||||
.. automodule:: fastNLP.modules.aggregator.pooling
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
@ -1,17 +0,0 @@
|
||||
fastNLP.modules.aggregator
|
||||
==========================
|
||||
|
||||
.. automodule:: fastNLP.modules.aggregator
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
子模块
|
||||
----------
|
||||
|
||||
.. toctree::
|
||||
:titlesonly:
|
||||
|
||||
fastNLP.modules.aggregator.attention
|
||||
fastNLP.modules.aggregator.pooling
|
||||
|
@ -12,6 +12,5 @@ fastNLP.modules
|
||||
.. toctree::
|
||||
:titlesonly:
|
||||
|
||||
fastNLP.modules.aggregator
|
||||
fastNLP.modules.decoder
|
||||
fastNLP.modules.encoder
|
@ -52,11 +52,9 @@ fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
安装指南 <user/installation>
|
||||
快速入门 <user/quickstart>
|
||||
详细指南 <user/tutorial_one>
|
||||
科研指南 <user/with_fitlog>
|
||||
注释语法 <user/example>
|
||||
安装指南 </user/installation>
|
||||
快速入门 </user/quickstart>
|
||||
详细指南 </user/tutorials>
|
||||
|
||||
API 文档
|
||||
-------------
|
||||
|
@ -1,6 +1,6 @@
|
||||
=================
|
||||
科研向导
|
||||
=================
|
||||
============================================
|
||||
使用fitlog 辅助 fastNLP 进行科研
|
||||
============================================
|
||||
|
||||
本文介绍结合使用 fastNLP 和 fitlog 进行科研的方法。
|
||||
|
156
docs/source/tutorials/tutorial_1_data_preprocess.rst
Normal file
156
docs/source/tutorials/tutorial_1_data_preprocess.rst
Normal file
@ -0,0 +1,156 @@
|
||||
==============================
|
||||
数据格式及预处理教程
|
||||
==============================
|
||||
|
||||
:class:`~fastNLP.DataSet` 是fastNLP中用于承载数据的容器。可以将DataSet看做是一个表格,
|
||||
每一行是一个sample (在fastNLP中被称为 :mod:`~fastNLP.core.instance` ),
|
||||
每一列是一个feature (在fastNLP中称为 :mod:`~fastNLP.core.field` )。
|
||||
|
||||
.. csv-table::
|
||||
:header: "sentence", "words", "seq_len"
|
||||
|
||||
"This is the first instance .", "[This, is, the, first, instance, .]", 6
|
||||
"Second instance .", "[Second, instance, .]", 3
|
||||
"Third instance .", "[Third, instance, .]", 3
|
||||
"...", "[...]", "..."
|
||||
|
||||
上面是一个样例数据中 DataSet 的存储结构。其中它的每一行是一个 :class:`~fastNLP.Instance` 对象; 每一列是一个 :class:`~fastNLP.FieldArray` 对象。
|
||||
|
||||
|
||||
-----------------------------
|
||||
数据集构建和删除
|
||||
-----------------------------
|
||||
|
||||
我们使用传入字典的方式构建一个数据集,这是 :class:`~fastNLP.DataSet` 初始化的最基础的方式
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import DataSet
|
||||
data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."],
|
||||
'words': [['this', 'is', 'the', 'first', 'instance', '.'], ['Second', 'instance', '.'], ['Third', 'instance', '.']],
|
||||
'seq_len': [6, 3, 3]}
|
||||
dataset = DataSet(data)
|
||||
# 传入的dict的每个key的value应该为具有相同长度的list
|
||||
|
||||
我们还可以使用 :func:`~fastNLP.DataSet.append` 方法向数据集内增加数据
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import Instance
|
||||
dataset = DataSet()
|
||||
instance = Instance(sentence="This is the first instance",
|
||||
words=['this', 'is', 'the', 'first', 'instance', '.'],
|
||||
seq_len=6)
|
||||
dataset.append(instance)
|
||||
# 可以继续append更多内容,但是append的instance应该和前面的instance拥有完全相同的field
|
||||
|
||||
另外,我们还可以用 :class:`~fastNLP.Instance` 数组的方式构建数据集
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import Instance
|
||||
dataset = DataSet([
|
||||
Instance(sentence="This is the first instance",
|
||||
words=['this', 'is', 'the', 'first', 'instance', '.'],
|
||||
seq_len=6),
|
||||
Instance(sentence="Second instance .",
|
||||
words=['Second', 'instance', '.'],
|
||||
seq_len=3)
|
||||
])
|
||||
|
||||
在初步构建完数据集之后,我们可可以通过 `for` 循环遍历 :class:`~fastNLP.DataSet` 中的内容。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
for instance in dataset:
|
||||
# do something
|
||||
|
||||
FastNLP 同样提供了多种删除数据的方法 :func:`~fastNLP.DataSet.drop` 、 :func:`~fastNLP.DataSet.delete_instance` 和 :func:`~fastNLP.DataSet.delete_field`
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import DataSet
|
||||
dataset = DataSet({'a': list(range(-5, 5))})
|
||||
# 返回满足条件的instance,并放入DataSet中
|
||||
dropped_dataset = dataset.drop(lambda ins:ins['a']<0, inplace=False)
|
||||
# 在dataset中删除满足条件的instance
|
||||
dataset.drop(lambda ins:ins['a']<0) # dataset的instance数量减少
|
||||
# 删除第3个instance
|
||||
dataset.delete_instance(2)
|
||||
# 删除名为'a'的field
|
||||
dataset.delete_field('a')
|
||||
|
||||
-----------------------------
|
||||
简单的数据预处理
|
||||
-----------------------------
|
||||
|
||||
因为 fastNLP 中的数据是按列存储的,所以大部分的数据预处理操作是以列( :mod:`~fastNLP.core.field` )为操作对象的。
|
||||
首先,我们可以检查特定名称的 :mod:`~fastNLP.core.field` 是否存在,并对其进行改名。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# 检查是否存在名为'a'的field
|
||||
dataset.has_field('a') # 或 ('a' in dataset)
|
||||
# 将名为'a'的field改名为'b'
|
||||
dataset.rename_field('a', 'b')
|
||||
# DataSet的长度
|
||||
len(dataset)
|
||||
|
||||
其次,我们可以使用 :func:`~fastNLP.DataSet.apply` 或 :func:`~fastNLP.DataSet.apply_field` 进行数据预处理操作操作。
|
||||
这两个方法通过传入一个对单一 :mod:`~fastNLP.core.instance` 操作的函数,
|
||||
自动地帮助你对一个 :mod:`~fastNLP.core.field` 中的每个 :mod:`~fastNLP.core.instance` 调用这个函数,完成整体的操作。
|
||||
这个传入的函数可以是 lambda 匿名函数,也可以是完整定义的函数。同时,你还可以用 ``new_field_name`` 参数指定数据处理后存储的 :mod:`~fastNLP.core.field` 的名称。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import DataSet
|
||||
data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."]}
|
||||
dataset = DataSet(data)
|
||||
|
||||
# 将句子分成单词形式, 详见DataSet.apply()方法
|
||||
dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words')
|
||||
|
||||
# 或使用DataSet.apply_field()
|
||||
dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words')
|
||||
|
||||
# 除了匿名函数,也可以定义函数传递进去
|
||||
def get_words(instance):
|
||||
sentence = instance['sentence']
|
||||
words = sentence.split()
|
||||
return words
|
||||
dataset.apply(get_words, new_field_name='words')
|
||||
|
||||
除了手动处理数据集之外,你还可以使用 fastNLP 提供的各种 :class:`~fastNLP.io.base_loader.DataSetLoader` 来进行数据处理。
|
||||
详细请参考这篇教程 :doc:`使用DataSetLoader加载数据集 </tutorials/tutorial_2_load_dataset>` 。
|
||||
|
||||
-----------------------------
|
||||
DataSet与pad
|
||||
-----------------------------
|
||||
|
||||
在fastNLP里,pad是与一个 :mod:`~fastNLP.core.field` 绑定的。即不同的 :mod:`~fastNLP.core.field` 可以使用不同的pad方式,比如在英文任务中word需要的pad和
|
||||
character的pad方式往往是不同的。fastNLP是通过一个叫做 :class:`~fastNLP.Padder` 的子类来完成的。
|
||||
默认情况下,所有field使用 :class:`~fastNLP.AutoPadder`
|
||||
。可以通过使用以下方式设置Padder(如果将padder设置为None,则该field不会进行pad操作)。
|
||||
大多数情况下直接使用 :class:`~fastNLP.AutoPadder` 就可以了。
|
||||
如果 :class:`~fastNLP.AutoPadder` 或 :class:`~fastNLP.EngChar2DPadder` 无法满足需求,
|
||||
也可以自己写一个 :class:`~fastNLP.Padder` 。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import EngChar2DPadder
|
||||
import random
|
||||
dataset = DataSet()
|
||||
max_chars, max_words, sent_num = 5, 10, 20
|
||||
contents = [[
|
||||
[random.randint(1, 27) for _ in range(random.randint(1, max_chars))]
|
||||
for _ in range(random.randint(1, max_words))
|
||||
] for _ in range(sent_num)]
|
||||
# 初始化时传入
|
||||
dataset.add_field('chars', contents, padder=EngChar2DPadder())
|
||||
# 直接设置
|
||||
dataset.set_padder('chars', EngChar2DPadder())
|
||||
# 也可以设置pad的value
|
||||
dataset.set_pad_val('chars', -1)
|
193
docs/source/tutorials/tutorial_2_load_dataset.rst
Normal file
193
docs/source/tutorials/tutorial_2_load_dataset.rst
Normal file
@ -0,0 +1,193 @@
|
||||
=========================
|
||||
数据集加载教程
|
||||
=========================
|
||||
|
||||
这一部分是一个关于如何加载数据集的教程
|
||||
|
||||
教程目录:
|
||||
|
||||
- `Part I: 数据集信息`_
|
||||
- `Part II: 数据集的使用方式`_
|
||||
- `Part III: 不同数据类型的DataSetLoader`_
|
||||
- `Part IV: DataSetLoader举例`_
|
||||
- `Part V: fastNLP封装好的数据集加载器`_
|
||||
|
||||
|
||||
----------------------------
|
||||
Part I: 数据集信息
|
||||
----------------------------
|
||||
|
||||
在fastNLP中,我们使用 :class:`~fastNLP.io.base_loader.DataInfo` 来存储数据集信息。 :class:`~fastNLP.io.base_loader.DataInfo`
|
||||
类包含了两个重要内容: `datasets` 和 `vocabs` 。
|
||||
|
||||
`datasets` 是一个 `key` 为数据集名称(如 `train` , `dev` ,和 `test` 等), `value` 为 :class:`~fastNLP.DataSet` 的字典。
|
||||
|
||||
`vocabs` 是一个 `key` 为词表名称(如 :attr:`fastNLP.Const.INPUT` 表示输入文本的词表名称, :attr:`fastNLP.Const.TARGET` 表示目标
|
||||
的真实标签词表的名称,等等), `value` 为词表内容( :class:`~fastNLP.Vocabulary` )的字典。
|
||||
|
||||
----------------------------
|
||||
Part II: 数据集的使用方式
|
||||
----------------------------
|
||||
|
||||
在fastNLP中,我们采用 :class:`~fastNLP.io.base_loader.DataSetLoader` 来作为加载数据集的基类。
|
||||
:class:`~fastNLP.io.base_loader.DataSetLoader` 定义了各种DataSetLoader所需的API接口,开发者应该继承它实现各种的DataSetLoader。
|
||||
在各种数据集的DataSetLoader当中,至少应该编写如下内容:
|
||||
|
||||
- _load 函数:从一个数据文件中读取数据到一个 :class:`~fastNLP.DataSet`
|
||||
- load 函数(可以使用基类的方法):从一个或多个数据文件中读取数据到一个或多个 :class:`~fastNLP.DataSet`
|
||||
- process 函数:一个或多个从数据文件中读取数据,并处理成可以训练的 :class:`~fastNLP.io.DataInfo`
|
||||
|
||||
**\*process函数中可以调用load函数或_load函数**
|
||||
|
||||
DataSetLoader的_load或者load函数返回的 :class:`~fastNLP.DataSet` 当中,内容为数据集的文本信息,process函数返回的
|
||||
:class:`~fastNLP.io.DataInfo` 当中, `datasets` 的内容为已经index好的、可以直接被 :class:`~fastNLP.Trainer`
|
||||
接受的内容。
|
||||
|
||||
--------------------------------------------------------
|
||||
Part III: 不同数据类型的DataSetLoader
|
||||
--------------------------------------------------------
|
||||
|
||||
:class:`~fastNLP.io.dataset_loader.CSVLoader`
|
||||
读取CSV类型的数据集文件。例子如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
data_set_loader = CSVLoader(
|
||||
headers=('words', 'target'), sep='\t'
|
||||
)
|
||||
# 表示将CSV文件中每一行的第一项填入'words' field,第二项填入'target' field。
|
||||
# 其中每两项之间由'\t'分割开来
|
||||
|
||||
data_set = data_set_loader._load('path/to/your/file')
|
||||
|
||||
数据集内容样例如下 ::
|
||||
|
||||
But it does not leave you with much . 1
|
||||
You could hate it for the same reason . 1
|
||||
The performances are an absolute joy . 4
|
||||
|
||||
|
||||
:class:`~fastNLP.io.dataset_loader.JsonLoader`
|
||||
读取Json类型的数据集文件,数据必须按行存储,每行是一个包含各类属性的Json对象。例子如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
data_set_loader = JsonLoader(
|
||||
fields={'sentence1': 'words1', 'sentence2': 'words2', 'gold_label': 'target'}
|
||||
)
|
||||
# 表示将Json对象中'sentence1'、'sentence2'和'gold_label'对应的值赋给'words1'、'words2'、'target'这三个fields
|
||||
|
||||
data_set = data_set_loader._load('path/to/your/file')
|
||||
|
||||
数据集内容样例如下 ::
|
||||
|
||||
{"annotator_labels": ["neutral"], "captionID": "3416050480.jpg#4", "gold_label": "neutral", "pairID": "3416050480.jpg#4r1n", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is training his horse for a competition.", "sentence2_binary_parse": "( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))"}
|
||||
{"annotator_labels": ["contradiction"], "captionID": "3416050480.jpg#4", "gold_label": "contradiction", "pairID": "3416050480.jpg#4r1c", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is at a diner, ordering an omelette.", "sentence2_binary_parse": "( ( A person ) ( ( ( ( is ( at ( a diner ) ) ) , ) ( ordering ( an omelette ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (PP (IN at) (NP (DT a) (NN diner))) (, ,) (S (VP (VBG ordering) (NP (DT an) (NN omelette))))) (. .)))"}
|
||||
{"annotator_labels": ["entailment"], "captionID": "3416050480.jpg#4", "gold_label": "entailment", "pairID": "3416050480.jpg#4r1e", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is outdoors, on a horse.", "sentence2_binary_parse": "( ( A person ) ( ( ( ( is outdoors ) , ) ( on ( a horse ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (ADVP (RB outdoors)) (, ,) (PP (IN on) (NP (DT a) (NN horse)))) (. .)))"}
|
||||
|
||||
------------------------------------------
|
||||
Part IV: DataSetLoader举例
|
||||
------------------------------------------
|
||||
|
||||
以Matching任务为例子:
|
||||
|
||||
:class:`~fastNLP.io.data_loader.matching.MatchingLoader`
|
||||
我们在fastNLP当中封装了一个Matching任务数据集的数据加载类: :class:`~fastNLP.io.data_loader.matching.MatchingLoader` .
|
||||
|
||||
在MatchingLoader类当中我们封装了一个对数据集中的文本内容进行进一步的预处理的函数:
|
||||
:meth:`~fastNLP.io.data_loader.matching.MatchingLoader.process`
|
||||
这个函数具有各种预处理option,如:
|
||||
- 是否将文本转成全小写
|
||||
- 是否需要序列长度信息,需要什么类型的序列长度信息
|
||||
- 是否需要用BertTokenizer来获取序列的WordPiece信息
|
||||
- 等等
|
||||
|
||||
具体内容参见 :meth:`fastNLP.io.MatchingLoader.process` 。
|
||||
|
||||
:class:`~fastNLP.io.data_loader.matching.SNLILoader`
|
||||
一个关于SNLI数据集的DataSetLoader。SNLI数据集来自
|
||||
`SNLI Data Set <https://nlp.stanford.edu/projects/snli/snli_1.0.zip>`_ .
|
||||
|
||||
在 :class:`~fastNLP.io.data_loader.matching.SNLILoader` 的 :meth:`~fastNLP.io.data_loader.matching.SNLILoader._load`
|
||||
函数中,我们用以下代码将数据集内容从文本文件读入内存
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def _load(self, path):
|
||||
ds = JsonLoader._load(self, path) # SNLI数据集原始文件为Json格式,可以采用JsonLoader来读取数据集文件
|
||||
|
||||
parentheses_table = str.maketrans({'(': None, ')': None})
|
||||
# 字符串匹配格式:SNLI数据集的文本中由括号分割开的,组成树结构,因此
|
||||
# 我们将这些括号去除。
|
||||
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(0))
|
||||
# 把第一句话的内容用上面的字符串匹配格式进行替换,并将句子分割为一个由单词组成的list
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(1))
|
||||
# 对第二句话的内容进行同样的预处理
|
||||
ds.drop(lambda x: x[Const.TARGET] == '-') # 将标签为'-'的样本丢掉
|
||||
return ds
|
||||
|
||||
------------------------------------------
|
||||
Part V: fastNLP封装好的数据集加载器
|
||||
------------------------------------------
|
||||
|
||||
fastNLP封装好的数据集加载器可以适用于多种类型的任务:
|
||||
|
||||
- `文本分类任务`_
|
||||
- `序列标注任务`_
|
||||
- `Matching任务`_
|
||||
- `指代消解任务`_
|
||||
- `摘要任务`_
|
||||
|
||||
|
||||
文本分类任务
|
||||
-------------------
|
||||
|
||||
文本分类任务
|
||||
|
||||
|
||||
|
||||
序列标注任务
|
||||
-------------------
|
||||
|
||||
序列标注任务
|
||||
|
||||
|
||||
Matching任务
|
||||
-------------------
|
||||
|
||||
:class:`~fastNLP.io.data_loader.matching.SNLILoader`
|
||||
一个关于SNLI数据集的DataSetLoader。SNLI数据集来自
|
||||
`SNLI Data Set <https://nlp.stanford.edu/projects/snli/snli_1.0.zip>`_ .
|
||||
|
||||
:class:`~fastNLP.io.data_loader.matching.MNLILoader`
|
||||
一个关于MultiNLI数据集的DataSetLoader。MultiNLI数据集来自 `GLUE benchmark <https://gluebenchmark.com/tasks>`_
|
||||
|
||||
:class:`~fastNLP.io.data_loader.matching.QNLILoader`
|
||||
一个关于QNLI数据集的DataSetLoader。QNLI数据集来自 `GLUE benchmark <https://gluebenchmark.com/tasks>`_
|
||||
|
||||
:class:`~fastNLP.io.data_loader.matching.RTELoader`
|
||||
一个关于Recognizing Textual Entailment数据集(RTE)的DataSetLoader。RTE数据集来自
|
||||
`GLUE benchmark <https://gluebenchmark.com/tasks>`_
|
||||
|
||||
:class:`~fastNLP.io.data_loader.matching.QuoraLoader`
|
||||
一个关于Quora数据集的DataSetLoader。
|
||||
|
||||
|
||||
|
||||
|
||||
指代消解任务
|
||||
-------------------
|
||||
|
||||
指代消解任务
|
||||
|
||||
|
||||
|
||||
摘要任务
|
||||
-------------------
|
||||
|
||||
摘要任务
|
||||
|
||||
|
214
docs/source/tutorials/tutorial_3_embedding.rst
Normal file
214
docs/source/tutorials/tutorial_3_embedding.rst
Normal file
@ -0,0 +1,214 @@
|
||||
=========================================
|
||||
使用Embedding模块将文本转成向量
|
||||
=========================================
|
||||
|
||||
这一部分是一个关于在fastNLP当中使用embedding的教程。
|
||||
|
||||
教程目录:
|
||||
|
||||
- `Part I: embedding介绍`_
|
||||
- `Part II: 使用随机初始化的embedding`_
|
||||
- `Part III: 使用预训练的静态embedding`_
|
||||
- `Part IV: 使用预训练的Contextual Embedding(ELMo & BERT)`_
|
||||
- `Part V: 使用character-level的embedding`_
|
||||
- `Part VI: 叠加使用多个embedding`_
|
||||
|
||||
|
||||
|
||||
|
||||
---------------------------------------
|
||||
Part I: embedding介绍
|
||||
---------------------------------------
|
||||
|
||||
与torch.nn.Embedding类似,fastNLP的embedding接受的输入是一个被index好的序列,输出的内容是这个序列的embedding结果。
|
||||
|
||||
fastNLP的embedding包括了预训练embedding和随机初始化embedding。
|
||||
|
||||
|
||||
---------------------------------------
|
||||
Part II: 使用随机初始化的embedding
|
||||
---------------------------------------
|
||||
|
||||
使用随机初始化的embedding参见 :class:`~fastNLP.modules.encoder.embedding.Embedding` 。
|
||||
|
||||
可以传入词表大小和embedding维度:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed = Embedding(10000, 50)
|
||||
|
||||
也可以传入一个初始化的参数矩阵:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed = Embedding(init_embed)
|
||||
|
||||
其中的init_embed可以是torch.FloatTensor、torch.nn.Embedding或者numpy.ndarray。
|
||||
|
||||
|
||||
---------------------------------------
|
||||
Part III: 使用预训练的静态embedding
|
||||
---------------------------------------
|
||||
|
||||
在使用预训练的embedding之前,需要根据数据集的内容构建一个词表 :class:`~fastNLP.core.vocabulary.Vocabulary` ,在
|
||||
预训练embedding类初始化的时候需要将这个词表作为参数传入。
|
||||
|
||||
在fastNLP中,我们提供了 :class:`~fastNLP.modules.encoder.embedding.StaticEmbedding` 这一个类。
|
||||
通过 :class:`~fastNLP.modules.encoder.embedding.StaticEmbedding` 可以加载预训练好的静态
|
||||
Embedding,例子如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
|
||||
|
||||
vocab为根据数据集构建的词表,model_dir_or_name可以是一个路径,也可以是embedding模型的名称:
|
||||
|
||||
1 如果传入的是路径,那么fastNLP将会根据该路径来读取预训练的权重文件并将embedding加载进来(glove
|
||||
和word2vec类型的权重文件都支持)
|
||||
|
||||
2 如果传入的是模型名称,那么fastNLP将会根据名称查找embedding模型,如果在cache目录下找到模型则会
|
||||
自动加载;如果找不到则会自动下载。可以通过环境变量 ``FASTNLP_CACHE_DIR`` 来自定义cache目录,如::
|
||||
|
||||
$ FASTNLP_CACHE_DIR=~/fastnlp_cache_dir python your_python_file.py
|
||||
|
||||
这个命令表示fastNLP将会在 `~/fastnlp_cache_dir` 这个目录下寻找模型,找不到则会自动将模型下载到这个目录
|
||||
|
||||
目前支持的静态embedding模型有:
|
||||
|
||||
========================== ================================
|
||||
模型名称 模型
|
||||
-------------------------- --------------------------------
|
||||
en glove.840B.300d
|
||||
-------------------------- --------------------------------
|
||||
en-glove-840d-300 glove.840B.300d
|
||||
-------------------------- --------------------------------
|
||||
en-glove-6b-50 glove.6B.50d
|
||||
-------------------------- --------------------------------
|
||||
en-word2vec-300 谷歌word2vec 300维
|
||||
-------------------------- --------------------------------
|
||||
en-fasttext 英文fasttext 300维
|
||||
-------------------------- --------------------------------
|
||||
cn 腾讯中文词向量 200维
|
||||
-------------------------- --------------------------------
|
||||
cn-fasttext 中文fasttext 300维
|
||||
========================== ================================
|
||||
|
||||
|
||||
|
||||
-----------------------------------------------------------
|
||||
Part IV: 使用预训练的Contextual Embedding(ELMo & BERT)
|
||||
-----------------------------------------------------------
|
||||
|
||||
在fastNLP中,我们提供了ELMo和BERT的embedding: :class:`~fastNLP.modules.encoder.embedding.ElmoEmbedding`
|
||||
和 :class:`~fastNLP.modules.encoder.embedding.BertEmbedding` 。
|
||||
|
||||
与静态embedding类似,ELMo的使用方法如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed = ElmoEmbedding(vocab, model_dir_or_name='small', requires_grad=False)
|
||||
|
||||
目前支持的ElmoEmbedding模型有:
|
||||
|
||||
========================== ================================
|
||||
模型名称 模型
|
||||
-------------------------- --------------------------------
|
||||
small allennlp ELMo的small
|
||||
-------------------------- --------------------------------
|
||||
medium allennlp ELMo的medium
|
||||
-------------------------- --------------------------------
|
||||
original allennlp ELMo的original
|
||||
-------------------------- --------------------------------
|
||||
5.5b-original allennlp ELMo的5.5B original
|
||||
========================== ================================
|
||||
|
||||
BERT-embedding的使用方法如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed = BertEmbedding(
|
||||
vocab, model_dir_or_name='en-base-cased', requires_grad=False, layers='4,-2,-1'
|
||||
)
|
||||
|
||||
其中layers变量表示需要取哪几层的encode结果。
|
||||
|
||||
目前支持的BertEmbedding模型有:
|
||||
|
||||
========================== ====================================
|
||||
模型名称 模型
|
||||
-------------------------- ------------------------------------
|
||||
en bert-base-cased
|
||||
-------------------------- ------------------------------------
|
||||
en-base-uncased bert-base-uncased
|
||||
-------------------------- ------------------------------------
|
||||
en-base-cased bert-base-cased
|
||||
-------------------------- ------------------------------------
|
||||
en-large-uncased bert-large-uncased
|
||||
-------------------------- ------------------------------------
|
||||
en-large-cased bert-large-cased
|
||||
-------------------------- ------------------------------------
|
||||
-------------------------- ------------------------------------
|
||||
en-large-cased-wwm bert-large-cased-whole-word-mask
|
||||
-------------------------- ------------------------------------
|
||||
en-large-uncased-wwm bert-large-uncased-whole-word-mask
|
||||
-------------------------- ------------------------------------
|
||||
en-base-cased-mrpc bert-base-cased-finetuned-mrpc
|
||||
-------------------------- ------------------------------------
|
||||
-------------------------- ------------------------------------
|
||||
multilingual bert-base-multilingual-cased
|
||||
-------------------------- ------------------------------------
|
||||
multilingual-base-uncased bert-base-multilingual-uncased
|
||||
-------------------------- ------------------------------------
|
||||
multilingual-base-cased bert-base-multilingual-cased
|
||||
========================== ====================================
|
||||
|
||||
-----------------------------------------------------
|
||||
Part V: 使用character-level的embedding
|
||||
-----------------------------------------------------
|
||||
|
||||
除了预训练的embedding以外,fastNLP还提供了CharEmbedding: :class:`~fastNLP.modules.encoder.embedding.CNNCharEmbedding` 和
|
||||
:class:`~fastNLP.modules.encoder.embedding.LSTMCharEmbedding` 。
|
||||
|
||||
CNNCharEmbedding的使用例子如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed = CNNCharEmbedding(vocab, embed_size=100, char_emb_size=50)
|
||||
|
||||
这表示这个CNNCharEmbedding当中character的embedding维度大小为50,返回的embedding结果维度大小为100。
|
||||
|
||||
与CNNCharEmbedding类似,LSTMCharEmbedding的使用例子如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed = LSTMCharEmbedding(vocab, embed_size=100, char_emb_size=50)
|
||||
|
||||
这表示这个LSTMCharEmbedding当中character的embedding维度大小为50,返回的embedding结果维度大小为100。
|
||||
|
||||
|
||||
|
||||
-----------------------------------------------------
|
||||
Part VI: 叠加使用多个embedding
|
||||
-----------------------------------------------------
|
||||
|
||||
在fastNLP中,我们使用 :class:`~fastNLP.modules.encoder.embedding.StackEmbedding` 来叠加多个embedding
|
||||
|
||||
例子如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
embed_1 = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
|
||||
embed_2 = StaticEmbedding(vocab, model_dir_or_name='en-word2vec-300', requires_grad=True)
|
||||
|
||||
stack_embed = StackEmbedding([embed_1, embed_2])
|
||||
|
||||
StackEmbedding会把多个embedding的结果拼接起来,如上面例子的stack_embed返回的embedding维度为350维。
|
||||
|
||||
除此以外,还可以把静态embedding跟上下文相关的embedding拼接起来:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
elmo_embedding = ElmoEmbedding(vocab, model_dir_or_name='medium', layers='0,1,2', requires_grad=False)
|
||||
glove_embedding = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
|
||||
|
||||
stack_embed = StackEmbedding([elmo_embedding, glove_embedding])
|
266
docs/source/tutorials/tutorial_4_loss_optimizer.rst
Normal file
266
docs/source/tutorials/tutorial_4_loss_optimizer.rst
Normal file
@ -0,0 +1,266 @@
|
||||
==============================================================================
|
||||
Loss 和 optimizer 教程 ———— 以文本分类为例
|
||||
==============================================================================
|
||||
|
||||
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段评价性文字,预测其情感倾向是积极(label=1)、消极(label=0)还是中性(label=2),使用 :class:`~fastNLP.Trainer` 和 :class:`~fastNLP.Tester` 来进行快速训练和测试,损失函数之前的内容与 :doc:`/tutorials/tutorial_5_datasetiter` 中的完全一样,如已经阅读过可以跳过。
|
||||
|
||||
--------------
|
||||
数据处理
|
||||
--------------
|
||||
|
||||
数据读入
|
||||
我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.SSTLoader` 类,轻松地读取SST数据集(数据来源:https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip)。
|
||||
这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.io import SSTLoader
|
||||
|
||||
loader = SSTLoader()
|
||||
#这里的all.txt是下载好数据后train.txt、dev.txt、test.txt的组合
|
||||
dataset = loader.load("./trainDevTestTrees_PTB/trees/all.txt")
|
||||
print(dataset[0])
|
||||
|
||||
输出数据如下::
|
||||
|
||||
{'words': ['It', "'s", 'a', 'lovely', 'film', 'with', 'lovely', 'performances', 'by', 'Buy', 'and', 'Accorsi', '.'] type=list,
|
||||
'target': positive type=str}
|
||||
|
||||
除了读取数据外,fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io` 。
|
||||
|
||||
|
||||
数据处理
|
||||
我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``target`` :mod:`~fastNLP.core.field` 转化为整数。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def label_to_int(x):
|
||||
if x['target']=="positive":
|
||||
return 1
|
||||
elif x['target']=="negative":
|
||||
return 0
|
||||
else:
|
||||
return 2
|
||||
|
||||
# 将label转为整数
|
||||
dataset.apply(lambda x: label_to_int(x), new_field_name='target')
|
||||
|
||||
``words`` 和 ``target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
|
||||
:class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len`` 。
|
||||
所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len`` 的 :mod:`~fastNLP.core.field` 。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# 增加长度信息
|
||||
dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
|
||||
|
||||
观察可知: :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 类似,
|
||||
但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
|
||||
而 :meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
|
||||
|
||||
.. note::
|
||||
`lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
|
||||
|
||||
def func_lambda(x):
|
||||
return len(x)
|
||||
|
||||
你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 的参数
|
||||
|
||||
Vocabulary 的使用
|
||||
我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabulary.index_dataset`
|
||||
将单词序列转化为训练可用的数字序列。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Vocabulary
|
||||
|
||||
# 使用Vocabulary类统计单词,并将单词序列转化为数字序列
|
||||
vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
|
||||
vocab.index_dataset(dataset, field_name='words',new_field_name='words')
|
||||
print(dataset[0])
|
||||
|
||||
输出数据如下::
|
||||
|
||||
{'words': [27, 9, 6, 913, 16, 18, 913, 124, 31, 5715, 5, 1, 2] type=list,
|
||||
'target': 1 type=int,
|
||||
'seq_len': 13 type=int}
|
||||
|
||||
|
||||
---------------------
|
||||
使用内置模型训练
|
||||
---------------------
|
||||
|
||||
内置模型的输入输出命名
|
||||
fastNLP内置了一些完整的神经网络模型,详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
|
||||
为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 的名称。
|
||||
在这个例子中模型输入 (forward方法的参数) 为 ``words`` 和 ``seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target`` 。
|
||||
具体的命名规范可以参考 :doc:`/fastNLP.core.const` 。
|
||||
|
||||
如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet` 中
|
||||
:mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Const
|
||||
|
||||
dataset.rename_field('words', Const.INPUT)
|
||||
dataset.rename_field('seq_len', Const.INPUT_LEN)
|
||||
dataset.rename_field('target', Const.TARGET)
|
||||
|
||||
print(Const.INPUT)
|
||||
print(Const.INPUT_LEN)
|
||||
print(Const.TARGET)
|
||||
print(Const.OUTPUT)
|
||||
|
||||
输出结果为::
|
||||
|
||||
words
|
||||
seq_len
|
||||
target
|
||||
pred
|
||||
|
||||
在给 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
|
||||
:meth:`~fastNLP.DataSet.set_input` 和 :meth:`~fastNLP.DataSet.set_target` 两个函数。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
#使用dataset的 set_input 和 set_target函数,告诉模型dataset中那些数据是输入,那些数据是标签(目标输出)
|
||||
dataset.set_input(Const.INPUT, Const.INPUT_LEN)
|
||||
dataset.set_target(Const.TARGET)
|
||||
|
||||
数据集分割
|
||||
除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
|
||||
下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
train_dev_data, test_data = dataset.split(0.1)
|
||||
train_data, dev_data = train_dev_data.split(0.1)
|
||||
print(len(train_data), len(dev_data), len(test_data))
|
||||
|
||||
输出结果为::
|
||||
|
||||
9603 1067 1185
|
||||
|
||||
评价指标
|
||||
训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
|
||||
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
|
||||
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import AccuracyMetric
|
||||
|
||||
# metrics=AccuracyMetric() 在本例中与下面这行代码等价
|
||||
metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
|
||||
|
||||
损失函数
|
||||
训练模型需要提供一个损失函数
|
||||
,fastNLP中提供了直接可以导入使用的四种loss,分别为:
|
||||
* :class:`~fastNLP.CrossEntropyLoss`:包装了torch.nn.functional.cross_entropy()函数,返回交叉熵损失(可以运用于多分类场景)
|
||||
* :class:`~fastNLP.BCELoss`:包装了torch.nn.functional.binary_cross_entropy()函数,返回二分类的交叉熵
|
||||
* :class:`~fastNLP.L1Loss`:包装了torch.nn.functional.l1_loss()函数,返回L1 损失
|
||||
* :class:`~fastNLP.NLLLoss`:包装了torch.nn.functional.nll_loss()函数,返回负对数似然损失
|
||||
|
||||
下面提供了一个在分类问题中常用的交叉熵损失。注意它的 **初始化参数** 。
|
||||
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
|
||||
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
|
||||
这里我们用 :class:`~fastNLP.Const` 来辅助命名,如果你自己编写模型中 forward 方法的返回值或
|
||||
数据集中 :mod:`~fastNLP.core.field` 的名字与本例不同, 你可以把 ``pred`` 参数和 ``target`` 参数设定符合自己代码的值。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import CrossEntropyLoss
|
||||
|
||||
# loss = CrossEntropyLoss() 在本例中与下面这行代码等价
|
||||
loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)
|
||||
|
||||
优化器
|
||||
定义模型运行的时候使用的优化器,可以使用fastNLP包装好的优化器:
|
||||
|
||||
* :class:`~fastNLP.SGD` :包装了torch.optim.SGD优化器
|
||||
* :class:`~fastNLP.Adam` :包装了torch.optim.Adam优化器
|
||||
|
||||
也可以直接使用torch.optim.Optimizer中的优化器,并在实例化 :class:`~fastNLP.Trainer` 类的时候传入优化器实参
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch.optim as optim
|
||||
from fastNLP import Adam
|
||||
|
||||
#使用 torch.optim 定义优化器
|
||||
optimizer_1=optim.RMSprop(model_cnn.parameters(), lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
|
||||
#使用fastNLP中包装的 Adam 定义优化器
|
||||
optimizer_2=Adam(lr=4e-3, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, model_params=model_cnn.parameters())
|
||||
|
||||
快速训练
|
||||
现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ,并使用 :class:`~fastNLP.Trainer` 进行训练,
|
||||
除了使用 :class:`~fastNLP.Trainer`进行训练,我们也可以通过使用 :class:`~fastNLP.DataSetIter` 来编写自己的训练过程,具体见 :doc:`/tutorials/tutorial_5_datasetiter`
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.models import CNNText
|
||||
|
||||
#词嵌入的维度、训练的轮数和batch size
|
||||
EMBED_DIM = 100
|
||||
N_EPOCHS = 10
|
||||
BATCH_SIZE = 16
|
||||
|
||||
#使用CNNText的时候第一个参数输入一个tuple,作为模型定义embedding的参数
|
||||
#还可以传入 kernel_nums, kernel_sizes, padding, dropout的自定义值
|
||||
model_cnn = CNNText((len(vocab),EMBED_DIM), num_classes=3, padding=2, dropout=0.1)
|
||||
|
||||
#如果在定义trainer的时候没有传入optimizer参数,模型默认的优化器为torch.optim.Adam且learning rate为lr=4e-3
|
||||
#这里只使用了optimizer_1作为优化器输入,感兴趣可以尝试optimizer_2或者其他优化器作为输入
|
||||
#这里只使用了loss作为损失函数输入,感兴趣可以尝试其他损失函数输入
|
||||
trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data, loss=loss, metrics=metrics,
|
||||
optimizer=optimizer_1,n_epochs=N_EPOCHS, batch_size=BATCH_SIZE)
|
||||
trainer.train()
|
||||
|
||||
训练过程的输出如下::
|
||||
|
||||
input fields after batch(if batch size is 2):
|
||||
words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 40])
|
||||
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
|
||||
target fields after batch(if batch size is 2):
|
||||
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
|
||||
|
||||
training epochs started 2019-07-08-15-44-48
|
||||
Evaluation at Epoch 1/10. Step:601/6010. AccuracyMetric: acc=0.59044
|
||||
|
||||
Evaluation at Epoch 2/10. Step:1202/6010. AccuracyMetric: acc=0.599813
|
||||
|
||||
Evaluation at Epoch 3/10. Step:1803/6010. AccuracyMetric: acc=0.508903
|
||||
|
||||
Evaluation at Epoch 4/10. Step:2404/6010. AccuracyMetric: acc=0.596064
|
||||
|
||||
Evaluation at Epoch 5/10. Step:3005/6010. AccuracyMetric: acc=0.47985
|
||||
|
||||
Evaluation at Epoch 6/10. Step:3606/6010. AccuracyMetric: acc=0.589503
|
||||
|
||||
Evaluation at Epoch 7/10. Step:4207/6010. AccuracyMetric: acc=0.311153
|
||||
|
||||
Evaluation at Epoch 8/10. Step:4808/6010. AccuracyMetric: acc=0.549203
|
||||
|
||||
Evaluation at Epoch 9/10. Step:5409/6010. AccuracyMetric: acc=0.581068
|
||||
|
||||
Evaluation at Epoch 10/10. Step:6010/6010. AccuracyMetric: acc=0.523899
|
||||
|
||||
|
||||
In Epoch:2/Step:1202, got best dev performance:AccuracyMetric: acc=0.599813
|
||||
Reloaded the best model.
|
||||
|
||||
快速测试
|
||||
与 :class:`~fastNLP.Trainer` 对应,fastNLP 也提供了 :class:`~fastNLP.Tester` 用于快速测试,用法如下
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Tester
|
||||
|
||||
tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())
|
||||
tester.test()
|
||||
|
||||
训练过程输出如下::
|
||||
|
||||
[tester]
|
||||
AccuracyMetric: acc=0.565401
|
248
docs/source/tutorials/tutorial_5_datasetiter.rst
Normal file
248
docs/source/tutorials/tutorial_5_datasetiter.rst
Normal file
@ -0,0 +1,248 @@
|
||||
==============================================================================
|
||||
DataSetIter 教程 ———— 以文本分类为例
|
||||
==============================================================================
|
||||
|
||||
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段评价性文字,预测其情感倾向是积极(label=1)、消极(label=0)还是中性(label=2),使用:class:`~fastNLP.DataSetIter` 类来编写自己的训练过程。自己编写训练过程之前的内容与 :doc:`/tutorials/tutorial_4_loss_optimizer` 中的完全一样,如已经阅读过可以跳过。
|
||||
|
||||
--------------
|
||||
数据处理
|
||||
--------------
|
||||
|
||||
数据读入
|
||||
我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.SSTLoader` 类,轻松地读取SST数据集(数据来源:https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip)。
|
||||
这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.io import SSTLoader
|
||||
|
||||
loader = SSTLoader()
|
||||
#这里的all.txt是下载好数据后train.txt、dev.txt、test.txt的组合
|
||||
dataset = loader.load("./trainDevTestTrees_PTB/trees/all.txt")
|
||||
print(dataset[0])
|
||||
|
||||
输出数据如下::
|
||||
|
||||
{'words': ['It', "'s", 'a', 'lovely', 'film', 'with', 'lovely', 'performances', 'by', 'Buy', 'and', 'Accorsi', '.'] type=list,
|
||||
'target': positive type=str}
|
||||
|
||||
除了读取数据外,fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io` 。
|
||||
|
||||
|
||||
数据处理
|
||||
我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``target`` :mod:`~fastNLP.core.field` 转化为整数。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def label_to_int(x):
|
||||
if x['target']=="positive":
|
||||
return 1
|
||||
elif x['target']=="negative":
|
||||
return 0
|
||||
else:
|
||||
return 2
|
||||
|
||||
# 将label转为整数
|
||||
dataset.apply(lambda x: label_to_int(x), new_field_name='target')
|
||||
|
||||
``words`` 和 ``target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
|
||||
:class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len`` 。
|
||||
所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len`` 的 :mod:`~fastNLP.core.field` 。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# 增加长度信息
|
||||
dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
|
||||
|
||||
观察可知: :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 类似,
|
||||
但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
|
||||
而 :meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
|
||||
|
||||
.. note::
|
||||
`lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
|
||||
|
||||
def func_lambda(x):
|
||||
return len(x)
|
||||
|
||||
你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 的参数
|
||||
|
||||
Vocabulary 的使用
|
||||
我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabulary.index_dataset`
|
||||
将单词序列转化为训练可用的数字序列。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Vocabulary
|
||||
|
||||
# 使用Vocabulary类统计单词,并将单词序列转化为数字序列
|
||||
vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
|
||||
vocab.index_dataset(dataset, field_name='words',new_field_name='words')
|
||||
print(dataset[0])
|
||||
|
||||
输出数据如下::
|
||||
|
||||
{'words': [27, 9, 6, 913, 16, 18, 913, 124, 31, 5715, 5, 1, 2] type=list,
|
||||
'target': 1 type=int,
|
||||
'seq_len': 13 type=int}
|
||||
|
||||
|
||||
---------------------
|
||||
使用内置模型训练
|
||||
---------------------
|
||||
|
||||
内置模型的输入输出命名
|
||||
fastNLP内置了一些完整的神经网络模型,详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
|
||||
为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 的名称。
|
||||
在这个例子中模型输入 (forward方法的参数) 为 ``words`` 和 ``seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target`` 。
|
||||
具体的命名规范可以参考 :doc:`/fastNLP.core.const` 。
|
||||
|
||||
如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet` 中
|
||||
:mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Const
|
||||
|
||||
dataset.rename_field('words', Const.INPUT)
|
||||
dataset.rename_field('seq_len', Const.INPUT_LEN)
|
||||
dataset.rename_field('target', Const.TARGET)
|
||||
|
||||
print(Const.INPUT)
|
||||
print(Const.INPUT_LEN)
|
||||
print(Const.TARGET)
|
||||
print(Const.OUTPUT)
|
||||
|
||||
输出结果为::
|
||||
|
||||
words
|
||||
seq_len
|
||||
target
|
||||
pred
|
||||
|
||||
在给 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
|
||||
:meth:`~fastNLP.DataSet.set_input` 和 :meth:`~fastNLP.DataSet.set_target` 两个函数。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
#使用dataset的 set_input 和 set_target函数,告诉模型dataset中那些数据是输入,那些数据是标签(目标输出)
|
||||
dataset.set_input(Const.INPUT, Const.INPUT_LEN)
|
||||
dataset.set_target(Const.TARGET)
|
||||
|
||||
数据集分割
|
||||
除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
|
||||
下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
train_dev_data, test_data = dataset.split(0.1)
|
||||
train_data, dev_data = train_dev_data.split(0.1)
|
||||
print(len(train_data), len(dev_data), len(test_data))
|
||||
|
||||
输出结果为::
|
||||
|
||||
9603 1067 1185
|
||||
|
||||
评价指标
|
||||
训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
|
||||
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
|
||||
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import AccuracyMetric
|
||||
|
||||
# metrics=AccuracyMetric() 在本例中与下面这行代码等价
|
||||
metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
|
||||
|
||||
|
||||
--------------------------
|
||||
自己编写训练过程
|
||||
--------------------------
|
||||
如果你想用类似 PyTorch 的使用方法,自己编写训练过程,你可以参考下面这段代码。
|
||||
其中使用了 fastNLP 提供的 :class:`~fastNLP.DataSetIter` 来获得小批量训练的小批量数据,
|
||||
使用 :class:`~fastNLP.BucketSampler` 做为 :class:`~fastNLP.DataSetIter` 的参数来选择采样的方式。
|
||||
|
||||
DataSetIter
|
||||
fastNLP定义的 :class:`~fastNLP.DataSetIter` 类,用于定义一个batch,并实现batch的多种功能,在初始化时传入的参数有:
|
||||
|
||||
* dataset: :class:`~fastNLP.DataSet` 对象, 数据集
|
||||
* batch_size: 取出的batch大小
|
||||
* sampler: 规定使用的 :class:`~fastNLP.Sampler` 若为 None, 使用 :class:`~fastNLP.RandomSampler` (Default: None)
|
||||
* as_numpy: 若为 True, 输出batch为 `numpy.array`. 否则为 `torch.Tensor` (Default: False)
|
||||
* prefetch: 若为 True使用多进程预先取出下一batch. (Default: False)
|
||||
|
||||
sampler
|
||||
fastNLP 实现的采样器有:
|
||||
|
||||
* :class:`~fastNLP.BucketSampler` 可以随机地取出长度相似的元素 【初始化参数: num_buckets:bucket的数量; batch_size:batch大小; seq_len_field_name:dataset中对应序列长度的 :mod:`~fastNLP.core.field` 的名字】
|
||||
* SequentialSampler: 顺序取出元素的采样器【无初始化参数】
|
||||
* RandomSampler:随机化取元素的采样器【无初始化参数】
|
||||
|
||||
以下代码使用BucketSampler作为 :class:`~fastNLP.DataSetIter` 初始化的输入,运用 :class:`~fastNLP.DataSetIter` 自己写训练程序
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import BucketSampler
|
||||
from fastNLP import DataSetIter
|
||||
from fastNLP.models import CNNText
|
||||
from fastNLP import Tester
|
||||
import torch
|
||||
import time
|
||||
|
||||
embed_dim = 100
|
||||
model = CNNText((len(vocab),embed_dim), num_classes=3, padding=2, dropout=0.1)
|
||||
|
||||
def train(epoch, data, devdata):
|
||||
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
||||
lossfunc = torch.nn.CrossEntropyLoss()
|
||||
batch_size = 32
|
||||
|
||||
# 定义一个Batch,传入DataSet,规定batch_size和去batch的规则。
|
||||
# 顺序(Sequential),随机(Random),相似长度组成一个batch(Bucket)
|
||||
train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
|
||||
train_batch = DataSetIter(batch_size=batch_size, dataset=data, sampler=train_sampler)
|
||||
|
||||
start_time = time.time()
|
||||
print("-"*5+"start training"+"-"*5)
|
||||
for i in range(epoch):
|
||||
loss_list = []
|
||||
for batch_x, batch_y in train_batch:
|
||||
optimizer.zero_grad()
|
||||
output = model(batch_x['words'])
|
||||
loss = lossfunc(output['pred'], batch_y['target'])
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
loss_list.append(loss.item())
|
||||
|
||||
#这里verbose如果为0,在调用Tester对象的test()函数时不输出任何信息,返回评估信息; 如果为1,打印出验证结果,返回评估信息
|
||||
#在调用过Tester对象的test()函数后,调用其_format_eval_results(res)函数,结构化输出验证结果
|
||||
tester_tmp = Tester(devdata, model, metrics=AccuracyMetric(), verbose=0)
|
||||
res=tester_tmp.test()
|
||||
|
||||
print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
|
||||
print(tester._format_eval_results(res),end=" ")
|
||||
print('{:d}ms'.format(round((time.time()-start_time)*1000)))
|
||||
loss_list.clear()
|
||||
|
||||
train(10, train_data, dev_data)
|
||||
#使用tester进行快速测试
|
||||
tester = Tester(test_data, model, metrics=AccuracyMetric())
|
||||
tester.test()
|
||||
|
||||
这段代码的输出如下::
|
||||
|
||||
-----start training-----
|
||||
Epoch 0 Avg Loss: 1.09 AccuracyMetric: acc=0.480787 58989ms
|
||||
Epoch 1 Avg Loss: 1.00 AccuracyMetric: acc=0.500469 118348ms
|
||||
Epoch 2 Avg Loss: 0.93 AccuracyMetric: acc=0.536082 176220ms
|
||||
Epoch 3 Avg Loss: 0.87 AccuracyMetric: acc=0.556701 236032ms
|
||||
Epoch 4 Avg Loss: 0.78 AccuracyMetric: acc=0.562324 294351ms
|
||||
Epoch 5 Avg Loss: 0.69 AccuracyMetric: acc=0.58388 353673ms
|
||||
Epoch 6 Avg Loss: 0.60 AccuracyMetric: acc=0.574508 412106ms
|
||||
Epoch 7 Avg Loss: 0.51 AccuracyMetric: acc=0.589503 471097ms
|
||||
Epoch 8 Avg Loss: 0.44 AccuracyMetric: acc=0.581068 529174ms
|
||||
Epoch 9 Avg Loss: 0.39 AccuracyMetric: acc=0.572634 586216ms
|
||||
[tester]
|
||||
AccuracyMetric: acc=0.527426
|
||||
|
||||
|
114
docs/source/tutorials/tutorial_6_seq_labeling.rst
Normal file
114
docs/source/tutorials/tutorial_6_seq_labeling.rst
Normal file
@ -0,0 +1,114 @@
|
||||
=====================
|
||||
序列标注教程
|
||||
=====================
|
||||
|
||||
这一部分的内容主要展示如何使用fastNLP 实现序列标注任务。你可以使用fastNLP的各个组件快捷,方便地完成序列标注任务,达到出色的效果。
|
||||
在阅读这篇Tutorial前,希望你已经熟悉了fastNLP的基础使用,包括基本数据结构以及数据预处理,embedding的嵌入等,希望你对之前的教程有更进一步的掌握。
|
||||
我们将对CoNLL-03的英文数据集进行处理,展示如何完成命名实体标注任务整个训练的过程。
|
||||
|
||||
载入数据
|
||||
===================================
|
||||
fastNLP可以方便地载入各种类型的数据。同时,针对常见的数据集,我们已经预先实现了载入方法,其中包含CoNLL-03数据集。
|
||||
在设计dataloader时,以DataSetLoader为基类,可以改写并应用于其他数据集的载入。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class Conll2003DataLoader(DataSetLoader):
|
||||
def __init__(self, task:str='ner', encoding_type:str='bioes'):
|
||||
assert task in ('ner', 'pos', 'chunk')
|
||||
index = {'ner':3, 'pos':1, 'chunk':2}[task]
|
||||
#ConllLoader是fastNLP内置的类
|
||||
self._loader = ConllLoader(headers=['raw_words', 'target'], indexes=[0, index])
|
||||
self._tag_converters = None
|
||||
if task in ('ner', 'chunk'):
|
||||
#iob和iob2bioes会对tag进行统一,标准化
|
||||
self._tag_converters = [iob2]
|
||||
if encoding_type == 'bioes':
|
||||
self._tag_converters.append(iob2bioes)
|
||||
|
||||
def load(self, path: str):
|
||||
dataset = self._loader.load(path)
|
||||
def convert_tag_schema(tags):
|
||||
for converter in self._tag_converters:
|
||||
tags = converter(tags)
|
||||
return tags
|
||||
if self._tag_converters:
|
||||
#使用apply实现convert_tag_schema函数,实际上也支持匿名函数
|
||||
dataset.apply_field(convert_tag_schema, field_name=Const.TARGET, new_field_name=Const.TARGET)
|
||||
return dataset
|
||||
|
||||
输出数据格式如:
|
||||
|
||||
{'raw_words': ['on', 'Friday', ':'] type=list,
|
||||
'target': ['O', 'O', 'O'] type=list},
|
||||
|
||||
|
||||
数据处理
|
||||
----------------------------
|
||||
我们进一步处理数据。将数据和词表封装在 :class:`~fastNLP.DataInfo` 类中。data是DataInfo的实例。
|
||||
我们输入模型的数据包括char embedding,以及word embedding。在数据处理部分,我们尝试完成词表的构建。
|
||||
使用fastNLP中的Vocabulary类来构建词表。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
word_vocab = Vocabulary(min_freq=2)
|
||||
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT)
|
||||
word_vocab.index_dataset(*data.datasets.values(),field_name=Const.INPUT, new_field_name=Const.INPUT)
|
||||
|
||||
处理后的data对象内部为:
|
||||
|
||||
dataset
|
||||
vocabs
|
||||
dataset保存了train和test中的数据,并保存为dataset类型
|
||||
vocab保存了words,raw-words以及target的词表。
|
||||
|
||||
模型构建
|
||||
--------------------------------
|
||||
我们使用CNN-BILSTM-CRF模型完成这一任务。在网络构建方面,fastNLP的网络定义继承pytorch的 :class:`nn.Module` 类。
|
||||
自己可以按照pytorch的方式定义网络。需要注意的是命名。fastNLP的标准命名位于 :class:`~fastNLP.Const` 类。
|
||||
|
||||
模型的训练
|
||||
首先实例化模型,导入所需的char embedding以及word embedding。Embedding的载入可以参考教程。
|
||||
也可以查看 :mod:`~fastNLP.modules.encoder.embedding` 使用所需的embedding 载入方法。
|
||||
fastNLP将模型的训练过程封装在了 :class:`~fastnlp.trainer` 类中。
|
||||
根据不同的任务调整trainer中的参数即可。通常,一个trainer实例需要有:指定的训练数据集,模型,优化器,loss函数,评测指标,以及指定训练的epoch数,batch size等参数。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
#实例化模型
|
||||
model = CNNBiLSTMCRF(word_embed, char_embed, hidden_size=200, num_layers=1, tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type)
|
||||
#定义优化器
|
||||
optimizer = Adam(model.parameters(), lr=0.005)
|
||||
#定义评估指标
|
||||
Metrics=SpanFPreRecMetric(tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type)
|
||||
#实例化trainer
|
||||
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer, dev_data=data.datasets['test'], batch_size=10, metrics=Metrics,callbacks=callbacks, n_epochs=100)
|
||||
#开始训练
|
||||
trainer.train()
|
||||
|
||||
训练中会保存最优的参数配置。
|
||||
训练的结果如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
Evaluation on DataSet test:
|
||||
SpanFPreRecMetric: f=0.727661, pre=0.732293, rec=0.723088
|
||||
Evaluation at Epoch 1/100. Step:1405/140500. SpanFPreRecMetric: f=0.727661, pre=0.732293, rec=0.723088
|
||||
|
||||
Evaluation on DataSet test:
|
||||
SpanFPreRecMetric: f=0.784307, pre=0.779371, rec=0.789306
|
||||
Evaluation at Epoch 2/100. Step:2810/140500. SpanFPreRecMetric: f=0.784307, pre=0.779371, rec=0.789306
|
||||
|
||||
Evaluation on DataSet test:
|
||||
SpanFPreRecMetric: f=0.810068, pre=0.811003, rec=0.809136
|
||||
Evaluation at Epoch 3/100. Step:4215/140500. SpanFPreRecMetric: f=0.810068, pre=0.811003, rec=0.809136
|
||||
|
||||
Evaluation on DataSet test:
|
||||
SpanFPreRecMetric: f=0.829592, pre=0.84153, rec=0.817989
|
||||
Evaluation at Epoch 4/100. Step:5620/140500. SpanFPreRecMetric: f=0.829592, pre=0.84153, rec=0.817989
|
||||
|
||||
Evaluation on DataSet test:
|
||||
SpanFPreRecMetric: f=0.828789, pre=0.837096, rec=0.820644
|
||||
Evaluation at Epoch 5/100. Step:7025/140500. SpanFPreRecMetric: f=0.828789, pre=0.837096, rec=0.820644
|
||||
|
||||
|
205
docs/source/tutorials/tutorial_7_modules_models.rst
Normal file
205
docs/source/tutorials/tutorial_7_modules_models.rst
Normal file
@ -0,0 +1,205 @@
|
||||
======================================
|
||||
Modules 和 models 的教程
|
||||
======================================
|
||||
|
||||
:mod:`~fastNLP.modules` 和 :mod:`~fastNLP.models` 用于构建 fastNLP 所需的神经网络模型,它可以和 torch.nn 中的模型一起使用。
|
||||
下面我们会分三节介绍编写构建模型的具体方法。
|
||||
|
||||
|
||||
----------------------
|
||||
使用 models 中的模型
|
||||
----------------------
|
||||
|
||||
fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models.CNNText` 、
|
||||
:class:`~fastNLP.models.SeqLabeling` 等完整的模型,以供用户直接使用。
|
||||
以 :class:`~fastNLP.models.CNNText` 为例,我们看一个简单的文本分类的任务的实现过程。
|
||||
|
||||
首先是数据读入和处理部分,这里的代码和 :doc:`快速入门 </user/quickstart>` 中一致。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.io import CSVLoader
|
||||
from fastNLP import Vocabulary, CrossEntropyLoss, AccuracyMetric
|
||||
|
||||
loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
|
||||
dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")
|
||||
|
||||
dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
|
||||
dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words', is_input=True)
|
||||
dataset.apply(lambda x: int(x['label']), new_field_name='target', is_target=True)
|
||||
|
||||
train_dev_data, test_data = dataset.split(0.1)
|
||||
train_data, dev_data = train_dev_data.split(0.1)
|
||||
|
||||
vocab = Vocabulary(min_freq=2).from_dataset(train_data, field_name='words')
|
||||
vocab.index_dataset(train_data, dev_data, test_data, field_name='words', new_field_name='words')
|
||||
|
||||
然后我们从 :mod:`~fastNLP.models` 中导入 ``CNNText`` 模型,用它进行训练
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.models import CNNText
|
||||
from fastNLP import Trainer
|
||||
|
||||
model_cnn = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
|
||||
|
||||
trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data,
|
||||
loss=CrossEntropyLoss(), metrics=AccuracyMetric())
|
||||
trainer.train()
|
||||
|
||||
在 iPython 环境输入 `model_cnn` ,我们可以看到 ``model_cnn`` 的网络结构
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
CNNText(
|
||||
(embed): Embedding(
|
||||
169, 50
|
||||
(dropout): Dropout(p=0.0)
|
||||
)
|
||||
(conv_pool): ConvMaxpool(
|
||||
(convs): ModuleList(
|
||||
(0): Conv1d(50, 3, kernel_size=(3,), stride=(1,), padding=(2,))
|
||||
(1): Conv1d(50, 4, kernel_size=(4,), stride=(1,), padding=(2,))
|
||||
(2): Conv1d(50, 5, kernel_size=(5,), stride=(1,), padding=(2,))
|
||||
)
|
||||
)
|
||||
(dropout): Dropout(p=0.1)
|
||||
(fc): Linear(in_features=12, out_features=5, bias=True)
|
||||
)
|
||||
|
||||
FastNLP 中内置的 models 如下表所示,您可以点击具体的名称查看详细的 API:
|
||||
|
||||
.. csv-table::
|
||||
:header: 名称, 介绍
|
||||
|
||||
:class:`~fastNLP.models.CNNText` , 使用 CNN 进行文本分类的模型
|
||||
:class:`~fastNLP.models.SeqLabeling` , 简单的序列标注模型
|
||||
:class:`~fastNLP.models.AdvSeqLabel` , 更大网络结构的序列标注模型
|
||||
:class:`~fastNLP.models.ESIM` , ESIM 模型的实现
|
||||
:class:`~fastNLP.models.StarTransEnc` , 带 word-embedding的Star-Transformer模 型
|
||||
:class:`~fastNLP.models.STSeqLabel` , 用于序列标注的 Star-Transformer 模型
|
||||
:class:`~fastNLP.models.STNLICls` ,用于自然语言推断 (NLI) 的 Star-Transformer 模型
|
||||
:class:`~fastNLP.models.STSeqCls` , 用于分类任务的 Star-Transformer 模型
|
||||
:class:`~fastNLP.models.BiaffineParser` , Biaffine 依存句法分析网络的实现
|
||||
|
||||
----------------------------
|
||||
使用 nn.torch 编写模型
|
||||
----------------------------
|
||||
|
||||
FastNLP 完全支持使用 pyTorch 编写的模型,但与 pyTorch 中编写模型的常见方法不同,
|
||||
用于 fastNLP 的模型中 forward 函数需要返回一个字典,字典中至少需要包含 ``pred`` 这个字段。
|
||||
|
||||
下面是使用 pyTorch 中的 torch.nn 模块编写的文本分类,注意观察代码中标注的向量维度。
|
||||
由于 pyTorch 使用了约定俗成的维度设置,使得 forward 中需要多次处理维度顺序
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
class LSTMText(nn.Module):
|
||||
def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
|
||||
super().__init__()
|
||||
|
||||
self.embedding = nn.Embedding(vocab_size, embedding_dim)
|
||||
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)
|
||||
self.fc = nn.Linear(hidden_dim * 2, output_dim)
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
|
||||
def forward(self, words):
|
||||
# (input) words : (batch_size, seq_len)
|
||||
words = words.permute(1,0)
|
||||
# words : (seq_len, batch_size)
|
||||
|
||||
embedded = self.dropout(self.embedding(words))
|
||||
# embedded : (seq_len, batch_size, embedding_dim)
|
||||
output, (hidden, cell) = self.lstm(embedded)
|
||||
# output: (seq_len, batch_size, hidden_dim * 2)
|
||||
# hidden: (num_layers * 2, batch_size, hidden_dim)
|
||||
# cell: (num_layers * 2, batch_size, hidden_dim)
|
||||
|
||||
hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
|
||||
hidden = self.dropout(hidden)
|
||||
# hidden: (batch_size, hidden_dim * 2)
|
||||
|
||||
pred = self.fc(hidden.squeeze(0))
|
||||
# result: (batch_size, output_dim)
|
||||
return {"pred":pred}
|
||||
|
||||
我们同样可以在 iPython 环境中查看这个模型的网络结构
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
LSTMText(
|
||||
(embedding): Embedding(169, 50)
|
||||
(lstm): LSTM(50, 64, num_layers=2, dropout=0.5, bidirectional=True)
|
||||
(fc): Linear(in_features=128, out_features=5, bias=True)
|
||||
(dropout): Dropout(p=0.5)
|
||||
)
|
||||
|
||||
----------------------------
|
||||
使用 modules 编写模型
|
||||
----------------------------
|
||||
|
||||
下面我们使用 :mod:`fastNLP.modules` 中的组件来构建同样的网络。由于 fastNLP 统一把 ``batch_size`` 放在第一维,
|
||||
在编写代码的过程中会有一定的便利。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.modules import Embedding, LSTM, MLP
|
||||
|
||||
class Model(nn.Module):
|
||||
def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
|
||||
super().__init__()
|
||||
|
||||
self.embedding = Embedding((vocab_size, embedding_dim))
|
||||
self.lstm = LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True)
|
||||
self.mlp = MLP([hidden_dim*2,output_dim], dropout=dropout)
|
||||
|
||||
def forward(self, words):
|
||||
embedded = self.embedding(words)
|
||||
_,(hidden,_) = self.lstm(embedded)
|
||||
pred = self.mlp(torch.cat((hidden[-1],hidden[-2]),dim=1))
|
||||
return {"pred":pred}
|
||||
|
||||
我们自己编写模型的网络结构如下
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
Model(
|
||||
(embedding): Embedding(
|
||||
169, 50
|
||||
(dropout): Dropout(p=0.0)
|
||||
)
|
||||
(lstm): LSTM(
|
||||
(lstm): LSTM(50, 64, num_layers=2, batch_first=True, bidirectional=True)
|
||||
)
|
||||
(mlp): MLP(
|
||||
(hiddens): ModuleList()
|
||||
(output): Linear(in_features=128, out_features=5, bias=True)
|
||||
(dropout): Dropout(p=0.5)
|
||||
)
|
||||
)
|
||||
|
||||
FastNLP 中包含的各种模块如下表,您可以点击具体的名称查看详细的 API:
|
||||
|
||||
.. csv-table::
|
||||
:header: 名称, 介绍
|
||||
|
||||
:class:`~fastNLP.modules.ConvolutionCharEncoder` , char级别的卷积 encoder
|
||||
:class:`~fastNLP.modules.LSTMCharEncoder` , char级别基于LSTM的 encoder
|
||||
:class:`~fastNLP.modules.ConvMaxpool` , 结合了Convolution和Max-Pooling于一体的模块
|
||||
:class:`~fastNLP.modules.Embedding` , 基础的Embedding模块
|
||||
:class:`~fastNLP.modules.LSTM` , LSTM模块, 轻量封装了PyTorch的LSTM
|
||||
:class:`~fastNLP.modules.StarTransformer` , Star-Transformer 的encoder部分
|
||||
:class:`~fastNLP.modules.TransformerEncoder` , Transformer的encoder模块,不包含embedding层
|
||||
:class:`~fastNLP.modules.VarRNN` , Variational Dropout RNN 模块
|
||||
:class:`~fastNLP.modules.VarLSTM` , Variational Dropout LSTM 模块
|
||||
:class:`~fastNLP.modules.VarGRU` , Variational Dropout GRU 模块
|
||||
:class:`~fastNLP.modules.MaxPool` , Max-pooling模块
|
||||
:class:`~fastNLP.modules.MaxPoolWithMask` , 带mask矩阵的max pooling。在做 max-pooling的时候不会考虑mask值为0的位置。
|
||||
:class:`~fastNLP.modules.MultiHeadAttention` , MultiHead Attention 模块
|
||||
:class:`~fastNLP.modules.MLP` , 简单的多层感知器模块
|
||||
:class:`~fastNLP.modules.ConditionalRandomField` , 条件随机场模块
|
||||
:class:`~fastNLP.modules.viterbi_decode` , 给定一个特征矩阵以及转移分数矩阵,计算出最佳的路径以及对应的分数 (与 :class:`~fastNLP.modules.ConditionalRandomField` 配合使用)
|
||||
:class:`~fastNLP.modules.allowed_transitions` , 给定一个id到label的映射表,返回所有可以跳转的列表(与 :class:`~fastNLP.modules.ConditionalRandomField` 配合使用)
|
121
docs/source/tutorials/tutorial_8_metrics.rst
Normal file
121
docs/source/tutorials/tutorial_8_metrics.rst
Normal file
@ -0,0 +1,121 @@
|
||||
=====================
|
||||
Metric 教程
|
||||
=====================
|
||||
|
||||
在进行训练时,fastNLP提供了各种各样的 :mod:`~fastNLP.core.metrics` 。
|
||||
如 :doc:`/user/quickstart` 中所介绍的,:class:`~fastNLP.AccuracyMetric` 类的对象被直接传到 :class:`~fastNLP.Trainer` 中用于训练
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric
|
||||
|
||||
trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
|
||||
loss=CrossEntropyLoss(), metrics=AccuracyMetric())
|
||||
trainer.train()
|
||||
|
||||
除了 :class:`~fastNLP.AccuracyMetric` 之外,:class:`~fastNLP.SpanFPreRecMetric` 也是一种非常见的评价指标,
|
||||
例如在序列标注问题中,常以span的方式计算 F-measure, precision, recall。
|
||||
|
||||
另外,fastNLP 还实现了用于抽取式QA(如SQuAD)的metric :class:`~fastNLP.ExtractiveQAMetric`。
|
||||
用户可以参考下面这个表格,点击第一列查看各个 :mod:`~fastNLP.core.metrics` 的详细文档。
|
||||
|
||||
.. csv-table::
|
||||
:header: 名称, 介绍
|
||||
|
||||
:class:`~fastNLP.core.metrics.MetricBase` , 自定义metrics需继承的基类
|
||||
:class:`~fastNLP.core.metrics.AccuracyMetric` , 简单的正确率metric
|
||||
:class:`~fastNLP.core.metrics.SpanFPreRecMetric` , "同时计算 F-measure, precision, recall 值的 metric"
|
||||
:class:`~fastNLP.core.metrics.ExtractiveQAMetric` , 用于抽取式QA任务 的metric
|
||||
|
||||
更多的 :mod:`~fastNLP.core.metrics` 正在被添加到 fastNLP 当中,敬请期待。
|
||||
|
||||
------------------------------
|
||||
定义自己的metrics
|
||||
------------------------------
|
||||
|
||||
在定义自己的metrics类时需继承 fastNLP 的 :class:`~fastNLP.core.metrics.MetricBase`,
|
||||
并覆盖写入 ``evaluate`` 和 ``get_metric`` 方法。
|
||||
|
||||
evaluate(xxx) 中传入一个批次的数据,将针对一个批次的预测结果做评价指标的累计
|
||||
|
||||
get_metric(xxx) 当所有数据处理完毕时调用该方法,它将根据 evaluate函数累计的评价指标统计量来计算最终的评价结果
|
||||
|
||||
以分类问题中,Accuracy计算为例,假设model的forward返回dict中包含 `pred` 这个key, 并且该key需要用于Accuracy::
|
||||
|
||||
class Model(nn.Module):
|
||||
def __init__(xxx):
|
||||
# do something
|
||||
def forward(self, xxx):
|
||||
# do something
|
||||
return {'pred': pred, 'other_keys':xxx} # pred's shape: batch_size x num_classes
|
||||
|
||||
假设dataset中 `label` 这个field是需要预测的值,并且该field被设置为了target
|
||||
对应的AccMetric可以按如下的定义, version1, 只使用这一次::
|
||||
|
||||
class AccMetric(MetricBase):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
# 根据你的情况自定义指标
|
||||
self.corr_num = 0
|
||||
self.total = 0
|
||||
|
||||
def evaluate(self, label, pred): # 这里的名称需要和dataset中target field与model返回的key是一样的,不然找不到对应的value
|
||||
# dev或test时,每个batch结束会调用一次该方法,需要实现如何根据每个batch累加metric
|
||||
self.total += label.size(0)
|
||||
self.corr_num += label.eq(pred).sum().item()
|
||||
|
||||
def get_metric(self, reset=True): # 在这里定义如何计算metric
|
||||
acc = self.corr_num/self.total
|
||||
if reset: # 是否清零以便重新计算
|
||||
self.corr_num = 0
|
||||
self.total = 0
|
||||
return {'acc': acc} # 需要返回一个dict,key为该metric的名称,该名称会显示到Trainer的progress bar中
|
||||
|
||||
|
||||
version2,如果需要复用Metric,比如下一次使用AccMetric时,dataset中目标field不叫label而叫y,或者model的输出不是pred::
|
||||
|
||||
class AccMetric(MetricBase):
|
||||
def __init__(self, label=None, pred=None):
|
||||
# 假设在另一场景使用时,目标field叫y,model给出的key为pred_y。则只需要在初始化AccMetric时,
|
||||
# acc_metric = AccMetric(label='y', pred='pred_y')即可。
|
||||
# 当初始化为acc_metric = AccMetric(),即label=None, pred=None, fastNLP会直接使用'label', 'pred'作为key去索取对
|
||||
# 应的的值
|
||||
super().__init__()
|
||||
self._init_param_map(label=label, pred=pred) # 该方法会注册label和pred. 仅需要注册evaluate()方法会用到的参数名即可
|
||||
# 如果没有注册该则效果与version1就是一样的
|
||||
|
||||
# 根据你的情况自定义指标
|
||||
self.corr_num = 0
|
||||
self.total = 0
|
||||
|
||||
def evaluate(self, label, pred): # 这里的参数名称需要和self._init_param_map()注册时一致。
|
||||
# dev或test时,每个batch结束会调用一次该方法,需要实现如何根据每个batch累加metric
|
||||
self.total += label.size(0)
|
||||
self.corr_num += label.eq(pred).sum().item()
|
||||
|
||||
def get_metric(self, reset=True): # 在这里定义如何计算metric
|
||||
acc = self.corr_num/self.total
|
||||
if reset: # 是否清零以便重新计算
|
||||
self.corr_num = 0
|
||||
self.total = 0
|
||||
return {'acc': acc} # 需要返回一个dict,key为该metric的名称,该名称会显示到Trainer的progress bar中
|
||||
|
||||
|
||||
``MetricBase`` 将会在输入的字典 ``pred_dict`` 和 ``target_dict`` 中进行检查.
|
||||
``pred_dict`` 是模型当中 ``forward()`` 函数或者 ``predict()`` 函数的返回值.
|
||||
``target_dict`` 是DataSet当中的ground truth, 判定ground truth的条件是field的 ``is_target`` 被设置为True.
|
||||
|
||||
``MetricBase`` 会进行以下的类型检测:
|
||||
|
||||
1. self.evaluate当中是否有varargs, 这是不支持的.
|
||||
2. self.evaluate当中所需要的参数是否既不在 ``pred_dict`` 也不在 ``target_dict`` .
|
||||
3. self.evaluate当中所需要的参数是否既在 ``pred_dict`` 也在 ``target_dict`` .
|
||||
|
||||
除此以外,在参数被传入self.evaluate以前,这个函数会检测 ``pred_dict`` 和 ``target_dict`` 当中没有被用到的参数
|
||||
如果kwargs是self.evaluate的参数,则不会检测
|
||||
|
||||
|
||||
self.evaluate将计算一个批次(batch)的评价指标,并累计。 没有返回值
|
||||
self.get_metric将统计当前的评价指标并返回评价结果, 返回值需要是一个dict, key是指标名称,value是指标的值
|
||||
|
67
docs/source/tutorials/tutorial_9_callback.rst
Normal file
67
docs/source/tutorials/tutorial_9_callback.rst
Normal file
@ -0,0 +1,67 @@
|
||||
==============================================================================
|
||||
Callback 教程
|
||||
==============================================================================
|
||||
|
||||
在训练时,我们常常要使用trick来提高模型的性能(如调节学习率),或者要打印训练中的信息。
|
||||
这里我们提供Callback类,在Trainer中插入代码,完成一些自定义的操作。
|
||||
|
||||
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。
|
||||
给出一段评价性文字,预测其情感倾向是积极(label=1)、消极(label=0)还是中性(label=2),使用 :class:`~fastNLP.Trainer` 和 :class:`~fastNLP.Tester` 来进行快速训练和测试。
|
||||
关于数据处理,Loss和Optimizer的选择可以看其他教程,这里仅在训练时加入学习率衰减。
|
||||
|
||||
---------------------
|
||||
Callback的构建和使用
|
||||
---------------------
|
||||
|
||||
创建Callback
|
||||
我们可以继承fastNLP :class:`~fastNLP.Callback` 类来定义自己的Callback。
|
||||
这里我们实现一个让学习率线性衰减的Callback。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import fastNLP
|
||||
|
||||
class LRDecay(fastNLP.Callback):
|
||||
def __init__(self):
|
||||
super(MyCallback, self).__init__()
|
||||
self.base_lrs = []
|
||||
self.delta = []
|
||||
|
||||
def on_train_begin(self):
|
||||
# 初始化,仅训练开始时调用
|
||||
self.base_lrs = [pg['lr'] for pg in self.optimizer.param_groups]
|
||||
self.delta = [float(lr) / self.n_epochs for lr in self.base_lrs]
|
||||
|
||||
def on_epoch_end(self):
|
||||
# 每个epoch结束时,更新学习率
|
||||
ep = self.epoch
|
||||
lrs = [lr - d * ep for lr, d in zip(self.base_lrs, self.delta)]
|
||||
self.change_lr(lrs)
|
||||
|
||||
def change_lr(self, lrs):
|
||||
for pg, lr in zip(self.optimizer.param_groups, lrs):
|
||||
pg['lr'] = lr
|
||||
|
||||
这里,:class:`~fastNLP.Callback` 中所有以 ``on_`` 开头的类方法会在 :class:`~fastNLP.Trainer` 的训练中在特定时间调用。
|
||||
如 on_train_begin() 会在训练开始时被调用,on_epoch_end() 会在每个 epoch 结束时调用。
|
||||
具体有哪些类方法,参见文档。
|
||||
|
||||
另外,为了使用方便,可以在 :class:`~fastNLP.Callback` 内部访问 :class:`~fastNLP.Trainer` 中的属性,如 optimizer, epoch, step,分别对应训练时的优化器,当前epoch数,和当前的总step数。
|
||||
具体可访问的属性,参见文档。
|
||||
|
||||
使用Callback
|
||||
在定义好 :class:`~fastNLP.Callback` 之后,就能将它传入Trainer的 ``callbacks`` 参数,在实际训练时使用。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
"""
|
||||
数据预处理,模型定义等等
|
||||
"""
|
||||
|
||||
trainer = fastNLP.Trainer(
|
||||
model=model, train_data=train_data, dev_data=dev_data,
|
||||
optimizer=optimizer, metrics=metrics,
|
||||
batch_size=10, n_epochs=100,
|
||||
callbacks=[LRDecay()])
|
||||
|
||||
trainer.train()
|
3
docs/source/user/docs_in_code.rst
Normal file
3
docs/source/user/docs_in_code.rst
Normal file
@ -0,0 +1,3 @@
|
||||
===============
|
||||
在代码中写文档
|
||||
===============
|
@ -20,7 +20,13 @@
|
||||
小标题4
|
||||
-------------------
|
||||
|
||||
参考 http://docutils.sourceforge.net/docs/user/rst/quickref.html
|
||||
推荐使用大标题、小标题3和小标题4
|
||||
|
||||
官方文档 http://docutils.sourceforge.net/docs/user/rst/quickref.html
|
||||
|
||||
`熟悉markdown的同学推荐参考这篇文章 <https://macplay.github.io/posts/cong-markdown-dao-restructuredtext/#id30>`_
|
||||
|
||||
\<\>内表示的是链接地址,\<\>外的是显示到外面的文字
|
||||
|
||||
常见语法
|
||||
============
|
||||
@ -75,6 +81,7 @@ http://docutils.sf.net/ 孤立的网址会自动生成链接
|
||||
不显示冒号的代码块
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
:linenos:
|
||||
:emphasize-lines: 1,3
|
||||
|
||||
@ -83,22 +90,67 @@ http://docutils.sf.net/ 孤立的网址会自动生成链接
|
||||
print("有行号和高亮")
|
||||
|
||||
数学块
|
||||
==========
|
||||
|
||||
.. math::
|
||||
|
||||
H_2O + Na = NaOH + H_2 \uparrow
|
||||
|
||||
复杂表格
|
||||
==========
|
||||
|
||||
各种连接
|
||||
===========
|
||||
+------------------------+------------+----------+----------+
|
||||
| Header row, column 1 | Header 2 | Header 3 | Header 4 |
|
||||
| (header rows optional) | | | |
|
||||
+========================+============+==========+==========+
|
||||
| body row 1, column 1 | column 2 | column 3 | column 4 |
|
||||
+------------------------+------------+----------+----------+
|
||||
| body row 2 | Cells may span columns. |
|
||||
+------------------------+------------+---------------------+
|
||||
| body row 3 | Cells may | - Table cells |
|
||||
+------------------------+ span rows. | - contain |
|
||||
| body row 4 | | - body elements. |
|
||||
+------------------------+------------+---------------------+
|
||||
|
||||
:doc:`/user/with_fitlog`
|
||||
简易表格
|
||||
==========
|
||||
|
||||
===== ===== ======
|
||||
Inputs Output
|
||||
------------ ------
|
||||
A B A or B
|
||||
===== ===== ======
|
||||
False False False
|
||||
True True True
|
||||
===== ===== ======
|
||||
|
||||
csv 表格
|
||||
============
|
||||
|
||||
.. csv-table::
|
||||
:header: sentence, target
|
||||
|
||||
This is the first instance ., 0
|
||||
Second instance ., 1
|
||||
Third instance ., 1
|
||||
..., ...
|
||||
|
||||
|
||||
|
||||
[重要]各种链接
|
||||
===================
|
||||
|
||||
各种链接帮助我们连接到fastNLP文档的各个位置
|
||||
|
||||
\<\>内表示的是链接地址,\<\>外的是显示到外面的文字
|
||||
|
||||
:doc:`根据文件名链接 </user/quickstart>`
|
||||
|
||||
:mod:`~fastNLP.core.batch`
|
||||
|
||||
:class:`~fastNLP.Batch`
|
||||
|
||||
~表示指显示最后一项
|
||||
~表示只显示最后一项
|
||||
|
||||
:meth:`fastNLP.DataSet.apply`
|
||||
|
||||
|
@ -7,10 +7,12 @@
|
||||
|
||||
fastNLP 依赖如下包::
|
||||
|
||||
torch>=0.4.0
|
||||
numpy
|
||||
tqdm
|
||||
nltk
|
||||
numpy>=1.14.2
|
||||
torch>=1.0.0
|
||||
tqdm>=4.28.1
|
||||
nltk>=3.4.1
|
||||
requests
|
||||
spacy
|
||||
|
||||
其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 `PyTorch 官网 <https://pytorch.org/get-started/locally/>`_ 。
|
||||
在依赖包安装完成的情况,您可以在命令行执行如下指令完成安装
|
||||
@ -18,3 +20,4 @@ fastNLP 依赖如下包::
|
||||
.. code:: shell
|
||||
|
||||
>>> pip install fastNLP
|
||||
>>> python -m spacy download en
|
||||
|
@ -121,4 +121,4 @@
|
||||
In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
|
||||
Reloaded the best model.
|
||||
|
||||
这份教程只是简单地介绍了使用 fastNLP 工作的流程,具体的细节分析见 :doc:`/user/tutorial_one`
|
||||
这份教程只是简单地介绍了使用 fastNLP 工作的流程,更多的教程分析见 :doc:`/user/tutorials`
|
||||
|
@ -1,371 +0,0 @@
|
||||
===============
|
||||
详细指南
|
||||
===============
|
||||
|
||||
我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段文字,预测它的标签是0~4中的哪一个
|
||||
(数据来源 `kaggle <https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews>`_ )。
|
||||
|
||||
--------------
|
||||
数据处理
|
||||
--------------
|
||||
|
||||
数据读入
|
||||
我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.CSVLoader` 类,轻松地从 csv 文件读取我们的数据。
|
||||
这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.io import CSVLoader
|
||||
|
||||
loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
|
||||
dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")
|
||||
|
||||
除了读取数据外,fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io` 。
|
||||
|
||||
Instance 和 DataSet
|
||||
fastNLP 中的 :class:`~fastNLP.DataSet` 类对象类似于二维表格,它的每一列是一个 :mod:`~fastNLP.core.field`
|
||||
每一行是一个 :mod:`~fastNLP.core.instance` 。我们可以手动向数据集中添加 :class:`~fastNLP.Instance` 类的对象
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Instance
|
||||
|
||||
dataset.append(Instance(raw_sentence='fake data', label='0'))
|
||||
|
||||
此时的 ``dataset[-1]`` 的值如下,可以看到,数据集中的每个数据包含 ``raw_sentence`` 和 ``label`` 两个
|
||||
:mod:`~fastNLP.core.field` ,他们的类型都是 ``str`` ::
|
||||
|
||||
{'raw_sentence': fake data type=str, 'label': 0 type=str}
|
||||
|
||||
field 的修改
|
||||
我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``raw_sentence`` 中字母变成小写,并将句子分词。
|
||||
同时也将 ``label`` :mod:`~fastNLP.core.field` 转化为整数并改名为 ``target``
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
|
||||
dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words')
|
||||
dataset.apply(lambda x: int(x['label']), new_field_name='target')
|
||||
|
||||
``words`` 和 ``target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
|
||||
:class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len`` 。
|
||||
所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len`` 的 :mod:`~fastNLP.core.field` 。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
|
||||
|
||||
观察可知: :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 类似,
|
||||
但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
|
||||
而 :meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
|
||||
|
||||
.. note::
|
||||
`lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
|
||||
|
||||
def func_lambda(x):
|
||||
return len(x)
|
||||
|
||||
你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 的参数
|
||||
|
||||
Vocabulary 的使用
|
||||
我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabularyindex_dataset`
|
||||
将单词序列转化为训练可用的数字序列。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Vocabulary
|
||||
|
||||
vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
|
||||
vocab.index_dataset(dataset, field_name='words',new_field_name='words')
|
||||
|
||||
数据集分割
|
||||
除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
|
||||
下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法(但实际应该放在后面两段改名和设置输入的代码之后)
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
train_dev_data, test_data = dataset.split(0.1)
|
||||
train_data, dev_data = train_dev_data.split(0.1)
|
||||
len(train_data), len(dev_data), len(test_data)
|
||||
|
||||
---------------------
|
||||
使用内置模型训练
|
||||
---------------------
|
||||
|
||||
内置模型的输入输出命名
|
||||
fastNLP内置了一些完整的神经网络模型,详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
|
||||
为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 的名称。
|
||||
在这个例子中模型输入 (forward方法的参数) 为 ``words`` 和 ``seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target`` 。
|
||||
具体的命名规范可以参考 :doc:`/fastNLP.core.const` 。
|
||||
|
||||
如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet` 中
|
||||
:mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Const
|
||||
|
||||
dataset.rename_field('words', Const.INPUT)
|
||||
dataset.rename_field('seq_len', Const.INPUT_LEN)
|
||||
dataset.rename_field('target', Const.TARGET)
|
||||
|
||||
在给 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
|
||||
:meth:`~fastNLP.DataSet.set_input` 和 :meth:`~fastNLP.DataSet.set_target` 两个函数。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
dataset.set_input(Const.INPUT, Const.INPUT_LEN)
|
||||
dataset.set_target(Const.TARGET)
|
||||
|
||||
快速训练
|
||||
现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ,并使用 :class:`~fastNLP.Trainer` 进行训练了
|
||||
(其中 ``loss`` 和 ``metrics`` 的定义,我们将在后续两段代码中给出)。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP.models import CNNText
|
||||
from fastNLP import Trainer
|
||||
|
||||
model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
|
||||
|
||||
trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data,
|
||||
loss=loss, metrics=metrics)
|
||||
trainer.train()
|
||||
|
||||
训练过程的输出如下::
|
||||
|
||||
input fields after batch(if batch size is 2):
|
||||
words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
|
||||
target fields after batch(if batch size is 2):
|
||||
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
|
||||
|
||||
training epochs started 2019-05-09-10-59-39
|
||||
Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.333333
|
||||
|
||||
Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.533333
|
||||
|
||||
Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.533333
|
||||
|
||||
Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.533333
|
||||
|
||||
Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.6
|
||||
|
||||
Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.8
|
||||
|
||||
Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.8
|
||||
|
||||
Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.733333
|
||||
|
||||
Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.733333
|
||||
|
||||
Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.733333
|
||||
|
||||
|
||||
In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
|
||||
Reloaded the best model.
|
||||
|
||||
损失函数
|
||||
训练模型需要提供一个损失函数, 下面提供了一个在分类问题中常用的交叉熵损失。注意它的 **初始化参数** 。
|
||||
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
|
||||
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
|
||||
这里我们用 :class:`~fastNLP.Const` 来辅助命名,如果你自己编写模型中 forward 方法的返回值或
|
||||
数据集中 :mod:`~fastNLP.core.field` 的名字与本例不同, 你可以把 ``pred`` 参数和 ``target`` 参数设定符合自己代码的值。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import CrossEntropyLoss
|
||||
|
||||
# loss = CrossEntropyLoss() 在本例中与下面这行代码等价
|
||||
loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)
|
||||
|
||||
评价指标
|
||||
训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
|
||||
``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
|
||||
``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import AccuracyMetric
|
||||
|
||||
# metrics=AccuracyMetric() 在本例中与下面这行代码等价
|
||||
metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
|
||||
|
||||
快速测试
|
||||
与 :class:`~fastNLP.Trainer` 对应,fastNLP 也提供了 :class:`~fastNLP.Tester` 用于快速测试,用法如下
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Tester
|
||||
|
||||
tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())
|
||||
tester.test()
|
||||
|
||||
---------------------
|
||||
编写自己的模型
|
||||
---------------------
|
||||
|
||||
因为 fastNLP 是基于 `PyTorch <https://pytorch.org/>`_ 开发的框架,所以我们可以基于 PyTorch 模型编写自己的神经网络模型。
|
||||
与标准的 PyTorch 模型不同,fastNLP 模型中 forward 方法返回的是一个字典,字典中至少需要包含 "pred" 这个字段。
|
||||
而 forward 方法的参数名称必须与 :class:`~fastNLP.DataSet` 中用 :meth:`~fastNLP.DataSet.set_input` 设定的名称一致。
|
||||
模型定义的代码如下:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
class LSTMText(nn.Module):
|
||||
def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
|
||||
super().__init__()
|
||||
|
||||
self.embedding = nn.Embedding(vocab_size, embedding_dim)
|
||||
self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)
|
||||
self.fc = nn.Linear(hidden_dim * 2, output_dim)
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
|
||||
def forward(self, words):
|
||||
# (input) words : (batch_size, seq_len)
|
||||
words = words.permute(1,0)
|
||||
# words : (seq_len, batch_size)
|
||||
|
||||
embedded = self.dropout(self.embedding(words))
|
||||
# embedded : (seq_len, batch_size, embedding_dim)
|
||||
output, (hidden, cell) = self.lstm(embedded)
|
||||
# output: (seq_len, batch_size, hidden_dim * 2)
|
||||
# hidden: (num_layers * 2, batch_size, hidden_dim)
|
||||
# cell: (num_layers * 2, batch_size, hidden_dim)
|
||||
|
||||
hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
|
||||
hidden = self.dropout(hidden)
|
||||
# hidden: (batch_size, hidden_dim * 2)
|
||||
|
||||
pred = self.fc(hidden.squeeze(0))
|
||||
# result: (batch_size, output_dim)
|
||||
return {"pred":pred}
|
||||
|
||||
模型的使用方法与内置模型 :class:`~fastNLP.models.CNNText` 一致
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
model_lstm = LSTMText(len(vocab),50,5)
|
||||
|
||||
trainer = Trainer(model=model_lstm, train_data=train_data, dev_data=dev_data,
|
||||
loss=loss, metrics=metrics)
|
||||
trainer.train()
|
||||
|
||||
tester = Tester(test_data, model_lstm, metrics=AccuracyMetric())
|
||||
tester.test()
|
||||
|
||||
.. todo::
|
||||
使用 :doc:`/fastNLP.modules` 编写模型
|
||||
|
||||
--------------------------
|
||||
自己编写训练过程
|
||||
--------------------------
|
||||
|
||||
如果你想用类似 PyTorch 的使用方法,自己编写训练过程,你可以参考下面这段代码。其中使用了 fastNLP 提供的 :class:`~fastNLP.Batch`
|
||||
来获得小批量训练的小批量数据,使用 :class:`~fastNLP.BucketSampler` 做为 :class:`~fastNLP.Batch` 的参数来选择采样的方式。
|
||||
这段代码中使用了 PyTorch 的 `torch.optim.Adam` 优化器 和 `torch.nn.CrossEntropyLoss` 损失函数,并自己计算了正确率
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import BucketSampler
|
||||
from fastNLP import Batch
|
||||
import torch
|
||||
import time
|
||||
|
||||
model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
|
||||
|
||||
def train(epoch, data):
|
||||
optim = torch.optim.Adam(model.parameters(), lr=0.001)
|
||||
lossfunc = torch.nn.CrossEntropyLoss()
|
||||
batch_size = 32
|
||||
|
||||
train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
|
||||
train_batch = Batch(batch_size=batch_size, dataset=data, sampler=train_sampler)
|
||||
|
||||
start_time = time.time()
|
||||
for i in range(epoch):
|
||||
loss_list = []
|
||||
for batch_x, batch_y in train_batch:
|
||||
optim.zero_grad()
|
||||
output = model(batch_x['words'])
|
||||
loss = lossfunc(output['pred'], batch_y['target'])
|
||||
loss.backward()
|
||||
optim.step()
|
||||
loss_list.append(loss.item())
|
||||
print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
|
||||
print('{:d}ms'.format(round((time.time()-start_time)*1000)))
|
||||
loss_list.clear()
|
||||
|
||||
train(10, train_data)
|
||||
|
||||
tester = Tester(test_data, model, metrics=AccuracyMetric())
|
||||
tester.test()
|
||||
|
||||
这段代码的输出如下::
|
||||
|
||||
Epoch 0 Avg Loss: 2.76 17ms
|
||||
Epoch 1 Avg Loss: 2.55 29ms
|
||||
Epoch 2 Avg Loss: 2.37 41ms
|
||||
Epoch 3 Avg Loss: 2.30 53ms
|
||||
Epoch 4 Avg Loss: 2.12 65ms
|
||||
Epoch 5 Avg Loss: 2.16 76ms
|
||||
Epoch 6 Avg Loss: 1.88 88ms
|
||||
Epoch 7 Avg Loss: 1.84 99ms
|
||||
Epoch 8 Avg Loss: 1.71 111ms
|
||||
Epoch 9 Avg Loss: 1.62 122ms
|
||||
[tester]
|
||||
AccuracyMetric: acc=0.142857
|
||||
|
||||
----------------------------------
|
||||
使用 Callback 增强 Trainer
|
||||
----------------------------------
|
||||
|
||||
如果你不想自己实现繁琐的训练过程,只希望在训练过程中实现一些自己的功能(比如:输出从训练开始到当前 batch 结束的总时间),
|
||||
你可以使用 fastNLP 提供的 :class:`~fastNLP.Callback` 类。下面的例子中,我们继承 :class:`~fastNLP.Callback` 类实现了这个功能。
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from fastNLP import Callback
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
class MyCallback(Callback):
|
||||
def on_epoch_end(self):
|
||||
print('Sum Time: {:d}ms\n\n'.format(round((time.time()-start_time)*1000)))
|
||||
|
||||
|
||||
model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
|
||||
trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
|
||||
loss=CrossEntropyLoss(), metrics=AccuracyMetric(), callbacks=[MyCallback()])
|
||||
trainer.train()
|
||||
|
||||
训练输出如下::
|
||||
|
||||
input fields after batch(if batch size is 2):
|
||||
words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 16])
|
||||
seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
|
||||
target fields after batch(if batch size is 2):
|
||||
target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
|
||||
|
||||
training epochs started 2019-05-12-21-38-40
|
||||
Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.285714
|
||||
|
||||
Sum Time: 51ms
|
||||
|
||||
|
||||
…………………………
|
||||
|
||||
|
||||
Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.857143
|
||||
|
||||
Sum Time: 212ms
|
||||
|
||||
|
||||
|
||||
In Epoch:10/Step:20, got best dev performance:AccuracyMetric: acc=0.857143
|
||||
Reloaded the best model.
|
||||
|
||||
这个例子只是介绍了 :class:`~fastNLP.Callback` 类的使用方法。实际应用(比如:负采样、Learning Rate Decay、Early Stop 等)中
|
||||
很多功能已经被 fastNLP 实现了。你可以直接 import 它们使用,详细请查看文档 :doc:`/fastNLP.core.callback` 。
|
18
docs/source/user/tutorials.rst
Normal file
18
docs/source/user/tutorials.rst
Normal file
@ -0,0 +1,18 @@
|
||||
===================
|
||||
fastNLP详细使用教程
|
||||
===================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
1. 使用DataSet预处理文本 </tutorials/tutorial_1_data_preprocess>
|
||||
2. 使用DataSetLoader加载数据集 </tutorials/tutorial_2_load_dataset>
|
||||
3. 使用Embedding模块将文本转成向量 </tutorials/tutorial_3_embedding>
|
||||
4. 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试 </tutorials/tutorial_4_loss_optimizer>
|
||||
5. 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程 </tutorials/tutorial_5_datasetiter>
|
||||
6. 快速实现序列标注模型 </tutorials/tutorial_6_seq_labeling>
|
||||
7. 使用Modules和Models快速搭建自定义模型 </tutorials/tutorial_7_modules_models>
|
||||
8. 使用Metric快速评测你的模型 </tutorials/tutorial_8_metrics>
|
||||
9. 使用Callback自定义你的训练过程 </tutorials/tutorial_9_callback>
|
||||
10. 使用fitlog 辅助 fastNLP 进行科研 </tutorials/tutorial_10_fitlog>
|
||||
|
@ -37,7 +37,7 @@ __all__ = [
|
||||
|
||||
"AccuracyMetric",
|
||||
"SpanFPreRecMetric",
|
||||
"SQuADMetric",
|
||||
"ExtractiveQAMetric",
|
||||
|
||||
"Optimizer",
|
||||
"SGD",
|
||||
@ -56,8 +56,9 @@ __all__ = [
|
||||
|
||||
"cache_results"
|
||||
]
|
||||
__version__ = '0.4.0'
|
||||
__version__ = '0.4.5'
|
||||
|
||||
from .core import *
|
||||
from . import models
|
||||
from . import modules
|
||||
from .io import data_loader
|
||||
|
@ -21,7 +21,7 @@ from .dataset import DataSet
|
||||
from .field import FieldArray, Padder, AutoPadder, EngChar2DPadder
|
||||
from .instance import Instance
|
||||
from .losses import LossFunc, CrossEntropyLoss, L1Loss, BCELoss, NLLLoss, LossInForward
|
||||
from .metrics import AccuracyMetric, SpanFPreRecMetric, SQuADMetric
|
||||
from .metrics import AccuracyMetric, SpanFPreRecMetric, ExtractiveQAMetric
|
||||
from .optimizer import Optimizer, SGD, Adam
|
||||
from .sampler import SequentialSampler, BucketSampler, RandomSampler, Sampler
|
||||
from .tester import Tester
|
||||
|
@ -448,10 +448,10 @@ class FitlogCallback(Callback):
|
||||
并将验证结果写入到fitlog中。这些数据集的结果是根据dev上最好的结果报道的,即如果dev在第3个epoch取得了最佳,则
|
||||
fitlog中记录的关于这些数据集的结果就是来自第三个epoch的结果。
|
||||
|
||||
:param DataSet,dict(DataSet) data: 传入DataSet对象,会使用多个Trainer中的metric对数据进行验证。如果需要传入多个
|
||||
:param ~fastNLP.DataSet,dict(~fastNLP.DataSet) data: 传入DataSet对象,会使用多个Trainer中的metric对数据进行验证。如果需要传入多个
|
||||
DataSet请通过dict的方式传入,dict的key将作为对应dataset的name传递给fitlog。若tester不为None时,data需要通过
|
||||
dict的方式传入。如果仅传入DataSet, 则被命名为test
|
||||
:param Tester tester: Tester对象,将在on_valid_end时调用。tester中的DataSet会被称为为`test`
|
||||
:param ~fastNLP.Tester tester: Tester对象,将在on_valid_end时调用。tester中的DataSet会被称为为`test`
|
||||
:param int log_loss_every: 多少个step记录一次loss(记录的是这几个batch的loss平均值),如果数据集较大建议将该值设置得
|
||||
大一些,不然会导致log文件巨大。默认为0, 即不要记录loss。
|
||||
:param int verbose: 是否在终端打印evaluation的结果,0不打印。
|
||||
@ -674,7 +674,7 @@ class TensorboardCallback(Callback):
|
||||
|
||||
.. warning::
|
||||
fastNLP 已停止对此功能的维护,请等待 fastNLP 兼容 PyTorch1.1 的下一个版本。
|
||||
或者使用和 fastNLP 高度配合的 fitlog(参见 :doc:`/user/with_fitlog` )。
|
||||
或者使用和 fastNLP 高度配合的 fitlog(参见 :doc:`/tutorials/tutorial_10_fitlog` )。
|
||||
|
||||
"""
|
||||
|
||||
|
@ -78,19 +78,7 @@
|
||||
sent, label = line.strip().split('\t')
|
||||
dataset.append(Instance(sentence=sent, label=label))
|
||||
|
||||
2.2 index, 返回结果为对DataSet对象的浅拷贝
|
||||
|
||||
Example::
|
||||
|
||||
import numpy as np
|
||||
from fastNLP import DataSet
|
||||
dataset = DataSet({'a': np.arange(10), 'b': [[_] for _ in range(10)]})
|
||||
d[0] # 使用一个下标获取一个instance
|
||||
>>{'a': 0 type=int,'b': [2] type=list} # 得到一个instance
|
||||
d[1:3] # 使用slice获取一个新的DataSet
|
||||
>>DataSet({'a': 1 type=int, 'b': [2] type=list}, {'a': 2 type=int, 'b': [2] type=list})
|
||||
|
||||
2.3 对DataSet中的内容处理
|
||||
2.2 对DataSet中的内容处理
|
||||
|
||||
Example::
|
||||
|
||||
@ -108,7 +96,7 @@
|
||||
return words
|
||||
dataset.apply(get_words, new_field_name='words')
|
||||
|
||||
2.4 删除DataSet的内容
|
||||
2.3 删除DataSet的内容
|
||||
|
||||
Example::
|
||||
|
||||
@ -124,14 +112,14 @@
|
||||
dataset.delete_field('a')
|
||||
|
||||
|
||||
2.5 遍历DataSet的内容
|
||||
2.4 遍历DataSet的内容
|
||||
|
||||
Example::
|
||||
|
||||
for instance in dataset:
|
||||
# do something
|
||||
|
||||
2.6 一些其它操作
|
||||
2.5 一些其它操作
|
||||
|
||||
Example::
|
||||
|
||||
|
@ -6,7 +6,7 @@ __all__ = [
|
||||
"MetricBase",
|
||||
"AccuracyMetric",
|
||||
"SpanFPreRecMetric",
|
||||
"SQuADMetric"
|
||||
"ExtractiveQAMetric"
|
||||
]
|
||||
|
||||
import inspect
|
||||
@ -24,6 +24,7 @@ from .utils import seq_len_to_mask
|
||||
from .vocabulary import Vocabulary
|
||||
from abc import abstractmethod
|
||||
|
||||
|
||||
class MetricBase(object):
|
||||
"""
|
||||
所有metrics的基类,,所有的传入到Trainer, Tester的Metric需要继承自该对象,需要覆盖写入evaluate(), get_metric()方法。
|
||||
@ -735,11 +736,11 @@ def _pred_topk(y_prob, k=1):
|
||||
return y_pred_topk, y_prob_topk
|
||||
|
||||
|
||||
class SQuADMetric(MetricBase):
|
||||
class ExtractiveQAMetric(MetricBase):
|
||||
r"""
|
||||
别名::class:`fastNLP.SQuADMetric` :class:`fastNLP.core.metrics.SQuADMetric`
|
||||
别名::class:`fastNLP.ExtractiveQAMetric` :class:`fastNLP.core.metrics.ExtractiveQAMetric`
|
||||
|
||||
SQuAD数据集metric
|
||||
抽取式QA(如SQuAD)的metric.
|
||||
|
||||
:param pred1: 参数映射表中 `pred1` 的映射关系,None表示映射关系为 `pred1` -> `pred1`
|
||||
:param pred2: 参数映射表中 `pred2` 的映射关系,None表示映射关系为 `pred2` -> `pred2`
|
||||
@ -755,7 +756,7 @@ class SQuADMetric(MetricBase):
|
||||
def __init__(self, pred1=None, pred2=None, target1=None, target2=None,
|
||||
beta=1, right_open=True, print_predict_stat=False):
|
||||
|
||||
super(SQuADMetric, self).__init__()
|
||||
super(ExtractiveQAMetric, self).__init__()
|
||||
|
||||
self._init_param_map(pred1=pred1, pred2=pred2, target1=target1, target2=target2)
|
||||
|
||||
|
@ -91,47 +91,84 @@ class Vocabulary(object):
|
||||
self.idx2word = None
|
||||
self.rebuild = True
|
||||
# 用于承载不需要单独创建entry的词语,具体见from_dataset()方法
|
||||
self._no_create_word = defaultdict(int)
|
||||
self._no_create_word = Counter()
|
||||
|
||||
@_check_build_status
|
||||
def update(self, word_lst):
|
||||
def update(self, word_lst, no_create_entry=False):
|
||||
"""依次增加序列中词在词典中的出现频率
|
||||
|
||||
:param list word_lst: a list of strings
|
||||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
|
||||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
|
||||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
|
||||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
|
||||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
|
||||
则这个词将认为是需要创建单独的vector的。
|
||||
"""
|
||||
self._add_no_create_entry(word_lst, no_create_entry)
|
||||
self.word_count.update(word_lst)
|
||||
|
||||
@_check_build_status
|
||||
def add(self, word):
|
||||
def add(self, word, no_create_entry=False):
|
||||
"""
|
||||
增加一个新词在词典中的出现频率
|
||||
|
||||
:param str word: 新词
|
||||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
|
||||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
|
||||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
|
||||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
|
||||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
|
||||
则这个词将认为是需要创建单独的vector的。
|
||||
"""
|
||||
self._add_no_create_entry(word, no_create_entry)
|
||||
self.word_count[word] += 1
|
||||
|
||||
|
||||
def _add_no_create_entry(self, word, no_create_entry):
|
||||
"""
|
||||
在新加入word时,检查_no_create_word的设置。
|
||||
|
||||
:param str, List[str] word:
|
||||
:param bool no_create_entry:
|
||||
:return:
|
||||
"""
|
||||
if isinstance(word, str):
|
||||
word = [word]
|
||||
for w in word:
|
||||
if no_create_entry and self.word_count.get(w, 0) == self._no_create_word.get(w, 0):
|
||||
self._no_create_word[w] += 1
|
||||
elif not no_create_entry and w in self._no_create_word:
|
||||
self._no_create_word.pop(w)
|
||||
|
||||
@_check_build_status
|
||||
def add_word(self, word):
|
||||
def add_word(self, word, no_create_entry=False):
|
||||
"""
|
||||
增加一个新词在词典中的出现频率
|
||||
|
||||
:param str word: 新词
|
||||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
|
||||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
|
||||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
|
||||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
|
||||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
|
||||
则这个词将认为是需要创建单独的vector的。
|
||||
"""
|
||||
if word in self._no_create_word:
|
||||
self._no_create_word.pop(word)
|
||||
self.add(word)
|
||||
self.add(word, no_create_entry=no_create_entry)
|
||||
|
||||
@_check_build_status
|
||||
def add_word_lst(self, word_lst):
|
||||
def add_word_lst(self, word_lst, no_create_entry=False):
|
||||
"""
|
||||
依次增加序列中词在词典中的出现频率
|
||||
|
||||
:param list[str] word_lst: 词的序列
|
||||
:param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
|
||||
如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
|
||||
的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
|
||||
加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
|
||||
个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
|
||||
则这个词将认为是需要创建单独的vector的。
|
||||
"""
|
||||
for word in word_lst:
|
||||
if word in self._no_create_word:
|
||||
self._no_create_word.pop(word)
|
||||
self.update(word_lst)
|
||||
self.update(word_lst, no_create_entry=no_create_entry)
|
||||
|
||||
def build_vocab(self):
|
||||
"""
|
||||
@ -283,23 +320,17 @@ class Vocabulary(object):
|
||||
for fn in field_name:
|
||||
field = ins[fn]
|
||||
if isinstance(field, str):
|
||||
if no_create_entry and field not in self.word_count:
|
||||
self._no_create_word[field] += 1
|
||||
self.add_word(field)
|
||||
self.add_word(field, no_create_entry=no_create_entry)
|
||||
elif isinstance(field, (list, np.ndarray)):
|
||||
if not isinstance(field[0], (list, np.ndarray)):
|
||||
for word in field:
|
||||
if no_create_entry and word not in self.word_count:
|
||||
self._no_create_word[word] += 1
|
||||
self.add_word(word)
|
||||
self.add_word(word, no_create_entry=no_create_entry)
|
||||
else:
|
||||
if isinstance(field[0][0], (list, np.ndarray)):
|
||||
raise RuntimeError("Only support field with 2 dimensions.")
|
||||
for words in field:
|
||||
for word in words:
|
||||
if no_create_entry and word not in self.word_count:
|
||||
self._no_create_word[word] += 1
|
||||
self.add_word(word)
|
||||
self.add_word(word, no_create_entry=no_create_entry)
|
||||
|
||||
for idx, dataset in enumerate(datasets):
|
||||
if isinstance(dataset, DataSet):
|
||||
|
@ -12,22 +12,22 @@
|
||||
__all__ = [
|
||||
'EmbedLoader',
|
||||
|
||||
'DataInfo',
|
||||
'DataBundle',
|
||||
'DataSetLoader',
|
||||
|
||||
'CSVLoader',
|
||||
'JsonLoader',
|
||||
'ConllLoader',
|
||||
'PeopleDailyCorpusLoader',
|
||||
'Conll2003Loader',
|
||||
|
||||
'ModelLoader',
|
||||
'ModelSaver',
|
||||
|
||||
'SSTLoader',
|
||||
|
||||
'ConllLoader',
|
||||
'Conll2003Loader',
|
||||
'MatchingLoader',
|
||||
'PeopleDailyCorpusLoader',
|
||||
'SNLILoader',
|
||||
'SSTLoader',
|
||||
'SST2Loader',
|
||||
'MNLILoader',
|
||||
'QNLILoader',
|
||||
'QuoraLoader',
|
||||
@ -35,11 +35,8 @@ __all__ = [
|
||||
]
|
||||
|
||||
from .embed_loader import EmbedLoader
|
||||
from .base_loader import DataInfo, DataSetLoader
|
||||
from .dataset_loader import CSVLoader, JsonLoader, ConllLoader, \
|
||||
PeopleDailyCorpusLoader, Conll2003Loader
|
||||
from .base_loader import DataBundle, DataSetLoader
|
||||
from .dataset_loader import CSVLoader, JsonLoader
|
||||
from .model_io import ModelLoader, ModelSaver
|
||||
|
||||
from .data_loader.sst import SSTLoader
|
||||
from .data_loader.matching import MatchingLoader, SNLILoader, \
|
||||
MNLILoader, QNLILoader, QuoraLoader, RTELoader
|
||||
from .data_loader import *
|
||||
|
@ -1,6 +1,6 @@
|
||||
__all__ = [
|
||||
"BaseLoader",
|
||||
'DataInfo',
|
||||
'DataBundle',
|
||||
'DataSetLoader',
|
||||
]
|
||||
|
||||
@ -109,7 +109,7 @@ def _uncompress(src, dst):
|
||||
raise ValueError('unsupported file {}'.format(src))
|
||||
|
||||
|
||||
class DataInfo:
|
||||
class DataBundle:
|
||||
"""
|
||||
经过处理的数据信息,包括一系列数据集(比如:分开的训练集、验证集和测试集)及它们所用的词表和词嵌入。
|
||||
|
||||
@ -201,20 +201,20 @@ class DataSetLoader:
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def process(self, paths: Union[str, Dict[str, str]], **options) -> DataInfo:
|
||||
def process(self, paths: Union[str, Dict[str, str]], **options) -> DataBundle:
|
||||
"""
|
||||
对于特定的任务和数据集,读取并处理数据,返回处理DataInfo类对象或字典。
|
||||
|
||||
从指定一个或多个路径中的文件中读取数据,DataInfo对象中可以包含一个或多个数据集 。
|
||||
如果处理多个路径,传入的 dict 的 key 与返回DataInfo中的 dict 中的 key 保存一致。
|
||||
|
||||
返回的 :class:`DataInfo` 对象有如下属性:
|
||||
返回的 :class:`DataBundle` 对象有如下属性:
|
||||
|
||||
- vocabs: 由从数据集中获取的词表组成的字典,每个词表
|
||||
- datasets: 一个dict,包含一系列 :class:`~fastNLP.DataSet` 类型的对象。其中 field 的命名参考 :mod:`~fastNLP.core.const`
|
||||
|
||||
:param paths: 原始数据读取的路径
|
||||
:param options: 根据不同的任务和数据集,设计自己的参数
|
||||
:return: 返回一个 DataInfo
|
||||
:return: 返回一个 DataBundle
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
@ -4,16 +4,32 @@
|
||||
这些模块的使用方法如下:
|
||||
"""
|
||||
__all__ = [
|
||||
'SSTLoader',
|
||||
|
||||
'ConllLoader',
|
||||
'Conll2003Loader',
|
||||
'IMDBLoader',
|
||||
'MatchingLoader',
|
||||
'SNLILoader',
|
||||
'MNLILoader',
|
||||
'MTL16Loader',
|
||||
'PeopleDailyCorpusLoader',
|
||||
'QNLILoader',
|
||||
'QuoraLoader',
|
||||
'RTELoader',
|
||||
'SSTLoader',
|
||||
'SST2Loader',
|
||||
'SNLILoader',
|
||||
'YelpLoader',
|
||||
]
|
||||
|
||||
from .sst import SSTLoader
|
||||
from .matching import MatchingLoader, SNLILoader, \
|
||||
MNLILoader, QNLILoader, QuoraLoader, RTELoader
|
||||
|
||||
from .conll import ConllLoader, Conll2003Loader
|
||||
from .imdb import IMDBLoader
|
||||
from .matching import MatchingLoader
|
||||
from .mnli import MNLILoader
|
||||
from .mtl import MTL16Loader
|
||||
from .people_daily import PeopleDailyCorpusLoader
|
||||
from .qnli import QNLILoader
|
||||
from .quora import QuoraLoader
|
||||
from .rte import RTELoader
|
||||
from .snli import SNLILoader
|
||||
from .sst import SSTLoader, SST2Loader
|
||||
from .yelp import YelpLoader
|
||||
|
73
fastNLP/io/data_loader/conll.py
Normal file
73
fastNLP/io/data_loader/conll.py
Normal file
@ -0,0 +1,73 @@
|
||||
|
||||
from ...core.dataset import DataSet
|
||||
from ...core.instance import Instance
|
||||
from ..base_loader import DataSetLoader
|
||||
from ..file_reader import _read_conll
|
||||
|
||||
|
||||
class ConllLoader(DataSetLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.data_loader.ConllLoader`
|
||||
|
||||
读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略,因为
|
||||
该符号在conll 2003中被用为文档分割符。
|
||||
|
||||
列号从0开始, 每列对应内容为::
|
||||
|
||||
Column Type
|
||||
0 Document ID
|
||||
1 Part number
|
||||
2 Word number
|
||||
3 Word itself
|
||||
4 Part-of-Speech
|
||||
5 Parse bit
|
||||
6 Predicate lemma
|
||||
7 Predicate Frameset ID
|
||||
8 Word sense
|
||||
9 Speaker/Author
|
||||
10 Named Entities
|
||||
11:N Predicate Arguments
|
||||
N Coreference
|
||||
|
||||
:param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应
|
||||
:param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None``
|
||||
:param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False``
|
||||
"""
|
||||
|
||||
def __init__(self, headers, indexes=None, dropna=False):
|
||||
super(ConllLoader, self).__init__()
|
||||
if not isinstance(headers, (list, tuple)):
|
||||
raise TypeError(
|
||||
'invalid headers: {}, should be list of strings'.format(headers))
|
||||
self.headers = headers
|
||||
self.dropna = dropna
|
||||
if indexes is None:
|
||||
self.indexes = list(range(len(self.headers)))
|
||||
else:
|
||||
if len(indexes) != len(headers):
|
||||
raise ValueError
|
||||
self.indexes = indexes
|
||||
|
||||
def _load(self, path):
|
||||
ds = DataSet()
|
||||
for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
|
||||
ins = {h: data[i] for i, h in enumerate(self.headers)}
|
||||
ds.append(Instance(**ins))
|
||||
return ds
|
||||
|
||||
|
||||
class Conll2003Loader(ConllLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader`
|
||||
|
||||
读取Conll2003数据
|
||||
|
||||
关于数据集的更多信息,参考:
|
||||
https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
headers = [
|
||||
'tokens', 'pos', 'chunks', 'ner',
|
||||
]
|
||||
super(Conll2003Loader, self).__init__(headers=headers)
|
96
fastNLP/io/data_loader/imdb.py
Normal file
96
fastNLP/io/data_loader/imdb.py
Normal file
@ -0,0 +1,96 @@
|
||||
|
||||
from typing import Union, Dict
|
||||
|
||||
from ..embed_loader import EmbeddingOption, EmbedLoader
|
||||
from ..base_loader import DataSetLoader, DataBundle
|
||||
from ...core.vocabulary import VocabularyOption, Vocabulary
|
||||
from ...core.dataset import DataSet
|
||||
from ...core.instance import Instance
|
||||
from ...core.const import Const
|
||||
|
||||
from ..utils import get_tokenizer
|
||||
|
||||
|
||||
class IMDBLoader(DataSetLoader):
|
||||
"""
|
||||
读取IMDB数据集,DataSet包含以下fields:
|
||||
|
||||
words: list(str), 需要分类的文本
|
||||
target: str, 文本的标签
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super(IMDBLoader, self).__init__()
|
||||
self.tokenizer = get_tokenizer()
|
||||
|
||||
def _load(self, path):
|
||||
dataset = DataSet()
|
||||
with open(path, 'r', encoding="utf-8") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
parts = line.split('\t')
|
||||
target = parts[0]
|
||||
words = self.tokenizer(parts[1].lower())
|
||||
dataset.append(Instance(words=words, target=target))
|
||||
|
||||
if len(dataset) == 0:
|
||||
raise RuntimeError(f"{path} has no valid data.")
|
||||
|
||||
return dataset
|
||||
|
||||
def process(self,
|
||||
paths: Union[str, Dict[str, str]],
|
||||
src_vocab_opt: VocabularyOption = None,
|
||||
tgt_vocab_opt: VocabularyOption = None,
|
||||
char_level_op=False):
|
||||
|
||||
datasets = {}
|
||||
info = DataBundle()
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
datasets[name] = dataset
|
||||
|
||||
def wordtochar(words):
|
||||
chars = []
|
||||
for word in words:
|
||||
word = word.lower()
|
||||
for char in word:
|
||||
chars.append(char)
|
||||
chars.append('')
|
||||
chars.pop()
|
||||
return chars
|
||||
|
||||
if char_level_op:
|
||||
for dataset in datasets.values():
|
||||
dataset.apply_field(wordtochar, field_name="words", new_field_name='chars')
|
||||
|
||||
datasets["train"], datasets["dev"] = datasets["train"].split(0.1, shuffle=False)
|
||||
|
||||
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
|
||||
src_vocab.from_dataset(datasets['train'], field_name='words')
|
||||
|
||||
src_vocab.index_dataset(*datasets.values(), field_name='words')
|
||||
|
||||
tgt_vocab = Vocabulary(unknown=None, padding=None) \
|
||||
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
|
||||
tgt_vocab.from_dataset(datasets['train'], field_name='target')
|
||||
tgt_vocab.index_dataset(*datasets.values(), field_name='target')
|
||||
|
||||
info.vocabs = {
|
||||
Const.INPUT: src_vocab,
|
||||
Const.TARGET: tgt_vocab
|
||||
}
|
||||
|
||||
info.datasets = datasets
|
||||
|
||||
for name, dataset in info.datasets.items():
|
||||
dataset.set_input(Const.INPUT)
|
||||
dataset.set_target(Const.TARGET)
|
||||
|
||||
return info
|
||||
|
||||
|
||||
|
@ -1,18 +1,17 @@
|
||||
import os
|
||||
|
||||
from typing import Union, Dict
|
||||
from typing import Union, Dict, List
|
||||
|
||||
from ...core.const import Const
|
||||
from ...core.vocabulary import Vocabulary
|
||||
from ..base_loader import DataInfo, DataSetLoader
|
||||
from ..dataset_loader import JsonLoader, CSVLoader
|
||||
from ..base_loader import DataBundle, DataSetLoader
|
||||
from ..file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
|
||||
from ...modules.encoder._bert import BertTokenizer
|
||||
|
||||
|
||||
class MatchingLoader(DataSetLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.dataset_loader.MatchingLoader`
|
||||
别名::class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.data_loader.MatchingLoader`
|
||||
|
||||
读取Matching任务的数据集
|
||||
|
||||
@ -34,7 +33,8 @@ class MatchingLoader(DataSetLoader):
|
||||
to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None,
|
||||
cut_text: int = None, get_index=True, auto_pad_length: int=None,
|
||||
auto_pad_token: str='<pad>', set_input: Union[list, str, bool]=True,
|
||||
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataInfo:
|
||||
set_target: Union[list, str, bool]=True, concat: Union[str, list, bool]=None,
|
||||
extra_split: List[str]=None, ) -> DataBundle:
|
||||
"""
|
||||
:param paths: str或者Dict[str, str]。如果是str,则为数据集所在的文件夹或者是全路径文件名:如果是文件夹,
|
||||
则会从self.paths里面找对应的数据集名称与文件名。如果是Dict,则为数据集名称(如train、dev、test)和
|
||||
@ -57,6 +57,7 @@ class MatchingLoader(DataSetLoader):
|
||||
:param concat: 是否需要将两个句子拼接起来。如果为False则不会拼接。如果为True则会在两个句子之间插入一个<sep>。
|
||||
如果传入一个长度为4的list,则分别表示插在第一句开始前、第一句结束后、第二句开始前、第二句结束后的标识符。如果
|
||||
传入字符串 ``bert`` ,则会采用bert的拼接方式,等价于['[CLS]', '[SEP]', '', '[SEP]'].
|
||||
:param extra_split: 额外的分隔符,即除了空格之外的用于分词的字符。
|
||||
:return:
|
||||
"""
|
||||
if isinstance(set_input, str):
|
||||
@ -79,7 +80,7 @@ class MatchingLoader(DataSetLoader):
|
||||
else:
|
||||
path = paths
|
||||
|
||||
data_info = DataInfo()
|
||||
data_info = DataBundle()
|
||||
for data_name in path.keys():
|
||||
data_info.datasets[data_name] = self._load(path[data_name])
|
||||
|
||||
@ -90,6 +91,24 @@ class MatchingLoader(DataSetLoader):
|
||||
if Const.TARGET in data_set.get_field_names():
|
||||
data_set.set_target(Const.TARGET)
|
||||
|
||||
if extra_split is not None:
|
||||
for data_name, data_set in data_info.datasets.items():
|
||||
data_set.apply(lambda x: ' '.join(x[Const.INPUTS(0)]), new_field_name=Const.INPUTS(0))
|
||||
data_set.apply(lambda x: ' '.join(x[Const.INPUTS(1)]), new_field_name=Const.INPUTS(1))
|
||||
|
||||
for s in extra_split:
|
||||
data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
|
||||
new_field_name=Const.INPUTS(0))
|
||||
data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
|
||||
new_field_name=Const.INPUTS(0))
|
||||
|
||||
_filt = lambda x: x
|
||||
data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(0)].split(' '))),
|
||||
new_field_name=Const.INPUTS(0), is_input=auto_set_input)
|
||||
data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(1)].split(' '))),
|
||||
new_field_name=Const.INPUTS(1), is_input=auto_set_input)
|
||||
_filt = None
|
||||
|
||||
if to_lower:
|
||||
for data_name, data_set in data_info.datasets.items():
|
||||
data_set.apply(lambda x: [w.lower() for w in x[Const.INPUTS(0)]], new_field_name=Const.INPUTS(0),
|
||||
@ -227,204 +246,3 @@ class MatchingLoader(DataSetLoader):
|
||||
data_set.set_target(*[target for target in set_target if target in data_set.get_field_names()])
|
||||
|
||||
return data_info
|
||||
|
||||
|
||||
class SNLILoader(MatchingLoader, JsonLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.dataset_loader.SNLILoader`
|
||||
|
||||
读取SNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
fields = {
|
||||
'sentence1_binary_parse': Const.INPUTS(0),
|
||||
'sentence2_binary_parse': Const.INPUTS(1),
|
||||
'gold_label': Const.TARGET,
|
||||
}
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'snli_1.0_train.jsonl',
|
||||
'dev': 'snli_1.0_dev.jsonl',
|
||||
'test': 'snli_1.0_test.jsonl'}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
JsonLoader.__init__(self, fields=fields)
|
||||
|
||||
def _load(self, path):
|
||||
ds = JsonLoader._load(self, path)
|
||||
|
||||
parentheses_table = str.maketrans({'(': None, ')': None})
|
||||
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(0))
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(1))
|
||||
ds.drop(lambda x: x[Const.TARGET] == '-')
|
||||
return ds
|
||||
|
||||
|
||||
class RTELoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.RTELoader` :class:`fastNLP.io.dataset_loader.RTELoader`
|
||||
|
||||
读取RTE数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev': 'dev.tsv',
|
||||
'test': 'test.tsv' # test set has not label
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
self.fields = {
|
||||
'sentence1': Const.INPUTS(0),
|
||||
'sentence2': Const.INPUTS(1),
|
||||
'label': Const.TARGET,
|
||||
}
|
||||
CSVLoader.__init__(self, sep='\t')
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
|
||||
for k, v in self.fields.items():
|
||||
if v in ds.get_field_names():
|
||||
ds.rename_field(k, v)
|
||||
for fields in ds.get_all_fields():
|
||||
if Const.INPUT in fields:
|
||||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
|
||||
|
||||
return ds
|
||||
|
||||
|
||||
class QNLILoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.dataset_loader.QNLILoader`
|
||||
|
||||
读取QNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev': 'dev.tsv',
|
||||
'test': 'test.tsv' # test set has not label
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
self.fields = {
|
||||
'question': Const.INPUTS(0),
|
||||
'sentence': Const.INPUTS(1),
|
||||
'label': Const.TARGET,
|
||||
}
|
||||
CSVLoader.__init__(self, sep='\t')
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
|
||||
for k, v in self.fields.items():
|
||||
if v in ds.get_field_names():
|
||||
ds.rename_field(k, v)
|
||||
for fields in ds.get_all_fields():
|
||||
if Const.INPUT in fields:
|
||||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
|
||||
|
||||
return ds
|
||||
|
||||
|
||||
class MNLILoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.dataset_loader.MNLILoader`
|
||||
|
||||
读取MNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev_matched': 'dev_matched.tsv',
|
||||
'dev_mismatched': 'dev_mismatched.tsv',
|
||||
'test_matched': 'test_matched.tsv',
|
||||
'test_mismatched': 'test_mismatched.tsv',
|
||||
# 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt',
|
||||
# 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt',
|
||||
|
||||
# test_0.9_mathed与mismatched是MNLI0.9版本的(数据来源:kaggle)
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
CSVLoader.__init__(self, sep='\t')
|
||||
self.fields = {
|
||||
'sentence1_binary_parse': Const.INPUTS(0),
|
||||
'sentence2_binary_parse': Const.INPUTS(1),
|
||||
'gold_label': Const.TARGET,
|
||||
}
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
|
||||
for k, v in self.fields.items():
|
||||
if k in ds.get_field_names():
|
||||
ds.rename_field(k, v)
|
||||
|
||||
if Const.TARGET in ds.get_field_names():
|
||||
if ds[0][Const.TARGET] == 'hidden':
|
||||
ds.delete_field(Const.TARGET)
|
||||
|
||||
parentheses_table = str.maketrans({'(': None, ')': None})
|
||||
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(0))
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(1))
|
||||
if Const.TARGET in ds.get_field_names():
|
||||
ds.drop(lambda x: x[Const.TARGET] == '-')
|
||||
return ds
|
||||
|
||||
|
||||
class QuoraLoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.dataset_loader.QuoraLoader`
|
||||
|
||||
读取MNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev': 'dev.tsv',
|
||||
'test': 'test.tsv',
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID'))
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
return ds
|
||||
|
60
fastNLP/io/data_loader/mnli.py
Normal file
60
fastNLP/io/data_loader/mnli.py
Normal file
@ -0,0 +1,60 @@
|
||||
|
||||
from ...core.const import Const
|
||||
|
||||
from .matching import MatchingLoader
|
||||
from ..dataset_loader import CSVLoader
|
||||
|
||||
|
||||
class MNLILoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.data_loader.MNLILoader`
|
||||
|
||||
读取MNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev_matched': 'dev_matched.tsv',
|
||||
'dev_mismatched': 'dev_mismatched.tsv',
|
||||
'test_matched': 'test_matched.tsv',
|
||||
'test_mismatched': 'test_mismatched.tsv',
|
||||
# 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt',
|
||||
# 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt',
|
||||
|
||||
# test_0.9_mathed与mismatched是MNLI0.9版本的(数据来源:kaggle)
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
CSVLoader.__init__(self, sep='\t')
|
||||
self.fields = {
|
||||
'sentence1_binary_parse': Const.INPUTS(0),
|
||||
'sentence2_binary_parse': Const.INPUTS(1),
|
||||
'gold_label': Const.TARGET,
|
||||
}
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
|
||||
for k, v in self.fields.items():
|
||||
if k in ds.get_field_names():
|
||||
ds.rename_field(k, v)
|
||||
|
||||
if Const.TARGET in ds.get_field_names():
|
||||
if ds[0][Const.TARGET] == 'hidden':
|
||||
ds.delete_field(Const.TARGET)
|
||||
|
||||
parentheses_table = str.maketrans({'(': None, ')': None})
|
||||
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(0))
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(1))
|
||||
if Const.TARGET in ds.get_field_names():
|
||||
ds.drop(lambda x: x[Const.TARGET] == '-')
|
||||
return ds
|
65
fastNLP/io/data_loader/mtl.py
Normal file
65
fastNLP/io/data_loader/mtl.py
Normal file
@ -0,0 +1,65 @@
|
||||
|
||||
from typing import Union, Dict
|
||||
|
||||
from ..base_loader import DataBundle
|
||||
from ..dataset_loader import CSVLoader
|
||||
from ...core.vocabulary import Vocabulary, VocabularyOption
|
||||
from ...core.const import Const
|
||||
from ..utils import check_dataloader_paths
|
||||
|
||||
|
||||
class MTL16Loader(CSVLoader):
|
||||
"""
|
||||
读取MTL16数据集,DataSet包含以下fields:
|
||||
|
||||
words: list(str), 需要分类的文本
|
||||
target: str, 文本的标签
|
||||
|
||||
数据来源:https://pan.baidu.com/s/1c2L6vdA
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super(MTL16Loader, self).__init__(headers=(Const.TARGET, Const.INPUT), sep='\t')
|
||||
|
||||
def _load(self, path):
|
||||
dataset = super(MTL16Loader, self)._load(path)
|
||||
dataset.apply(lambda x: x[Const.INPUT].lower().split(), new_field_name=Const.INPUT)
|
||||
if len(dataset) == 0:
|
||||
raise RuntimeError(f"{path} has no valid data.")
|
||||
|
||||
return dataset
|
||||
|
||||
def process(self,
|
||||
paths: Union[str, Dict[str, str]],
|
||||
src_vocab_opt: VocabularyOption = None,
|
||||
tgt_vocab_opt: VocabularyOption = None,):
|
||||
|
||||
paths = check_dataloader_paths(paths)
|
||||
datasets = {}
|
||||
info = DataBundle()
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
datasets[name] = dataset
|
||||
|
||||
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
|
||||
src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
|
||||
src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
|
||||
|
||||
tgt_vocab = Vocabulary(unknown=None, padding=None) \
|
||||
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
|
||||
tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
|
||||
tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
|
||||
|
||||
info.vocabs = {
|
||||
Const.INPUT: src_vocab,
|
||||
Const.TARGET: tgt_vocab
|
||||
}
|
||||
|
||||
info.datasets = datasets
|
||||
|
||||
for name, dataset in info.datasets.items():
|
||||
dataset.set_input(Const.INPUT)
|
||||
dataset.set_target(Const.TARGET)
|
||||
|
||||
return info
|
85
fastNLP/io/data_loader/people_daily.py
Normal file
85
fastNLP/io/data_loader/people_daily.py
Normal file
@ -0,0 +1,85 @@
|
||||
|
||||
from ..base_loader import DataSetLoader
|
||||
from ...core.dataset import DataSet
|
||||
from ...core.instance import Instance
|
||||
from ...core.const import Const
|
||||
|
||||
|
||||
class PeopleDailyCorpusLoader(DataSetLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader`
|
||||
|
||||
读取人民日报数据集
|
||||
"""
|
||||
|
||||
def __init__(self, pos=True, ner=True):
|
||||
super(PeopleDailyCorpusLoader, self).__init__()
|
||||
self.pos = pos
|
||||
self.ner = ner
|
||||
|
||||
def _load(self, data_path):
|
||||
with open(data_path, "r", encoding="utf-8") as f:
|
||||
sents = f.readlines()
|
||||
examples = []
|
||||
for sent in sents:
|
||||
if len(sent) <= 2:
|
||||
continue
|
||||
inside_ne = False
|
||||
sent_pos_tag = []
|
||||
sent_words = []
|
||||
sent_ner = []
|
||||
words = sent.strip().split()[1:]
|
||||
for word in words:
|
||||
if "[" in word and "]" in word:
|
||||
ner_tag = "U"
|
||||
print(word)
|
||||
elif "[" in word:
|
||||
inside_ne = True
|
||||
ner_tag = "B"
|
||||
word = word[1:]
|
||||
elif "]" in word:
|
||||
ner_tag = "L"
|
||||
word = word[:word.index("]")]
|
||||
if inside_ne is True:
|
||||
inside_ne = False
|
||||
else:
|
||||
raise RuntimeError("only ] appears!")
|
||||
else:
|
||||
if inside_ne is True:
|
||||
ner_tag = "I"
|
||||
else:
|
||||
ner_tag = "O"
|
||||
tmp = word.split("/")
|
||||
token, pos = tmp[0], tmp[1]
|
||||
sent_ner.append(ner_tag)
|
||||
sent_pos_tag.append(pos)
|
||||
sent_words.append(token)
|
||||
example = [sent_words]
|
||||
if self.pos is True:
|
||||
example.append(sent_pos_tag)
|
||||
if self.ner is True:
|
||||
example.append(sent_ner)
|
||||
examples.append(example)
|
||||
return self.convert(examples)
|
||||
|
||||
def convert(self, data):
|
||||
"""
|
||||
|
||||
:param data: python 内置对象
|
||||
:return: 一个 :class:`~fastNLP.DataSet` 类型的对象
|
||||
"""
|
||||
data_set = DataSet()
|
||||
for item in data:
|
||||
sent_words = item[0]
|
||||
if self.pos is True and self.ner is True:
|
||||
instance = Instance(
|
||||
words=sent_words, pos_tags=item[1], ner=item[2])
|
||||
elif self.pos is True:
|
||||
instance = Instance(words=sent_words, pos_tags=item[1])
|
||||
elif self.ner is True:
|
||||
instance = Instance(words=sent_words, ner=item[1])
|
||||
else:
|
||||
instance = Instance(words=sent_words)
|
||||
data_set.append(instance)
|
||||
data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN)
|
||||
return data_set
|
45
fastNLP/io/data_loader/qnli.py
Normal file
45
fastNLP/io/data_loader/qnli.py
Normal file
@ -0,0 +1,45 @@
|
||||
|
||||
from ...core.const import Const
|
||||
|
||||
from .matching import MatchingLoader
|
||||
from ..dataset_loader import CSVLoader
|
||||
|
||||
|
||||
class QNLILoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.data_loader.QNLILoader`
|
||||
|
||||
读取QNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev': 'dev.tsv',
|
||||
'test': 'test.tsv' # test set has not label
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
self.fields = {
|
||||
'question': Const.INPUTS(0),
|
||||
'sentence': Const.INPUTS(1),
|
||||
'label': Const.TARGET,
|
||||
}
|
||||
CSVLoader.__init__(self, sep='\t')
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
|
||||
for k, v in self.fields.items():
|
||||
if k in ds.get_field_names():
|
||||
ds.rename_field(k, v)
|
||||
for fields in ds.get_all_fields():
|
||||
if Const.INPUT in fields:
|
||||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
|
||||
|
||||
return ds
|
32
fastNLP/io/data_loader/quora.py
Normal file
32
fastNLP/io/data_loader/quora.py
Normal file
@ -0,0 +1,32 @@
|
||||
|
||||
from ...core.const import Const
|
||||
|
||||
from .matching import MatchingLoader
|
||||
from ..dataset_loader import CSVLoader
|
||||
|
||||
|
||||
class QuoraLoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.data_loader.QuoraLoader`
|
||||
|
||||
读取MNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev': 'dev.tsv',
|
||||
'test': 'test.tsv',
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID'))
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
return ds
|
45
fastNLP/io/data_loader/rte.py
Normal file
45
fastNLP/io/data_loader/rte.py
Normal file
@ -0,0 +1,45 @@
|
||||
|
||||
from ...core.const import Const
|
||||
|
||||
from .matching import MatchingLoader
|
||||
from ..dataset_loader import CSVLoader
|
||||
|
||||
|
||||
class RTELoader(MatchingLoader, CSVLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.RTELoader` :class:`fastNLP.io.data_loader.RTELoader`
|
||||
|
||||
读取RTE数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源:
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'train.tsv',
|
||||
'dev': 'dev.tsv',
|
||||
'test': 'test.tsv' # test set has not label
|
||||
}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
self.fields = {
|
||||
'sentence1': Const.INPUTS(0),
|
||||
'sentence2': Const.INPUTS(1),
|
||||
'label': Const.TARGET,
|
||||
}
|
||||
CSVLoader.__init__(self, sep='\t')
|
||||
|
||||
def _load(self, path):
|
||||
ds = CSVLoader._load(self, path)
|
||||
|
||||
for k, v in self.fields.items():
|
||||
if k in ds.get_field_names():
|
||||
ds.rename_field(k, v)
|
||||
for fields in ds.get_all_fields():
|
||||
if Const.INPUT in fields:
|
||||
ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
|
||||
|
||||
return ds
|
44
fastNLP/io/data_loader/snli.py
Normal file
44
fastNLP/io/data_loader/snli.py
Normal file
@ -0,0 +1,44 @@
|
||||
|
||||
from ...core.const import Const
|
||||
|
||||
from .matching import MatchingLoader
|
||||
from ..dataset_loader import JsonLoader
|
||||
|
||||
|
||||
class SNLILoader(MatchingLoader, JsonLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.data_loader.SNLILoader`
|
||||
|
||||
读取SNLI数据集,读取的DataSet包含fields::
|
||||
|
||||
words1: list(str),第一句文本, premise
|
||||
words2: list(str), 第二句文本, hypothesis
|
||||
target: str, 真实标签
|
||||
|
||||
数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
|
||||
"""
|
||||
|
||||
def __init__(self, paths: dict=None):
|
||||
fields = {
|
||||
'sentence1_binary_parse': Const.INPUTS(0),
|
||||
'sentence2_binary_parse': Const.INPUTS(1),
|
||||
'gold_label': Const.TARGET,
|
||||
}
|
||||
paths = paths if paths is not None else {
|
||||
'train': 'snli_1.0_train.jsonl',
|
||||
'dev': 'snli_1.0_dev.jsonl',
|
||||
'test': 'snli_1.0_test.jsonl'}
|
||||
MatchingLoader.__init__(self, paths=paths)
|
||||
JsonLoader.__init__(self, fields=fields)
|
||||
|
||||
def _load(self, path):
|
||||
ds = JsonLoader._load(self, path)
|
||||
|
||||
parentheses_table = str.maketrans({'(': None, ')': None})
|
||||
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(0))
|
||||
ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
|
||||
new_field_name=Const.INPUTS(1))
|
||||
ds.drop(lambda x: x[Const.TARGET] == '-')
|
||||
return ds
|
@ -1,19 +1,19 @@
|
||||
from typing import Iterable
|
||||
|
||||
from typing import Union, Dict
|
||||
from nltk import Tree
|
||||
import spacy
|
||||
from ..base_loader import DataInfo, DataSetLoader
|
||||
|
||||
from ..base_loader import DataBundle, DataSetLoader
|
||||
from ..dataset_loader import CSVLoader
|
||||
from ...core.vocabulary import VocabularyOption, Vocabulary
|
||||
from ...core.dataset import DataSet
|
||||
from ...core.const import Const
|
||||
from ...core.instance import Instance
|
||||
from ..utils import check_dataloader_paths, get_tokenizer
|
||||
|
||||
|
||||
class SSTLoader(DataSetLoader):
|
||||
URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
|
||||
DATA_DIR = 'sst/'
|
||||
|
||||
"""
|
||||
别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.dataset_loader.SSTLoader`
|
||||
别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.data_loader.SSTLoader`
|
||||
|
||||
读取SST数据集, DataSet包含fields::
|
||||
|
||||
@ -26,6 +26,9 @@ class SSTLoader(DataSetLoader):
|
||||
:param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False``
|
||||
"""
|
||||
|
||||
URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
|
||||
DATA_DIR = 'sst/'
|
||||
|
||||
def __init__(self, subtree=False, fine_grained=False):
|
||||
self.subtree = subtree
|
||||
|
||||
@ -57,8 +60,8 @@ class SSTLoader(DataSetLoader):
|
||||
def _get_one(self, data, subtree):
|
||||
tree = Tree.fromstring(data)
|
||||
if subtree:
|
||||
return [([x.text for x in self.tokenizer(' '.join(t.leaves()))], t.label()) for t in tree.subtrees() ]
|
||||
return [([x.text for x in self.tokenizer(' '.join(tree.leaves()))], tree.label())]
|
||||
return [(self.tokenizer(' '.join(t.leaves())), t.label()) for t in tree.subtrees() ]
|
||||
return [(self.tokenizer(' '.join(tree.leaves())), tree.label())]
|
||||
|
||||
def process(self,
|
||||
paths, train_subtree=True,
|
||||
@ -70,7 +73,7 @@ class SSTLoader(DataSetLoader):
|
||||
tgt_vocab = Vocabulary(unknown=None, padding=None) \
|
||||
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
|
||||
|
||||
info = DataInfo()
|
||||
info = DataBundle()
|
||||
origin_subtree = self.subtree
|
||||
self.subtree = train_subtree
|
||||
info.datasets['train'] = self._load(paths['train'])
|
||||
@ -98,3 +101,75 @@ class SSTLoader(DataSetLoader):
|
||||
|
||||
return info
|
||||
|
||||
|
||||
class SST2Loader(CSVLoader):
|
||||
"""
|
||||
数据来源"SST":'https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8',
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super(SST2Loader, self).__init__(sep='\t')
|
||||
self.tokenizer = get_tokenizer()
|
||||
self.field = {'sentence': Const.INPUT, 'label': Const.TARGET}
|
||||
|
||||
def _load(self, path: str) -> DataSet:
|
||||
ds = super(SST2Loader, self)._load(path)
|
||||
for k, v in self.field.items():
|
||||
if k in ds.get_field_names():
|
||||
ds.rename_field(k, v)
|
||||
ds.apply(lambda x: self.tokenizer(x[Const.INPUT]), new_field_name=Const.INPUT)
|
||||
print("all count:", len(ds))
|
||||
return ds
|
||||
|
||||
def process(self,
|
||||
paths: Union[str, Dict[str, str]],
|
||||
src_vocab_opt: VocabularyOption = None,
|
||||
tgt_vocab_opt: VocabularyOption = None,
|
||||
char_level_op=False):
|
||||
|
||||
paths = check_dataloader_paths(paths)
|
||||
datasets = {}
|
||||
info = DataBundle()
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
datasets[name] = dataset
|
||||
|
||||
def wordtochar(words):
|
||||
chars = []
|
||||
for word in words:
|
||||
word = word.lower()
|
||||
for char in word:
|
||||
chars.append(char)
|
||||
chars.append('')
|
||||
chars.pop()
|
||||
return chars
|
||||
|
||||
input_name, target_name = Const.INPUT, Const.TARGET
|
||||
info.vocabs={}
|
||||
|
||||
# 就分隔为char形式
|
||||
if char_level_op:
|
||||
for dataset in datasets.values():
|
||||
dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
|
||||
src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
|
||||
src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
|
||||
src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
|
||||
|
||||
tgt_vocab = Vocabulary(unknown=None, padding=None) \
|
||||
if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
|
||||
tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
|
||||
tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
|
||||
|
||||
info.vocabs = {
|
||||
Const.INPUT: src_vocab,
|
||||
Const.TARGET: tgt_vocab
|
||||
}
|
||||
|
||||
info.datasets = datasets
|
||||
|
||||
for name, dataset in info.datasets.items():
|
||||
dataset.set_input(Const.INPUT)
|
||||
dataset.set_target(Const.TARGET)
|
||||
|
||||
return info
|
||||
|
||||
|
127
fastNLP/io/data_loader/yelp.py
Normal file
127
fastNLP/io/data_loader/yelp.py
Normal file
@ -0,0 +1,127 @@
|
||||
|
||||
import csv
|
||||
from typing import Iterable
|
||||
|
||||
from ...core.const import Const
|
||||
from ...core.dataset import DataSet
|
||||
from ...core.instance import Instance
|
||||
from ...core.vocabulary import VocabularyOption, Vocabulary
|
||||
from ..base_loader import DataBundle, DataSetLoader
|
||||
from typing import Union, Dict
|
||||
from ..utils import check_dataloader_paths, get_tokenizer
|
||||
|
||||
|
||||
class YelpLoader(DataSetLoader):
|
||||
"""
|
||||
读取Yelp_full/Yelp_polarity数据集, DataSet包含fields:
|
||||
words: list(str), 需要分类的文本
|
||||
target: str, 文本的标签
|
||||
chars:list(str),未index的字符列表
|
||||
|
||||
数据集:yelp_full/yelp_polarity
|
||||
:param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False``
|
||||
:param lower: 是否需要自动转小写,默认为False。
|
||||
"""
|
||||
|
||||
def __init__(self, fine_grained=False, lower=False):
|
||||
super(YelpLoader, self).__init__()
|
||||
tag_v = {'1.0': 'very negative', '2.0': 'negative', '3.0': 'neutral',
|
||||
'4.0': 'positive', '5.0': 'very positive'}
|
||||
if not fine_grained:
|
||||
tag_v['1.0'] = tag_v['2.0']
|
||||
tag_v['5.0'] = tag_v['4.0']
|
||||
self.fine_grained = fine_grained
|
||||
self.tag_v = tag_v
|
||||
self.lower = lower
|
||||
self.tokenizer = get_tokenizer()
|
||||
|
||||
def _load(self, path):
|
||||
ds = DataSet()
|
||||
csv_reader = csv.reader(open(path, encoding='utf-8'))
|
||||
all_count = 0
|
||||
real_count = 0
|
||||
for row in csv_reader:
|
||||
all_count += 1
|
||||
if len(row) == 2:
|
||||
target = self.tag_v[row[0] + ".0"]
|
||||
words = clean_str(row[1], self.tokenizer, self.lower)
|
||||
if len(words) != 0:
|
||||
ds.append(Instance(words=words, target=target))
|
||||
real_count += 1
|
||||
print("all count:", all_count)
|
||||
print("real count:", real_count)
|
||||
return ds
|
||||
|
||||
def process(self, paths: Union[str, Dict[str, str]],
|
||||
train_ds: Iterable[str] = None,
|
||||
src_vocab_op: VocabularyOption = None,
|
||||
tgt_vocab_op: VocabularyOption = None,
|
||||
char_level_op=False):
|
||||
paths = check_dataloader_paths(paths)
|
||||
info = DataBundle(datasets=self.load(paths))
|
||||
src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
|
||||
tgt_vocab = Vocabulary(unknown=None, padding=None) \
|
||||
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
|
||||
_train_ds = [info.datasets[name]
|
||||
for name in train_ds] if train_ds else info.datasets.values()
|
||||
|
||||
def wordtochar(words):
|
||||
chars = []
|
||||
for word in words:
|
||||
word = word.lower()
|
||||
for char in word:
|
||||
chars.append(char)
|
||||
chars.append('')
|
||||
chars.pop()
|
||||
return chars
|
||||
|
||||
input_name, target_name = Const.INPUT, Const.TARGET
|
||||
info.vocabs = {}
|
||||
# 就分隔为char形式
|
||||
if char_level_op:
|
||||
for dataset in info.datasets.values():
|
||||
dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
|
||||
else:
|
||||
src_vocab.from_dataset(*_train_ds, field_name=input_name)
|
||||
src_vocab.index_dataset(*info.datasets.values(), field_name=input_name, new_field_name=input_name)
|
||||
info.vocabs[input_name] = src_vocab
|
||||
|
||||
tgt_vocab.from_dataset(*_train_ds, field_name=target_name)
|
||||
tgt_vocab.index_dataset(
|
||||
*info.datasets.values(),
|
||||
field_name=target_name, new_field_name=target_name)
|
||||
|
||||
info.vocabs[target_name] = tgt_vocab
|
||||
|
||||
info.datasets['train'], info.datasets['dev'] = info.datasets['train'].split(0.1, shuffle=False)
|
||||
|
||||
for name, dataset in info.datasets.items():
|
||||
dataset.set_input(Const.INPUT)
|
||||
dataset.set_target(Const.TARGET)
|
||||
|
||||
return info
|
||||
|
||||
|
||||
def clean_str(sentence, tokenizer, char_lower=False):
|
||||
"""
|
||||
heavily borrowed from github
|
||||
https://github.com/LukeZhuang/Hierarchical-Attention-Network/blob/master/yelp-preprocess.ipynb
|
||||
:param sentence: is a str
|
||||
:return:
|
||||
"""
|
||||
if char_lower:
|
||||
sentence = sentence.lower()
|
||||
import re
|
||||
nonalpnum = re.compile('[^0-9a-zA-Z?!\']+')
|
||||
words = tokenizer(sentence)
|
||||
words_collection = []
|
||||
for word in words:
|
||||
if word in ['-lrb-', '-rrb-', '<sssss>', '-r', '-l', 'b-']:
|
||||
continue
|
||||
tt = nonalpnum.split(word)
|
||||
t = ''.join(tt)
|
||||
if t != '':
|
||||
words_collection.append(t)
|
||||
|
||||
return words_collection
|
||||
|
@ -15,199 +15,13 @@ dataset_loader模块实现了许多 DataSetLoader, 用于读取不同格式的
|
||||
__all__ = [
|
||||
'CSVLoader',
|
||||
'JsonLoader',
|
||||
'ConllLoader',
|
||||
'PeopleDailyCorpusLoader',
|
||||
'Conll2003Loader',
|
||||
]
|
||||
|
||||
import os
|
||||
from nltk import Tree
|
||||
from typing import Union, Dict
|
||||
from ..core.vocabulary import Vocabulary
|
||||
|
||||
from ..core.dataset import DataSet
|
||||
from ..core.instance import Instance
|
||||
from .file_reader import _read_csv, _read_json, _read_conll
|
||||
from .base_loader import DataSetLoader, DataInfo
|
||||
from ..core.const import Const
|
||||
from ..modules.encoder._bert import BertTokenizer
|
||||
|
||||
|
||||
class PeopleDailyCorpusLoader(DataSetLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader`
|
||||
|
||||
读取人民日报数据集
|
||||
"""
|
||||
|
||||
def __init__(self, pos=True, ner=True):
|
||||
super(PeopleDailyCorpusLoader, self).__init__()
|
||||
self.pos = pos
|
||||
self.ner = ner
|
||||
|
||||
def _load(self, data_path):
|
||||
with open(data_path, "r", encoding="utf-8") as f:
|
||||
sents = f.readlines()
|
||||
examples = []
|
||||
for sent in sents:
|
||||
if len(sent) <= 2:
|
||||
continue
|
||||
inside_ne = False
|
||||
sent_pos_tag = []
|
||||
sent_words = []
|
||||
sent_ner = []
|
||||
words = sent.strip().split()[1:]
|
||||
for word in words:
|
||||
if "[" in word and "]" in word:
|
||||
ner_tag = "U"
|
||||
print(word)
|
||||
elif "[" in word:
|
||||
inside_ne = True
|
||||
ner_tag = "B"
|
||||
word = word[1:]
|
||||
elif "]" in word:
|
||||
ner_tag = "L"
|
||||
word = word[:word.index("]")]
|
||||
if inside_ne is True:
|
||||
inside_ne = False
|
||||
else:
|
||||
raise RuntimeError("only ] appears!")
|
||||
else:
|
||||
if inside_ne is True:
|
||||
ner_tag = "I"
|
||||
else:
|
||||
ner_tag = "O"
|
||||
tmp = word.split("/")
|
||||
token, pos = tmp[0], tmp[1]
|
||||
sent_ner.append(ner_tag)
|
||||
sent_pos_tag.append(pos)
|
||||
sent_words.append(token)
|
||||
example = [sent_words]
|
||||
if self.pos is True:
|
||||
example.append(sent_pos_tag)
|
||||
if self.ner is True:
|
||||
example.append(sent_ner)
|
||||
examples.append(example)
|
||||
return self.convert(examples)
|
||||
|
||||
def convert(self, data):
|
||||
"""
|
||||
|
||||
:param data: python 内置对象
|
||||
:return: 一个 :class:`~fastNLP.DataSet` 类型的对象
|
||||
"""
|
||||
data_set = DataSet()
|
||||
for item in data:
|
||||
sent_words = item[0]
|
||||
if self.pos is True and self.ner is True:
|
||||
instance = Instance(
|
||||
words=sent_words, pos_tags=item[1], ner=item[2])
|
||||
elif self.pos is True:
|
||||
instance = Instance(words=sent_words, pos_tags=item[1])
|
||||
elif self.ner is True:
|
||||
instance = Instance(words=sent_words, ner=item[1])
|
||||
else:
|
||||
instance = Instance(words=sent_words)
|
||||
data_set.append(instance)
|
||||
data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN)
|
||||
return data_set
|
||||
|
||||
|
||||
class ConllLoader(DataSetLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.dataset_loader.ConllLoader`
|
||||
|
||||
读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略,因为
|
||||
该符号在conll 2003中被用为文档分割符。
|
||||
|
||||
列号从0开始, 每列对应内容为::
|
||||
|
||||
Column Type
|
||||
0 Document ID
|
||||
1 Part number
|
||||
2 Word number
|
||||
3 Word itself
|
||||
4 Part-of-Speech
|
||||
5 Parse bit
|
||||
6 Predicate lemma
|
||||
7 Predicate Frameset ID
|
||||
8 Word sense
|
||||
9 Speaker/Author
|
||||
10 Named Entities
|
||||
11:N Predicate Arguments
|
||||
N Coreference
|
||||
|
||||
:param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应
|
||||
:param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None``
|
||||
:param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False``
|
||||
"""
|
||||
|
||||
def __init__(self, headers, indexes=None, dropna=False):
|
||||
super(ConllLoader, self).__init__()
|
||||
if not isinstance(headers, (list, tuple)):
|
||||
raise TypeError(
|
||||
'invalid headers: {}, should be list of strings'.format(headers))
|
||||
self.headers = headers
|
||||
self.dropna = dropna
|
||||
if indexes is None:
|
||||
self.indexes = list(range(len(self.headers)))
|
||||
else:
|
||||
if len(indexes) != len(headers):
|
||||
raise ValueError
|
||||
self.indexes = indexes
|
||||
|
||||
def _load(self, path):
|
||||
ds = DataSet()
|
||||
for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
|
||||
ins = {h: data[i] for i, h in enumerate(self.headers)}
|
||||
ds.append(Instance(**ins))
|
||||
return ds
|
||||
|
||||
|
||||
class Conll2003Loader(ConllLoader):
|
||||
"""
|
||||
别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader`
|
||||
|
||||
读取Conll2003数据
|
||||
|
||||
关于数据集的更多信息,参考:
|
||||
https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
headers = [
|
||||
'tokens', 'pos', 'chunks', 'ner',
|
||||
]
|
||||
super(Conll2003Loader, self).__init__(headers=headers)
|
||||
|
||||
|
||||
def _cut_long_sentence(sent, max_sample_length=200):
|
||||
"""
|
||||
将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。
|
||||
所以截取的句子可能长于或者短于max_sample_length
|
||||
|
||||
:param sent: str.
|
||||
:param max_sample_length: int.
|
||||
:return: list of str.
|
||||
"""
|
||||
sent_no_space = sent.replace(' ', '')
|
||||
cutted_sentence = []
|
||||
if len(sent_no_space) > max_sample_length:
|
||||
parts = sent.strip().split()
|
||||
new_line = ''
|
||||
length = 0
|
||||
for part in parts:
|
||||
length += len(part)
|
||||
new_line += part + ' '
|
||||
if length > max_sample_length:
|
||||
new_line = new_line[:-1]
|
||||
cutted_sentence.append(new_line)
|
||||
length = 0
|
||||
new_line = ''
|
||||
if new_line != '':
|
||||
cutted_sentence.append(new_line[:-1])
|
||||
else:
|
||||
cutted_sentence.append(sent)
|
||||
return cutted_sentence
|
||||
from .file_reader import _read_csv, _read_json
|
||||
from .base_loader import DataSetLoader
|
||||
|
||||
|
||||
class JsonLoader(DataSetLoader):
|
||||
@ -272,6 +86,36 @@ class CSVLoader(DataSetLoader):
|
||||
return ds
|
||||
|
||||
|
||||
def _cut_long_sentence(sent, max_sample_length=200):
|
||||
"""
|
||||
将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。
|
||||
所以截取的句子可能长于或者短于max_sample_length
|
||||
|
||||
:param sent: str.
|
||||
:param max_sample_length: int.
|
||||
:return: list of str.
|
||||
"""
|
||||
sent_no_space = sent.replace(' ', '')
|
||||
cutted_sentence = []
|
||||
if len(sent_no_space) > max_sample_length:
|
||||
parts = sent.strip().split()
|
||||
new_line = ''
|
||||
length = 0
|
||||
for part in parts:
|
||||
length += len(part)
|
||||
new_line += part + ' '
|
||||
if length > max_sample_length:
|
||||
new_line = new_line[:-1]
|
||||
cutted_sentence.append(new_line)
|
||||
length = 0
|
||||
new_line = ''
|
||||
if new_line != '':
|
||||
cutted_sentence.append(new_line[:-1])
|
||||
else:
|
||||
cutted_sentence.append(sent)
|
||||
return cutted_sentence
|
||||
|
||||
|
||||
def _add_seg_tag(data):
|
||||
"""
|
||||
|
||||
|
@ -17,6 +17,10 @@ PRETRAINED_BERT_MODEL_DIR = {
|
||||
'en-large-uncased': 'bert-large-uncased-20939f45.zip',
|
||||
'en-large-cased': 'bert-large-cased-e0cf90fc.zip',
|
||||
|
||||
'en-large-cased-wwm': 'bert-large-cased-wwm-a457f118.zip',
|
||||
'en-large-uncased-wwm': 'bert-large-uncased-wwm-92a50aeb.zip',
|
||||
'en-base-cased-mrpc': 'bert-base-cased-finetuned-mrpc-c7099855.zip',
|
||||
|
||||
'cn': 'bert-base-chinese-29d0a84a.zip',
|
||||
'cn-base': 'bert-base-chinese-29d0a84a.zip',
|
||||
|
||||
@ -68,6 +72,7 @@ def cached_path(url_or_filename: str, cache_dir: Path=None) -> Path:
|
||||
"unable to parse {} as a URL or as a local path".format(url_or_filename)
|
||||
)
|
||||
|
||||
|
||||
def get_filepath(filepath):
|
||||
"""
|
||||
如果filepath中只有一个文件,则直接返回对应的全路径
|
||||
@ -82,6 +87,7 @@ def get_filepath(filepath):
|
||||
return filepath
|
||||
return filepath
|
||||
|
||||
|
||||
def get_defalt_path():
|
||||
"""
|
||||
获取默认的fastNLP存放路径, 如果将FASTNLP_CACHE_PATH设置在了环境变量中,将使用环境变量的值,使得不用每个用户都去下载。
|
||||
@ -98,6 +104,7 @@ def get_defalt_path():
|
||||
fastnlp_cache_dir = os.path.expanduser(os.path.join("~", ".fastNLP"))
|
||||
return fastnlp_cache_dir
|
||||
|
||||
|
||||
def _get_base_url(name):
|
||||
# 返回的URL结尾必须是/
|
||||
if 'FASTNLP_BASE_URL' in os.environ:
|
||||
@ -105,6 +112,7 @@ def _get_base_url(name):
|
||||
return fastnlp_base_url
|
||||
raise RuntimeError("There function is not available right now.")
|
||||
|
||||
|
||||
def split_filename_suffix(filepath):
|
||||
"""
|
||||
给定filepath返回对应的name和suffix
|
||||
@ -116,6 +124,7 @@ def split_filename_suffix(filepath):
|
||||
return filename[:-7], '.tar.gz'
|
||||
return os.path.splitext(filename)
|
||||
|
||||
|
||||
def get_from_cache(url: str, cache_dir: Path = None) -> Path:
|
||||
"""
|
||||
尝试在cache_dir中寻找url定义的资源; 如果没有找到。则从url下载并将结果放在cache_dir下,缓存的名称由url的结果推断而来。
|
||||
@ -226,6 +235,7 @@ def get_from_cache(url: str, cache_dir: Path = None) -> Path:
|
||||
|
||||
return get_filepath(cache_path)
|
||||
|
||||
|
||||
def unzip_file(file: Path, to: Path):
|
||||
# unpack and write out in CoNLL column-like format
|
||||
from zipfile import ZipFile
|
||||
@ -234,13 +244,15 @@ def unzip_file(file: Path, to: Path):
|
||||
# Extract all the contents of zip file in current directory
|
||||
zipObj.extractall(to)
|
||||
|
||||
|
||||
def untar_gz_file(file:Path, to:Path):
|
||||
import tarfile
|
||||
|
||||
with tarfile.open(file, 'r:gz') as tar:
|
||||
tar.extractall(to)
|
||||
|
||||
def match_file(dir_name:str, cache_dir:str)->str:
|
||||
|
||||
def match_file(dir_name: str, cache_dir: str) -> str:
|
||||
"""
|
||||
匹配的原则是,在cache_dir下的文件: (1) 与dir_name完全一致; (2) 除了后缀以外和dir_name完全一致。
|
||||
如果找到了两个匹配的结果将报错. 如果找到了则返回匹配的文件的名称; 没有找到返回空字符串
|
||||
@ -261,6 +273,7 @@ def match_file(dir_name:str, cache_dir:str)->str:
|
||||
else:
|
||||
raise RuntimeError(f"Duplicate matched files:{matched_filenames}, this should be caused by a bug.")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
cache_dir = Path('caches')
|
||||
cache_dir = None
|
||||
|
@ -4,149 +4,209 @@ __all__ = [
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
from .base_model import BaseModel
|
||||
from ..core.const import Const
|
||||
from ..modules import decoder as Decoder
|
||||
from ..modules import encoder as Encoder
|
||||
from ..modules import aggregator as Aggregator
|
||||
from ..core.utils import seq_len_to_mask
|
||||
from torch.nn import CrossEntropyLoss
|
||||
|
||||
my_inf = 10e12
|
||||
from fastNLP.models import BaseModel
|
||||
from fastNLP.modules.encoder.embedding import TokenEmbedding
|
||||
from fastNLP.modules.encoder.lstm import LSTM
|
||||
from fastNLP.core.const import Const
|
||||
from fastNLP.core.utils import seq_len_to_mask
|
||||
|
||||
|
||||
class ESIM(BaseModel):
|
||||
"""
|
||||
别名::class:`fastNLP.models.ESIM` :class:`fastNLP.models.snli.ESIM`
|
||||
"""ESIM model的一个PyTorch实现
|
||||
论文参见: https://arxiv.org/pdf/1609.06038.pdf
|
||||
|
||||
ESIM模型的一个PyTorch实现。
|
||||
ESIM模型的论文: Enhanced LSTM for Natural Language Inference (arXiv: 1609.06038)
|
||||
|
||||
:param int vocab_size: 词表大小
|
||||
:param int embed_dim: 词嵌入维度
|
||||
:param int hidden_size: LSTM隐层大小
|
||||
:param float dropout: dropout大小,默认为0
|
||||
:param int num_classes: 标签数目,默认为3
|
||||
:param numpy.array init_embedding: 初始词嵌入矩阵,形状为(vocab_size, embed_dim),默认为None,即随机初始化词嵌入矩阵
|
||||
:param fastNLP.TokenEmbedding init_embedding: 初始化的TokenEmbedding
|
||||
:param int hidden_size: 隐藏层大小,默认值为Embedding的维度
|
||||
:param int num_labels: 目标标签种类数量,默认值为3
|
||||
:param float dropout_rate: dropout的比率,默认值为0.3
|
||||
:param float dropout_embed: 对Embedding的dropout比率,默认值为0.1
|
||||
"""
|
||||
|
||||
def __init__(self, vocab_size, embed_dim, hidden_size, dropout=0.0, num_classes=3, init_embedding=None):
|
||||
|
||||
|
||||
def __init__(self, init_embedding: TokenEmbedding, hidden_size=None, num_labels=3, dropout_rate=0.3,
|
||||
dropout_embed=0.1):
|
||||
super(ESIM, self).__init__()
|
||||
self.vocab_size = vocab_size
|
||||
self.embed_dim = embed_dim
|
||||
self.hidden_size = hidden_size
|
||||
self.dropout = dropout
|
||||
self.n_labels = num_classes
|
||||
|
||||
self.drop = nn.Dropout(self.dropout)
|
||||
|
||||
self.embedding = Encoder.Embedding(
|
||||
(self.vocab_size, self.embed_dim), dropout=self.dropout,
|
||||
)
|
||||
|
||||
self.embedding_layer = nn.Linear(self.embed_dim, self.hidden_size)
|
||||
|
||||
self.encoder = Encoder.LSTM(
|
||||
input_size=self.embed_dim, hidden_size=self.hidden_size, num_layers=1, bias=True,
|
||||
batch_first=True, bidirectional=True
|
||||
)
|
||||
|
||||
self.bi_attention = Aggregator.BiAttention()
|
||||
self.mean_pooling = Aggregator.AvgPoolWithMask()
|
||||
self.max_pooling = Aggregator.MaxPoolWithMask()
|
||||
|
||||
self.inference_layer = nn.Linear(self.hidden_size * 4, self.hidden_size)
|
||||
|
||||
self.decoder = Encoder.LSTM(
|
||||
input_size=self.hidden_size, hidden_size=self.hidden_size, num_layers=1, bias=True,
|
||||
batch_first=True, bidirectional=True
|
||||
)
|
||||
|
||||
self.output = Decoder.MLP([4 * self.hidden_size, self.hidden_size, self.n_labels], 'tanh', dropout=self.dropout)
|
||||
|
||||
def forward(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
|
||||
""" Forward function
|
||||
|
||||
:param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
|
||||
:param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
|
||||
:param torch.LongTensor seq_len1: [B] premise的长度
|
||||
:param torch.LongTensor seq_len2: [B] hypothesis的长度
|
||||
:param torch.LongTensor target: [B] 真实目标值
|
||||
:return: dict prediction: [B, n_labels(N)] 预测结果
|
||||
"""
|
||||
|
||||
premise0 = self.embedding_layer(self.embedding(words1))
|
||||
hypothesis0 = self.embedding_layer(self.embedding(words2))
|
||||
|
||||
if seq_len1 is not None:
|
||||
seq_len1 = seq_len_to_mask(seq_len1)
|
||||
else:
|
||||
seq_len1 = torch.ones(premise0.size(0), premise0.size(1))
|
||||
seq_len1 = (seq_len1.long()).to(device=premise0.device)
|
||||
if seq_len2 is not None:
|
||||
seq_len2 = seq_len_to_mask(seq_len2)
|
||||
else:
|
||||
seq_len2 = torch.ones(hypothesis0.size(0), hypothesis0.size(1))
|
||||
seq_len2 = (seq_len2.long()).to(device=hypothesis0.device)
|
||||
|
||||
_BP, _PSL, _HP = premise0.size()
|
||||
_BH, _HSL, _HH = hypothesis0.size()
|
||||
_BPL, _PLL = seq_len1.size()
|
||||
_HPL, _HLL = seq_len2.size()
|
||||
|
||||
assert _BP == _BH and _BPL == _HPL and _BP == _BPL
|
||||
assert _HP == _HH
|
||||
assert _PSL == _PLL and _HSL == _HLL
|
||||
|
||||
B, PL, H = premise0.size()
|
||||
B, HL, H = hypothesis0.size()
|
||||
|
||||
a0 = self.encoder(self.drop(premise0)) # a0: [B, PL, H * 2]
|
||||
b0 = self.encoder(self.drop(hypothesis0)) # b0: [B, HL, H * 2]
|
||||
|
||||
a = torch.mean(a0.view(B, PL, -1, H), dim=2) # a: [B, PL, H]
|
||||
b = torch.mean(b0.view(B, HL, -1, H), dim=2) # b: [B, HL, H]
|
||||
|
||||
ai, bi = self.bi_attention(a, b, seq_len1, seq_len2)
|
||||
|
||||
ma = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 4 * H]
|
||||
mb = torch.cat((b, bi, b - bi, b * bi), dim=2) # mb: [B, HL, 4 * H]
|
||||
|
||||
f_ma = self.inference_layer(ma)
|
||||
f_mb = self.inference_layer(mb)
|
||||
|
||||
vat = self.decoder(self.drop(f_ma))
|
||||
vbt = self.decoder(self.drop(f_mb))
|
||||
|
||||
va = torch.mean(vat.view(B, PL, -1, H), dim=2) # va: [B, PL, H]
|
||||
vb = torch.mean(vbt.view(B, HL, -1, H), dim=2) # vb: [B, HL, H]
|
||||
|
||||
va_ave = self.mean_pooling(va, seq_len1, dim=1) # va_ave: [B, H]
|
||||
va_max, va_arg_max = self.max_pooling(va, seq_len1, dim=1) # va_max: [B, H]
|
||||
vb_ave = self.mean_pooling(vb, seq_len2, dim=1) # vb_ave: [B, H]
|
||||
vb_max, vb_arg_max = self.max_pooling(vb, seq_len2, dim=1) # vb_max: [B, H]
|
||||
|
||||
v = torch.cat((va_ave, va_max, vb_ave, vb_max), dim=1) # v: [B, 4 * H]
|
||||
|
||||
prediction = torch.tanh(self.output(v)) # prediction: [B, N]
|
||||
|
||||
if target is not None:
|
||||
func = nn.CrossEntropyLoss()
|
||||
loss = func(prediction, target)
|
||||
return {Const.OUTPUT: prediction, Const.LOSS: loss}
|
||||
|
||||
return {Const.OUTPUT: prediction}
|
||||
|
||||
def predict(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
|
||||
""" Predict function
|
||||
|
||||
:param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
|
||||
:param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
|
||||
:param torch.LongTensor seq_len1: [B] premise的长度
|
||||
:param torch.LongTensor seq_len2: [B] hypothesis的长度
|
||||
:param torch.LongTensor target: [B] 真实目标值
|
||||
:return: dict prediction: [B, n_labels(N)] 预测结果
|
||||
self.embedding = init_embedding
|
||||
self.dropout_embed = EmbedDropout(p=dropout_embed)
|
||||
if hidden_size is None:
|
||||
hidden_size = self.embedding.embed_size
|
||||
self.rnn = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
|
||||
# self.rnn = LSTM(self.embedding.embed_size, hidden_size, dropout=dropout_rate, bidirectional=True)
|
||||
|
||||
self.interfere = nn.Sequential(nn.Dropout(p=dropout_rate),
|
||||
nn.Linear(8 * hidden_size, hidden_size),
|
||||
nn.ReLU())
|
||||
nn.init.xavier_uniform_(self.interfere[1].weight.data)
|
||||
self.bi_attention = SoftmaxAttention()
|
||||
|
||||
self.rnn_high = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
|
||||
# self.rnn_high = LSTM(hidden_size, hidden_size, dropout=dropout_rate, bidirectional=True,)
|
||||
|
||||
self.classifier = nn.Sequential(nn.Dropout(p=dropout_rate),
|
||||
nn.Linear(8 * hidden_size, hidden_size),
|
||||
nn.Tanh(),
|
||||
nn.Dropout(p=dropout_rate),
|
||||
nn.Linear(hidden_size, num_labels))
|
||||
|
||||
self.dropout_rnn = nn.Dropout(p=dropout_rate)
|
||||
|
||||
nn.init.xavier_uniform_(self.classifier[1].weight.data)
|
||||
nn.init.xavier_uniform_(self.classifier[4].weight.data)
|
||||
|
||||
def forward(self, words1, words2, seq_len1, seq_len2, target=None):
|
||||
"""
|
||||
prediction = self.forward(words1, words2, seq_len1, seq_len2)[Const.OUTPUT]
|
||||
return {Const.OUTPUT: torch.argmax(prediction, dim=-1)}
|
||||
:param words1: [batch, seq_len]
|
||||
:param words2: [batch, seq_len]
|
||||
:param seq_len1: [batch]
|
||||
:param seq_len2: [batch]
|
||||
:param target:
|
||||
:return:
|
||||
"""
|
||||
mask1 = seq_len_to_mask(seq_len1, words1.size(1))
|
||||
mask2 = seq_len_to_mask(seq_len2, words2.size(1))
|
||||
a0 = self.embedding(words1) # B * len * emb_dim
|
||||
b0 = self.embedding(words2)
|
||||
a0, b0 = self.dropout_embed(a0), self.dropout_embed(b0)
|
||||
a = self.rnn(a0, mask1.byte()) # a: [B, PL, 2 * H]
|
||||
b = self.rnn(b0, mask2.byte())
|
||||
# a = self.dropout_rnn(self.rnn(a0, seq_len1)[0]) # a: [B, PL, 2 * H]
|
||||
# b = self.dropout_rnn(self.rnn(b0, seq_len2)[0])
|
||||
|
||||
ai, bi = self.bi_attention(a, mask1, b, mask2)
|
||||
|
||||
a_ = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 8 * H]
|
||||
b_ = torch.cat((b, bi, b - bi, b * bi), dim=2)
|
||||
a_f = self.interfere(a_)
|
||||
b_f = self.interfere(b_)
|
||||
|
||||
a_h = self.rnn_high(a_f, mask1.byte()) # ma: [B, PL, 2 * H]
|
||||
b_h = self.rnn_high(b_f, mask2.byte())
|
||||
# a_h = self.dropout_rnn(self.rnn_high(a_f, seq_len1)[0]) # ma: [B, PL, 2 * H]
|
||||
# b_h = self.dropout_rnn(self.rnn_high(b_f, seq_len2)[0])
|
||||
|
||||
a_avg = self.mean_pooling(a_h, mask1, dim=1)
|
||||
a_max, _ = self.max_pooling(a_h, mask1, dim=1)
|
||||
b_avg = self.mean_pooling(b_h, mask2, dim=1)
|
||||
b_max, _ = self.max_pooling(b_h, mask2, dim=1)
|
||||
|
||||
out = torch.cat((a_avg, a_max, b_avg, b_max), dim=1) # v: [B, 8 * H]
|
||||
logits = torch.tanh(self.classifier(out))
|
||||
|
||||
if target is not None:
|
||||
loss_fct = CrossEntropyLoss()
|
||||
loss = loss_fct(logits, target)
|
||||
|
||||
return {Const.LOSS: loss, Const.OUTPUT: logits}
|
||||
else:
|
||||
return {Const.OUTPUT: logits}
|
||||
|
||||
def predict(self, **kwargs):
|
||||
pred = self.forward(**kwargs)[Const.OUTPUT].argmax(-1)
|
||||
return {Const.OUTPUT: pred}
|
||||
|
||||
# input [batch_size, len , hidden]
|
||||
# mask [batch_size, len] (111...00)
|
||||
@staticmethod
|
||||
def mean_pooling(input, mask, dim=1):
|
||||
masks = mask.view(mask.size(0), mask.size(1), -1).float()
|
||||
return torch.sum(input * masks, dim=dim) / torch.sum(masks, dim=1)
|
||||
|
||||
@staticmethod
|
||||
def max_pooling(input, mask, dim=1):
|
||||
my_inf = 10e12
|
||||
masks = mask.view(mask.size(0), mask.size(1), -1)
|
||||
masks = masks.expand(-1, -1, input.size(2)).float()
|
||||
return torch.max(input + masks.le(0.5).float() * -my_inf, dim=dim)
|
||||
|
||||
|
||||
class EmbedDropout(nn.Dropout):
|
||||
|
||||
def forward(self, sequences_batch):
|
||||
ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1])
|
||||
dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False)
|
||||
return dropout_mask.unsqueeze(1) * sequences_batch
|
||||
|
||||
|
||||
class BiRNN(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout_rate=0.3):
|
||||
super(BiRNN, self).__init__()
|
||||
self.dropout_rate = dropout_rate
|
||||
self.rnn = nn.LSTM(input_size, hidden_size,
|
||||
num_layers=1,
|
||||
bidirectional=True,
|
||||
batch_first=True)
|
||||
|
||||
def forward(self, x, x_mask):
|
||||
# Sort x
|
||||
lengths = x_mask.data.eq(1).long().sum(1)
|
||||
_, idx_sort = torch.sort(lengths, dim=0, descending=True)
|
||||
_, idx_unsort = torch.sort(idx_sort, dim=0)
|
||||
lengths = list(lengths[idx_sort])
|
||||
|
||||
x = x.index_select(0, idx_sort)
|
||||
# Pack it up
|
||||
rnn_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True)
|
||||
# Apply dropout to input
|
||||
if self.dropout_rate > 0:
|
||||
dropout_input = F.dropout(rnn_input.data, p=self.dropout_rate, training=self.training)
|
||||
rnn_input = nn.utils.rnn.PackedSequence(dropout_input, rnn_input.batch_sizes)
|
||||
output = self.rnn(rnn_input)[0]
|
||||
# Unpack everything
|
||||
output = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)[0]
|
||||
output = output.index_select(0, idx_unsort)
|
||||
if output.size(1) != x_mask.size(1):
|
||||
padding = torch.zeros(output.size(0),
|
||||
x_mask.size(1) - output.size(1),
|
||||
output.size(2)).type(output.data.type())
|
||||
output = torch.cat([output, padding], 1)
|
||||
return output
|
||||
|
||||
|
||||
def masked_softmax(tensor, mask):
|
||||
tensor_shape = tensor.size()
|
||||
reshaped_tensor = tensor.view(-1, tensor_shape[-1])
|
||||
|
||||
# Reshape the mask so it matches the size of the input tensor.
|
||||
while mask.dim() < tensor.dim():
|
||||
mask = mask.unsqueeze(1)
|
||||
mask = mask.expand_as(tensor).contiguous().float()
|
||||
reshaped_mask = mask.view(-1, mask.size()[-1])
|
||||
result = F.softmax(reshaped_tensor * reshaped_mask, dim=-1)
|
||||
result = result * reshaped_mask
|
||||
# 1e-13 is added to avoid divisions by zero.
|
||||
result = result / (result.sum(dim=-1, keepdim=True) + 1e-13)
|
||||
return result.view(*tensor_shape)
|
||||
|
||||
|
||||
def weighted_sum(tensor, weights, mask):
|
||||
w_sum = weights.bmm(tensor)
|
||||
while mask.dim() < w_sum.dim():
|
||||
mask = mask.unsqueeze(1)
|
||||
mask = mask.transpose(-1, -2)
|
||||
mask = mask.expand_as(w_sum).contiguous().float()
|
||||
return w_sum * mask
|
||||
|
||||
|
||||
class SoftmaxAttention(nn.Module):
|
||||
|
||||
def forward(self, premise_batch, premise_mask, hypothesis_batch, hypothesis_mask):
|
||||
similarity_matrix = premise_batch.bmm(hypothesis_batch.transpose(2, 1)
|
||||
.contiguous())
|
||||
|
||||
prem_hyp_attn = masked_softmax(similarity_matrix, hypothesis_mask)
|
||||
hyp_prem_attn = masked_softmax(similarity_matrix.transpose(1, 2)
|
||||
.contiguous(),
|
||||
premise_mask)
|
||||
|
||||
attended_premises = weighted_sum(hypothesis_batch,
|
||||
prem_hyp_attn,
|
||||
premise_mask)
|
||||
attended_hypotheses = weighted_sum(premise_batch,
|
||||
hyp_prem_attn,
|
||||
hypothesis_mask)
|
||||
|
||||
return attended_premises, attended_hypotheses
|
||||
|
@ -46,8 +46,8 @@ class StarTransEnc(nn.Module):
|
||||
super(StarTransEnc, self).__init__()
|
||||
self.embedding = get_embeddings(init_embed)
|
||||
emb_dim = self.embedding.embedding_dim
|
||||
#self.emb_fc = nn.Linear(emb_dim, hidden_size)
|
||||
self.emb_drop = nn.Dropout(emb_dropout)
|
||||
self.emb_fc = nn.Linear(emb_dim, hidden_size)
|
||||
# self.emb_drop = nn.Dropout(emb_dropout)
|
||||
self.encoder = StarTransformer(hidden_size=hidden_size,
|
||||
num_layers=num_layers,
|
||||
num_head=num_head,
|
||||
@ -65,7 +65,7 @@ class StarTransEnc(nn.Module):
|
||||
[batch, hidden] 全局 relay 节点, 详见论文
|
||||
"""
|
||||
x = self.embedding(x)
|
||||
#x = self.emb_fc(self.emb_drop(x))
|
||||
x = self.emb_fc(x)
|
||||
nodes, relay = self.encoder(x, mask)
|
||||
return nodes, relay
|
||||
|
||||
|
@ -1,11 +1,11 @@
|
||||
"""
|
||||
大部分用于的 NLP 任务神经网络都可以看做由编码 :mod:`~fastNLP.modules.encoder` 、
|
||||
聚合 :mod:`~fastNLP.modules.aggregator` 、解码 :mod:`~fastNLP.modules.decoder` 三种模块组成。
|
||||
解码 :mod:`~fastNLP.modules.decoder` 两种模块组成。
|
||||
|
||||
.. image:: figures/text_classification.png
|
||||
|
||||
:mod:`~fastNLP.modules` 中实现了 fastNLP 提供的诸多模块组件,可以帮助用户快速搭建自己所需的网络。
|
||||
三种模块的功能和常见组件如下:
|
||||
两种模块的功能和常见组件如下:
|
||||
|
||||
+-----------------------+-----------------------+-----------------------+
|
||||
| module type | functionality | example |
|
||||
@ -13,9 +13,6 @@
|
||||
| encoder | 将输入编码为具有具 | embedding, RNN, CNN, |
|
||||
| | 有表示能力的向量 | transformer |
|
||||
+-----------------------+-----------------------+-----------------------+
|
||||
| aggregator | 从多个向量中聚合信息 | self-attention, |
|
||||
| | | max-pooling |
|
||||
+-----------------------+-----------------------+-----------------------+
|
||||
| decoder | 将具有某种表示意义的 | MLP, CRF |
|
||||
| | 向量解码为需要的输出 | |
|
||||
| | 形式 | |
|
||||
@ -46,10 +43,8 @@ __all__ = [
|
||||
"allowed_transitions",
|
||||
]
|
||||
|
||||
from . import aggregator
|
||||
from . import decoder
|
||||
from . import encoder
|
||||
from .aggregator import *
|
||||
from .decoder import *
|
||||
from .dropout import TimestepDropout
|
||||
from .encoder import *
|
||||
|
@ -1,14 +0,0 @@
|
||||
__all__ = [
|
||||
"MaxPool",
|
||||
"MaxPoolWithMask",
|
||||
"AvgPool",
|
||||
|
||||
"MultiHeadAttention",
|
||||
]
|
||||
|
||||
from .pooling import MaxPool
|
||||
from .pooling import MaxPoolWithMask
|
||||
from .pooling import AvgPool
|
||||
from .pooling import AvgPoolWithMask
|
||||
|
||||
from .attention import MultiHeadAttention
|
@ -22,7 +22,14 @@ __all__ = [
|
||||
|
||||
"VarRNN",
|
||||
"VarLSTM",
|
||||
"VarGRU"
|
||||
"VarGRU",
|
||||
|
||||
"MaxPool",
|
||||
"MaxPoolWithMask",
|
||||
"AvgPool",
|
||||
"AvgPoolWithMask",
|
||||
|
||||
"MultiHeadAttention",
|
||||
]
|
||||
from ._bert import BertModel
|
||||
from .bert import BertWordPieceEncoder
|
||||
@ -34,3 +41,6 @@ from .lstm import LSTM
|
||||
from .star_transformer import StarTransformer
|
||||
from .transformer import TransformerEncoder
|
||||
from .variational_rnn import VarRNN, VarLSTM, VarGRU
|
||||
|
||||
from .pooling import MaxPool, MaxPoolWithMask, AvgPool, AvgPoolWithMask
|
||||
from .attention import MultiHeadAttention
|
||||
|
@ -6,14 +6,13 @@ from typing import Optional, Tuple, List, Callable
|
||||
|
||||
import os
|
||||
|
||||
import h5py
|
||||
import numpy
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
from torch.nn.utils.rnn import PackedSequence, pad_packed_sequence
|
||||
from ...core.vocabulary import Vocabulary
|
||||
import json
|
||||
import pickle
|
||||
|
||||
from ..utils import get_dropout_mask
|
||||
import codecs
|
||||
@ -244,13 +243,13 @@ class LstmbiLm(nn.Module):
|
||||
def __init__(self, config):
|
||||
super(LstmbiLm, self).__init__()
|
||||
self.config = config
|
||||
self.encoder = nn.LSTM(self.config['encoder']['projection_dim'],
|
||||
self.config['encoder']['dim'],
|
||||
num_layers=self.config['encoder']['n_layers'],
|
||||
self.encoder = nn.LSTM(self.config['lstm']['projection_dim'],
|
||||
self.config['lstm']['dim'],
|
||||
num_layers=self.config['lstm']['n_layers'],
|
||||
bidirectional=True,
|
||||
batch_first=True,
|
||||
dropout=self.config['dropout'])
|
||||
self.projection = nn.Linear(self.config['encoder']['dim'], self.config['encoder']['projection_dim'], bias=True)
|
||||
self.projection = nn.Linear(self.config['lstm']['dim'], self.config['lstm']['projection_dim'], bias=True)
|
||||
|
||||
def forward(self, inputs, seq_len):
|
||||
sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True)
|
||||
@ -260,7 +259,7 @@ class LstmbiLm(nn.Module):
|
||||
output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=self.batch_first)
|
||||
_, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
|
||||
output = output[unsort_idx]
|
||||
forward, backward = output.split(self.config['encoder']['dim'], 2)
|
||||
forward, backward = output.split(self.config['lstm']['dim'], 2)
|
||||
return torch.cat([self.projection(forward), self.projection(backward)], dim=2)
|
||||
|
||||
|
||||
@ -268,13 +267,13 @@ class ElmobiLm(torch.nn.Module):
|
||||
def __init__(self, config):
|
||||
super(ElmobiLm, self).__init__()
|
||||
self.config = config
|
||||
input_size = config['encoder']['projection_dim']
|
||||
hidden_size = config['encoder']['projection_dim']
|
||||
cell_size = config['encoder']['dim']
|
||||
num_layers = config['encoder']['n_layers']
|
||||
memory_cell_clip_value = config['encoder']['cell_clip']
|
||||
state_projection_clip_value = config['encoder']['proj_clip']
|
||||
recurrent_dropout_probability = config['dropout']
|
||||
input_size = config['lstm']['projection_dim']
|
||||
hidden_size = config['lstm']['projection_dim']
|
||||
cell_size = config['lstm']['dim']
|
||||
num_layers = config['lstm']['n_layers']
|
||||
memory_cell_clip_value = config['lstm']['cell_clip']
|
||||
state_projection_clip_value = config['lstm']['proj_clip']
|
||||
recurrent_dropout_probability = 0.0
|
||||
|
||||
self.input_size = input_size
|
||||
self.hidden_size = hidden_size
|
||||
@ -409,199 +408,50 @@ class ElmobiLm(torch.nn.Module):
|
||||
torch.cat(final_memory_states, 0))
|
||||
return stacked_sequence_outputs, final_state_tuple
|
||||
|
||||
def load_weights(self, weight_file: str) -> None:
|
||||
"""
|
||||
Load the pre-trained weights from the file.
|
||||
"""
|
||||
requires_grad = False
|
||||
|
||||
with h5py.File(weight_file, 'r') as fin:
|
||||
for i_layer, lstms in enumerate(
|
||||
zip(self.forward_layers, self.backward_layers)
|
||||
):
|
||||
for j_direction, lstm in enumerate(lstms):
|
||||
# lstm is an instance of LSTMCellWithProjection
|
||||
cell_size = lstm.cell_size
|
||||
|
||||
dataset = fin['RNN_%s' % j_direction]['RNN']['MultiRNNCell']['Cell%s' % i_layer
|
||||
]['LSTMCell']
|
||||
|
||||
# tensorflow packs together both W and U matrices into one matrix,
|
||||
# but pytorch maintains individual matrices. In addition, tensorflow
|
||||
# packs the gates as input, memory, forget, output but pytorch
|
||||
# uses input, forget, memory, output. So we need to modify the weights.
|
||||
tf_weights = numpy.transpose(dataset['W_0'][...])
|
||||
torch_weights = tf_weights.copy()
|
||||
|
||||
# split the W from U matrices
|
||||
input_size = lstm.input_size
|
||||
input_weights = torch_weights[:, :input_size]
|
||||
recurrent_weights = torch_weights[:, input_size:]
|
||||
tf_input_weights = tf_weights[:, :input_size]
|
||||
tf_recurrent_weights = tf_weights[:, input_size:]
|
||||
|
||||
# handle the different gate order convention
|
||||
for torch_w, tf_w in [[input_weights, tf_input_weights],
|
||||
[recurrent_weights, tf_recurrent_weights]]:
|
||||
torch_w[(1 * cell_size):(2 * cell_size), :] = tf_w[(2 * cell_size):(3 * cell_size), :]
|
||||
torch_w[(2 * cell_size):(3 * cell_size), :] = tf_w[(1 * cell_size):(2 * cell_size), :]
|
||||
|
||||
lstm.input_linearity.weight.data.copy_(torch.FloatTensor(input_weights))
|
||||
lstm.state_linearity.weight.data.copy_(torch.FloatTensor(recurrent_weights))
|
||||
lstm.input_linearity.weight.requires_grad = requires_grad
|
||||
lstm.state_linearity.weight.requires_grad = requires_grad
|
||||
|
||||
# the bias weights
|
||||
tf_bias = dataset['B'][...]
|
||||
# tensorflow adds 1.0 to forget gate bias instead of modifying the
|
||||
# parameters...
|
||||
tf_bias[(2 * cell_size):(3 * cell_size)] += 1
|
||||
torch_bias = tf_bias.copy()
|
||||
torch_bias[(1 * cell_size):(2 * cell_size)
|
||||
] = tf_bias[(2 * cell_size):(3 * cell_size)]
|
||||
torch_bias[(2 * cell_size):(3 * cell_size)
|
||||
] = tf_bias[(1 * cell_size):(2 * cell_size)]
|
||||
lstm.state_linearity.bias.data.copy_(torch.FloatTensor(torch_bias))
|
||||
lstm.state_linearity.bias.requires_grad = requires_grad
|
||||
|
||||
# the projection weights
|
||||
proj_weights = numpy.transpose(dataset['W_P_0'][...])
|
||||
lstm.state_projection.weight.data.copy_(torch.FloatTensor(proj_weights))
|
||||
lstm.state_projection.weight.requires_grad = requires_grad
|
||||
|
||||
|
||||
class LstmTokenEmbedder(nn.Module):
|
||||
def __init__(self, config, word_emb_layer, char_emb_layer):
|
||||
super(LstmTokenEmbedder, self).__init__()
|
||||
self.config = config
|
||||
self.word_emb_layer = word_emb_layer
|
||||
self.char_emb_layer = char_emb_layer
|
||||
self.output_dim = config['encoder']['projection_dim']
|
||||
emb_dim = 0
|
||||
if word_emb_layer is not None:
|
||||
emb_dim += word_emb_layer.n_d
|
||||
|
||||
if char_emb_layer is not None:
|
||||
emb_dim += char_emb_layer.n_d * 2
|
||||
self.char_lstm = nn.LSTM(char_emb_layer.n_d, char_emb_layer.n_d, num_layers=1, bidirectional=True,
|
||||
batch_first=True, dropout=config['dropout'])
|
||||
|
||||
self.projection = nn.Linear(emb_dim, self.output_dim, bias=True)
|
||||
|
||||
def forward(self, words, chars):
|
||||
embs = []
|
||||
if self.word_emb_layer is not None:
|
||||
if hasattr(self, 'words_to_words'):
|
||||
words = self.words_to_words[words]
|
||||
word_emb = self.word_emb_layer(words)
|
||||
embs.append(word_emb)
|
||||
|
||||
if self.char_emb_layer is not None:
|
||||
batch_size, seq_len, _ = chars.shape
|
||||
chars = chars.view(batch_size * seq_len, -1)
|
||||
chars_emb = self.char_emb_layer(chars)
|
||||
# TODO 这里应该要考虑seq_len的问题
|
||||
_, (chars_outputs, __) = self.char_lstm(chars_emb)
|
||||
chars_outputs = chars_outputs.contiguous().view(-1, self.config['token_embedder']['embedding']['dim'] * 2)
|
||||
embs.append(chars_outputs)
|
||||
|
||||
token_embedding = torch.cat(embs, dim=2)
|
||||
|
||||
return self.projection(token_embedding)
|
||||
|
||||
|
||||
class ConvTokenEmbedder(nn.Module):
|
||||
def __init__(self, config, weight_file, word_emb_layer, char_emb_layer, char_vocab):
|
||||
def __init__(self, config, weight_file, word_emb_layer, char_emb_layer):
|
||||
super(ConvTokenEmbedder, self).__init__()
|
||||
self.weight_file = weight_file
|
||||
self.word_emb_layer = word_emb_layer
|
||||
self.char_emb_layer = char_emb_layer
|
||||
|
||||
self.output_dim = config['encoder']['projection_dim']
|
||||
self.output_dim = config['lstm']['projection_dim']
|
||||
self._options = config
|
||||
self.requires_grad = False
|
||||
self._load_weights()
|
||||
self._char_embedding_weights = char_emb_layer.weight.data
|
||||
|
||||
def _load_weights(self):
|
||||
self._load_cnn_weights()
|
||||
self._load_highway()
|
||||
self._load_projection()
|
||||
char_cnn_options = self._options['char_cnn']
|
||||
if char_cnn_options['activation'] == 'tanh':
|
||||
self.activation = torch.tanh
|
||||
elif char_cnn_options['activation'] == 'relu':
|
||||
self.activation = torch.nn.functional.relu
|
||||
else:
|
||||
raise Exception("Unknown activation")
|
||||
|
||||
def _load_cnn_weights(self):
|
||||
cnn_options = self._options['token_embedder']
|
||||
filters = cnn_options['filters']
|
||||
char_embed_dim = cnn_options['embedding']['dim']
|
||||
if char_emb_layer is not None:
|
||||
self.char_conv = []
|
||||
cnn_config = config['char_cnn']
|
||||
filters = cnn_config['filters']
|
||||
char_embed_dim = cnn_config['embedding']['dim']
|
||||
convolutions = []
|
||||
|
||||
convolutions = []
|
||||
for i, (width, num) in enumerate(filters):
|
||||
conv = torch.nn.Conv1d(
|
||||
in_channels=char_embed_dim,
|
||||
out_channels=num,
|
||||
kernel_size=width,
|
||||
bias=True
|
||||
)
|
||||
# load the weights
|
||||
with h5py.File(self.weight_file, 'r') as fin:
|
||||
weight = fin['CNN']['W_cnn_{}'.format(i)][...]
|
||||
bias = fin['CNN']['b_cnn_{}'.format(i)][...]
|
||||
for i, (width, num) in enumerate(filters):
|
||||
conv = torch.nn.Conv1d(
|
||||
in_channels=char_embed_dim,
|
||||
out_channels=num,
|
||||
kernel_size=width,
|
||||
bias=True
|
||||
)
|
||||
convolutions.append(conv)
|
||||
self.add_module('char_conv_{}'.format(i), conv)
|
||||
|
||||
w_reshaped = numpy.transpose(weight.squeeze(axis=0), axes=(2, 1, 0))
|
||||
if w_reshaped.shape != tuple(conv.weight.data.shape):
|
||||
raise ValueError("Invalid weight file")
|
||||
conv.weight.data.copy_(torch.FloatTensor(w_reshaped))
|
||||
conv.bias.data.copy_(torch.FloatTensor(bias))
|
||||
self._convolutions = convolutions
|
||||
|
||||
conv.weight.requires_grad = self.requires_grad
|
||||
conv.bias.requires_grad = self.requires_grad
|
||||
n_filters = sum(f[1] for f in filters)
|
||||
n_highway = cnn_config['n_highway']
|
||||
|
||||
convolutions.append(conv)
|
||||
self.add_module('char_conv_{}'.format(i), conv)
|
||||
self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu)
|
||||
|
||||
self._convolutions = convolutions
|
||||
|
||||
def _load_highway(self):
|
||||
# the highway layers have same dimensionality as the number of cnn filters
|
||||
cnn_options = self._options['token_embedder']
|
||||
filters = cnn_options['filters']
|
||||
n_filters = sum(f[1] for f in filters)
|
||||
n_highway = cnn_options['n_highway']
|
||||
|
||||
# create the layers, and load the weights
|
||||
self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu)
|
||||
for k in range(n_highway):
|
||||
# The AllenNLP highway is one matrix multplication with concatenation of
|
||||
# transform and carry weights.
|
||||
with h5py.File(self.weight_file, 'r') as fin:
|
||||
# The weights are transposed due to multiplication order assumptions in tf
|
||||
# vs pytorch (tf.matmul(X, W) vs pytorch.matmul(W, X))
|
||||
w_transform = numpy.transpose(fin['CNN_high_{}'.format(k)]['W_transform'][...])
|
||||
# -1.0 since AllenNLP is g * x + (1 - g) * f(x) but tf is (1 - g) * x + g * f(x)
|
||||
w_carry = -1.0 * numpy.transpose(fin['CNN_high_{}'.format(k)]['W_carry'][...])
|
||||
weight = numpy.concatenate([w_transform, w_carry], axis=0)
|
||||
self._highways._layers[k].weight.data.copy_(torch.FloatTensor(weight))
|
||||
self._highways._layers[k].weight.requires_grad = self.requires_grad
|
||||
|
||||
b_transform = fin['CNN_high_{}'.format(k)]['b_transform'][...]
|
||||
b_carry = -1.0 * fin['CNN_high_{}'.format(k)]['b_carry'][...]
|
||||
bias = numpy.concatenate([b_transform, b_carry], axis=0)
|
||||
self._highways._layers[k].bias.data.copy_(torch.FloatTensor(bias))
|
||||
self._highways._layers[k].bias.requires_grad = self.requires_grad
|
||||
|
||||
def _load_projection(self):
|
||||
cnn_options = self._options['token_embedder']
|
||||
filters = cnn_options['filters']
|
||||
n_filters = sum(f[1] for f in filters)
|
||||
|
||||
self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True)
|
||||
with h5py.File(self.weight_file, 'r') as fin:
|
||||
weight = fin['CNN_proj']['W_proj'][...]
|
||||
bias = fin['CNN_proj']['b_proj'][...]
|
||||
self._projection.weight.data.copy_(torch.FloatTensor(numpy.transpose(weight)))
|
||||
self._projection.bias.data.copy_(torch.FloatTensor(bias))
|
||||
|
||||
self._projection.weight.requires_grad = self.requires_grad
|
||||
self._projection.bias.requires_grad = self.requires_grad
|
||||
self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True)
|
||||
|
||||
def forward(self, words, chars):
|
||||
"""
|
||||
@ -616,15 +466,8 @@ class ConvTokenEmbedder(nn.Module):
|
||||
# self._char_embedding_weights
|
||||
# )
|
||||
batch_size, sequence_length, max_char_len = chars.size()
|
||||
character_embedding = self.char_emb_layer(chars).reshape(batch_size*sequence_length, max_char_len, -1)
|
||||
character_embedding = self.char_emb_layer(chars).reshape(batch_size * sequence_length, max_char_len, -1)
|
||||
# run convolutions
|
||||
cnn_options = self._options['token_embedder']
|
||||
if cnn_options['activation'] == 'tanh':
|
||||
activation = torch.tanh
|
||||
elif cnn_options['activation'] == 'relu':
|
||||
activation = torch.nn.functional.relu
|
||||
else:
|
||||
raise Exception("Unknown activation")
|
||||
|
||||
# (batch_size * sequence_length, embed_dim, max_chars_per_token)
|
||||
character_embedding = torch.transpose(character_embedding, 1, 2)
|
||||
@ -634,7 +477,7 @@ class ConvTokenEmbedder(nn.Module):
|
||||
convolved = conv(character_embedding)
|
||||
# (batch_size * sequence_length, n_filters for this width)
|
||||
convolved, _ = torch.max(convolved, dim=-1)
|
||||
convolved = activation(convolved)
|
||||
convolved = self.activation(convolved)
|
||||
convs.append(convolved)
|
||||
|
||||
# (batch_size * sequence_length, n_filters)
|
||||
@ -712,8 +555,8 @@ class _ElmoModel(nn.Module):
|
||||
|
||||
def __init__(self, model_dir: str, vocab: Vocabulary = None, cache_word_reprs: bool = False):
|
||||
super(_ElmoModel, self).__init__()
|
||||
|
||||
dir = os.walk(model_dir)
|
||||
self.model_dir = model_dir
|
||||
dir = os.walk(self.model_dir)
|
||||
config_file = None
|
||||
weight_file = None
|
||||
config_count = 0
|
||||
@ -723,7 +566,7 @@ class _ElmoModel(nn.Module):
|
||||
if file_name.__contains__(".json"):
|
||||
config_file = file_name
|
||||
config_count += 1
|
||||
elif file_name.__contains__(".hdf5"):
|
||||
elif file_name.__contains__(".pkl"):
|
||||
weight_file = file_name
|
||||
weight_count += 1
|
||||
if config_count > 1 or weight_count > 1:
|
||||
@ -734,7 +577,6 @@ class _ElmoModel(nn.Module):
|
||||
config = json.load(open(os.path.join(model_dir, config_file), 'r'))
|
||||
self.weight_file = os.path.join(model_dir, weight_file)
|
||||
self.config = config
|
||||
self.requires_grad = False
|
||||
|
||||
OOV_TAG = '<oov>'
|
||||
PAD_TAG = '<pad>'
|
||||
@ -744,102 +586,84 @@ class _ElmoModel(nn.Module):
|
||||
EOW_TAG = '<eow>'
|
||||
|
||||
# For the model trained with character-based word encoder.
|
||||
if config['token_embedder']['embedding']['dim'] > 0:
|
||||
char_lexicon = {}
|
||||
with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi:
|
||||
for line in fpi:
|
||||
tokens = line.strip().split('\t')
|
||||
if len(tokens) == 1:
|
||||
tokens.insert(0, '\u3000')
|
||||
token, i = tokens
|
||||
char_lexicon[token] = int(i)
|
||||
char_lexicon = {}
|
||||
with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi:
|
||||
for line in fpi:
|
||||
tokens = line.strip().split('\t')
|
||||
if len(tokens) == 1:
|
||||
tokens.insert(0, '\u3000')
|
||||
token, i = tokens
|
||||
char_lexicon[token] = int(i)
|
||||
|
||||
# 做一些sanity check
|
||||
for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]:
|
||||
assert special_word in char_lexicon, f"{special_word} not found in char.dic."
|
||||
# 做一些sanity check
|
||||
for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]:
|
||||
assert special_word in char_lexicon, f"{special_word} not found in char.dic."
|
||||
|
||||
# 从vocab中构建char_vocab
|
||||
char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG)
|
||||
# 需要保证<bow>与<eow>在里面
|
||||
char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG])
|
||||
# 从vocab中构建char_vocab
|
||||
char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG)
|
||||
# 需要保证<bow>与<eow>在里面
|
||||
char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG])
|
||||
|
||||
for word, index in vocab:
|
||||
char_vocab.add_word_lst(list(word))
|
||||
for word, index in vocab:
|
||||
char_vocab.add_word_lst(list(word))
|
||||
|
||||
self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab)+1, vocab.padding_idx
|
||||
# 根据char_lexicon调整, 多设置一位,是预留给word padding的(该位置的char表示为全0表示)
|
||||
char_emb_layer = nn.Embedding(len(char_vocab)+1, int(config['token_embedder']['embedding']['dim']),
|
||||
padding_idx=len(char_vocab))
|
||||
with h5py.File(self.weight_file, 'r') as fin:
|
||||
char_embed_weights = fin['char_embed'][...]
|
||||
char_embed_weights = torch.from_numpy(char_embed_weights)
|
||||
found_char_count = 0
|
||||
for char, index in char_vocab: # 调整character embedding
|
||||
if char in char_lexicon:
|
||||
index_in_pre = char_lexicon.get(char)
|
||||
found_char_count += 1
|
||||
else:
|
||||
index_in_pre = char_lexicon[OOV_TAG]
|
||||
char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre]
|
||||
self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab) + 1, vocab.padding_idx
|
||||
# 根据char_lexicon调整, 多设置一位,是预留给word padding的(该位置的char表示为全0表示)
|
||||
char_emb_layer = nn.Embedding(len(char_vocab) + 1, int(config['char_cnn']['embedding']['dim']),
|
||||
padding_idx=len(char_vocab))
|
||||
|
||||
print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.")
|
||||
# 生成words到chars的映射
|
||||
if config['token_embedder']['name'].lower() == 'cnn':
|
||||
max_chars = config['token_embedder']['max_characters_per_token']
|
||||
elif config['token_embedder']['name'].lower() == 'lstm':
|
||||
max_chars = max(map(lambda x: len(x[0]), vocab)) + 2 # 需要补充两个<bow>与<eow>
|
||||
# 读入预训练权重 这里的elmo_model 包含char_cnn和 lstm 的 state_dict
|
||||
elmo_model = torch.load(os.path.join(self.model_dir, weight_file), map_location='cpu')
|
||||
|
||||
char_embed_weights = elmo_model["char_cnn"]['char_emb_layer.weight']
|
||||
|
||||
found_char_count = 0
|
||||
for char, index in char_vocab: # 调整character embedding
|
||||
if char in char_lexicon:
|
||||
index_in_pre = char_lexicon.get(char)
|
||||
found_char_count += 1
|
||||
else:
|
||||
raise ValueError('Unknown token_embedder: {0}'.format(config['token_embedder']['name']))
|
||||
index_in_pre = char_lexicon[OOV_TAG]
|
||||
char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre]
|
||||
|
||||
self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab)+2, max_chars),
|
||||
fill_value=len(char_vocab),
|
||||
dtype=torch.long),
|
||||
requires_grad=False)
|
||||
for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab)+1)]:
|
||||
if len(word) + 2 > max_chars:
|
||||
word = word[:max_chars - 2]
|
||||
if index == self._pad_index:
|
||||
continue
|
||||
elif word == BOS_TAG or word == EOS_TAG:
|
||||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [
|
||||
char_vocab.to_index(EOW_TAG)]
|
||||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
|
||||
else:
|
||||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [
|
||||
char_vocab.to_index(EOW_TAG)]
|
||||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
|
||||
self.words_to_chars_embedding[index] = torch.LongTensor(char_ids)
|
||||
print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.")
|
||||
# 生成words到chars的映射
|
||||
max_chars = config['char_cnn']['max_characters_per_token']
|
||||
|
||||
self.char_vocab = char_vocab
|
||||
else:
|
||||
char_emb_layer = None
|
||||
self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab) + 2, max_chars),
|
||||
fill_value=len(char_vocab),
|
||||
dtype=torch.long),
|
||||
requires_grad=False)
|
||||
for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab) + 1)]:
|
||||
if len(word) + 2 > max_chars:
|
||||
word = word[:max_chars - 2]
|
||||
if index == self._pad_index:
|
||||
continue
|
||||
elif word == BOS_TAG or word == EOS_TAG:
|
||||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [
|
||||
char_vocab.to_index(EOW_TAG)]
|
||||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
|
||||
else:
|
||||
char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [
|
||||
char_vocab.to_index(EOW_TAG)]
|
||||
char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
|
||||
self.words_to_chars_embedding[index] = torch.LongTensor(char_ids)
|
||||
|
||||
if config['token_embedder']['name'].lower() == 'cnn':
|
||||
self.token_embedder = ConvTokenEmbedder(
|
||||
config, self.weight_file, None, char_emb_layer, self.char_vocab)
|
||||
elif config['token_embedder']['name'].lower() == 'lstm':
|
||||
self.token_embedder = LstmTokenEmbedder(
|
||||
config, None, char_emb_layer)
|
||||
self.char_vocab = char_vocab
|
||||
|
||||
if config['token_embedder']['word_dim'] > 0 \
|
||||
and vocab._no_create_word_length > 0: # 需要映射,使得来自于dev, test的idx指向unk
|
||||
words_to_words = nn.Parameter(torch.arange(len(vocab) + 2).long(), requires_grad=False)
|
||||
for word, idx in vocab:
|
||||
if vocab._is_word_no_create_entry(word):
|
||||
words_to_words[idx] = vocab.unknown_idx
|
||||
setattr(self.token_embedder, 'words_to_words', words_to_words)
|
||||
self.output_dim = config['encoder']['projection_dim']
|
||||
self.token_embedder = ConvTokenEmbedder(
|
||||
config, self.weight_file, None, char_emb_layer)
|
||||
elmo_model["char_cnn"]['char_emb_layer.weight'] = char_emb_layer.weight
|
||||
self.token_embedder.load_state_dict(elmo_model["char_cnn"])
|
||||
|
||||
# 暂时只考虑 elmo
|
||||
if config['encoder']['name'].lower() == 'elmo':
|
||||
self.encoder = ElmobiLm(config)
|
||||
elif config['encoder']['name'].lower() == 'lstm':
|
||||
self.encoder = LstmbiLm(config)
|
||||
self.output_dim = config['lstm']['projection_dim']
|
||||
|
||||
self.encoder.load_weights(self.weight_file)
|
||||
# lstm encoder
|
||||
self.encoder = ElmobiLm(config)
|
||||
self.encoder.load_state_dict(elmo_model["lstm"])
|
||||
|
||||
if cache_word_reprs:
|
||||
if config['token_embedder']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用
|
||||
if config['char_cnn']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用
|
||||
print("Start to generate cache word representations.")
|
||||
batch_size = 320
|
||||
# bos eos
|
||||
@ -848,7 +672,7 @@ class _ElmoModel(nn.Module):
|
||||
int(word_size % batch_size != 0)
|
||||
|
||||
self.cached_word_embedding = nn.Embedding(word_size,
|
||||
config['encoder']['projection_dim'])
|
||||
config['lstm']['projection_dim'])
|
||||
with torch.no_grad():
|
||||
for i in range(num_batches):
|
||||
words = torch.arange(i * batch_size,
|
||||
@ -877,6 +701,8 @@ class _ElmoModel(nn.Module):
|
||||
expanded_words[:, 0].fill_(self.bos_index)
|
||||
expanded_words[torch.arange(batch_size).to(words), seq_len + 1] = self.eos_index
|
||||
seq_len = seq_len + 2
|
||||
zero_tensor = expanded_words.new_zeros(expanded_words.shape)
|
||||
mask = (expanded_words == zero_tensor).unsqueeze(-1)
|
||||
if hasattr(self, 'cached_word_embedding'):
|
||||
token_embedding = self.cached_word_embedding(expanded_words)
|
||||
else:
|
||||
@ -886,20 +712,16 @@ class _ElmoModel(nn.Module):
|
||||
chars = None
|
||||
token_embedding = self.token_embedder(expanded_words, chars) # batch_size x max_len x embed_dim
|
||||
|
||||
if self.config['encoder']['name'] == 'elmo':
|
||||
encoder_output = self.encoder(token_embedding, seq_len)
|
||||
if encoder_output.size(2) < max_len + 2:
|
||||
num_layers, _, output_len, hidden_size = encoder_output.size()
|
||||
dummy_tensor = encoder_output.new_zeros(num_layers, batch_size,
|
||||
max_len + 2 - output_len, hidden_size)
|
||||
encoder_output = torch.cat((encoder_output, dummy_tensor), 2)
|
||||
sz = encoder_output.size() # 2, batch_size, max_len, hidden_size
|
||||
token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3])
|
||||
encoder_output = torch.cat((token_embedding, encoder_output), dim=0)
|
||||
elif self.config['encoder']['name'] == 'lstm':
|
||||
encoder_output = self.encoder(token_embedding, seq_len)
|
||||
else:
|
||||
raise ValueError('Unknown encoder: {0}'.format(self.config['encoder']['name']))
|
||||
encoder_output = self.encoder(token_embedding, seq_len)
|
||||
if encoder_output.size(2) < max_len + 2:
|
||||
num_layers, _, output_len, hidden_size = encoder_output.size()
|
||||
dummy_tensor = encoder_output.new_zeros(num_layers, batch_size,
|
||||
max_len + 2 - output_len, hidden_size)
|
||||
encoder_output = torch.cat((encoder_output, dummy_tensor), 2)
|
||||
sz = encoder_output.size() # 2, batch_size, max_len, hidden_size
|
||||
token_embedding = token_embedding.masked_fill(mask, 0)
|
||||
token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3])
|
||||
encoder_output = torch.cat((token_embedding, encoder_output), dim=0)
|
||||
|
||||
# 删除<eos>, <bos>. 这里没有精确地删除,但应该也不会影响最后的结果了。
|
||||
encoder_output = encoder_output[:, :, 1:-1]
|
||||
|
@ -8,9 +8,9 @@ import torch
|
||||
import torch.nn.functional as F
|
||||
from torch import nn
|
||||
|
||||
from ..dropout import TimestepDropout
|
||||
from fastNLP.modules.dropout import TimestepDropout
|
||||
|
||||
from ..utils import initial_parameter
|
||||
from fastNLP.modules.utils import initial_parameter
|
||||
|
||||
|
||||
class DotAttention(nn.Module):
|
||||
@ -45,8 +45,7 @@ class DotAttention(nn.Module):
|
||||
|
||||
class MultiHeadAttention(nn.Module):
|
||||
"""
|
||||
别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.aggregator.attention.MultiHeadAttention`
|
||||
|
||||
别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.encoder.attention.MultiHeadAttention`
|
||||
|
||||
:param input_size: int, 输入维度的大小。同时也是输出维度的大小。
|
||||
:param key_size: int, 每个head的维度大小。
|
@ -2,35 +2,22 @@
|
||||
import os
|
||||
from torch import nn
|
||||
import torch
|
||||
from ...io.file_utils import _get_base_url, cached_path
|
||||
from ...io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
|
||||
from ._bert import _WordPieceBertModel, BertModel
|
||||
|
||||
|
||||
class BertWordPieceEncoder(nn.Module):
|
||||
"""
|
||||
读取bert模型,读取之后调用index_dataset方法在dataset中生成word_pieces这一列。
|
||||
|
||||
:param fastNLP.Vocabulary vocab: 词表
|
||||
:param str model_dir_or_name: 模型所在目录或者模型的名称。默认值为``en-base-uncased``
|
||||
:param str layers:最终结果中的表示。以','隔开层数,可以以负数去索引倒数几层
|
||||
:param bool requires_grad: 是否需要gradient。
|
||||
"""
|
||||
def __init__(self, model_dir_or_name:str='en-base-uncased', layers:str='-1',
|
||||
requires_grad:bool=False):
|
||||
def __init__(self, model_dir_or_name: str='en-base-uncased', layers: str='-1',
|
||||
requires_grad: bool=False):
|
||||
super().__init__()
|
||||
PRETRAIN_URL = _get_base_url('bert')
|
||||
PRETRAINED_BERT_MODEL_DIR = {'en': 'bert-base-cased-f89bfe08.zip',
|
||||
'en-base-uncased': 'bert-base-uncased-3413b23c.zip',
|
||||
'en-base-cased': 'bert-base-cased-f89bfe08.zip',
|
||||
'en-large-uncased': 'bert-large-uncased-20939f45.zip',
|
||||
'en-large-cased': 'bert-large-cased-e0cf90fc.zip',
|
||||
|
||||
'cn': 'bert-base-chinese-29d0a84a.zip',
|
||||
'cn-base': 'bert-base-chinese-29d0a84a.zip',
|
||||
|
||||
'multilingual': 'bert-base-multilingual-cased-1bd364ee.zip',
|
||||
'multilingual-base-uncased': 'bert-base-multilingual-uncased-f8730fe4.zip',
|
||||
'multilingual-base-cased': 'bert-base-multilingual-cased-1bd364ee.zip',
|
||||
}
|
||||
|
||||
if model_dir_or_name in PRETRAINED_BERT_MODEL_DIR:
|
||||
model_name = PRETRAINED_BERT_MODEL_DIR[model_dir_or_name]
|
||||
@ -89,4 +76,4 @@ class BertWordPieceEncoder(nn.Module):
|
||||
outputs = self.model(word_pieces, token_type_ids)
|
||||
outputs = torch.cat([*outputs], dim=-1)
|
||||
|
||||
return outputs
|
||||
return outputs
|
||||
|
@ -135,7 +135,7 @@ class TokenEmbedding(nn.Module):
|
||||
:param torch.LongTensor words: batch_size x max_len
|
||||
:return:
|
||||
"""
|
||||
if self.dropout_word > 0 and self.training:
|
||||
if self.word_dropout > 0 and self.training:
|
||||
mask = torch.ones_like(words).float() * self.word_dropout
|
||||
mask = torch.bernoulli(mask).byte() # dropout_word越大,越多位置为1
|
||||
words = words.masked_fill(mask, self._word_unk_index)
|
||||
@ -174,8 +174,16 @@ class TokenEmbedding(nn.Module):
|
||||
def embed_size(self) -> int:
|
||||
return self._embed_size
|
||||
|
||||
@property
|
||||
def embedding_dim(self) -> int:
|
||||
return self._embed_size
|
||||
|
||||
@property
|
||||
def num_embedding(self) -> int:
|
||||
"""
|
||||
这个值可能会大于实际的embedding矩阵的大小。
|
||||
:return:
|
||||
"""
|
||||
return len(self._word_vocab)
|
||||
|
||||
def get_word_vocab(self):
|
||||
@ -531,11 +539,11 @@ class ElmoEmbedding(ContextualEmbedding):
|
||||
self.model = _ElmoModel(model_dir, vocab, cache_word_reprs=cache_word_reprs)
|
||||
|
||||
if layers=='mix':
|
||||
self.layer_weights = nn.Parameter(torch.zeros(self.model.config['encoder']['n_layers']+1),
|
||||
self.layer_weights = nn.Parameter(torch.zeros(self.model.config['lstm']['n_layers']+1),
|
||||
requires_grad=requires_grad)
|
||||
self.gamma = nn.Parameter(torch.ones(1), requires_grad=requires_grad)
|
||||
self._get_outputs = self._get_mixed_outputs
|
||||
self._embed_size = self.model.config['encoder']['projection_dim'] * 2
|
||||
self._embed_size = self.model.config['lstm']['projection_dim'] * 2
|
||||
else:
|
||||
layers = list(map(int, layers.split(',')))
|
||||
assert len(layers) > 0, "Must choose one output"
|
||||
@ -543,7 +551,7 @@ class ElmoEmbedding(ContextualEmbedding):
|
||||
assert 0 <= layer <= 2, "Layer index should be in range [0, 2]."
|
||||
self.layers = layers
|
||||
self._get_outputs = self._get_layer_outputs
|
||||
self._embed_size = len(self.layers) * self.model.config['encoder']['projection_dim'] * 2
|
||||
self._embed_size = len(self.layers) * self.model.config['lstm']['projection_dim'] * 2
|
||||
|
||||
self.requires_grad = requires_grad
|
||||
|
||||
@ -810,7 +818,7 @@ class CNNCharEmbedding(TokenEmbedding):
|
||||
# 为1的地方为mask
|
||||
chars_masks = chars.eq(self.char_pad_index) # batch_size x max_len x max_word_len 如果为0, 说明是padding的位置了
|
||||
chars = self.char_embedding(chars) # batch_size x max_len x max_word_len x embed_size
|
||||
self.dropout(chars)
|
||||
chars = self.dropout(chars)
|
||||
reshaped_chars = chars.reshape(batch_size*max_len, max_word_len, -1)
|
||||
reshaped_chars = reshaped_chars.transpose(1, 2) # B' x E x M
|
||||
conv_chars = [conv(reshaped_chars).transpose(1, 2).reshape(batch_size, max_len, max_word_len, -1)
|
||||
@ -962,7 +970,7 @@ class LSTMCharEmbedding(TokenEmbedding):
|
||||
|
||||
chars = self.fc(chars)
|
||||
|
||||
return self.dropout(words)
|
||||
return self.dropout(chars)
|
||||
|
||||
@property
|
||||
def requires_grad(self):
|
||||
|
@ -1,7 +1,8 @@
|
||||
__all__ = [
|
||||
"MaxPool",
|
||||
"MaxPoolWithMask",
|
||||
"AvgPool"
|
||||
"AvgPool",
|
||||
"AvgPoolWithMask"
|
||||
]
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
@ -9,7 +10,7 @@ import torch.nn as nn
|
||||
|
||||
class MaxPool(nn.Module):
|
||||
"""
|
||||
别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.aggregator.pooling.MaxPool`
|
||||
别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.encoder.pooling.MaxPool`
|
||||
|
||||
Max-pooling模块。
|
||||
|
||||
@ -58,7 +59,7 @@ class MaxPool(nn.Module):
|
||||
|
||||
class MaxPoolWithMask(nn.Module):
|
||||
"""
|
||||
别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.MaxPoolWithMask`
|
||||
别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.encoder.pooling.MaxPoolWithMask`
|
||||
|
||||
带mask矩阵的max pooling。在做max-pooling的时候不会考虑mask值为0的位置。
|
||||
"""
|
||||
@ -98,7 +99,7 @@ class KMaxPool(nn.Module):
|
||||
|
||||
class AvgPool(nn.Module):
|
||||
"""
|
||||
别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.aggregator.pooling.AvgPool`
|
||||
别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.encoder.pooling.AvgPool`
|
||||
|
||||
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size]
|
||||
"""
|
||||
@ -125,7 +126,7 @@ class AvgPool(nn.Module):
|
||||
|
||||
class AvgPoolWithMask(nn.Module):
|
||||
"""
|
||||
别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.AvgPoolWithMask`
|
||||
别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.encoder.pooling.AvgPoolWithMask`
|
||||
|
||||
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size], pooling
|
||||
的时候只会考虑mask为1的位置
|
@ -34,8 +34,8 @@ class StarTransformer(nn.Module):
|
||||
super(StarTransformer, self).__init__()
|
||||
self.iters = num_layers
|
||||
|
||||
self.norm = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(self.iters)])
|
||||
self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1)
|
||||
self.norm = nn.ModuleList([nn.LayerNorm(hidden_size, eps=1e-6) for _ in range(self.iters)])
|
||||
# self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1)
|
||||
self.emb_drop = nn.Dropout(dropout)
|
||||
self.ring_att = nn.ModuleList(
|
||||
[_MSA1(hidden_size, nhead=num_head, head_dim=head_dim, dropout=0.0)
|
||||
|
@ -3,7 +3,7 @@ __all__ = [
|
||||
]
|
||||
from torch import nn
|
||||
|
||||
from ..aggregator.attention import MultiHeadAttention
|
||||
from fastNLP.modules.encoder.attention import MultiHeadAttention
|
||||
from ..dropout import TimestepDropout
|
||||
|
||||
|
||||
|
@ -8,7 +8,8 @@ import os
|
||||
from fastNLP.core.dataset import DataSet
|
||||
from .utils import load_url
|
||||
from .processor import ModelProcessor
|
||||
from fastNLP.io.dataset_loader import _cut_long_sentence, ConllLoader
|
||||
from fastNLP.io.dataset_loader import _cut_long_sentence
|
||||
from fastNLP.io.data_loader import ConllLoader
|
||||
from fastNLP.core.instance import Instance
|
||||
from ..api.pipeline import Pipeline
|
||||
from fastNLP.core.metrics import SpanFPreRecMetric
|
||||
|
@ -2,14 +2,14 @@
|
||||
这里复现了在fastNLP中实现的模型,旨在达到与论文中相符的性能。
|
||||
|
||||
复现的模型有:
|
||||
- [Star-Transformer](Star_transformer/)
|
||||
- [Star-Transformer](Star_transformer)
|
||||
- [Biaffine](https://github.com/fastnlp/fastNLP/blob/999a14381747068e9e6a7cc370037b320197db00/fastNLP/models/biaffine_parser.py#L239)
|
||||
- [CNNText](https://github.com/fastnlp/fastNLP/blob/999a14381747068e9e6a7cc370037b320197db00/fastNLP/models/cnn_text_classification.py#L12)
|
||||
- ...
|
||||
|
||||
# 任务复现
|
||||
## Text Classification (文本分类)
|
||||
- still in progress
|
||||
- [Text Classification 文本分类任务复现](text_classification)
|
||||
|
||||
|
||||
## Matching (自然语言推理/句子匹配)
|
||||
@ -20,12 +20,12 @@
|
||||
- [NER](seqence_labelling/ner)
|
||||
|
||||
|
||||
## Coreference resolution (指代消解)
|
||||
- still in progress
|
||||
## Coreference Resolution (共指消解)
|
||||
- [Coreference Resolution 共指消解任务复现](coreference_resolution)
|
||||
|
||||
|
||||
## Summarization (摘要)
|
||||
- still in progress
|
||||
- [Summerization 摘要任务复现](Summarization)
|
||||
|
||||
|
||||
## ...
|
||||
|
@ -9,26 +9,3 @@ paper: [Star-Transformer](https://arxiv.org/abs/1902.09113)
|
||||
|Text Classification|SST|-|51.2|
|
||||
|Natural Language Inference|SNLI|-|83.76|
|
||||
|
||||
## Usage
|
||||
``` python
|
||||
# for sequence labeling(ner, pos tagging, etc)
|
||||
from fastNLP.models.star_transformer import STSeqLabel
|
||||
model = STSeqLabel(
|
||||
vocab_size=10000, num_cls=50,
|
||||
emb_dim=300)
|
||||
|
||||
|
||||
# for sequence classification
|
||||
from fastNLP.models.star_transformer import STSeqCls
|
||||
model = STSeqCls(
|
||||
vocab_size=10000, num_cls=50,
|
||||
emb_dim=300)
|
||||
|
||||
|
||||
# for natural language inference
|
||||
from fastNLP.models.star_transformer import STNLICls
|
||||
model = STNLICls(
|
||||
vocab_size=10000, num_cls=50,
|
||||
emb_dim=300)
|
||||
|
||||
```
|
||||
|
@ -2,8 +2,7 @@ import torch
|
||||
import json
|
||||
import os
|
||||
from fastNLP import Vocabulary
|
||||
from fastNLP.io.dataset_loader import ConllLoader
|
||||
from fastNLP.io.data_loader import SSTLoader, SNLILoader
|
||||
from fastNLP.io.data_loader import ConllLoader, SSTLoader, SNLILoader
|
||||
from fastNLP.core import Const as C
|
||||
import numpy as np
|
||||
|
||||
|
@ -10,7 +10,8 @@ from fastNLP.models.star_transformer import STSeqLabel, STSeqCls, STNLICls
|
||||
from fastNLP.core.const import Const as C
|
||||
import sys
|
||||
#sys.path.append('/remote-home/yfshao/workdir/dev_fastnlp/')
|
||||
pre_dir = '/home/ec2-user/fast_data/'
|
||||
import os
|
||||
pre_dir = os.path.join(os.environ['HOME'], 'workdir/datasets/')
|
||||
|
||||
g_model_select = {
|
||||
'pos': STSeqLabel,
|
||||
@ -19,7 +20,7 @@ g_model_select = {
|
||||
'nli': STNLICls,
|
||||
}
|
||||
|
||||
g_emb_file_path = {'en': pre_dir + 'glove.840B.300d.txt',
|
||||
g_emb_file_path = {'en': pre_dir + 'word_vector/glove.840B.300d.txt',
|
||||
'zh': pre_dir + 'cc.zh.300.vec'}
|
||||
|
||||
g_args = None
|
||||
@ -55,7 +56,7 @@ def get_conll2012_ner():
|
||||
|
||||
|
||||
def get_sst():
|
||||
path = pre_dir + 'sst'
|
||||
path = pre_dir + 'SST'
|
||||
files = ['train.txt', 'dev.txt', 'test.txt']
|
||||
return load_sst(path, files)
|
||||
|
||||
@ -171,10 +172,10 @@ def train():
|
||||
sampler=FN.BucketSampler(100, g_args.bsz, C.INPUT_LEN),
|
||||
callbacks=[MyCallback()])
|
||||
|
||||
trainer.train()
|
||||
print(trainer.train())
|
||||
tester = FN.Tester(data=test_data, model=model, metrics=metric,
|
||||
batch_size=128, device=device)
|
||||
tester.test()
|
||||
print(tester.test())
|
||||
|
||||
|
||||
def test():
|
||||
|
@ -2,7 +2,7 @@ import pickle
|
||||
import numpy as np
|
||||
|
||||
from fastNLP.core.vocabulary import Vocabulary
|
||||
from fastNLP.io.base_loader import DataInfo
|
||||
from fastNLP.io.base_loader import DataBundle
|
||||
from fastNLP.io.dataset_loader import JsonLoader
|
||||
from fastNLP.core.const import Const
|
||||
|
||||
@ -66,7 +66,7 @@ class SummarizationLoader(JsonLoader):
|
||||
:param domain: bool build vocab for publication, use 'X' for unknown
|
||||
:param tag: bool build vocab for tag, use 'X' for unknown
|
||||
:param load_vocab: bool build vocab (False) or load vocab (True)
|
||||
:return: DataInfo
|
||||
:return: DataBundle
|
||||
datasets: dict keys correspond to the paths dict
|
||||
vocabs: dict key: vocab(if "train" in paths), domain(if domain=True), tag(if tag=True)
|
||||
embeddings: optional
|
||||
@ -182,7 +182,7 @@ class SummarizationLoader(JsonLoader):
|
||||
for ds in datasets.values():
|
||||
vocab_dict["vocab"].index_dataset(ds, field_name=Const.INPUT, new_field_name=Const.INPUT)
|
||||
|
||||
return DataInfo(vocabs=vocab_dict, datasets=datasets)
|
||||
return DataBundle(vocabs=vocab_dict, datasets=datasets)
|
||||
|
||||
|
||||
|
||||
|
@ -1,24 +1,36 @@
|
||||
|
||||
|
||||
import unittest
|
||||
from ..data.dataloader import SummarizationLoader
|
||||
|
||||
import sys
|
||||
sys.path.append('..')
|
||||
|
||||
from data.dataloader import SummarizationLoader
|
||||
|
||||
vocab_size = 100000
|
||||
vocab_path = "testdata/vocab"
|
||||
sent_max_len = 100
|
||||
doc_max_timesteps = 50
|
||||
|
||||
class TestSummarizationLoader(unittest.TestCase):
|
||||
|
||||
def test_case1(self):
|
||||
sum_loader = SummarizationLoader()
|
||||
paths = {"train":"testdata/train.jsonl", "valid":"testdata/val.jsonl", "test":"testdata/test.jsonl"}
|
||||
data = sum_loader.process(paths=paths)
|
||||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps)
|
||||
print(data.datasets)
|
||||
|
||||
def test_case2(self):
|
||||
sum_loader = SummarizationLoader()
|
||||
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"}
|
||||
data = sum_loader.process(paths=paths, domain=True)
|
||||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, domain=True)
|
||||
print(data.datasets, data.vocabs)
|
||||
|
||||
def test_case3(self):
|
||||
sum_loader = SummarizationLoader()
|
||||
paths = {"train": "testdata/train.jsonl", "valid": "testdata/val.jsonl", "test": "testdata/test.jsonl"}
|
||||
data = sum_loader.process(paths=paths, tag=True)
|
||||
print(data.datasets, data.vocabs)
|
||||
data = sum_loader.process(paths=paths, vocab_size=vocab_size, vocab_path=vocab_path, sent_max_len=sent_max_len, doc_max_timesteps=doc_max_timesteps, tag=True)
|
||||
print(data.datasets, data.vocabs)
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -3,7 +3,7 @@ from datetime import timedelta
|
||||
|
||||
from fastNLP.io.dataset_loader import JsonLoader
|
||||
from fastNLP.modules.encoder._bert import BertTokenizer
|
||||
from fastNLP.io.base_loader import DataInfo
|
||||
from fastNLP.io.base_loader import DataBundle
|
||||
from fastNLP.core.const import Const
|
||||
|
||||
class BertData(JsonLoader):
|
||||
@ -110,7 +110,7 @@ class BertData(JsonLoader):
|
||||
# set paddding value
|
||||
datasets[name].set_pad_val('article', 0)
|
||||
|
||||
return DataInfo(datasets=datasets)
|
||||
return DataBundle(datasets=datasets)
|
||||
|
||||
|
||||
class BertSumLoader(JsonLoader):
|
||||
@ -154,4 +154,4 @@ class BertSumLoader(JsonLoader):
|
||||
|
||||
print('Finished in {}'.format(timedelta(seconds=time()-start)))
|
||||
|
||||
return DataInfo(datasets=datasets)
|
||||
return DataBundle(datasets=datasets)
|
||||
|
@ -11,7 +11,7 @@ Coreference resolution是查找文本中指向同一现实实体的所有表达
|
||||
由于版权问题,本文无法提供数据集的下载,请自行下载。
|
||||
原始数据集的格式为conll格式,详细介绍参考数据集给出的官方介绍页面。
|
||||
|
||||
代码实现采用了论文作者Lee的预处理方法,具体细节参加[链接](https://github.com/kentonl/e2e-coref/blob/e2e/setup_training.sh)。
|
||||
代码实现采用了论文作者Lee的预处理方法,具体细节参见[链接](https://github.com/kentonl/e2e-coref/blob/e2e/setup_training.sh)。
|
||||
处理之后的数据集为json格式,例子:
|
||||
```
|
||||
{
|
||||
@ -25,12 +25,12 @@ Coreference resolution是查找文本中指向同一现实实体的所有表达
|
||||
### embedding 数据集下载
|
||||
[turian emdedding](https://lil.cs.washington.edu/coref/turian.50d.txt)
|
||||
|
||||
[glove embedding]( https://nlp.stanford.edu/data/glove.840B.300d.zip)
|
||||
[glove embedding](https://nlp.stanford.edu/data/glove.840B.300d.zip)
|
||||
|
||||
|
||||
|
||||
## 运行
|
||||
```python
|
||||
```shell
|
||||
# 训练代码
|
||||
CUDA_VISIBLE_DEVICES=0 python train.py
|
||||
# 测试代码
|
||||
@ -39,9 +39,9 @@ CUDA_VISIBLE_DEVICES=0 python valid.py
|
||||
|
||||
## 结果
|
||||
原论文作者在测试集上取得了67.2%的结果,AllenNLP复现的结果为 [63.0%](https://allennlp.org/models)。
|
||||
其中allenNLP训练时没有加入speaker信息,没有variational dropout以及只使用了100的antecedents而不是250。
|
||||
其中AllenNLP训练时没有加入speaker信息,没有variational dropout以及只使用了100的antecedents而不是250。
|
||||
|
||||
在与allenNLP使用同样的超参和配置时,本代码复现取得了63.6%的F1值。
|
||||
在与AllenNLP使用同样的超参和配置时,本代码复现取得了63.6%的F1值。
|
||||
|
||||
|
||||
## 问题
|
@ -1,7 +1,7 @@
|
||||
from fastNLP.io.dataset_loader import JsonLoader,DataSet,Instance
|
||||
from fastNLP.io.file_reader import _read_json
|
||||
from fastNLP.core.vocabulary import Vocabulary
|
||||
from fastNLP.io.base_loader import DataInfo
|
||||
from fastNLP.io.base_loader import DataBundle
|
||||
from reproduction.coreference_resolution.model.config import Config
|
||||
import reproduction.coreference_resolution.model.preprocess as preprocess
|
||||
|
||||
@ -26,7 +26,7 @@ class CRLoader(JsonLoader):
|
||||
return dataset
|
||||
|
||||
def process(self, paths, **kwargs):
|
||||
data_info = DataInfo()
|
||||
data_info = DataBundle()
|
||||
for name in ['train', 'test', 'dev']:
|
||||
data_info.datasets[name] = self.load(paths[name])
|
||||
|
||||
|
@ -1,7 +1,7 @@
|
||||
|
||||
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataInfo
|
||||
from fastNLP.io.dataset_loader import ConllLoader
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataBundle
|
||||
from fastNLP.io.data_loader import ConllLoader
|
||||
import numpy as np
|
||||
|
||||
from itertools import chain
|
||||
@ -76,7 +76,7 @@ class CTBxJointLoader(DataSetLoader):
|
||||
gold_label_word_pairs:
|
||||
"""
|
||||
paths = check_dataloader_paths(paths)
|
||||
data = DataInfo()
|
||||
data = DataBundle()
|
||||
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
|
@ -2,13 +2,13 @@
|
||||
这里使用fastNLP复现了几个著名的Matching任务的模型,旨在达到与论文中相符的性能。这几个任务的评价指标均为准确率(%).
|
||||
|
||||
复现的模型有(按论文发表时间顺序排序):
|
||||
- CNTN:模型代码(still in progress)[](); 训练代码(still in progress)[]().
|
||||
- CNTN:[模型代码](model/cntn.py); [训练代码](matching_cntn.py).
|
||||
论文链接:[Convolutional Neural Tensor Network Architecture for Community-based Question Answering](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844).
|
||||
- ESIM:[模型代码](model/esim.py); [训练代码](matching_esim.py).
|
||||
论文链接:[Enhanced LSTM for Natural Language Inference](https://arxiv.org/pdf/1609.06038.pdf).
|
||||
- DIIN:模型代码(still in progress)[](); 训练代码(still in progress)[]().
|
||||
论文链接:[Natural Language Inference over Interaction Space](https://arxiv.org/pdf/1709.04348.pdf).
|
||||
- MwAN:模型代码(still in progress)[](); 训练代码(still in progress)[]().
|
||||
- MwAN:[模型代码](model/mwan.py); [训练代码](matching_mwan.py).
|
||||
论文链接:[Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/proceedings/2018/0613.pdf).
|
||||
- BERT:[模型代码](model/bert.py); [训练代码](matching_bert.py).
|
||||
论文链接:[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf).
|
||||
@ -21,10 +21,10 @@
|
||||
|
||||
model name | SNLI | MNLI | RTE | QNLI | Quora
|
||||
:---: | :---: | :---: | :---: | :---: | :---:
|
||||
CNTN [](); [论文](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844) | 74.53 vs - | 60.84/-(dev) vs - | 57.4(dev) vs - | 62.53(dev) vs - | - |
|
||||
CNTN [代码](model/cntn.py); [论文](https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11401/10844) | 77.79 vs - | 63.29/63.16(dev) vs - | 57.04(dev) vs - | 62.38(dev) vs - | - |
|
||||
ESIM[代码](model/bert.py); [论文](https://arxiv.org/pdf/1609.06038.pdf) | 88.13(glove) vs 88.0(glove)/88.7(elmo) | 77.78/76.49 vs 72.4/72.1* | 59.21(dev) vs - | 76.97(dev) vs - | - |
|
||||
DIIN [](); [论文](https://arxiv.org/pdf/1709.04348.pdf) | - vs 88.0 | - vs 78.8/77.8 | - | - | - vs 89.06 |
|
||||
MwAN [](); [论文](https://www.ijcai.org/proceedings/2018/0613.pdf) | 87.9 vs 88.3 | 77.3/76.7(dev) vs 78.5/77.7 | - | 74.6(dev) vs - | 85.6 vs 89.12 |
|
||||
MwAN [代码](model/mwan.py); [论文](https://www.ijcai.org/proceedings/2018/0613.pdf) | 87.9 vs 88.3 | 77.3/76.7(dev) vs 78.5/77.7 | - | 74.6(dev) vs - | 85.6 vs 89.12 |
|
||||
BERT (BASE version)[代码](model/bert.py); [论文](https://arxiv.org/pdf/1810.04805.pdf) | 90.6 vs - | - vs 84.6/83.4| 67.87(dev) vs 66.4 | 90.97(dev) vs 90.5 | - |
|
||||
|
||||
*ESIM模型由MNLI官方复现的结果为72.4/72.1,ESIM原论文当中没有汇报MNLI数据集的结果。
|
||||
@ -44,7 +44,7 @@ Performance on Test set:
|
||||
|
||||
model name | CNTN | ESIM | DIIN | MwAN | BERT-Base | BERT-Large
|
||||
:---: | :---: | :---: | :---: | :---: | :---: | :---:
|
||||
__performance__ | - | 88.13 | - | 87.9 | 90.6 | 91.16
|
||||
__performance__ | 77.79 | 88.13 | - | 87.9 | 90.6 | 91.16
|
||||
|
||||
## MNLI
|
||||
[Link to MNLI main page](https://www.nyu.edu/projects/bowman/multinli/)
|
||||
@ -60,7 +60,7 @@ Performance on Test set(matched/mismatched):
|
||||
|
||||
model name | CNTN | ESIM | DIIN | MwAN | BERT-Base
|
||||
:---: | :---: | :---: | :---: | :---: | :---: |
|
||||
__performance__ | - | 77.78/76.49 | - | 77.3/76.7(dev) | - |
|
||||
__performance__ | 63.29/63.16(dev) | 77.78/76.49 | - | 77.3/76.7(dev) | - |
|
||||
|
||||
|
||||
## RTE
|
||||
@ -92,7 +92,7 @@ Performance on __Dev__ set:
|
||||
|
||||
model name | CNTN | ESIM | DIIN | MwAN | BERT
|
||||
:---: | :---: | :---: | :---: | :---: | :---:
|
||||
__performance__ | - | 76.97 | - | 74.6 | -
|
||||
__performance__ | 62.38 | 76.97 | - | 74.6 | -
|
||||
|
||||
## Quora
|
||||
|
||||
|
@ -5,7 +5,7 @@ from typing import Union, Dict
|
||||
|
||||
from fastNLP.core.const import Const
|
||||
from fastNLP.core.vocabulary import Vocabulary
|
||||
from fastNLP.io.base_loader import DataInfo, DataSetLoader
|
||||
from fastNLP.io.base_loader import DataBundle, DataSetLoader
|
||||
from fastNLP.io.dataset_loader import JsonLoader, CSVLoader
|
||||
from fastNLP.io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
|
||||
from fastNLP.modules.encoder._bert import BertTokenizer
|
||||
@ -35,7 +35,7 @@ class MatchingLoader(DataSetLoader):
|
||||
to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None,
|
||||
cut_text: int = None, get_index=True, auto_pad_length: int=None,
|
||||
auto_pad_token: str='<pad>', set_input: Union[list, str, bool]=True,
|
||||
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataInfo:
|
||||
set_target: Union[list, str, bool] = True, concat: Union[str, list, bool]=None, ) -> DataBundle:
|
||||
"""
|
||||
:param paths: str或者Dict[str, str]。如果是str,则为数据集所在的文件夹或者是全路径文件名:如果是文件夹,
|
||||
则会从self.paths里面找对应的数据集名称与文件名。如果是Dict,则为数据集名称(如train、dev、test)和
|
||||
@ -80,7 +80,7 @@ class MatchingLoader(DataSetLoader):
|
||||
else:
|
||||
path = paths
|
||||
|
||||
data_info = DataInfo()
|
||||
data_info = DataBundle()
|
||||
for data_name in path.keys():
|
||||
data_info.datasets[data_name] = self._load(path[data_name])
|
||||
|
||||
|
145
reproduction/matching/matching_mwan.py
Normal file
145
reproduction/matching/matching_mwan.py
Normal file
@ -0,0 +1,145 @@
|
||||
import sys
|
||||
|
||||
import os
|
||||
import random
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch.optim import Adadelta, SGD
|
||||
from torch.optim.lr_scheduler import StepLR
|
||||
|
||||
from tqdm import tqdm
|
||||
|
||||
from fastNLP import CrossEntropyLoss
|
||||
from fastNLP import cache_results
|
||||
from fastNLP.core import Trainer, Tester, Adam, AccuracyMetric, Const
|
||||
from fastNLP.core.predictor import Predictor
|
||||
from fastNLP.core.callback import GradientClipCallback, LRScheduler, FitlogCallback
|
||||
from fastNLP.modules.encoder.embedding import ElmoEmbedding, StaticEmbedding
|
||||
|
||||
from fastNLP.io.data_loader import MNLILoader, QNLILoader, QuoraLoader, SNLILoader, RTELoader
|
||||
from reproduction.matching.model.mwan import MwanModel
|
||||
|
||||
import fitlog
|
||||
fitlog.debug()
|
||||
|
||||
import argparse
|
||||
|
||||
|
||||
argument = argparse.ArgumentParser()
|
||||
argument.add_argument('--task' , choices = ['snli', 'rte', 'qnli', 'mnli'],default = 'snli')
|
||||
argument.add_argument('--batch-size' , type = int , default = 128)
|
||||
argument.add_argument('--n-epochs' , type = int , default = 50)
|
||||
argument.add_argument('--lr' , type = float , default = 1)
|
||||
argument.add_argument('--testset-name' , type = str , default = 'test')
|
||||
argument.add_argument('--devset-name' , type = str , default = 'dev')
|
||||
argument.add_argument('--seed' , type = int , default = 42)
|
||||
argument.add_argument('--hidden-size' , type = int , default = 150)
|
||||
argument.add_argument('--dropout' , type = float , default = 0.3)
|
||||
arg = argument.parse_args()
|
||||
|
||||
random.seed(arg.seed)
|
||||
np.random.seed(arg.seed)
|
||||
torch.manual_seed(arg.seed)
|
||||
|
||||
n_gpu = torch.cuda.device_count()
|
||||
if n_gpu > 0:
|
||||
torch.cuda.manual_seed_all(arg.seed)
|
||||
print (n_gpu)
|
||||
|
||||
for k in arg.__dict__:
|
||||
print(k, arg.__dict__[k], type(arg.__dict__[k]))
|
||||
|
||||
# load data set
|
||||
if arg.task == 'snli':
|
||||
@cache_results(f'snli_mwan.pkl')
|
||||
def read_snli():
|
||||
data_info = SNLILoader().process(
|
||||
paths='path/to/snli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
|
||||
get_index=True, concat=False, extra_split=['/','%','-'],
|
||||
)
|
||||
return data_info
|
||||
data_info = read_snli()
|
||||
elif arg.task == 'rte':
|
||||
@cache_results(f'rte_mwan.pkl')
|
||||
def read_rte():
|
||||
data_info = RTELoader().process(
|
||||
paths='path/to/rte/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
|
||||
get_index=True, concat=False, extra_split=['/','%','-'],
|
||||
)
|
||||
return data_info
|
||||
data_info = read_rte()
|
||||
elif arg.task == 'qnli':
|
||||
data_info = QNLILoader().process(
|
||||
paths='path/to/qnli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
|
||||
get_index=True, concat=False , cut_text=512, extra_split=['/','%','-'],
|
||||
)
|
||||
elif arg.task == 'mnli':
|
||||
@cache_results(f'mnli_v0.9_mwan.pkl')
|
||||
def read_mnli():
|
||||
data_info = MNLILoader().process(
|
||||
paths='path/to/mnli/data', to_lower=True, seq_len_type=None, bert_tokenizer=None,
|
||||
get_index=True, concat=False, extra_split=['/','%','-'],
|
||||
)
|
||||
return data_info
|
||||
data_info = read_mnli()
|
||||
else:
|
||||
raise RuntimeError(f'NOT support {arg.task} task yet!')
|
||||
|
||||
print(data_info)
|
||||
print(len(data_info.vocabs['words']))
|
||||
|
||||
|
||||
model = MwanModel(
|
||||
num_class = len(data_info.vocabs[Const.TARGET]),
|
||||
EmbLayer = StaticEmbedding(data_info.vocabs[Const.INPUT], requires_grad=False, normalize=False),
|
||||
ElmoLayer = None,
|
||||
args_of_imm = {
|
||||
"input_size" : 300 ,
|
||||
"hidden_size" : arg.hidden_size ,
|
||||
"dropout" : arg.dropout ,
|
||||
"use_allennlp" : False ,
|
||||
} ,
|
||||
)
|
||||
|
||||
|
||||
optimizer = Adadelta(lr=arg.lr, params=model.parameters())
|
||||
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
|
||||
|
||||
callbacks = [
|
||||
LRScheduler(scheduler),
|
||||
]
|
||||
|
||||
if arg.task in ['snli']:
|
||||
callbacks.append(FitlogCallback(data_info.datasets[arg.testset_name], verbose=1))
|
||||
elif arg.task == 'mnli':
|
||||
callbacks.append(FitlogCallback({'dev_matched': data_info.datasets['dev_matched'],
|
||||
'dev_mismatched': data_info.datasets['dev_mismatched']},
|
||||
verbose=1))
|
||||
|
||||
trainer = Trainer(
|
||||
train_data = data_info.datasets['train'],
|
||||
model = model,
|
||||
optimizer = optimizer,
|
||||
num_workers = 0,
|
||||
batch_size = arg.batch_size,
|
||||
n_epochs = arg.n_epochs,
|
||||
print_every = -1,
|
||||
dev_data = data_info.datasets[arg.devset_name],
|
||||
metrics = AccuracyMetric(pred = "pred" , target = "target"),
|
||||
metric_key = 'acc',
|
||||
device = [i for i in range(torch.cuda.device_count())],
|
||||
check_code_level = -1,
|
||||
callbacks = callbacks,
|
||||
loss = CrossEntropyLoss(pred = "pred" , target = "target")
|
||||
)
|
||||
trainer.train(load_best_model=True)
|
||||
|
||||
tester = Tester(
|
||||
data=data_info.datasets[arg.testset_name],
|
||||
model=model,
|
||||
metrics=AccuracyMetric(),
|
||||
batch_size=arg.batch_size,
|
||||
device=[i for i in range(torch.cuda.device_count())],
|
||||
)
|
||||
tester.test()
|
455
reproduction/matching/model/mwan.py
Normal file
455
reproduction/matching/model/mwan.py
Normal file
@ -0,0 +1,455 @@
|
||||
import torch as tc
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import sys
|
||||
import os
|
||||
import math
|
||||
from fastNLP.core.const import Const
|
||||
|
||||
class RNNModel(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, num_layers, bidrect, dropout):
|
||||
super(RNNModel, self).__init__()
|
||||
|
||||
if num_layers <= 1:
|
||||
dropout = 0.0
|
||||
|
||||
self.rnn = nn.GRU(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers,
|
||||
batch_first=True, dropout=dropout, bidirectional=bidrect)
|
||||
|
||||
self.number = (2 if bidrect else 1) * num_layers
|
||||
|
||||
def forward(self, x, mask):
|
||||
'''
|
||||
mask: (batch_size, seq_len)
|
||||
x: (batch_size, seq_len, input_size)
|
||||
'''
|
||||
lens = (mask).long().sum(dim=1)
|
||||
lens, idx_sort = tc.sort(lens, descending=True)
|
||||
_, idx_unsort = tc.sort(idx_sort)
|
||||
|
||||
x = x[idx_sort]
|
||||
|
||||
x = nn.utils.rnn.pack_padded_sequence(x, lens, batch_first=True)
|
||||
self.rnn.flatten_parameters()
|
||||
y, h = self.rnn(x)
|
||||
y, lens = nn.utils.rnn.pad_packed_sequence(y, batch_first=True)
|
||||
|
||||
h = h.transpose(0,1).contiguous() #make batch size first
|
||||
|
||||
y = y[idx_unsort] #(batch_size, seq_len, bid * hid_size)
|
||||
h = h[idx_unsort] #(batch_size, number, hid_size)
|
||||
|
||||
return y, h
|
||||
|
||||
class Contexualizer(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.3):
|
||||
super(Contexualizer, self).__init__()
|
||||
|
||||
self.rnn = RNNModel(input_size, hidden_size, num_layers, True, dropout)
|
||||
self.output_size = hidden_size * 2
|
||||
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
weights = self.rnn.rnn.all_weights
|
||||
for w1 in weights:
|
||||
for w2 in w1:
|
||||
if len(list(w2.size())) <= 1:
|
||||
w2.data.fill_(0)
|
||||
else: nn.init.xavier_normal_(w2.data, gain=1.414)
|
||||
|
||||
def forward(self, s, mask):
|
||||
y = self.rnn(s, mask)[0] # (batch_size, seq_len, 2 * hidden_size)
|
||||
|
||||
return y
|
||||
|
||||
class ConcatAttention_Param(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout=0.2):
|
||||
super(ConcatAttention_Param, self).__init__()
|
||||
self.ln = nn.Linear(input_size + hidden_size, hidden_size)
|
||||
self.v = nn.Linear(hidden_size, 1, bias=False)
|
||||
self.vq = nn.Parameter(tc.rand(hidden_size))
|
||||
self.drop = nn.Dropout(dropout)
|
||||
|
||||
self.output_size = input_size
|
||||
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
|
||||
nn.init.xavier_uniform_(self.v.weight.data)
|
||||
nn.init.xavier_uniform_(self.ln.weight.data)
|
||||
self.ln.bias.data.fill_(0)
|
||||
|
||||
def forward(self, h, mask):
|
||||
'''
|
||||
h: (batch_size, len, input_size)
|
||||
mask: (batch_size, len)
|
||||
'''
|
||||
|
||||
vq = self.vq.view(1,1,-1).expand(h.size(0), h.size(1), self.vq.size(0))
|
||||
|
||||
s = self.v(tc.tanh(self.ln(tc.cat([h,vq],-1)))).squeeze(-1) # (batch_size, len)
|
||||
|
||||
s = s - ((mask == 0).float() * 10000)
|
||||
a = tc.softmax(s, dim=1)
|
||||
|
||||
r = a.unsqueeze(-1) * h # (batch_size, len, input_size)
|
||||
r = tc.sum(r, dim=1) # (batch_size, input_size)
|
||||
|
||||
return self.drop(r)
|
||||
|
||||
|
||||
def get_2dmask(mask_hq, mask_hp, siz=None):
|
||||
|
||||
if siz is None:
|
||||
siz = (mask_hq.size(0), mask_hq.size(1), mask_hp.size(1))
|
||||
|
||||
mask_mat = 1
|
||||
if mask_hq is not None:
|
||||
mask_mat = mask_mat * mask_hq.unsqueeze(2).expand(siz)
|
||||
if mask_hp is not None:
|
||||
mask_mat = mask_mat * mask_hp.unsqueeze(1).expand(siz)
|
||||
return mask_mat
|
||||
|
||||
def Attention(hq, hp, mask_hq, mask_hp, my_method):
|
||||
standard_size = (hq.size(0), hq.size(1), hp.size(1), hq.size(-1))
|
||||
mask_mat = get_2dmask(mask_hq, mask_hp, standard_size[:-1])
|
||||
|
||||
hq_mat = hq.unsqueeze(2).expand(standard_size)
|
||||
hp_mat = hp.unsqueeze(1).expand(standard_size)
|
||||
|
||||
s = my_method(hq_mat, hp_mat) # (batch_size, len_q, len_p)
|
||||
|
||||
s = s - ((mask_mat == 0).float() * 10000)
|
||||
a = tc.softmax(s, dim=1)
|
||||
|
||||
q = a.unsqueeze(-1) * hq_mat #(batch_size, len_q, len_p, input_size)
|
||||
q = tc.sum(q, dim=1) #(batch_size, len_p, input_size)
|
||||
|
||||
return q
|
||||
|
||||
class ConcatAttention(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout=0.2, input_size_2=-1):
|
||||
super(ConcatAttention, self).__init__()
|
||||
|
||||
if input_size_2 < 0:
|
||||
input_size_2 = input_size
|
||||
self.ln = nn.Linear(input_size + input_size_2, hidden_size)
|
||||
self.v = nn.Linear(hidden_size, 1, bias=False)
|
||||
self.drop = nn.Dropout(dropout)
|
||||
|
||||
self.output_size = input_size
|
||||
|
||||
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
|
||||
nn.init.xavier_uniform_(self.v.weight.data)
|
||||
nn.init.xavier_uniform_(self.ln.weight.data)
|
||||
self.ln.bias.data.fill_(0)
|
||||
|
||||
def my_method(self, hq_mat, hp_mat):
|
||||
s = tc.cat([hq_mat, hp_mat], dim=-1)
|
||||
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p)
|
||||
return s
|
||||
|
||||
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
|
||||
'''
|
||||
hq: (batch_size, len_q, input_size)
|
||||
mask_hq: (batch_size, len_q)
|
||||
'''
|
||||
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method))
|
||||
|
||||
class MinusAttention(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout=0.2):
|
||||
super(MinusAttention, self).__init__()
|
||||
self.ln = nn.Linear(input_size, hidden_size)
|
||||
self.v = nn.Linear(hidden_size, 1, bias=False)
|
||||
|
||||
self.drop = nn.Dropout(dropout)
|
||||
self.output_size = input_size
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
|
||||
nn.init.xavier_uniform_(self.v.weight.data)
|
||||
nn.init.xavier_uniform_(self.ln.weight.data)
|
||||
self.ln.bias.data.fill_(0)
|
||||
|
||||
def my_method(self, hq_mat, hp_mat):
|
||||
s = hq_mat - hp_mat
|
||||
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p) s[j,t]
|
||||
return s
|
||||
|
||||
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
|
||||
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method))
|
||||
|
||||
class DotProductAttention(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout=0.2):
|
||||
super(DotProductAttention, self).__init__()
|
||||
self.ln = nn.Linear(input_size, hidden_size)
|
||||
self.v = nn.Linear(hidden_size, 1, bias=False)
|
||||
|
||||
self.drop = nn.Dropout(dropout)
|
||||
self.output_size = input_size
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
|
||||
nn.init.xavier_uniform_(self.v.weight.data)
|
||||
nn.init.xavier_uniform_(self.ln.weight.data)
|
||||
self.ln.bias.data.fill_(0)
|
||||
|
||||
def my_method(self, hq_mat, hp_mat):
|
||||
s = hq_mat * hp_mat
|
||||
s = self.v(tc.tanh(self.ln(s))).squeeze(-1) #(batch_size, len_q, len_p) s[j,t]
|
||||
return s
|
||||
|
||||
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
|
||||
return self.drop(Attention(hq, hp, mask_hq, mask_hp, self.my_method))
|
||||
|
||||
class BiLinearAttention(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout=0.2, input_size_2=-1):
|
||||
super(BiLinearAttention, self).__init__()
|
||||
|
||||
input_size_2 = input_size if input_size_2 < 0 else input_size_2
|
||||
|
||||
self.ln = nn.Linear(input_size_2, input_size)
|
||||
self.drop = nn.Dropout(dropout)
|
||||
self.output_size = input_size
|
||||
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
|
||||
nn.init.xavier_uniform_(self.ln.weight.data)
|
||||
self.ln.bias.data.fill_(0)
|
||||
|
||||
def my_method(self, hq, hp, mask_p):
|
||||
# (bs, len, input_size)
|
||||
|
||||
hp = self.ln(hp)
|
||||
hp = hp * mask_p.unsqueeze(-1)
|
||||
s = tc.matmul(hq, hp.transpose(-1,-2))
|
||||
|
||||
return s
|
||||
|
||||
def forward(self, hq, hp, mask_hq=None, mask_hp=None):
|
||||
standard_size = (hq.size(0), hq.size(1), hp.size(1), hq.size(-1))
|
||||
mask_mat = get_2dmask(mask_hq, mask_hp, standard_size[:-1])
|
||||
|
||||
s = self.my_method(hq, hp, mask_hp) # (batch_size, len_q, len_p)
|
||||
|
||||
s = s - ((mask_mat == 0).float() * 10000)
|
||||
a = tc.softmax(s, dim=1)
|
||||
|
||||
hq_mat = hq.unsqueeze(2).expand(standard_size)
|
||||
q = a.unsqueeze(-1) * hq_mat #(batch_size, len_q, len_p, input_size)
|
||||
q = tc.sum(q, dim=1) #(batch_size, len_p, input_size)
|
||||
|
||||
return self.drop(q)
|
||||
|
||||
|
||||
class AggAttention(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout=0.2):
|
||||
super(AggAttention, self).__init__()
|
||||
self.ln = nn.Linear(input_size + hidden_size, hidden_size)
|
||||
self.v = nn.Linear(hidden_size, 1, bias=False)
|
||||
self.vq = nn.Parameter(tc.rand(hidden_size, 1))
|
||||
self.drop = nn.Dropout(dropout)
|
||||
|
||||
self.output_size = input_size
|
||||
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
|
||||
nn.init.xavier_uniform_(self.vq.data)
|
||||
nn.init.xavier_uniform_(self.v.weight.data)
|
||||
nn.init.xavier_uniform_(self.ln.weight.data)
|
||||
self.ln.bias.data.fill_(0)
|
||||
self.vq.data = self.vq.data[:,0]
|
||||
|
||||
|
||||
def forward(self, hs, mask):
|
||||
'''
|
||||
hs: [(batch_size, len_q, input_size), ...]
|
||||
mask: (batch_size, len_q)
|
||||
'''
|
||||
|
||||
hs = tc.cat([h.unsqueeze(0) for h in hs], dim=0)# (4, batch_size, len_q, input_size)
|
||||
|
||||
vq = self.vq.view(1,1,1,-1).expand(hs.size(0), hs.size(1), hs.size(2), self.vq.size(0))
|
||||
|
||||
s = self.v(tc.tanh(self.ln(tc.cat([hs,vq],-1)))).squeeze(-1)# (4, batch_size, len_q)
|
||||
|
||||
s = s - ((mask.unsqueeze(0) == 0).float() * 10000)
|
||||
a = tc.softmax(s, dim=0)
|
||||
|
||||
x = a.unsqueeze(-1) * hs
|
||||
x = tc.sum(x, dim=0)#(batch_size, len_q, input_size)
|
||||
|
||||
return self.drop(x)
|
||||
|
||||
class Aggragator(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, dropout=0.3):
|
||||
super(Aggragator, self).__init__()
|
||||
|
||||
now_size = input_size
|
||||
self.ln = nn.Linear(2 * input_size, 2 * input_size)
|
||||
|
||||
now_size = 2 * input_size
|
||||
self.rnn = Contexualizer(now_size, hidden_size, 2, dropout)
|
||||
|
||||
now_size = self.rnn.output_size
|
||||
self.agg_att = AggAttention(now_size, now_size, dropout)
|
||||
|
||||
now_size = self.agg_att.output_size
|
||||
self.agg_rnn = Contexualizer(now_size, hidden_size, 2, dropout)
|
||||
|
||||
self.drop = nn.Dropout(dropout)
|
||||
|
||||
self.output_size = self.agg_rnn.output_size
|
||||
|
||||
def forward(self, qs, hp, mask):
|
||||
'''
|
||||
qs: [ (batch_size, len_p, input_size), ...]
|
||||
hp: (batch_size, len_p, input_size)
|
||||
mask if the same of hp's mask
|
||||
'''
|
||||
|
||||
hs = [0 for _ in range(len(qs))]
|
||||
|
||||
for i in range(len(qs)):
|
||||
q = qs[i]
|
||||
x = tc.cat([q, hp], dim=-1)
|
||||
g = tc.sigmoid(self.ln(x))
|
||||
x_star = x * g
|
||||
h = self.rnn(x_star, mask)
|
||||
|
||||
hs[i] = h
|
||||
|
||||
x = self.agg_att(hs, mask) #(batch_size, len_p, output_size)
|
||||
h = self.agg_rnn(x, mask) #(batch_size, len_p, output_size)
|
||||
return self.drop(h)
|
||||
|
||||
|
||||
class Mwan_Imm(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, num_class=3, dropout=0.2, use_allennlp=False):
|
||||
super(Mwan_Imm, self).__init__()
|
||||
|
||||
now_size = input_size
|
||||
self.enc_s1 = Contexualizer(now_size, hidden_size, 2, dropout)
|
||||
self.enc_s2 = Contexualizer(now_size, hidden_size, 2, dropout)
|
||||
|
||||
now_size = self.enc_s1.output_size
|
||||
self.att_c = ConcatAttention(now_size, hidden_size, dropout)
|
||||
self.att_b = BiLinearAttention(now_size, hidden_size, dropout)
|
||||
self.att_d = DotProductAttention(now_size, hidden_size, dropout)
|
||||
self.att_m = MinusAttention(now_size, hidden_size, dropout)
|
||||
|
||||
now_size = self.att_c.output_size
|
||||
self.agg = Aggragator(now_size, hidden_size, dropout)
|
||||
|
||||
now_size = self.enc_s1.output_size
|
||||
self.pred_1 = ConcatAttention_Param(now_size, hidden_size, dropout)
|
||||
now_size = self.agg.output_size
|
||||
self.pred_2 = ConcatAttention(now_size, hidden_size, dropout,
|
||||
input_size_2=self.pred_1.output_size)
|
||||
|
||||
now_size = self.pred_2.output_size
|
||||
self.ln1 = nn.Linear(now_size, hidden_size)
|
||||
self.ln2 = nn.Linear(hidden_size, num_class)
|
||||
|
||||
self.reset_parameters()
|
||||
|
||||
def reset_parameters(self):
|
||||
nn.init.xavier_uniform_(self.ln1.weight.data)
|
||||
nn.init.xavier_uniform_(self.ln2.weight.data)
|
||||
self.ln1.bias.data.fill_(0)
|
||||
self.ln2.bias.data.fill_(0)
|
||||
|
||||
def forward(self, s1, s2, mas_s1, mas_s2):
|
||||
hq = self.enc_s1(s1, mas_s1) #(batch_size, len_q, output_size)
|
||||
hp = self.enc_s1(s2, mas_s2)
|
||||
|
||||
mas_s1 = mas_s1[:,:hq.size(1)]
|
||||
mas_s2 = mas_s2[:,:hp.size(1)]
|
||||
mas_q, mas_p = mas_s1, mas_s2
|
||||
|
||||
qc = self.att_c(hq, hp, mas_s1, mas_s2) #(batch_size, len_p, output_size)
|
||||
qb = self.att_b(hq, hp, mas_s1, mas_s2)
|
||||
qd = self.att_d(hq, hp, mas_s1, mas_s2)
|
||||
qm = self.att_m(hq, hp, mas_s1, mas_s2)
|
||||
|
||||
ho = self.agg([qc,qb,qd,qm], hp, mas_s2) #(batch_size, len_p, output_size)
|
||||
|
||||
rq = self.pred_1(hq, mas_q) #(batch_size, output_size)
|
||||
rp = self.pred_2(ho, rq.unsqueeze(1), mas_p)#(batch_size, 1, output_size)
|
||||
rp = rp.squeeze(1) #(batch_size, output_size)
|
||||
|
||||
rp = F.relu(self.ln1(rp))
|
||||
rp = self.ln2(rp)
|
||||
|
||||
return rp
|
||||
|
||||
class MwanModel(nn.Module):
|
||||
def __init__(self, num_class, EmbLayer, args_of_imm={}, ElmoLayer=None):
|
||||
super(MwanModel, self).__init__()
|
||||
|
||||
self.emb = EmbLayer
|
||||
|
||||
if ElmoLayer is not None:
|
||||
self.elmo = ElmoLayer
|
||||
self.elmo_preln = nn.Linear(3 * self.elmo.emb_size, self.elmo.emb_size)
|
||||
self.elmo_ln = nn.Linear(args_of_imm["input_size"] +
|
||||
self.elmo.emb_size, args_of_imm["input_size"])
|
||||
|
||||
else:
|
||||
self.elmo = None
|
||||
|
||||
|
||||
self.imm = Mwan_Imm(num_class=num_class, **args_of_imm)
|
||||
self.drop = nn.Dropout(args_of_imm["dropout"])
|
||||
|
||||
|
||||
def forward(self, words1, words2, str_s1=None, str_s2=None, *pargs, **kwargs):
|
||||
'''
|
||||
str_s is for elmo use , however we don't use elmo
|
||||
str_s: (batch_size, seq_len, word_len)
|
||||
'''
|
||||
|
||||
s1, s2 = words1, words2
|
||||
|
||||
mas_s1 = (s1 != 0).float() # mas: (batch_size, seq_len)
|
||||
mas_s2 = (s2 != 0).float() # mas: (batch_size, seq_len)
|
||||
|
||||
mas_s1.requires_grad = False
|
||||
mas_s2.requires_grad = False
|
||||
|
||||
s1_emb = self.emb(s1)
|
||||
s2_emb = self.emb(s2)
|
||||
|
||||
if self.elmo is not None:
|
||||
s1_elmo = self.elmo(str_s1)
|
||||
s2_elmo = self.elmo(str_s2)
|
||||
|
||||
s1_elmo = tc.tanh(self.elmo_preln(tc.cat(s1_elmo, dim=-1)))
|
||||
s2_elmo = tc.tanh(self.elmo_preln(tc.cat(s2_elmo, dim=-1)))
|
||||
|
||||
s1_emb = tc.cat([s1_emb, s1_elmo], dim=-1)
|
||||
s2_emb = tc.cat([s2_emb, s2_elmo], dim=-1)
|
||||
|
||||
s1_emb = tc.tanh(self.elmo_ln(s1_emb))
|
||||
s2_emb = tc.tanh(self.elmo_ln(s2_emb))
|
||||
|
||||
s1_emb = self.drop(s1_emb)
|
||||
s2_emb = self.drop(s2_emb)
|
||||
|
||||
y = self.imm(s1_emb, s2_emb, mas_s1, mas_s2)
|
||||
|
||||
return {
|
||||
Const.OUTPUT: y,
|
||||
}
|
@ -1,7 +1,7 @@
|
||||
|
||||
from fastNLP.io.embed_loader import EmbeddingOption, EmbedLoader
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataInfo
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataBundle
|
||||
from typing import Union, Dict, List, Iterator
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import Instance
|
||||
@ -161,7 +161,7 @@ class SigHanLoader(DataSetLoader):
|
||||
# 推荐大家使用这个check_data_loader_paths进行paths的验证
|
||||
paths = check_dataloader_paths(paths)
|
||||
datasets = {}
|
||||
data = DataInfo()
|
||||
data = DataBundle()
|
||||
bigram = bigram_vocab_opt is not None
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path, bigram=bigram)
|
||||
|
93
reproduction/seqence_labelling/ner/data/Conll2003Loader.py
Normal file
93
reproduction/seqence_labelling/ner/data/Conll2003Loader.py
Normal file
@ -0,0 +1,93 @@
|
||||
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataBundle
|
||||
from typing import Union, Dict
|
||||
from fastNLP import Vocabulary
|
||||
from fastNLP import Const
|
||||
from reproduction.utils import check_dataloader_paths
|
||||
|
||||
from fastNLP.io import ConllLoader
|
||||
from reproduction.seqence_labelling.ner.data.utils import iob2bioes, iob2
|
||||
|
||||
|
||||
class Conll2003DataLoader(DataSetLoader):
|
||||
def __init__(self, task:str='ner', encoding_type:str='bioes'):
|
||||
"""
|
||||
加载Conll2003格式的英语语料,该数据集的信息可以在https://www.clips.uantwerpen.be/conll2003/ner/找到。当task为pos
|
||||
时,返回的DataSet中target取值于第2列; 当task为chunk时,返回的DataSet中target取值于第3列;当task为ner时,返回
|
||||
的DataSet中target取值于第4列。所有"-DOCSTART- -X- O O"将被忽略,这会导致数据的数量少于很多文献报道的值,但
|
||||
鉴于"-DOCSTART- -X- O O"只是用于文档分割的符号,并不应该作为预测对象,所以我们忽略了数据中的-DOCTSTART-开头的行
|
||||
ner与chunk任务读取后的数据的target将为encoding_type类型。pos任务读取后就是pos列的数据。
|
||||
|
||||
:param task: 指定需要标注任务。可选ner, pos, chunk
|
||||
"""
|
||||
assert task in ('ner', 'pos', 'chunk')
|
||||
index = {'ner':3, 'pos':1, 'chunk':2}[task]
|
||||
self._loader = ConllLoader(headers=['raw_words', 'target'], indexes=[0, index])
|
||||
self._tag_converters = []
|
||||
if task in ('ner', 'chunk'):
|
||||
self._tag_converters = [iob2]
|
||||
if encoding_type == 'bioes':
|
||||
self._tag_converters.append(iob2bioes)
|
||||
|
||||
def load(self, path: str):
|
||||
dataset = self._loader.load(path)
|
||||
def convert_tag_schema(tags):
|
||||
for converter in self._tag_converters:
|
||||
tags = converter(tags)
|
||||
return tags
|
||||
if self._tag_converters:
|
||||
dataset.apply_field(convert_tag_schema, field_name=Const.TARGET, new_field_name=Const.TARGET)
|
||||
return dataset
|
||||
|
||||
def process(self, paths: Union[str, Dict[str, str]], word_vocab_opt:VocabularyOption=None, lower:bool=False):
|
||||
"""
|
||||
读取并处理数据。数据中的'-DOCSTART-'开头的行会被忽略
|
||||
|
||||
:param paths:
|
||||
:param word_vocab_opt: vocabulary的初始化值
|
||||
:param lower: 是否将所有字母转为小写。
|
||||
:return:
|
||||
"""
|
||||
# 读取数据
|
||||
paths = check_dataloader_paths(paths)
|
||||
data = DataBundle()
|
||||
input_fields = [Const.TARGET, Const.INPUT, Const.INPUT_LEN]
|
||||
target_fields = [Const.TARGET, Const.INPUT_LEN]
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
dataset.apply_field(lambda words: words, field_name='raw_words', new_field_name=Const.INPUT)
|
||||
if lower:
|
||||
dataset.words.lower()
|
||||
data.datasets[name] = dataset
|
||||
|
||||
# 对construct vocab
|
||||
word_vocab = Vocabulary(min_freq=2) if word_vocab_opt is None else Vocabulary(**word_vocab_opt)
|
||||
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT,
|
||||
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train'])
|
||||
word_vocab.index_dataset(*data.datasets.values(), field_name=Const.INPUT, new_field_name=Const.INPUT)
|
||||
data.vocabs[Const.INPUT] = word_vocab
|
||||
|
||||
# cap words
|
||||
cap_word_vocab = Vocabulary()
|
||||
cap_word_vocab.from_dataset(data.datasets['train'], field_name='raw_words',
|
||||
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train'])
|
||||
cap_word_vocab.index_dataset(*data.datasets.values(), field_name='raw_words', new_field_name='cap_words')
|
||||
input_fields.append('cap_words')
|
||||
data.vocabs['cap_words'] = cap_word_vocab
|
||||
|
||||
# 对target建vocab
|
||||
target_vocab = Vocabulary(unknown=None, padding=None)
|
||||
target_vocab.from_dataset(*data.datasets.values(), field_name=Const.TARGET)
|
||||
target_vocab.index_dataset(*data.datasets.values(), field_name=Const.TARGET)
|
||||
data.vocabs[Const.TARGET] = target_vocab
|
||||
|
||||
for name, dataset in data.datasets.items():
|
||||
dataset.add_seq_len(Const.INPUT, new_field_name=Const.INPUT_LEN)
|
||||
dataset.set_input(*input_fields)
|
||||
dataset.set_target(*target_fields)
|
||||
|
||||
return data
|
||||
|
||||
if __name__ == '__main__':
|
||||
pass
|
152
reproduction/seqence_labelling/ner/data/OntoNoteLoader.py
Normal file
152
reproduction/seqence_labelling/ner/data/OntoNoteLoader.py
Normal file
@ -0,0 +1,152 @@
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataBundle
|
||||
from typing import Union, Dict
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import Vocabulary
|
||||
from fastNLP import Const
|
||||
from reproduction.utils import check_dataloader_paths
|
||||
|
||||
from fastNLP.io import ConllLoader
|
||||
from reproduction.seqence_labelling.ner.data.utils import iob2bioes, iob2
|
||||
|
||||
class OntoNoteNERDataLoader(DataSetLoader):
|
||||
"""
|
||||
用于读取处理为Conll格式后的OntoNote数据。将OntoNote数据处理为conll格式的过程可以参考https://github.com/yhcc/OntoNotes-5.0-NER。
|
||||
|
||||
"""
|
||||
def __init__(self, encoding_type:str='bioes'):
|
||||
assert encoding_type in ('bioes', 'bio')
|
||||
self.encoding_type = encoding_type
|
||||
if encoding_type=='bioes':
|
||||
self.encoding_method = iob2bioes
|
||||
else:
|
||||
self.encoding_method = iob2
|
||||
|
||||
def load(self, path:str)->DataSet:
|
||||
"""
|
||||
给定一个文件路径,读取数据。返回的DataSet包含以下的field
|
||||
raw_words: List[str]
|
||||
target: List[str]
|
||||
|
||||
:param path:
|
||||
:return:
|
||||
"""
|
||||
dataset = ConllLoader(headers=['raw_words', 'target'], indexes=[3, 10]).load(path)
|
||||
def convert_to_bio(tags):
|
||||
bio_tags = []
|
||||
flag = None
|
||||
for tag in tags:
|
||||
label = tag.strip("()*")
|
||||
if '(' in tag:
|
||||
bio_label = 'B-' + label
|
||||
flag = label
|
||||
elif flag:
|
||||
bio_label = 'I-' + flag
|
||||
else:
|
||||
bio_label = 'O'
|
||||
if ')' in tag:
|
||||
flag = None
|
||||
bio_tags.append(bio_label)
|
||||
return self.encoding_method(bio_tags)
|
||||
|
||||
def convert_word(words):
|
||||
converted_words = []
|
||||
for word in words:
|
||||
word = word.replace('/.', '.') # 有些结尾的.是/.形式的
|
||||
if not word.startswith('-'):
|
||||
converted_words.append(word)
|
||||
continue
|
||||
# 以下是由于这些符号被转义了,再转回来
|
||||
tfrs = {'-LRB-':'(',
|
||||
'-RRB-': ')',
|
||||
'-LSB-': '[',
|
||||
'-RSB-': ']',
|
||||
'-LCB-': '{',
|
||||
'-RCB-': '}'
|
||||
}
|
||||
if word in tfrs:
|
||||
converted_words.append(tfrs[word])
|
||||
else:
|
||||
converted_words.append(word)
|
||||
return converted_words
|
||||
|
||||
dataset.apply_field(convert_word, field_name='raw_words', new_field_name='raw_words')
|
||||
dataset.apply_field(convert_to_bio, field_name='target', new_field_name='target')
|
||||
|
||||
return dataset
|
||||
|
||||
def process(self, paths: Union[str, Dict[str, str]], word_vocab_opt:VocabularyOption=None,
|
||||
lower:bool=True)->DataBundle:
|
||||
"""
|
||||
读取并处理数据。返回的DataInfo包含以下的内容
|
||||
vocabs:
|
||||
word: Vocabulary
|
||||
target: Vocabulary
|
||||
datasets:
|
||||
train: DataSet
|
||||
words: List[int], 被设置为input
|
||||
target: int. label,被同时设置为input和target
|
||||
seq_len: int. 句子的长度,被同时设置为input和target
|
||||
raw_words: List[str]
|
||||
xxx(根据传入的paths可能有所变化)
|
||||
|
||||
:param paths:
|
||||
:param word_vocab_opt: vocabulary的初始化值
|
||||
:param lower: 是否使用小写
|
||||
:return:
|
||||
"""
|
||||
paths = check_dataloader_paths(paths)
|
||||
data = DataBundle()
|
||||
input_fields = [Const.TARGET, Const.INPUT, Const.INPUT_LEN]
|
||||
target_fields = [Const.TARGET, Const.INPUT_LEN]
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
dataset.apply_field(lambda words: words, field_name='raw_words', new_field_name=Const.INPUT)
|
||||
if lower:
|
||||
dataset.words.lower()
|
||||
data.datasets[name] = dataset
|
||||
|
||||
# 对construct vocab
|
||||
word_vocab = Vocabulary(min_freq=2) if word_vocab_opt is None else Vocabulary(**word_vocab_opt)
|
||||
word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT,
|
||||
no_create_entry_dataset=[dataset for name, dataset in data.datasets.items() if name!='train'])
|
||||
word_vocab.index_dataset(*data.datasets.values(), field_name=Const.INPUT, new_field_name=Const.INPUT)
|
||||
data.vocabs[Const.INPUT] = word_vocab
|
||||
|
||||
# cap words
|
||||
cap_word_vocab = Vocabulary()
|
||||
cap_word_vocab.from_dataset(*data.datasets.values(), field_name='raw_words')
|
||||
cap_word_vocab.index_dataset(*data.datasets.values(), field_name='raw_words', new_field_name='cap_words')
|
||||
input_fields.append('cap_words')
|
||||
data.vocabs['cap_words'] = cap_word_vocab
|
||||
|
||||
# 对target建vocab
|
||||
target_vocab = Vocabulary(unknown=None, padding=None)
|
||||
target_vocab.from_dataset(*data.datasets.values(), field_name=Const.TARGET)
|
||||
target_vocab.index_dataset(*data.datasets.values(), field_name=Const.TARGET)
|
||||
data.vocabs[Const.TARGET] = target_vocab
|
||||
|
||||
for name, dataset in data.datasets.items():
|
||||
dataset.add_seq_len(Const.INPUT, new_field_name=Const.INPUT_LEN)
|
||||
dataset.set_input(*input_fields)
|
||||
dataset.set_target(*target_fields)
|
||||
|
||||
return data
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
loader = OntoNoteNERDataLoader()
|
||||
dataset = loader.load('/hdd/fudanNLP/fastNLP/others/data/v4/english/test.txt')
|
||||
print(dataset.target.value_count())
|
||||
print(dataset[:4])
|
||||
|
||||
|
||||
"""
|
||||
train 115812 2200752
|
||||
development 15680 304684
|
||||
test 12217 230111
|
||||
|
||||
train 92403 1901772
|
||||
valid 13606 279180
|
||||
test 10258 204135
|
||||
"""
|
49
reproduction/seqence_labelling/ner/data/utils.py
Normal file
49
reproduction/seqence_labelling/ner/data/utils.py
Normal file
@ -0,0 +1,49 @@
|
||||
from typing import List
|
||||
|
||||
def iob2(tags:List[str])->List[str]:
|
||||
"""
|
||||
检查数据是否是合法的IOB数据,如果是IOB1会被自动转换为IOB2。
|
||||
|
||||
:param tags: 需要转换的tags
|
||||
"""
|
||||
for i, tag in enumerate(tags):
|
||||
if tag == "O":
|
||||
continue
|
||||
split = tag.split("-")
|
||||
if len(split) != 2 or split[0] not in ["I", "B"]:
|
||||
raise TypeError("The encoding schema is not a valid IOB type.")
|
||||
if split[0] == "B":
|
||||
continue
|
||||
elif i == 0 or tags[i - 1] == "O": # conversion IOB1 to IOB2
|
||||
tags[i] = "B" + tag[1:]
|
||||
elif tags[i - 1][1:] == tag[1:]:
|
||||
continue
|
||||
else: # conversion IOB1 to IOB2
|
||||
tags[i] = "B" + tag[1:]
|
||||
return tags
|
||||
|
||||
def iob2bioes(tags:List[str])->List[str]:
|
||||
"""
|
||||
将iob的tag转换为bmeso编码
|
||||
:param tags:
|
||||
:return:
|
||||
"""
|
||||
new_tags = []
|
||||
for i, tag in enumerate(tags):
|
||||
if tag == 'O':
|
||||
new_tags.append(tag)
|
||||
else:
|
||||
split = tag.split('-')[0]
|
||||
if split == 'B':
|
||||
if i+1!=len(tags) and tags[i+1].split('-')[0] == 'I':
|
||||
new_tags.append(tag)
|
||||
else:
|
||||
new_tags.append(tag.replace('B-', 'S-'))
|
||||
elif split == 'I':
|
||||
if i + 1<len(tags) and tags[i+1].split('-')[0] == 'I':
|
||||
new_tags.append(tag)
|
||||
else:
|
||||
new_tags.append(tag.replace('I-', 'E-'))
|
||||
else:
|
||||
raise TypeError("Invalid IOB format.")
|
||||
return new_tags
|
@ -106,7 +106,9 @@ class IDCNN(nn.Module):
|
||||
if self.crf is not None and target is not None:
|
||||
loss = self.crf(y.transpose(1, 2), t, mask)
|
||||
else:
|
||||
t.masked_fill_(mask == 0, -100)
|
||||
y.masked_fill_((mask == 0)[:,None,:], -100)
|
||||
# f_mask = mask.float()
|
||||
# t = f_mask * t + (1-f_mask) * -100
|
||||
loss = F.cross_entropy(y, t, ignore_index=-100)
|
||||
return loss
|
||||
|
||||
@ -130,13 +132,3 @@ class IDCNN(nn.Module):
|
||||
C.OUTPUT: pred,
|
||||
}
|
||||
|
||||
def predict(self, words, seq_len, chars=None):
|
||||
res = self.forward(
|
||||
words=words,
|
||||
seq_len=seq_len,
|
||||
chars=chars,
|
||||
target=None
|
||||
)[C.OUTPUT]
|
||||
return {
|
||||
C.OUTPUT: res
|
||||
}
|
||||
|
@ -11,9 +11,8 @@ from fastNLP import Const
|
||||
class CNNBiLSTMCRF(nn.Module):
|
||||
def __init__(self, embed, char_embed, hidden_size, num_layers, tag_vocab, dropout=0.5, encoding_type='bioes'):
|
||||
super().__init__()
|
||||
|
||||
self.embedding = Embedding(embed, dropout=0.5, dropout_word=0)
|
||||
self.char_embedding = Embedding(char_embed, dropout=0.5, dropout_word=0.01)
|
||||
self.embedding = embed
|
||||
self.char_embedding = char_embed
|
||||
self.lstm = LSTM(input_size=self.embedding.embedding_dim+self.char_embedding.embedding_dim,
|
||||
hidden_size=hidden_size//2, num_layers=num_layers,
|
||||
bidirectional=True, batch_first=True)
|
||||
@ -33,24 +32,24 @@ class CNNBiLSTMCRF(nn.Module):
|
||||
if 'crf' in name:
|
||||
nn.init.zeros_(param)
|
||||
|
||||
def _forward(self, words, cap_words, seq_len, target=None):
|
||||
words = self.embedding(words)
|
||||
chars = self.char_embedding(cap_words)
|
||||
words = torch.cat([words, chars], dim=-1)
|
||||
def _forward(self, words, seq_len, target=None):
|
||||
word_embeds = self.embedding(words)
|
||||
char_embeds = self.char_embedding(words)
|
||||
words = torch.cat((word_embeds, char_embeds), dim=-1)
|
||||
outputs, _ = self.lstm(words, seq_len)
|
||||
self.dropout(outputs)
|
||||
|
||||
logits = F.log_softmax(self.fc(outputs), dim=-1)
|
||||
|
||||
if target is not None:
|
||||
loss = self.crf(logits, target, seq_len_to_mask(seq_len))
|
||||
loss = self.crf(logits, target, seq_len_to_mask(seq_len, max_len=logits.size(1))).mean()
|
||||
return {Const.LOSS: loss}
|
||||
else:
|
||||
pred, _ = self.crf.viterbi_decode(logits, seq_len_to_mask(seq_len))
|
||||
pred, _ = self.crf.viterbi_decode(logits, seq_len_to_mask(seq_len, max_len=logits.size(1)))
|
||||
return {Const.OUTPUT: pred}
|
||||
|
||||
def forward(self, words, cap_words, seq_len, target):
|
||||
return self._forward(words, cap_words, seq_len, target)
|
||||
def forward(self, words, seq_len, target):
|
||||
return self._forward(words, seq_len, target)
|
||||
|
||||
def predict(self, words, cap_words, seq_len):
|
||||
return self._forward(words, cap_words, seq_len, None)
|
||||
def predict(self, words, seq_len):
|
||||
return self._forward(words, seq_len, None)
|
||||
|
@ -1,6 +1,7 @@
|
||||
import sys
|
||||
sys.path.append('../../..')
|
||||
|
||||
|
||||
from fastNLP.modules.encoder.embedding import CNNCharEmbedding, StaticEmbedding, BertEmbedding, ElmoEmbedding, LSTMCharEmbedding
|
||||
from fastNLP.modules.encoder.embedding import CNNCharEmbedding, StaticEmbedding, BertEmbedding, ElmoEmbedding, StackEmbedding
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
|
||||
from reproduction.seqence_labelling.ner.model.lstm_cnn_crf import CNNBiLSTMCRF
|
||||
@ -12,7 +13,10 @@ from torch.optim import SGD, Adam
|
||||
from fastNLP import GradientClipCallback
|
||||
from fastNLP.core.callback import FitlogCallback, LRScheduler
|
||||
from torch.optim.lr_scheduler import LambdaLR
|
||||
from reproduction.seqence_labelling.ner.model.swats import SWATS
|
||||
from fastNLP.core.optimizer import AdamW
|
||||
# from reproduction.seqence_labelling.ner.model.swats import SWATS
|
||||
from reproduction.seqence_labelling.chinese_ner.callbacks import SaveModelCallback
|
||||
from fastNLP import cache_results
|
||||
|
||||
import fitlog
|
||||
fitlog.debug()
|
||||
@ -20,17 +24,20 @@ fitlog.debug()
|
||||
from reproduction.seqence_labelling.ner.data.Conll2003Loader import Conll2003DataLoader
|
||||
|
||||
encoding_type = 'bioes'
|
||||
|
||||
data = Conll2003DataLoader(encoding_type=encoding_type).process('../../../../others/data/conll2003',
|
||||
word_vocab_opt=VocabularyOption(min_freq=2),
|
||||
lower=False)
|
||||
@cache_results('caches/upper_conll2003.pkl')
|
||||
def load_data():
|
||||
data = Conll2003DataLoader(encoding_type=encoding_type).process('../../../../others/data/conll2003',
|
||||
word_vocab_opt=VocabularyOption(min_freq=1),
|
||||
lower=False)
|
||||
return data
|
||||
data = load_data()
|
||||
print(data)
|
||||
char_embed = CNNCharEmbedding(vocab=data.vocabs['cap_words'], embed_size=30, char_emb_size=30, filter_nums=[30],
|
||||
kernel_sizes=[3])
|
||||
char_embed = CNNCharEmbedding(vocab=data.vocabs['words'], embed_size=30, char_emb_size=30, filter_nums=[30],
|
||||
kernel_sizes=[3], word_dropout=0.01, dropout=0.5)
|
||||
# char_embed = LSTMCharEmbedding(vocab=data.vocabs['cap_words'], embed_size=30 ,char_emb_size=30)
|
||||
word_embed = StaticEmbedding(vocab=data.vocabs[Const.INPUT],
|
||||
model_dir_or_name='/hdd/fudanNLP/pretrain_vectors/wiki_en_100_50_case_2.txt',
|
||||
requires_grad=True)
|
||||
word_embed = StaticEmbedding(vocab=data.vocabs['words'],
|
||||
model_dir_or_name='/hdd/fudanNLP/pretrain_vectors/glove.6B.100d.txt',
|
||||
requires_grad=True, lower=True, word_dropout=0.01, dropout=0.5)
|
||||
word_embed.embedding.weight.data = word_embed.embedding.weight.data/word_embed.embedding.weight.data.std()
|
||||
|
||||
# import joblib
|
||||
@ -46,25 +53,28 @@ word_embed.embedding.weight.data = word_embed.embedding.weight.data/word_embed.e
|
||||
# for name, dataset in data.datasets.items():
|
||||
# dataset.apply_field(convert_to_ids, field_name='raw_words', new_field_name=Const.INPUT)
|
||||
|
||||
# word_embed = ElmoEmbedding(vocab=data.vocabs['cap_words'],
|
||||
# model_dir_or_name='/hdd/fudanNLP/fastNLP/others/pretrained_models/elmo_en',
|
||||
# requires_grad=True)
|
||||
# elmo_embed = ElmoEmbedding(vocab=data.vocabs['cap_words'],
|
||||
# model_dir_or_name='.',
|
||||
# requires_grad=True, layers='mix')
|
||||
# char_embed = StackEmbedding([elmo_embed, char_embed])
|
||||
|
||||
model = CNNBiLSTMCRF(word_embed, char_embed, hidden_size=200, num_layers=1, tag_vocab=data.vocabs[Const.TARGET],
|
||||
encoding_type=encoding_type)
|
||||
|
||||
callbacks = [
|
||||
GradientClipCallback(clip_type='value', clip_value=5)
|
||||
, FitlogCallback({'test':data.datasets['test']}, verbose=1)
|
||||
GradientClipCallback(clip_type='value', clip_value=5),
|
||||
FitlogCallback({'test':data.datasets['test']}, verbose=1),
|
||||
# SaveModelCallback('save_models/', top=3, only_param=False, save_on_exception=True)
|
||||
]
|
||||
# optimizer = Adam(model.parameters(), lr=0.005)
|
||||
optimizer = SWATS(model.parameters(), verbose=True)
|
||||
# optimizer = SGD(model.parameters(), lr=0.008, momentum=0.9)
|
||||
# scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
|
||||
# callbacks.append(scheduler)
|
||||
# optimizer = Adam(model.parameters(), lr=0.001)
|
||||
# optimizer = SWATS(model.parameters(), verbose=True)
|
||||
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
|
||||
scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
|
||||
callbacks.append(scheduler)
|
||||
|
||||
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer, sampler=BucketSampler(),
|
||||
device=1, dev_data=data.datasets['dev'], batch_size=10,
|
||||
|
||||
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer, sampler=BucketSampler(batch_size=20),
|
||||
device=1, dev_data=data.datasets['dev'], batch_size=20,
|
||||
metrics=SpanFPreRecMetric(tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type),
|
||||
callbacks=callbacks, num_workers=1, n_epochs=100)
|
||||
callbacks=callbacks, num_workers=2, n_epochs=100)
|
||||
trainer.train()
|
@ -1,4 +1,5 @@
|
||||
from reproduction.seqence_labelling.ner.data.OntoNoteLoader import OntoNoteNERDataLoader
|
||||
from reproduction.seqence_labelling.ner.data.Conll2003Loader import Conll2003DataLoader
|
||||
from fastNLP.core.callback import FitlogCallback, LRScheduler
|
||||
from fastNLP import GradientClipCallback
|
||||
from torch.optim.lr_scheduler import LambdaLR, CosineAnnealingLR
|
||||
@ -6,11 +7,14 @@ from torch.optim import SGD, Adam
|
||||
from fastNLP import Const
|
||||
from fastNLP import RandomSampler, BucketSampler
|
||||
from fastNLP import SpanFPreRecMetric
|
||||
from fastNLP import Trainer
|
||||
from fastNLP import Trainer, Tester
|
||||
from fastNLP.core.metrics import MetricBase
|
||||
from reproduction.seqence_labelling.ner.model.dilated_cnn import IDCNN
|
||||
from fastNLP.core.utils import Option
|
||||
from fastNLP.modules.encoder.embedding import CNNCharEmbedding, StaticEmbedding
|
||||
from fastNLP.core.utils import cache_results
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
import fitlog
|
||||
import sys
|
||||
import torch.cuda
|
||||
import os
|
||||
@ -24,7 +28,6 @@ encoding_type = 'bioes'
|
||||
def get_path(path):
|
||||
return os.path.join(os.environ['HOME'], path)
|
||||
|
||||
data_path = get_path('workdir/datasets/ontonotes-v4')
|
||||
|
||||
ops = Option(
|
||||
batch_size=128,
|
||||
@ -33,34 +36,45 @@ ops = Option(
|
||||
repeats=3,
|
||||
num_layers=3,
|
||||
num_filters=400,
|
||||
use_crf=True,
|
||||
use_crf=False,
|
||||
gradient_clip=5,
|
||||
)
|
||||
|
||||
@cache_results('ontonotes-cache')
|
||||
@cache_results('ontonotes-case-cache')
|
||||
def load_data():
|
||||
|
||||
data = OntoNoteNERDataLoader(encoding_type=encoding_type).process(data_path,
|
||||
lower=True)
|
||||
print('loading data')
|
||||
data = OntoNoteNERDataLoader(encoding_type=encoding_type).process(
|
||||
paths = get_path('workdir/datasets/ontonotes-v4'),
|
||||
lower=False,
|
||||
word_vocab_opt=VocabularyOption(min_freq=0),
|
||||
)
|
||||
# data = Conll2003DataLoader(task='ner', encoding_type=encoding_type).process(
|
||||
# paths=get_path('workdir/datasets/conll03'),
|
||||
# lower=False, word_vocab_opt=VocabularyOption(min_freq=0)
|
||||
# )
|
||||
|
||||
# char_embed = CNNCharEmbedding(vocab=data.vocabs['cap_words'], embed_size=30, char_emb_size=30, filter_nums=[30],
|
||||
# kernel_sizes=[3])
|
||||
|
||||
print('loading embedding')
|
||||
word_embed = StaticEmbedding(vocab=data.vocabs[Const.INPUT],
|
||||
model_dir_or_name='en-glove-840b-300',
|
||||
requires_grad=True)
|
||||
return data, [word_embed]
|
||||
|
||||
data, embeds = load_data()
|
||||
print(data)
|
||||
print(data.datasets['train'][0])
|
||||
print(list(data.vocabs.keys()))
|
||||
|
||||
for ds in data.datasets.values():
|
||||
ds.rename_field('cap_words', 'chars')
|
||||
ds.set_input('chars')
|
||||
# for ds in data.datasets.values():
|
||||
# ds.rename_field('cap_words', 'chars')
|
||||
# ds.set_input('chars')
|
||||
|
||||
word_embed = embeds[0]
|
||||
char_embed = CNNCharEmbedding(data.vocabs['cap_words'])
|
||||
word_embed.embedding.weight.data /= word_embed.embedding.weight.data.std()
|
||||
|
||||
# char_embed = CNNCharEmbedding(data.vocabs['cap_words'])
|
||||
char_embed = None
|
||||
# for ds in data.datasets:
|
||||
# ds.rename_field('')
|
||||
|
||||
@ -75,14 +89,44 @@ model = IDCNN(init_embed=word_embed,
|
||||
kernel_size=3,
|
||||
use_crf=ops.use_crf, use_projection=True,
|
||||
block_loss=True,
|
||||
input_dropout=0.33, hidden_dropout=0.2, inner_dropout=0.2)
|
||||
input_dropout=0.5, hidden_dropout=0.2, inner_dropout=0.2)
|
||||
|
||||
print(model)
|
||||
|
||||
callbacks = [GradientClipCallback(clip_value=ops.gradient_clip, clip_type='norm'),]
|
||||
callbacks = [GradientClipCallback(clip_value=ops.gradient_clip, clip_type='value'),]
|
||||
metrics = []
|
||||
metrics.append(
|
||||
SpanFPreRecMetric(
|
||||
tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type,
|
||||
pred=Const.OUTPUT, target=Const.TARGET, seq_len=Const.INPUT_LEN,
|
||||
)
|
||||
)
|
||||
|
||||
class LossMetric(MetricBase):
|
||||
def __init__(self, loss=None):
|
||||
super(LossMetric, self).__init__()
|
||||
self._init_param_map(loss=loss)
|
||||
self.total_loss = 0.0
|
||||
self.steps = 0
|
||||
|
||||
def evaluate(self, loss):
|
||||
self.total_loss += float(loss)
|
||||
self.steps += 1
|
||||
|
||||
def get_metric(self, reset=True):
|
||||
result = {'loss': self.total_loss / (self.steps + 1e-12)}
|
||||
if reset:
|
||||
self.total_loss = 0.0
|
||||
self.steps = 0
|
||||
return result
|
||||
|
||||
metrics.append(
|
||||
LossMetric(loss=Const.LOSS)
|
||||
)
|
||||
|
||||
optimizer = Adam(model.parameters(), lr=ops.lr, weight_decay=0)
|
||||
# scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
|
||||
scheduler = LRScheduler(LambdaLR(optimizer, lr_lambda=lambda epoch: 1 / (1 + 0.05 * epoch)))
|
||||
callbacks.append(scheduler)
|
||||
# callbacks.append(LRScheduler(CosineAnnealingLR(optimizer, 15)))
|
||||
# optimizer = SWATS(model.parameters(), verbose=True)
|
||||
# optimizer = Adam(model.parameters(), lr=0.005)
|
||||
@ -92,8 +136,20 @@ device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
|
||||
trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer,
|
||||
sampler=BucketSampler(num_buckets=50, batch_size=ops.batch_size),
|
||||
device=device, dev_data=data.datasets['dev'], batch_size=ops.batch_size,
|
||||
metrics=SpanFPreRecMetric(
|
||||
tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type),
|
||||
metrics=metrics,
|
||||
check_code_level=-1,
|
||||
callbacks=callbacks, num_workers=2, n_epochs=ops.num_epochs)
|
||||
trainer.train()
|
||||
|
||||
torch.save(model, 'idcnn.pt')
|
||||
|
||||
tester = Tester(
|
||||
data=data.datasets['test'],
|
||||
model=model,
|
||||
metrics=metrics,
|
||||
batch_size=ops.batch_size,
|
||||
num_workers=2,
|
||||
device=device
|
||||
)
|
||||
tester.test()
|
||||
|
||||
|
@ -7,9 +7,9 @@ dpcnn:论文链接[Deep Pyramid Convolutional Neural Networks for TextCategoriza
|
||||
|
||||
HAN:论文链接[Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)
|
||||
|
||||
LSTM+self_attention:论文链接[A Structured Self-attentive Sentence Embedding](<https://arxiv.org/pdf/1703.03130.pdf>)
|
||||
LSTM+self_attention:论文链接[A Structured Self-attentive Sentence Embedding](https://arxiv.org/pdf/1703.03130.pdf)
|
||||
|
||||
AWD-LSTM:论文链接[Regularizing and Optimizing LSTM Language Models](<https://arxiv.org/pdf/1708.02182.pdf>)
|
||||
AWD-LSTM:论文链接[Regularizing and Optimizing LSTM Language Models](https://arxiv.org/pdf/1708.02182.pdf)
|
||||
|
||||
# 数据集及复现结果汇总
|
||||
|
||||
|
@ -1,6 +1,6 @@
|
||||
from fastNLP.io.embed_loader import EmbeddingOption, EmbedLoader
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataInfo
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataBundle
|
||||
from typing import Union, Dict, List, Iterator
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import Instance
|
||||
@ -50,7 +50,7 @@ class IMDBLoader(DataSetLoader):
|
||||
char_level_op=False):
|
||||
|
||||
datasets = {}
|
||||
info = DataInfo()
|
||||
info = DataBundle()
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
datasets[name] = dataset
|
||||
|
@ -1,6 +1,6 @@
|
||||
from fastNLP.io.embed_loader import EmbeddingOption, EmbedLoader
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataInfo
|
||||
from fastNLP.io.base_loader import DataSetLoader, DataBundle
|
||||
from typing import Union, Dict, List, Iterator
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import Instance
|
||||
@ -47,7 +47,7 @@ class MTL16Loader(DataSetLoader):
|
||||
|
||||
paths = check_dataloader_paths(paths)
|
||||
datasets = {}
|
||||
info = DataInfo()
|
||||
info = DataBundle()
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
datasets[name] = dataset
|
||||
|
@ -1,6 +1,6 @@
|
||||
from typing import Iterable
|
||||
from nltk import Tree
|
||||
from fastNLP.io.base_loader import DataInfo, DataSetLoader
|
||||
from fastNLP.io.base_loader import DataBundle, DataSetLoader
|
||||
from fastNLP.core.vocabulary import VocabularyOption, Vocabulary
|
||||
from fastNLP import DataSet
|
||||
from fastNLP import Instance
|
||||
@ -68,7 +68,7 @@ class SSTLoader(DataSetLoader):
|
||||
tgt_vocab = Vocabulary(unknown=None, padding=None) \
|
||||
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
|
||||
|
||||
info = DataInfo(datasets=self.load(paths))
|
||||
info = DataBundle(datasets=self.load(paths))
|
||||
_train_ds = [info.datasets[name]
|
||||
for name in train_ds] if train_ds else info.datasets.values()
|
||||
src_vocab.from_dataset(*_train_ds, field_name=input_name)
|
||||
@ -134,7 +134,7 @@ class sst2Loader(DataSetLoader):
|
||||
|
||||
paths = check_dataloader_paths(paths)
|
||||
datasets = {}
|
||||
info = DataInfo()
|
||||
info = DataBundle()
|
||||
for name, path in paths.items():
|
||||
dataset = self.load(path)
|
||||
datasets[name] = dataset
|
||||
|
@ -4,7 +4,7 @@ from typing import Iterable
|
||||
from fastNLP import DataSet, Instance, Vocabulary
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
from fastNLP.io import JsonLoader
|
||||
from fastNLP.io.base_loader import DataInfo,DataSetLoader
|
||||
from fastNLP.io.base_loader import DataBundle,DataSetLoader
|
||||
from fastNLP.io.embed_loader import EmbeddingOption
|
||||
from fastNLP.io.file_reader import _read_json
|
||||
from typing import Union, Dict
|
||||
@ -134,7 +134,7 @@ class yelpLoader(DataSetLoader):
|
||||
char_level_op=False):
|
||||
paths = check_dataloader_paths(paths)
|
||||
datasets = {}
|
||||
info = DataInfo(datasets=self.load(paths))
|
||||
info = DataBundle(datasets=self.load(paths))
|
||||
src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
|
||||
tgt_vocab = Vocabulary(unknown=None, padding=None) \
|
||||
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
|
||||
|
@ -11,7 +11,7 @@ from reproduction.text_classification.model.dpcnn import DPCNN
|
||||
from data.yelpLoader import yelpLoader
|
||||
from fastNLP.core.sampler import BucketSampler
|
||||
import torch.nn as nn
|
||||
from fastNLP.core import LRScheduler
|
||||
from fastNLP.core import LRScheduler, Callback
|
||||
from fastNLP.core.const import Const as C
|
||||
from fastNLP.core.vocabulary import VocabularyOption
|
||||
from utils.util_init import set_rng_seeds
|
||||
@ -25,14 +25,14 @@ os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
|
||||
|
||||
class Config():
|
||||
seed = 12345
|
||||
model_dir_or_name = "dpcnn-yelp-p"
|
||||
model_dir_or_name = "dpcnn-yelp-f"
|
||||
embedding_grad = True
|
||||
train_epoch = 30
|
||||
batch_size = 100
|
||||
task = "yelp_p"
|
||||
task = "yelp_f"
|
||||
#datadir = 'workdir/datasets/SST'
|
||||
datadir = 'workdir/datasets/yelp_polarity'
|
||||
# datadir = 'workdir/datasets/yelp_full'
|
||||
# datadir = 'workdir/datasets/yelp_polarity'
|
||||
datadir = 'workdir/datasets/yelp_full'
|
||||
#datafile = {"train": "train.txt", "dev": "dev.txt", "test": "test.txt"}
|
||||
datafile = {"train": "train.csv", "test": "test.csv"}
|
||||
lr = 1e-3
|
||||
@ -73,6 +73,8 @@ def load_data():
|
||||
|
||||
|
||||
datainfo, embedding = load_data()
|
||||
embedding.embedding.weight.data /= embedding.embedding.weight.data.std()
|
||||
print(embedding.embedding.weight.mean(), embedding.embedding.weight.std())
|
||||
|
||||
# 2.或直接复用fastNLP的模型
|
||||
|
||||
@ -92,11 +94,12 @@ optimizer = SGD([param for param in model.parameters() if param.requires_grad ==
|
||||
lr=ops.lr, momentum=0.9, weight_decay=ops.weight_decay)
|
||||
|
||||
callbacks = []
|
||||
# callbacks.append(LRScheduler(CosineAnnealingLR(optimizer, 5)))
|
||||
callbacks.append(
|
||||
LRScheduler(LambdaLR(optimizer, lambda epoch: ops.lr if epoch <
|
||||
ops.train_epoch * 0.8 else ops.lr * 0.1))
|
||||
)
|
||||
|
||||
callbacks.append(LRScheduler(CosineAnnealingLR(optimizer, 5)))
|
||||
# callbacks.append(
|
||||
# LRScheduler(LambdaLR(optimizer, lambda epoch: ops.lr if epoch <
|
||||
# ops.train_epoch * 0.8 else ops.lr * 0.1))
|
||||
# )
|
||||
|
||||
# callbacks.append(
|
||||
# FitlogCallback(data=datainfo.datasets, verbose=1)
|
||||
|
@ -3,3 +3,4 @@ torch>=1.0.0
|
||||
tqdm>=4.28.1
|
||||
nltk>=3.4.1
|
||||
requests
|
||||
spacy
|
||||
|
@ -88,6 +88,27 @@ class TestAdd(unittest.TestCase):
|
||||
for i in range(num_samples):
|
||||
self.assertEqual(True, vocab._is_word_no_create_entry(chr(start_char + i)+chr(start_char + i)))
|
||||
|
||||
def test_no_entry(self):
|
||||
# 先建立vocabulary,然后变化no_create_entry, 测试能否正确识别
|
||||
text = ["FastNLP", "works", "well", "in", "most", "cases", "and", "scales", "well", "in",
|
||||
"works", "well", "in", "most", "cases", "scales", "well"]
|
||||
vocab = Vocabulary()
|
||||
vocab.add_word_lst(text)
|
||||
|
||||
self.assertFalse(vocab._is_word_no_create_entry('FastNLP'))
|
||||
vocab.add_word('FastNLP', no_create_entry=True)
|
||||
self.assertFalse(vocab._is_word_no_create_entry('FastNLP'))
|
||||
|
||||
vocab.add_word('fastnlp', no_create_entry=True)
|
||||
self.assertTrue(vocab._is_word_no_create_entry('fastnlp'))
|
||||
vocab.add_word('fastnlp', no_create_entry=False)
|
||||
self.assertFalse(vocab._is_word_no_create_entry('fastnlp'))
|
||||
|
||||
vocab.add_word_lst(['1']*10, no_create_entry=True)
|
||||
self.assertTrue(vocab._is_word_no_create_entry('1'))
|
||||
vocab.add_word('1')
|
||||
self.assertFalse(vocab._is_word_no_create_entry('1'))
|
||||
|
||||
|
||||
class TestIndexing(unittest.TestCase):
|
||||
def test_len(self):
|
||||
@ -127,6 +148,21 @@ class TestIndexing(unittest.TestCase):
|
||||
self.assertTrue(word in text)
|
||||
self.assertTrue(idx < len(vocab))
|
||||
|
||||
def test_rebuild(self):
|
||||
# 测试build之后新加入词,原来的词顺序不变
|
||||
vocab = Vocabulary()
|
||||
text = [str(idx) for idx in range(10)]
|
||||
vocab.update(text)
|
||||
for i in text:
|
||||
self.assertEqual(int(i)+2, vocab.to_index(i))
|
||||
indexes = []
|
||||
for word, index in vocab:
|
||||
indexes.append((word, index))
|
||||
vocab.add_word_lst([str(idx) for idx in range(10, 13)])
|
||||
for idx, pair in enumerate(indexes):
|
||||
self.assertEqual(pair[1], vocab.to_index(pair[0]))
|
||||
for i in range(13):
|
||||
self.assertEqual(int(i)+2, vocab.to_index(str(i)))
|
||||
|
||||
class TestOther(unittest.TestCase):
|
||||
def test_additional_update(self):
|
||||
|
@ -1,8 +1,7 @@
|
||||
import unittest
|
||||
import os
|
||||
from fastNLP.io import Conll2003Loader, PeopleDailyCorpusLoader, CSVLoader, JsonLoader
|
||||
from fastNLP.io.data_loader import SSTLoader, SNLILoader
|
||||
from reproduction.text_classification.data.yelpLoader import yelpLoader
|
||||
from fastNLP.io import CSVLoader, JsonLoader
|
||||
from fastNLP.io.data_loader import SSTLoader, SNLILoader, Conll2003Loader, PeopleDailyCorpusLoader
|
||||
|
||||
|
||||
class TestDatasetLoader(unittest.TestCase):
|
||||
@ -31,7 +30,7 @@ class TestDatasetLoader(unittest.TestCase):
|
||||
ds = JsonLoader().load('test/data_for_tests/sample_snli.jsonl')
|
||||
assert len(ds) == 3
|
||||
|
||||
def test_SST(self):
|
||||
def no_test_SST(self):
|
||||
train_data = """(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
|
||||
(4 (4 (4 (2 The) (4 (3 gorgeously) (3 (2 elaborate) (2 continuation)))) (2 (2 (2 of) (2 ``)) (2 (2 The) (2 (2 (2 Lord) (2 (2 of) (2 (2 the) (2 Rings)))) (2 (2 '') (2 trilogy)))))) (2 (3 (2 (2 is) (2 (2 so) (2 huge))) (2 (2 that) (3 (2 (2 (2 a) (2 column)) (2 (2 of) (2 words))) (2 (2 (2 (2 can) (1 not)) (3 adequately)) (2 (2 describe) (2 (3 (2 (2 co-writer\/director) (2 (2 Peter) (3 (2 Jackson) (2 's)))) (3 (2 expanded) (2 vision))) (2 (2 of) (2 (2 (2 J.R.R.) (2 (2 Tolkien) (2 's))) (2 Middle-earth))))))))) (2 .)))
|
||||
(3 (3 (2 (2 (2 (2 (2 Singer\/composer) (2 (2 Bryan) (2 Adams))) (2 (2 contributes) (2 (2 (2 a) (2 slew)) (2 (2 of) (2 songs))))) (2 (2 --) (2 (2 (2 (2 a) (2 (2 few) (3 potential))) (2 (2 (2 hits) (2 ,)) (2 (2 (2 a) (2 few)) (1 (1 (2 more) (1 (2 simply) (2 intrusive))) (2 (2 to) (2 (2 the) (2 story))))))) (2 --)))) (2 but)) (3 (4 (2 the) (3 (2 whole) (2 package))) (2 (3 certainly) (3 (2 captures) (2 (1 (2 the) (2 (2 (2 intended) (2 (2 ,) (2 (2 er) (2 ,)))) (3 spirit))) (2 (2 of) (2 (2 the) (2 piece)))))))) (2 .))
|
||||
@ -65,6 +64,12 @@ class TestDatasetLoader(unittest.TestCase):
|
||||
def test_import(self):
|
||||
import fastNLP
|
||||
from fastNLP.io import SNLILoader
|
||||
ds = SNLILoader().process('test/data_for_tests/sample_snli.jsonl', to_lower=True,
|
||||
get_index=True, seq_len_type='seq_len', extra_split=['-'])
|
||||
assert 'train' in ds.datasets
|
||||
assert len(ds.datasets) == 1
|
||||
assert len(ds.datasets['train']) == 3
|
||||
|
||||
ds = SNLILoader().process('test/data_for_tests/sample_snli.jsonl', to_lower=True,
|
||||
get_index=True, seq_len_type='seq_len')
|
||||
assert 'train' in ds.datasets
|
||||
|
Loading…
Reference in New Issue
Block a user