diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..2b2b2b35
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,16 @@
+.gitignore
+
+.DS_Store
+.ipynb_checkpoints
+*.pyc
+__pycache__
+*.swp
+.vscode/
+.idea/**
+
+caches
+
+# fitlog
+.fitlog
+logs/
+.fitconfig
diff --git a/.travis.yml b/.travis.yml
index 559fc86e..210d158a 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -8,7 +8,7 @@ install:
- pip install pytest-cov
# command to run tests
script:
- - pytest --cov=./
+ - pytest --cov=./ test/
after_success:
- bash <(curl -s https://codecov.io/bash)
diff --git a/README.md b/README.md
index 9d949482..b35776dc 100644
--- a/README.md
+++ b/README.md
@@ -6,48 +6,69 @@
![Hex.pm](https://img.shields.io/hexpm/l/plug.svg)
[![Documentation Status](https://readthedocs.org/projects/fastnlp/badge/?version=latest)](http://fastnlp.readthedocs.io/?badge=latest)
-fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个命名实体识别(NER)、中文分词或文本分类任务; 也可以使用他构建许多复杂的网络模型,进行科研。它具有如下的特性:
+fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注([NER](reproduction/seqence_labelling/ner)、POS-Tagging等)、中文分词、[文本分类](reproduction/text_classification)、[Matching](reproduction/matching)、[指代消解](reproduction/coreference_resolution)、[摘要](reproduction/Summarization)等任务; 也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性:
-- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的DataSet Loader,省去预处理代码。
-- 各种方便的NLP工具,例如预处理embedding加载; 中间数据cache等;
-- 详尽的中文文档以供查阅;
+- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的DataSet Loader,省去预处理代码;
+- 多种训练、测试组件,例如训练器Trainer;测试器Tester;以及各种评测metrics等等;
+- 各种方便的NLP工具,例如预处理embedding加载(包括ELMo和BERT); 中间数据cache等;
+- 详尽的中文[文档](https://fastnlp.readthedocs.io/)、[教程](https://fastnlp.readthedocs.io/zh/latest/user/tutorials.html)以供查阅;
- 提供诸多高级模块,例如Variational LSTM, Transformer, CRF等;
-- 封装CNNText,Biaffine等模型可供直接使用;
+- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种模型可供直接使用,详细内容见 [reproduction](reproduction) 部分;
- 便捷且具有扩展性的训练器; 提供多种内置callback函数,方便实验记录、异常捕获等。
## 安装指南
-fastNLP 依赖如下包:
+fastNLP 依赖以下包:
-+ numpy
-+ torch>=0.4.0
-+ tqdm
-+ nltk
++ numpy>=1.14.2
++ torch>=1.0.0
++ tqdm>=4.28.1
++ nltk>=3.4.1
++ requests
++ spacy
-其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 PyTorch 官网 。
-在依赖包安装完成的情况,您可以在命令行执行如下指令完成安装
+其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 [PyTorch 官网](https://pytorch.org/) 。
+在依赖包安装完成后,您可以在命令行执行如下指令完成安装
```shell
pip install fastNLP
+python -m spacy download en
```
+目前使用pip安装fastNLP的版本是0.4.1,有较多功能仍未更新,最新内容以master分支为准。
+fastNLP0.5.0版本将在近期推出,请密切关注。
-## 参考资源
-- [文档](https://fastnlp.readthedocs.io/zh/latest/)
-- [源码](https://github.com/fastnlp/fastNLP)
+## fastNLP教程
+
+- [0. 快速入门](https://fastnlp.readthedocs.io/zh/latest/user/quickstart.html)
+- [1. 使用DataSet预处理文本](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_1_data_preprocess.html)
+- [2. 使用DataSetLoader加载数据集](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_2_load_dataset.html)
+- [3. 使用Embedding模块将文本转成向量](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_3_embedding.html)
+- [4. 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_4_loss_optimizer.html)
+- [5. 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_5_datasetiter.html)
+- [6. 快速实现序列标注模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_6_seq_labeling.html)
+- [7. 使用Modules和Models快速搭建自定义模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_7_modules_models.html)
+- [8. 使用Metric快速评测你的模型](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_8_metrics.html)
+- [9. 使用Callback自定义你的训练过程](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_9_callback.html)
+- [10. 使用fitlog 辅助 fastNLP 进行科研](https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_10_fitlog.html)
## 内置组件
-大部分用于的 NLP 任务神经网络都可以看做由编码(encoder)、聚合(aggregator)、解码(decoder)三种模块组成。
+大部分用于的 NLP 任务神经网络都可以看做由词嵌入(embeddings)和两种模块:编码器(encoder)、解码器(decoder)组成。
+
+以文本分类任务为例,下图展示了一个BiLSTM+Attention实现文本分类器的模型流程图:
![](./docs/source/figures/text_classification.png)
-fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 三种模块的功能和常见组件如下:
+fastNLP 在 embeddings 模块中内置了几种不同的embedding:静态embedding(GloVe、word2vec)、上下文相关embedding
+(ELMo、BERT)、字符embedding(基于CNN或者LSTM的CharEmbedding)
+
+与此同时,fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下:
@@ -57,29 +78,17 @@ fastNLP 在 modules 模块中内置了三种模块的诸多组件,可以帮助
encoder |
- 将输入编码为具有具 有表示能力的向量 |
+ 将输入编码为具有具有表示能力的向量 |
embedding, RNN, CNN, transformer
|
-
- aggregator |
- 从多个向量中聚合信息 |
- self-attention, max-pooling |
-
decoder |
- 将具有某种表示意义的 向量解码为需要的输出 形式 |
+ 将具有某种表示意义的向量解码为需要的输出形式 |
MLP, CRF |
-## 完整模型
-fastNLP 为不同的 NLP 任务实现了许多完整的模型,它们都经过了训练和测试。
-
-你可以在以下两个地方查看相关信息
-- [介绍](reproduction/)
-- [源码](fastNLP/models/)
-
## 项目结构
![](./docs/source/figures/workflow.png)
@@ -93,7 +102,7 @@ fastNLP的大致工作流程如上图所示,而项目结构如下:
fastNLP.core |
- 实现了核心功能,包括数据处理组件、训练器、测速器等 |
+ 实现了核心功能,包括数据处理组件、训练器、测试器等 |
fastNLP.models |
@@ -103,6 +112,10 @@ fastNLP的大致工作流程如上图所示,而项目结构如下:
fastNLP.modules |
实现了用于搭建神经网络模型的诸多组件 |
+
+ fastNLP.embeddings |
+ 实现了将序列index转为向量序列的功能,包括读取预训练embedding等 |
+
fastNLP.io |
实现了读写功能,包括数据读入,模型读写等 |
diff --git a/docs/Makefile b/docs/Makefile
index 6ba2fa54..2b4de2d8 100644
--- a/docs/Makefile
+++ b/docs/Makefile
@@ -19,6 +19,9 @@ apidoc:
server:
cd build/html && python -m http.server
+dev:
+ rm -rf build/html && make html && make server
+
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 00000000..15dcccda
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,41 @@
+# 快速入门 fastNLP 文档编写
+
+本教程为 fastNLP 文档编写者创建,文档编写者包括合作开发人员和文档维护人员。您在一般情况下属于前者,
+只需要了解整个框架的部分内容即可。
+
+## 合作开发人员
+
+FastNLP的文档使用基于[reStructuredText标记语言](http://docutils.sourceforge.net/rst.html)的
+[Sphinx](http://sphinx.pocoo.org/)工具生成,由[Read the Docs](https://readthedocs.org/)网站自动维护生成。
+一般开发者只要编写符合reStructuredText语法规范的文档并通过[PR](https://help.github.com/en/articles/about-pull-requests),
+就可以为fastNLP的文档贡献一份力量。
+
+如果你想在本地编译文档并进行大段文档的编写,您需要安装Sphinx工具以及sphinx-rtd-theme主题:
+```bash
+fastNLP/docs> pip install sphinx
+fastNLP/docs> pip install sphinx-rtd-theme
+```
+然后在本目录下执行 `make dev` 命令。该命令只支持Linux和MacOS系统,期望看到如下输出:
+```bash
+fastNLP/docs> make dev
+rm -rf build/html && make html && make server
+Running Sphinx v1.5.6
+making output directory...
+......
+Build finished. The HTML pages are in build/html.
+cd build/html && python -m http.server
+Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
+```
+现在您浏览器访问 http://localhost:8000/ 查看文档。如果你在远程服务器尚进行工作,则访问地址为 http://{服务器的ip地址}:8000/ 。
+但您必须保证服务器的8000端口是开放的。如果您的电脑或远程服务器的8000端口被占用,程序会顺延使用8001、8002……等端口。
+当你结束访问时,您可以使用Control(Ctrl) + C 来结束进程。
+
+我们在[这里](./source/user/example.rst)列举了fastNLP文档经常用到的reStructuredText语法(网页查看请结合Raw模式),
+您可以通过阅读它进行快速上手。FastNLP大部分的文档都是写在代码中通过Sphinx工具进行抽取生成的,
+您还可以参考这篇[未完成的文章](./source/user/docs_in_code.rst)了解代码内文档编写的规范。
+
+## 文档维护人员
+
+文档维护人员需要了解 Makefile 中全部命令的含义,并了解到目前的文档结构
+是在 sphinx-apidoc 自动抽取的基础上进行手动修改得到的。
+文档维护人员应进一步提升整个框架的自动化程度,并监督合作开发人员不要破坏文档项目的整体结构。
\ No newline at end of file
diff --git a/docs/make.bat b/docs/make.bat
deleted file mode 100644
index 1c651b1f..00000000
--- a/docs/make.bat
+++ /dev/null
@@ -1,36 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
- set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=source
-set BUILDDIR=build
-set SPHINXPROJ=fastNLP
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
- echo.
- echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
- echo.installed, then set the SPHINXBUILD environment variable to point
- echo.to the full path of the 'sphinx-build' executable. Alternatively you
- echo.may add the Sphinx directory to PATH.
- echo.
- echo.If you don't have Sphinx installed, grab it from
- echo.http://sphinx-doc.org/
- exit /b 1
-)
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
-
-:end
-popd
diff --git a/docs/quick_tutorial.md b/docs/quick_tutorial.md
deleted file mode 100644
index 64c51124..00000000
--- a/docs/quick_tutorial.md
+++ /dev/null
@@ -1,2 +0,0 @@
-# FastNLP Quick Tutorial
-
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 3e9753af..2e10bc89 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -24,9 +24,9 @@ copyright = '2018, xpqiu'
author = 'xpqiu'
# The short X.Y version
-version = '0.4'
+version = '0.4.5'
# The full version, including alpha/beta/rc tags
-release = '0.4'
+release = '0.4.5'
# -- General configuration ---------------------------------------------------
diff --git a/docs/source/fastNLP.core.batch.rst b/docs/source/fastNLP.core.batch.rst
index 33a5b730..03008b52 100644
--- a/docs/source/fastNLP.core.batch.rst
+++ b/docs/source/fastNLP.core.batch.rst
@@ -2,6 +2,6 @@ fastNLP.core.batch
==================
.. automodule:: fastNLP.core.batch
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.callback.rst b/docs/source/fastNLP.core.callback.rst
index 31ec627b..74a7825d 100644
--- a/docs/source/fastNLP.core.callback.rst
+++ b/docs/source/fastNLP.core.callback.rst
@@ -2,6 +2,6 @@ fastNLP.core.callback
=====================
.. automodule:: fastNLP.core.callback
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.const.rst b/docs/source/fastNLP.core.const.rst
index c9e3bd97..330a8883 100644
--- a/docs/source/fastNLP.core.const.rst
+++ b/docs/source/fastNLP.core.const.rst
@@ -2,6 +2,6 @@ fastNLP.core.const
==================
.. automodule:: fastNLP.core.const
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.dataset.rst b/docs/source/fastNLP.core.dataset.rst
index b377cb0f..1ad94bb6 100644
--- a/docs/source/fastNLP.core.dataset.rst
+++ b/docs/source/fastNLP.core.dataset.rst
@@ -2,6 +2,6 @@ fastNLP.core.dataset
====================
.. automodule:: fastNLP.core.dataset
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.field.rst b/docs/source/fastNLP.core.field.rst
index 7686e79a..7fc099c9 100644
--- a/docs/source/fastNLP.core.field.rst
+++ b/docs/source/fastNLP.core.field.rst
@@ -2,6 +2,6 @@ fastNLP.core.field
==================
.. automodule:: fastNLP.core.field
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.instance.rst b/docs/source/fastNLP.core.instance.rst
index 14393a91..6e496ac1 100644
--- a/docs/source/fastNLP.core.instance.rst
+++ b/docs/source/fastNLP.core.instance.rst
@@ -2,6 +2,6 @@ fastNLP.core.instance
=====================
.. automodule:: fastNLP.core.instance
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.losses.rst b/docs/source/fastNLP.core.losses.rst
index d2dd492b..8e63dfa1 100644
--- a/docs/source/fastNLP.core.losses.rst
+++ b/docs/source/fastNLP.core.losses.rst
@@ -2,6 +2,6 @@ fastNLP.core.losses
===================
.. automodule:: fastNLP.core.losses
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.metrics.rst b/docs/source/fastNLP.core.metrics.rst
index 69afff36..d3b87bb8 100644
--- a/docs/source/fastNLP.core.metrics.rst
+++ b/docs/source/fastNLP.core.metrics.rst
@@ -2,6 +2,6 @@ fastNLP.core.metrics
====================
.. automodule:: fastNLP.core.metrics
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.optimizer.rst b/docs/source/fastNLP.core.optimizer.rst
index e2100d2e..c80be53f 100644
--- a/docs/source/fastNLP.core.optimizer.rst
+++ b/docs/source/fastNLP.core.optimizer.rst
@@ -2,6 +2,6 @@ fastNLP.core.optimizer
======================
.. automodule:: fastNLP.core.optimizer
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.rst b/docs/source/fastNLP.core.rst
index 82c13e46..cacc6622 100644
--- a/docs/source/fastNLP.core.rst
+++ b/docs/source/fastNLP.core.rst
@@ -2,15 +2,15 @@ fastNLP.core
============
.. automodule:: fastNLP.core
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
子模块
----------
.. toctree::
- :titlesonly:
+ :maxdepth: 1
fastNLP.core.batch
fastNLP.core.callback
@@ -26,4 +26,3 @@ fastNLP.core
fastNLP.core.trainer
fastNLP.core.utils
fastNLP.core.vocabulary
-
diff --git a/docs/source/fastNLP.core.sampler.rst b/docs/source/fastNLP.core.sampler.rst
index 1810d59c..0110f0c0 100644
--- a/docs/source/fastNLP.core.sampler.rst
+++ b/docs/source/fastNLP.core.sampler.rst
@@ -2,6 +2,6 @@ fastNLP.core.sampler
====================
.. automodule:: fastNLP.core.sampler
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.tester.rst b/docs/source/fastNLP.core.tester.rst
index a9e7e09f..4d71a27b 100644
--- a/docs/source/fastNLP.core.tester.rst
+++ b/docs/source/fastNLP.core.tester.rst
@@ -2,6 +2,6 @@ fastNLP.core.tester
===================
.. automodule:: fastNLP.core.tester
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.trainer.rst b/docs/source/fastNLP.core.trainer.rst
index 9e518d4b..60bf2d5b 100644
--- a/docs/source/fastNLP.core.trainer.rst
+++ b/docs/source/fastNLP.core.trainer.rst
@@ -2,6 +2,6 @@ fastNLP.core.trainer
====================
.. automodule:: fastNLP.core.trainer
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.utils.rst b/docs/source/fastNLP.core.utils.rst
index fcd3f50c..3f80b4e8 100644
--- a/docs/source/fastNLP.core.utils.rst
+++ b/docs/source/fastNLP.core.utils.rst
@@ -2,6 +2,6 @@ fastNLP.core.utils
==================
.. automodule:: fastNLP.core.utils
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.core.vocabulary.rst b/docs/source/fastNLP.core.vocabulary.rst
index b3bf4bac..ba9598b9 100644
--- a/docs/source/fastNLP.core.vocabulary.rst
+++ b/docs/source/fastNLP.core.vocabulary.rst
@@ -2,6 +2,6 @@ fastNLP.core.vocabulary
=======================
.. automodule:: fastNLP.core.vocabulary
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.embeddings.bert_embedding.rst b/docs/source/fastNLP.embeddings.bert_embedding.rst
new file mode 100644
index 00000000..24ceff1c
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.bert_embedding.rst
@@ -0,0 +1,7 @@
+fastNLP.embeddings.bert\_embedding
+==================================
+
+.. automodule:: fastNLP.embeddings.bert_embedding
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.embeddings.char_embedding.rst b/docs/source/fastNLP.embeddings.char_embedding.rst
new file mode 100644
index 00000000..501089d8
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.char_embedding.rst
@@ -0,0 +1,7 @@
+fastNLP.embeddings.char\_embedding
+==================================
+
+.. automodule:: fastNLP.embeddings.char_embedding
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.embeddings.elmo_embedding.rst b/docs/source/fastNLP.embeddings.elmo_embedding.rst
new file mode 100644
index 00000000..76669ee3
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.elmo_embedding.rst
@@ -0,0 +1,7 @@
+fastNLP.embeddings.elmo\_embedding
+==================================
+
+.. automodule:: fastNLP.embeddings.elmo_embedding
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.embeddings.embedding.rst b/docs/source/fastNLP.embeddings.embedding.rst
new file mode 100644
index 00000000..5960d2cd
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.embedding.rst
@@ -0,0 +1,7 @@
+fastNLP.embeddings.embedding
+============================
+
+.. automodule:: fastNLP.embeddings.embedding
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.embeddings.rst b/docs/source/fastNLP.embeddings.rst
new file mode 100644
index 00000000..6b168906
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.rst
@@ -0,0 +1,21 @@
+fastNLP.embeddings
+==================
+
+.. automodule:: fastNLP.embeddings
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+子模块
+----------
+
+.. toctree::
+ :maxdepth: 1
+
+ fastNLP.embeddings.bert_embedding
+ fastNLP.embeddings.char_embedding
+ fastNLP.embeddings.elmo_embedding
+ fastNLP.embeddings.embedding
+ fastNLP.embeddings.stack_embedding
+ fastNLP.embeddings.static_embedding
+ fastNLP.embeddings.utils
diff --git a/docs/source/fastNLP.embeddings.stack_embedding.rst b/docs/source/fastNLP.embeddings.stack_embedding.rst
new file mode 100644
index 00000000..4d2115f7
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.stack_embedding.rst
@@ -0,0 +1,7 @@
+fastNLP.embeddings.stack\_embedding
+===================================
+
+.. automodule:: fastNLP.embeddings.stack_embedding
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.embeddings.static_embedding.rst b/docs/source/fastNLP.embeddings.static_embedding.rst
new file mode 100644
index 00000000..e46de81a
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.static_embedding.rst
@@ -0,0 +1,7 @@
+fastNLP.embeddings.static\_embedding
+====================================
+
+.. automodule:: fastNLP.embeddings.static_embedding
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.embeddings.utils.rst b/docs/source/fastNLP.embeddings.utils.rst
new file mode 100644
index 00000000..263bfbd6
--- /dev/null
+++ b/docs/source/fastNLP.embeddings.utils.rst
@@ -0,0 +1,7 @@
+fastNLP.embeddings.utils
+========================
+
+.. automodule:: fastNLP.embeddings.utils
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.io.base_loader.rst b/docs/source/fastNLP.io.base_loader.rst
index c1f9ac14..057867f4 100644
--- a/docs/source/fastNLP.io.base_loader.rst
+++ b/docs/source/fastNLP.io.base_loader.rst
@@ -2,6 +2,6 @@ fastNLP.io.base\_loader
=======================
.. automodule:: fastNLP.io.base_loader
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.io.data_loader.rst b/docs/source/fastNLP.io.data_loader.rst
new file mode 100644
index 00000000..8f990102
--- /dev/null
+++ b/docs/source/fastNLP.io.data_loader.rst
@@ -0,0 +1,7 @@
+fastNLP.io.data\_loader
+==========================
+
+.. automodule:: fastNLP.io.data_loader
+ :members:
+ :undoc-members:
+ :show-inheritance:
\ No newline at end of file
diff --git a/docs/source/fastNLP.io.dataset_loader.rst b/docs/source/fastNLP.io.dataset_loader.rst
index d6663e59..e7990714 100644
--- a/docs/source/fastNLP.io.dataset_loader.rst
+++ b/docs/source/fastNLP.io.dataset_loader.rst
@@ -2,6 +2,6 @@ fastNLP.io.dataset\_loader
==========================
.. automodule:: fastNLP.io.dataset_loader
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.io.embed_loader.rst b/docs/source/fastNLP.io.embed_loader.rst
index 7a8e730c..69e1f7ff 100644
--- a/docs/source/fastNLP.io.embed_loader.rst
+++ b/docs/source/fastNLP.io.embed_loader.rst
@@ -2,6 +2,6 @@ fastNLP.io.embed\_loader
========================
.. automodule:: fastNLP.io.embed_loader
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.io.model_io.rst b/docs/source/fastNLP.io.model_io.rst
index 50d4c25a..537ce752 100644
--- a/docs/source/fastNLP.io.model_io.rst
+++ b/docs/source/fastNLP.io.model_io.rst
@@ -2,6 +2,6 @@ fastNLP.io.model\_io
====================
.. automodule:: fastNLP.io.model_io
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.io.rst b/docs/source/fastNLP.io.rst
index fad05a21..a97ed67d 100644
--- a/docs/source/fastNLP.io.rst
+++ b/docs/source/fastNLP.io.rst
@@ -2,18 +2,18 @@ fastNLP.io
==========
.. automodule:: fastNLP.io
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
子模块
----------
.. toctree::
- :titlesonly:
+ :maxdepth: 1
fastNLP.io.base_loader
- fastNLP.io.dataset_loader
fastNLP.io.embed_loader
+ fastNLP.io.dataset_loader
+ fastNLP.io.data_loader
fastNLP.io.model_io
-
diff --git a/docs/source/fastNLP.models.biaffine_parser.rst b/docs/source/fastNLP.models.biaffine_parser.rst
index a3dd1836..f19504e8 100644
--- a/docs/source/fastNLP.models.biaffine_parser.rst
+++ b/docs/source/fastNLP.models.biaffine_parser.rst
@@ -2,6 +2,6 @@ fastNLP.models.biaffine\_parser
===============================
.. automodule:: fastNLP.models.biaffine_parser
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.models.cnn_text_classification.rst b/docs/source/fastNLP.models.cnn_text_classification.rst
index a935d0bf..eacf6916 100644
--- a/docs/source/fastNLP.models.cnn_text_classification.rst
+++ b/docs/source/fastNLP.models.cnn_text_classification.rst
@@ -2,6 +2,6 @@ fastNLP.models.cnn\_text\_classification
========================================
.. automodule:: fastNLP.models.cnn_text_classification
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.models.rst b/docs/source/fastNLP.models.rst
index 5858ebcd..2ea546e2 100644
--- a/docs/source/fastNLP.models.rst
+++ b/docs/source/fastNLP.models.rst
@@ -2,19 +2,18 @@ fastNLP.models
==============
.. automodule:: fastNLP.models
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
子模块
----------
.. toctree::
- :titlesonly:
+ :maxdepth: 1
fastNLP.models.biaffine_parser
fastNLP.models.cnn_text_classification
fastNLP.models.sequence_labeling
fastNLP.models.snli
fastNLP.models.star_transformer
-
diff --git a/docs/source/fastNLP.models.sequence_labeling.rst b/docs/source/fastNLP.models.sequence_labeling.rst
index 6d569fe1..85e28f06 100644
--- a/docs/source/fastNLP.models.sequence_labeling.rst
+++ b/docs/source/fastNLP.models.sequence_labeling.rst
@@ -2,6 +2,6 @@ fastNLP.models.sequence\_labeling
=================================
.. automodule:: fastNLP.models.sequence_labeling
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.models.snli.rst b/docs/source/fastNLP.models.snli.rst
index 24c2cc53..3b9b555c 100644
--- a/docs/source/fastNLP.models.snli.rst
+++ b/docs/source/fastNLP.models.snli.rst
@@ -2,6 +2,6 @@ fastNLP.models.snli
===================
.. automodule:: fastNLP.models.snli
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.models.star_transformer.rst b/docs/source/fastNLP.models.star_transformer.rst
index c93fb8cd..69d5c5b2 100644
--- a/docs/source/fastNLP.models.star_transformer.rst
+++ b/docs/source/fastNLP.models.star_transformer.rst
@@ -2,6 +2,6 @@ fastNLP.models.star\_transformer
================================
.. automodule:: fastNLP.models.star_transformer
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.modules.aggregator.attention.rst b/docs/source/fastNLP.modules.aggregator.attention.rst
deleted file mode 100644
index dc9c2b53..00000000
--- a/docs/source/fastNLP.modules.aggregator.attention.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.aggregator.attention
-====================================
-
-.. automodule:: fastNLP.modules.aggregator.attention
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.aggregator.pooling.rst b/docs/source/fastNLP.modules.aggregator.pooling.rst
deleted file mode 100644
index 162f889d..00000000
--- a/docs/source/fastNLP.modules.aggregator.pooling.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.aggregator.pooling
-==================================
-
-.. automodule:: fastNLP.modules.aggregator.pooling
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.aggregator.rst b/docs/source/fastNLP.modules.aggregator.rst
deleted file mode 100644
index 44398325..00000000
--- a/docs/source/fastNLP.modules.aggregator.rst
+++ /dev/null
@@ -1,17 +0,0 @@
-fastNLP.modules.aggregator
-==========================
-
-.. automodule:: fastNLP.modules.aggregator
- :members:
- :undoc-members:
- :show-inheritance:
-
-子模块
-----------
-
-.. toctree::
- :titlesonly:
-
- fastNLP.modules.aggregator.attention
- fastNLP.modules.aggregator.pooling
-
diff --git a/docs/source/fastNLP.modules.decoder.crf.rst b/docs/source/fastNLP.modules.decoder.crf.rst
deleted file mode 100644
index 6d5b0d5b..00000000
--- a/docs/source/fastNLP.modules.decoder.crf.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.decoder.CRF
-===========================
-
-.. automodule:: fastNLP.modules.decoder.crf
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.decoder.mlp.rst b/docs/source/fastNLP.modules.decoder.mlp.rst
deleted file mode 100644
index 7d661ebf..00000000
--- a/docs/source/fastNLP.modules.decoder.mlp.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.decoder.MLP
-===========================
-
-.. automodule:: fastNLP.modules.decoder.mlp
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.decoder.rst b/docs/source/fastNLP.modules.decoder.rst
index e42a9f39..ecc2adbd 100644
--- a/docs/source/fastNLP.modules.decoder.rst
+++ b/docs/source/fastNLP.modules.decoder.rst
@@ -2,17 +2,7 @@ fastNLP.modules.decoder
=======================
.. automodule:: fastNLP.modules.decoder
- :members:
- :undoc-members:
- :show-inheritance:
-
-子模块
-----------
-
-.. toctree::
- :titlesonly:
-
- fastNLP.modules.decoder.crf
- fastNLP.modules.decoder.mlp
- fastNLP.modules.decoder.utils
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.modules.decoder.utils.rst b/docs/source/fastNLP.modules.decoder.utils.rst
deleted file mode 100644
index da979d99..00000000
--- a/docs/source/fastNLP.modules.decoder.utils.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.decoder.utils
-=============================
-
-.. automodule:: fastNLP.modules.decoder.utils
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.bert.rst b/docs/source/fastNLP.modules.encoder.bert.rst
deleted file mode 100644
index 66bd0bbd..00000000
--- a/docs/source/fastNLP.modules.encoder.bert.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.bert
-============================
-
-.. automodule:: fastNLP.modules.encoder.bert
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.char_encoder.rst b/docs/source/fastNLP.modules.encoder.char_encoder.rst
deleted file mode 100644
index 61ea3340..00000000
--- a/docs/source/fastNLP.modules.encoder.char_encoder.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.char\_encoder
-=====================================
-
-.. automodule:: fastNLP.modules.encoder.char_encoder
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.conv_maxpool.rst b/docs/source/fastNLP.modules.encoder.conv_maxpool.rst
deleted file mode 100644
index 7058a723..00000000
--- a/docs/source/fastNLP.modules.encoder.conv_maxpool.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.conv\_maxpool
-=====================================
-
-.. automodule:: fastNLP.modules.encoder.conv_maxpool
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.embedding.rst b/docs/source/fastNLP.modules.encoder.embedding.rst
deleted file mode 100644
index 4427b3bf..00000000
--- a/docs/source/fastNLP.modules.encoder.embedding.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.embedding
-=================================
-
-.. automodule:: fastNLP.modules.encoder.embedding
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.lstm.rst b/docs/source/fastNLP.modules.encoder.lstm.rst
deleted file mode 100644
index f9cbea88..00000000
--- a/docs/source/fastNLP.modules.encoder.lstm.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.lstm
-============================
-
-.. automodule:: fastNLP.modules.encoder.lstm
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.rst b/docs/source/fastNLP.modules.encoder.rst
index b15232fa..0562f12d 100644
--- a/docs/source/fastNLP.modules.encoder.rst
+++ b/docs/source/fastNLP.modules.encoder.rst
@@ -2,22 +2,6 @@ fastNLP.modules.encoder
=======================
.. automodule:: fastNLP.modules.encoder
- :members:
- :undoc-members:
- :show-inheritance:
-
-子模块
-----------
-
-.. toctree::
- :titlesonly:
-
- fastNLP.modules.encoder.bert
- fastNLP.modules.encoder.char_encoder
- fastNLP.modules.encoder.conv_maxpool
- fastNLP.modules.encoder.embedding
- fastNLP.modules.encoder.lstm
- fastNLP.modules.encoder.star_transformer
- fastNLP.modules.encoder.transformer
- fastNLP.modules.encoder.variational_rnn
-
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.star_transformer.rst b/docs/source/fastNLP.modules.encoder.star_transformer.rst
deleted file mode 100644
index 0c406782..00000000
--- a/docs/source/fastNLP.modules.encoder.star_transformer.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.star\_transformer
-=========================================
-
-.. automodule:: fastNLP.modules.encoder.star_transformer
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.transformer.rst b/docs/source/fastNLP.modules.encoder.transformer.rst
deleted file mode 100644
index 6a40c597..00000000
--- a/docs/source/fastNLP.modules.encoder.transformer.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.transformer
-===================================
-
-.. automodule:: fastNLP.modules.encoder.transformer
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.encoder.variational_rnn.rst b/docs/source/fastNLP.modules.encoder.variational_rnn.rst
deleted file mode 100644
index 348fb3d8..00000000
--- a/docs/source/fastNLP.modules.encoder.variational_rnn.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-fastNLP.modules.encoder.variational\_rnn
-========================================
-
-.. automodule:: fastNLP.modules.encoder.variational_rnn
- :members:
- :undoc-members:
- :show-inheritance:
diff --git a/docs/source/fastNLP.modules.rst b/docs/source/fastNLP.modules.rst
index d04ccdcf..646ef2d3 100644
--- a/docs/source/fastNLP.modules.rst
+++ b/docs/source/fastNLP.modules.rst
@@ -2,16 +2,16 @@ fastNLP.modules
===============
.. automodule:: fastNLP.modules
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
子模块
-----------
.. toctree::
- :titlesonly:
+ :titlesonly:
+ :maxdepth: 1
- fastNLP.modules.aggregator
- fastNLP.modules.decoder
- fastNLP.modules.encoder
\ No newline at end of file
+ fastNLP.modules.decoder
+ fastNLP.modules.encoder
\ No newline at end of file
diff --git a/docs/source/fastNLP.rst b/docs/source/fastNLP.rst
index f0c3d41c..0057a184 100644
--- a/docs/source/fastNLP.rst
+++ b/docs/source/fastNLP.rst
@@ -2,19 +2,18 @@ API 文档
===============
.. automodule:: fastNLP
- :members:
- :undoc-members:
- :show-inheritance:
+ :members:
+ :undoc-members:
+ :show-inheritance:
内部模块
-----------
.. toctree::
- :titlesonly:
- :maxdepth: 3
-
- fastNLP.core
- fastNLP.io
- fastNLP.modules
- fastNLP.models
+ :maxdepth: 1
+ fastNLP.core
+ fastNLP.embeddings
+ fastNLP.io
+ fastNLP.models
+ fastNLP.modules
diff --git a/docs/source/figures/text_classification.png b/docs/source/figures/text_classification.png
index 0d36a2a1..21502708 100644
Binary files a/docs/source/figures/text_classification.png and b/docs/source/figures/text_classification.png differ
diff --git a/docs/source/figures/workflow.png b/docs/source/figures/workflow.png
index d2f22df8..d8e4e455 100644
Binary files a/docs/source/figures/workflow.png and b/docs/source/figures/workflow.png differ
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 03a192dc..d48af986 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -1,61 +1,28 @@
fastNLP 中文文档
=====================
-fastNLP 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个命名实体识别(NER)、中文分词或文本分类任务;
-也可以使用他构建许多复杂的网络模型,进行科研。它具有如下的特性:
+`fastNLP `_ 是一款轻量级的 NLP 处理套件。你既可以使用它快速地完成一个序列标注
+(NER、POS-Tagging等)、中文分词、文本分类、Matching、指代消解、摘要等任务
+(详见 `reproduction `_ );
+也可以使用它构建许多复杂的网络模型,进行科研。它具有如下的特性:
-- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的DataSet Loader,省去预处理代码。
-- 各种方便的NLP工具,例如预处理embedding加载; 中间数据cache等;
-- 详尽的中文文档以供查阅;
-- 提供诸多高级模块,例如Variational LSTM, Transformer, CRF等;
-- 封装CNNText,Biaffine等模型可供直接使用;
-- 便捷且具有扩展性的训练器; 提供多种内置callback函数,方便实验记录、异常捕获等。
+- 统一的Tabular式数据容器,让数据预处理过程简洁明了。内置多种数据集的 :mod:`~fastNLP.io.data_loader` ,省去预处理代码;
+- 多种训练、测试组件,例如训练器 :class:`~fastNLP.Trainer` ;测试器 :class:`~fastNLP.Tester` ;以及各种评测 :mod:`~fastNLP.core.metrics` 等等;
+- 各种方便的NLP工具,例如预处理 :mod:`embedding` 加载(包括ELMo和BERT); 中间数据存储 :func:`cache ` 等;
+- 提供诸多高级模块 :mod:`~fastNLP.modules`,例如 :class:`~fastNLP.modules.VarLSTM` , :class:`Transformer` , :class:`CRF` 等;
+- 在序列标注、中文分词、文本分类、Matching、指代消解、摘要等任务上封装了各种 :mod:`~fastNLP.models` 可供直接使用;
+- 训练器便捷且具有扩展性,提供多种内置 :mod:`~fastNLP.core.callback` 函数,方便实验记录、异常捕获等。
-内置组件
-------------
-
-大部分用于的 NLP 任务神经网络都可以看做由编码(encoder)、聚合(aggregator)、解码(decoder)三种模块组成。
-
-.. image:: figures/text_classification.png
-
-fastNLP 在 :mod:`~fastNLP.modules` 模块中内置了三种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。
-三种模块的功能和常见组件如下:
-
-+-----------------------+-----------------------+-----------------------+
-| module type | functionality | example |
-+=======================+=======================+=======================+
-| encoder | 将输入编码为具有具 | embedding, RNN, CNN, |
-| | 有表示能力的向量 | transformer |
-+-----------------------+-----------------------+-----------------------+
-| aggregator | 从多个向量中聚合信息 | self-attention, |
-| | | max-pooling |
-+-----------------------+-----------------------+-----------------------+
-| decoder | 将具有某种表示意义的 | MLP, CRF |
-| | 向量解码为需要的输出 | |
-| | 形式 | |
-+-----------------------+-----------------------+-----------------------+
-
-
-内置模型
-----------------
-
-fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models.CNNText` 、
-:class:`~fastNLP.models.SeqLabeling` 等完整的模型,以供用户直接使用。
-
-.. todo::
- 这些模型的介绍如下表所示:(模型名称 + 介绍 + 任务上的结果)
-
用户手册
----------------
.. toctree::
- :maxdepth: 1
+ :maxdepth: 2
- 安装指南
- 快速入门
- 详细指南
- 科研指南
+ 安装指南
+ 快速入门
+ 详细教程
API 文档
-------------
@@ -68,11 +35,11 @@ API 文档
fastNLP
-fitlog
-------
+fitlog文档
+----------
-用户可以 `点此 `_ 查看fitlog的文档。
-fitlog 是由我们团队开发,用于帮助用户记录日志并管理代码的工具
+您可以 `点此 `_ 查看fitlog的文档。
+fitlog 是由我们团队开发的日志记录+代码管理的工具。
索引与搜索
==================
diff --git a/docs/source/user/with_fitlog.rst b/docs/source/tutorials/tutorial_10_fitlog.rst
similarity index 96%
rename from docs/source/user/with_fitlog.rst
rename to docs/source/tutorials/tutorial_10_fitlog.rst
index 51445775..0fa24143 100644
--- a/docs/source/user/with_fitlog.rst
+++ b/docs/source/tutorials/tutorial_10_fitlog.rst
@@ -1,6 +1,6 @@
-=================
-科研向导
-=================
+============================================
+使用fitlog 辅助 fastNLP 进行科研
+============================================
本文介绍结合使用 fastNLP 和 fitlog 进行科研的方法。
diff --git a/docs/source/tutorials/tutorial_1_data_preprocess.rst b/docs/source/tutorials/tutorial_1_data_preprocess.rst
new file mode 100644
index 00000000..0ec63f87
--- /dev/null
+++ b/docs/source/tutorials/tutorial_1_data_preprocess.rst
@@ -0,0 +1,156 @@
+==============================
+使用DataSet预处理文本
+==============================
+
+:class:`~fastNLP.DataSet` 是fastNLP中用于承载数据的容器。可以将DataSet看做是一个表格,
+每一行是一个sample (在fastNLP中被称为 :mod:`~fastNLP.core.instance` ),
+每一列是一个feature (在fastNLP中称为 :mod:`~fastNLP.core.field` )。
+
+.. csv-table::
+ :header: "sentence", "words", "seq_len"
+
+ "This is the first instance .", "[This, is, the, first, instance, .]", 6
+ "Second instance .", "[Second, instance, .]", 3
+ "Third instance .", "[Third, instance, .]", 3
+ "...", "[...]", "..."
+
+上面是一个样例数据中 DataSet 的存储结构。其中它的每一行是一个 :class:`~fastNLP.Instance` 对象; 每一列是一个 :class:`~fastNLP.FieldArray` 对象。
+
+
+-----------------------------
+数据集构建和删除
+-----------------------------
+
+我们使用传入字典的方式构建一个数据集,这是 :class:`~fastNLP.DataSet` 初始化的最基础的方式
+
+.. code-block:: python
+
+ from fastNLP import DataSet
+ data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."],
+ 'words': [['this', 'is', 'the', 'first', 'instance', '.'], ['Second', 'instance', '.'], ['Third', 'instance', '.']],
+ 'seq_len': [6, 3, 3]}
+ dataset = DataSet(data)
+ # 传入的dict的每个key的value应该为具有相同长度的list
+
+我们还可以使用 :func:`~fastNLP.DataSet.append` 方法向数据集内增加数据
+
+.. code-block:: python
+
+ from fastNLP import DataSet
+ from fastNLP import Instance
+ dataset = DataSet()
+ instance = Instance(sentence="This is the first instance",
+ words=['this', 'is', 'the', 'first', 'instance', '.'],
+ seq_len=6)
+ dataset.append(instance)
+ # 可以继续append更多内容,但是append的instance应该和前面的instance拥有完全相同的field
+
+另外,我们还可以用 :class:`~fastNLP.Instance` 数组的方式构建数据集
+
+.. code-block:: python
+
+ from fastNLP import DataSet
+ from fastNLP import Instance
+ dataset = DataSet([
+ Instance(sentence="This is the first instance",
+ words=['this', 'is', 'the', 'first', 'instance', '.'],
+ seq_len=6),
+ Instance(sentence="Second instance .",
+ words=['Second', 'instance', '.'],
+ seq_len=3)
+ ])
+
+在初步构建完数据集之后,我们可以通过 `for` 循环遍历 :class:`~fastNLP.DataSet` 中的内容。
+
+.. code-block:: python
+
+ for instance in dataset:
+ # do something
+
+FastNLP 同样提供了多种删除数据的方法 :func:`~fastNLP.DataSet.drop` 、 :func:`~fastNLP.DataSet.delete_instance` 和 :func:`~fastNLP.DataSet.delete_field`
+
+.. code-block:: python
+
+ from fastNLP import DataSet
+ dataset = DataSet({'a': list(range(-5, 5))})
+ # 返回满足条件的instance,并放入DataSet中
+ dropped_dataset = dataset.drop(lambda ins:ins['a']<0, inplace=False)
+ # 在dataset中删除满足条件的instance
+ dataset.drop(lambda ins:ins['a']<0) # dataset的instance数量减少
+ # 删除第3个instance
+ dataset.delete_instance(2)
+ # 删除名为'a'的field
+ dataset.delete_field('a')
+
+-----------------------------
+简单的数据预处理
+-----------------------------
+
+因为 fastNLP 中的数据是按列存储的,所以大部分的数据预处理操作是以列( :mod:`~fastNLP.core.field` )为操作对象的。
+首先,我们可以检查特定名称的 :mod:`~fastNLP.core.field` 是否存在,并对其进行改名。
+
+.. code-block:: python
+
+ # 检查是否存在名为'a'的field
+ dataset.has_field('a') # 或 ('a' in dataset)
+ # 将名为'a'的field改名为'b'
+ dataset.rename_field('a', 'b')
+ # DataSet的长度
+ len(dataset)
+
+其次,我们可以使用 :func:`~fastNLP.DataSet.apply` 或 :func:`~fastNLP.DataSet.apply_field` 进行数据预处理操作操作。
+这两个方法通过传入一个对单一 :mod:`~fastNLP.core.instance` 操作的函数,
+自动地帮助你对一个 :mod:`~fastNLP.core.field` 中的每个 :mod:`~fastNLP.core.instance` 调用这个函数,完成整体的操作。
+这个传入的函数可以是 lambda 匿名函数,也可以是完整定义的函数。同时,你还可以用 ``new_field_name`` 参数指定数据处理后存储的 :mod:`~fastNLP.core.field` 的名称。
+
+.. code-block:: python
+
+ from fastNLP import DataSet
+ data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."]}
+ dataset = DataSet(data)
+
+ # 将句子分成单词形式, 详见DataSet.apply()方法
+ dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words')
+
+ # 或使用DataSet.apply_field()
+ dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words')
+
+ # 除了匿名函数,也可以定义函数传递进去
+ def get_words(instance):
+ sentence = instance['sentence']
+ words = sentence.split()
+ return words
+ dataset.apply(get_words, new_field_name='words')
+
+除了手动处理数据集之外,你还可以使用 fastNLP 提供的各种 :class:`~fastNLP.io.base_loader.DataSetLoader` 来进行数据处理。
+详细请参考这篇教程 :doc:`使用DataSetLoader加载数据集 ` 。
+
+-----------------------------
+DataSet与pad
+-----------------------------
+
+在fastNLP里,pad是与一个 :mod:`~fastNLP.core.field` 绑定的。即不同的 :mod:`~fastNLP.core.field` 可以使用不同的pad方式,比如在英文任务中word需要的pad和
+character的pad方式往往是不同的。fastNLP是通过一个叫做 :class:`~fastNLP.Padder` 的子类来完成的。
+默认情况下,所有field使用 :class:`~fastNLP.AutoPadder`
+。可以通过使用以下方式设置Padder(如果将padder设置为None,则该field不会进行pad操作)。
+大多数情况下直接使用 :class:`~fastNLP.AutoPadder` 就可以了。
+如果 :class:`~fastNLP.AutoPadder` 或 :class:`~fastNLP.EngChar2DPadder` 无法满足需求,
+也可以自己写一个 :class:`~fastNLP.Padder` 。
+
+.. code-block:: python
+
+ from fastNLP import DataSet
+ from fastNLP import EngChar2DPadder
+ import random
+ dataset = DataSet()
+ max_chars, max_words, sent_num = 5, 10, 20
+ contents = [[
+ [random.randint(1, 27) for _ in range(random.randint(1, max_chars))]
+ for _ in range(random.randint(1, max_words))
+ ] for _ in range(sent_num)]
+ # 初始化时传入
+ dataset.add_field('chars', contents, padder=EngChar2DPadder())
+ # 直接设置
+ dataset.set_padder('chars', EngChar2DPadder())
+ # 也可以设置pad的value
+ dataset.set_pad_val('chars', -1)
diff --git a/docs/source/tutorials/tutorial_2_load_dataset.rst b/docs/source/tutorials/tutorial_2_load_dataset.rst
new file mode 100644
index 00000000..4fa4a84d
--- /dev/null
+++ b/docs/source/tutorials/tutorial_2_load_dataset.rst
@@ -0,0 +1,224 @@
+=================================
+使用DataSetLoader加载数据集
+=================================
+
+这一部分是一个关于如何加载数据集的教程
+
+教程目录:
+
+ - `Part I: 数据集容器`_
+ - `Part II: 数据集的使用方式`_
+ - `Part III: 不同数据类型的DataSetLoader`_
+ - `Part IV: DataSetLoader举例`_
+ - `Part V: fastNLP封装好的数据集加载器`_
+
+
+----------------------------
+Part I: 数据集容器
+----------------------------
+
+在fastNLP中,我们使用 :class:`~fastNLP.io.base_loader.DataBundle` 来存储数据集信息。
+:class:`~fastNLP.io.base_loader.DataBundle` 类包含了两个重要内容: `datasets` 和 `vocabs` 。
+
+`datasets` 是一个 `key` 为数据集名称(如 `train` , `dev` ,和 `test` 等), `value` 为 :class:`~fastNLP.DataSet` 的字典。
+
+`vocabs` 是一个 `key` 为词表名称(如 :attr:`fastNLP.Const.INPUT` 表示输入文本的词表名称, :attr:`fastNLP.Const.TARGET` 表示目标
+的真实标签词表的名称,等等), `value` 为词表内容( :class:`~fastNLP.Vocabulary` )的字典。
+
+----------------------------
+Part II: 数据集的使用方式
+----------------------------
+
+在fastNLP中,我们采用 :class:`~fastNLP.io.base_loader.DataSetLoader` 来作为加载数据集的基类。
+:class:`~fastNLP.io.base_loader.DataSetLoader` 定义了各种DataSetLoader所需的API接口,开发者应该继承它实现各种的DataSetLoader。
+在各种数据集的DataSetLoader当中,至少应该编写如下内容:
+
+ - _load 函数:从一个数据文件中读取数据到一个 :class:`~fastNLP.DataSet`
+ - load 函数(可以使用基类的方法):从一个或多个数据文件中读取数据到一个或多个 :class:`~fastNLP.DataSet`
+ - process 函数:一个或多个从数据文件中读取数据,并处理成可以训练的 :class:`~fastNLP.io.DataBundle`
+
+ **\*process函数中可以调用load函数或_load函数**
+
+DataSetLoader的_load或者load函数返回的 :class:`~fastNLP.DataSet` 当中,内容为数据集的文本信息,process函数返回的
+:class:`~fastNLP.io.DataBundle` 当中, `datasets` 的内容为已经index好的、可以直接被 :class:`~fastNLP.Trainer`
+接受的内容。
+
+--------------------------------------------------------
+Part III: 不同数据类型的DataSetLoader
+--------------------------------------------------------
+
+:class:`~fastNLP.io.dataset_loader.CSVLoader`
+ 读取CSV类型的数据集文件。例子如下:
+
+ .. code-block:: python
+
+ data_set_loader = CSVLoader(
+ headers=('words', 'target'), sep='\t'
+ )
+ # 表示将CSV文件中每一行的第一项填入'words' field,第二项填入'target' field。
+ # 其中每两项之间由'\t'分割开来
+
+ data_set = data_set_loader._load('path/to/your/file')
+
+ 数据集内容样例如下 ::
+
+ But it does not leave you with much . 1
+ You could hate it for the same reason . 1
+ The performances are an absolute joy . 4
+
+
+:class:`~fastNLP.io.dataset_loader.JsonLoader`
+ 读取Json类型的数据集文件,数据必须按行存储,每行是一个包含各类属性的Json对象。例子如下:
+
+ .. code-block:: python
+
+ data_set_loader = JsonLoader(
+ fields={'sentence1': 'words1', 'sentence2': 'words2', 'gold_label': 'target'}
+ )
+ # 表示将Json对象中'sentence1'、'sentence2'和'gold_label'对应的值赋给'words1'、'words2'、'target'这三个fields
+
+ data_set = data_set_loader._load('path/to/your/file')
+
+ 数据集内容样例如下 ::
+
+ {"annotator_labels": ["neutral"], "captionID": "3416050480.jpg#4", "gold_label": "neutral", "pairID": "3416050480.jpg#4r1n", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is training his horse for a competition.", "sentence2_binary_parse": "( ( A person ) ( ( is ( ( training ( his horse ) ) ( for ( a competition ) ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (VP (VBG training) (NP (PRP$ his) (NN horse)) (PP (IN for) (NP (DT a) (NN competition))))) (. .)))"}
+ {"annotator_labels": ["contradiction"], "captionID": "3416050480.jpg#4", "gold_label": "contradiction", "pairID": "3416050480.jpg#4r1c", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is at a diner, ordering an omelette.", "sentence2_binary_parse": "( ( A person ) ( ( ( ( is ( at ( a diner ) ) ) , ) ( ordering ( an omelette ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (PP (IN at) (NP (DT a) (NN diner))) (, ,) (S (VP (VBG ordering) (NP (DT an) (NN omelette))))) (. .)))"}
+ {"annotator_labels": ["entailment"], "captionID": "3416050480.jpg#4", "gold_label": "entailment", "pairID": "3416050480.jpg#4r1e", "sentence1": "A person on a horse jumps over a broken down airplane.", "sentence1_binary_parse": "( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )", "sentence1_parse": "(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN on) (NP (DT a) (NN horse)))) (VP (VBZ jumps) (PP (IN over) (NP (DT a) (JJ broken) (JJ down) (NN airplane)))) (. .)))", "sentence2": "A person is outdoors, on a horse.", "sentence2_binary_parse": "( ( A person ) ( ( ( ( is outdoors ) , ) ( on ( a horse ) ) ) . ) )", "sentence2_parse": "(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) (ADVP (RB outdoors)) (, ,) (PP (IN on) (NP (DT a) (NN horse)))) (. .)))"}
+
+------------------------------------------
+Part IV: DataSetLoader举例
+------------------------------------------
+
+以Matching任务为例子:
+
+ :class:`~fastNLP.io.data_loader.MatchingLoader`
+ 我们在fastNLP当中封装了一个Matching任务数据集的数据加载类: :class:`~fastNLP.io.data_loader.MatchingLoader` .
+
+ 在MatchingLoader类当中我们封装了一个对数据集中的文本内容进行进一步的预处理的函数:
+ :meth:`~fastNLP.io.data_loader.MatchingLoader.process`
+ 这个函数具有各种预处理option,如:
+ - 是否将文本转成全小写
+ - 是否需要序列长度信息,需要什么类型的序列长度信息
+ - 是否需要用BertTokenizer来获取序列的WordPiece信息
+ - 等等
+
+ 具体内容参见 :meth:`fastNLP.io.MatchingLoader.process` 。
+
+ :class:`~fastNLP.io.data_loader.SNLILoader`
+ 一个关于SNLI数据集的DataSetLoader。SNLI数据集来自
+ `SNLI Data Set `_ .
+
+ 在 :class:`~fastNLP.io.data_loader.SNLILoader` 的 :meth:`~fastNLP.io.data_loader.SNLILoader._load`
+ 函数中,我们用以下代码将数据集内容从文本文件读入内存:
+
+ .. code-block:: python
+
+ data = SNLILoader().process(
+ paths='path/to/snli/data', to_lower=False, seq_len_type='seq_len',
+ get_index=True, concat=False,
+ )
+ print(data)
+
+ 输出的内容是::
+
+ In total 3 datasets:
+ train has 549367 instances.
+ dev has 9842 instances.
+ test has 9824 instances.
+ In total 2 vocabs:
+ words has 43154 entries.
+ target has 3 entries.
+
+
+ 这里的data是一个 :class:`~fastNLP.io.base_loader.DataBundle` ,取 ``datasets`` 字典里的内容即可直接传入
+ :class:`~fastNLP.Trainer` 或者 :class:`~fastNLP.Tester` 进行训练或者测试。
+
+ :class:`~fastNLP.io.data_loader.IMDBLoader`
+ 以IMDB数据集为例,在 :class:`~fastNLP.io.data_loader.IMDBLoader` 的 :meth:`~fastNLP.io.data_loader.IMDBLoader._load`
+ 函数中,我们用以下代码将数据集内容从文本文件读入内存:
+
+ .. code-block:: python
+
+ data = IMDBLoader().process(
+ paths={'train': 'path/to/train/file', 'test': 'path/to/test/file'}
+ )
+ print(data)
+
+ 输出的内容是::
+
+ In total 3 datasets:
+ train has 22500 instances.
+ test has 25000 instances.
+ dev has 2500 instances.
+ In total 2 vocabs:
+ words has 82846 entries.
+ target has 2 entries.
+
+
+ 这里的将原来的train集按9:1的比例分成了训练集和验证集。
+
+
+------------------------------------------
+Part V: fastNLP封装好的数据集加载器
+------------------------------------------
+
+fastNLP封装好的数据集加载器可以适用于多种类型的任务:
+
+ - `文本分类任务`_
+ - `序列标注任务`_
+ - `Matching任务`_
+
+
+文本分类任务
+-------------------
+
+========================== ==================================================================
+数据集名称 数据集加载器
+-------------------------- ------------------------------------------------------------------
+IMDb :class:`~fastNLP.io.data_loader.IMDBLoader`
+-------------------------- ------------------------------------------------------------------
+SST :class:`~fastNLP.io.data_loader.SSTLoader`
+-------------------------- ------------------------------------------------------------------
+SST-2 :class:`~fastNLP.io.data_loader.SST2Loader`
+-------------------------- ------------------------------------------------------------------
+Yelp Polarity :class:`~fastNLP.io.data_loader.YelpLoader`
+-------------------------- ------------------------------------------------------------------
+Yelp Full :class:`~fastNLP.io.data_loader.YelpLoader`
+-------------------------- ------------------------------------------------------------------
+MTL16 :class:`~fastNLP.io.data_loader.MTL16Loader`
+========================== ==================================================================
+
+
+
+序列标注任务
+-------------------
+
+========================== ==================================================================
+数据集名称 数据集加载器
+-------------------------- ------------------------------------------------------------------
+Conll :class:`~fastNLP.io.data_loader.ConllLoader`
+-------------------------- ------------------------------------------------------------------
+Conll2003 :class:`~fastNLP.io.data_loader.Conll2003Loader`
+-------------------------- ------------------------------------------------------------------
+人民日报数据集 :class:`~fastNLP.io.data_loader.PeopleDailyCorpusLoader`
+========================== ==================================================================
+
+
+
+Matching任务
+-------------------
+
+========================== ==================================================================
+数据集名称 数据集加载器
+-------------------------- ------------------------------------------------------------------
+SNLI :class:`~fastNLP.io.data_loader.SNLILoader`
+-------------------------- ------------------------------------------------------------------
+MultiNLI :class:`~fastNLP.io.data_loader.MNLILoader`
+-------------------------- ------------------------------------------------------------------
+QNLI :class:`~fastNLP.io.data_loader.QNLILoader`
+-------------------------- ------------------------------------------------------------------
+RTE :class:`~fastNLP.io.data_loader.RTELoader`
+-------------------------- ------------------------------------------------------------------
+Quora Pair Dataset :class:`~fastNLP.io.data_loader.QuoraLoader`
+========================== ==================================================================
+
diff --git a/docs/source/tutorials/tutorial_3_embedding.rst b/docs/source/tutorials/tutorial_3_embedding.rst
new file mode 100644
index 00000000..5e0a9107
--- /dev/null
+++ b/docs/source/tutorials/tutorial_3_embedding.rst
@@ -0,0 +1,214 @@
+=========================================
+使用Embedding模块将文本转成向量
+=========================================
+
+这一部分是一个关于在fastNLP当中使用embedding的教程。
+
+教程目录:
+
+ - `Part I: embedding介绍`_
+ - `Part II: 使用随机初始化的embedding`_
+ - `Part III: 使用预训练的静态embedding`_
+ - `Part IV: 使用预训练的Contextual Embedding(ELMo & BERT)`_
+ - `Part V: 使用character-level的embedding`_
+ - `Part VI: 叠加使用多个embedding`_
+
+
+
+
+---------------------------------------
+Part I: embedding介绍
+---------------------------------------
+
+与torch.nn.Embedding类似,fastNLP的embedding接受的输入是一个被index好的序列,输出的内容是这个序列的embedding结果。
+
+fastNLP的embedding包括了预训练embedding和随机初始化embedding。
+
+
+---------------------------------------
+Part II: 使用随机初始化的embedding
+---------------------------------------
+
+使用随机初始化的embedding参见 :class:`~fastNLP.modules.encoder.embedding.Embedding` 。
+
+可以传入词表大小和embedding维度:
+
+.. code-block:: python
+
+ embed = Embedding(10000, 50)
+
+也可以传入一个初始化的参数矩阵:
+
+.. code-block:: python
+
+ embed = Embedding(init_embed)
+
+其中的init_embed可以是torch.FloatTensor、torch.nn.Embedding或者numpy.ndarray。
+
+
+---------------------------------------
+Part III: 使用预训练的静态embedding
+---------------------------------------
+
+在使用预训练的embedding之前,需要根据数据集的内容构建一个词表 :class:`~fastNLP.core.vocabulary.Vocabulary` ,在
+预训练embedding类初始化的时候需要将这个词表作为参数传入。
+
+在fastNLP中,我们提供了 :class:`~fastNLP.modules.encoder.embedding.StaticEmbedding` 这一个类。
+通过 :class:`~fastNLP.modules.encoder.embedding.StaticEmbedding` 可以加载预训练好的静态
+Embedding,例子如下:
+
+.. code-block:: python
+
+ embed = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
+
+vocab为根据数据集构建的词表,model_dir_or_name可以是一个路径,也可以是embedding模型的名称:
+
+ 1 如果传入的是路径,那么fastNLP将会根据该路径来读取预训练的权重文件并将embedding加载进来(glove
+ 和word2vec类型的权重文件都支持)
+
+ 2 如果传入的是模型名称,那么fastNLP将会根据名称查找embedding模型,如果在cache目录下找到模型则会
+ 自动加载;如果找不到则会自动下载。可以通过环境变量 ``FASTNLP_CACHE_DIR`` 来自定义cache目录,如::
+
+ $ FASTNLP_CACHE_DIR=~/fastnlp_cache_dir python your_python_file.py
+
+这个命令表示fastNLP将会在 `~/fastnlp_cache_dir` 这个目录下寻找模型,找不到则会自动将模型下载到这个目录
+
+目前支持的静态embedding模型有:
+
+ ========================== ================================
+ 模型名称 模型
+ -------------------------- --------------------------------
+ en glove.840B.300d
+ -------------------------- --------------------------------
+ en-glove-840d-300 glove.840B.300d
+ -------------------------- --------------------------------
+ en-glove-6b-50 glove.6B.50d
+ -------------------------- --------------------------------
+ en-word2vec-300 谷歌word2vec 300维
+ -------------------------- --------------------------------
+ en-fasttext 英文fasttext 300维
+ -------------------------- --------------------------------
+ cn 腾讯中文词向量 200维
+ -------------------------- --------------------------------
+ cn-fasttext 中文fasttext 300维
+ ========================== ================================
+
+
+
+-----------------------------------------------------------
+Part IV: 使用预训练的Contextual Embedding(ELMo & BERT)
+-----------------------------------------------------------
+
+在fastNLP中,我们提供了ELMo和BERT的embedding: :class:`~fastNLP.modules.encoder.embedding.ElmoEmbedding`
+和 :class:`~fastNLP.modules.encoder.embedding.BertEmbedding` 。
+
+与静态embedding类似,ELMo的使用方法如下:
+
+.. code-block:: python
+
+ embed = ElmoEmbedding(vocab, model_dir_or_name='small', requires_grad=False)
+
+目前支持的ElmoEmbedding模型有:
+
+ ========================== ================================
+ 模型名称 模型
+ -------------------------- --------------------------------
+ small allennlp ELMo的small
+ -------------------------- --------------------------------
+ medium allennlp ELMo的medium
+ -------------------------- --------------------------------
+ original allennlp ELMo的original
+ -------------------------- --------------------------------
+ 5.5b-original allennlp ELMo的5.5B original
+ ========================== ================================
+
+BERT-embedding的使用方法如下:
+
+.. code-block:: python
+
+ embed = BertEmbedding(
+ vocab, model_dir_or_name='en-base-cased', requires_grad=False, layers='4,-2,-1'
+ )
+
+其中layers变量表示需要取哪几层的encode结果。
+
+目前支持的BertEmbedding模型有:
+
+ ========================== ====================================
+ 模型名称 模型
+ -------------------------- ------------------------------------
+ en bert-base-cased
+ -------------------------- ------------------------------------
+ en-base-uncased bert-base-uncased
+ -------------------------- ------------------------------------
+ en-base-cased bert-base-cased
+ -------------------------- ------------------------------------
+ en-large-uncased bert-large-uncased
+ -------------------------- ------------------------------------
+ en-large-cased bert-large-cased
+ -------------------------- ------------------------------------
+ -------------------------- ------------------------------------
+ en-large-cased-wwm bert-large-cased-whole-word-mask
+ -------------------------- ------------------------------------
+ en-large-uncased-wwm bert-large-uncased-whole-word-mask
+ -------------------------- ------------------------------------
+ en-base-cased-mrpc bert-base-cased-finetuned-mrpc
+ -------------------------- ------------------------------------
+ -------------------------- ------------------------------------
+ multilingual bert-base-multilingual-cased
+ -------------------------- ------------------------------------
+ multilingual-base-uncased bert-base-multilingual-uncased
+ -------------------------- ------------------------------------
+ multilingual-base-cased bert-base-multilingual-cased
+ ========================== ====================================
+
+-----------------------------------------------------
+Part V: 使用character-level的embedding
+-----------------------------------------------------
+
+除了预训练的embedding以外,fastNLP还提供了CharEmbedding: :class:`~fastNLP.modules.encoder.embedding.CNNCharEmbedding` 和
+:class:`~fastNLP.modules.encoder.embedding.LSTMCharEmbedding` 。
+
+CNNCharEmbedding的使用例子如下:
+
+.. code-block:: python
+
+ embed = CNNCharEmbedding(vocab, embed_size=100, char_emb_size=50)
+
+这表示这个CNNCharEmbedding当中character的embedding维度大小为50,返回的embedding结果维度大小为100。
+
+与CNNCharEmbedding类似,LSTMCharEmbedding的使用例子如下:
+
+.. code-block:: python
+
+ embed = LSTMCharEmbedding(vocab, embed_size=100, char_emb_size=50)
+
+这表示这个LSTMCharEmbedding当中character的embedding维度大小为50,返回的embedding结果维度大小为100。
+
+
+
+-----------------------------------------------------
+Part VI: 叠加使用多个embedding
+-----------------------------------------------------
+
+在fastNLP中,我们使用 :class:`~fastNLP.modules.encoder.embedding.StackEmbedding` 来叠加多个embedding
+
+例子如下:
+
+.. code-block:: python
+
+ embed_1 = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
+ embed_2 = StaticEmbedding(vocab, model_dir_or_name='en-word2vec-300', requires_grad=True)
+
+ stack_embed = StackEmbedding([embed_1, embed_2])
+
+StackEmbedding会把多个embedding的结果拼接起来,如上面例子的stack_embed返回的embedding维度为350维。
+
+除此以外,还可以把静态embedding跟上下文相关的embedding拼接起来:
+
+.. code-block:: python
+
+ elmo_embedding = ElmoEmbedding(vocab, model_dir_or_name='medium', layers='0,1,2', requires_grad=False)
+ glove_embedding = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
+
+ stack_embed = StackEmbedding([elmo_embedding, glove_embedding])
diff --git a/docs/source/tutorials/tutorial_4_loss_optimizer.rst b/docs/source/tutorials/tutorial_4_loss_optimizer.rst
new file mode 100644
index 00000000..a6e1730a
--- /dev/null
+++ b/docs/source/tutorials/tutorial_4_loss_optimizer.rst
@@ -0,0 +1,267 @@
+==============================================================================
+动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试
+==============================================================================
+
+我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段评价性文字,预测其情感倾向是积极(label=1)、
+消极(label=0)还是中性(label=2),使用 :class:`~fastNLP.Trainer` 和 :class:`~fastNLP.Tester` 来进行快速训练和测试。
+
+--------------
+数据处理
+--------------
+
+数据读入
+ 我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.SSTLoader` 类,轻松地读取SST数据集(数据来源:https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip)。
+ 这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象。
+
+ .. code-block:: python
+
+ from fastNLP.io import SSTLoader
+
+ loader = SSTLoader()
+ #这里的all.txt是下载好数据后train.txt、dev.txt、test.txt的组合
+ dataset = loader.load("./trainDevTestTrees_PTB/trees/all.txt")
+ print(dataset[0])
+
+ 输出数据如下::
+
+ {'words': ['It', "'s", 'a', 'lovely', 'film', 'with', 'lovely', 'performances', 'by', 'Buy', 'and', 'Accorsi', '.'] type=list,
+ 'target': positive type=str}
+
+ 除了读取数据外,fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io` 。
+
+
+数据处理
+ 我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``target`` :mod:`~fastNLP.core.field` 转化为整数。
+
+ .. code-block:: python
+
+ def label_to_int(x):
+ if x['target']=="positive":
+ return 1
+ elif x['target']=="negative":
+ return 0
+ else:
+ return 2
+
+ # 将label转为整数
+ dataset.apply(lambda x: label_to_int(x), new_field_name='target')
+
+ ``words`` 和 ``target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
+ :class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len`` 。
+ 所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len`` 的 :mod:`~fastNLP.core.field` 。
+
+ .. code-block:: python
+
+ # 增加长度信息
+ dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
+
+ 观察可知: :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 类似,
+ 但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
+ 而 :meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
+
+ .. note::
+ `lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
+
+ def func_lambda(x):
+ return len(x)
+
+ 你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 的参数
+
+Vocabulary 的使用
+ 我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabulary.index_dataset`
+ 将单词序列转化为训练可用的数字序列。
+
+ .. code-block:: python
+
+ from fastNLP import Vocabulary
+
+ # 使用Vocabulary类统计单词,并将单词序列转化为数字序列
+ vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
+ vocab.index_dataset(dataset, field_name='words',new_field_name='words')
+ print(dataset[0])
+
+ 输出数据如下::
+
+ {'words': [27, 9, 6, 913, 16, 18, 913, 124, 31, 5715, 5, 1, 2] type=list,
+ 'target': 1 type=int,
+ 'seq_len': 13 type=int}
+
+
+---------------------
+使用内置模型训练
+---------------------
+
+内置模型的输入输出命名
+ fastNLP内置了一些完整的神经网络模型,详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
+ 为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 的名称。
+ 在这个例子中模型输入 (forward方法的参数) 为 ``words`` 和 ``seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target`` 。
+ 具体的命名规范可以参考 :doc:`/fastNLP.core.const` 。
+
+ 如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet` 中
+ :mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
+
+ .. code-block:: python
+
+ from fastNLP import Const
+
+ dataset.rename_field('words', Const.INPUT)
+ dataset.rename_field('seq_len', Const.INPUT_LEN)
+ dataset.rename_field('target', Const.TARGET)
+
+ print(Const.INPUT)
+ print(Const.INPUT_LEN)
+ print(Const.TARGET)
+ print(Const.OUTPUT)
+
+ 输出结果为::
+
+ words
+ seq_len
+ target
+ pred
+
+ 在给 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
+ :meth:`~fastNLP.DataSet.set_input` 和 :meth:`~fastNLP.DataSet.set_target` 两个函数。
+
+ .. code-block:: python
+
+ #使用dataset的 set_input 和 set_target函数,告诉模型dataset中那些数据是输入,那些数据是标签(目标输出)
+ dataset.set_input(Const.INPUT, Const.INPUT_LEN)
+ dataset.set_target(Const.TARGET)
+
+数据集分割
+ 除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
+ 下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法
+
+ .. code-block:: python
+
+ train_dev_data, test_data = dataset.split(0.1)
+ train_data, dev_data = train_dev_data.split(0.1)
+ print(len(train_data), len(dev_data), len(test_data))
+
+ 输出结果为::
+
+ 9603 1067 1185
+
+评价指标
+ 训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
+ ``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
+ ``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
+
+ .. code-block:: python
+
+ from fastNLP import AccuracyMetric
+
+ # metrics=AccuracyMetric() 在本例中与下面这行代码等价
+ metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
+
+损失函数
+ 训练模型需要提供一个损失函数
+ ,fastNLP中提供了直接可以导入使用的四种loss,分别为:
+ * :class:`~fastNLP.CrossEntropyLoss`:包装了torch.nn.functional.cross_entropy()函数,返回交叉熵损失(可以运用于多分类场景)
+ * :class:`~fastNLP.BCELoss`:包装了torch.nn.functional.binary_cross_entropy()函数,返回二分类的交叉熵
+ * :class:`~fastNLP.L1Loss`:包装了torch.nn.functional.l1_loss()函数,返回L1 损失
+ * :class:`~fastNLP.NLLLoss`:包装了torch.nn.functional.nll_loss()函数,返回负对数似然损失
+
+ 下面提供了一个在分类问题中常用的交叉熵损失。注意它的 **初始化参数** 。
+ ``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
+ ``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
+ 这里我们用 :class:`~fastNLP.Const` 来辅助命名,如果你自己编写模型中 forward 方法的返回值或
+ 数据集中 :mod:`~fastNLP.core.field` 的名字与本例不同, 你可以把 ``pred`` 参数和 ``target`` 参数设定符合自己代码的值。
+
+ .. code-block:: python
+
+ from fastNLP import CrossEntropyLoss
+
+ # loss = CrossEntropyLoss() 在本例中与下面这行代码等价
+ loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)
+
+优化器
+ 定义模型运行的时候使用的优化器,可以使用fastNLP包装好的优化器:
+
+ * :class:`~fastNLP.SGD` :包装了torch.optim.SGD优化器
+ * :class:`~fastNLP.Adam` :包装了torch.optim.Adam优化器
+
+ 也可以直接使用torch.optim.Optimizer中的优化器,并在实例化 :class:`~fastNLP.Trainer` 类的时候传入优化器实参
+
+ .. code-block:: python
+
+ import torch.optim as optim
+ from fastNLP import Adam
+
+ #使用 torch.optim 定义优化器
+ optimizer_1=optim.RMSprop(model_cnn.parameters(), lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
+ #使用fastNLP中包装的 Adam 定义优化器
+ optimizer_2=Adam(lr=4e-3, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, model_params=model_cnn.parameters())
+
+快速训练
+ 现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ,并使用 :class:`~fastNLP.Trainer` 进行训练,
+ 除了使用 :class:`~fastNLP.Trainer`进行训练,我们也可以通过使用 :class:`~fastNLP.DataSetIter` 来编写自己的训练过程,具体见 :doc:`/tutorials/tutorial_5_datasetiter`
+
+ .. code-block:: python
+
+ from fastNLP.models import CNNText
+
+ #词嵌入的维度、训练的轮数和batch size
+ EMBED_DIM = 100
+ N_EPOCHS = 10
+ BATCH_SIZE = 16
+
+ #使用CNNText的时候第一个参数输入一个tuple,作为模型定义embedding的参数
+ #还可以传入 kernel_nums, kernel_sizes, padding, dropout的自定义值
+ model_cnn = CNNText((len(vocab),EMBED_DIM), num_classes=3, padding=2, dropout=0.1)
+
+ #如果在定义trainer的时候没有传入optimizer参数,模型默认的优化器为torch.optim.Adam且learning rate为lr=4e-3
+ #这里只使用了optimizer_1作为优化器输入,感兴趣可以尝试optimizer_2或者其他优化器作为输入
+ #这里只使用了loss作为损失函数输入,感兴趣可以尝试其他损失函数输入
+ trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data, loss=loss, metrics=metrics,
+ optimizer=optimizer_1,n_epochs=N_EPOCHS, batch_size=BATCH_SIZE)
+ trainer.train()
+
+ 训练过程的输出如下::
+
+ input fields after batch(if batch size is 2):
+ words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 40])
+ seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
+ target fields after batch(if batch size is 2):
+ target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
+
+ training epochs started 2019-07-08-15-44-48
+ Evaluation at Epoch 1/10. Step:601/6010. AccuracyMetric: acc=0.59044
+
+ Evaluation at Epoch 2/10. Step:1202/6010. AccuracyMetric: acc=0.599813
+
+ Evaluation at Epoch 3/10. Step:1803/6010. AccuracyMetric: acc=0.508903
+
+ Evaluation at Epoch 4/10. Step:2404/6010. AccuracyMetric: acc=0.596064
+
+ Evaluation at Epoch 5/10. Step:3005/6010. AccuracyMetric: acc=0.47985
+
+ Evaluation at Epoch 6/10. Step:3606/6010. AccuracyMetric: acc=0.589503
+
+ Evaluation at Epoch 7/10. Step:4207/6010. AccuracyMetric: acc=0.311153
+
+ Evaluation at Epoch 8/10. Step:4808/6010. AccuracyMetric: acc=0.549203
+
+ Evaluation at Epoch 9/10. Step:5409/6010. AccuracyMetric: acc=0.581068
+
+ Evaluation at Epoch 10/10. Step:6010/6010. AccuracyMetric: acc=0.523899
+
+
+ In Epoch:2/Step:1202, got best dev performance:AccuracyMetric: acc=0.599813
+ Reloaded the best model.
+
+快速测试
+ 与 :class:`~fastNLP.Trainer` 对应,fastNLP 也提供了 :class:`~fastNLP.Tester` 用于快速测试,用法如下
+
+ .. code-block:: python
+
+ from fastNLP import Tester
+
+ tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())
+ tester.test()
+
+ 训练过程输出如下::
+
+ [tester]
+ AccuracyMetric: acc=0.565401
diff --git a/docs/source/tutorials/tutorial_5_datasetiter.rst b/docs/source/tutorials/tutorial_5_datasetiter.rst
new file mode 100644
index 00000000..23d26deb
--- /dev/null
+++ b/docs/source/tutorials/tutorial_5_datasetiter.rst
@@ -0,0 +1,250 @@
+==============================================================================
+动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程
+==============================================================================
+
+我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段评价性文字,预测其情感倾向是积极(label=1)、
+消极(label=0)还是中性(label=2),使用 :class:`~fastNLP.DataSetIter` 类来编写自己的训练过程。
+自己编写训练过程之前的内容与 :doc:`/tutorials/tutorial_4_loss_optimizer` 中的完全一样,如已经阅读过可以跳过。
+
+--------------
+数据处理
+--------------
+
+数据读入
+ 我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.SSTLoader` 类,轻松地读取SST数据集(数据来源:https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip)。
+ 这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象。
+
+ .. code-block:: python
+
+ from fastNLP.io import SSTLoader
+
+ loader = SSTLoader()
+ #这里的all.txt是下载好数据后train.txt、dev.txt、test.txt的组合
+ dataset = loader.load("./trainDevTestTrees_PTB/trees/all.txt")
+ print(dataset[0])
+
+ 输出数据如下::
+
+ {'words': ['It', "'s", 'a', 'lovely', 'film', 'with', 'lovely', 'performances', 'by', 'Buy', 'and', 'Accorsi', '.'] type=list,
+ 'target': positive type=str}
+
+ 除了读取数据外,fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io` 。
+
+
+数据处理
+ 我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``target`` :mod:`~fastNLP.core.field` 转化为整数。
+
+ .. code-block:: python
+
+ def label_to_int(x):
+ if x['target']=="positive":
+ return 1
+ elif x['target']=="negative":
+ return 0
+ else:
+ return 2
+
+ # 将label转为整数
+ dataset.apply(lambda x: label_to_int(x), new_field_name='target')
+
+ ``words`` 和 ``target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
+ :class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len`` 。
+ 所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len`` 的 :mod:`~fastNLP.core.field` 。
+
+ .. code-block:: python
+
+ # 增加长度信息
+ dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
+
+ 观察可知: :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 类似,
+ 但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
+ 而 :meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
+
+ .. note::
+ `lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
+
+ def func_lambda(x):
+ return len(x)
+
+ 你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 的参数
+
+Vocabulary 的使用
+ 我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabulary.index_dataset`
+ 将单词序列转化为训练可用的数字序列。
+
+ .. code-block:: python
+
+ from fastNLP import Vocabulary
+
+ # 使用Vocabulary类统计单词,并将单词序列转化为数字序列
+ vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
+ vocab.index_dataset(dataset, field_name='words',new_field_name='words')
+ print(dataset[0])
+
+ 输出数据如下::
+
+ {'words': [27, 9, 6, 913, 16, 18, 913, 124, 31, 5715, 5, 1, 2] type=list,
+ 'target': 1 type=int,
+ 'seq_len': 13 type=int}
+
+
+---------------------
+使用内置模型训练
+---------------------
+
+内置模型的输入输出命名
+ fastNLP内置了一些完整的神经网络模型,详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
+ 为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 的名称。
+ 在这个例子中模型输入 (forward方法的参数) 为 ``words`` 和 ``seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target`` 。
+ 具体的命名规范可以参考 :doc:`/fastNLP.core.const` 。
+
+ 如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet` 中
+ :mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
+
+ .. code-block:: python
+
+ from fastNLP import Const
+
+ dataset.rename_field('words', Const.INPUT)
+ dataset.rename_field('seq_len', Const.INPUT_LEN)
+ dataset.rename_field('target', Const.TARGET)
+
+ print(Const.INPUT)
+ print(Const.INPUT_LEN)
+ print(Const.TARGET)
+ print(Const.OUTPUT)
+
+ 输出结果为::
+
+ words
+ seq_len
+ target
+ pred
+
+ 在给 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
+ :meth:`~fastNLP.DataSet.set_input` 和 :meth:`~fastNLP.DataSet.set_target` 两个函数。
+
+ .. code-block:: python
+
+ #使用dataset的 set_input 和 set_target函数,告诉模型dataset中那些数据是输入,那些数据是标签(目标输出)
+ dataset.set_input(Const.INPUT, Const.INPUT_LEN)
+ dataset.set_target(Const.TARGET)
+
+数据集分割
+ 除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
+ 下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法
+
+ .. code-block:: python
+
+ train_dev_data, test_data = dataset.split(0.1)
+ train_data, dev_data = train_dev_data.split(0.1)
+ print(len(train_data), len(dev_data), len(test_data))
+
+ 输出结果为::
+
+ 9603 1067 1185
+
+评价指标
+ 训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
+ ``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
+ ``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
+
+ .. code-block:: python
+
+ from fastNLP import AccuracyMetric
+
+ # metrics=AccuracyMetric() 在本例中与下面这行代码等价
+ metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
+
+
+--------------------------
+自己编写训练过程
+--------------------------
+ 如果你想用类似 PyTorch 的使用方法,自己编写训练过程,你可以参考下面这段代码。
+ 其中使用了 fastNLP 提供的 :class:`~fastNLP.DataSetIter` 来获得小批量训练的小批量数据,
+ 使用 :class:`~fastNLP.BucketSampler` 做为 :class:`~fastNLP.DataSetIter` 的参数来选择采样的方式。
+
+DataSetIter
+ fastNLP定义的 :class:`~fastNLP.DataSetIter` 类,用于定义一个batch,并实现batch的多种功能,在初始化时传入的参数有:
+
+ * dataset: :class:`~fastNLP.DataSet` 对象, 数据集
+ * batch_size: 取出的batch大小
+ * sampler: 规定使用的 :class:`~fastNLP.Sampler` 若为 None, 使用 :class:`~fastNLP.RandomSampler` (Default: None)
+ * as_numpy: 若为 True, 输出batch为 `numpy.array`. 否则为 `torch.Tensor` (Default: False)
+ * prefetch: 若为 True使用多进程预先取出下一batch. (Default: False)
+
+sampler
+ fastNLP 实现的采样器有:
+
+ * :class:`~fastNLP.BucketSampler` 可以随机地取出长度相似的元素 【初始化参数: num_buckets:bucket的数量; batch_size:batch大小; seq_len_field_name:dataset中对应序列长度的 :mod:`~fastNLP.core.field` 的名字】
+ * SequentialSampler: 顺序取出元素的采样器【无初始化参数】
+ * RandomSampler:随机化取元素的采样器【无初始化参数】
+
+ 以下代码使用BucketSampler作为 :class:`~fastNLP.DataSetIter` 初始化的输入,运用 :class:`~fastNLP.DataSetIter` 自己写训练程序
+
+ .. code-block:: python
+
+ from fastNLP import BucketSampler
+ from fastNLP import DataSetIter
+ from fastNLP.models import CNNText
+ from fastNLP import Tester
+ import torch
+ import time
+
+ embed_dim = 100
+ model = CNNText((len(vocab),embed_dim), num_classes=3, padding=2, dropout=0.1)
+
+ def train(epoch, data, devdata):
+ optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+ lossfunc = torch.nn.CrossEntropyLoss()
+ batch_size = 32
+
+ # 定义一个Batch,传入DataSet,规定batch_size和去batch的规则。
+ # 顺序(Sequential),随机(Random),相似长度组成一个batch(Bucket)
+ train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
+ train_batch = DataSetIter(batch_size=batch_size, dataset=data, sampler=train_sampler)
+
+ start_time = time.time()
+ print("-"*5+"start training"+"-"*5)
+ for i in range(epoch):
+ loss_list = []
+ for batch_x, batch_y in train_batch:
+ optimizer.zero_grad()
+ output = model(batch_x['words'])
+ loss = lossfunc(output['pred'], batch_y['target'])
+ loss.backward()
+ optimizer.step()
+ loss_list.append(loss.item())
+
+ #这里verbose如果为0,在调用Tester对象的test()函数时不输出任何信息,返回评估信息; 如果为1,打印出验证结果,返回评估信息
+ #在调用过Tester对象的test()函数后,调用其_format_eval_results(res)函数,结构化输出验证结果
+ tester_tmp = Tester(devdata, model, metrics=AccuracyMetric(), verbose=0)
+ res=tester_tmp.test()
+
+ print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
+ print(tester._format_eval_results(res),end=" ")
+ print('{:d}ms'.format(round((time.time()-start_time)*1000)))
+ loss_list.clear()
+
+ train(10, train_data, dev_data)
+ #使用tester进行快速测试
+ tester = Tester(test_data, model, metrics=AccuracyMetric())
+ tester.test()
+
+ 这段代码的输出如下::
+
+ -----start training-----
+ Epoch 0 Avg Loss: 1.09 AccuracyMetric: acc=0.480787 58989ms
+ Epoch 1 Avg Loss: 1.00 AccuracyMetric: acc=0.500469 118348ms
+ Epoch 2 Avg Loss: 0.93 AccuracyMetric: acc=0.536082 176220ms
+ Epoch 3 Avg Loss: 0.87 AccuracyMetric: acc=0.556701 236032ms
+ Epoch 4 Avg Loss: 0.78 AccuracyMetric: acc=0.562324 294351ms
+ Epoch 5 Avg Loss: 0.69 AccuracyMetric: acc=0.58388 353673ms
+ Epoch 6 Avg Loss: 0.60 AccuracyMetric: acc=0.574508 412106ms
+ Epoch 7 Avg Loss: 0.51 AccuracyMetric: acc=0.589503 471097ms
+ Epoch 8 Avg Loss: 0.44 AccuracyMetric: acc=0.581068 529174ms
+ Epoch 9 Avg Loss: 0.39 AccuracyMetric: acc=0.572634 586216ms
+ [tester]
+ AccuracyMetric: acc=0.527426
+
+
diff --git a/docs/source/tutorials/tutorial_6_seq_labeling.rst b/docs/source/tutorials/tutorial_6_seq_labeling.rst
new file mode 100644
index 00000000..09a53cdc
--- /dev/null
+++ b/docs/source/tutorials/tutorial_6_seq_labeling.rst
@@ -0,0 +1,114 @@
+=====================
+快速实现序列标注模型
+=====================
+
+这一部分的内容主要展示如何使用fastNLP 实现序列标注任务。你可以使用fastNLP的各个组件快捷,方便地完成序列标注任务,达到出色的效果。
+在阅读这篇Tutorial前,希望你已经熟悉了fastNLP的基础使用,包括基本数据结构以及数据预处理,embedding的嵌入等,希望你对之前的教程有更进一步的掌握。
+我们将对CoNLL-03的英文数据集进行处理,展示如何完成命名实体标注任务整个训练的过程。
+
+载入数据
+===================================
+fastNLP可以方便地载入各种类型的数据。同时,针对常见的数据集,我们已经预先实现了载入方法,其中包含CoNLL-03数据集。
+在设计dataloader时,以DataSetLoader为基类,可以改写并应用于其他数据集的载入。
+
+.. code-block:: python
+
+ class Conll2003DataLoader(DataSetLoader):
+ def __init__(self, task:str='ner', encoding_type:str='bioes'):
+ assert task in ('ner', 'pos', 'chunk')
+ index = {'ner':3, 'pos':1, 'chunk':2}[task]
+ #ConllLoader是fastNLP内置的类
+ self._loader = ConllLoader(headers=['raw_words', 'target'], indexes=[0, index])
+ self._tag_converters = None
+ if task in ('ner', 'chunk'):
+ #iob和iob2bioes会对tag进行统一,标准化
+ self._tag_converters = [iob2]
+ if encoding_type == 'bioes':
+ self._tag_converters.append(iob2bioes)
+
+ def load(self, path: str):
+ dataset = self._loader.load(path)
+ def convert_tag_schema(tags):
+ for converter in self._tag_converters:
+ tags = converter(tags)
+ return tags
+ if self._tag_converters:
+ #使用apply实现convert_tag_schema函数,实际上也支持匿名函数
+ dataset.apply_field(convert_tag_schema, field_name=Const.TARGET, new_field_name=Const.TARGET)
+ return dataset
+
+输出数据格式如:
+
+ {'raw_words': ['on', 'Friday', ':'] type=list,
+ 'target': ['O', 'O', 'O'] type=list},
+
+
+数据处理
+----------------------------
+我们进一步处理数据。将数据和词表封装在 :class:`~fastNLP.DataBundle` 类中。data是DataBundle的实例。
+我们输入模型的数据包括char embedding,以及word embedding。在数据处理部分,我们尝试完成词表的构建。
+使用fastNLP中的Vocabulary类来构建词表。
+
+.. code-block:: python
+
+ word_vocab = Vocabulary(min_freq=2)
+ word_vocab.from_dataset(data.datasets['train'], field_name=Const.INPUT)
+ word_vocab.index_dataset(*data.datasets.values(),field_name=Const.INPUT, new_field_name=Const.INPUT)
+
+处理后的data对象内部为:
+
+ dataset
+ vocabs
+ dataset保存了train和test中的数据,并保存为dataset类型
+ vocab保存了words,raw-words以及target的词表。
+
+模型构建
+--------------------------------
+我们使用CNN-BILSTM-CRF模型完成这一任务。在网络构建方面,fastNLP的网络定义继承pytorch的 :class:`nn.Module` 类。
+自己可以按照pytorch的方式定义网络。需要注意的是命名。fastNLP的标准命名位于 :class:`~fastNLP.Const` 类。
+
+模型的训练
+首先实例化模型,导入所需的char embedding以及word embedding。Embedding的载入可以参考教程。
+也可以查看 :mod:`~fastNLP.modules.encoder.embedding` 使用所需的embedding 载入方法。
+fastNLP将模型的训练过程封装在了 :class:`~fastnlp.trainer` 类中。
+根据不同的任务调整trainer中的参数即可。通常,一个trainer实例需要有:指定的训练数据集,模型,优化器,loss函数,评测指标,以及指定训练的epoch数,batch size等参数。
+
+.. code-block:: python
+
+ #实例化模型
+ model = CNNBiLSTMCRF(word_embed, char_embed, hidden_size=200, num_layers=1, tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type)
+ #定义优化器
+ optimizer = Adam(model.parameters(), lr=0.005)
+ #定义评估指标
+ Metrics=SpanFPreRecMetric(tag_vocab=data.vocabs[Const.TARGET], encoding_type=encoding_type)
+ #实例化trainer
+ trainer = Trainer(train_data=data.datasets['train'], model=model, optimizer=optimizer, dev_data=data.datasets['test'], batch_size=10, metrics=Metrics,callbacks=callbacks, n_epochs=100)
+ #开始训练
+ trainer.train()
+
+训练中会保存最优的参数配置。
+训练的结果如下:
+
+.. code-block:: python
+
+ Evaluation on DataSet test:
+ SpanFPreRecMetric: f=0.727661, pre=0.732293, rec=0.723088
+ Evaluation at Epoch 1/100. Step:1405/140500. SpanFPreRecMetric: f=0.727661, pre=0.732293, rec=0.723088
+
+ Evaluation on DataSet test:
+ SpanFPreRecMetric: f=0.784307, pre=0.779371, rec=0.789306
+ Evaluation at Epoch 2/100. Step:2810/140500. SpanFPreRecMetric: f=0.784307, pre=0.779371, rec=0.789306
+
+ Evaluation on DataSet test:
+ SpanFPreRecMetric: f=0.810068, pre=0.811003, rec=0.809136
+ Evaluation at Epoch 3/100. Step:4215/140500. SpanFPreRecMetric: f=0.810068, pre=0.811003, rec=0.809136
+
+ Evaluation on DataSet test:
+ SpanFPreRecMetric: f=0.829592, pre=0.84153, rec=0.817989
+ Evaluation at Epoch 4/100. Step:5620/140500. SpanFPreRecMetric: f=0.829592, pre=0.84153, rec=0.817989
+
+ Evaluation on DataSet test:
+ SpanFPreRecMetric: f=0.828789, pre=0.837096, rec=0.820644
+ Evaluation at Epoch 5/100. Step:7025/140500. SpanFPreRecMetric: f=0.828789, pre=0.837096, rec=0.820644
+
+
diff --git a/docs/source/tutorials/tutorial_7_modules_models.rst b/docs/source/tutorials/tutorial_7_modules_models.rst
new file mode 100644
index 00000000..680d75fd
--- /dev/null
+++ b/docs/source/tutorials/tutorial_7_modules_models.rst
@@ -0,0 +1,207 @@
+======================================
+使用Modules和Models快速搭建自定义模型
+======================================
+
+:mod:`~fastNLP.modules` 和 :mod:`~fastNLP.models` 用于构建 fastNLP 所需的神经网络模型,它可以和 torch.nn 中的模型一起使用。
+下面我们会分三节介绍编写构建模型的具体方法。
+
+
+----------------------
+使用 models 中的模型
+----------------------
+
+fastNLP 在 :mod:`~fastNLP.models` 模块中内置了如 :class:`~fastNLP.models.CNNText` 、
+:class:`~fastNLP.models.SeqLabeling` 等完整的模型,以供用户直接使用。
+以 :class:`~fastNLP.models.CNNText` 为例,我们看一个简单的文本分类的任务的实现过程。
+
+首先是数据读入和处理部分,这里的代码和 :doc:`快速入门 ` 中一致。
+
+.. code-block:: python
+
+ from fastNLP.io import CSVLoader
+ from fastNLP import Vocabulary, CrossEntropyLoss, AccuracyMetric
+
+ loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
+ dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")
+
+ dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
+ dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words', is_input=True)
+ dataset.apply(lambda x: int(x['label']), new_field_name='target', is_target=True)
+
+ train_dev_data, test_data = dataset.split(0.1)
+ train_data, dev_data = train_dev_data.split(0.1)
+
+ vocab = Vocabulary(min_freq=2).from_dataset(train_data, field_name='words')
+ vocab.index_dataset(train_data, dev_data, test_data, field_name='words', new_field_name='words')
+
+然后我们从 :mod:`~fastNLP.models` 中导入 ``CNNText`` 模型,用它进行训练
+
+.. code-block:: python
+
+ from fastNLP.models import CNNText
+ from fastNLP import Trainer
+
+ model_cnn = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
+
+ trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data,
+ loss=CrossEntropyLoss(), metrics=AccuracyMetric())
+ trainer.train()
+
+在 iPython 环境输入 `model_cnn` ,我们可以看到 ``model_cnn`` 的网络结构
+
+.. parsed-literal::
+
+ CNNText(
+ (embed): Embedding(
+ 169, 50
+ (dropout): Dropout(p=0.0)
+ )
+ (conv_pool): ConvMaxpool(
+ (convs): ModuleList(
+ (0): Conv1d(50, 3, kernel_size=(3,), stride=(1,), padding=(2,))
+ (1): Conv1d(50, 4, kernel_size=(4,), stride=(1,), padding=(2,))
+ (2): Conv1d(50, 5, kernel_size=(5,), stride=(1,), padding=(2,))
+ )
+ )
+ (dropout): Dropout(p=0.1)
+ (fc): Linear(in_features=12, out_features=5, bias=True)
+ )
+
+FastNLP 中内置的 models 如下表所示,您可以点击具体的名称查看详细的 API:
+
+.. csv-table::
+ :header: 名称, 介绍
+
+ :class:`~fastNLP.models.CNNText` , 使用 CNN 进行文本分类的模型
+ :class:`~fastNLP.models.SeqLabeling` , 简单的序列标注模型
+ :class:`~fastNLP.models.AdvSeqLabel` , 更大网络结构的序列标注模型
+ :class:`~fastNLP.models.ESIM` , ESIM 模型的实现
+ :class:`~fastNLP.models.StarTransEnc` , 带 word-embedding的Star-Transformer模 型
+ :class:`~fastNLP.models.STSeqLabel` , 用于序列标注的 Star-Transformer 模型
+ :class:`~fastNLP.models.STNLICls` ,用于自然语言推断 (NLI) 的 Star-Transformer 模型
+ :class:`~fastNLP.models.STSeqCls` , 用于分类任务的 Star-Transformer 模型
+ :class:`~fastNLP.models.BiaffineParser` , Biaffine 依存句法分析网络的实现
+
+----------------------------
+使用 nn.torch 编写模型
+----------------------------
+
+FastNLP 完全支持使用 pyTorch 编写的模型,但与 pyTorch 中编写模型的常见方法不同,
+用于 fastNLP 的模型中 forward 函数需要返回一个字典,字典中至少需要包含 ``pred`` 这个字段。
+
+下面是使用 pyTorch 中的 torch.nn 模块编写的文本分类,注意观察代码中标注的向量维度。
+由于 pyTorch 使用了约定俗成的维度设置,使得 forward 中需要多次处理维度顺序
+
+.. code-block:: python
+
+ import torch
+ import torch.nn as nn
+
+ class LSTMText(nn.Module):
+ def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
+ super().__init__()
+
+ self.embedding = nn.Embedding(vocab_size, embedding_dim)
+ self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)
+ self.fc = nn.Linear(hidden_dim * 2, output_dim)
+ self.dropout = nn.Dropout(dropout)
+
+ def forward(self, words):
+ # (input) words : (batch_size, seq_len)
+ words = words.permute(1,0)
+ # words : (seq_len, batch_size)
+
+ embedded = self.dropout(self.embedding(words))
+ # embedded : (seq_len, batch_size, embedding_dim)
+ output, (hidden, cell) = self.lstm(embedded)
+ # output: (seq_len, batch_size, hidden_dim * 2)
+ # hidden: (num_layers * 2, batch_size, hidden_dim)
+ # cell: (num_layers * 2, batch_size, hidden_dim)
+
+ hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
+ hidden = self.dropout(hidden)
+ # hidden: (batch_size, hidden_dim * 2)
+
+ pred = self.fc(hidden.squeeze(0))
+ # result: (batch_size, output_dim)
+ return {"pred":pred}
+
+我们同样可以在 iPython 环境中查看这个模型的网络结构
+
+.. parsed-literal::
+
+ LSTMText(
+ (embedding): Embedding(169, 50)
+ (lstm): LSTM(50, 64, num_layers=2, dropout=0.5, bidirectional=True)
+ (fc): Linear(in_features=128, out_features=5, bias=True)
+ (dropout): Dropout(p=0.5)
+ )
+
+----------------------------
+使用 modules 编写模型
+----------------------------
+
+下面我们使用 :mod:`fastNLP.modules` 中的组件来构建同样的网络。由于 fastNLP 统一把 ``batch_size`` 放在第一维,
+在编写代码的过程中会有一定的便利。
+
+.. code-block:: python
+
+ from fastNLP.modules import Embedding, LSTM, MLP
+
+ class Model(nn.Module):
+ def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
+ super().__init__()
+
+ self.embedding = Embedding((vocab_size, embedding_dim))
+ self.lstm = LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True)
+ self.mlp = MLP([hidden_dim*2,output_dim], dropout=dropout)
+
+ def forward(self, words):
+ embedded = self.embedding(words)
+ _,(hidden,_) = self.lstm(embedded)
+ pred = self.mlp(torch.cat((hidden[-1],hidden[-2]),dim=1))
+ return {"pred":pred}
+
+我们自己编写模型的网络结构如下
+
+.. parsed-literal::
+
+ Model(
+ (embedding): Embedding(
+ 169, 50
+ (dropout): Dropout(p=0.0)
+ )
+ (lstm): LSTM(
+ (lstm): LSTM(50, 64, num_layers=2, batch_first=True, bidirectional=True)
+ )
+ (mlp): MLP(
+ (hiddens): ModuleList()
+ (output): Linear(in_features=128, out_features=5, bias=True)
+ (dropout): Dropout(p=0.5)
+ )
+ )
+
+FastNLP 中包含的各种模块如下表,您可以点击具体的名称查看详细的 API,也可以通过 :doc:`/fastNLP.modules` 进行了解。
+
+.. csv-table::
+ :header: 名称, 介绍
+
+ :class:`~fastNLP.modules.ConvolutionCharEncoder` , char级别的卷积 encoder
+ :class:`~fastNLP.modules.LSTMCharEncoder` , char级别基于LSTM的 encoder
+ :class:`~fastNLP.modules.ConvMaxpool` , 结合了Convolution和Max-Pooling于一体的模块
+ :class:`~fastNLP.modules.LSTM` , LSTM模块, 轻量封装了PyTorch的LSTM
+ :class:`~fastNLP.modules.StarTransformer` , Star-Transformer 的encoder部分
+ :class:`~fastNLP.modules.TransformerEncoder` , Transformer的encoder模块,不包含embedding层
+ :class:`~fastNLP.modules.VarRNN` , Variational Dropout RNN 模块
+ :class:`~fastNLP.modules.VarLSTM` , Variational Dropout LSTM 模块
+ :class:`~fastNLP.modules.VarGRU` , Variational Dropout GRU 模块
+ :class:`~fastNLP.modules.MaxPool` , Max-pooling模块
+ :class:`~fastNLP.modules.MaxPoolWithMask` , 带mask矩阵的max pooling。在做 max-pooling的时候不会考虑mask值为0的位置。
+ :class:`~fastNLP.modules.AvgPool` , Average-pooling模块
+ :class:`~fastNLP.modules.AvgPoolWithMask` , 带mask矩阵的average pooling。在做 average-pooling的时候不会考虑mask值为0的位置。
+ :class:`~fastNLP.modules.MultiHeadAttention` , MultiHead Attention 模块
+ :class:`~fastNLP.modules.MLP` , 简单的多层感知器模块
+ :class:`~fastNLP.modules.ConditionalRandomField` , 条件随机场模块
+ :class:`~fastNLP.modules.viterbi_decode` , 给定一个特征矩阵以及转移分数矩阵,计算出最佳的路径以及对应的分数 (与 :class:`~fastNLP.modules.ConditionalRandomField` 配合使用)
+ :class:`~fastNLP.modules.allowed_transitions` , 给定一个id到label的映射表,返回所有可以跳转的列表(与 :class:`~fastNLP.modules.ConditionalRandomField` 配合使用)
+ :class:`~fastNLP.modules.TimestepDropout` , 简单包装过的Dropout 组件
diff --git a/docs/source/tutorials/tutorial_8_metrics.rst b/docs/source/tutorials/tutorial_8_metrics.rst
new file mode 100644
index 00000000..0b4f86c8
--- /dev/null
+++ b/docs/source/tutorials/tutorial_8_metrics.rst
@@ -0,0 +1,121 @@
+===============================
+使用Metric快速评测你的模型
+===============================
+
+在进行训练时,fastNLP提供了各种各样的 :mod:`~fastNLP.core.metrics` 。
+如 :doc:`/user/quickstart` 中所介绍的,:class:`~fastNLP.AccuracyMetric` 类的对象被直接传到 :class:`~fastNLP.Trainer` 中用于训练
+
+.. code-block:: python
+
+ from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric
+
+ trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
+ loss=CrossEntropyLoss(), metrics=AccuracyMetric())
+ trainer.train()
+
+除了 :class:`~fastNLP.AccuracyMetric` 之外,:class:`~fastNLP.SpanFPreRecMetric` 也是一种非常见的评价指标,
+例如在序列标注问题中,常以span的方式计算 F-measure, precision, recall。
+
+另外,fastNLP 还实现了用于抽取式QA(如SQuAD)的metric :class:`~fastNLP.ExtractiveQAMetric`。
+用户可以参考下面这个表格,点击第一列查看各个 :mod:`~fastNLP.core.metrics` 的详细文档。
+
+.. csv-table::
+ :header: 名称, 介绍
+
+ :class:`~fastNLP.core.metrics.MetricBase` , 自定义metrics需继承的基类
+ :class:`~fastNLP.core.metrics.AccuracyMetric` , 简单的正确率metric
+ :class:`~fastNLP.core.metrics.SpanFPreRecMetric` , "同时计算 F-measure, precision, recall 值的 metric"
+ :class:`~fastNLP.core.metrics.ExtractiveQAMetric` , 用于抽取式QA任务 的metric
+
+更多的 :mod:`~fastNLP.core.metrics` 正在被添加到 fastNLP 当中,敬请期待。
+
+------------------------------
+定义自己的metrics
+------------------------------
+
+在定义自己的metrics类时需继承 fastNLP 的 :class:`~fastNLP.core.metrics.MetricBase`,
+并覆盖写入 ``evaluate`` 和 ``get_metric`` 方法。
+
+ evaluate(xxx) 中传入一个批次的数据,将针对一个批次的预测结果做评价指标的累计
+
+ get_metric(xxx) 当所有数据处理完毕时调用该方法,它将根据 evaluate函数累计的评价指标统计量来计算最终的评价结果
+
+以分类问题中,Accuracy计算为例,假设model的forward返回dict中包含 `pred` 这个key, 并且该key需要用于Accuracy::
+
+ class Model(nn.Module):
+ def __init__(xxx):
+ # do something
+ def forward(self, xxx):
+ # do something
+ return {'pred': pred, 'other_keys':xxx} # pred's shape: batch_size x num_classes
+
+假设dataset中 `label` 这个field是需要预测的值,并且该field被设置为了target
+对应的AccMetric可以按如下的定义, version1, 只使用这一次::
+
+ class AccMetric(MetricBase):
+ def __init__(self):
+ super().__init__()
+
+ # 根据你的情况自定义指标
+ self.corr_num = 0
+ self.total = 0
+
+ def evaluate(self, label, pred): # 这里的名称需要和dataset中target field与model返回的key是一样的,不然找不到对应的value
+ # dev或test时,每个batch结束会调用一次该方法,需要实现如何根据每个batch累加metric
+ self.total += label.size(0)
+ self.corr_num += label.eq(pred).sum().item()
+
+ def get_metric(self, reset=True): # 在这里定义如何计算metric
+ acc = self.corr_num/self.total
+ if reset: # 是否清零以便重新计算
+ self.corr_num = 0
+ self.total = 0
+ return {'acc': acc} # 需要返回一个dict,key为该metric的名称,该名称会显示到Trainer的progress bar中
+
+
+version2,如果需要复用Metric,比如下一次使用AccMetric时,dataset中目标field不叫label而叫y,或者model的输出不是pred::
+
+ class AccMetric(MetricBase):
+ def __init__(self, label=None, pred=None):
+ # 假设在另一场景使用时,目标field叫y,model给出的key为pred_y。则只需要在初始化AccMetric时,
+ # acc_metric = AccMetric(label='y', pred='pred_y')即可。
+ # 当初始化为acc_metric = AccMetric(),即label=None, pred=None, fastNLP会直接使用'label', 'pred'作为key去索取对
+ # 应的的值
+ super().__init__()
+ self._init_param_map(label=label, pred=pred) # 该方法会注册label和pred. 仅需要注册evaluate()方法会用到的参数名即可
+ # 如果没有注册该则效果与version1就是一样的
+
+ # 根据你的情况自定义指标
+ self.corr_num = 0
+ self.total = 0
+
+ def evaluate(self, label, pred): # 这里的参数名称需要和self._init_param_map()注册时一致。
+ # dev或test时,每个batch结束会调用一次该方法,需要实现如何根据每个batch累加metric
+ self.total += label.size(0)
+ self.corr_num += label.eq(pred).sum().item()
+
+ def get_metric(self, reset=True): # 在这里定义如何计算metric
+ acc = self.corr_num/self.total
+ if reset: # 是否清零以便重新计算
+ self.corr_num = 0
+ self.total = 0
+ return {'acc': acc} # 需要返回一个dict,key为该metric的名称,该名称会显示到Trainer的progress bar中
+
+
+``MetricBase`` 将会在输入的字典 ``pred_dict`` 和 ``target_dict`` 中进行检查.
+``pred_dict`` 是模型当中 ``forward()`` 函数或者 ``predict()`` 函数的返回值.
+``target_dict`` 是DataSet当中的ground truth, 判定ground truth的条件是field的 ``is_target`` 被设置为True.
+
+``MetricBase`` 会进行以下的类型检测:
+
+1. self.evaluate当中是否有varargs, 这是不支持的.
+2. self.evaluate当中所需要的参数是否既不在 ``pred_dict`` 也不在 ``target_dict`` .
+3. self.evaluate当中所需要的参数是否既在 ``pred_dict`` 也在 ``target_dict`` .
+
+除此以外,在参数被传入self.evaluate以前,这个函数会检测 ``pred_dict`` 和 ``target_dict`` 当中没有被用到的参数
+如果kwargs是self.evaluate的参数,则不会检测
+
+
+self.evaluate将计算一个批次(batch)的评价指标,并累计。 没有返回值
+self.get_metric将统计当前的评价指标并返回评价结果, 返回值需要是一个dict, key是指标名称,value是指标的值
+
diff --git a/docs/source/tutorials/tutorial_9_callback.rst b/docs/source/tutorials/tutorial_9_callback.rst
new file mode 100644
index 00000000..8e2742bb
--- /dev/null
+++ b/docs/source/tutorials/tutorial_9_callback.rst
@@ -0,0 +1,67 @@
+===================================================
+使用Callback自定义你的训练过程
+===================================================
+
+在训练时,我们常常要使用trick来提高模型的性能(如调节学习率),或者要打印训练中的信息。
+这里我们提供Callback类,在Trainer中插入代码,完成一些自定义的操作。
+
+我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。
+给出一段评价性文字,预测其情感倾向是积极(label=1)、消极(label=0)还是中性(label=2),使用 :class:`~fastNLP.Trainer` 和 :class:`~fastNLP.Tester` 来进行快速训练和测试。
+关于数据处理,Loss和Optimizer的选择可以看其他教程,这里仅在训练时加入学习率衰减。
+
+---------------------
+Callback的构建和使用
+---------------------
+
+创建Callback
+ 我们可以继承fastNLP :class:`~fastNLP.Callback` 类来定义自己的Callback。
+ 这里我们实现一个让学习率线性衰减的Callback。
+
+ .. code-block:: python
+
+ import fastNLP
+
+ class LRDecay(fastNLP.Callback):
+ def __init__(self):
+ super(MyCallback, self).__init__()
+ self.base_lrs = []
+ self.delta = []
+
+ def on_train_begin(self):
+ # 初始化,仅训练开始时调用
+ self.base_lrs = [pg['lr'] for pg in self.optimizer.param_groups]
+ self.delta = [float(lr) / self.n_epochs for lr in self.base_lrs]
+
+ def on_epoch_end(self):
+ # 每个epoch结束时,更新学习率
+ ep = self.epoch
+ lrs = [lr - d * ep for lr, d in zip(self.base_lrs, self.delta)]
+ self.change_lr(lrs)
+
+ def change_lr(self, lrs):
+ for pg, lr in zip(self.optimizer.param_groups, lrs):
+ pg['lr'] = lr
+
+ 这里,:class:`~fastNLP.Callback` 中所有以 ``on_`` 开头的类方法会在 :class:`~fastNLP.Trainer` 的训练中在特定时间调用。
+ 如 on_train_begin() 会在训练开始时被调用,on_epoch_end() 会在每个 epoch 结束时调用。
+ 具体有哪些类方法,参见文档 :class:`~fastNLP.Callback` 。
+
+ 另外,为了使用方便,可以在 :class:`~fastNLP.Callback` 内部访问 :class:`~fastNLP.Trainer` 中的属性,如 optimizer, epoch, step,分别对应训练时的优化器,当前epoch数,和当前的总step数。
+ 具体可访问的属性,参见文档 :class:`~fastNLP.Callback` 。
+
+使用Callback
+ 在定义好 :class:`~fastNLP.Callback` 之后,就能将它传入Trainer的 ``callbacks`` 参数,在实际训练时使用。
+
+ .. code-block:: python
+
+ """
+ 数据预处理,模型定义等等
+ """
+
+ trainer = fastNLP.Trainer(
+ model=model, train_data=train_data, dev_data=dev_data,
+ optimizer=optimizer, metrics=metrics,
+ batch_size=10, n_epochs=100,
+ callbacks=[LRDecay()])
+
+ trainer.train()
diff --git a/docs/source/user/docs_in_code.rst b/docs/source/user/docs_in_code.rst
new file mode 100644
index 00000000..a0b9576f
--- /dev/null
+++ b/docs/source/user/docs_in_code.rst
@@ -0,0 +1,3 @@
+===============
+在代码中写文档
+===============
\ No newline at end of file
diff --git a/docs/source/user/example.rst b/docs/source/user/example.rst
new file mode 100644
index 00000000..70ebe628
--- /dev/null
+++ b/docs/source/user/example.rst
@@ -0,0 +1,156 @@
+======
+大标题
+======
+
+.. note::
+ 中文标题需要符号的数量至少是中文字数的两倍
+
+.. warning::
+ 符号的数量只可以多,不可以少。
+
+小标题1
+###########
+
+小标题2
+*********
+
+小标题3(正常使用)
+========================
+
+小标题4
+-------------------
+
+推荐使用大标题、小标题3和小标题4
+
+官方文档 http://docutils.sourceforge.net/docs/user/rst/quickref.html
+
+`熟悉markdown的同学推荐参考这篇文章 `_
+
+\<\>内表示的是链接地址,\<\>外的是显示到外面的文字
+
+常见语法
+============
+
+*emphasis*
+
+**strong**
+
+`text`
+
+``inline literal``
+
+http://docutils.sf.net/ 孤立的网址会自动生成链接
+
+显示为特定的文字的链接 `sohu `_
+
+突出显示的
+ 上面文字
+
+正常缩进
+
+ 形成锻炼
+
+
+
+特殊模块
+============
+
+选项会自动识别
+
+-v An option
+-o file Same with value
+--delta A long option
+--delta=len Same with value
+
+
+图片
+
+.. image:: ../figures/procedures.PNG
+ :height: 200
+ :width: 560
+ :scale: 50
+ :alt: alternate text
+ :align: center
+
+显示一个冒号的代码块::
+
+ 中间要空一行
+
+::
+
+ 不显示冒号的代码块
+
+.. code-block:: python
+
+ :linenos:
+ :emphasize-lines: 1,3
+
+ print("专业的代码块")
+ print("")
+ print("有行号和高亮")
+
+数学块
+==========
+
+.. math::
+
+ H_2O + Na = NaOH + H_2 \uparrow
+
+复杂表格
+==========
+
++------------------------+------------+----------+----------+
+| Header row, column 1 | Header 2 | Header 3 | Header 4 |
+| (header rows optional) | | | |
++========================+============+==========+==========+
+| body row 1, column 1 | column 2 | column 3 | column 4 |
++------------------------+------------+----------+----------+
+| body row 2 | Cells may span columns. |
++------------------------+------------+---------------------+
+| body row 3 | Cells may | - Table cells |
++------------------------+ span rows. | - contain |
+| body row 4 | | - body elements. |
++------------------------+------------+---------------------+
+
+简易表格
+==========
+
+===== ===== ======
+ Inputs Output
+------------ ------
+ A B A or B
+===== ===== ======
+False False False
+True True True
+===== ===== ======
+
+csv 表格
+============
+
+.. csv-table::
+ :header: sentence, target
+
+ This is the first instance ., 0
+ Second instance ., 1
+ Third instance ., 1
+ ..., ...
+
+
+
+[重要]各种链接
+===================
+
+各种链接帮助我们连接到fastNLP文档的各个位置
+
+\<\>内表示的是链接地址,\<\>外的是显示到外面的文字
+
+:doc:`根据文件名链接 `
+
+:mod:`~fastNLP.core.batch`
+
+:class:`~fastNLP.Batch`
+
+~表示只显示最后一项
+
+:meth:`fastNLP.DataSet.apply`
+
diff --git a/docs/source/user/installation.rst b/docs/source/user/installation.rst
index c218b3e1..42ea402c 100644
--- a/docs/source/user/installation.rst
+++ b/docs/source/user/installation.rst
@@ -7,10 +7,12 @@
fastNLP 依赖如下包::
- torch>=0.4.0
- numpy
- tqdm
- nltk
+ numpy>=1.14.2
+ torch>=1.0.0
+ tqdm>=4.28.1
+ nltk>=3.4.1
+ requests
+ spacy
其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 `PyTorch 官网 `_ 。
在依赖包安装完成的情况,您可以在命令行执行如下指令完成安装
@@ -18,3 +20,4 @@ fastNLP 依赖如下包::
.. code:: shell
>>> pip install fastNLP
+ >>> python -m spacy download en
diff --git a/docs/source/user/quickstart.rst b/docs/source/user/quickstart.rst
index 43056a26..b92645b0 100644
--- a/docs/source/user/quickstart.rst
+++ b/docs/source/user/quickstart.rst
@@ -49,7 +49,7 @@
.. code-block:: python
from fastNLP.models import CNNText
- model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
+ model = CNNText((len(vocab),50), num_classes=5, dropout=0.1)
:class:`~fastNLP.models.CNNText` 的网络结构如下::
@@ -121,4 +121,4 @@
In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
Reloaded the best model.
-这份教程只是简单地介绍了使用 fastNLP 工作的流程,具体的细节分析见 :doc:`/user/tutorial_one`
\ No newline at end of file
+这份教程只是简单地介绍了使用 fastNLP 工作的流程,更多的教程分析见 :doc:`/user/tutorials`
diff --git a/docs/source/user/tutorial_one.rst b/docs/source/user/tutorial_one.rst
deleted file mode 100644
index 0c7be77d..00000000
--- a/docs/source/user/tutorial_one.rst
+++ /dev/null
@@ -1,371 +0,0 @@
-===============
-详细指南
-===============
-
-我们使用和 :doc:`/user/quickstart` 中一样的任务来进行详细的介绍。给出一段文字,预测它的标签是0~4中的哪一个
-(数据来源 `kaggle `_ )。
-
---------------
-数据处理
---------------
-
-数据读入
- 我们可以使用 fastNLP :mod:`fastNLP.io` 模块中的 :class:`~fastNLP.io.CSVLoader` 类,轻松地从 csv 文件读取我们的数据。
- 这里的 dataset 是 fastNLP 中 :class:`~fastNLP.DataSet` 类的对象
-
- .. code-block:: python
-
- from fastNLP.io import CSVLoader
-
- loader = CSVLoader(headers=('raw_sentence', 'label'), sep='\t')
- dataset = loader.load("./sample_data/tutorial_sample_dataset.csv")
-
- 除了读取数据外,fastNLP 还提供了读取其它文件类型的 Loader 类、读取 Embedding的 Loader 等。详见 :doc:`/fastNLP.io` 。
-
-Instance 和 DataSet
- fastNLP 中的 :class:`~fastNLP.DataSet` 类对象类似于二维表格,它的每一列是一个 :mod:`~fastNLP.core.field`
- 每一行是一个 :mod:`~fastNLP.core.instance` 。我们可以手动向数据集中添加 :class:`~fastNLP.Instance` 类的对象
-
- .. code-block:: python
-
- from fastNLP import Instance
-
- dataset.append(Instance(raw_sentence='fake data', label='0'))
-
- 此时的 ``dataset[-1]`` 的值如下,可以看到,数据集中的每个数据包含 ``raw_sentence`` 和 ``label`` 两个
- :mod:`~fastNLP.core.field` ,他们的类型都是 ``str`` ::
-
- {'raw_sentence': fake data type=str, 'label': 0 type=str}
-
-field 的修改
- 我们使用 :class:`~fastNLP.DataSet` 类的 :meth:`~fastNLP.DataSet.apply` 方法将 ``raw_sentence`` 中字母变成小写,并将句子分词。
- 同时也将 ``label`` :mod:`~fastNLP.core.field` 转化为整数并改名为 ``target``
-
- .. code-block:: python
-
- dataset.apply(lambda x: x['raw_sentence'].lower(), new_field_name='sentence')
- dataset.apply_field(lambda x: x.split(), field_name='sentence', new_field_name='words')
- dataset.apply(lambda x: int(x['label']), new_field_name='target')
-
- ``words`` 和 ``target`` 已经足够用于 :class:`~fastNLP.models.CNNText` 的训练了,但我们从其文档
- :class:`~fastNLP.models.CNNText` 中看到,在 :meth:`~fastNLP.models.CNNText.forward` 的时候,还可以传入可选参数 ``seq_len`` 。
- 所以,我们再使用 :meth:`~fastNLP.DataSet.apply_field` 方法增加一个名为 ``seq_len`` 的 :mod:`~fastNLP.core.field` 。
-
- .. code-block:: python
-
- dataset.apply_field(lambda x: len(x), field_name='words', new_field_name='seq_len')
-
- 观察可知: :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 类似,
- 但所传入的 `lambda` 函数是针对一个 :class:`~fastNLP.Instance` 中的一个 :mod:`~fastNLP.core.field` 的;
- 而 :meth:`~fastNLP.DataSet.apply` 所传入的 `lambda` 函数是针对整个 :class:`~fastNLP.Instance` 的。
-
- .. note::
- `lambda` 函数即匿名函数,是 Python 的重要特性。 ``lambda x: len(x)`` 和下面的这个函数的作用相同::
-
- def func_lambda(x):
- return len(x)
-
- 你也可以编写复杂的函数做为 :meth:`~fastNLP.DataSet.apply_field` 与 :meth:`~fastNLP.DataSet.apply` 的参数
-
-Vocabulary 的使用
- 我们再用 :class:`~fastNLP.Vocabulary` 类来统计数据中出现的单词,并使用 :meth:`~fastNLP.Vocabularyindex_dataset`
- 将单词序列转化为训练可用的数字序列。
-
- .. code-block:: python
-
- from fastNLP import Vocabulary
-
- vocab = Vocabulary(min_freq=2).from_dataset(dataset, field_name='words')
- vocab.index_dataset(dataset, field_name='words',new_field_name='words')
-
-数据集分割
- 除了修改 :mod:`~fastNLP.core.field` 之外,我们还可以对 :class:`~fastNLP.DataSet` 进行分割,以供训练、开发和测试使用。
- 下面这段代码展示了 :meth:`~fastNLP.DataSet.split` 的使用方法(但实际应该放在后面两段改名和设置输入的代码之后)
-
- .. code-block:: python
-
- train_dev_data, test_data = dataset.split(0.1)
- train_data, dev_data = train_dev_data.split(0.1)
- len(train_data), len(dev_data), len(test_data)
-
----------------------
-使用内置模型训练
----------------------
-
-内置模型的输入输出命名
- fastNLP内置了一些完整的神经网络模型,详见 :doc:`/fastNLP.models` , 我们使用其中的 :class:`~fastNLP.models.CNNText` 模型进行训练。
- 为了使用内置的 :class:`~fastNLP.models.CNNText`,我们必须修改 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 的名称。
- 在这个例子中模型输入 (forward方法的参数) 为 ``words`` 和 ``seq_len`` ; 预测输出为 ``pred`` ;标准答案为 ``target`` 。
- 具体的命名规范可以参考 :doc:`/fastNLP.core.const` 。
-
- 如果不想查看文档,您也可以使用 :class:`~fastNLP.Const` 类进行命名。下面的代码展示了给 :class:`~fastNLP.DataSet` 中
- :mod:`~fastNLP.core.field` 改名的 :meth:`~fastNLP.DataSet.rename_field` 方法,以及 :class:`~fastNLP.Const` 类的使用方法。
-
- .. code-block:: python
-
- from fastNLP import Const
-
- dataset.rename_field('words', Const.INPUT)
- dataset.rename_field('seq_len', Const.INPUT_LEN)
- dataset.rename_field('target', Const.TARGET)
-
- 在给 :class:`~fastNLP.DataSet` 中 :mod:`~fastNLP.core.field` 改名后,我们还需要设置训练所需的输入和目标,这里使用的是
- :meth:`~fastNLP.DataSet.set_input` 和 :meth:`~fastNLP.DataSet.set_target` 两个函数。
-
- .. code-block:: python
-
- dataset.set_input(Const.INPUT, Const.INPUT_LEN)
- dataset.set_target(Const.TARGET)
-
-快速训练
- 现在我们可以导入 fastNLP 内置的文本分类模型 :class:`~fastNLP.models.CNNText` ,并使用 :class:`~fastNLP.Trainer` 进行训练了
- (其中 ``loss`` 和 ``metrics`` 的定义,我们将在后续两段代码中给出)。
-
- .. code-block:: python
-
- from fastNLP.models import CNNText
- from fastNLP import Trainer
-
- model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
-
- trainer = Trainer(model=model_cnn, train_data=train_data, dev_data=dev_data,
- loss=loss, metrics=metrics)
- trainer.train()
-
- 训练过程的输出如下::
-
- input fields after batch(if batch size is 2):
- words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
- target fields after batch(if batch size is 2):
- target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
-
- training epochs started 2019-05-09-10-59-39
- Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.333333
-
- Evaluation at Epoch 2/10. Step:4/20. AccuracyMetric: acc=0.533333
-
- Evaluation at Epoch 3/10. Step:6/20. AccuracyMetric: acc=0.533333
-
- Evaluation at Epoch 4/10. Step:8/20. AccuracyMetric: acc=0.533333
-
- Evaluation at Epoch 5/10. Step:10/20. AccuracyMetric: acc=0.6
-
- Evaluation at Epoch 6/10. Step:12/20. AccuracyMetric: acc=0.8
-
- Evaluation at Epoch 7/10. Step:14/20. AccuracyMetric: acc=0.8
-
- Evaluation at Epoch 8/10. Step:16/20. AccuracyMetric: acc=0.733333
-
- Evaluation at Epoch 9/10. Step:18/20. AccuracyMetric: acc=0.733333
-
- Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.733333
-
-
- In Epoch:6/Step:12, got best dev performance:AccuracyMetric: acc=0.8
- Reloaded the best model.
-
-损失函数
- 训练模型需要提供一个损失函数, 下面提供了一个在分类问题中常用的交叉熵损失。注意它的 **初始化参数** 。
- ``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
- ``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
- 这里我们用 :class:`~fastNLP.Const` 来辅助命名,如果你自己编写模型中 forward 方法的返回值或
- 数据集中 :mod:`~fastNLP.core.field` 的名字与本例不同, 你可以把 ``pred`` 参数和 ``target`` 参数设定符合自己代码的值。
-
- .. code-block:: python
-
- from fastNLP import CrossEntropyLoss
-
- # loss = CrossEntropyLoss() 在本例中与下面这行代码等价
- loss = CrossEntropyLoss(pred=Const.OUTPUT, target=Const.TARGET)
-
-评价指标
- 训练模型需要提供一个评价指标。这里使用准确率做为评价指标。参数的 `命名规则` 跟上面类似。
- ``pred`` 参数对应的是模型的 forward 方法返回的 dict 中的一个 key 的名字。
- ``target`` 参数对应的是 :class:`~fastNLP.DataSet` 中作为标签的 :mod:`~fastNLP.core.field` 的名字。
-
- .. code-block:: python
-
- from fastNLP import AccuracyMetric
-
- # metrics=AccuracyMetric() 在本例中与下面这行代码等价
- metrics=AccuracyMetric(pred=Const.OUTPUT, target=Const.TARGET)
-
-快速测试
- 与 :class:`~fastNLP.Trainer` 对应,fastNLP 也提供了 :class:`~fastNLP.Tester` 用于快速测试,用法如下
-
- .. code-block:: python
-
- from fastNLP import Tester
-
- tester = Tester(test_data, model_cnn, metrics=AccuracyMetric())
- tester.test()
-
----------------------
-编写自己的模型
----------------------
-
-因为 fastNLP 是基于 `PyTorch `_ 开发的框架,所以我们可以基于 PyTorch 模型编写自己的神经网络模型。
-与标准的 PyTorch 模型不同,fastNLP 模型中 forward 方法返回的是一个字典,字典中至少需要包含 "pred" 这个字段。
-而 forward 方法的参数名称必须与 :class:`~fastNLP.DataSet` 中用 :meth:`~fastNLP.DataSet.set_input` 设定的名称一致。
-模型定义的代码如下:
-
-.. code-block:: python
-
- import torch
- import torch.nn as nn
-
- class LSTMText(nn.Module):
- def __init__(self, vocab_size, embedding_dim, output_dim, hidden_dim=64, num_layers=2, dropout=0.5):
- super().__init__()
-
- self.embedding = nn.Embedding(vocab_size, embedding_dim)
- self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, bidirectional=True, dropout=dropout)
- self.fc = nn.Linear(hidden_dim * 2, output_dim)
- self.dropout = nn.Dropout(dropout)
-
- def forward(self, words):
- # (input) words : (batch_size, seq_len)
- words = words.permute(1,0)
- # words : (seq_len, batch_size)
-
- embedded = self.dropout(self.embedding(words))
- # embedded : (seq_len, batch_size, embedding_dim)
- output, (hidden, cell) = self.lstm(embedded)
- # output: (seq_len, batch_size, hidden_dim * 2)
- # hidden: (num_layers * 2, batch_size, hidden_dim)
- # cell: (num_layers * 2, batch_size, hidden_dim)
-
- hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
- hidden = self.dropout(hidden)
- # hidden: (batch_size, hidden_dim * 2)
-
- pred = self.fc(hidden.squeeze(0))
- # result: (batch_size, output_dim)
- return {"pred":pred}
-
-模型的使用方法与内置模型 :class:`~fastNLP.models.CNNText` 一致
-
-.. code-block:: python
-
- model_lstm = LSTMText(len(vocab),50,5)
-
- trainer = Trainer(model=model_lstm, train_data=train_data, dev_data=dev_data,
- loss=loss, metrics=metrics)
- trainer.train()
-
- tester = Tester(test_data, model_lstm, metrics=AccuracyMetric())
- tester.test()
-
-.. todo::
- 使用 :doc:`/fastNLP.modules` 编写模型
-
---------------------------
-自己编写训练过程
---------------------------
-
-如果你想用类似 PyTorch 的使用方法,自己编写训练过程,你可以参考下面这段代码。其中使用了 fastNLP 提供的 :class:`~fastNLP.Batch`
-来获得小批量训练的小批量数据,使用 :class:`~fastNLP.BucketSampler` 做为 :class:`~fastNLP.Batch` 的参数来选择采样的方式。
-这段代码中使用了 PyTorch 的 `torch.optim.Adam` 优化器 和 `torch.nn.CrossEntropyLoss` 损失函数,并自己计算了正确率
-
-.. code-block:: python
-
- from fastNLP import BucketSampler
- from fastNLP import Batch
- import torch
- import time
-
- model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
-
- def train(epoch, data):
- optim = torch.optim.Adam(model.parameters(), lr=0.001)
- lossfunc = torch.nn.CrossEntropyLoss()
- batch_size = 32
-
- train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
- train_batch = Batch(batch_size=batch_size, dataset=data, sampler=train_sampler)
-
- start_time = time.time()
- for i in range(epoch):
- loss_list = []
- for batch_x, batch_y in train_batch:
- optim.zero_grad()
- output = model(batch_x['words'])
- loss = lossfunc(output['pred'], batch_y['target'])
- loss.backward()
- optim.step()
- loss_list.append(loss.item())
- print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
- print('{:d}ms'.format(round((time.time()-start_time)*1000)))
- loss_list.clear()
-
- train(10, train_data)
-
- tester = Tester(test_data, model, metrics=AccuracyMetric())
- tester.test()
-
-这段代码的输出如下::
-
- Epoch 0 Avg Loss: 2.76 17ms
- Epoch 1 Avg Loss: 2.55 29ms
- Epoch 2 Avg Loss: 2.37 41ms
- Epoch 3 Avg Loss: 2.30 53ms
- Epoch 4 Avg Loss: 2.12 65ms
- Epoch 5 Avg Loss: 2.16 76ms
- Epoch 6 Avg Loss: 1.88 88ms
- Epoch 7 Avg Loss: 1.84 99ms
- Epoch 8 Avg Loss: 1.71 111ms
- Epoch 9 Avg Loss: 1.62 122ms
- [tester]
- AccuracyMetric: acc=0.142857
-
-----------------------------------
-使用 Callback 增强 Trainer
-----------------------------------
-
-如果你不想自己实现繁琐的训练过程,只希望在训练过程中实现一些自己的功能(比如:输出从训练开始到当前 batch 结束的总时间),
-你可以使用 fastNLP 提供的 :class:`~fastNLP.Callback` 类。下面的例子中,我们继承 :class:`~fastNLP.Callback` 类实现了这个功能。
-
-.. code-block:: python
-
- from fastNLP import Callback
-
- start_time = time.time()
-
- class MyCallback(Callback):
- def on_epoch_end(self):
- print('Sum Time: {:d}ms\n\n'.format(round((time.time()-start_time)*1000)))
-
-
- model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
- trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data,
- loss=CrossEntropyLoss(), metrics=AccuracyMetric(), callbacks=[MyCallback()])
- trainer.train()
-
-训练输出如下::
-
- input fields after batch(if batch size is 2):
- words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 16])
- seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
- target fields after batch(if batch size is 2):
- target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
-
- training epochs started 2019-05-12-21-38-40
- Evaluation at Epoch 1/10. Step:2/20. AccuracyMetric: acc=0.285714
-
- Sum Time: 51ms
-
-
- …………………………
-
-
- Evaluation at Epoch 10/10. Step:20/20. AccuracyMetric: acc=0.857143
-
- Sum Time: 212ms
-
-
-
- In Epoch:10/Step:20, got best dev performance:AccuracyMetric: acc=0.857143
- Reloaded the best model.
-
-这个例子只是介绍了 :class:`~fastNLP.Callback` 类的使用方法。实际应用(比如:负采样、Learning Rate Decay、Early Stop 等)中
-很多功能已经被 fastNLP 实现了。你可以直接 import 它们使用,详细请查看文档 :doc:`/fastNLP.core.callback` 。
\ No newline at end of file
diff --git a/docs/source/user/tutorials.rst b/docs/source/user/tutorials.rst
new file mode 100644
index 00000000..196f9c29
--- /dev/null
+++ b/docs/source/user/tutorials.rst
@@ -0,0 +1,20 @@
+========================
+fastNLP 详细使用教程
+========================
+
+这里是更详细的使用教程。对于大部分的用户,我们建议你从第一篇开始顺序阅读;如果你只想了解其中的一部分,也可以进行选读。
+
+.. toctree::
+ :maxdepth: 1
+
+ 使用DataSet预处理文本
+ 使用DataSetLoader加载数据集
+ 使用Embedding模块将文本转成向量
+ 动手实现一个文本分类器I-使用Trainer和Tester快速训练和测试
+ 动手实现一个文本分类器II-使用DataSetIter实现自定义训练过程
+ 快速实现序列标注模型
+ 使用Modules和Models快速搭建自定义模型
+ 使用Metric快速评测你的模型
+ 使用Callback自定义你的训练过程
+ 使用fitlog 辅助 fastNLP 进行科研
+
diff --git a/fastNLP/__init__.py b/fastNLP/__init__.py
index c67e5919..ec192568 100644
--- a/fastNLP/__init__.py
+++ b/fastNLP/__init__.py
@@ -1,18 +1,23 @@
"""
-fastNLP 由 :mod:`~fastNLP.core` 、 :mod:`~fastNLP.io` 、:mod:`~fastNLP.modules`、:mod:`~fastNLP.models`
-等子模块组成,你可以点进去查看每个模块的文档。
+fastNLP 由 :mod:`~fastNLP.core` 、 :mod:`~fastNLP.io` 、:mod:`~fastNLP.embeddings` 、 :mod:`~fastNLP.modules`、
+:mod:`~fastNLP.models` 等子模块组成,你可以查看每个模块的文档。
- :mod:`~fastNLP.core` 是fastNLP 的核心模块,包括 DataSet、 Trainer、 Tester 等组件。详见文档 :doc:`/fastNLP.core`
- :mod:`~fastNLP.io` 是实现输入输出的模块,包括了数据集的读取,模型的存取等功能。详见文档 :doc:`/fastNLP.io`
+- :mod:`~fastNLP.embeddings` 提供用于构建复杂网络模型所需的各种embedding。详见文档 :doc:`/fastNLP.embeddings`
- :mod:`~fastNLP.modules` 包含了用于搭建神经网络模型的诸多组件,可以帮助用户快速搭建自己所需的网络。详见文档 :doc:`/fastNLP.modules`
-- :mod:`~fastNLP.models` 包含了一些使用 fastNLP 实现的完整网络模型,包括CNNText、SeqLabeling等常见模型。详见文档 :doc:`/fastNLP.models`
+- :mod:`~fastNLP.models` 包含了一些使用 fastNLP 实现的完整网络模型,包括 :class:`~fastNLP.models.CNNText` 、 :class:`~fastNLP.models.SeqLabeling` 等常见模型。详见文档 :doc:`fastNLP.models`
fastNLP 中最常用的组件可以直接从 fastNLP 包中 import ,他们的文档如下:
"""
__all__ = [
"Instance",
"FieldArray",
- "Batch",
+
+ "DataSetIter",
+ "BatchIter",
+ "TorchLoaderIter",
+
"Vocabulary",
"DataSet",
"Const",
@@ -33,7 +38,7 @@ __all__ = [
"AccuracyMetric",
"SpanFPreRecMetric",
- "SQuADMetric",
+ "ExtractiveQAMetric",
"Optimizer",
"SGD",
@@ -52,8 +57,10 @@ __all__ = [
"cache_results"
]
-__version__ = '0.4.0'
+__version__ = '0.4.5'
from .core import *
from . import models
from . import modules
+from . import embeddings
+from .io import data_loader
diff --git a/fastNLP/core/__init__.py b/fastNLP/core/__init__.py
index d6ab8983..c9f51123 100644
--- a/fastNLP/core/__init__.py
+++ b/fastNLP/core/__init__.py
@@ -1,12 +1,12 @@
"""
core 模块里实现了 fastNLP 的核心框架,常用的功能都可以从 fastNLP 包中直接 import。当然你也同样可以从 core 模块的子模块中 import,
-例如 Batch 组件有两种 import 的方式::
+例如 :class:`~fastNLP.DataSetIter` 组件有两种 import 的方式::
# 直接从 fastNLP 中 import
- from fastNLP import Batch
+ from fastNLP import DataSetIter
- # 从 core 模块的子模块 batch 中 import
- from fastNLP.core.batch import Batch
+ # 从 core 模块的子模块 batch 中 import DataSetIter
+ from fastNLP.core.batch import DataSetIter
对于常用的功能,你只需要在 :doc:`fastNLP` 中查看即可。如果想了解各个子模块的具体作用,您可以在下面找到每个子模块的具体文档。
@@ -14,14 +14,14 @@ core 模块里实现了 fastNLP 的核心框架,常用的功能都可以从 fa
介绍core 的子模块的分工,好像必要性不大
"""
-from .batch import Batch
+from .batch import DataSetIter, BatchIter, TorchLoaderIter
from .callback import Callback, GradientClipCallback, EarlyStopCallback, TensorboardCallback, LRScheduler, ControlC
from .const import Const
from .dataset import DataSet
from .field import FieldArray, Padder, AutoPadder, EngChar2DPadder
from .instance import Instance
from .losses import LossFunc, CrossEntropyLoss, L1Loss, BCELoss, NLLLoss, LossInForward
-from .metrics import AccuracyMetric, SpanFPreRecMetric, SQuADMetric
+from .metrics import AccuracyMetric, SpanFPreRecMetric, ExtractiveQAMetric
from .optimizer import Optimizer, SGD, Adam
from .sampler import SequentialSampler, BucketSampler, RandomSampler, Sampler
from .tester import Tester
diff --git a/fastNLP/core/_parallel_utils.py b/fastNLP/core/_parallel_utils.py
new file mode 100644
index 00000000..4a7757d3
--- /dev/null
+++ b/fastNLP/core/_parallel_utils.py
@@ -0,0 +1,88 @@
+
+import threading
+import torch
+from torch.nn.parallel.parallel_apply import get_a_var
+
+from torch.nn.parallel.scatter_gather import scatter_kwargs, gather
+from torch.nn.parallel.replicate import replicate
+
+
+def parallel_apply(modules, func_name, inputs, kwargs_tup=None, devices=None):
+ r"""Applies each `module` in :attr:`modules` in parallel on arguments
+ contained in :attr:`inputs` (positional) and :attr:`kwargs_tup` (keyword)
+ on each of :attr:`devices`.
+
+ :attr:`modules`, :attr:`inputs`, :attr:`kwargs_tup` (if given), and
+ :attr:`devices` (if given) should all have same length. Moreover, each
+ element of :attr:`inputs` can either be a single object as the only argument
+ to a module, or a collection of positional arguments.
+ """
+ assert len(modules) == len(inputs)
+ if kwargs_tup is not None:
+ assert len(modules) == len(kwargs_tup)
+ else:
+ kwargs_tup = ({},) * len(modules)
+ if devices is not None:
+ assert len(modules) == len(devices)
+ else:
+ devices = [None] * len(modules)
+
+ lock = threading.Lock()
+ results = {}
+ grad_enabled = torch.is_grad_enabled()
+
+ def _worker(i, module, input, kwargs, device=None):
+ torch.set_grad_enabled(grad_enabled)
+ if device is None:
+ device = get_a_var(input).get_device()
+ try:
+ with torch.cuda.device(device):
+ # this also avoids accidental slicing of `input` if it is a Tensor
+ if not isinstance(input, (list, tuple)):
+ input = (input,)
+ output = getattr(module, func_name)(*input, **kwargs)
+ with lock:
+ results[i] = output
+ except Exception as e:
+ with lock:
+ results[i] = e
+
+ if len(modules) > 1:
+ threads = [threading.Thread(target=_worker,
+ args=(i, module, input, kwargs, device))
+ for i, (module, input, kwargs, device) in
+ enumerate(zip(modules, inputs, kwargs_tup, devices))]
+
+ for thread in threads:
+ thread.start()
+ for thread in threads:
+ thread.join()
+ else:
+ _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])
+
+ outputs = []
+ for i in range(len(inputs)):
+ output = results[i]
+ if isinstance(output, Exception):
+ raise output
+ outputs.append(output)
+ return outputs
+
+
+def _data_parallel_wrapper(func_name, device_ids, output_device):
+ """
+ 这个函数是用于对需要多卡执行的函数的wrapper函数。参考的nn.DataParallel的forward函数
+
+ :param str, func_name: 对network中的这个函数进行多卡运行
+ :param device_ids: nn.DataParallel中的device_ids
+ :param output_device: nn.DataParallel中的output_device
+ :return:
+ """
+ def wrapper(network, *inputs, **kwargs):
+ inputs, kwargs = scatter_kwargs(inputs, kwargs, device_ids, dim=0)
+ if len(device_ids) == 1:
+ return getattr(network, func_name)(*inputs[0], **kwargs[0])
+ replicas = replicate(network, device_ids[:len(inputs)])
+ outputs = parallel_apply(replicas, func_name, inputs, kwargs, device_ids[:len(replicas)])
+ return gather(outputs, output_device)
+ return wrapper
diff --git a/fastNLP/core/batch.py b/fastNLP/core/batch.py
index 109d4fe9..64c5f48e 100644
--- a/fastNLP/core/batch.py
+++ b/fastNLP/core/batch.py
@@ -1,19 +1,22 @@
"""
-batch 模块实现了 fastNLP 所需的 Batch 类。
+batch 模块实现了 fastNLP 所需的 :class:`~fastNLP.core.batch.DataSetIter` 类。
"""
__all__ = [
- "Batch"
+ "BatchIter",
+ "DataSetIter",
+ "TorchLoaderIter",
]
import atexit
-from queue import Empty, Full
import numpy as np
import torch
-import torch.multiprocessing as mp
+import torch.utils.data
+from numbers import Number
-from .sampler import RandomSampler
+from .sampler import SequentialSampler
+from .dataset import DataSet
_python_is_exit = False
@@ -26,160 +29,189 @@ def _set_python_is_exit():
atexit.register(_set_python_is_exit)
-class Batch(object):
- """
- 别名::class:`fastNLP.Batch` :class:`fastNLP.core.batch.Batch`
+class DataSetGetter:
+ def __init__(self, dataset: DataSet, as_numpy=False):
+ self.dataset = dataset
+ self.inputs = {n: f for n, f in dataset.get_all_fields().items() if f.is_input}
+ self.targets = {n: f for n, f in dataset.get_all_fields().items() if f.is_target}
+ self.as_numpy = as_numpy
+ self.idx_list = list(range(len(dataset)))
- Batch 用于从 `DataSet` 中按一定的顺序, 依次按 ``batch_size`` 的大小将数据取出,
+ def __getitem__(self, idx: int):
+ # mapping idx to sampled idx
+ idx = self.idx_list[idx]
+ inputs = {n:f.get(idx) for n, f in self.inputs.items()}
+ targets = {n:f.get(idx) for n, f in self.targets.items()}
+ return idx, inputs, targets
+
+ def __len__(self):
+ return len(self.dataset)
+
+ def collate_fn(self, batch: list):
+ # TODO 支持在DataSet中定义collate_fn,因为有时候可能需要不同的field之间融合,比如BERT的场景
+ batch_x = {n:[] for n in self.inputs.keys()}
+ batch_y = {n:[] for n in self.targets.keys()}
+ indices = []
+ for idx, x, y in batch:
+ indices.append(idx)
+ for n, v in x.items():
+ batch_x[n].append(v)
+ for n, v in y.items():
+ batch_y[n].append(v)
+
+ def pad_batch(batch_dict, field_array):
+ for n, vlist in batch_dict.items():
+ f = field_array[n]
+ if f.padder is None:
+ batch_dict[n] = np.array(vlist)
+ else:
+ data = f.pad(vlist)
+ if not self.as_numpy:
+ try:
+ data, flag = _to_tensor(data, f.dtype)
+ except TypeError as e:
+ print(f"Field {n} cannot be converted to torch.tensor.")
+ raise e
+ batch_dict[n] = data
+ return batch_dict
+
+ return (indices,
+ pad_batch(batch_x, self.inputs),
+ pad_batch(batch_y, self.targets))
+
+ def set_idx_list(self, idx_list):
+ if len(idx_list) != len(self.idx_list):
+ raise ValueError
+ self.idx_list = idx_list
+
+ def __getattr__(self, item):
+ if hasattr(self.dataset, item):
+ return getattr(self.dataset, item)
+ else:
+ raise AttributeError("'DataSetGetter' object has no attribute '{}'".format(item))
+
+
+class SamplerAdapter(torch.utils.data.Sampler):
+ def __init__(self, sampler, dataset):
+ self.sampler = sampler
+ self.dataset = dataset
+
+ def __iter__(self):
+ return iter(self.sampler(self.dataset))
+
+
+class BatchIter:
+ def __init__(self):
+ self.dataiter = None
+ self.num_batches = None
+ self.cur_batch_indices = None
+ self.batch_size = None
+
+ def init_iter(self):
+ pass
+
+ @staticmethod
+ def get_num_batches(num_samples, batch_size, drop_last):
+ num_batches = num_samples // batch_size
+ if not drop_last and (num_samples % batch_size > 0):
+ num_batches += 1
+ return num_batches
+
+ def __iter__(self):
+ self.init_iter()
+ for indices, batch_x, batch_y in self.dataiter:
+ self.cur_batch_indices = indices
+ yield batch_x, batch_y
+
+ def get_batch_indices(self):
+ return self.cur_batch_indices
+
+ def __len__(self):
+ return self.num_batches
+
+ @property
+ def dataset(self):
+ return self.dataiter.dataset
+
+
+class DataSetIter(BatchIter):
+ """
+ 别名::class:`fastNLP.DataSetIter` :class:`fastNLP.core.batch.DataSetIter`
+
+ DataSetIter 用于从 `DataSet` 中按一定的顺序, 依次按 ``batch_size`` 的大小将数据取出,
组成 `x` 和 `y`::
- batch = Batch(data_set, batch_size=16, sampler=SequentialSampler())
+ batch = DataSetIter(data_set, batch_size=16, sampler=SequentialSampler())
num_batch = len(batch)
for batch_x, batch_y in batch:
# do stuff ...
:param dataset: :class:`~fastNLP.DataSet` 对象, 数据集
:param int batch_size: 取出的batch大小
- :param sampler: 规定使用的 :class:`~fastNLP.Sampler` 方式. 若为 ``None`` , 使用 :class:`~fastNLP.RandomSampler`.
-
+ :param sampler: 规定使用的 :class:`~fastNLP.Sampler` 方式. 若为 ``None`` , 使用 :class:`~fastNLP.SequentialSampler`.
+
Default: ``None``
:param bool as_numpy: 若为 ``True`` , 输出batch为 numpy.array. 否则为 :class:`torch.Tensor`.
-
- Default: ``False``
- :param bool prefetch: 若为 ``True`` 使用多进程预先取出下一batch.
-
+
Default: ``False``
+ :param int num_workers: 使用多少个进程来预处理数据
+ :param bool pin_memory: 是否将产生的tensor使用pin memory, 可能会加快速度。
+ :param bool drop_last: 如果最后一个batch没有batch_size这么多sample,就扔掉最后一个
+ :param timeout:
+ :param worker_init_fn: 在每个worker启动时调用该函数,会传入一个值,该值是worker的index。
"""
-
- def __init__(self, dataset, batch_size, sampler=None, as_numpy=False, prefetch=False):
- self.dataset = dataset
+ def __init__(self, dataset, batch_size=1, sampler=None, as_numpy=False,
+ num_workers=0, pin_memory=False, drop_last=False,
+ timeout=0, worker_init_fn=None):
+ super().__init__()
+ assert isinstance(dataset, DataSet)
+ sampler = SamplerAdapter(sampler=sampler or SequentialSampler(), dataset=dataset)
+ dataset = DataSetGetter(dataset, as_numpy)
+ collate_fn = dataset.collate_fn if hasattr(dataset, 'collate_fn') else None
+ self.dataiter = torch.utils.data.DataLoader(
+ dataset=dataset, batch_size=batch_size, sampler=sampler,
+ collate_fn=collate_fn, num_workers=num_workers,
+ pin_memory=pin_memory, drop_last=drop_last,
+ timeout=timeout, worker_init_fn=worker_init_fn)
+ self.num_batches = self.get_num_batches(len(dataset), batch_size, drop_last)
self.batch_size = batch_size
- if sampler is None:
- sampler = RandomSampler()
- self.sampler = sampler
- self.as_numpy = as_numpy
- self.idx_list = None
- self.curidx = 0
- self.num_batches = len(dataset) // batch_size + int(len(dataset) % batch_size != 0)
- self.cur_batch_indices = None
- self.prefetch = prefetch
- self.lengths = 0
-
- def fetch_one(self):
- if self.curidx >= len(self.idx_list):
- return None
- else:
- endidx = min(self.curidx + self.batch_size, len(self.idx_list))
- batch_x, batch_y = {}, {}
-
- indices = self.idx_list[self.curidx:endidx]
- self.cur_batch_indices = indices
-
- for field_name, field in self.dataset.get_all_fields().items():
- if field.is_target or field.is_input:
- batch = field.get(indices)
- if not self.as_numpy and field.padder is not None:
- batch = _to_tensor(batch, field.dtype)
- if field.is_target:
- batch_y[field_name] = batch
- if field.is_input:
- batch_x[field_name] = batch
-
- self.curidx = endidx
- return batch_x, batch_y
-
- def __iter__(self):
- """
- Iterate on dataset, fetch batch data. Fetch process don't block the iterate process
- :return:
- """
- if self.prefetch:
- return self._run_batch_iter(self)
-
- def batch_iter():
- self.init_iter()
- while 1:
- res = self.fetch_one()
- if res is None:
- break
- yield res
-
- return batch_iter()
-
- def init_iter(self):
- self.idx_list = self.sampler(self.dataset)
- self.curidx = 0
- self.lengths = self.dataset.get_length()
-
- def __len__(self):
- return self.num_batches
-
- def get_batch_indices(self):
- """
- 取得当前batch在DataSet中所在的index下标序列
-
- :return list(int) indexes: 下标序列
- """
- return self.cur_batch_indices
-
- @staticmethod
- def _run_fetch(batch, q):
- try:
- global _python_is_exit
- batch.init_iter()
- # print('start fetch')
- while 1:
- res = batch.fetch_one()
- # print('fetch one')
- while 1:
- try:
- q.put(res, timeout=3)
- break
- except Full:
- if _python_is_exit:
- return
- if res is None:
- # print('fetch done, waiting processing')
- break
- # print('fetch exit')
- except Exception as e:
- q.put(e)
- finally:
- q.join()
-
- @staticmethod
- def _run_batch_iter(batch):
- q = mp.JoinableQueue(maxsize=10)
- fetch_p = mp.Process(target=Batch._run_fetch, args=(batch, q))
- fetch_p.daemon = True
- fetch_p.start()
- # print('fork fetch process')
- while 1:
- try:
- res = q.get(timeout=1)
- q.task_done()
- # print('get fetched')
- if res is None:
- break
- elif isinstance(res, Exception):
- raise res
- yield res
- except Empty as e:
- if fetch_p.is_alive():
- continue
- else:
- break
- fetch_p.terminate()
- fetch_p.join()
- # print('iter done')
-def _to_tensor(batch, dtype):
+class TorchLoaderIter(BatchIter):
+ def __init__(self, dataset):
+ super().__init__()
+ assert isinstance(dataset, torch.utils.data.DataLoader)
+ self.dataiter = dataset
+ self.num_batches = self.get_num_batches(len(dataset), dataset.batch_size, dataset.drop_last)
+ self.batch_size = dataset.batch_size
+
+
+class OnlineDataGettter:
+ # TODO
+ pass
+
+
+class OnlineDataIter(BatchIter):
+ # TODO
+ def __init__(self, dataset, batch_size=1, buffer_size=10000, sampler=None, as_numpy=False,
+ num_workers=0, pin_memory=False, drop_last=False,
+ timeout=0, worker_init_fn=None, **kwargs):
+ super().__init__()
+
+
+def _to_tensor(batch, field_dtype):
try:
- if dtype in (int, np.int8, np.int16, np.int32, np.int64):
- batch = torch.LongTensor(batch)
- if dtype in (float, np.float32, np.float64):
- batch = torch.FloatTensor(batch)
- except:
- pass
- return batch
+ if field_dtype is not None and isinstance(field_dtype, type)\
+ and issubclass(field_dtype, Number) \
+ and not isinstance(batch, torch.Tensor):
+ if issubclass(batch.dtype.type, np.floating):
+ new_batch = torch.as_tensor(batch).float() # 默认使用float32
+ elif issubclass(batch.dtype.type, np.integer):
+ new_batch = torch.as_tensor(batch).long() # 复用内存地址,避免复制
+ else:
+ new_batch = torch.as_tensor(batch)
+ return new_batch, True
+ else:
+ return batch, False
+ except Exception as e:
+ raise e
diff --git a/fastNLP/core/callback.py b/fastNLP/core/callback.py
index 483f6dc1..6f855397 100644
--- a/fastNLP/core/callback.py
+++ b/fastNLP/core/callback.py
@@ -2,11 +2,11 @@ r"""
callback模块实现了 fastNLP 中的许多 callback 类,用于增强 :class:`~fastNLP.Trainer` 类。
虽然Trainer本身已经集成了一些功能,但仍然不足以囊括训练过程中可能需要到的功能,
-比如负采样,learning rate decay, Early Stop等。
-为了解决这个问题fastNLP引入了callback的机制,Callback 是一种在Trainer训练过程中特定阶段会运行的函数集合。
-关于Trainer的详细文档,请参见 :doc:`trainer 模块`
+比如负采样,learning rate decay 和 early stop等。
+为了解决这个问题,fastNLP引入了callback的机制,:class:`~fastNLP.Callback` 是一种在Trainer训练过程中特定阶段会运行的函数集合。
+关于 :class:`~fastNLP.Trainer` 的详细文档,请参见 :doc:`trainer 模块`
-我们将 :meth:`~fastNLP.Train.train` 这个函数内部分为以下的阶段,在对应阶段会触发相应的调用::
+我们将 :meth:`~fastNLP.Trainer.train` 这个函数内部分为以下的阶段,在对应阶段会触发相应的调用::
callback.on_train_begin() # 开始进行训练
for i in range(1, n_epochs+1):
@@ -31,8 +31,8 @@ callback模块实现了 fastNLP 中的许多 callback 类,用于增强 :class:
callback.on_train_end() # 训练结束
callback.on_exception() # 这是一个特殊的步骤,在训练过程中遭遇exception会跳转到这里。
-如下面的例子所示,我们可以使用内置的 callback 类,或者继承 :class:`~fastNLP.core.callback.Callback`
-定义自己的 callback 类::
+如下面的例子所示,我们可以使用内置的 callback 组件,或者继承 :class:`~fastNLP.core.callback.Callback`
+定义自己的 callback 组件::
from fastNLP import Callback, EarlyStopCallback, Trainer, CrossEntropyLoss, AccuracyMetric
from fastNLP.models import CNNText
@@ -66,6 +66,8 @@ import os
import torch
from copy import deepcopy
+import sys
+from .utils import _save_model
try:
from tensorboardX import SummaryWriter
@@ -113,7 +115,7 @@ class Callback(object):
@property
def n_steps(self):
- """Trainer一共会运行多少步"""
+ """Trainer一共会采多少个batch。当Trainer中update_every设置为非1的值时,该值不等于update的次数"""
return self._trainer.n_steps
@property
@@ -181,7 +183,7 @@ class Callback(object):
:param dict batch_x: DataSet中被设置为input的field的batch。
:param dict batch_y: DataSet中被设置为target的field的batch。
:param list(int) indices: 这次采样使用到的indices,可以通过DataSet[indices]获取出这个batch采出的Instance,在一些
- 情况下可以帮助定位是哪个Sample导致了错误。仅在Trainer的prefetch为False时可用。
+ 情况下可以帮助定位是哪个Sample导致了错误。仅当num_workers=0时有效。
:return:
"""
pass
@@ -399,10 +401,11 @@ class GradientClipCallback(Callback):
self.clip_value = clip_value
def on_backward_end(self):
- if self.parameters is None:
- self.clip_fun(self.model.parameters(), self.clip_value)
- else:
- self.clip_fun(self.parameters, self.clip_value)
+ if self.step%self.update_every==0:
+ if self.parameters is None:
+ self.clip_fun(self.model.parameters(), self.clip_value)
+ else:
+ self.clip_fun(self.parameters, self.clip_value)
class EarlyStopCallback(Callback):
@@ -445,10 +448,10 @@ class FitlogCallback(Callback):
并将验证结果写入到fitlog中。这些数据集的结果是根据dev上最好的结果报道的,即如果dev在第3个epoch取得了最佳,则
fitlog中记录的关于这些数据集的结果就是来自第三个epoch的结果。
- :param DataSet,dict(DataSet) data: 传入DataSet对象,会使用多个Trainer中的metric对数据进行验证。如果需要传入多个
+ :param ~fastNLP.DataSet,Dict[~fastNLP.DataSet] data: 传入DataSet对象,会使用多个Trainer中的metric对数据进行验证。如果需要传入多个
DataSet请通过dict的方式传入,dict的key将作为对应dataset的name传递给fitlog。若tester不为None时,data需要通过
dict的方式传入。如果仅传入DataSet, 则被命名为test
- :param Tester tester: Tester对象,将在on_valid_end时调用。tester中的DataSet会被称为为`test`
+ :param ~fastNLP.Tester tester: Tester对象,将在on_valid_end时调用。tester中的DataSet会被称为为`test`
:param int log_loss_every: 多少个step记录一次loss(记录的是这几个batch的loss平均值),如果数据集较大建议将该值设置得
大一些,不然会导致log文件巨大。默认为0, 即不要记录loss。
:param int verbose: 是否在终端打印evaluation的结果,0不打印。
@@ -548,7 +551,7 @@ class LRScheduler(Callback):
else:
raise ValueError(f"Expect torch.optim.lr_scheduler for LRScheduler. Got {type(lr_scheduler)}.")
- def on_epoch_begin(self):
+ def on_epoch_end(self):
self.scheduler.step(self.epoch)
@@ -671,7 +674,7 @@ class TensorboardCallback(Callback):
.. warning::
fastNLP 已停止对此功能的维护,请等待 fastNLP 兼容 PyTorch1.1 的下一个版本。
- 或者使用和 fastNLP 高度配合的 fitlog(参见 :doc:`/user/with_fitlog` )。
+ 或者使用和 fastNLP 高度配合的 fitlog(参见 :doc:`/tutorials/tutorial_10_fitlog` )。
"""
@@ -736,6 +739,132 @@ class TensorboardCallback(Callback):
del self._summary_writer
+class WarmupCallback(Callback):
+ """
+ 按一定的周期调节Learning rate的大小。
+
+ :param int,float warmup: 如果warmup为int,则在该step之前,learning rate根据schedule的策略变化; 如果warmup为float,
+ 如0.1, 则前10%的step是按照schedule策略调整learning rate。
+ :param str schedule: 以哪种方式调整。linear: 前warmup的step上升到指定的learning rate(从Trainer中的optimizer处获取的), 后
+ warmup的step下降到0; constant前warmup的step上升到指定learning rate,后面的step保持learning rate.
+ """
+ def __init__(self, warmup=0.1, schedule='constant'):
+ super().__init__()
+ self.warmup = max(warmup, 0.)
+
+ self.initial_lrs = [] # 存放param_group的learning rate
+ if schedule == 'constant':
+ self.get_lr = self._get_constant_lr
+ elif schedule == 'linear':
+ self.get_lr = self._get_linear_lr
+ else:
+ raise RuntimeError("Only support 'linear', 'constant'.")
+
+ def _get_constant_lr(self, progress):
+ if progress1:
+ self.warmup = self.warmup/self.t_steps
+ self.t_steps = max(2, self.t_steps) # 不能小于2
+ # 获取param_group的初始learning rate
+ for group in self.optimizer.param_groups:
+ self.initial_lrs.append(group['lr'])
+
+ def on_backward_end(self):
+ if self.step%self.update_every==0:
+ progress = (self.step/self.update_every)/self.t_steps
+ for lr, group in zip(self.initial_lrs, self.optimizer.param_groups):
+ group['lr'] = lr * self.get_lr(progress)
+
+
+class SaveModelCallback(Callback):
+ """
+ 由于Trainer在训练过程中只会保存最佳的模型, 该callback可实现多种方式的结果存储。
+ 会根据训练开始的时间戳在save_dir下建立文件夹,再在文件夹下存放多个模型
+ -save_dir
+ -2019-07-03-15-06-36
+ -epoch:0_step:20_{metric_key}:{evaluate_performance}.pt # metric是给定的metric_key, evaluate_performance是性能
+ -epoch:1_step:40_{metric_key}:{evaluate_performance}.pt
+ -2019-07-03-15-10-00
+ -epoch:0_step:20_{metric_key}:{evaluate_performance}.pt # metric是给定的metric_key, evaluate_perfomance是性能
+ :param str save_dir: 将模型存放在哪个目录下,会在该目录下创建以时间戳命名的目录,并存放模型
+ :param int top: 保存dev表现top多少模型。-1为保存所有模型。
+ :param bool only_param: 是否只保存模型d饿权重。
+ :param save_on_exception: 发生exception时,是否保存一份发生exception的模型。模型名称为epoch:x_step:x_Exception:{exception_name}.
+ """
+ def __init__(self, save_dir, top=3, only_param=False, save_on_exception=False):
+ super().__init__()
+
+ if not os.path.isdir(save_dir):
+ raise IsADirectoryError("{} is not a directory.".format(save_dir))
+ self.save_dir = save_dir
+ if top < 0:
+ self.top = sys.maxsize
+ else:
+ self.top = top
+ self._ordered_save_models = [] # List[Tuple], Tuple[0]是metric, Tuple[1]是path。metric是依次变好的,所以从头删
+
+ self.only_param = only_param
+ self.save_on_exception = save_on_exception
+
+ def on_train_begin(self):
+ self.save_dir = os.path.join(self.save_dir, self.trainer.start_time)
+
+ def on_valid_end(self, eval_result, metric_key, optimizer, is_better_eval):
+ metric_value = list(eval_result.values())[0][metric_key]
+ self._save_this_model(metric_value)
+
+ def _insert_into_ordered_save_models(self, pair):
+ # pair:(metric_value, model_name)
+ # 返回save的模型pair与删除的模型pair. pair中第一个元素是metric的值,第二个元素是模型的名称
+ index = -1
+ for _pair in self._ordered_save_models:
+ if _pair[0]>=pair[0] and self.trainer.increase_better:
+ break
+ if not self.trainer.increase_better and _pair[0]<=pair[0]:
+ break
+ index += 1
+ save_pair = None
+ if len(self._ordered_save_models)=self.top and index!=-1):
+ save_pair = pair
+ self._ordered_save_models.insert(index+1, pair)
+ delete_pair = None
+ if len(self._ordered_save_models)>self.top:
+ delete_pair = self._ordered_save_models.pop(0)
+ return save_pair, delete_pair
+
+ def _save_this_model(self, metric_value):
+ name = "epoch:{}_step:{}_{}:{:.6f}.pt".format(self.epoch, self.step, self.trainer.metric_key, metric_value)
+ save_pair, delete_pair = self._insert_into_ordered_save_models((metric_value, name))
+ if save_pair:
+ try:
+ _save_model(self.model, model_name=name, save_dir=self.save_dir, only_param=self.only_param)
+ except Exception as e:
+ print(f"The following exception:{e} happens when save model to {self.save_dir}.")
+ if delete_pair:
+ try:
+ delete_model_path = os.path.join(self.save_dir, delete_pair[1])
+ if os.path.exists(delete_model_path):
+ os.remove(delete_model_path)
+ except Exception as e:
+ print(f"Fail to delete model {name} at {self.save_dir} caused by exception:{e}.")
+
+ def on_exception(self, exception):
+ if self.save_on_exception:
+ name = "epoch:{}_step:{}_Exception:{}.pt".format(self.epoch, self.step, exception.__class__.__name__)
+ _save_model(self.model, model_name=name, save_dir=self.save_dir, only_param=self.only_param)
+
+
class CallbackException(BaseException):
"""
当需要通过callback跳出训练的时候可以通过抛出CallbackException并在on_exception中捕获这个值。
diff --git a/fastNLP/core/dataset.py b/fastNLP/core/dataset.py
index 9f24adf2..7b7fa87a 100644
--- a/fastNLP/core/dataset.py
+++ b/fastNLP/core/dataset.py
@@ -1,7 +1,7 @@
"""
:class:`~fastNLP.core.dataset.DataSet` 是fastNLP中用于承载数据的容器。可以将DataSet看做是一个表格,
-每一行是一个sample (在fastNLP中被称为 :mod:`~.instance` ),
-每一列是一个feature (在fastNLP中称为 :mod:`.field` )。
+每一行是一个sample (在fastNLP中被称为 :mod:`~fastNLP.core.instance` ),
+每一列是一个feature (在fastNLP中称为 :mod:`~fastNLP.core.field` )。
.. csv-table:: Following is a demo layout of DataSet
:header: "sentence", "words", "seq_len"
@@ -13,57 +13,64 @@
在fastNLP内部每一行是一个 :class:`~fastNLP.Instance` 对象; 每一列是一个 :class:`~fastNLP.FieldArray` 对象。
-1 DataSet的创建
- 创建DataSet主要有以下的3种方式
+----------------------------
+1.DataSet的创建
+----------------------------
+
+创建DataSet主要有以下的3种方式
1.1 传入dict
+----------------------------
- Example::
+ .. code-block::
- from fastNLP import DataSet
- data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."],
- 'words': [['this', 'is', 'the', 'first', 'instance', '.'], ['Second', 'instance', '.'], ['Third', 'instance', '.'],
- 'seq_len': [6, 3, 3]}
- dataset = DataSet(data)
- # 传入的dict的每个key的value应该为具有相同长度的list
+ from fastNLP import DataSet
+ data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."],
+ 'words': [['this', 'is', 'the', 'first', 'instance', '.'], ['Second', 'instance', '.'], ['Third', 'instance', '.'],
+ 'seq_len': [6, 3, 3]}
+ dataset = DataSet(data)
+ # 传入的dict的每个key的value应该为具有相同长度的list
-1.2 通过构建Instance
+1.2 通过 Instance 构建
+----------------------------
- Example::
+ .. code-block::
- from fastNLP import DataSet
- from fastNLP import Instance
- dataset = DataSet()
- instance = Instance(sentence="This is the first instance",
- words=['this', 'is', 'the', 'first', 'instance', '.'],
- seq_len=6)
- dataset.append(instance)
- # 可以继续append更多内容,但是append的instance应该和第一个instance拥有完全相同的field
+ from fastNLP import DataSet
+ from fastNLP import Instance
+ dataset = DataSet()
+ instance = Instance(sentence="This is the first instance",
+ words=['this', 'is', 'the', 'first', 'instance', '.'],
+ seq_len=6)
+ dataset.append(instance)
+ # 可以继续append更多内容,但是append的instance应该和第一个instance拥有完全相同的field
-1.3 通过list(Instance)
+1.3 通过 List[Instance] 构建
+--------------------------------------
- Example::
+ .. code-block::
- from fastNLP import DataSet
- from fastNLP import Instance
- instances = []
- instances.append(Instance(sentence="This is the first instance",
- words=['this', 'is', 'the', 'first', 'instance', '.'],
- seq_len=6))
- instances.append(Instance(sentence="Second instance .",
- words=['Second', 'instance', '.'],
- seq_len=3))
- dataset = DataSet(instances)
+ from fastNLP import DataSet
+ from fastNLP import Instance
+ instances = []
+ winstances.append(Instance(sentence="This is the first instance",
+ ords=['this', 'is', 'the', 'first', 'instance', '.'],
+ seq_len=6))
+ instances.append(Instance(sentence="Second instance .",
+ words=['Second', 'instance', '.'],
+ seq_len=3))
+ dataset = DataSet(instances)
+
+--------------------------------------
+2.DataSet与预处理
+--------------------------------------
-2 DataSet与预处理
- 常见的预处理有如下几种
+常见的预处理有如下几种
-2.1 从某个文本文件读取内容 #
+2.1 从某个文本文件读取内容
+--------------------------------------
- .. todo::
- 引用DataLoader
-
- Example::
+ .. code-block::
from fastNLP import DataSet
from fastNLP import Instance
@@ -78,21 +85,13 @@
sent, label = line.strip().split('\t')
dataset.append(Instance(sentence=sent, label=label))
-2.2 index, 返回结果为对DataSet对象的浅拷贝
+ .. note::
+ 直接读取特定数据集的数据请参考 :doc:`/tutorials/tutorial_2_load_dataset`
- Example::
+2.2 对DataSet中的内容处理
+--------------------------------------
- import numpy as np
- from fastNLP import DataSet
- dataset = DataSet({'a': np.arange(10), 'b': [[_] for _ in range(10)]})
- d[0] # 使用一个下标获取一个instance
- >>{'a': 0 type=int,'b': [2] type=list} # 得到一个instance
- d[1:3] # 使用slice获取一个新的DataSet
- >>DataSet({'a': 1 type=int, 'b': [2] type=list}, {'a': 2 type=int, 'b': [2] type=list})
-
-2.3 对DataSet中的内容处理
-
- Example::
+ .. code-block::
from fastNLP import DataSet
data = {'sentence':["This is the first instance .", "Second instance .", "Third instance ."]}
@@ -108,9 +107,10 @@
return words
dataset.apply(get_words, new_field_name='words')
-2.4 删除DataSet的内容
+2.3 删除DataSet的内容
+--------------------------------------
- Example::
+ .. code-block::
from fastNLP import DataSet
dataset = DataSet({'a': list(range(-5, 5))})
@@ -124,16 +124,18 @@
dataset.delete_field('a')
-2.5 遍历DataSet的内容
+2.4 遍历DataSet的内容
+--------------------------------------
- Example::
+ .. code-block::
for instance in dataset:
# do something
-2.6 一些其它操作
+2.5 一些其它操作
+--------------------------------------
- Example::
+ .. code-block::
# 检查是否存在名为'a'的field
dataset.has_field('a') # 或 ('a' in dataset)
@@ -141,21 +143,25 @@
dataset.rename_field('a', 'b')
# DataSet的长度
len(dataset)
+
+--------------------------------------
+3.DataSet与自然语言处理(NLP)
+--------------------------------------
-3 DataSet与自然语言处理(NLP)
- 在目前深度学习的模型中,大都依赖于随机梯度下降法(SGD)进行模型的优化。随机梯度下降需要将数据切分成一个一个的Batch,
- 一个Batch进行一次前向计算(forward)与梯度后向传播(backward)。在自然语言处理的场景下,往往还需要对数据进行pad。这是
- 由于句子的长度一般是不同的,但是一次Batch中的每个field都必须是一个tensor,所以需要将所有句子都补齐到相同的长度。
+在目前深度学习的模型中,大都依赖于随机梯度下降法(SGD)进行模型的优化。随机梯度下降需要将数据切分成一个个的 batch,
+一个batch进行一次前向计算(forward)与梯度后向传播(backward)。在自然语言处理的场景下,往往还需要对数据进行pad。这是
+由于句子的长度一般是不同的,但是一次batch中的每个field都必须是一个tensor,所以需要将所有句子都补齐到相同的长度。
-3.1 DataSet与Batch
+3.1 DataSet与DataSetIter
+--------------------------------------
- 我们先看fastNLP中如何将数据分成一个一个的Batch的例子, 这里我们使用随机生成的数据来模拟一个二分类文本分类任务,
+ 我们先看fastNLP中如何将数据分成一个一个的batch的例子, 这里我们使用随机生成的数据来模拟一个二分类文本分类任务,
words和characters是输入,labels是文本类别
- Example::
+ .. code-block::
from fastNLP import DataSet
- from fastNLP import Batch
+ from fastNLP import DataSetIter
from fastNLP import SequentialSampler
from fastNLP import EngChar2DPadder
@@ -175,7 +181,7 @@
d.set_target('label')
d.set_input('words', 'chars')
- for batch_x, batch_y in Batch(d, sampler=SequentialSampler(), batch_size=2):
+ for batch_x, batch_y in DataSetIter(d, sampler=SequentialSampler(), batch_size=2):
print("batch_x:", batch_x)
print("batch_y:", batch_y)
break
@@ -194,23 +200,26 @@
# [ 0, 0, 0, 0, 0]]])}
# {'label': tensor([0, 0])}
- 其中 :class:`~fastNLP.Batch` 是用于从DataSet中按照batch_size为大小取出batch的迭代器,
- :class:`~fastNLP.SequentialSampler` 用于指示 Batch 以怎样的
+ 其中 :class:`~fastNLP.DataSetIter` 是用于从DataSet中按照batch_size为大小取出batch的迭代器,
+ :class:`~fastNLP.SequentialSampler` 用于指示 :class:`~fastNLP.DataSetIter` 以怎样的
顺序从DataSet中取出instance以组成一个batch,
- 更详细的说明请参照 :class:`~fastNLP.Batch` 和 :class:`~fastNLP.SequentialSampler` 文档。
+ 更详细的说明请参照 :class:`~fastNLP.DataSetIter` 和 :class:`~fastNLP.SequentialSampler` 文档。
- 通过DataSet.set_input('words', 'chars'), fastNLP将认为'words'和'chars'这两个field都是input,并将它们都放入迭代器
- 生成的第一个dict中; DataSet.set_target('labels'), fastNLP将认为'labels'这个field是target,并将其放入到迭代器的第
+ 通过 ``DataSet.set_input('words', 'chars')`` , fastNLP将认为 `words` 和 `chars` 这两个field都是input,并将它们都放入迭代器
+ 生成的第一个dict中; ``DataSet.set_target('labels')`` , fastNLP将认为 `labels` 这个field是target,并将其放入到迭代器的第
二个dict中。如上例中所打印结果。分为input和target的原因是由于它们在被 :class:`~fastNLP.Trainer` 所使用时会有所差异,
详见 :class:`~fastNLP.Trainer`
- 当把某个field设置为'target'或者'input'的时候(两者不是互斥的,可以同时设为input和target),fastNLP不仅仅只是将其放
- 置到不同的dict中,而还会对被设置为input或target的field进行类型检查。类型检查的目的是为了看能否把该field转为
- pytorch的torch.LongTensor或torch.FloatTensor类型(也可以在Batch中设置输出numpy类型,参考 :class:`~fastNLP.Batch` ),如上例所示,
- fastNLP已将words,chars和label转为了Tensor类型。如果field在每个instance都拥有相同的维度(不能超过两维),且最内层
- 的元素都为相同的type(int, float, np.int*, np.float*),则fastNLP默认将对该field进行pad。也支持全为str的field作为
- target和input,这种情况下,fastNLP默认不进行pad。另外,当某个field已经被设置为了target或者input后,之后append的
- instance对应的field必须要和前面已有的内容一致,否则会报错。
+ 当把某个field设置为 `target` 或者 `input` 的时候(两者不是互斥的,可以同时设为两种),fastNLP不仅仅只是将其放
+ 置到不同的dict中,而还会对被设置为 `input` 或 `target` 的 field 进行类型检查。类型检查的目的是为了看能否把该 field 转为
+ pytorch的 :class:`torch.LongTensor` 或 :class:`torch.FloatTensor` 类型
+ (也可以在 :class:`~fastNLP.DataSetIter` 中设置输出numpy类型,参考 :class:`~fastNLP.DataSetIter` )。
+
+ 如上例所示,fastNLP已将 `words` ,`chars` 和 `label` 转为了 :class:`Tensor` 类型。
+ 如果 field 在每个 `instance` 都拥有相同的维度(不能超过两维),且最内层的元素都为相同的 type(int, float, np.int*, np.float*),
+ 则fastNLP默认将对该 field 进行pad。也支持全为str的field作为target和input,这种情况下,fastNLP默认不进行pad。
+ 另外,当某个 field 已经被设置为了 target 或者 input 后,之后 `append` 的
+ `instance` 对应的 field 必须要和前面已有的内容一致,否则会报错。
可以查看field的dtype::
@@ -229,6 +238,7 @@
错误::
from fastNLP import DataSet
+
d = DataSet({'data': [1, 'a']})
d.set_input('data')
>> RuntimeError: Mixed data types in Field data: [, ]
@@ -243,6 +253,7 @@
当某个field被设置为忽略type之后,fastNLP将不对其进行pad。
3.2 DataSet与pad
+--------------------------------------
在fastNLP里,pad是与一个field绑定的。即不同的field可以使用不同的pad方式,比如在英文任务中word需要的pad和
character的pad方式往往是不同的。fastNLP是通过一个叫做 :class:`~fastNLP.Padder` 的子类来完成的。
@@ -252,7 +263,7 @@
如果 :class:`~fastNLP.AutoPadder` 或 :class:`~fastNLP.EngChar2DPadder` 无法满足需求,
也可以自己写一个 :class:`~fastNLP.Padder` 。
- Example::
+ .. code-block::
from fastNLP import DataSet
from fastNLP import EngChar2DPadder
@@ -285,7 +296,8 @@ from .field import AutoPadder
from .field import FieldArray
from .instance import Instance
from .utils import _get_func_signature
-
+from .field import AppendToTargetOrInputException
+from .field import SetInputOrTargetException
class DataSet(object):
"""
@@ -416,13 +428,13 @@ class DataSet(object):
"""
将一个instance对象append到DataSet后面。
- :param instance: :class:`~fastNLP.Instance` 类型。若DataSet不为空,则instance应该拥有和DataSet完全一样的field。
+ :param ~fastNLP.Instance instance: 若DataSet不为空,则instance应该拥有和DataSet完全一样的field。
"""
if len(self.field_arrays) == 0:
# DataSet has no field yet
for name, field in instance.fields.items():
- field = field.tolist() if isinstance(field, np.ndarray) else field
+ # field = field.tolist() if isinstance(field, np.ndarray) else field
self.field_arrays[name] = FieldArray(name, [field]) # 第一个样本,必须用list包装起来
else:
if len(self.field_arrays) != len(instance.fields):
@@ -431,14 +443,18 @@ class DataSet(object):
.format(len(self.field_arrays), len(instance.fields)))
for name, field in instance.fields.items():
assert name in self.field_arrays
- self.field_arrays[name].append(field)
+ try:
+ self.field_arrays[name].append(field)
+ except AppendToTargetOrInputException as e:
+ print(f"Cannot append to field:{name}.")
+ raise e
def add_fieldarray(self, field_name, fieldarray):
"""
将fieldarray添加到DataSet中.
:param str field_name: 新加入的field的名称
- :param fieldarray: :class:`~fastNLP.FieldArray` 类型。需要加入DataSet的field的内容
+ :param ~fastNLP.core.FieldArray fieldarray: 需要加入DataSet的field的内容
:return:
"""
if not isinstance(fieldarray, FieldArray):
@@ -454,8 +470,7 @@ class DataSet(object):
:param str field_name: 新增的field的名称
:param list fields: 需要新增的field的内容
- :param None, padder: :class:`~fastNLP.Padder` 类型,
- 如果为None,则不进行pad,默认使用 :class:`~fastNLP.AutoPadder` 自动判断是否需要做pad。
+ :param None,~fastNLP.Padder padder: 如果为None,则不进行pad,默认使用 :class:`~fastNLP.AutoPadder` 自动判断是否需要做pad。
:param bool is_input: 新加入的field是否是input
:param bool is_target: 新加入的field是否是target
:param bool ignore_type: 是否忽略对新加入的field的类型检查
@@ -517,7 +532,7 @@ class DataSet(object):
"""
返回一个dict,key为field_name, value为对应的 :class:`~fastNLP.FieldArray`
- :return: dict: 返回如上所述的字典
+ :return dict: 返回如上所述的字典
"""
return self.field_arrays
@@ -525,7 +540,7 @@ class DataSet(object):
"""
返回一个list,包含所有 field 的名字
- :return: list: 返回如上所述的列表
+ :return list: 返回如上所述的列表
"""
return sorted(self.field_arrays.keys())
@@ -549,6 +564,7 @@ class DataSet(object):
self.field_arrays[new_name].name = new_name
else:
raise KeyError("DataSet has no field named {}.".format(old_name))
+ return self
def set_target(self, *field_names, flag=True):
"""
@@ -565,7 +581,11 @@ class DataSet(object):
assert isinstance(flag, bool), "Only bool type supported."
for name in field_names:
if name in self.field_arrays:
- self.field_arrays[name].is_target = flag
+ try:
+ self.field_arrays[name].is_target = flag
+ except SetInputOrTargetException as e:
+ print(f"Cannot set field:{name} as target.")
+ raise e
else:
raise KeyError("{} is not a valid field name.".format(name))
@@ -581,7 +601,11 @@ class DataSet(object):
"""
for name in field_names:
if name in self.field_arrays:
- self.field_arrays[name].is_input = flag
+ try:
+ self.field_arrays[name].is_input = flag
+ except SetInputOrTargetException as e:
+ print(f"Cannot set field:{name} as input, exception happens at the {e.index} value.")
+ raise e
else:
raise KeyError("{} is not a valid field name.".format(name))
@@ -610,7 +634,7 @@ class DataSet(object):
dataset.set_padder('chars', padder) # 则chars这个field会使用EngChar2DPadder进行pad操作
:param str field_name: 设置field的padding方式为padder
- :param None, Padder padder: 设置为None即删除padder, 即对该field不进行pad操作。
+ :param None,~fastNLP.Padder padder: 设置为None即删除padder, 即对该field不进行pad操作。
"""
if field_name not in self.field_arrays:
raise KeyError("There is no field named {}.".format(field_name))
@@ -658,7 +682,7 @@ class DataSet(object):
2. is_target: bool, 如果为True则将名为 `new_field_name` 的field设置为target
3. ignore_type: bool, 如果为True则将名为 `new_field_name` 的field的ignore_type设置为true, 忽略其类型
- :return: list(Any), 里面的元素为func的返回值,所以list长度为DataSet的长度
+ :return List[Any]: 里面的元素为func的返回值,所以list长度为DataSet的长度
"""
assert len(self) != 0, "Null DataSet cannot use apply_field()."
@@ -685,7 +709,7 @@ class DataSet(object):
"""
将results作为加入到新的field中,field名称为new_field_name
- :param list(str) results: 一般是apply*()之后的结果
+ :param List[str] results: 一般是apply*()之后的结果
:param str new_field_name: 新加入的field的名称
:param dict kwargs: 用户apply*()时传入的自定义参数
:return:
@@ -728,7 +752,7 @@ class DataSet(object):
3. ignore_type: bool, 如果为True则将 `new_field_name` 的field的ignore_type设置为true, 忽略其类型
- :return: list(Any), 里面的元素为func的返回值,所以list长度为DataSet的长度
+ :return List[Any]: 里面的元素为func的返回值,所以list长度为DataSet的长度
"""
assert len(self) != 0, "Null DataSet cannot use apply()."
idx = -1
@@ -748,7 +772,20 @@ class DataSet(object):
self._add_apply_field(results, new_field_name, kwargs)
return results
-
+
+ def add_seq_len(self, field_name:str, new_field_name='seq_len'):
+ """
+ 将使用len()直接对field_name中每个元素作用,将其结果作为seqence length, 并放入seq_len这个field。
+
+ :param field_name: str.
+ :return:
+ """
+ if self.has_field(field_name=field_name):
+ self.apply_field(len, field_name, new_field_name=new_field_name)
+ else:
+ raise KeyError(f"Field:{field_name} not found.")
+ return self
+
def drop(self, func, inplace=True):
"""
func接受一个Instance,返回bool值。返回值为True时,该Instance会被移除或者加入到返回的DataSet中。
@@ -774,17 +811,19 @@ class DataSet(object):
else:
return DataSet()
- def split(self, ratio):
+ def split(self, ratio, shuffle=True):
"""
将DataSet按照ratio的比例拆分,返回两个DataSet
- :param float ratio: 0 1:
- # list 跟 非list 混在一起
- raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
- # >1维list
- inner_type_set = set()
- for l in content:
- [inner_type_set.add(type(obj)) for obj in l]
- if list not in inner_type_set:
- # 二维list
- self.content_dim = 2
- return self._basic_type_detection(inner_type_set)
- else:
- if len(inner_type_set) == 1:
- # >2维list
- inner_inner_type_set = set()
- for _2d_list in content:
- for _1d_list in _2d_list:
- [inner_inner_type_set.add(type(obj)) for obj in _1d_list]
- if list in inner_inner_type_set:
- raise RuntimeError("FieldArray cannot handle 4-D or more-D list.")
- # 3维list
- self.content_dim = 3
- return self._basic_type_detection(inner_inner_type_set)
- else:
- # list 跟 非list 混在一起
- raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(inner_type_set)))
- else:
- # 一维list
- for content_type in type_set:
- if content_type not in self.BASIC_TYPES:
- raise RuntimeError("Unexpected data type in Field '{}'. Expect one of {}. Got {}.".format(
- self.name, self.BASIC_TYPES, content_type))
- self.content_dim = 1
- return self._basic_type_detection(type_set)
-
- def _basic_type_detection(self, type_set):
- """
- :param type_set: a set of Python types
- :return: one of self.BASIC_TYPES
- """
- if len(type_set) == 1:
- return type_set.pop()
- elif len(type_set) == 2:
- # 有多个basic type; 可能需要up-cast
- if float in type_set and int in type_set:
- # up-cast int to float
- return float
- else:
- # str 跟 int 或者 float 混在一起
- raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
- else:
- # str, int, float混在一起
- raise RuntimeError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
-
- def _1d_list_check(self, val):
- """如果不是1D list就报错
- """
- type_set = set((type(obj) for obj in val))
- if any(obj not in self.BASIC_TYPES for obj in type_set):
- raise ValueError("Mixed data types in Field {}: {}".format(self.name, list(type_set)))
- self._basic_type_detection(type_set)
- # otherwise: _basic_type_detection will raise error
- return True
-
- def _2d_list_check(self, val):
- """如果不是2D list 就报错
- """
- type_set = set(type(obj) for obj in val)
- if list(type_set) != [list]:
- raise ValueError("Mixed data types in Field {}: {}".format(self.name, type_set))
- inner_type_set = set()
- for l in val:
- for obj in l:
- inner_type_set.add(type(obj))
- self._basic_type_detection(inner_type_set)
- return True
-
- @staticmethod
- def _map_to_np_type(basic_type):
- type_mapping = {int: np.int64, float: np.float64, str: np.str, np.ndarray: np.ndarray}
- return type_mapping[basic_type]
-
- def __repr__(self):
- return "FieldArray {}: {}".format(self.name, self.content.__repr__())
-
- def append(self, val):
- """将val append到这个field的尾部。如果这个field已经被设置为input或者target,则在append之前会检查该类型是否与已有
- 的内容是匹配的。
+ 检查当前content所有的element是否是同一个类型,且是否每个元素具有相同的维度。通过的话,设置_cell_ndim与_ele_type属性;没有
+ 通过将直接报错.
- :param Any val: 需要append的值。
+ :return:
"""
- if self.ignore_type is False:
- if isinstance(val, list):
- pass
- elif isinstance(val, tuple): # 确保最外层是list
- val = list(val)
- elif isinstance(val, np.ndarray):
- val = val.tolist()
- elif any((isinstance(val, t) for t in self.BASIC_TYPES)):
- pass
- else:
- raise RuntimeError(
- "Unexpected data type {}. Should be list, np.array, or {}".format(type(val), self.BASIC_TYPES))
-
- if self.is_input is True or self.is_target is True:
- if type(val) == list:
- if len(val) == 0:
- raise ValueError("Cannot append an empty list.")
- if self.content_dim == 2 and self._1d_list_check(val):
- # 1维list检查
- pass
- elif self.content_dim == 3 and self._2d_list_check(val):
- # 2维list检查
- pass
- else:
- raise RuntimeError(
- "Dimension not matched: expect dim={}, got {}.".format(self.content_dim - 1, val))
- elif type(val) in self.BASIC_TYPES and self.content_dim == 1:
- # scalar检查
- if type(val) == float and self.pytype == int:
- self.pytype = float
- self.dtype = self._map_to_np_type(self.pytype)
- else:
- raise RuntimeError(
- "Unexpected data type {}. Should be list, np.array, or {}".format(type(val), self.BASIC_TYPES))
- self.content.append(val)
-
+ cell_0 = self.content[0]
+ index = 0
+ try:
+ type_0, dim_0 = _get_ele_type_and_dim(cell_0)
+ for cell in self.content[1:]:
+ index += 1
+ type_i, dim_i = _get_ele_type_and_dim(cell)
+ if type_i!=type_0:
+ raise SetInputOrTargetException("Type:{} in index {} is different from the first element with type:{}."
+ ".".format(type_i, index, type_0))
+ if dim_0!=dim_i:
+ raise SetInputOrTargetException("Dimension:{} in index {} is different from the first element with "
+ "dimension:{}.".format(dim_i, index, dim_0))
+ self._cell_ndim = dim_0
+ self.dtype = type_0
+ except SetInputOrTargetException as e:
+ e.index = index
+ raise e
+
+ def append(self, val:Any):
+ """
+ :param val: 把该val append到fieldarray。
+ :return:
+ """
+ if (self._is_target or self._is_input) and self._ignore_type is False:
+ type_, dim_ = _get_ele_type_and_dim(val)
+ if self.dtype!=type_:
+ raise AppendToTargetOrInputException(f"Value(type:{type_}) are of different types with "
+ f"previous values(type:{self.dtype}).")
+ if self._cell_ndim!=dim_:
+ raise AppendToTargetOrInputException(f"Value(dim:{dim_}) are of different dimensions with "
+ f"previous values(dim:{self._cell_ndim}).")
+ self.content.append(val)
+ else:
+ self.content.append(val)
+
def __getitem__(self, indices):
return self.get(indices, pad=False)
-
+
def __setitem__(self, idx, val):
assert isinstance(idx, int)
+ if (self._is_target or self._is_input) and self.ignore_type is False: # 需要检测类型
+ type_, dim_ = _get_ele_type_and_dim(val)
+ if self.dtype!=type_:
+ raise RuntimeError(f"Value(type:{type_}) are of different types with "
+ f"other values(type:{self.dtype}).")
+ if self._cell_ndim!=dim_:
+ raise RuntimeError(f"Value(dim:{dim_}) are of different dimensions with "
+ f"previous values(dim:{self._cell_ndim}).")
self.content[idx] = val
-
+
def get(self, indices, pad=True):
"""
根据给定的indices返回内容
@@ -257,14 +170,17 @@ class FieldArray(object):
if isinstance(indices, int):
return self.content[indices]
if self.is_input is False and self.is_target is False:
- raise RuntimeError("Please specify either is_input or is_target is True for {}".format(self.name))
-
+ raise RuntimeError("Please specify either is_input or is_target to True for {}".format(self.name))
+
contents = [self.content[i] for i in indices]
if self.padder is None or pad is False:
return np.array(contents)
else:
- return self.padder(contents, field_name=self.name, field_ele_dtype=self.dtype)
-
+ return self.pad(contents)
+
+ def pad(self, contents):
+ return self.padder(contents, field_name=self.name, field_ele_dtype=self.dtype, dim=self._cell_ndim)
+
def set_padder(self, padder):
"""
设置padder,在这个field进行pad的时候用这个padder进行pad,如果为None则不进行pad。
@@ -276,7 +192,7 @@ class FieldArray(object):
self.padder = deepcopy(padder)
else:
self.padder = None
-
+
def set_pad_val(self, pad_val):
"""
修改padder的pad_val.
@@ -286,7 +202,7 @@ class FieldArray(object):
if self.padder is not None:
self.padder.set_pad_val(pad_val)
return self
-
+
def __len__(self):
"""
Returns the size of FieldArray.
@@ -294,7 +210,7 @@ class FieldArray(object):
:return int length:
"""
return len(self.content)
-
+
def to(self, other):
"""
将other的属性复制给本FieldArray(other必须为FieldArray类型).
@@ -303,22 +219,225 @@ class FieldArray(object):
:param other: :class:`~fastNLP.FieldArray` 从哪个field拷贝属性
:return: :class:`~fastNLP.FieldArray`
"""
- assert isinstance(other, FieldArray), "Only support FieldArray type, not {}.".format(type(other))
-
+ assert isinstance(other, FieldArray), "Only supports fastNLP.FieldArray type, not {}.".format(type(other))
+
+ self.ignore_type = other.ignore_type
self.is_input = other.is_input
self.is_target = other.is_target
self.padder = other.padder
- self.ignore_type = other.ignore_type
-
+
return self
+ def split(self, sep:str=None, inplace:bool=True):
+ """
+ 依次对自身的元素使用.split()方法,应该只有当本field的元素为str时,该方法才有用。将返回值
-def _is_iterable(content):
+ :param sep: 分割符,如果为None则直接调用str.split()。
+ :param inplace: 如果为True,则将新生成值替换本field。否则返回list。
+ :return: List[List[str]] or self
+ """
+ new_contents = []
+ for index, cell in enumerate(self.content):
+ try:
+ new_contents.append(cell.split(sep))
+ except Exception as e:
+ print(f"Exception happens when process value in index {index}.")
+ raise e
+ return self._after_process(new_contents, inplace=inplace)
+
+ def int(self, inplace:bool=True):
+ """
+ 将本field中的值调用int(cell). 支持field中内容为以下两种情况(1)['1', '2', ...](即field中每个值为str的),
+ (2) [['1', '2', ..], ['3', ..], ...](即field中每个值为一个list,list中的值会被依次转换。)
+
+ :param inplace: 如果为True,则将新生成值替换本field。否则返回list。
+ :return: List[int], List[List[int]], self
+ """
+ new_contents = []
+ for index, cell in enumerate(self.content):
+ try:
+ if isinstance(cell, list):
+ new_contents.append([int(value) for value in cell])
+ else:
+ new_contents.append(int(cell))
+ except Exception as e:
+ print(f"Exception happens when process value in index {index}.")
+ print(e)
+ return self._after_process(new_contents, inplace=inplace)
+
+ def float(self, inplace=True):
+ """
+ 将本field中的值调用float(cell). 支持field中内容为以下两种情况(1)['1', '2', ...](即field中每个值为str的),
+ (2) [['1', '2', ..], ['3', ..], ...](即field中每个值为一个list,list中的值会被依次转换。)
+
+ :param inplace: 如果为True,则将新生成值替换本field。否则返回list。
+ :return:
+ """
+ new_contents = []
+ for index, cell in enumerate(self.content):
+ try:
+ if isinstance(cell, list):
+ new_contents.append([float(value) for value in cell])
+ else:
+ new_contents.append(float(cell))
+ except Exception as e:
+ print(f"Exception happens when process value in index {index}.")
+ raise e
+ return self._after_process(new_contents, inplace=inplace)
+
+ def bool(self, inplace=True):
+ """
+ 将本field中的值调用bool(cell). 支持field中内容为以下两种情况(1)['1', '2', ...](即field中每个值为str的),
+ (2) [['1', '2', ..], ['3', ..], ...](即field中每个值为一个list,list中的值会被依次转换。)
+
+ :param inplace: 如果为True,则将新生成值替换本field。否则返回list。
+ :return:
+ """
+ new_contents = []
+ for index, cell in enumerate(self.content):
+ try:
+ if isinstance(cell, list):
+ new_contents.append([bool(value) for value in cell])
+ else:
+ new_contents.append(bool(cell))
+ except Exception as e:
+ print(f"Exception happens when process value in index {index}.")
+ raise e
+
+ return self._after_process(new_contents, inplace=inplace)
+
+ def lower(self, inplace=True):
+ """
+ 将本field中的值调用cell.lower(). 支持field中内容为以下两种情况(1)['1', '2', ...](即field中每个值为str的),
+ (2) [['1', '2', ..], ['3', ..], ...](即field中每个值为一个list,list中的值会被依次转换。)
+
+ :param inplace: 如果为True,则将新生成值替换本field。否则返回list。
+ :return: List[int], List[List[int]], self
+ """
+ new_contents = []
+ for index, cell in enumerate(self.content):
+ try:
+ if isinstance(cell, list):
+ new_contents.append([value.lower() for value in cell])
+ else:
+ new_contents.append(cell.lower())
+ except Exception as e:
+ print(f"Exception happens when process value in index {index}.")
+ raise e
+ return self._after_process(new_contents, inplace=inplace)
+
+ def upper(self, inplace=True):
+ """
+ 将本field中的值调用cell.lower(). 支持field中内容为以下两种情况(1)['1', '2', ...](即field中每个值为str的),
+ (2) [['1', '2', ..], ['3', ..], ...](即field中每个值为一个list,list中的值会被依次转换。)
+
+ :param inplace: 如果为True,则将新生成值替换本field。否则返回list。
+ :return: List[int], List[List[int]], self
+ """
+ new_contents = []
+ for index, cell in enumerate(self.content):
+ try:
+ if isinstance(cell, list):
+ new_contents.append([value.upper() for value in cell])
+ else:
+ new_contents.append(cell.upper())
+ except Exception as e:
+ print(f"Exception happens when process value in index {index}.")
+ raise e
+ return self._after_process(new_contents, inplace=inplace)
+
+ def value_count(self):
+ """
+ 返回该field下不同value的数量。多用于统计label数量
+
+ :return: Counter, key是label,value是出现次数
+ """
+ count = Counter()
+
+ def cum(cell):
+ if _is_iterable(cell) and not isinstance(cell, str):
+ for cell_ in cell:
+ cum(cell_)
+ else:
+ count[cell] += 1
+ for cell in self.content:
+ cum(cell)
+ return count
+
+ def _after_process(self, new_contents, inplace):
+ """
+ 当调用处理函数之后,决定是否要替换field。
+
+ :param new_contents:
+ :param inplace:
+ :return: self或者生成的content
+ """
+ if inplace:
+ self.content = new_contents
+ try:
+ self.is_input = self.is_input
+ self.is_target = self.is_input
+ except SetInputOrTargetException as e:
+ print("The newly generated field cannot be set as input or target.")
+ raise e
+ return self
+ else:
+ return new_contents
+
+
+def _get_ele_type_and_dim(cell:Any, dim=0):
+ """
+ 识别cell的类别与dimension的数量
+
+ numpy scalar type:https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.scalars.html
+ :param cell:
+ :param dim:
+ :return:
+ """
+ if isinstance(cell, (str, Number, np.bool_)):
+ if hasattr(cell, 'dtype'):
+ return cell.dtype.type, dim
+ return type(cell), dim
+ elif isinstance(cell, list):
+ dim += 1
+ res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell]
+ types = set([i for i,j in res])
+ dims = set([j for i,j in res])
+ if len(types)>1:
+ raise SetInputOrTargetException("Mixed types detected: {}.".format(list(types)))
+ elif len(types)==0:
+ raise SetInputOrTargetException("Empty value encountered.")
+ if len(dims)>1:
+ raise SetInputOrTargetException("Mixed dimension detected: {}.".format(list(dims)))
+ return types.pop(), dims.pop()
+ elif isinstance(cell, torch.Tensor):
+ return cell.dtype, cell.dim() + dim # 如果是torch.mean的结果是0
+ elif isinstance(cell, np.ndarray):
+ if cell.dtype != np.dtype('O'): # 如果不是object的话说明是well-formatted的了
+ return cell.dtype.type, cell.ndim + dim # dtype.type返回的会是np.int32, np.float等
+ # 否则需要继续往下iterate
+ dim += 1
+ res = [_get_ele_type_and_dim(cell_i, dim) for cell_i in cell]
+ types = set([i for i,j in res])
+ dims = set([j for i,j in res])
+ if len(types)>1:
+ raise SetInputOrTargetException("Mixed types detected: {}.".format(list(types)))
+ elif len(types)==0:
+ raise SetInputOrTargetException("Empty value encountered.")
+ if len(dims)>1:
+ raise SetInputOrTargetException("Mixed dimension detected: {}.".format(list(dims)))
+ return types.pop(), dims.pop()
+ else: # 包含tuple, set, dict以及其它的类型
+ raise SetInputOrTargetException(f"Cannot process type:{type(cell)}.")
+
+
+def _is_iterable(value):
+ # 检查是否是iterable的, duck typing
try:
- _ = (e for e in content)
- except TypeError:
+ iter(value)
+ return True
+ except BaseException as e:
return False
- return True
class Padder:
@@ -327,32 +446,36 @@ class Padder:
所有padder都需要继承这个类,并覆盖__call__方法。
用于对batch进行padding操作。传入的element是inplace的,即直接修改element可能导致数据变化,建议inplace修改之前deepcopy一份。
-
+
.. py:function:: __call__(self, contents, field_name, field_ele_dtype):
- 传入的是List内容。假设有以下的DataSet。
- :param list(Any) contents: 传入的element是inplace的,即直接修改element可能导致数据变化,建议inplace修改之前
+ 传入的是List内容。假设有以下的DataSet。
+
+ :param List[Any] contents: 传入的element是inplace的,即直接修改element可能导致数据变化,建议inplace修改之前
deepcopy一份。
:param str, field_name: field的名称。
:param np.int64,np.float64,np.str,None, field_ele_dtype: 该field的内层元素的类型。如果该field的ignore_type为True,该这个值为None。
:return: np.array([padded_element])
-
+
"""
-
+
def __init__(self, pad_val=0, **kwargs):
self.pad_val = pad_val
-
+
def set_pad_val(self, pad_val):
self.pad_val = pad_val
-
- def __call__(self, contents, field_name, field_ele_dtype):
+
+ @abstractmethod
+ def __call__(self, contents, field_name, field_ele_dtype, dim:int):
"""
传入的是List内容。假设有以下的DataSet。
- :param list(Any) contents: 传入的element是inplace的,即直接修改element可能导致数据变化,建议inplace修改之前
+ :param List[Any] contents: 传入的element是inplace的,即直接修改element可能导致数据变化,建议inplace修改之前
deepcopy一份。
:param str, field_name: field的名称。
- :param np.int64,np.float64,np.str,None, field_ele_dtype: 该field的内层元素的类型。如果该field的ignore_type为True,该这个值为None。
+ :param np.int64,np.float64,np.str,None, field_ele_dtype: 该field的内层元素的类型。如果该field的ignore_type为True,
+ 该这个值为None。
+ :param dim: 这个field的维度。当ignore_type为True时,该值为None
:return: np.array([padded_element])
Example::
@@ -394,50 +517,86 @@ class AutoPadder(Padder):
根据contents的数据自动判定是否需要做padding。
1 如果元素类型(元素类型是指field中最里层元素的数据类型, 可以通过FieldArray.dtype查看,比如['This', 'is', ...]的元素类
- 型为np.str, [[1,2], ...]的元素类型为np.int64)的数据不为(np.int64, np.float64)则不会进行pad
+ 型为str, [[1,2], ...]的元素类型为int)的数据不为数值类型则不会进行pad
- 2 如果元素类型为(np.int64, np.float64),
+ 2 如果元素类型为数值类型,比如np.int64, np.float64, int, float, torch.int64等
- 2.1 如果该field的内容为(np.int64, np.float64),比如为seq_len, 则不进行padding
+ 2.1 如果该field的内容为数值类型(包括int, float等),比如为seq_len, 则不进行padding
- 2.2 如果该field的内容为List, 那么会将Batch中的List pad为一样长。若该List下还有里层的List需要padding,请使用其它padder。
- 即如果Instance中field形如[1, 2, 3, ...],则可以pad;若为[[1,2], [3,4, ...]]则不能进行pad
+ 2.2 如果该field的内容等价于一维list, 那么会将Batch中的List pad为一样长。
+
+ 2.3 如果该field的内容等价于二维list,那么会按照英语character padding的方式进行padding。如果是character padding建议使用
+ :class: fastNLP.EngChar2DPadder.
+
+ 2.4 如果该field的内容等价于三维list,则如果每个instance在每个维度上相等,会组成一个batch的tensor返回,这种情况应该是为图片
+ 的情况。
+
+ 3 其它情况不进行处理,返回一个np.array类型。
"""
-
def __init__(self, pad_val=0):
- """
- :param pad_val: int, padding的位置使用该index
- """
super().__init__(pad_val=pad_val)
-
- def _is_two_dimension(self, contents):
- """
- 判断contents是不是只有两个维度。[[1,2], [3]]是两个维度. [[[1,2], [3, 4, 5]], [[4,5]]]有三个维度
- :param contents:
- :return:
- """
- value = contents[0]
- if isinstance(value, (np.ndarray, list)):
- value = value[0]
- if isinstance(value, (np.ndarray, list)):
- return False
- return True
- return False
-
- def __call__(self, contents, field_name, field_ele_dtype):
-
- if not _is_iterable(contents[0]):
- array = np.array([content for content in contents], dtype=field_ele_dtype)
- elif field_ele_dtype in (np.int64, np.float64) and self._is_two_dimension(contents):
- max_len = max([len(content) for content in contents])
- array = np.full((len(contents), max_len), self.pad_val, dtype=field_ele_dtype)
- for i, content in enumerate(contents):
- array[i][:len(content)] = content
- elif field_ele_dtype is None:
- array = np.array(contents) # 当ignore_type=True时,直接返回contents
- else: # should only be str
- array = np.array([content for content in contents])
- return array
+
+ def __call__(self, contents, field_name, field_ele_dtype, dim):
+ if field_ele_dtype:
+ if dim>3:
+ return np.array(contents)
+ if isinstance(field_ele_dtype, type) and \
+ (issubclass(field_ele_dtype, np.number) or issubclass(field_ele_dtype, Number)):
+ if dim==0:
+ array = np.array(contents, dtype=field_ele_dtype)
+ elif dim==1:
+ max_len = max(map(len, contents))
+ array = np.full((len(contents), max_len), self.pad_val, dtype=field_ele_dtype)
+ for i, content_i in enumerate(contents):
+ array[i, :len(content_i)] = content_i
+ elif dim==2:
+ max_len = max(map(len, contents))
+ max_word_len = max([max([len(content_ii) for content_ii in content_i]) for
+ content_i in contents])
+ array = np.full((len(contents), max_len, max_word_len), self.pad_val, dtype=field_ele_dtype)
+ for i, content_i in enumerate(contents):
+ for j, content_ii in enumerate(content_i):
+ array[i, j, :len(content_ii)] = content_ii
+ else:
+ shape = np.shape(contents)
+ if len(shape)==4: # 说明各dimension是相同的大小
+ array = np.array(contents, dtype=field_ele_dtype)
+ else:
+ raise RuntimeError(f"Field:{field_name} has 3 dimensions, every sample should have the same shape.")
+ return array
+ elif str(field_ele_dtype).startswith('torch'):
+ if dim==0:
+ tensor = torch.tensor(contents).to(field_ele_dtype)
+ elif dim==1:
+ max_len = max(map(len, contents))
+ tensor = torch.full((len(contents), max_len), fill_value=self.pad_val, dtype=field_ele_dtype)
+ for i, content_i in enumerate(contents):
+ tensor[i, :len(content_i)] = torch.tensor(content_i)
+ elif dim==2:
+ max_len = max(map(len, contents))
+ max_word_len = max([max([len(content_ii) for content_ii in content_i]) for
+ content_i in contents])
+ tensor = torch.full((len(contents), max_len, max_word_len), fill_value=self.pad_val,
+ dtype=field_ele_dtype)
+ for i, content_i in enumerate(contents):
+ for j, content_ii in enumerate(content_i):
+ tensor[i, j, :len(content_ii)] = torch.tensor(content_ii)
+ else:
+ shapes = set([np.shape(content_i) for content_i in contents])
+ if len(shapes)>1:
+ raise RuntimeError(f"Field:{field_name} has 3 dimensions, every sample should have the same shape.")
+ shape = shapes.pop()
+ if len(shape)==3:
+ tensor = torch.full([len(contents)]+list(shape), fill_value=self.pad_val, dtype=field_ele_dtype)
+ for i, content_i in enumerate(contents):
+ tensor[i] = torch.tensor(content_i, dtype=field_ele_dtype)
+ else:
+ raise RuntimeError(f"Field:{field_name} has 3 dimensions, every sample should have the same shape.")
+ return tensor
+ else:
+ return np.array(contents) # 不进行任何操作
+ else:
+ return np.array(contents)
class EngChar2DPadder(Padder):
@@ -463,7 +622,7 @@ class EngChar2DPadder(Padder):
dataset.set_padder('chars', padder) # chars这个field的设置为了EnChar2DPadder
"""
-
+
def __init__(self, pad_val=0, pad_length=0):
"""
:param pad_val: int, pad的位置使用该index
@@ -471,32 +630,10 @@ class EngChar2DPadder(Padder):
都pad或截取到该长度.
"""
super().__init__(pad_val=pad_val)
-
+
self.pad_length = pad_length
-
- def _exactly_three_dims(self, contents, field_name):
- """
- 检查传入的contents是否刚好是3维,如果不是3维就报错。理论上,第一个维度是batch,第二个维度是word,第三个维度是character
- :param contents:
- :param field_name: str
- :return:
- """
- if not isinstance(contents, list):
- raise TypeError("contents should be a list, not {}.".format(type(contents)))
- value = contents[0]
- try:
- value = value[0]
- except:
- raise ValueError("Field:{} only has one dimension.".format(field_name))
- try:
- value = value[0]
- except:
- raise ValueError("Field:{} only has two dimensions.".format(field_name))
-
- if _is_iterable(value):
- raise ValueError("Field:{} has more than 3 dimension.".format(field_name))
-
- def __call__(self, contents, field_name, field_ele_dtype):
+
+ def __call__(self, contents, field_name, field_ele_dtype, dim):
"""
期望输入类似于
[
@@ -510,11 +647,11 @@ class EngChar2DPadder(Padder):
:param field_ele_dtype
:return:
"""
- if field_ele_dtype not in (np.int64, np.float64):
+ if field_ele_dtype not in (np.int64, np.float64, int, float):
raise TypeError('dtype of Field:{} should be np.int64 or np.float64 to do 2D padding, get {}.'.format(
field_name, field_ele_dtype
))
- self._exactly_three_dims(contents, field_name)
+ assert dim==2, f"Field:{field_name} has {dim}, EngChar2DPadder only supports input with 2 dimensions."
if self.pad_length < 1:
max_char_length = max([max(len(char_lst) for char_lst in word_lst) for word_lst in contents])
else:
@@ -522,12 +659,12 @@ class EngChar2DPadder(Padder):
max_sent_length = max(len(word_lst) for word_lst in contents)
batch_size = len(contents)
dtype = type(contents[0][0][0])
-
+
padded_array = np.full((batch_size, max_sent_length, max_char_length), fill_value=self.pad_val,
dtype=dtype)
for b_idx, word_lst in enumerate(contents):
for c_idx, char_lst in enumerate(word_lst):
chars = char_lst[:max_char_length]
padded_array[b_idx, c_idx, :len(chars)] = chars
-
+
return padded_array
diff --git a/fastNLP/core/losses.py b/fastNLP/core/losses.py
index 9dc02f3d..1f8923eb 100644
--- a/fastNLP/core/losses.py
+++ b/fastNLP/core/losses.py
@@ -20,12 +20,14 @@ from collections import defaultdict
import torch
import torch.nn.functional as F
+from ..core.const import Const
from .utils import _CheckError
from .utils import _CheckRes
from .utils import _build_args
from .utils import _check_arg_dict_list
from .utils import _check_function_or_method
from .utils import _get_func_signature
+from .utils import seq_len_to_mask
class LossBase(object):
@@ -34,14 +36,23 @@ class LossBase(object):
"""
def __init__(self):
- self.param_map = {}
+ self._param_map = {} # key是fun的参数,value是以该值从传入的dict取出value
self._checked = False
-
+
+ @property
+ def param_map(self):
+ if len(self._param_map) == 0: # 如果为空说明还没有初始化
+ func_spect = inspect.getfullargspec(self.get_loss)
+ func_args = [arg for arg in func_spect.args if arg != 'self']
+ for arg in func_args:
+ self._param_map[arg] = arg
+ return self._param_map
+
def get_loss(self, *args, **kwargs):
raise NotImplementedError
def _init_param_map(self, key_map=None, **kwargs):
- """检查key_map和其他参数map,并将这些映射关系添加到self.param_map
+ """检查key_map和其他参数map,并将这些映射关系添加到self._param_map
:param dict key_map: 表示key的映射关系
:param kwargs: key word args里面的每一个的键-值对都会被构造成映射关系
@@ -53,30 +64,30 @@ class LossBase(object):
raise TypeError("key_map must be `dict`, got {}.".format(type(key_map)))
for key, value in key_map.items():
if value is None:
- self.param_map[key] = key
+ self._param_map[key] = key
continue
if not isinstance(key, str):
raise TypeError(f"key in key_map must be `str`, not `{type(key)}`.")
if not isinstance(value, str):
raise TypeError(f"value in key_map must be `str`, not `{type(value)}`.")
- self.param_map[key] = value
+ self._param_map[key] = value
value_counter[value].add(key)
for key, value in kwargs.items():
if value is None:
- self.param_map[key] = key
+ self._param_map[key] = key
continue
if not isinstance(value, str):
raise TypeError(f"in {key}={value}, value must be `str`, not `{type(value)}`.")
- self.param_map[key] = value
+ self._param_map[key] = value
value_counter[value].add(key)
for value, key_set in value_counter.items():
if len(key_set) > 1:
raise ValueError(f"Several parameters:{key_set} are provided with one output {value}.")
- # check consistence between signature and param_map
+ # check consistence between signature and _param_map
func_spect = inspect.getfullargspec(self.get_loss)
func_args = [arg for arg in func_spect.args if arg != 'self']
- for func_param, input_param in self.param_map.items():
+ for func_param, input_param in self._param_map.items():
if func_param not in func_args:
raise NameError(
f"Parameter `{func_param}` is not in {_get_func_signature(self.get_loss)}. Please check the "
@@ -86,22 +97,7 @@ class LossBase(object):
# if func_spect.varargs:
# raise NameError(f"Delete `*{func_spect.varargs}` in {get_func_signature(self.get_loss)}(Do not use "
# f"positional argument.).")
-
- def _fast_param_map(self, pred_dict, target_dict):
- """Only used as inner function. When the pred_dict, target is unequivocal. Don't need users to pass key_map.
- such as pred_dict has one element, target_dict has one element
- :param pred_dict:
- :param target_dict:
- :return: dict, if dict is not {}, pass it to self.evaluate. Otherwise do mapping.
- """
- fast_param = {}
- if len(self.param_map) == 2 and len(pred_dict) == 1 and len(target_dict) == 1:
- fast_param['pred'] = list(pred_dict.values())[0]
- fast_param['target'] = list(target_dict.values())[0]
- return fast_param
- return fast_param
-
def __call__(self, pred_dict, target_dict, check=False):
"""
:param dict pred_dict: 模型的forward函数返回的dict
@@ -109,55 +105,43 @@ class LossBase(object):
:param Boolean check: 每一次执行映射函数的时候是否检查映射表,默认为不检查
:return:
"""
- fast_param = self._fast_param_map(pred_dict, target_dict)
- if fast_param:
- loss = self.get_loss(**fast_param)
- return loss
-
+
if not self._checked:
- # 1. check consistence between signature and param_map
+ # 1. check consistence between signature and _param_map
func_spect = inspect.getfullargspec(self.get_loss)
func_args = set([arg for arg in func_spect.args if arg != 'self'])
- for func_arg, input_arg in self.param_map.items():
+ for func_arg, input_arg in self._param_map.items():
if func_arg not in func_args:
raise NameError(f"`{func_arg}` not in {_get_func_signature(self.get_loss)}.")
- # 2. only part of the param_map are passed, left are not
+ # 2. only part of the _param_map are passed, left are not
for arg in func_args:
- if arg not in self.param_map:
- self.param_map[arg] = arg # This param does not need mapping.
+ if arg not in self._param_map:
+ self._param_map[arg] = arg # This param does not need mapping.
self._evaluate_args = func_args
- self._reverse_param_map = {input_arg: func_arg for func_arg, input_arg in self.param_map.items()}
-
- # need to wrap inputs in dict.
+ self._reverse_param_map = {input_arg: func_arg for func_arg, input_arg in self._param_map.items()}
+
mapped_pred_dict = {}
mapped_target_dict = {}
- duplicated = []
- for input_arg in set(list(pred_dict.keys()) + list(target_dict.keys())):
- not_duplicate_flag = 0
- if input_arg in self._reverse_param_map:
- mapped_arg = self._reverse_param_map[input_arg]
- not_duplicate_flag += 1
- else:
- mapped_arg = input_arg
+ for input_arg, mapped_arg in self._reverse_param_map.items():
if input_arg in pred_dict:
mapped_pred_dict[mapped_arg] = pred_dict[input_arg]
- not_duplicate_flag += 1
if input_arg in target_dict:
mapped_target_dict[mapped_arg] = target_dict[input_arg]
- not_duplicate_flag += 1
- if not_duplicate_flag == 3:
- duplicated.append(input_arg)
# missing
if not self._checked:
+ duplicated = []
+ for input_arg, mapped_arg in self._reverse_param_map.items():
+ if input_arg in pred_dict and input_arg in target_dict:
+ duplicated.append(input_arg)
check_res = _check_arg_dict_list(self.get_loss, [mapped_pred_dict, mapped_target_dict])
# replace missing.
missing = check_res.missing
replaced_missing = list(missing)
for idx, func_arg in enumerate(missing):
# Don't delete `` in this information, nor add ``
- replaced_missing[idx] = f"{self.param_map[func_arg]}" + f"(assign to `{func_arg}` " \
+ replaced_missing[idx] = f"{self._param_map[func_arg]}" + f"(assign to `{func_arg}` " \
f"in `{self.__class__.__name__}`)"
check_res = _CheckRes(missing=replaced_missing,
@@ -170,6 +154,8 @@ class LossBase(object):
if check_res.missing or check_res.duplicated:
raise _CheckError(check_res=check_res,
func_signature=_get_func_signature(self.get_loss))
+ self._checked = True
+
refined_args = _build_args(self.get_loss, **mapped_pred_dict, **mapped_target_dict)
loss = self.get_loss(**refined_args)
@@ -204,15 +190,11 @@ class LossFunc(LossBase):
super(LossFunc, self).__init__()
_check_function_or_method(func)
+ self.get_loss = func
if key_map is not None:
if not isinstance(key_map, dict):
raise RuntimeError(f"Loss error: key_map except a {type({})} but got a {type(key_map)}")
- self.param_map = key_map
- if len(kwargs) > 0:
- for key, val in kwargs.items():
- self.param_map.update({key: val})
-
- self.get_loss = func
+ self._init_param_map(key_map, **kwargs)
class CrossEntropyLoss(LossBase):
@@ -223,7 +205,10 @@ class CrossEntropyLoss(LossBase):
:param pred: 参数映射表中 `pred` 的映射关系,None表示映射关系为 `pred` -> `pred`
:param target: 参数映射表中 `target` 的映射关系,None表示映射关系为 `target` -> `target`
- :param padding_idx: padding的index,在计算loss时将忽略target中标号为padding_idx的内容
+ :param seq_len: 句子的长度, 长度之外的token不会计算loss。。
+ :param padding_idx: padding的index,在计算loss时将忽略target中标号为padding_idx的内容, 可以通过该值代替
+ 传入seq_len.
+ :param str reduction: 支持 `mean` ,`sum` 和 `none` .
Example::
@@ -231,15 +216,25 @@ class CrossEntropyLoss(LossBase):
"""
- def __init__(self, pred=None, target=None, padding_idx=-100):
- # TODO 需要做一些检查,F.cross_entropy在计算时,如果pred是(16, 10 ,4), target的形状按道理应该是(16, 10), 但实际需要(16,4)
+ def __init__(self, pred=None, target=None, seq_len=None, padding_idx=-100, reduction='mean'):
super(CrossEntropyLoss, self).__init__()
- self._init_param_map(pred=pred, target=target)
+ self._init_param_map(pred=pred, target=target, seq_len=seq_len)
self.padding_idx = padding_idx
+ assert reduction in ('mean', 'sum', 'none')
+ self.reduction = reduction
- def get_loss(self, pred, target):
+ def get_loss(self, pred, target, seq_len=None):
+ if pred.dim() > 2:
+ if pred.size(1) != target.size(1):
+ pred = pred.transpose(1, 2)
+ pred = pred.reshape(-1, pred.size(-1))
+ target = target.reshape(-1)
+ if seq_len is not None:
+ mask = seq_len_to_mask(seq_len).reshape(-1).eq(0)
+ target = target.masked_fill(mask, self.padding_idx)
+
return F.cross_entropy(input=pred, target=target,
- ignore_index=self.padding_idx)
+ ignore_index=self.padding_idx, reduction=self.reduction)
class L1Loss(LossBase):
@@ -250,15 +245,18 @@ class L1Loss(LossBase):
:param pred: 参数映射表中 `pred` 的映射关系,None表示映射关系为 `pred` -> `pred`
:param target: 参数映射表中 `target` 的映射关系,None表示映射关系为 `target` >`target`
+ :param str reduction: 支持'mean','sum'和'none'.
"""
- def __init__(self, pred=None, target=None):
+ def __init__(self, pred=None, target=None, reduction='mean'):
super(L1Loss, self).__init__()
self._init_param_map(pred=pred, target=target)
+ assert reduction in ('mean', 'sum', 'none')
+ self.reduction = reduction
def get_loss(self, pred, target):
- return F.l1_loss(input=pred, target=target)
+ return F.l1_loss(input=pred, target=target, reduction=self.reduction)
class BCELoss(LossBase):
@@ -267,16 +265,19 @@ class BCELoss(LossBase):
二分类交叉熵损失函数
- :param pred: 参数映射表中`pred`的映射关系,None表示映射关系为`pred`->`pred`
- :param target: 参数映射表中`target`的映射关系,None表示映射关系为`target`->`target`
+ :param pred: 参数映射表中 `pred` 的映射关系,None表示映射关系为 `pred` -> `pred`
+ :param target: 参数映射表中 `target` 的映射关系,None表示映射关系为 `target` -> `target`
+ :param str reduction: 支持 `mean` ,`sum` 和 `none` .
"""
- def __init__(self, pred=None, target=None):
+ def __init__(self, pred=None, target=None, reduction='mean'):
super(BCELoss, self).__init__()
self._init_param_map(pred=pred, target=target)
+ assert reduction in ('mean', 'sum', 'none')
+ self.reduction = reduction
def get_loss(self, pred, target):
- return F.binary_cross_entropy(input=pred, target=target)
+ return F.binary_cross_entropy(input=pred, target=target, reduction=self.reduction)
class NLLLoss(LossBase):
@@ -285,16 +286,22 @@ class NLLLoss(LossBase):
负对数似然损失函数
- :param pred: 参数映射表中`pred`的映射关系,None表示映射关系为`pred`->`pred`
- :param target: 参数映射表中`target`的映射关系,None表示映射关系为`target`->`target`
+ :param pred: 参数映射表中 `pred` 的映射关系,None表示映射关系为 `pred` -> `pred`
+ :param target: 参数映射表中 `target` 的映射关系,None表示映射关系为 `target` -> `target`
+ :param ignore_idx: ignore的index,在计算loss时将忽略target中标号为ignore_idx的内容, 可以通过该值代替
+ 传入seq_len.
+ :param str reduction: 支持 `mean` ,`sum` 和 `none` .
"""
- def __init__(self, pred=None, target=None):
+ def __init__(self, pred=None, target=None, ignore_idx=-100, reduction='mean'):
super(NLLLoss, self).__init__()
self._init_param_map(pred=pred, target=target)
+ assert reduction in ('mean', 'sum', 'none')
+ self.reduction = reduction
+ self.ignore_idx = ignore_idx
def get_loss(self, pred, target):
- return F.nll_loss(input=pred, target=target)
+ return F.nll_loss(input=pred, target=target, ignore_index=self.ignore_idx, reduction=self.reduction)
class LossInForward(LossBase):
@@ -306,7 +313,7 @@ class LossInForward(LossBase):
:param str loss_key: 在forward函数中loss的键名,默认为loss
"""
- def __init__(self, loss_key='loss'):
+ def __init__(self, loss_key=Const.LOSS):
super().__init__()
if not isinstance(loss_key, str):
raise TypeError(f"Only str allowed for loss_key, got {type(loss_key)}.")
diff --git a/fastNLP/core/metrics.py b/fastNLP/core/metrics.py
index 868d67b1..f23eab91 100644
--- a/fastNLP/core/metrics.py
+++ b/fastNLP/core/metrics.py
@@ -6,7 +6,7 @@ __all__ = [
"MetricBase",
"AccuracyMetric",
"SpanFPreRecMetric",
- "SQuADMetric"
+ "ExtractiveQAMetric"
]
import inspect
@@ -22,18 +22,19 @@ from .utils import _check_arg_dict_list
from .utils import _get_func_signature
from .utils import seq_len_to_mask
from .vocabulary import Vocabulary
+from abc import abstractmethod
class MetricBase(object):
"""
- 所有metrics的基类,,所有的传入到Trainer, Tester的Metric需要继承自该对象,需要覆盖写入evaluate(), get_metric()方法。
+ 所有metrics的基类,所有的传入到Trainer, Tester的Metric需要继承自该对象,需要覆盖写入evaluate(), get_metric()方法。
evaluate(xxx)中传入的是一个batch的数据。
get_metric(xxx)当所有数据处理完毕,调用该方法得到最终的metric值
以分类问题中,Accuracy计算为例
- 假设model的forward返回dict中包含'pred'这个key, 并且该key需要用于Accuracy::
+ 假设model的forward返回dict中包含 `pred` 这个key, 并且该key需要用于Accuracy::
class Model(nn.Module):
def __init__(xxx):
@@ -42,7 +43,7 @@ class MetricBase(object):
# do something
return {'pred': pred, 'other_keys':xxx} # pred's shape: batch_size x num_classes
- 假设dataset中'label'这个field是需要预测的值,并且该field被设置为了target
+ 假设dataset中 `label` 这个field是需要预测的值,并且该field被设置为了target
对应的AccMetric可以按如下的定义, version1, 只使用这一次::
class AccMetric(MetricBase):
@@ -115,17 +116,28 @@ class MetricBase(object):
"""
def __init__(self):
- self.param_map = {} # key is param in function, value is input param.
+ self._param_map = {} # key is param in function, value is input param.
self._checked = False
-
+
+ @property
+ def param_map(self):
+ if len(self._param_map) == 0: # 如果为空说明还没有初始化
+ func_spect = inspect.getfullargspec(self.evaluate)
+ func_args = [arg for arg in func_spect.args if arg != 'self']
+ for arg in func_args:
+ self._param_map[arg] = arg
+ return self._param_map
+
+ @abstractmethod
def evaluate(self, *args, **kwargs):
raise NotImplementedError
-
+
+ @abstractmethod
def get_metric(self, reset=True):
raise NotImplemented
def _init_param_map(self, key_map=None, **kwargs):
- """检查key_map和其他参数map,并将这些映射关系添加到self.param_map
+ """检查key_map和其他参数map,并将这些映射关系添加到self._param_map
:param dict key_map: 表示key的映射关系
:param kwargs: key word args里面的每一个的键-值对都会被构造成映射关系
@@ -137,30 +149,30 @@ class MetricBase(object):
raise TypeError("key_map must be `dict`, got {}.".format(type(key_map)))
for key, value in key_map.items():
if value is None:
- self.param_map[key] = key
+ self._param_map[key] = key
continue
if not isinstance(key, str):
raise TypeError(f"key in key_map must be `str`, not `{type(key)}`.")
if not isinstance(value, str):
raise TypeError(f"value in key_map must be `str`, not `{type(value)}`.")
- self.param_map[key] = value
+ self._param_map[key] = value
value_counter[value].add(key)
for key, value in kwargs.items():
if value is None:
- self.param_map[key] = key
+ self._param_map[key] = key
continue
if not isinstance(value, str):
raise TypeError(f"in {key}={value}, value must be `str`, not `{type(value)}`.")
- self.param_map[key] = value
+ self._param_map[key] = value
value_counter[value].add(key)
for value, key_set in value_counter.items():
if len(key_set) > 1:
raise ValueError(f"Several parameters:{key_set} are provided with one output {value}.")
- # check consistence between signature and param_map
+ # check consistence between signature and _param_map
func_spect = inspect.getfullargspec(self.evaluate)
func_args = [arg for arg in func_spect.args if arg != 'self']
- for func_param, input_param in self.param_map.items():
+ for func_param, input_param in self._param_map.items():
if func_param not in func_args:
raise NameError(
f"Parameter `{func_param}` is not in {_get_func_signature(self.evaluate)}. Please check the "
@@ -175,7 +187,7 @@ class MetricBase(object):
:return: dict, if dict is not {}, pass it to self.evaluate. Otherwise do mapping.
"""
fast_param = {}
- if len(self.param_map) == 2 and len(pred_dict) == 1 and len(target_dict) == 1:
+ if len(self._param_map) == 2 and len(pred_dict) == 1 and len(target_dict) == 1:
fast_param['pred'] = list(pred_dict.values())[0]
fast_param['target'] = list(target_dict.values())[0]
return fast_param
@@ -204,42 +216,35 @@ class MetricBase(object):
if not self._checked:
if not callable(self.evaluate):
raise TypeError(f"{self.__class__.__name__}.evaluate has to be callable, not {type(self.evaluate)}.")
- # 1. check consistence between signature and param_map
+ # 1. check consistence between signature and _param_map
func_spect = inspect.getfullargspec(self.evaluate)
func_args = set([arg for arg in func_spect.args if arg != 'self'])
- for func_arg, input_arg in self.param_map.items():
+ for func_arg, input_arg in self._param_map.items():
if func_arg not in func_args:
raise NameError(f"`{func_arg}` not in {_get_func_signature(self.evaluate)}.")
- # 2. only part of the param_map are passed, left are not
+ # 2. only part of the _param_map are passed, left are not
for arg in func_args:
- if arg not in self.param_map:
- self.param_map[arg] = arg # This param does not need mapping.
+ if arg not in self._param_map:
+ self._param_map[arg] = arg # This param does not need mapping.
self._evaluate_args = func_args
- self._reverse_param_map = {input_arg: func_arg for func_arg, input_arg in self.param_map.items()}
+ self._reverse_param_map = {input_arg: func_arg for func_arg, input_arg in self._param_map.items()}
# need to wrap inputs in dict.
mapped_pred_dict = {}
mapped_target_dict = {}
- duplicated = []
- for input_arg in set(list(pred_dict.keys()) + list(target_dict.keys())):
- not_duplicate_flag = 0
- if input_arg in self._reverse_param_map:
- mapped_arg = self._reverse_param_map[input_arg]
- not_duplicate_flag += 1
- else:
- mapped_arg = input_arg
+ for input_arg, mapped_arg in self._reverse_param_map.items():
if input_arg in pred_dict:
mapped_pred_dict[mapped_arg] = pred_dict[input_arg]
- not_duplicate_flag += 1
if input_arg in target_dict:
mapped_target_dict[mapped_arg] = target_dict[input_arg]
- not_duplicate_flag += 1
- if not_duplicate_flag == 3:
- duplicated.append(input_arg)
# missing
if not self._checked:
+ duplicated = []
+ for input_arg, mapped_arg in self._reverse_param_map.items():
+ if input_arg in pred_dict and input_arg in target_dict:
+ duplicated.append(input_arg)
check_res = _check_arg_dict_list(self.evaluate, [mapped_pred_dict, mapped_target_dict])
# only check missing.
# replace missing.
@@ -247,7 +252,7 @@ class MetricBase(object):
replaced_missing = list(missing)
for idx, func_arg in enumerate(missing):
# Don't delete `` in this information, nor add ``
- replaced_missing[idx] = f"{self.param_map[func_arg]}" + f"(assign to `{func_arg}` " \
+ replaced_missing[idx] = f"{self._param_map[func_arg]}" + f"(assign to `{func_arg}` " \
f"in `{self.__class__.__name__}`)"
check_res = _CheckRes(missing=replaced_missing,
@@ -260,10 +265,10 @@ class MetricBase(object):
if check_res.missing or check_res.duplicated:
raise _CheckError(check_res=check_res,
func_signature=_get_func_signature(self.evaluate))
+ self._checked = True
refined_args = _build_args(self.evaluate, **mapped_pred_dict, **mapped_target_dict)
self.evaluate(**refined_args)
- self._checked = True
return
@@ -409,6 +414,37 @@ def _bmeso_tag_to_spans(tags, ignore_labels=None):
]
+def _bioes_tag_to_spans(tags, ignore_labels=None):
+ """
+ 给定一个tags的lis,比如['O', 'B-singer', 'I-singer', 'E-singer', 'O', 'O']。
+ 返回[('singer', (1, 4))] (左闭右开区间)
+
+ :param tags: List[str],
+ :param ignore_labels: List[str], 在该list中的label将被忽略
+ :return: List[Tuple[str, List[int, int]]]. [(label,[start, end])]
+ """
+ ignore_labels = set(ignore_labels) if ignore_labels else set()
+
+ spans = []
+ prev_bioes_tag = None
+ for idx, tag in enumerate(tags):
+ tag = tag.lower()
+ bioes_tag, label = tag[:1], tag[2:]
+ if bioes_tag in ('b', 's'):
+ spans.append((label, [idx, idx]))
+ elif bioes_tag in ('i', 'e') and prev_bioes_tag in ('b', 'i') and label == spans[-1][0]:
+ spans[-1][1][1] = idx
+ elif bioes_tag == 'o':
+ pass
+ else:
+ spans.append((label, [idx, idx]))
+ prev_bioes_tag = bioes_tag
+ return [(span[0], (span[1][0], span[1][1] + 1))
+ for span in spans
+ if span[0] not in ignore_labels
+ ]
+
+
def _bio_tag_to_spans(tags, ignore_labels=None):
"""
给定一个tags的lis,比如['O', 'B-singer', 'I-singer', 'I-singer', 'O', 'O']。
@@ -442,7 +478,7 @@ class SpanFPreRecMetric(MetricBase):
别名::class:`fastNLP.SpanFPreRecMetric` :class:`fastNLP.core.metrics.SpanFPreRecMetric`
在序列标注问题中,以span的方式计算F, pre, rec.
- 比如中文Part of speech中,会以character的方式进行标注,句子'中国在亚洲'对应的POS可能为(以BMES为例)
+ 比如中文Part of speech中,会以character的方式进行标注,句子 `中国在亚洲` 对应的POS可能为(以BMES为例)
['B-NN', 'E-NN', 'S-DET', 'B-NN', 'E-NN']。该metric就是为类似情况下的F1计算。
最后得到的metric结果为::
@@ -466,15 +502,15 @@ class SpanFPreRecMetric(MetricBase):
:param tag_vocab: 标签的 :class:`~fastNLP.Vocabulary` 。支持的标签为"B"(没有label);或"B-xxx"(xxx为某种label,比如POS中的NN),
在解码时,会将相同xxx的认为是同一个label,比如['B-NN', 'E-NN']会被合并为一个'NN'.
- :param str pred: 用该key在evaluate()时从传入dict中取出prediction数据。 为None,则使用'pred'取数据
- :param str target: 用该key在evaluate()时从传入dict中取出target数据。 为None,则使用'target'取数据
- :param str seq_len: 用该key在evaluate()时从传入dict中取出sequence length数据。为None,则使用'seq_len'取数据。
- :param str encoding_type: 目前支持bio, bmes
+ :param str pred: 用该key在evaluate()时从传入dict中取出prediction数据。 为None,则使用 `pred` 取数据
+ :param str target: 用该key在evaluate()时从传入dict中取出target数据。 为None,则使用 `target` 取数据
+ :param str seq_len: 用该key在evaluate()时从传入dict中取出sequence length数据。为None,则使用 `seq_len` 取数据。
+ :param str encoding_type: 目前支持bio, bmes, bmeso, bioes
:param list ignore_labels: str 组成的list. 这个list中的class不会被用于计算。例如在POS tagging时传入['NN'],则不会计算'NN'这
个label
:param bool only_gross: 是否只计算总的f1, precision, recall的值;如果为False,不仅返回总的f1, pre, rec, 还会返回每个
label的f1, pre, rec
- :param str f_type: 'micro'或'macro'. 'micro':通过先计算总体的TP,FN和FP的数量,再计算f, precision, recall; 'macro':
+ :param str f_type: `micro` 或 `macro` . `micro` :通过先计算总体的TP,FN和FP的数量,再计算f, precision, recall; `macro` :
分布计算每个类别的f, precision, recall,然后做平均(各类别f的权重相同)
:param float beta: f_beta分数, :math:`f_{beta} = \frac{(1 + {beta}^{2})*(pre*rec)}{({beta}^{2}*pre + rec)}` .
常用为beta=0.5, 1, 2. 若为0.5则精确率的权重高于召回率;若为1,则两者平等;若为2,则召回率权重高于精确率。
@@ -497,6 +533,8 @@ class SpanFPreRecMetric(MetricBase):
self.tag_to_span_func = _bio_tag_to_spans
elif self.encoding_type == 'bmeso':
self.tag_to_span_func = _bmeso_tag_to_spans
+ elif self.encoding_type == 'bioes':
+ self.tag_to_span_func = _bioes_tag_to_spans
else:
raise ValueError("Only support 'bio', 'bmes', 'bmeso' type.")
@@ -698,11 +736,11 @@ def _pred_topk(y_prob, k=1):
return y_pred_topk, y_prob_topk
-class SQuADMetric(MetricBase):
+class ExtractiveQAMetric(MetricBase):
r"""
- 别名::class:`fastNLP.SQuADMetric` :class:`fastNLP.core.metrics.SQuADMetric`
+ 别名::class:`fastNLP.ExtractiveQAMetric` :class:`fastNLP.core.metrics.ExtractiveQAMetric`
- SQuAD数据集metric
+ 抽取式QA(如SQuAD)的metric.
:param pred1: 参数映射表中 `pred1` 的映射关系,None表示映射关系为 `pred1` -> `pred1`
:param pred2: 参数映射表中 `pred2` 的映射关系,None表示映射关系为 `pred2` -> `pred2`
@@ -718,7 +756,7 @@ class SQuADMetric(MetricBase):
def __init__(self, pred1=None, pred2=None, target1=None, target2=None,
beta=1, right_open=True, print_predict_stat=False):
- super(SQuADMetric, self).__init__()
+ super(ExtractiveQAMetric, self).__init__()
self._init_param_map(pred1=pred1, pred2=pred2, target1=target1, target2=target2)
diff --git a/fastNLP/core/optimizer.py b/fastNLP/core/optimizer.py
index ef619042..3036257c 100644
--- a/fastNLP/core/optimizer.py
+++ b/fastNLP/core/optimizer.py
@@ -5,10 +5,14 @@ optimizer 模块定义了 fastNLP 中所需的各种优化器,一般做为 :cl
__all__ = [
"Optimizer",
"SGD",
- "Adam"
+ "Adam",
+ "AdamW"
]
import torch
+import math
+import torch
+from torch.optim.optimizer import Optimizer as TorchOptimizer
class Optimizer(object):
@@ -36,6 +40,23 @@ class Optimizer(object):
"""
return [param for param in params if param.requires_grad]
+class NullOptimizer(Optimizer):
+ """
+ 当不希望Trainer更新optimizer时,传入本optimizer,但请确保通过callback的方式对参数进行了更新。
+
+ """
+ def __init__(self):
+ super().__init__(None)
+
+ def construct_from_pytorch(self, model_params):
+ pass
+
+ def __getattr__(self, item):
+ def pass_func(*args, **kwargs):
+ pass
+
+ return pass_func
+
class SGD(Optimizer):
"""
@@ -80,3 +101,117 @@ class Adam(Optimizer):
return torch.optim.Adam(self._get_require_grads_param(model_params), **self.settings)
else:
return torch.optim.Adam(self._get_require_grads_param(self.model_params), **self.settings)
+
+
+class AdamW(TorchOptimizer):
+ r"""
+ 别名::class:`fastNLP.AdamW` :class:`fastNLP.core.optimizer.AdamW`
+
+ 对AdamW的实现,该实现应该会在pytorch更高版本中出现,https://github.com/pytorch/pytorch/pull/21250。这里提前加入
+
+ .. todo::
+ 翻译成中文
+
+ The original Adam algorithm was proposed in `Adam: A Method for Stochastic Optimization`_.
+ The AdamW variant was proposed in `Decoupled Weight Decay Regularization`_.
+
+ :param params (iterable): iterable of parameters to optimize or dicts defining
+ parameter groups
+ :param lr (float, optional): learning rate (default: 1e-3)
+ :param betas (Tuple[float, float], optional): coefficients used for computing
+ running averages of gradient and its square (default: (0.9, 0.99))
+ :param eps (float, optional): term added to the denominator to improve
+ numerical stability (default: 1e-8)
+ :param weight_decay (float, optional): weight decay coefficient (default: 1e-2)
+ algorithm from the paper `On the Convergence of Adam and Beyond`_
+ (default: False)
+
+ .. _Adam\: A Method for Stochastic Optimization:
+ https://arxiv.org/abs/1412.6980
+ .. _Decoupled Weight Decay Regularization:
+ https://arxiv.org/abs/1711.05101
+ .. _On the Convergence of Adam and Beyond:
+ https://openreview.net/forum?id=ryQu7f-RZ
+ """
+
+ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
+ weight_decay=1e-2, amsgrad=False):
+ if not 0.0 <= lr:
+ raise ValueError("Invalid learning rate: {}".format(lr))
+ if not 0.0 <= eps:
+ raise ValueError("Invalid epsilon value: {}".format(eps))
+ if not 0.0 <= betas[0] < 1.0:
+ raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
+ if not 0.0 <= betas[1] < 1.0:
+ raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
+ defaults = dict(lr=lr, betas=betas, eps=eps,
+ weight_decay=weight_decay, amsgrad=amsgrad)
+ super(AdamW, self).__init__(params, defaults)
+
+ def __setstate__(self, state):
+ super(AdamW, self).__setstate__(state)
+ for group in self.param_groups:
+ group.setdefault('amsgrad', False)
+
+ def step(self, closure=None):
+ """Performs a single optimization step.
+
+ :param closure: (callable, optional) A closure that reevaluates the model
+ and returns the loss.
+ """
+ loss = None
+ if closure is not None:
+ loss = closure()
+
+ for group in self.param_groups:
+ for p in group['params']:
+ if p.grad is None:
+ continue
+
+ # Perform stepweight decay
+ p.data.mul_(1 - group['lr'] * group['weight_decay'])
+
+ # Perform optimization step
+ grad = p.grad.data
+ if grad.is_sparse:
+ raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
+ amsgrad = group['amsgrad']
+
+ state = self.state[p]
+
+ # State initialization
+ if len(state) == 0:
+ state['step'] = 0
+ # Exponential moving average of gradient values
+ state['exp_avg'] = torch.zeros_like(p.data)
+ # Exponential moving average of squared gradient values
+ state['exp_avg_sq'] = torch.zeros_like(p.data)
+ if amsgrad:
+ # Maintains max of all exp. moving avg. of sq. grad. values
+ state['max_exp_avg_sq'] = torch.zeros_like(p.data)
+
+ exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+ if amsgrad:
+ max_exp_avg_sq = state['max_exp_avg_sq']
+ beta1, beta2 = group['betas']
+
+ state['step'] += 1
+
+ # Decay the first and second moment running average coefficient
+ exp_avg.mul_(beta1).add_(1 - beta1, grad)
+ exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
+ if amsgrad:
+ # Maintains the maximum of all 2nd moment running avg. till now
+ torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
+ # Use the max. for normalizing running avg. of gradient
+ denom = max_exp_avg_sq.sqrt().add_(group['eps'])
+ else:
+ denom = exp_avg_sq.sqrt().add_(group['eps'])
+
+ bias_correction1 = 1 - beta1 ** state['step']
+ bias_correction2 = 1 - beta2 ** state['step']
+ step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
+
+ p.data.addcdiv_(-step_size, exp_avg, denom)
+
+ return loss
diff --git a/fastNLP/core/predictor.py b/fastNLP/core/predictor.py
index 4f37e105..2d6a7380 100644
--- a/fastNLP/core/predictor.py
+++ b/fastNLP/core/predictor.py
@@ -6,20 +6,20 @@ from collections import defaultdict
import torch
-from . import Batch
+from . import DataSetIter
from . import DataSet
from . import SequentialSampler
-from .utils import _build_args
+from .utils import _build_args, _move_dict_value_to_device, _get_model_device
class Predictor(object):
"""
- An interface for predicting outputs based on trained models.
+ 一个根据训练模型预测输出的预测器(Predictor)
- It does not care about evaluations of the model, which is different from Tester.
- This is a high-level model wrapper to be called by FastNLP.
- This class does not share any operations with Trainer and Tester.
- Currently, Predictor does not support GPU.
+ 与测试器(Tester)不同的是,predictor不关心模型性能的评价指标,只做inference。
+ 这是一个fastNLP调用的高级模型包装器。它与Trainer、Tester不共享任何操作。
+
+ :param torch.nn.Module network: 用来完成预测任务的模型
"""
def __init__(self, network):
@@ -30,22 +30,23 @@ class Predictor(object):
self.batch_size = 1
self.batch_output = []
- def predict(self, data, seq_len_field_name=None):
- """Perform inference using the trained model.
+ def predict(self, data: DataSet, seq_len_field_name=None):
+ """用已经训练好的模型进行inference.
- :param data: a DataSet object.
- :param str seq_len_field_name: field name indicating sequence lengths
- :return: list of batch outputs
+ :param fastNLP.DataSet data: 待预测的数据集
+ :param str seq_len_field_name: 表示序列长度信息的field名字
+ :return: dict dict里面的内容为模型预测的结果
"""
if not isinstance(data, DataSet):
raise ValueError("Only Dataset class is allowed, not {}.".format(type(data)))
if seq_len_field_name is not None and seq_len_field_name not in data.field_arrays:
raise ValueError("Field name {} not found in DataSet {}.".format(seq_len_field_name, data))
+ prev_training = self.network.training
self.network.eval()
+ network_device = _get_model_device(self.network)
batch_output = defaultdict(list)
- data_iterator = Batch(data, batch_size=self.batch_size, sampler=SequentialSampler(), as_numpy=False,
- prefetch=False)
+ data_iterator = DataSetIter(data, batch_size=self.batch_size, sampler=SequentialSampler(), as_numpy=False)
if hasattr(self.network, "predict"):
predict_func = self.network.predict
@@ -54,6 +55,7 @@ class Predictor(object):
with torch.no_grad():
for batch_x, _ in data_iterator:
+ _move_dict_value_to_device(batch_x, _, device=network_device)
refined_batch_x = _build_args(predict_func, **batch_x)
prediction = predict_func(**refined_batch_x)
@@ -73,4 +75,5 @@ class Predictor(object):
else:
batch_output[key].append(value)
+ self.network.train(prev_training)
return batch_output
diff --git a/fastNLP/core/sampler.py b/fastNLP/core/sampler.py
index c5784f59..d8ba1ad1 100644
--- a/fastNLP/core/sampler.py
+++ b/fastNLP/core/sampler.py
@@ -62,16 +62,27 @@ class BucketSampler(Sampler):
带Bucket的 `Random Sampler`. 可以随机地取出长度相似的元素
:param int num_buckets: bucket的数量
- :param int batch_size: batch的大小
+ :param int batch_size: batch的大小. 默认为None,Trainer在调用BucketSampler时,会将该值正确设置,如果是非Trainer场景使用,需
+ 要显示传递该值
:param str seq_len_field_name: 对应序列长度的 `field` 的名字
"""
- def __init__(self, num_buckets=10, batch_size=32, seq_len_field_name='seq_len'):
+ def __init__(self, num_buckets=10, batch_size=None, seq_len_field_name='seq_len'):
self.num_buckets = num_buckets
self.batch_size = batch_size
self.seq_len_field_name = seq_len_field_name
-
+
+ def set_batch_size(self, batch_size):
+ """
+
+ :param int batch_size: 每个batch的大小
+ :return:
+ """
+ self.batch_size = batch_size
+
def __call__(self, data_set):
+ if self.batch_size is None:
+ raise RuntimeError("batch_size is None.")
seq_lens = data_set.get_all_fields()[self.seq_len_field_name].content
total_sample_num = len(seq_lens)
diff --git a/fastNLP/core/tester.py b/fastNLP/core/tester.py
index 883e0d01..c1d270d1 100644
--- a/fastNLP/core/tester.py
+++ b/fastNLP/core/tester.py
@@ -1,7 +1,7 @@
"""
tester模块实现了 fastNLP 所需的Tester类,能在提供数据、模型以及metric的情况下进行性能测试。
-Example::
+.. code-block::
import numpy as np
import torch
@@ -32,12 +32,10 @@ Tester在验证进行之前会调用model.eval()提示当前进入了evaluation
"""
-import warnings
-
import torch
import torch.nn as nn
-from .batch import Batch
+from .batch import BatchIter, DataSetIter
from .dataset import DataSet
from .metrics import _prepare_metrics
from .sampler import SequentialSampler
@@ -48,6 +46,8 @@ from .utils import _move_dict_value_to_device
from .utils import _get_func_signature
from .utils import _get_model_device
from .utils import _move_model_to_device
+from ._parallel_utils import _data_parallel_wrapper
+from functools import partial
__all__ = [
"Tester"
@@ -60,15 +60,14 @@ class Tester(object):
Tester是在提供数据,模型以及metric的情况下进行性能测试的类。需要传入模型,数据以及metric进行验证。
- :param data: 需要测试的数据集, :class:`~fastNLP.DataSet` 类型
+ :param ~fastNLP.DataSet data: 需要测试的数据集
:param torch.nn.module model: 使用的模型
- :param metrics: :class:`~fastNLP.core.metrics.MetricBase` 或者一个列表的 :class:`~fastNLP.core.metrics.MetricBase`
+ :param ~fastNLP.core.metrics.MetricBase,List[~fastNLP.core.metrics.MetricBase] metrics: 测试时使用的metrics
:param int batch_size: evaluation时使用的batch_size有多大。
:param str,int,torch.device,list(int) device: 将模型load到哪个设备。默认为None,即Trainer不对模型
的计算位置进行管理。支持以下的输入:
- 1. str: ['cpu', 'cuda', 'cuda:0', 'cuda:1', ...] 依次为'cpu'中, 可见的第一个GPU中, 可见的第一个GPU中,
- 可见的第二个GPU中;
+ 1. str: ['cpu', 'cuda', 'cuda:0', 'cuda:1', ...] 依次为'cpu'中, 可见的第一个GPU中,可见的第一个GPU中,可见的第二个GPU中;
2. torch.device:将模型装载到torch.device上。
@@ -82,7 +81,7 @@ class Tester(object):
:param int verbose: 如果为0不输出任何信息; 如果为1,打印出验证结果。
"""
- def __init__(self, data, model, metrics, batch_size=16, device=None, verbose=1):
+ def __init__(self, data, model, metrics, batch_size=16, num_workers=0, device=None, verbose=1):
super(Tester, self).__init__()
if not isinstance(data, DataSet):
@@ -96,23 +95,35 @@ class Tester(object):
self._model = _move_model_to_device(model, device=device)
self.batch_size = batch_size
self.verbose = verbose
-
- # 如果是DataParallel将没有办法使用predict方法
- if isinstance(self._model, nn.DataParallel):
- if hasattr(self._model.module, 'predict') and not hasattr(self._model, 'predict'):
- warnings.warn("Cannot use DataParallel to test your model, because your model offer predict() function,"
- " while DataParallel has no predict() function.")
- self._model = self._model.module
-
- # check predict
- if hasattr(self._model, 'predict'):
- self._predict_func = self._model.predict
- if not callable(self._predict_func):
- _model_name = model.__class__.__name__
- raise TypeError(f"`{_model_name}.predict` must be callable to be used "
- f"for evaluation, not `{type(self._predict_func)}`.")
+
+ if isinstance(data, DataSet):
+ self.data_iterator = DataSetIter(
+ dataset=data, batch_size=batch_size, num_workers=num_workers, sampler=SequentialSampler())
+ elif isinstance(data, BatchIter):
+ self.data_iterator = data
else:
- self._predict_func = self._model.forward
+ raise TypeError("data type {} not support".format(type(data)))
+
+ # check predict
+ if (hasattr(self._model, 'predict') and callable(self._model.predict)) or \
+ (isinstance(self._model, nn.DataParallel) and hasattr(self._model.module, 'predict') and
+ callable(self._model.module.predict)):
+ if isinstance(self._model, nn.DataParallel):
+ self._predict_func_wrapper = partial(_data_parallel_wrapper('predict',
+ self._model.device_ids,
+ self._model.output_device),
+ network=self._model.module)
+ self._predict_func = self._model.module.predict
+ else:
+ self._predict_func = self._model.predict
+ self._predict_func_wrapper = self._model.predict
+ else:
+ if isinstance(self._model, nn.DataParallel):
+ self._predict_func_wrapper = self._model.forward
+ self._predict_func = self._model.module.forward
+ else:
+ self._predict_func = self._model.forward
+ self._predict_func_wrapper = self._model.forward
def test(self):
"""开始进行验证,并返回验证结果。
@@ -124,7 +135,7 @@ class Tester(object):
self._model_device = _get_model_device(self._model)
network = self._model
self._mode(network, is_test=True)
- data_iterator = Batch(self.data, self.batch_size, sampler=SequentialSampler(), as_numpy=False)
+ data_iterator = self.data_iterator
eval_results = {}
try:
with torch.no_grad():
@@ -169,7 +180,7 @@ class Tester(object):
def _data_forward(self, func, x):
"""A forward pass of the model. """
x = _build_args(func, **x)
- y = func(**x)
+ y = self._predict_func_wrapper(**x)
return y
def _format_eval_results(self, results):
diff --git a/fastNLP/core/trainer.py b/fastNLP/core/trainer.py
index 40e5a5c1..671e2736 100644
--- a/fastNLP/core/trainer.py
+++ b/fastNLP/core/trainer.py
@@ -11,288 +11,310 @@ Trainer在fastNLP中用于组织单任务的训练过程,可以避免用户在
(5) 保存获得更好验证性能的模型。
-1 Trainer的基本使用
- 下面的例子是使用神经网络来进行预测一个序列中是否有偶数个1。
- Example::
+----------------------------
+1. Trainer的基本使用
+----------------------------
- import numpy as np
- from torch import nn
- import torch
- import torch.nn.functional as F
- from torch.optim import SGD
+下面的例子是使用神经网络来进行预测一个序列中是否有偶数个1。
- from fastNLP import DataSet
- from fastNLP import Trainer
- from fastNLP import CrossEntropyLoss
- from fastNLP import AccuracyMetric
- from fastNLP.modules.decoder import MLP
+.. code-block:: python
- # 模型
- class Model(nn.Module):
- def __init__(self, input_num):
- super().__init__()
- self.fcs = MLP([input_num, 40, 40, 2], 'relu')
+ import numpy as np
+ from torch import nn
+ import torch
+ import torch.nn.functional as F
+ from torch.optim import SGD
- def forward(self, x):
- x = self.fcs(x)
- return {'pred': x}
- model = Model(10)
+ from fastNLP import DataSet
+ from fastNLP import Trainer
+ from fastNLP import CrossEntropyLoss
+ from fastNLP import AccuracyMetric
+ from fastNLP.modules.decoder import MLP
- # 生成数据
- def generate_psedo_dataset(num_samples):
- dataset = DataSet()
- data = np.random.randint(2, size=(num_samples, 10))
- label = np.sum(data, axis=1)%2
- dataset = DataSet({'x':data.astype(float), 'label': label})
- dataset.set_input('x')
- dataset.set_target('label')
- return dataset
- tr_dataset = generate_psedo_dataset(1000)
- dev_data = generate_psedo_dataset(100)
+ # 模型
+ class Model(nn.Module):
+ def __init__(self, input_num):
+ super().__init__()
+ self.fcs = MLP([input_num, 40, 40, 2], 'relu')
- # 训练
- trainer = Trainer(tr_dataset, model, loss=CrossEntropyLoss(target='label'),
- optimizer=SGD(model.parameters(), lr=0.1),n_epochs=1000,
- dev_data = dev_data, metrics=AccuracyMetric(target='label'))
- trainer.train()
+ def forward(self, x):
+ x = self.fcs(x)
+ return {'pred': x}
+ model = Model(10)
- 由上面的例子可以看出通过使用Trainer,可以使得训练部分的代码大幅减少。
- 使用Trainer需要满足以下几个条件:
+ # 生成数据
+ def generate_psedo_dataset(num_samples):
+ dataset = DataSet()
+ data = np.random.randint(2, size=(num_samples, 10))
+ label = np.sum(data, axis=1)%2
+ dataset = DataSet({'x':data.astype(float), 'label': label})
+ dataset.set_input('x')
+ dataset.set_target('label')
+ return dataset
+ tr_dataset = generate_psedo_dataset(1000)
+ dev_data = generate_psedo_dataset(100)
+
+ # 训练
+ trainer = Trainer(tr_dataset, model, loss=CrossEntropyLoss(target='label'),
+ optimizer=SGD(model.parameters(), lr=0.1),n_epochs=1000,
+ dev_data = dev_data, metrics=AccuracyMetric(target='label'))
+ trainer.train()
+
+由上面的例子可以看出通过使用Trainer,可以使得训练部分的代码大幅减少。
+使用Trainer需要满足以下几个条件:
1.1 模型
- 1 模型的forward()的参数名需要与DataSet中的名字对应。实际上fastNLP在将DataSet中的数据传递给模型forward()时,是
- 通过匹配名称实现的。所以上例中,如果Model的forward函数修改为forward(self, data), 则DataSet中的'x'这个field就应该
- 改名为'data'。
+----------------------------
- 2 传递给forward()的参数是DataSet中被设置为input的那些field。但如果forward()中没有对应的参数,则不会将数据传递
- 给forward()。例如,DataSet中'x1', 'x2'都是input,但是模型的函数为forward(self, x1), 那么'x2'不会传递给forward()。
+1 模型的forward()的参数名需要与DataSet中的名字对应。实际上fastNLP在将DataSet中的数据传递给模型forward()时,是
+通过匹配名称实现的。所以上例中,如果Model的forward函数修改为forward(self, data), 则DataSet中的'x'这个field就应该
+改名为'data'。
- 3 模型的forward()返回值需要为一个dict。
+2 传递给forward()的参数是DataSet中被设置为input的那些field。但如果forward()中没有对应的参数,则不会将数据传递
+给forward()。例如,DataSet中'x1', 'x2'都是input,但是模型的函数为forward(self, x1), 那么'x2'不会传递给forward()。
+
+3 模型的forward()返回值需要为一个dict。
1.2 Loss
- fastNLP中的为了不限制forward函数的返回内容数量(比如一些复杂任务需要返回多个内容,如Dependency Parsing,
- :mod:`Loss` 与 :mod:`Metric` 都使用了通过名称来匹配相应内容的策略。如上面的例子中
+----------------------------
- Example::
+fastNLP中的为了不限制forward函数的返回内容数量(比如一些复杂任务需要返回多个内容,如Dependency Parsing,
+:mod:`Loss` 与 :mod:`Metric` 都使用了通过名称来匹配相应内容的策略。如上面的例子中
- trainer = Trainer(tr_dataset, model, loss=CrossEntropyLoss(target='label'),
- optimizer=SGD(model.parameters(), lr=0.1),n_epochs=1000,
- dev_data = dev_data, metrics=AccuracyMetric(target='label'))
+.. code-block:: python
- loss被设置为了 :class:`~fastNLP.CrossEntropyLoss` , 但在初始化的时候传入了target='label'这个参数,
- :class:`~fastNLP.CrossEntropyLoss` 的初始化参数为(pred=None, target=None, padding_idx=-100)。
-
- 这里的两个参数分别为计算CrossEntropy时需要使用到的模型的预测值与真实值。
- 其中 `pred` 一般来自于模型forward()的返回结果,`target` 一般是来自于DataSet中被设置为target的field。
- 由于每个人对真实值或者model的返回值取名并不一样,所以fastNLP的 :mod:`Loss` 提供一种类似于映射的机制来匹配对应的值,
- 比如这里 :class:`~fastNLP.CrossEntropyLoss` 将尝试找到名为'label'的内容来作为真实值得到loss;
- 而pred=None, 则 :class:`~fastNLP.CrossEntropyLoss` 使用'pred'作为名称匹配预测值,
- 正好forward的返回值也叫pred,所以这里不需要申明pred。
+ trainer = Trainer(tr_dataset, model, loss=CrossEntropyLoss(target='label'),
+ optimizer=SGD(model.parameters(), lr=0.1),n_epochs=1000,
+ dev_data = dev_data, metrics=AccuracyMetric(target='label'))
- 尽管fastNLP使用了映射机制来使得loss的计算变得比较灵活,但有些情况下loss必须在模型中进行计算,比如使用了CRF的模型。
- fastNLP中提供了 :class:`~fastNLP.LossInForward` 这个loss。
- 这个loss的原理是直接在forward()的返回结果中找到loss_key(默认寻找'loss')指定的那个tensor,并使用它作为loss。
- 如果Trainer初始化没有提供loss则默认使用 :class:`~fastNLP.LossInForward` 。
-
- .. todo::
- 补充一个例子 详细例子可以参照
+loss被设置为了 :class:`~fastNLP.CrossEntropyLoss` , 但在初始化的时候传入了target='label'这个参数,
+:class:`~fastNLP.CrossEntropyLoss` 的初始化参数为(pred=None, target=None, padding_idx=-100)。
+
+这里的两个参数分别为计算CrossEntropy时需要使用到的模型的预测值与真实值。
+其中 `pred` 一般来自于模型forward()的返回结果,`target` 一般是来自于DataSet中被设置为target的field。
+由于每个人对真实值或者model的返回值取名并不一样,所以fastNLP的 :mod:`Loss` 提供一种类似于映射的机制来匹配对应的值,
+比如这里 :class:`~fastNLP.CrossEntropyLoss` 将尝试找到名为'label'的内容来作为真实值得到loss;
+而pred=None, 则 :class:`~fastNLP.CrossEntropyLoss` 使用'pred'作为名称匹配预测值,
+正好forward的返回值也叫pred,所以这里不需要申明pred。
+
+尽管fastNLP使用了映射机制来使得loss的计算变得比较灵活,但有些情况下loss必须在模型中进行计算,比如使用了CRF的模型。
+fastNLP中提供了 :class:`~fastNLP.LossInForward` 这个loss。
+这个loss的原理是直接在forward()的返回结果中找到loss_key(默认寻找'loss')指定的那个tensor,并使用它作为loss。
+如果Trainer初始化没有提供loss则默认使用 :class:`~fastNLP.LossInForward` 。
+
+.. todo::
+ 补充一个例子 详细例子可以参照
1.3 Metric
- :mod:`Metric` 使用了与上述Loss一样的策略,即使用名称进行匹配。
- AccuracyMetric(target='label')的情况与CrossEntropyLoss 是同理的。
-
- 在进行验证时,可能用到的计算与forward()中不太一致,没有办法直接从forward()的结果中得到预测值,这时模型可以提供一个predict()方法,
- 如果提供的模型具有predict方法,则在模型验证时将调用predict()方法获取预测结果,
- 传入到predict()的参数也是从DataSet中被设置为input的field中选择出来的;
- 与forward()一样,返回值需要为一个dict。
-
- .. todo::
- 补充一个例子 具体例子可以参考
+----------------------------
-2 Trainer的代码检查
- 由于在fastNLP中采取了映射的机制,所以难免可能存在对应出错的情况。Trainer提供一种映射检查机制,可以通过check_code_level来进行控制
- 比如下面的例子中,由于各种原因产生的报错
+:mod:`Metric` 使用了与上述Loss一样的策略,即使用名称进行匹配。
+AccuracyMetric(target='label')的情况与CrossEntropyLoss 是同理的。
+
+在进行验证时,可能用到的计算与forward()中不太一致,没有办法直接从forward()的结果中得到预测值,这时模型可以提供一个predict()方法,
+如果提供的模型具有predict方法,则在模型验证时将调用predict()方法获取预测结果,
+传入到predict()的参数也是从DataSet中被设置为input的field中选择出来的;
+与forward()一样,返回值需要为一个dict。
+
+.. todo::
+ 补充一个例子 具体例子可以参考
+
+----------------------------
+2. Trainer的代码检查
+----------------------------
+
+由于在fastNLP中采取了映射的机制,所以难免可能存在对应出错的情况。Trainer提供一种映射检查机制,可以通过check_code_level来进行控制
+比如下面的例子中,由于各种原因产生的报错
Example2.1
- ::
-
- import numpy as np
- from torch import nn
- import torch
- from torch.optim import SGD
- from fastNLP import Trainer
- from fastNLP import DataSet
+----------------------------
- class Model(nn.Module):
- def __init__(self):
- super().__init__()
- self.fc = nn.Linear(1, 1)
- def forward(self, x, b):
- loss = torch.mean((self.fc(x)-b)**2)
- return {'loss': loss}
- model = Model()
+.. code-block:: python
- dataset = DataSet({'a': np.arange(10), 'b':np.arange(10)*2})
- dataset.set_input('a', 'b')
+ import numpy as np
+ from torch import nn
+ import torch
+ from torch.optim import SGD
+ from fastNLP import Trainer
+ from fastNLP import DataSet
- trainer = Trainer(dataset, model, loss=None, optimizer=SGD(model.parameters(), lr=0.001))
+ class Model(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.fc = nn.Linear(1, 1)
+ def forward(self, x, b):
+ loss = torch.mean((self.fc(x)-b)**2)
+ return {'loss': loss}
+ model = Model()
- trainer = Trainer(dataset, model, SGD(model.parameters()))
- # 会报以下的错误
- # input fields after batch(if batch size is 2):
- # a: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
- # b: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
- # There is no target field.
- # ....
- # NameError:
- # Problems occurred when calling Model.forward(self, x, b)
- # missing param: ['x']
- # unused field: ['a']
- # Suggestion: You need to provide ['x'] in DataSet and set it as input.
+ dataset = DataSet({'a': np.arange(10), 'b':np.arange(10)*2})
+ dataset.set_input('a', 'b')
- 这里就是由于在Trainer初始化的时候,fastNLP会尝试使用一个batch_size=2的batch去运行一遍forward()以及backward()。这里有两类
- 信息可以为你提供参考
+ trainer = Trainer(dataset, model, loss=None, optimizer=SGD(model.parameters(), lr=0.001))
- 1 'input fields after batch...'这部分显示的是train dataset经过Batch操作后,每个field对应的类型以及进行shape。这里
- 因为train dataset没有target所以没有显示。根据这里可以看出是否正确将需要的内容设置为了input或target。
+ trainer = Trainer(dataset, model, SGD(model.parameters()))
+ # 会报以下的错误
+ # input fields after batch(if batch size is 2):
+ # a: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
+ # b: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
+ # There is no target field.
+ # ....
+ # NameError:
+ # Problems occurred when calling Model.forward(self, x, b)
+ # missing param: ['x']
+ # unused field: ['a']
+ # Suggestion: You need to provide ['x'] in DataSet and set it as input.
- 2 NameError,NameError发生在映射出错的情况。这里报错的原因是由于尝试进行forward计算时(可以通过Model.forward(self, x, b)判断
- 出当前是在调取forward),却没有获取到forward()函数中需要的'x';在报错信息中同时指出了缺'x',而'a'没有被使用,那么可能
- 就是由于field的名称不对。这里将dataset中'a'这个field的名称改为'x',或者model的参数从'x'修改为'a'都可以解决问题。
+这里就是由于在Trainer初始化的时候,fastNLP会尝试使用一个batch_size=2的batch去运行一遍forward()以及backward()。这里有两类
+信息可以为你提供参考
- 下面的例子是由于loss计算的时候找不到需要的值
+1 'input fields after batch...'这部分显示的是train dataset经过Batch操作后,每个field对应的类型以及进行shape。这里
+因为train dataset没有target所以没有显示。根据这里可以看出是否正确将需要的内容设置为了input或target。
+
+2 NameError,NameError发生在映射出错的情况。这里报错的原因是由于尝试进行forward计算时(可以通过Model.forward(self, x, b)判断
+出当前是在调取forward),却没有获取到forward()函数中需要的'x';在报错信息中同时指出了缺'x',而'a'没有被使用,那么可能
+就是由于field的名称不对。这里将dataset中'a'这个field的名称改为'x',或者model的参数从'x'修改为'a'都可以解决问题。
+
+下面的例子是由于loss计算的时候找不到需要的值
Example2.2
- ::
+----------------------------
- import numpy as np
- from torch import nn
- from torch.optim import SGD
- from fastNLP import Trainer
- from fastNLP import DataSet
- from fastNLP import L1Loss
- import torch
+.. code-block:: python
- class Model(nn.Module):
- def __init__(self):
- super().__init__()
- self.fc = nn.Linear(1, 1)
- def forward(self, a):
- return {'pred_b': self.fc(a.unsqueeze(1)).squeeze(1), 'No use':1}
+ import numpy as np
+ from torch import nn
+ from torch.optim import SGD
+ from fastNLP import Trainer
+ from fastNLP import DataSet
+ from fastNLP import L1Loss
+ import torch
- model = Model()
+ class Model(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.fc = nn.Linear(1, 1)
+ def forward(self, a):
+ return {'pred_b': self.fc(a.unsqueeze(1)).squeeze(1), 'No use':1}
- dataset = DataSet({'a': np.arange(10, dtype=float), 'b':np.arange(10, dtype=float)*2})
+ model = Model()
- dataset.set_input('a')
- dataset.set_target('b')
+ dataset = DataSet({'a': np.arange(10, dtype=float), 'b':np.arange(10, dtype=float)*2})
- trainer = Trainer(dataset, model, loss=L1Loss(target='label'), optimizer=SGD(model.parameters(), lr=0.001))
- # 报错信息如下
- # input fields after batch(if batch size is 2):
- # a: (1)type:torch.Tensor (2)dtype:torch.float32, (3)shape:torch.Size([2])
- # target fields after batch(if batch size is 2):
- # b: (1)type:torch.Tensor (2)dtype:torch.float32, (3)shape:torch.Size([2])
- # ....
- # NameError:
- # Problems occurred when calling L1Loss.get_loss(self, pred, target)
- # missing param: ['pred(assign to `pred` in `L1Loss`)', 'label(assign to `target` in `L1Loss`)']
- # unused field: ['b']
- # unused param: ['pred_b', 'No use']
- # target field: ['b']
- # param from Model.forward(self, a): ['pred_b', 'No use']
- # Suggestion: (1). Check key assignment for `target` when initialize L1Loss. Or provide `label` in DataSet or output of Model.forward(self, a).
- # (2). Check key assignment for `pred` when initialize L1Loss. Or provide `pred` in DataSet or output of Model.forward(self, a).
+ dataset.set_input('a')
+ dataset.set_target('b')
- 报错信息也包含两部分:
+ trainer = Trainer(dataset, model, loss=L1Loss(target='label'), optimizer=SGD(model.parameters(), lr=0.001))
+ # 报错信息如下
+ # input fields after batch(if batch size is 2):
+ # a: (1)type:torch.Tensor (2)dtype:torch.float32, (3)shape:torch.Size([2])
+ # target fields after batch(if batch size is 2):
+ # b: (1)type:torch.Tensor (2)dtype:torch.float32, (3)shape:torch.Size([2])
+ # ....
+ # NameError:
+ # Problems occurred when calling L1Loss.get_loss(self, pred, target)
+ # missing param: ['pred(assign to `pred` in `L1Loss`)', 'label(assign to `target` in `L1Loss`)']
+ # unused field: ['b']
+ # unused param: ['pred_b', 'No use']
+ # target field: ['b']
+ # param from Model.forward(self, a): ['pred_b', 'No use']
+ # Suggestion: (1). Check key assignment for `target` when initialize L1Loss. Or provide `label` in DataSet or output of Model.forward(self, a).
+ # (2). Check key assignment for `pred` when initialize L1Loss. Or provide `pred` in DataSet or output of Model.forward(self, a).
- 1 第一部分与上面是一样的
+报错信息也包含两部分:
- 2 这里报错的原因是由于计算loss的时候找不到相应的值(通过L1Loss.get_loss(self, pred, target)判断出来的);
- 报错的原因是因为 `pred` 和 `label` (我们在初始化L1Loss时将target指定为了label)都没有找到。
- 这里'unused field'是DataSet中出现了,但却没有被设置为input或者target的field;
- 'unused param'是forward()中返回且没有被使用到的内容;'target field'是被设置为了target的field;
- 'param from Model.forward(self, a)'是forward()返回的所有key。"Suggestion"是关于当前错误处理的建议。
+1 第一部分与上面是一样的
- 但是在一些情况下,比如forward()返回值只有一个,target也只有一个,fastNLP不会进行匹配,而直接将forward()的结果作为pred,
- 将DataSet中的target设置为target。上面的例子在返回值中加入了一个'No use'则只是为了使得Loss去匹配结果。
+2 这里报错的原因是由于计算loss的时候找不到相应的值(通过L1Loss.get_loss(self, pred, target)判断出来的);
+报错的原因是因为 `pred` 和 `label` (我们在初始化L1Loss时将target指定为了label)都没有找到。
+这里'unused field'是DataSet中出现了,但却没有被设置为input或者target的field;
+'unused param'是forward()中返回且没有被使用到的内容;'target field'是被设置为了target的field;
+'param from Model.forward(self, a)'是forward()返回的所有key。"Suggestion"是关于当前错误处理的建议。
+
+但是在一些情况下,比如forward()返回值只有一个,target也只有一个,fastNLP不会进行匹配,而直接将forward()的结果作为pred,
+将DataSet中的target设置为target。上面的例子在返回值中加入了一个'No use'则只是为了使得Loss去匹配结果。
- 下面是带有dev dataset时如果出现错误会发生的报错,
+下面是带有dev dataset时如果出现错误会发生的报错,
Example2.3
- ::
+----------------------------
+
+.. code-block:: python
+
+ import numpy as np
+ from torch import nn
+ from torch.optim import SGD
+ from fastNLP import Trainer
+ from fastNLP import DataSet
+ from fastNLP import AccuracyMetric
+ import torch
+
+ class Model(nn.Module):
+ def __init__(self):
+ super().__init__()
+ self.fc = nn.Linear(1, 1)
+ def forward(self, a, b):
+ loss = torch.mean((self.fc(a.float().unsqueeze(1))-b.float())**2)
+ return {'loss': loss}
+ def predict(self, a): # 使用predict()进行验证
+ return {'output':self.fc(a.float().unsqueeze(1))} #这里return的值不包含'pred'这个key
+ model = Model()
+
+ dataset = DataSet({'a': np.arange(10), 'b':np.arange(10)*2})
+ dev_data = DataSet({'a': np.arange(10, 20), 'b':np.arange(10, 20)*2})
+
+ dataset.set_input('a', 'b')
+ dev_data.set_input('a') # 这里没有设置target
+
+ trainer = Trainer(dataset, model, loss=None, optimizer=SGD(model.parameters(), lr=0.001),
+ dev_data=dev_data, metrics=AccuracyMetric())
+
+ # 报错信息
+ # ...
+ # NameError:
+ # Problems occurred when calling AccuracyMetric.evaluate(self, pred, target, seq_len=None)
+ # missing param: ['pred(assign to `pred` in `AccuracyMetric`)', 'target(assign to `target` in `AccuracyMetric`)']
+ # unused param: ['output']
+ # target field: []
+ # param from Model.predict(self, a): ['output']
+ # Suggestion: (1). Check key assignment for `pred` when initialize AccuracyMetric. Or provide `pred` in DataSet or output of Model.predict(self, a).
+ # (2). Check key assignment for `target` when initialize AccuracyMetric. Or provide `target` in DataSet or output of Model.predict(self, a).
+
+报错信息和前面都是类似的,但是可以通过'AccuracyMetric.evaluate(self, pred, target, seq_len=None)'看出这里是evaluation
+的时候发生了错误。这样避免了需要在完成一整个epoch的训练才能发现evaluation弄错的情况。这里的修改是通过在初始化metric的时候
+指明通过'output'获取`pred`, 即AccuracyMetric(pred='output')。
+
+可以通过check_code_level调节检查的强度。默认为0,即进行检查。
+
+----------------------------
+3. Trainer与callback
+----------------------------
+
+虽然Trainer本身已经集成了一些功能,但仍然不足以囊括训练过程中可能需要到的功能,比如负采样,learning rate decay, Early Stop等。
+为了解决这个问题fastNLP引入了callback的机制,:class:`~fastNLP.Callback` 是一种在Trainer训练过程中特定阶段会运行的函数集合,
+所有的 :class:`~fastNLP.Callback` 都具有on_*(比如on_train_start, on_backward_begin)等函数。
+如果 Callback 实现了该函数,则Trainer运行至对应阶段,会进行调用,例如::
+
+ from fastNLP import Callback, EarlyStopCallback, Trainer, CrossEntropyLoss, AccuracyMetric
+ from fastNLP.models import CNNText
+
+ start_time = time.time()
- import numpy as np
- from torch import nn
- from torch.optim import SGD
- from fastNLP import Trainer
- from fastNLP import DataSet
- from fastNLP import AccuracyMetric
- import torch
-
- class Model(nn.Module):
- def __init__(self):
- super().__init__()
- self.fc = nn.Linear(1, 1)
- def forward(self, a, b):
- loss = torch.mean((self.fc(a.float().unsqueeze(1))-b.float())**2)
- return {'loss': loss}
- def predict(self, a): # 使用predict()进行验证
- return {'output':self.fc(a.float().unsqueeze(1))} #这里return的值不包含'pred'这个key
- model = Model()
-
- dataset = DataSet({'a': np.arange(10), 'b':np.arange(10)*2})
- dev_data = DataSet({'a': np.arange(10, 20), 'b':np.arange(10, 20)*2})
-
- dataset.set_input('a', 'b')
- dev_data.set_input('a') # 这里没有设置target
-
- trainer = Trainer(dataset, model, loss=None, optimizer=SGD(model.parameters(), lr=0.001),
- dev_data=dev_data, metrics=AccuracyMetric())
-
- # 报错信息
- # ...
- # NameError:
- # Problems occurred when calling AccuracyMetric.evaluate(self, pred, target, seq_len=None)
- # missing param: ['pred(assign to `pred` in `AccuracyMetric`)', 'target(assign to `target` in `AccuracyMetric`)']
- # unused param: ['output']
- # target field: []
- # param from Model.predict(self, a): ['output']
- # Suggestion: (1). Check key assignment for `pred` when initialize AccuracyMetric. Or provide `pred` in DataSet or output of Model.predict(self, a).
- # (2). Check key assignment for `target` when initialize AccuracyMetric. Or provide `target` in DataSet or output of Model.predict(self, a).
-
- 报错信息和前面都是类似的,但是可以通过'AccuracyMetric.evaluate(self, pred, target, seq_len=None)'看出这里是evaluation
- 的时候发生了错误。这样避免了需要在完成一整个epoch的训练才能发现evaluation弄错的情况。这里的修改是通过在初始化metric的时候
- 指明通过'output'获取`pred`, 即AccuracyMetric(pred='output')。
-
- 可以通过check_code_level调节检查的强度。默认为0,即进行检查。
-
-3 Trainer与callback
- 虽然Trainer本身已经集成了一些功能,但仍然不足以囊括训练过程中可能需要到的功能,比如负采样,learning rate decay, Early Stop等。
- 为了解决这个问题fastNLP引入了callback的机制,:class:`~fastNLP.Callback` 是一种在Trainer训练过程中特定阶段会运行的函数集合,
- 所有的 :class:`~fastNLP.Callback` 都具有on_*(比如on_train_start, on_backward_begin)等函数。
- 如果 Callback 实现了该函数,则Trainer运行至对应阶段,会进行调用,例如::
+ class MyCallback(Callback):
+ def on_epoch_end(self):
+ print('{:d}ms\n\n'.format(round((time.time()-start_time)*1000)))
- from fastNLP import Callback, EarlyStopCallback, Trainer, CrossEntropyLoss, AccuracyMetric
- from fastNLP.models import CNNText
-
- start_time = time.time()
-
- class MyCallback(Callback):
- def on_epoch_end(self):
- print('{:d}ms\n\n'.format(round((time.time()-start_time)*1000)))
-
- model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
- trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data, loss=CrossEntropyLoss(),
- metrics=AccuracyMetric(), callbacks=[MyCallback(),EarlyStopCallback(10)])
- trainer.train()
-
- 这里,我们通过继承 :class:`~fastNLP.Callback` 类定义了自己的 callback 的,并和内置的 :class:`~fastNLP.EarlyStopCallback`
- 一起传给了 :class:`~fastNLP.Trainer` ,增强了 :class:`~fastNLP.Trainer` 的功能
+ model = CNNText((len(vocab),50), num_classes=5, padding=2, dropout=0.1)
+ trainer = Trainer(model=model, train_data=train_data, dev_data=dev_data, loss=CrossEntropyLoss(),
+ metrics=AccuracyMetric(), callbacks=[MyCallback(),EarlyStopCallback(10)])
+ trainer.train()
- fastNLP已经自带了很多callback函数供使用,可以参考 :doc:`fastNLP.core.callback` 。
+这里,我们通过继承 :class:`~fastNLP.Callback` 类定义了自己的 callback 的,并和内置的 :class:`~fastNLP.EarlyStopCallback`
+一起传给了 :class:`~fastNLP.Trainer` ,增强了 :class:`~fastNLP.Trainer` 的功能
+
+fastNLP已经自带了很多callback函数供使用,可以参考 :doc:`fastNLP.core.callback` 。
"""
__all__ = [
@@ -311,8 +333,9 @@ try:
from tqdm.auto import tqdm
except:
from .utils import _pseudo_tqdm as tqdm
+import warnings
-from .batch import Batch
+from .batch import DataSetIter, BatchIter
from .callback import CallbackManager, CallbackException
from .dataset import DataSet
from .losses import _prepare_losser
@@ -320,7 +343,6 @@ from .metrics import _prepare_metrics
from .optimizer import Optimizer
from .sampler import Sampler
from .sampler import RandomSampler
-from .sampler import SequentialSampler
from .tester import Tester
from .utils import _CheckError
from .utils import _build_args
@@ -351,6 +373,8 @@ class Trainer(object):
:param int batch_size: 训练和验证的时候的batch大小。
:param loss: 使用的 :class:`~fastNLP.core.losses.LossBase` 对象。当为None时,默认使用 :class:`~fastNLP.LossInForward`
:param sampler: Batch数据生成的顺序, :class:`~fastNLP.Sampler` 类型。如果为None,默认使用 :class:`~fastNLP.RandomSampler`
+ :param drop_last: 如果最后一个batch没有正好为batch_size这么多数据,就扔掉最后一个batch
+ :param num_workers: int, 有多少个线程来进行数据pad处理。
:param update_every: int, 多少步更新一次梯度。用于希望累计梯度的场景,比如需要128的batch_size, 但是直接设为128
会导致内存不足,通过设置batch_size=32, update_every=4达到目的。当optimizer为None时,该参数无效。
:param int n_epochs: 需要优化迭代多少次。
@@ -367,7 +391,6 @@ class Trainer(object):
:param int validate_every: 多少个step在验证集上验证一次; 如果为-1,则每个epoch结束验证一次。仅在传入dev_data时有效。
:param str,None save_path: 将模型保存路径。如果为None,则不保存模型。如果dev_data为None,则保存最后一次迭代的模型。
保存的时候不仅保存了参数,还保存了模型结构。即便使用DataParallel,这里也只保存模型。
- :param prefetch: bool, 是否使用额外的进程对产生batch数据。理论上会使得Batch迭代更快。
:param bool use_tqdm: 是否使用tqdm来显示训练进度; 如果为False,则将loss打印在终端中。
:param str,int,torch.device,list(int) device: 将模型load到哪个设备。默认为None,即Trainer不对模型
的计算位置进行管理。支持以下的输入:
@@ -394,16 +417,17 @@ class Trainer(object):
"""
def __init__(self, train_data, model, optimizer=None, loss=None,
- batch_size=32, sampler=None, update_every=1,
- n_epochs=10, print_every=5,
+ batch_size=32, sampler=None, drop_last=False, update_every=1,
+ num_workers=0, n_epochs=10, print_every=5,
dev_data=None, metrics=None, metric_key=None,
- validate_every=-1, save_path=None,
- prefetch=False, use_tqdm=True, device=None,
- callbacks=None,
- check_code_level=0):
+ validate_every=-1, save_path=None, use_tqdm=True, device=None, prefetch=False,
+ callbacks=None, check_code_level=0):
+ if prefetch and num_workers==0:
+ num_workers = 1
+ if prefetch:
+ warnings.warn("prefetch is deprecated, will be removed in version 0.5.0, please use num_workers instead.")
+
super(Trainer, self).__init__()
- if not isinstance(train_data, DataSet):
- raise TypeError(f"The type of train_data must be fastNLP.DataSet, got {type(train_data)}.")
if not isinstance(model, nn.Module):
raise TypeError(f"The type of model must be torch.nn.Module, got {type(model)}.")
@@ -430,25 +454,37 @@ class Trainer(object):
if metric_key is not None:
self.increase_better = False if metric_key[0] == "-" else True
self.metric_key = metric_key[1:] if metric_key[0] == "+" or metric_key[0] == "-" else metric_key
- elif len(metrics) > 0:
- self.metric_key = metrics[0].__class__.__name__.lower().strip('metric')
-
+ else:
+ self.metric_key = None
# prepare loss
losser = _prepare_losser(loss)
# sampler check
if sampler is not None and not isinstance(sampler, Sampler):
raise ValueError("The type of sampler should be fastNLP.BaseSampler, got {}.".format(type(sampler)))
-
- if check_code_level > -1:
+
+ if sampler is None:
+ sampler = RandomSampler()
+ elif hasattr(sampler, 'set_batch_size'):
+ sampler.set_batch_size(batch_size)
+
+ if isinstance(train_data, DataSet):
+ self.data_iterator = DataSetIter(
+ dataset=train_data, batch_size=batch_size, num_workers=num_workers, sampler=sampler, drop_last=drop_last)
+ elif isinstance(train_data, BatchIter):
+ self.data_iterator = train_data
+ else:
+ raise TypeError("train_data type {} not support".format(type(train_data)))
+
+ if check_code_level > -1 and isinstance(self.data_iterator, DataSetIter):
_check_code(dataset=train_data, model=model, losser=losser, metrics=metrics, dev_data=dev_data,
- metric_key=metric_key, check_level=check_code_level,
+ metric_key=self.metric_key, check_level=check_code_level,
batch_size=min(batch_size, DEFAULT_CHECK_BATCH_SIZE))
# _check_code 是 fastNLP 帮助你检查代码是否正确的方法 。如果你在错误栈中看到这行注释,请认真检查你的代码
-
+ self.model = _move_model_to_device(model, device=device)
+
self.train_data = train_data
self.dev_data = dev_data # If None, No validation.
- self.model = model
self.losser = losser
self.metrics = metrics
self.n_epochs = int(n_epochs)
@@ -460,26 +496,22 @@ class Trainer(object):
self.best_dev_epoch = None
self.best_dev_step = None
self.best_dev_perf = None
- self.sampler = sampler if sampler is not None else RandomSampler()
- self.prefetch = prefetch
self.n_steps = (len(self.train_data) // self.batch_size + int(
- len(self.train_data) % self.batch_size != 0)) * self.n_epochs
-
- self.model = _move_model_to_device(self.model, device=device)
-
+ len(self.train_data) % self.batch_size != 0)) * int(drop_last==0) * self.n_epochs
+
if isinstance(optimizer, torch.optim.Optimizer):
self.optimizer = optimizer
elif isinstance(optimizer, Optimizer):
- self.optimizer = optimizer.construct_from_pytorch(model.parameters())
+ self.optimizer = optimizer.construct_from_pytorch(self.model.parameters())
elif optimizer is None:
- self.optimizer = torch.optim.Adam(model.parameters(), lr=4e-3)
+ self.optimizer = torch.optim.Adam(self.model.parameters(), lr=4e-3)
else:
raise TypeError("optimizer can only be torch.optim.Optimizer type, not {}.".format(type(optimizer)))
self.use_tqdm = use_tqdm
self.pbar = None
self.print_every = abs(self.print_every)
-
+
if self.dev_data is not None:
self.tester = Tester(model=self.model,
data=self.dev_data,
@@ -493,15 +525,16 @@ class Trainer(object):
self.callback_manager = CallbackManager(env={"trainer": self},
callbacks=callbacks)
-
- def train(self, load_best_model=True, on_exception='ignore'):
+
+ def train(self, load_best_model=True, on_exception='auto'):
"""
使用该函数使Trainer开始训练。
:param bool load_best_model: 该参数只有在初始化提供了dev_data的情况下有效,如果True, trainer将在返回之前重新加载dev表现
最好的模型参数。
:param str on_exception: 在训练过程遭遇exception,并被 :py:class:Callback 的on_exception()处理后,是否继续抛出异常。
- 支持'ignore'与'raise': 'ignore'将捕获异常,写在Trainer.train()后面的代码将继续运行; 'raise'将异常抛出。
+ 支持'ignore','raise', 'auto': 'ignore'将捕获异常,写在Trainer.train()后面的代码将继续运行; 'raise'将异常抛出;
+ 'auto'将ignore以下两种Exception: CallbackException与KeyboardInterrupt, raise其它exception.
:return dict: 返回一个字典类型的数据,
内含以下内容::
@@ -530,12 +563,16 @@ class Trainer(object):
self.callback_manager.on_train_begin()
self._train()
self.callback_manager.on_train_end()
- except (CallbackException, KeyboardInterrupt, Exception) as e:
+
+ except BaseException as e:
self.callback_manager.on_exception(e)
- if on_exception=='raise':
+ if on_exception == 'auto':
+ if not isinstance(e, (CallbackException, KeyboardInterrupt)):
+ raise e
+ elif on_exception == 'raise':
raise e
- if self.dev_data is not None and hasattr(self, 'best_dev_perf'):
+ if self.dev_data is not None and self.best_dev_perf is not None:
print(
"\nIn Epoch:{}/Step:{}, got best dev performance:".format(self.best_dev_epoch, self.best_dev_step) +
self.tester._format_eval_results(self.best_dev_perf), )
@@ -563,12 +600,14 @@ class Trainer(object):
self.step = 0
self.epoch = 0
start = time.time()
-
+ if isinstance(self.model, nn.DataParallel):
+ self._forward_func = self.model.module.forward
+ else:
+ self._forward_func = self.model.forward
with inner_tqdm(total=self.n_steps, postfix='loss:{0:<6.5f}', leave=False, dynamic_ncols=True) as pbar:
self.pbar = pbar
avg_loss = 0
- data_iterator = Batch(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
- prefetch=self.prefetch)
+ data_iterator = self.data_iterator
self.batch_per_epoch = data_iterator.num_batches
for epoch in range(1, self.n_epochs + 1):
self.epoch = epoch
@@ -600,7 +639,7 @@ class Trainer(object):
if self.step % self.print_every == 0:
avg_loss = float(avg_loss) / self.print_every
if self.use_tqdm:
- print_output = "loss:{0:<6.5f}".format(avg_loss)
+ print_output = "loss:{:<6.5f}".format(avg_loss)
pbar.update(self.print_every)
else:
end = time.time()
@@ -664,15 +703,15 @@ class Trainer(object):
"""Perform weight update on a model.
"""
- if self.optimizer is not None and (self.step + 1) % self.update_every == 0:
+ if self.step % self.update_every == 0:
self.optimizer.step()
def _data_forward(self, network, x):
- x = _build_args(network.forward, **x)
+ x = _build_args(self._forward_func, **x)
y = network(**x)
if not isinstance(y, dict):
raise TypeError(
- f"The return value of {_get_func_signature(network.forward)} should be dict, got {type(y)}.")
+ f"The return value of {_get_func_signature(self._forward_func)} should be dict, got {type(y)}.")
return y
def _grad_backward(self, loss):
@@ -682,7 +721,7 @@ class Trainer(object):
For PyTorch, just do "loss.backward()"
"""
- if self.step % self.update_every == 0:
+ if (self.step-1) % self.update_every == 0:
self.model.zero_grad()
loss.backward()
@@ -741,7 +780,9 @@ class Trainer(object):
:return bool value: True means current results on dev set is the best.
"""
- indicator_val = _check_eval_results(metrics, self.metric_key, self.metrics)
+ indicator, indicator_val = _check_eval_results(metrics, self.metric_key, self.metrics)
+ if self.metric_key is None:
+ self.metric_key = indicator
is_better = True
if self.best_metric_indicator is None:
# first-time validation
@@ -780,15 +821,34 @@ def _get_value_info(_dict):
strs.append(_str)
return strs
-
+from numbers import Number
+from .batch import _to_tensor
def _check_code(dataset, model, losser, metrics, batch_size=DEFAULT_CHECK_BATCH_SIZE,
dev_data=None, metric_key=None,
check_level=0):
# check get_loss 方法
- model_devcie = model.parameters().__next__().device
+ model_devcie = _get_model_device(model=model)
- batch = Batch(dataset=dataset, batch_size=batch_size, sampler=SequentialSampler())
- for batch_count, (batch_x, batch_y) in enumerate(batch):
+ def _iter():
+ start_idx = 0
+ while start_idx 1 and metric_key is None:
- raise RuntimeError(
- f"Got multiple metric keys: {metric_dict}, but metric_key is not set. Which one to use?")
else:
# metric_key is set
if metric_key not in metric_dict:
raise RuntimeError(f"metric key {metric_key} not found in {metric_dict}")
indicator_val = metric_dict[metric_key]
+ indicator = metric_key
else:
raise RuntimeError("Invalid metrics type. Expect {}, got {}".format((tuple, dict), type(metrics)))
- return indicator_val
+ return indicator, indicator_val
diff --git a/fastNLP/core/utils.py b/fastNLP/core/utils.py
index fa6d90a2..2847e724 100644
--- a/fastNLP/core/utils.py
+++ b/fastNLP/core/utils.py
@@ -4,7 +4,6 @@ utils模块实现了 fastNLP 内部和外部所需的很多工具。其中用户
__all__ = [
"cache_results",
"seq_len_to_mask",
- "Example",
]
import _pickle
@@ -16,34 +15,35 @@ from collections import Counter, namedtuple
import numpy as np
import torch
import torch.nn as nn
-
+from typing import List
_CheckRes = namedtuple('_CheckRes', ['missing', 'unused', 'duplicated', 'required', 'all_needed',
'varargs'])
-class Example(dict):
+class Option(dict):
"""a dict can treat keys as attributes"""
+
def __getattr__(self, item):
try:
return self.__getitem__(item)
except KeyError:
raise AttributeError(item)
-
+
def __setattr__(self, key, value):
if key.startswith('__') and key.endswith('__'):
raise AttributeError(key)
self.__setitem__(key, value)
-
+
def __delattr__(self, item):
try:
self.pop(item)
except KeyError:
raise AttributeError(item)
-
+
def __getstate__(self):
return self
-
+
def __setstate__(self, state):
self.update(state)
@@ -164,6 +164,31 @@ def cache_results(_cache_fp, _refresh=False, _verbose=1):
return wrapper_
+def _save_model(model, model_name, save_dir, only_param=False):
+ """ 存储不含有显卡信息的state_dict或model
+ :param model:
+ :param model_name:
+ :param save_dir: 保存的directory
+ :param only_param:
+ :return:
+ """
+ model_path = os.path.join(save_dir, model_name)
+ if not os.path.isdir(save_dir):
+ os.makedirs(save_dir, exist_ok=True)
+ if isinstance(model, nn.DataParallel):
+ model = model.module
+ if only_param:
+ state_dict = model.state_dict()
+ for key in state_dict:
+ state_dict[key] = state_dict[key].cpu()
+ torch.save(state_dict, model_path)
+ else:
+ _model_device = _get_model_device(model)
+ model.cpu()
+ torch.save(model, model_path)
+ model.to(_model_device)
+
+
# def save_pickle(obj, pickle_path, file_name):
# """Save an object into a pickle file.
#
@@ -285,6 +310,7 @@ def _get_model_device(model):
:param model: nn.Module
:return: torch.device,None 如果返回值为None,说明这个模型没有任何参数。
"""
+ # TODO 这个函数存在一定的风险,因为同一个模型可能存在某些parameter不在显卡中,比如BertEmbedding. 或者跨显卡
assert isinstance(model, nn.Module)
parameters = list(model.parameters())
@@ -295,6 +321,13 @@ def _get_model_device(model):
def _build_args(func, **kwargs):
+ """
+ 根据func的初始化参数,从kwargs中选择func需要的参数
+
+ :param func: callable
+ :param kwargs: 参数
+ :return:dict. func中用到的参数
+ """
spect = inspect.getfullargspec(func)
if spect.varkw is not None:
return kwargs
@@ -635,13 +668,13 @@ def _check_forward_error(forward_func, batch_x, dataset, check_level):
warnings.warn(message=_unused_warn)
-def seq_len_to_mask(seq_len):
+def seq_len_to_mask(seq_len, max_len=None):
"""
将一个表示sequence length的一维数组转换为二维的mask,不包含的位置为0。
转变 1-d seq_len到2-d mask.
- Example::
+ .. code-block::
>>> seq_len = torch.arange(2, 16)
>>> mask = seq_len_to_mask(seq_len)
@@ -651,20 +684,26 @@ def seq_len_to_mask(seq_len):
>>> mask = seq_len_to_mask(seq_len)
>>> print(mask.shape)
(14, 15)
+ >>> seq_len = torch.arange(2, 16)
+ >>> mask = seq_len_to_mask(seq_len, max_len=100)
+ >>>print(mask.size())
+ torch.Size([14, 100])
:param np.ndarray,torch.LongTensor seq_len: shape将是(B,)
- :return: np.ndarray or torch.Tensor, shape将是(B, max_length)。 元素类似为bool或torch.uint8
+ :param int max_len: 将长度pad到这个长度。默认(None)使用的是seq_len中最长的长度。但在nn.DataParallel的场景下可能不同卡的seq_len会有
+ 区别,所以需要传入一个max_len使得mask的长度是pad到该长度。
+ :return: np.ndarray, torch.Tensor 。shape将是(B, max_length), 元素类似为bool或torch.uint8
"""
if isinstance(seq_len, np.ndarray):
assert len(np.shape(seq_len)) == 1, f"seq_len can only have one dimension, got {len(np.shape(seq_len))}."
- max_len = int(seq_len.max())
+ max_len = int(max_len) if max_len else int(seq_len.max())
broad_cast_seq_len = np.tile(np.arange(max_len), (len(seq_len), 1))
mask = broad_cast_seq_len < seq_len.reshape(-1, 1)
elif isinstance(seq_len, torch.Tensor):
assert seq_len.dim() == 1, f"seq_len can only have one dimension, got {seq_len.dim() == 1}."
batch_size = seq_len.size(0)
- max_len = seq_len.max().long()
+ max_len = int(max_len) if max_len else seq_len.max().long()
broad_cast_seq_len = torch.arange(max_len).expand(batch_size, -1).to(seq_len)
mask = broad_cast_seq_len.lt(seq_len.unsqueeze(1))
else:
@@ -698,3 +737,54 @@ class _pseudo_tqdm:
def __exit__(self, exc_type, exc_val, exc_tb):
del self
+
+
+def iob2(tags: List[str]) -> List[str]:
+ """
+ 检查数据是否是合法的IOB数据,如果是IOB1会被自动转换为IOB2。两者的差异见
+ https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format
+
+ :param tags: 需要转换的tags, 需要为大写的BIO标签。
+ """
+ for i, tag in enumerate(tags):
+ if tag == "O":
+ continue
+ split = tag.split("-")
+ if len(split) != 2 or split[0] not in ["I", "B"]:
+ raise TypeError("The encoding schema is not a valid IOB type.")
+ if split[0] == "B":
+ continue
+ elif i == 0 or tags[i - 1] == "O": # conversion IOB1 to IOB2
+ tags[i] = "B" + tag[1:]
+ elif tags[i - 1][1:] == tag[1:]:
+ continue
+ else: # conversion IOB1 to IOB2
+ tags[i] = "B" + tag[1:]
+ return tags
+
+
+def iob2bioes(tags: List[str]) -> List[str]:
+ """
+ 将iob的tag转换为bioes编码
+ :param tags: List[str]. 编码需要是大写的。
+ :return:
+ """
+ new_tags = []
+ for i, tag in enumerate(tags):
+ if tag == 'O':
+ new_tags.append(tag)
+ else:
+ split = tag.split('-')[0]
+ if split == 'B':
+ if i + 1 != len(tags) and tags[i + 1].split('-')[0] == 'I':
+ new_tags.append(tag)
+ else:
+ new_tags.append(tag.replace('B-', 'S-'))
+ elif split == 'I':
+ if i + 1 < len(tags) and tags[i + 1].split('-')[0] == 'I':
+ new_tags.append(tag)
+ else:
+ new_tags.append(tag.replace('I-', 'E-'))
+ else:
+ raise TypeError("Invalid IOB format.")
+ return new_tags
diff --git a/fastNLP/core/vocabulary.py b/fastNLP/core/vocabulary.py
index 0cf45049..9ce59a8c 100644
--- a/fastNLP/core/vocabulary.py
+++ b/fastNLP/core/vocabulary.py
@@ -4,12 +4,14 @@ __all__ = [
]
from functools import wraps
-from collections import Counter
+from collections import Counter, defaultdict
from .dataset import DataSet
-from .utils import Example
+from .utils import Option
+from functools import partial
+import numpy as np
-class VocabularyOption(Example):
+class VocabularyOption(Option):
def __init__(self,
max_size=None,
min_freq=None,
@@ -89,41 +91,88 @@ class Vocabulary(object):
self.word2idx = None
self.idx2word = None
self.rebuild = True
+ # 用于承载不需要单独创建entry的词语,具体见from_dataset()方法
+ self._no_create_word = Counter()
@_check_build_status
- def update(self, word_lst):
+ def update(self, word_lst, no_create_entry=False):
"""依次增加序列中词在词典中的出现频率
:param list word_lst: a list of strings
+ :param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
+ 如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
+ 的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
+ 加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
+ 个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
+ 则这个词将认为是需要创建单独的vector的。
"""
+ self._add_no_create_entry(word_lst, no_create_entry)
self.word_count.update(word_lst)
+ return self
@_check_build_status
- def add(self, word):
+ def add(self, word, no_create_entry=False):
"""
增加一个新词在词典中的出现频率
:param str word: 新词
+ :param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
+ 如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
+ 的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
+ 加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
+ 个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
+ 则这个词将认为是需要创建单独的vector的。
"""
+ self._add_no_create_entry(word, no_create_entry)
self.word_count[word] += 1
+ return self
+
+ def _add_no_create_entry(self, word, no_create_entry):
+ """
+ 在新加入word时,检查_no_create_word的设置。
+
+ :param str, List[str] word:
+ :param bool no_create_entry:
+ :return:
+ """
+ if isinstance(word, str):
+ word = [word]
+ for w in word:
+ if no_create_entry and self.word_count.get(w, 0) == self._no_create_word.get(w, 0):
+ self._no_create_word[w] += 1
+ elif not no_create_entry and w in self._no_create_word:
+ self._no_create_word.pop(w)
@_check_build_status
- def add_word(self, word):
+ def add_word(self, word, no_create_entry=False):
"""
增加一个新词在词典中的出现频率
:param str word: 新词
+ :param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
+ 如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
+ 的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
+ 加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
+ 个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
+ 则这个词将认为是需要创建单独的vector的。
"""
- self.add(word)
+ self.add(word, no_create_entry=no_create_entry)
@_check_build_status
- def add_word_lst(self, word_lst):
+ def add_word_lst(self, word_lst, no_create_entry=False):
"""
依次增加序列中词在词典中的出现频率
:param list[str] word_lst: 词的序列
+ :param bool no_create_entry: 在使用fastNLP.TokenEmbedding加载预训练模型时,没有从预训练词表中找到这个词的处理方式。
+ 如果为True,则不会有这个词语创建一个单独的entry,它将一直被指向unk的表示; 如果为False,则为这个词创建一个单独
+ 的entry。如果这个word来自于dev或者test,一般设置为True,如果来自与train一般设置为False。以下两种情况: 如果新
+ 加入一个word,且no_create_entry为True,但这个词之前已经在Vocabulary中且并不是no_create_entry的,则还是会为这
+ 个词创建一个单独的vector; 如果no_create_entry为False,但这个词之前已经在Vocabulary中且并不是no_create_entry的,
+ 则这个词将认为是需要创建单独的vector的。
"""
- self.update(word_lst)
+ self.update(word_lst, no_create_entry=no_create_entry)
+ return self
def build_vocab(self):
"""
@@ -133,10 +182,10 @@ class Vocabulary(object):
"""
if self.word2idx is None:
self.word2idx = {}
- if self.padding is not None:
- self.word2idx[self.padding] = len(self.word2idx)
- if self.unknown is not None:
- self.word2idx[self.unknown] = len(self.word2idx)
+ if self.padding is not None:
+ self.word2idx[self.padding] = len(self.word2idx)
+ if self.unknown is not None:
+ self.word2idx[self.unknown] = len(self.word2idx)
max_size = min(self.max_size, len(self.word_count)) if self.max_size else None
words = self.word_count.most_common(max_size)
@@ -148,13 +197,15 @@ class Vocabulary(object):
self.word2idx.update({w: i + start_idx for i, (w, _) in enumerate(words)})
self.build_reverse_vocab()
self.rebuild = False
-
+ return self
+
def build_reverse_vocab(self):
"""
- 基于 "word to index" dict, 构建 "index to word" dict.
+ 基于 `word to index` dict, 构建 `index to word` dict.
"""
self.idx2word = {i: w for w, i in self.word2idx.items()}
+ return self
@_check_build_vocab
def __len__(self):
@@ -205,9 +256,9 @@ class Vocabulary(object):
# remember to use `field_name`
vocab.index_dataset(train_data, dev_data, test_data, field_name='words')
- :param datasets: 需要转index的 class:`~fastNLP.DataSet` , 支持一个或多个(list)
+ :param ~fastNLP.DataSet,List[~fastNLP.DataSet] datasets: 需要转index的一个或多个数据集
:param str field_name: 需要转index的field, 若有多个 DataSet, 每个DataSet都必须有此 field.
- 目前仅支持 ``str`` , ``list(str)`` , ``list(list(str))``
+ 目前仅支持 ``str`` , ``List[str]`` , ``List[List[str]]``
:param str new_field_name: 保存结果的field_name. 若为 ``None`` , 将覆盖原field.
Default: ``None``
"""
@@ -240,19 +291,31 @@ class Vocabulary(object):
raise e
else:
raise RuntimeError("Only DataSet type is allowed.")
+ return self
- def from_dataset(self, *datasets, field_name):
+ @property
+ def _no_create_word_length(self):
+ return len(self._no_create_word)
+
+ def from_dataset(self, *datasets, field_name, no_create_entry_dataset=None):
"""
使用dataset的对应field中词构建词典::
# remember to use `field_name`
vocab.from_dataset(train_data1, train_data2, field_name='words')
- :param datasets: 需要转index的 class:`~fastNLP.DataSet` , 支持一个或多个(list)
- :param field_name: 可为 ``str`` 或 ``list(str)`` .
+ :param ~fastNLP.DataSet,List[~fastNLP.DataSet] datasets: 需要转index的一个或多个数据集
+ :param str,List[str] field_name: 可为 ``str`` 或 ``List[str]`` .
构建词典所使用的 field(s), 支持一个或多个field
若有多个 DataSet, 每个DataSet都必须有这些field.
- 目前仅支持的field结构: ``str`` , ``list(str)`` , ``list(list(str))``
+ 目前仅支持的field结构: ``str`` , ``List[str]`` , ``list[List[str]]``
+ :param no_create_entry_dataset: 可以传入DataSet, List[DataSet]或者None(默认),该选项用在接下来的模型会使用pretrain
+ 的embedding(包括glove, word2vec, elmo与bert)且会finetune的情况。如果仅使用来自于train的数据建立vocabulary,会导致test与dev
+ 中的数据无法充分利用到来自于预训练embedding的信息,所以在建立词表的时候将test与dev考虑进来会使得最终的结果更好。
+ 如果一个词出现在了train中,但是没在预训练模型中,embedding会为它用unk初始化,但它是单独的一个vector,如果
+ finetune embedding的话,这个词在更新之后可能会有更好的表示; 而如果这个词仅出现在了dev或test中,那么就不能为它们单独建立vector,
+ 而应该让它指向unk这个vector的值。所以只位于no_create_entry_dataset中的token,将首先从预训练的词表中寻找它的表示,
+ 如果找到了,就使用该表示; 如果没有找到,则认为该词的表示应该为unk的表示。
:return self:
"""
if isinstance(field_name, str):
@@ -260,18 +323,21 @@ class Vocabulary(object):
elif not isinstance(field_name, list):
raise TypeError('invalid argument field_name: {}'.format(field_name))
- def construct_vocab(ins):
+ def construct_vocab(ins, no_create_entry=False):
for fn in field_name:
field = ins[fn]
if isinstance(field, str):
- self.add_word(field)
- elif isinstance(field, list):
- if not isinstance(field[0], list):
- self.add_word_lst(field)
+ self.add_word(field, no_create_entry=no_create_entry)
+ elif isinstance(field, (list, np.ndarray)):
+ if not isinstance(field[0], (list, np.ndarray)):
+ for word in field:
+ self.add_word(word, no_create_entry=no_create_entry)
else:
- if isinstance(field[0][0], list):
+ if isinstance(field[0][0], (list, np.ndarray)):
raise RuntimeError("Only support field with 2 dimensions.")
- [self.add_word_lst(w) for w in field]
+ for words in field:
+ for word in words:
+ self.add_word(word, no_create_entry=no_create_entry)
for idx, dataset in enumerate(datasets):
if isinstance(dataset, DataSet):
@@ -281,13 +347,30 @@ class Vocabulary(object):
print("When processing the `{}` dataset, the following error occurred.".format(idx))
raise e
else:
- raise RuntimeError("Only DataSet type is allowed.")
+ raise TypeError("Only DataSet type is allowed.")
+
+ if no_create_entry_dataset is not None:
+ partial_construct_vocab = partial(construct_vocab, no_create_entry=True)
+ if isinstance(no_create_entry_dataset, DataSet):
+ no_create_entry_dataset.apply(partial_construct_vocab)
+ elif isinstance(no_create_entry_dataset, list):
+ for dataset in no_create_entry_dataset:
+ if not isinstance(dataset, DataSet):
+ raise TypeError("Only DataSet type is allowed.")
+ dataset.apply(partial_construct_vocab)
return self
+ def _is_word_no_create_entry(self, word):
+ """
+ 判断当前的word是否是不需要创建entry的,具体参见from_dataset的说明
+ :param word: str
+ :return: bool
+ """
+ return word in self._no_create_word
+
def to_index(self, w):
"""
- 将词转为数字. 若词不再词典中被记录, 将视为 unknown, 若 ``unknown=None`` , 将抛出
- ``ValueError``::
+ 将词转为数字. 若词不再词典中被记录, 将视为 unknown, 若 ``unknown=None`` , 将抛出``ValueError``::
index = vocab.to_index('abc')
# equals to
@@ -338,6 +421,8 @@ class Vocabulary(object):
self.word2idx = None
self.idx2word = None
self.rebuild = True
+ self._no_create_word.clear()
+ return self
def __getstate__(self):
"""Use to prepare data for pickle.
@@ -359,5 +444,7 @@ class Vocabulary(object):
def __repr__(self):
return "Vocabulary({}...)".format(list(self.word_count.keys())[:5])
+ @_check_build_vocab
def __iter__(self):
- return iter(list(self.word_count.keys()))
+ for word, index in self.word2idx.items():
+ yield word, index
diff --git a/fastNLP/embeddings/__init__.py b/fastNLP/embeddings/__init__.py
new file mode 100644
index 00000000..2bfb2960
--- /dev/null
+++ b/fastNLP/embeddings/__init__.py
@@ -0,0 +1,26 @@
+"""
+embeddings 模块主要用于从各种预训练的模型中获取词语的分布式表示,目前支持的预训练模型包括word2vec, glove, ELMO, BERT等。这里所有
+embedding的forward输入都是形状为 ``(batch_size, max_len)`` 的torch.LongTensor,输出都是 ``(batch_size, max_len, embedding_dim)`` 的
+torch.FloatTensor。所有的embedding都可以使用 `self.num_embedding` 获取最大的输入index范围, 用 `self.embeddig_dim` 或 `self.embed_size` 获取embedding的
+输出维度。
+"""
+
+__all__ = [
+ "Embedding",
+ "StaticEmbedding",
+ "ElmoEmbedding",
+ "BertEmbedding",
+ "StackEmbedding",
+ "LSTMCharEmbedding",
+ "CNNCharEmbedding",
+ "get_embeddings"
+]
+
+
+from .embedding import Embedding
+from .static_embedding import StaticEmbedding
+from .elmo_embedding import ElmoEmbedding
+from .bert_embedding import BertEmbedding
+from .char_embedding import CNNCharEmbedding, LSTMCharEmbedding
+from .stack_embedding import StackEmbedding
+from .utils import get_embeddings
\ No newline at end of file
diff --git a/fastNLP/embeddings/bert_embedding.py b/fastNLP/embeddings/bert_embedding.py
new file mode 100644
index 00000000..aa72898a
--- /dev/null
+++ b/fastNLP/embeddings/bert_embedding.py
@@ -0,0 +1,334 @@
+
+import os
+import collections
+
+from torch import nn
+import torch
+import numpy as np
+from itertools import chain
+
+from ..core.vocabulary import Vocabulary
+from ..io.file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
+from ..modules.encoder.bert import _WordPieceBertModel, BertModel, BertTokenizer
+from .contextual_embedding import ContextualEmbedding
+
+
+class BertEmbedding(ContextualEmbedding):
+ """
+ 别名::class:`fastNLP.embeddings.BertEmbedding` :class:`fastNLP.embeddings.bert_embedding.BertEmbedding`
+
+ 使用BERT对words进行编码的Embedding。建议将输入的words长度限制在430以内,而不要使用512(根据预训练模型参数,可能有变化)。这是由于
+ 预训练的bert模型长度限制为512个token,而因为输入的word是未进行word piece分割的(word piece的分割有BertEmbedding在输入word
+ 时切分),在分割之后长度可能会超过最大长度限制。
+
+ BertEmbedding可以支持自动下载权重,当前支持的模型有以下的几种(待补充):
+
+ Example::
+
+ >>> import torch
+ >>> from fastNLP import Vocabulary
+ >>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
+ >>> embed = BertEmbedding(vocab, model_dir_or_name='en-base-uncased', requires_grad=False, layers='4,-2,-1')
+ >>> words = torch.LongTensor([[vocab.to_index(word) for word in "The whether is good .".split()]])
+ >>> outputs = embed(words)
+ >>> outputs.size()
+ >>> # torch.Size([1, 5, 2304])
+
+ :param ~fastNLP.Vocabulary vocab: 词表
+ :param str model_dir_or_name: 模型所在目录或者模型的名称。当传入模型所在目录时,目录中应该包含一个词表文件(以.txt作为后缀名),
+ 权重文件(以.bin作为文件后缀名), 配置文件(以.json作为后缀名)。
+ :param str layers: 输出embedding表示来自于哪些层,不同层的结果按照layers中的顺序在最后一维concat起来。以','隔开层数,可以以负数
+ 去索引倒数几层。
+ :param str pool_method: 因为在bert中,每个word会被表示为多个word pieces, 当获取一个word的表示的时候,怎样从它的word pieces
+ 中计算得到它对应的表示。支持 ``last`` , ``first`` , ``avg`` , ``max``。
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
+ :param bool include_cls_sep: bool,在bert计算句子的表示的时候,需要在前面加上[CLS]和[SEP], 是否在结果中保留这两个内容。 这样
+ 会使得word embedding的结果比输入的结果长两个token。如果该值为True,则在使用 :class::StackEmbedding 可能会与其它类型的
+ embedding长度不匹配。
+ :param bool requires_grad: 是否需要gradient以更新Bert的权重。
+ """
+ def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en-base-uncased', layers: str='-1',
+ pool_method: str='first', word_dropout=0, dropout=0, requires_grad: bool=False,
+ include_cls_sep: bool=False):
+ super(BertEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
+
+ # 根据model_dir_or_name检查是否存在并下载
+ if model_dir_or_name.lower() in PRETRAINED_BERT_MODEL_DIR:
+ PRETRAIN_URL = _get_base_url('bert')
+ model_name = PRETRAINED_BERT_MODEL_DIR[model_dir_or_name]
+ model_url = PRETRAIN_URL + model_name
+ model_dir = cached_path(model_url)
+ # 检查是否存在
+ elif os.path.isdir(os.path.expanduser(os.path.abspath(model_dir_or_name))):
+ model_dir = model_dir_or_name
+ else:
+ raise ValueError(f"Cannot recognize {model_dir_or_name}.")
+
+ self.model = _WordBertModel(model_dir=model_dir, vocab=vocab, layers=layers,
+ pool_method=pool_method, include_cls_sep=include_cls_sep)
+
+ self.requires_grad = requires_grad
+ self._embed_size = len(self.model.layers)*self.model.encoder.hidden_size
+
+ def _delete_model_weights(self):
+ del self.model
+
+ def forward(self, words):
+ """
+ 计算words的bert embedding表示。计算之前会在每句话的开始增加[CLS]在结束增加[SEP], 并根据include_cls_sep判断要不要
+ 删除这两个token的表示。
+
+ :param torch.LongTensor words: [batch_size, max_len]
+ :return: torch.FloatTensor. batch_size x max_len x (768*len(self.layers))
+ """
+ words = self.drop_word(words)
+ outputs = self._get_sent_reprs(words)
+ if outputs is not None:
+ return self.dropout(words)
+ outputs = self.model(words)
+ outputs = torch.cat([*outputs], dim=-1)
+
+ return self.dropout(outputs)
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+
+ :return:
+ """
+ requires_grads = set([param.requires_grad for name, param in self.named_parameters()
+ if 'word_pieces_lengths' not in name])
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for name, param in self.named_parameters():
+ if 'word_pieces_lengths' in name: # 这个不能加入到requires_grad中
+ continue
+ param.requires_grad = value
+
+
+class BertWordPieceEncoder(nn.Module):
+ """
+ 读取bert模型,读取之后调用index_dataset方法在dataset中生成word_pieces这一列。
+
+ :param str model_dir_or_name: 模型所在目录或者模型的名称。默认值为 ``en-base-uncased``
+ :param str layers: 最终结果中的表示。以','隔开层数,可以以负数去索引倒数几层
+ :param bool requires_grad: 是否需要gradient。
+ """
+ def __init__(self, model_dir_or_name: str='en-base-uncased', layers: str='-1',
+ requires_grad: bool=False):
+ super().__init__()
+ PRETRAIN_URL = _get_base_url('bert')
+
+ if model_dir_or_name in PRETRAINED_BERT_MODEL_DIR:
+ model_name = PRETRAINED_BERT_MODEL_DIR[model_dir_or_name]
+ model_url = PRETRAIN_URL + model_name
+ model_dir = cached_path(model_url)
+ # 检查是否存在
+ elif os.path.isdir(model_dir_or_name):
+ model_dir = model_dir_or_name
+ else:
+ raise ValueError(f"Cannot recognize {model_dir_or_name}.")
+
+ self.model = _WordPieceBertModel(model_dir=model_dir, layers=layers)
+ self._embed_size = len(self.model.layers) * self.model.encoder.hidden_size
+ self.requires_grad = requires_grad
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+ :return:
+ """
+ requires_grads = set([param.requires_grad for name, param in self.named_parameters()])
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for name, param in self.named_parameters():
+ param.requires_grad = value
+
+ @property
+ def embed_size(self):
+ return self._embed_size
+
+ def index_datasets(self, *datasets, field_name):
+ """
+ 使用bert的tokenizer新生成word_pieces列加入到datasets中,并将他们设置为input。如果首尾不是
+ [CLS]与[SEP]会在首尾额外加入[CLS]与[SEP], 且将word_pieces这一列的pad value设置为了bert的pad value。
+
+ :param datasets: DataSet对象
+ :param field_name: 基于哪一列的内容生成word_pieces列。这一列中每个数据应该是List[str]的形式。
+ :return:
+ """
+ self.model.index_dataset(*datasets, field_name=field_name)
+
+ def forward(self, word_pieces, token_type_ids=None):
+ """
+ 计算words的bert embedding表示。传入的words中应该自行包含[CLS]与[SEP]的tag。
+
+ :param words: batch_size x max_len
+ :param token_type_ids: batch_size x max_len, 用于区分前一句和后一句话
+ :return: torch.FloatTensor. batch_size x max_len x (768*len(self.layers))
+ """
+ outputs = self.model(word_pieces, token_type_ids)
+ outputs = torch.cat([*outputs], dim=-1)
+
+ return outputs
+
+
+class _WordBertModel(nn.Module):
+ def __init__(self, model_dir:str, vocab:Vocabulary, layers:str='-1', pool_method:str='first', include_cls_sep:bool=False):
+ super().__init__()
+
+ self.tokenzier = BertTokenizer.from_pretrained(model_dir)
+ self.encoder = BertModel.from_pretrained(model_dir)
+ # 检查encoder_layer_number是否合理
+ encoder_layer_number = len(self.encoder.encoder.layer)
+ self.layers = list(map(int, layers.split(',')))
+ for layer in self.layers:
+ if layer<0:
+ assert -layer<=encoder_layer_number, f"The layer index:{layer} is out of scope for " \
+ f"a bert model with {encoder_layer_number} layers."
+ else:
+ assert layer Dropout(x) -> CNN(x) -> activation(x) -> pool -> fc -> Dropout.
+ 不同的kernel大小的fitler结果是concat起来然后通过一层fully connected layer, 然后输出word的表示。
+
+ Example::
+
+ >>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
+ >>> embed = CNNCharEmbedding(vocab, embed_size=50)
+ >>> words = torch.LongTensor([[vocab.to_index(word) for word in "The whether is good .".split()]])
+ >>> outputs = embed(words)
+ >>> outputs.size()
+ >>> # torch.Size([1, 5,50])
+
+ :param vocab: 词表
+ :param embed_size: 该word embedding的大小,默认值为50.
+ :param char_emb_size: character的embed的大小。character是从vocab中生成的。默认值为50.
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率drop分布式表示与char embedding的输出。
+ :param filter_nums: filter的数量. 长度需要和kernels一致。默认值为[40, 30, 20].
+ :param kernel_sizes: kernel的大小. 默认值为[5, 3, 1].
+ :param pool_method: character的表示在合成一个表示时所使用的pool方法,支持'avg', 'max'.
+ :param activation: CNN之后使用的激活方法,支持'relu', 'sigmoid', 'tanh' 或者自定义函数.
+ :param min_char_freq: character的最少出现次数。默认值为2.
+ """
+ def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, word_dropout:float=0,
+ dropout:float=0.5, filter_nums: List[int]=(40, 30, 20), kernel_sizes: List[int]=(5, 3, 1),
+ pool_method: str='max', activation='relu', min_char_freq: int=2):
+ super(CNNCharEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
+
+ for kernel in kernel_sizes:
+ assert kernel % 2 == 1, "Only odd kernel is allowed."
+
+ assert pool_method in ('max', 'avg')
+ self.dropout = nn.Dropout(dropout)
+ self.pool_method = pool_method
+ # activation function
+ if isinstance(activation, str):
+ if activation.lower() == 'relu':
+ self.activation = F.relu
+ elif activation.lower() == 'sigmoid':
+ self.activation = F.sigmoid
+ elif activation.lower() == 'tanh':
+ self.activation = F.tanh
+ elif activation is None:
+ self.activation = lambda x: x
+ elif callable(activation):
+ self.activation = activation
+ else:
+ raise Exception(
+ "Undefined activation function: choose from: [relu, tanh, sigmoid, or a callable function]")
+
+ print("Start constructing character vocabulary.")
+ # 建立char的词表
+ self.char_vocab = _construct_char_vocab_from_vocab(vocab, min_freq=min_char_freq)
+ self.char_pad_index = self.char_vocab.padding_idx
+ print(f"In total, there are {len(self.char_vocab)} distinct characters.")
+ # 对vocab进行index
+ max_word_len = max(map(lambda x: len(x[0]), vocab))
+ self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab), max_word_len),
+ fill_value=self.char_pad_index, dtype=torch.long),
+ requires_grad=False)
+ self.word_lengths = nn.Parameter(torch.zeros(len(vocab)).long(), requires_grad=False)
+ for word, index in vocab:
+ # if index!=vocab.padding_idx: # 如果是pad的话,直接就为pad_value了。修改为不区分pad, 这样所有的也是同一个embed
+ self.words_to_chars_embedding[index, :len(word)] = \
+ torch.LongTensor([self.char_vocab.to_index(c) for c in word])
+ self.word_lengths[index] = len(word)
+ self.char_embedding = nn.Embedding(len(self.char_vocab), char_emb_size)
+
+ self.convs = nn.ModuleList([nn.Conv1d(
+ char_emb_size, filter_nums[i], kernel_size=kernel_sizes[i], bias=True, padding=kernel_sizes[i] // 2)
+ for i in range(len(kernel_sizes))])
+ self._embed_size = embed_size
+ self.fc = nn.Linear(sum(filter_nums), embed_size)
+ self.init_param()
+
+ def forward(self, words):
+ """
+ 输入words的index后,生成对应的words的表示。
+
+ :param words: [batch_size, max_len]
+ :return: [batch_size, max_len, embed_size]
+ """
+ words = self.drop_word(words)
+ batch_size, max_len = words.size()
+ chars = self.words_to_chars_embedding[words] # batch_size x max_len x max_word_len
+ word_lengths = self.word_lengths[words] # batch_size x max_len
+ max_word_len = word_lengths.max()
+ chars = chars[:, :, :max_word_len]
+ # 为1的地方为mask
+ chars_masks = chars.eq(self.char_pad_index) # batch_size x max_len x max_word_len 如果为0, 说明是padding的位置了
+ chars = self.char_embedding(chars) # batch_size x max_len x max_word_len x embed_size
+ chars = self.dropout(chars)
+ reshaped_chars = chars.reshape(batch_size*max_len, max_word_len, -1)
+ reshaped_chars = reshaped_chars.transpose(1, 2) # B' x E x M
+ conv_chars = [conv(reshaped_chars).transpose(1, 2).reshape(batch_size, max_len, max_word_len, -1)
+ for conv in self.convs]
+ conv_chars = torch.cat(conv_chars, dim=-1).contiguous() # B x max_len x max_word_len x sum(filters)
+ conv_chars = self.activation(conv_chars)
+ if self.pool_method == 'max':
+ conv_chars = conv_chars.masked_fill(chars_masks.unsqueeze(-1), float('-inf'))
+ chars, _ = torch.max(conv_chars, dim=-2) # batch_size x max_len x sum(filters)
+ else:
+ conv_chars = conv_chars.masked_fill(chars_masks.unsqueeze(-1), 0)
+ chars = torch.sum(conv_chars, dim=-2)/chars_masks.eq(0).sum(dim=-1, keepdim=True).float()
+ chars = self.fc(chars)
+ return self.dropout(chars)
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+ :return:
+ """
+ params = []
+ for name, param in self.named_parameters():
+ if 'words_to_chars_embedding' not in name and 'word_lengths' not in name:
+ params.append(param.requires_grad)
+ requires_grads = set(params)
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for name, param in self.named_parameters():
+ if 'words_to_chars_embedding' in name or 'word_lengths' in name: # 这个不能加入到requires_grad中
+ continue
+ param.requires_grad = value
+
+ def init_param(self):
+ for name, param in self.named_parameters():
+ if 'words_to_chars_embedding' in name or 'word_lengths' in name: # 这个不能reset
+ continue
+ if param.data.dim()>1:
+ nn.init.xavier_uniform_(param, 1)
+ else:
+ nn.init.uniform_(param, -1, 1)
+
+
+class LSTMCharEmbedding(TokenEmbedding):
+ """
+ 别名::class:`fastNLP.embeddings.LSTMCharEmbedding` :class:`fastNLP.embeddings.char_embedding.LSTMCharEmbedding`
+
+ 使用LSTM的方式对character进行encode. embed(x) -> Dropout(x) -> LSTM(x) -> activation(x) -> pool -> Dropout
+
+ Example::
+
+ >>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
+ >>> embed = LSTMCharEmbedding(vocab, embed_size=50)
+ >>> words = torch.LongTensor([[vocab.to_index(word) for word in "The whether is good .".split()]])
+ >>> outputs = embed(words)
+ >>> outputs.size()
+ >>> # torch.Size([1, 5,50])
+
+ :param vocab: 词表
+ :param embed_size: embedding的大小。默认值为50.
+ :param char_emb_size: character的embedding的大小。默认值为50.
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param dropout: 以多大概率drop character embedding的输出以及最终的word的输出。
+ :param hidden_size: LSTM的中间hidden的大小,如果为bidirectional的,hidden会除二,默认为50.
+ :param pool_method: 支持'max', 'avg'。
+ :param activation: 激活函数,支持'relu', 'sigmoid', 'tanh', 或者自定义函数.
+ :param min_char_freq: character的最小出现次数。默认值为2.
+ :param bidirectional: 是否使用双向的LSTM进行encode。默认值为True。
+ """
+ def __init__(self, vocab: Vocabulary, embed_size: int=50, char_emb_size: int=50, word_dropout:float=0,
+ dropout:float=0.5, hidden_size=50,pool_method: str='max', activation='relu', min_char_freq: int=2,
+ bidirectional=True):
+ super(LSTMCharEmbedding, self).__init__(vocab)
+
+ assert hidden_size % 2 == 0, "Only even kernel is allowed."
+
+ assert pool_method in ('max', 'avg')
+ self.pool_method = pool_method
+ self.dropout = nn.Dropout(dropout)
+ # activation function
+ if isinstance(activation, str):
+ if activation.lower() == 'relu':
+ self.activation = F.relu
+ elif activation.lower() == 'sigmoid':
+ self.activation = F.sigmoid
+ elif activation.lower() == 'tanh':
+ self.activation = F.tanh
+ elif activation is None:
+ self.activation = lambda x: x
+ elif callable(activation):
+ self.activation = activation
+ else:
+ raise Exception(
+ "Undefined activation function: choose from: [relu, tanh, sigmoid, or a callable function]")
+
+ print("Start constructing character vocabulary.")
+ # 建立char的词表
+ self.char_vocab = _construct_char_vocab_from_vocab(vocab, min_freq=min_char_freq)
+ self.char_pad_index = self.char_vocab.padding_idx
+ print(f"In total, there are {len(self.char_vocab)} distinct characters.")
+ # 对vocab进行index
+ self.max_word_len = max(map(lambda x: len(x[0]), vocab))
+ self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab), self.max_word_len),
+ fill_value=self.char_pad_index, dtype=torch.long),
+ requires_grad=False)
+ self.word_lengths = nn.Parameter(torch.zeros(len(vocab)).long(), requires_grad=False)
+ for word, index in vocab:
+ # if index!=vocab.padding_idx: # 如果是pad的话,直接就为pad_value了. 修改为不区分pad与否
+ self.words_to_chars_embedding[index, :len(word)] = \
+ torch.LongTensor([self.char_vocab.to_index(c) for c in word])
+ self.word_lengths[index] = len(word)
+ self.char_embedding = nn.Embedding(len(self.char_vocab), char_emb_size)
+
+ self.fc = nn.Linear(hidden_size, embed_size)
+ hidden_size = hidden_size // 2 if bidirectional else hidden_size
+
+ self.lstm = LSTM(char_emb_size, hidden_size, bidirectional=bidirectional, batch_first=True)
+ self._embed_size = embed_size
+ self.bidirectional = bidirectional
+
+ def forward(self, words):
+ """
+ 输入words的index后,生成对应的words的表示。
+
+ :param words: [batch_size, max_len]
+ :return: [batch_size, max_len, embed_size]
+ """
+ words = self.drop_word(words)
+ batch_size, max_len = words.size()
+ chars = self.words_to_chars_embedding[words] # batch_size x max_len x max_word_len
+ word_lengths = self.word_lengths[words] # batch_size x max_len
+ max_word_len = word_lengths.max()
+ chars = chars[:, :, :max_word_len]
+ # 为mask的地方为1
+ chars_masks = chars.eq(self.char_pad_index) # batch_size x max_len x max_word_len 如果为0, 说明是padding的位置了
+ chars = self.char_embedding(chars) # batch_size x max_len x max_word_len x embed_size
+ chars = self.dropout(chars)
+ reshaped_chars = chars.reshape(batch_size * max_len, max_word_len, -1)
+ char_seq_len = chars_masks.eq(0).sum(dim=-1).reshape(batch_size * max_len)
+ lstm_chars = self.lstm(reshaped_chars, char_seq_len)[0].reshape(batch_size, max_len, max_word_len, -1)
+ # B x M x M x H
+
+ lstm_chars = self.activation(lstm_chars)
+ if self.pool_method == 'max':
+ lstm_chars = lstm_chars.masked_fill(chars_masks.unsqueeze(-1), float('-inf'))
+ chars, _ = torch.max(lstm_chars, dim=-2) # batch_size x max_len x H
+ else:
+ lstm_chars = lstm_chars.masked_fill(chars_masks.unsqueeze(-1), 0)
+ chars = torch.sum(lstm_chars, dim=-2) / chars_masks.eq(0).sum(dim=-1, keepdim=True).float()
+
+ chars = self.fc(chars)
+
+ return self.dropout(chars)
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+
+ :return:
+ """
+ params = []
+ for name, param in self.named_parameters():
+ if 'words_to_chars_embedding' not in name and 'word_lengths' not in name:
+ params.append(param)
+ requires_grads = set(params)
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for name, param in self.named_parameters():
+ if 'words_to_chars_embedding' in name or 'word_lengths' in name: # 这个不能加入到requires_grad中
+ continue
+ param.requires_grad = value
diff --git a/fastNLP/embeddings/contextual_embedding.py b/fastNLP/embeddings/contextual_embedding.py
new file mode 100644
index 00000000..1831af4e
--- /dev/null
+++ b/fastNLP/embeddings/contextual_embedding.py
@@ -0,0 +1,100 @@
+
+from abc import abstractmethod
+import torch
+
+from ..core.vocabulary import Vocabulary
+from ..core.dataset import DataSet
+from ..core.batch import DataSetIter
+from ..core.sampler import SequentialSampler
+from ..core.utils import _move_model_to_device, _get_model_device
+from .embedding import TokenEmbedding
+
+
+class ContextualEmbedding(TokenEmbedding):
+ def __init__(self, vocab: Vocabulary, word_dropout:float=0.0, dropout:float=0.0):
+ super(ContextualEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
+
+ def add_sentence_cache(self, *datasets, batch_size=32, device='cpu', delete_weights: bool=True):
+ """
+ 由于动态embedding生成比较耗时,所以可以把每句话embedding缓存下来,这样就不需要每次都运行生成过程。
+
+ :param datasets: DataSet对象
+ :param batch_size: int, 生成cache的sentence表示时使用的batch的大小
+ :param device: 参考 :class::fastNLP.Trainer 的device
+ :param delete_weights: 似乎在生成了cache之后删除权重,在不需要finetune动态模型的情况下,删除权重会大量减少内存占用。
+ :return:
+ """
+ for index, dataset in enumerate(datasets):
+ try:
+ assert isinstance(dataset, DataSet), "Only fastNLP.DataSet object is allowed."
+ assert 'words' in dataset.get_input_name(), "`words` field has to be set as input."
+ except Exception as e:
+ print(f"Exception happens at {index} dataset.")
+ raise e
+
+ sent_embeds = {}
+ _move_model_to_device(self, device=device)
+ device = _get_model_device(self)
+ pad_index = self._word_vocab.padding_idx
+ print("Start to calculate sentence representations.")
+ with torch.no_grad():
+ for index, dataset in enumerate(datasets):
+ try:
+ batch = DataSetIter(dataset, batch_size=batch_size, sampler=SequentialSampler())
+ for batch_x, batch_y in batch:
+ words = batch_x['words'].to(device)
+ words_list = words.tolist()
+ seq_len = words.ne(pad_index).sum(dim=-1)
+ max_len = words.size(1)
+ # 因为有些情况可能包含CLS, SEP, 从后面往前计算比较安全。
+ seq_len_from_behind = (max_len - seq_len).tolist()
+ word_embeds = self(words).detach().cpu().numpy()
+ for b in range(words.size(0)):
+ length = seq_len_from_behind[b]
+ if length==0:
+ sent_embeds[tuple(words_list[b][:seq_len[b]])] = word_embeds[b]
+ else:
+ sent_embeds[tuple(words_list[b][:seq_len[b]])] = word_embeds[b, :-length]
+ except Exception as e:
+ print(f"Exception happens at {index} dataset.")
+ raise e
+ print("Finish calculating sentence representations.")
+ self.sent_embeds = sent_embeds
+ if delete_weights:
+ self._delete_model_weights()
+
+ def _get_sent_reprs(self, words):
+ """
+ 获取sentence的表示,如果有缓存,则返回缓存的值; 没有缓存则返回None
+
+ :param words: torch.LongTensor
+ :return:
+ """
+ if hasattr(self, 'sent_embeds'):
+ words_list = words.tolist()
+ seq_len = words.ne(self._word_pad_index).sum(dim=-1)
+ _embeds = []
+ for b in range(len(words)):
+ words_i = tuple(words_list[b][:seq_len[b]])
+ embed = self.sent_embeds[words_i]
+ _embeds.append(embed)
+ max_sent_len = max(map(len, _embeds))
+ embeds = words.new_zeros(len(_embeds), max_sent_len, self.embed_size, dtype=torch.float,
+ device=words.device)
+ for i, embed in enumerate(_embeds):
+ embeds[i, :len(embed)] = torch.FloatTensor(embed).to(words.device)
+ return embeds
+ return None
+
+ @abstractmethod
+ def _delete_model_weights(self):
+ """删除计算表示的模型以节省资源"""
+ raise NotImplementedError
+
+ def remove_sentence_cache(self):
+ """
+ 删除缓存的句子表示. 删除之后如果模型权重没有被删除,将开始使用动态计算权重。
+
+ :return:
+ """
+ del self.sent_embeds
diff --git a/fastNLP/embeddings/elmo_embedding.py b/fastNLP/embeddings/elmo_embedding.py
new file mode 100644
index 00000000..af94e8ec
--- /dev/null
+++ b/fastNLP/embeddings/elmo_embedding.py
@@ -0,0 +1,337 @@
+
+import os
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import json
+import codecs
+
+from ..core.vocabulary import Vocabulary
+from ..io.file_utils import cached_path, _get_base_url, PRETRAINED_ELMO_MODEL_DIR
+from ..modules.encoder._elmo import ElmobiLm, ConvTokenEmbedder
+from .contextual_embedding import ContextualEmbedding
+
+
+class ElmoEmbedding(ContextualEmbedding):
+ """
+ 别名::class:`fastNLP.embeddings.ElmoEmbedding` :class:`fastNLP.embeddings.elmo_embedding.ElmoEmbedding`
+
+ 使用ELMo的embedding。初始化之后,只需要传入words就可以得到对应的embedding。当前支持的使用名称初始化的模型有以下的这些(待补充)
+
+ Example::
+
+ >>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
+ >>> # 使用不同层的concat的结果
+ >>> embed = ElmoEmbedding(vocab, model_dir_or_name='en', layers='1,2', requires_grad=False)
+ >>> words = torch.LongTensor([[vocab.to_index(word) for word in "The whether is good .".split()]])
+ >>> outputs = embed(words)
+ >>> outputs.size()
+ >>> # torch.Size([1, 5, 2048])
+
+ >>> # 使用不同层的weighted sum。
+ >>> embed = ElmoEmbedding(vocab, model_dir_or_name='en', layers='mix', requires_grad=False)
+ >>> embed.set_mix_weights_requires_grad() # 使得weighted的权重是可以学习的,但ELMO的LSTM部分是不更新
+
+ :param vocab: 词表
+ :param model_dir_or_name: 可以有两种方式调用预训练好的ELMo embedding:第一种是传入ELMo所在文件夹,该文件夹下面应该有两个文件,
+ 其中一个是以json为后缀的配置文件,另一个是以pkl为后缀的权重文件;第二种是传入ELMo版本的名称,将自动查看缓存中是否存在该模型,
+ 没有的话将自动下载并缓存。
+ :param layers: str, 指定返回的层数, 以,隔开不同的层。如果要返回第二层的结果'2', 返回后两层的结果'1,2'。不同的层的结果
+ 按照这个顺序concat起来,默认为'2'。'mix'会使用可学习的权重结合不同层的表示(权重是否可训练与requires_grad保持一致,
+ 初始化权重对三层结果进行mean-pooling, 可以通过ElmoEmbedding.set_mix_weights_requires_grad()方法只将mix weights设置为可学习。)
+ :param requires_grad: bool, 该层是否需要gradient, 默认为False.
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
+ :param cache_word_reprs: 可以选择对word的表示进行cache; 设置为True的话,将在初始化的时候为每个word生成对应的embedding,
+ 并删除character encoder,之后将直接使用cache的embedding。默认为False。
+ """
+
+ def __init__(self, vocab: Vocabulary, model_dir_or_name: str = 'en', layers: str = '2', requires_grad: bool = False,
+ word_dropout=0.0, dropout=0.0, cache_word_reprs: bool = False):
+ super(ElmoEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
+
+ # 根据model_dir_or_name检查是否存在并下载
+ if model_dir_or_name.lower() in PRETRAINED_ELMO_MODEL_DIR:
+ PRETRAIN_URL = _get_base_url('elmo')
+ model_name = PRETRAINED_ELMO_MODEL_DIR[model_dir_or_name]
+ model_url = PRETRAIN_URL + model_name
+ model_dir = cached_path(model_url)
+ # 检查是否存在
+ elif os.path.isdir(os.path.expanduser(os.path.abspath(model_dir_or_name))):
+ model_dir = model_dir_or_name
+ else:
+ raise ValueError(f"Cannot recognize {model_dir_or_name}.")
+ self.model = _ElmoModel(model_dir, vocab, cache_word_reprs=cache_word_reprs)
+
+ if layers == 'mix':
+ self.layer_weights = nn.Parameter(torch.zeros(self.model.config['lstm']['n_layers'] + 1),
+ requires_grad=requires_grad)
+ self.gamma = nn.Parameter(torch.ones(1), requires_grad=requires_grad)
+ self._get_outputs = self._get_mixed_outputs
+ self._embed_size = self.model.config['lstm']['projection_dim'] * 2
+ else:
+ layers = list(map(int, layers.split(',')))
+ assert len(layers) > 0, "Must choose one output"
+ for layer in layers:
+ assert 0 <= layer <= 2, "Layer index should be in range [0, 2]."
+ self.layers = layers
+ self._get_outputs = self._get_layer_outputs
+ self._embed_size = len(self.layers) * self.model.config['lstm']['projection_dim'] * 2
+
+ self.requires_grad = requires_grad
+
+ def _get_mixed_outputs(self, outputs):
+ # outputs: num_layers x batch_size x max_len x hidden_size
+ # return: batch_size x max_len x hidden_size
+ weights = F.softmax(self.layer_weights + 1 / len(outputs), dim=0).to(outputs)
+ outputs = torch.einsum('l,lbij->bij', weights, outputs)
+ return self.gamma.to(outputs) * outputs
+
+ def set_mix_weights_requires_grad(self, flag=True):
+ """
+ 当初始化ElmoEmbedding时layers被设置为mix时,可以通过调用该方法设置mix weights是否可训练。如果layers不是mix,调用
+ 该方法没有用。
+
+ :param bool flag: 混合不同层表示的结果是否可以训练。
+ :return:
+ """
+ if hasattr(self, 'layer_weights'):
+ self.layer_weights.requires_grad = flag
+ self.gamma.requires_grad = flag
+
+ def _get_layer_outputs(self, outputs):
+ if len(self.layers) == 1:
+ outputs = outputs[self.layers[0]]
+ else:
+ outputs = torch.cat(tuple([*outputs[self.layers]]), dim=-1)
+
+ return outputs
+
+ def forward(self, words: torch.LongTensor):
+ """
+ 计算words的elmo embedding表示。根据elmo文章中介绍的ELMO实际上是有2L+1层结果,但是为了让结果比较容易拆分,token的
+ 被重复了一次,使得实际上layer=0的结果是[token_embedding;token_embedding], 而layer=1的结果是[forward_hiddens;
+ backward_hiddens].
+
+ :param words: batch_size x max_len
+ :return: torch.FloatTensor. batch_size x max_len x (512*len(self.layers))
+ """
+ words = self.drop_word(words)
+ outputs = self._get_sent_reprs(words)
+ if outputs is not None:
+ return self.dropout(outputs)
+ outputs = self.model(words)
+ outputs = self._get_outputs(outputs)
+ return self.dropout(outputs)
+
+ def _delete_model_weights(self):
+ for name in ['layers', 'model', 'layer_weights', 'gamma']:
+ if hasattr(self, name):
+ delattr(self, name)
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+
+ :return:
+ """
+ requires_grads = set([param.requires_grad for name, param in self.named_parameters()
+ if 'words_to_chars_embedding' not in name and 'words_to_words' not in name])
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for name, param in self.named_parameters():
+ if 'words_to_chars_embedding' in name or 'words_to_words' in name: # 这个不能加入到requires_grad中
+ continue
+ param.requires_grad = value
+
+
+class _ElmoModel(nn.Module):
+ """
+ 该Module是ElmoEmbedding中进行所有的heavy lifting的地方。做的工作,包括
+ (1) 根据配置,加载模型;
+ (2) 根据vocab,对模型中的embedding进行调整. 并将其正确初始化
+ (3) 保存一个words与chars的对应转换,获取时自动进行相应的转换
+ (4) 设计一个保存token的embedding,允许缓存word的表示。
+
+ """
+
+ def __init__(self, model_dir: str, vocab: Vocabulary = None, cache_word_reprs: bool = False):
+ super(_ElmoModel, self).__init__()
+ self.model_dir = model_dir
+ dir = os.walk(self.model_dir)
+ config_file = None
+ weight_file = None
+ config_count = 0
+ weight_count = 0
+ for path, dir_list, file_list in dir:
+ for file_name in file_list:
+ if file_name.__contains__(".json"):
+ config_file = file_name
+ config_count += 1
+ elif file_name.__contains__(".pkl"):
+ weight_file = file_name
+ weight_count += 1
+ if config_count > 1 or weight_count > 1:
+ raise Exception(f"Multiple config files(*.json) or weight files(*.hdf5) detected in {model_dir}.")
+ elif config_count == 0 or weight_count == 0:
+ raise Exception(f"No config file or weight file found in {model_dir}")
+
+ config = json.load(open(os.path.join(model_dir, config_file), 'r'))
+ self.weight_file = os.path.join(model_dir, weight_file)
+ self.config = config
+
+ OOV_TAG = ''
+ PAD_TAG = ''
+ BOS_TAG = ''
+ EOS_TAG = ''
+ BOW_TAG = ''
+ EOW_TAG = ''
+
+ # For the model trained with character-based word encoder.
+ char_lexicon = {}
+ with codecs.open(os.path.join(model_dir, 'char.dic'), 'r', encoding='utf-8') as fpi:
+ for line in fpi:
+ tokens = line.strip().split('\t')
+ if len(tokens) == 1:
+ tokens.insert(0, '\u3000')
+ token, i = tokens
+ char_lexicon[token] = int(i)
+
+ # 做一些sanity check
+ for special_word in [PAD_TAG, OOV_TAG, BOW_TAG, EOW_TAG]:
+ assert special_word in char_lexicon, f"{special_word} not found in char.dic."
+
+ # 从vocab中构建char_vocab
+ char_vocab = Vocabulary(unknown=OOV_TAG, padding=PAD_TAG)
+ # 需要保证与在里面
+ char_vocab.add_word_lst([BOW_TAG, EOW_TAG, BOS_TAG, EOS_TAG])
+
+ for word, index in vocab:
+ char_vocab.add_word_lst(list(word))
+
+ self.bos_index, self.eos_index, self._pad_index = len(vocab), len(vocab) + 1, vocab.padding_idx
+ # 根据char_lexicon调整, 多设置一位,是预留给word padding的(该位置的char表示为全0表示)
+ char_emb_layer = nn.Embedding(len(char_vocab) + 1, int(config['char_cnn']['embedding']['dim']),
+ padding_idx=len(char_vocab))
+
+ # 读入预训练权重 这里的elmo_model 包含char_cnn和 lstm 的 state_dict
+ elmo_model = torch.load(os.path.join(self.model_dir, weight_file), map_location='cpu')
+
+ char_embed_weights = elmo_model["char_cnn"]['char_emb_layer.weight']
+
+ found_char_count = 0
+ for char, index in char_vocab: # 调整character embedding
+ if char in char_lexicon:
+ index_in_pre = char_lexicon.get(char)
+ found_char_count += 1
+ else:
+ index_in_pre = char_lexicon[OOV_TAG]
+ char_emb_layer.weight.data[index] = char_embed_weights[index_in_pre]
+
+ print(f"{found_char_count} out of {len(char_vocab)} characters were found in pretrained elmo embedding.")
+ # 生成words到chars的映射
+ max_chars = config['char_cnn']['max_characters_per_token']
+
+ self.words_to_chars_embedding = nn.Parameter(torch.full((len(vocab) + 2, max_chars),
+ fill_value=len(char_vocab),
+ dtype=torch.long),
+ requires_grad=False)
+ for word, index in list(iter(vocab)) + [(BOS_TAG, len(vocab)), (EOS_TAG, len(vocab) + 1)]:
+ if len(word) + 2 > max_chars:
+ word = word[:max_chars - 2]
+ if index == self._pad_index:
+ continue
+ elif word == BOS_TAG or word == EOS_TAG:
+ char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(word)] + [
+ char_vocab.to_index(EOW_TAG)]
+ char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
+ else:
+ char_ids = [char_vocab.to_index(BOW_TAG)] + [char_vocab.to_index(c) for c in word] + [
+ char_vocab.to_index(EOW_TAG)]
+ char_ids += [char_vocab.to_index(PAD_TAG)] * (max_chars - len(char_ids))
+ self.words_to_chars_embedding[index] = torch.LongTensor(char_ids)
+
+ self.char_vocab = char_vocab
+
+ self.token_embedder = ConvTokenEmbedder(
+ config, self.weight_file, None, char_emb_layer)
+ elmo_model["char_cnn"]['char_emb_layer.weight'] = char_emb_layer.weight
+ self.token_embedder.load_state_dict(elmo_model["char_cnn"])
+
+ self.output_dim = config['lstm']['projection_dim']
+
+ # lstm encoder
+ self.encoder = ElmobiLm(config)
+ self.encoder.load_state_dict(elmo_model["lstm"])
+
+ if cache_word_reprs:
+ if config['char_cnn']['embedding']['dim'] > 0: # 只有在使用了chars的情况下有用
+ print("Start to generate cache word representations.")
+ batch_size = 320
+ # bos eos
+ word_size = self.words_to_chars_embedding.size(0)
+ num_batches = word_size // batch_size + \
+ int(word_size % batch_size != 0)
+
+ self.cached_word_embedding = nn.Embedding(word_size,
+ config['lstm']['projection_dim'])
+ with torch.no_grad():
+ for i in range(num_batches):
+ words = torch.arange(i * batch_size,
+ min((i + 1) * batch_size, word_size)).long()
+ chars = self.words_to_chars_embedding[words].unsqueeze(1) # batch_size x 1 x max_chars
+ word_reprs = self.token_embedder(words.unsqueeze(1),
+ chars).detach() # batch_size x 1 x config['encoder']['projection_dim']
+ self.cached_word_embedding.weight.data[words] = word_reprs.squeeze(1)
+
+ print("Finish generating cached word representations. Going to delete the character encoder.")
+ del self.token_embedder, self.words_to_chars_embedding
+ else:
+ print("There is no need to cache word representations, since no character information is used.")
+
+ def forward(self, words):
+ """
+
+ :param words: batch_size x max_len
+ :return: num_layers x batch_size x max_len x hidden_size
+ """
+ # 扩展,
+ batch_size, max_len = words.size()
+ expanded_words = words.new_zeros(batch_size, max_len + 2) # 因为pad一定为0,
+ seq_len = words.ne(self._pad_index).sum(dim=-1)
+ expanded_words[:, 1:-1] = words
+ expanded_words[:, 0].fill_(self.bos_index)
+ expanded_words[torch.arange(batch_size).to(words), seq_len + 1] = self.eos_index
+ seq_len = seq_len + 2
+ zero_tensor = expanded_words.new_zeros(expanded_words.shape)
+ mask = (expanded_words == zero_tensor).unsqueeze(-1)
+ if hasattr(self, 'cached_word_embedding'):
+ token_embedding = self.cached_word_embedding(expanded_words)
+ else:
+ if hasattr(self, 'words_to_chars_embedding'):
+ chars = self.words_to_chars_embedding[expanded_words]
+ else:
+ chars = None
+ token_embedding = self.token_embedder(expanded_words, chars) # batch_size x max_len x embed_dim
+
+ encoder_output = self.encoder(token_embedding, seq_len)
+ if encoder_output.size(2) < max_len + 2:
+ num_layers, _, output_len, hidden_size = encoder_output.size()
+ dummy_tensor = encoder_output.new_zeros(num_layers, batch_size,
+ max_len + 2 - output_len, hidden_size)
+ encoder_output = torch.cat((encoder_output, dummy_tensor), 2)
+ sz = encoder_output.size() # 2, batch_size, max_len, hidden_size
+ token_embedding = token_embedding.masked_fill(mask, 0)
+ token_embedding = torch.cat((token_embedding, token_embedding), dim=2).view(1, sz[1], sz[2], sz[3])
+ encoder_output = torch.cat((token_embedding, encoder_output), dim=0)
+
+ # 删除, . 这里没有精确地删除,但应该也不会影响最后的结果了。
+ encoder_output = encoder_output[:, :, 1:-1]
+ return encoder_output
diff --git a/fastNLP/embeddings/embedding.py b/fastNLP/embeddings/embedding.py
new file mode 100644
index 00000000..111bacd0
--- /dev/null
+++ b/fastNLP/embeddings/embedding.py
@@ -0,0 +1,195 @@
+"""
+该模块中的Embedding主要用于随机初始化的embedding(更推荐使用 :class:`fastNLP.embeddings.StaticEmbedding` ),或按照预训练权重初始化Embedding。
+
+"""
+
+
+import torch.nn as nn
+from abc import abstractmethod
+import torch
+
+from .utils import get_embeddings
+
+
+class Embedding(nn.Module):
+ """
+ 别名::class:`fastNLP.embeddings.Embedding` :class:`fastNLP.embeddings.embedding.Embedding`
+
+ 词向量嵌入,支持输入多种方式初始化. 可以通过self.num_embeddings获取词表大小; self.embedding_dim获取embedding的维度.
+
+ Example::
+
+ >>> import numpy as np
+ >>> init_embed = (2000, 100)
+ >>> embed = Embedding(init_embed) # 随机初始化一个具有2000个词,每个词表示为100维的词向量
+ >>> init_embed = np.zeros((2000, 100))
+ >>> embed = Embedding(init_embed) # 使用numpy.ndarray的值作为初始化值初始化一个Embedding
+
+ :param tuple(int,int),torch.FloatTensor,nn.Embedding,numpy.ndarray init_embed: 支持传入Embedding的大小(传入tuple(int, int),
+ 第一个int为vocab_zie, 第二个int为embed_dim); 或传入Tensor, Embedding, numpy.ndarray等则直接使用该值初始化Embedding;
+ :param float word_dropout: 按照一定概率随机将word设置为unk_index,这样可以使得unk这个token得到足够的训练, 且会对网络有
+ 一定的regularize的作用。设置该值时,必须同时设置unk_index
+ :param float dropout: 对Embedding的输出的dropout。
+ :param int unk_index: drop word时替换为的index。fastNLP的Vocabulary的unk_index默认为1。
+ """
+
+ def __init__(self, init_embed, word_dropout=0, dropout=0.0, unk_index=None):
+
+ super(Embedding, self).__init__()
+
+ self.embed = get_embeddings(init_embed)
+
+ self.dropout = nn.Dropout(dropout)
+ if not isinstance(self.embed, TokenEmbedding):
+ self._embed_size = self.embed.weight.size(1)
+ if word_dropout>0 and not isinstance(unk_index, int):
+ raise ValueError("When drop word is set, you need to pass in the unk_index.")
+ else:
+ self._embed_size = self.embed.embed_size
+ unk_index = self.embed.get_word_vocab().unknown_idx
+ self.unk_index = unk_index
+ self.word_dropout = word_dropout
+
+ def forward(self, words):
+ """
+ :param torch.LongTensor words: [batch, seq_len]
+ :return: torch.Tensor : [batch, seq_len, embed_dim]
+ """
+ if self.word_dropout>0 and self.training:
+ mask = torch.ones_like(words).float() * self.word_dropout
+ mask = torch.bernoulli(mask).byte() # dropout_word越大,越多位置为1
+ words = words.masked_fill(mask, self.unk_index)
+ words = self.embed(words)
+ return self.dropout(words)
+
+ @property
+ def num_embedding(self)->int:
+ if isinstance(self.embed, nn.Embedding):
+ return self.embed.weight.size(0)
+ else:
+ return self.embed.num_embedding
+
+ def __len__(self):
+ return len(self.embed)
+
+ @property
+ def embed_size(self) -> int:
+ return self._embed_size
+
+ @property
+ def embedding_dim(self) -> int:
+ return self._embed_size
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+ :return:
+ """
+ if not isinstance(self.embed, TokenEmbedding):
+ return self.embed.weight.requires_grad
+ else:
+ return self.embed.requires_grad
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ if not isinstance(self.embed, TokenEmbedding):
+ self.embed.weight.requires_grad = value
+ else:
+ self.embed.requires_grad = value
+
+ @property
+ def size(self):
+ if isinstance(self.embed, TokenEmbedding):
+ return self.embed.size
+ else:
+ return self.embed.weight.size()
+
+
+class TokenEmbedding(nn.Module):
+ def __init__(self, vocab, word_dropout=0.0, dropout=0.0):
+ super(TokenEmbedding, self).__init__()
+ if vocab.rebuild:
+ vocab.build_vocab()
+ assert vocab.padding is not None, "Vocabulary must have a padding entry."
+ self._word_vocab = vocab
+ self._word_pad_index = vocab.padding_idx
+ if word_dropout>0:
+ assert vocab.unknown is not None, "Vocabulary must have unknown entry when you want to drop a word."
+ self.word_dropout = word_dropout
+ self._word_unk_index = vocab.unknown_idx
+ self.dropout_layer = nn.Dropout(dropout)
+
+ def drop_word(self, words):
+ """
+ 按照设定随机将words设置为unknown_index。
+
+ :param torch.LongTensor words: batch_size x max_len
+ :return:
+ """
+ if self.word_dropout > 0 and self.training:
+ mask = torch.ones_like(words).float() * self.word_dropout
+ mask = torch.bernoulli(mask).byte() # dropout_word越大,越多位置为1
+ words = words.masked_fill(mask, self._word_unk_index)
+ return words
+
+ def dropout(self, words):
+ """
+ 对embedding后的word表示进行drop。
+
+ :param torch.FloatTensor words: batch_size x max_len x embed_size
+ :return:
+ """
+ return self.dropout_layer(words)
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+ :return:
+ """
+ requires_grads = set([param.requires_grad for param in self.parameters()])
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for param in self.parameters():
+ param.requires_grad = value
+
+ def __len__(self):
+ return len(self._word_vocab)
+
+ @property
+ def embed_size(self) -> int:
+ return self._embed_size
+
+ @property
+ def embedding_dim(self) -> int:
+ return self._embed_size
+
+ @property
+ def num_embedding(self) -> int:
+ """
+ 这个值可能会大于实际的embedding矩阵的大小。
+ :return:
+ """
+ return len(self._word_vocab)
+
+ def get_word_vocab(self):
+ """
+ 返回embedding的词典。
+
+ :return: Vocabulary
+ """
+ return self._word_vocab
+
+ @property
+ def size(self):
+ return torch.Size(self.num_embedding, self._embed_size)
+
+ @abstractmethod
+ def forward(self, words):
+ raise NotImplementedError
diff --git a/fastNLP/embeddings/stack_embedding.py b/fastNLP/embeddings/stack_embedding.py
new file mode 100644
index 00000000..8091d598
--- /dev/null
+++ b/fastNLP/embeddings/stack_embedding.py
@@ -0,0 +1,94 @@
+from typing import List
+
+import torch
+from torch import nn as nn
+
+from .embedding import TokenEmbedding
+
+
+class StackEmbedding(TokenEmbedding):
+ """
+ 别名::class:`fastNLP.embeddings.StackEmbedding` :class:`fastNLP.embeddings.stack_embedding.StackEmbedding`
+
+ 支持将多个embedding集合成一个embedding。
+
+ Example::
+
+ >>> from fastNLP import Vocabulary
+ >>> from fastNLP.embeddings import StaticEmbedding
+ >>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
+ >>> embed_1 = StaticEmbedding(vocab, model_dir_or_name='en-glove-6b-50', requires_grad=True)
+ >>> embed_2 = StaticEmbedding(vocab, model_dir_or_name='en-word2vec-300', requires_grad=True)
+
+ :param embeds: 一个由若干个TokenEmbedding组成的list,要求每一个TokenEmbedding的词表都保持一致
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。不同embedidng会在相同的位置
+ 被设置为unknown。如果这里设置了dropout,则组成的embedding就不要再设置dropout了。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
+
+ """
+ def __init__(self, embeds: List[TokenEmbedding], word_dropout=0, dropout=0):
+ vocabs = []
+ for embed in embeds:
+ if hasattr(embed, 'get_word_vocab'):
+ vocabs.append(embed.get_word_vocab())
+ _vocab = vocabs[0]
+ for vocab in vocabs[1:]:
+ assert vocab == _vocab, "All embeddings in StackEmbedding should use the same word vocabulary."
+
+ super(StackEmbedding, self).__init__(_vocab, word_dropout=word_dropout, dropout=dropout)
+ assert isinstance(embeds, list)
+ for embed in embeds:
+ assert isinstance(embed, TokenEmbedding), "Only TokenEmbedding type is supported."
+ self.embeds = nn.ModuleList(embeds)
+ self._embed_size = sum([embed.embed_size for embed in self.embeds])
+
+ def append(self, embed: TokenEmbedding):
+ """
+ 添加一个embedding到结尾。
+ :param embed:
+ :return:
+ """
+ assert isinstance(embed, TokenEmbedding)
+ self.embeds.append(embed)
+
+ def pop(self):
+ """
+ 弹出最后一个embed
+ :return:
+ """
+ return self.embeds.pop()
+
+ @property
+ def embed_size(self):
+ return self._embed_size
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+ :return:
+ """
+ requires_grads = set([embed.requires_grad for embed in self.embeds()])
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for embed in self.embeds():
+ embed.requires_grad = value
+
+ def forward(self, words):
+ """
+ 得到多个embedding的结果,并把结果按照顺序concat起来。
+
+ :param words: batch_size x max_len
+ :return: 返回的shape和当前这个stack embedding中embedding的组成有关
+ """
+ outputs = []
+ words = self.drop_word(words)
+ for embed in self.embeds:
+ outputs.append(embed(words))
+ outputs = self.dropout(torch.cat(outputs, dim=-1))
+ return outputs
\ No newline at end of file
diff --git a/fastNLP/embeddings/static_embedding.py b/fastNLP/embeddings/static_embedding.py
new file mode 100644
index 00000000..94f7adb5
--- /dev/null
+++ b/fastNLP/embeddings/static_embedding.py
@@ -0,0 +1,255 @@
+
+import os
+
+import torch
+import torch.nn as nn
+import numpy as np
+import warnings
+
+from ..core.vocabulary import Vocabulary
+from ..io.file_utils import PRETRAIN_STATIC_FILES, _get_base_url, cached_path
+from .embedding import TokenEmbedding
+from ..modules.utils import _get_file_name_base_on_postfix
+
+class StaticEmbedding(TokenEmbedding):
+ """
+ 别名::class:`fastNLP.embeddings.StaticEmbedding` :class:`fastNLP.embeddings.static_embedding.StaticEmbedding`
+
+ StaticEmbedding组件. 给定预训练embedding的名称或路径,根据vocab从embedding中抽取相应的数据(只会将出现在vocab中的词抽取出来,
+ 如果没有找到,则会随机初始化一个值(但如果该word是被标记为no_create_entry的话,则不会单独创建一个值,而是会被指向unk的index))。
+ 当前支持自动下载的预训练vector有以下的几种(待补充);
+
+ Example::
+
+ >>> vocab = Vocabulary().add_word_lst("The whether is good .".split())
+ >>> embed = StaticEmbedding(vocab, model_dir_or_name='en-glove-50')
+
+ >>> vocab = Vocabulary().add_word_lst(["The", 'the', "THE"])
+ >>> embed = StaticEmbedding(vocab, model_dir_or_name="en-glove-50", lower=True)
+ >>> # "the", "The", "THE"它们共用一个vector,且将使用"the"在预训练词表中寻找它们的初始化表示。
+
+ >>> vocab = Vocabulary().add_word_lst(["The", "the", "THE"])
+ >>> embed = StaticEmbedding(vocab, model_dir_or_name=None, embedding_dim=5, lower=True)
+ >>> words = torch.LongTensor([[vocab.to_index(word) for word in ["The", "the", "THE"]]])
+ >>> embed(words)
+ >>> tensor([[[ 0.5773, 0.7251, -0.3104, 0.0777, 0.4849],
+ [ 0.5773, 0.7251, -0.3104, 0.0777, 0.4849],
+ [ 0.5773, 0.7251, -0.3104, 0.0777, 0.4849]]],
+ grad_fn=) # 每种word的输出是一致的。
+
+ :param vocab: Vocabulary. 若该项为None则会读取所有的embedding。
+ :param model_dir_or_name: 可以有两种方式调用预训练好的static embedding:第一种是传入embedding文件夹(文件夹下应该只有一个
+ 以.txt作为后缀的文件)或文件路径;第二种是传入embedding的名称,第二种情况将自动查看缓存中是否存在该模型,没有的话将自动下载。
+ 如果输入为None则使用embedding_dim的维度随机初始化一个embedding。
+ :param int embedding_dim: 随机初始化的embedding的维度,仅在model_dir_or_name为None时有效。
+ :param bool requires_grad: 是否需要gradient. 默认为True
+ :param callable init_method: 如何初始化没有找到的值。可以使用torch.nn.init.*中各种方法。调用该方法时传入一个tensor对象。
+ :param bool lower: 是否将vocab中的词语小写后再和预训练的词表进行匹配。如果你的词表中包含大写的词语,或者就是需要单独
+ 为大写的词语开辟一个vector表示,则将lower设置为False。
+ :param float word_dropout: 以多大的概率将一个词替换为unk。这样既可以训练unk也是一定的regularize。
+ :param float dropout: 以多大的概率对embedding的表示进行Dropout。0.1即随机将10%的值置为0。
+ :param bool normalize: 是否对vector进行normalize,使得每个vector的norm为1。
+ """
+ def __init__(self, vocab: Vocabulary, model_dir_or_name: str='en', embedding_dim=100, requires_grad: bool=True,
+ init_method=None, lower=False, dropout=0, word_dropout=0, normalize=False):
+ super(StaticEmbedding, self).__init__(vocab, word_dropout=word_dropout, dropout=dropout)
+
+ # 得到cache_path
+ if model_dir_or_name is None:
+ assert embedding_dim>=1, "The dimension of embedding should be larger than 1."
+ embedding_dim = int(embedding_dim)
+ model_path = None
+ elif model_dir_or_name.lower() in PRETRAIN_STATIC_FILES:
+ PRETRAIN_URL = _get_base_url('static')
+ model_name = PRETRAIN_STATIC_FILES[model_dir_or_name]
+ model_url = PRETRAIN_URL + model_name
+ model_path = cached_path(model_url)
+ # 检查是否存在
+ elif os.path.isfile(os.path.expanduser(os.path.abspath(model_dir_or_name))):
+ model_path = model_dir_or_name
+ elif os.path.isdir(os.path.expanduser(os.path.abspath(model_dir_or_name))):
+ model_path = _get_file_name_base_on_postfix(model_dir_or_name, '.txt')
+ else:
+ raise ValueError(f"Cannot recognize {model_dir_or_name}.")
+
+ # 读取embedding
+ if lower:
+ lowered_vocab = Vocabulary(padding=vocab.padding, unknown=vocab.unknown)
+ for word, index in vocab:
+ if not vocab._is_word_no_create_entry(word):
+ lowered_vocab.add_word(word.lower()) # 先加入需要创建entry的
+ for word in vocab._no_create_word.keys(): # 不需要创建entry的
+ if word in vocab:
+ lowered_word = word.lower()
+ if lowered_word not in lowered_vocab.word_count:
+ lowered_vocab.add_word(lowered_word)
+ lowered_vocab._no_create_word[lowered_word] += 1
+ print(f"All word in vocab have been lowered. There are {len(vocab)} words, {len(lowered_vocab)} unique lowered "
+ f"words.")
+ if model_path:
+ embedding = self._load_with_vocab(model_path, vocab=lowered_vocab, init_method=init_method)
+ else:
+ embedding = self._randomly_init_embed(len(vocab), embedding_dim, init_method)
+ # 需要适配一下
+ if not hasattr(self, 'words_to_words'):
+ self.words_to_words = torch.arange(len(lowered_vocab, )).long()
+ if lowered_vocab.unknown:
+ unknown_idx = lowered_vocab.unknown_idx
+ else:
+ unknown_idx = embedding.size(0) - 1 # 否则是最后一个为unknow
+ words_to_words = nn.Parameter(torch.full((len(vocab),), fill_value=unknown_idx).long(),
+ requires_grad=False)
+ for word, index in vocab:
+ if word not in lowered_vocab:
+ word = word.lower()
+ if lowered_vocab._is_word_no_create_entry(word): # 如果不需要创建entry,已经默认unknown了
+ continue
+ words_to_words[index] = self.words_to_words[lowered_vocab.to_index(word)]
+ self.words_to_words = words_to_words
+ else:
+ if model_path:
+ embedding = self._load_with_vocab(model_path, vocab=vocab, init_method=init_method)
+ else:
+ embedding = self._randomly_init_embed(len(vocab), embedding_dim, init_method)
+ if normalize:
+ embedding /= (torch.norm(embedding, dim=1, keepdim=True) + 1e-12)
+ self.embedding = nn.Embedding(num_embeddings=embedding.shape[0], embedding_dim=embedding.shape[1],
+ padding_idx=vocab.padding_idx,
+ max_norm=None, norm_type=2, scale_grad_by_freq=False,
+ sparse=False, _weight=embedding)
+ self._embed_size = self.embedding.weight.size(1)
+ self.requires_grad = requires_grad
+
+ def _randomly_init_embed(self, num_embedding, embedding_dim, init_embed=None):
+ """
+
+ :param int num_embedding: embedding的entry的数量
+ :param int embedding_dim: embedding的维度大小
+ :param callable init_embed: 初始化方法
+ :return: torch.FloatTensor
+ """
+ embed = torch.zeros(num_embedding, embedding_dim)
+
+ if init_embed is None:
+ nn.init.uniform_(embed, -np.sqrt(3/embedding_dim), np.sqrt(3/embedding_dim))
+ else:
+ init_embed(embed)
+
+ return embed
+
+ @property
+ def requires_grad(self):
+ """
+ Embedding的参数是否允许优化。True: 所有参数运行优化; False: 所有参数不允许优化; None: 部分允许优化、部分不允许
+
+ :return:
+ """
+ requires_grads = set([param.requires_grad for name, param in self.named_parameters()
+ if 'words_to_words' not in name])
+ if len(requires_grads) == 1:
+ return requires_grads.pop()
+ else:
+ return None
+
+ @requires_grad.setter
+ def requires_grad(self, value):
+ for name, param in self.named_parameters():
+ if 'words_to_words' in name:
+ continue
+ param.requires_grad = value
+
+ def _load_with_vocab(self, embed_filepath, vocab, dtype=np.float32, padding='', unknown='',
+ error='ignore', init_method=None):
+ """
+ 从embed_filepath这个预训练的词向量中抽取出vocab这个词表的词的embedding。EmbedLoader将自动判断embed_filepath是
+ word2vec(第一行只有两个元素)还是glove格式的数据。
+
+ :param str embed_filepath: 预训练的embedding的路径。
+ :param vocab: 词表 :class:`~fastNLP.Vocabulary` 类型,读取出现在vocab中的词的embedding。
+ 没有出现在vocab中的词的embedding将通过找到的词的embedding的正态分布采样出来,以使得整个Embedding是同分布的。
+ :param dtype: 读出的embedding的类型
+ :param str padding: 词表中padding的token
+ :param str unknown: 词表中unknown的token
+ :param str error: `ignore` , `strict` ; 如果 `ignore` ,错误将自动跳过; 如果 `strict` , 错误将抛出。
+ 这里主要可能出错的地方在于词表有空行或者词表出现了维度不一致。
+ :param init_method: 如何初始化没有找到的值。可以使用torch.nn.init.*中各种方法。默认使用torch.nn.init.zeros_
+ :return torch.tensor: shape为 [len(vocab), dimension], dimension由pretrain的embedding决定。
+ """
+ assert isinstance(vocab, Vocabulary), "Only fastNLP.Vocabulary is supported."
+ if not os.path.exists(embed_filepath):
+ raise FileNotFoundError("`{}` does not exist.".format(embed_filepath))
+ with open(embed_filepath, 'r', encoding='utf-8') as f:
+ line = f.readline().strip()
+ parts = line.split()
+ start_idx = 0
+ if len(parts) == 2:
+ dim = int(parts[1])
+ start_idx += 1
+ else:
+ dim = len(parts) - 1
+ f.seek(0)
+ matrix = {}
+ found_count = 0
+ for idx, line in enumerate(f, start_idx):
+ try:
+ parts = line.strip().split()
+ word = ''.join(parts[:-dim])
+ nums = parts[-dim:]
+ # 对齐unk与pad
+ if word == padding and vocab.padding is not None:
+ word = vocab.padding
+ elif word == unknown and vocab.unknown is not None:
+ word = vocab.unknown
+ if word in vocab:
+ index = vocab.to_index(word)
+ matrix[index] = torch.from_numpy(np.fromstring(' '.join(nums), sep=' ', dtype=dtype, count=dim))
+ found_count += 1
+ except Exception as e:
+ if error == 'ignore':
+ warnings.warn("Error occurred at the {} line.".format(idx))
+ else:
+ print("Error occurred at the {} line.".format(idx))
+ raise e
+ print("Found {} out of {} words in the pre-training embedding.".format(found_count, len(vocab)))
+ for word, index in vocab:
+ if index not in matrix and not vocab._is_word_no_create_entry(word):
+ if vocab.unknown_idx in matrix: # 如果有unkonwn,用unknown初始化
+ matrix[index] = matrix[vocab.unknown_idx]
+ else:
+ matrix[index] = None
+
+ vectors = self._randomly_init_embed(len(matrix), dim, init_method)
+
+ if vocab._no_create_word_length>0:
+ if vocab.unknown is None: # 创建一个专门的unknown
+ unknown_idx = len(matrix)
+ vectors = torch.cat((vectors, torch.zeros(1, dim)), dim=0).contiguous()
+ else:
+ unknown_idx = vocab.unknown_idx
+ words_to_words = nn.Parameter(torch.full((len(vocab),), fill_value=unknown_idx).long(),
+ requires_grad=False)
+ for order, (index, vec) in enumerate(matrix.items()):
+ if vec is not None:
+ vectors[order] = vec
+ words_to_words[index] = order
+ self.words_to_words = words_to_words
+ else:
+ for index, vec in matrix.items():
+ if vec is not None:
+ vectors[index] = vec
+
+ return vectors
+
+ def forward(self, words):
+ """
+ 传入words的index
+
+ :param words: torch.LongTensor, [batch_size, max_len]
+ :return: torch.FloatTensor, [batch_size, max_len, embed_size]
+ """
+ if hasattr(self, 'words_to_words'):
+ words = self.words_to_words[words]
+ words = self.drop_word(words)
+ words = self.embedding(words)
+ words = self.dropout(words)
+ return words
diff --git a/fastNLP/embeddings/utils.py b/fastNLP/embeddings/utils.py
new file mode 100644
index 00000000..b79f563c
--- /dev/null
+++ b/fastNLP/embeddings/utils.py
@@ -0,0 +1,51 @@
+import numpy as np
+import torch
+from torch import nn as nn
+
+from ..core.vocabulary import Vocabulary
+
+__all__ = ['get_embeddings']
+
+
+def _construct_char_vocab_from_vocab(vocab:Vocabulary, min_freq:int=1):
+ """
+ 给定一个word的vocabulary生成character的vocabulary.
+
+ :param vocab: 从vocab
+ :param min_freq:
+ :return:
+ """
+ char_vocab = Vocabulary(min_freq=min_freq)
+ for word, index in vocab:
+ if not vocab._is_word_no_create_entry(word):
+ char_vocab.add_word_lst(list(word))
+ return char_vocab
+
+
+def get_embeddings(init_embed):
+ """
+ 根据输入的init_embed返回Embedding对象。如果输入是tuple, 则随机初始化一个nn.Embedding; 如果输入是numpy.ndarray, 则按照ndarray
+ 的值将nn.Embedding初始化; 如果输入是torch.Tensor, 则按该值初始化nn.Embedding; 如果输入是fastNLP中的embedding将不做处理
+ 返回原对象。
+
+ :param init_embed: 可以是 tuple:(num_embedings, embedding_dim), 即embedding的大小和每个词的维度;也可以传入
+ nn.Embedding 对象, 此时就以传入的对象作为embedding; 传入np.ndarray也行,将使用传入的ndarray作为作为Embedding初始化;
+ 传入torch.Tensor, 将使用传入的值作为Embedding初始化。
+ :return nn.Embedding embeddings:
+ """
+ if isinstance(init_embed, tuple):
+ res = nn.Embedding(
+ num_embeddings=init_embed[0], embedding_dim=init_embed[1])
+ nn.init.uniform_(res.weight.data, a=-np.sqrt(3/res.weight.data.size(1)),
+ b=np.sqrt(3/res.weight.data.size(1)))
+ elif isinstance(init_embed, nn.Module):
+ res = init_embed
+ elif isinstance(init_embed, torch.Tensor):
+ res = nn.Embedding.from_pretrained(init_embed, freeze=False)
+ elif isinstance(init_embed, np.ndarray):
+ init_embed = torch.tensor(init_embed, dtype=torch.float32)
+ res = nn.Embedding.from_pretrained(init_embed, freeze=False)
+ else:
+ raise TypeError(
+ 'invalid init_embed type: {}'.format((type(init_embed))))
+ return res
\ No newline at end of file
diff --git a/fastNLP/io/__init__.py b/fastNLP/io/__init__.py
index c8d6a441..cd0d3527 100644
--- a/fastNLP/io/__init__.py
+++ b/fastNLP/io/__init__.py
@@ -3,29 +3,45 @@
1. 用于读入 embedding 的 :doc:`EmbedLoader ` 类,
-2. 用于读入数据的 :doc:`DataSetLoader ` 类
+2. 用于读入不同格式数据的 :doc:`DataSetLoader ` 类
-3. 用于保存和载入模型的类, 参考 :doc:`/fastNLP.io.model_io`
+3. 用于读入不同数据集并进行预处理的 :doc:`DataLoader ` 类
+
+4. 用于保存和载入模型的类, 参考 :doc:`model_io文档`
这些类的使用方法如下:
"""
__all__ = [
'EmbedLoader',
-
- 'DataSetLoader',
+
'CSVLoader',
'JsonLoader',
+
+ 'DataBundle',
+ 'DataSetLoader',
+
'ConllLoader',
- 'SNLILoader',
- 'SSTLoader',
- 'PeopleDailyCorpusLoader',
'Conll2003Loader',
+ 'IMDBLoader',
+ 'MatchingLoader',
+ 'SNLILoader',
+ 'MNLILoader',
+ 'MTL16Loader',
+ 'PeopleDailyCorpusLoader',
+ 'QNLILoader',
+ 'QuoraLoader',
+ 'RTELoader',
+ 'SSTLoader',
+ 'SST2Loader',
+ 'YelpLoader',
'ModelLoader',
'ModelSaver',
]
from .embed_loader import EmbedLoader
-from .dataset_loader import DataSetLoader, CSVLoader, JsonLoader, ConllLoader, SNLILoader, SSTLoader, \
- PeopleDailyCorpusLoader, Conll2003Loader
+from .base_loader import DataBundle, DataSetLoader
+from .dataset_loader import CSVLoader, JsonLoader
from .model_io import ModelLoader, ModelSaver
+
+from .data_loader import *
diff --git a/fastNLP/io/base_loader.py b/fastNLP/io/base_loader.py
index adfa8ca1..5d61c16a 100644
--- a/fastNLP/io/base_loader.py
+++ b/fastNLP/io/base_loader.py
@@ -1,6 +1,6 @@
__all__ = [
"BaseLoader",
- 'DataInfo',
+ 'DataBundle',
'DataSetLoader',
]
@@ -10,6 +10,7 @@ from typing import Union, Dict
import os
from ..core.dataset import DataSet
+
class BaseLoader(object):
"""
各个 Loader 的基类,提供了 API 的参考。
@@ -55,8 +56,6 @@ class BaseLoader(object):
return obj
-
-
def _download_from_url(url, path):
try:
from tqdm.auto import tqdm
@@ -110,20 +109,27 @@ def _uncompress(src, dst):
raise ValueError('unsupported file {}'.format(src))
-class DataInfo:
+class DataBundle:
"""
- 经过处理的数据信息,包括一系列数据集(比如:分开的训练集、验证集和测试集)及它们所用的词表和词嵌入。
+ 经过处理的数据信息,包括一系列数据集(比如:分开的训练集、验证集和测试集)以及各个field对应的vocabulary。
:param vocabs: 从名称(字符串)到 :class:`~fastNLP.Vocabulary` 类型的dict
- :param embeddings: 从名称(字符串)到一系列 embedding 的dict,参考 :class:`~fastNLP.io.EmbedLoader`
:param datasets: 从名称(字符串)到 :class:`~fastNLP.DataSet` 类型的dict
"""
- def __init__(self, vocabs: dict = None, embeddings: dict = None, datasets: dict = None):
+ def __init__(self, vocabs: dict = None, datasets: dict = None):
self.vocabs = vocabs or {}
- self.embeddings = embeddings or {}
self.datasets = datasets or {}
+ def __repr__(self):
+ _str = 'In total {} datasets:\n'.format(len(self.datasets))
+ for name, dataset in self.datasets.items():
+ _str += '\t{} has {} instances.\n'.format(name, len(dataset))
+ _str += 'In total {} vocabs:\n'.format(len(self.vocabs))
+ for name, vocab in self.vocabs.items():
+ _str += '\t{} has {} entries.\n'.format(name, len(vocab))
+ return _str
+
class DataSetLoader:
"""
@@ -195,21 +201,20 @@ class DataSetLoader:
"""
raise NotImplementedError
- def process(self, paths: Union[str, Dict[str, str]], **options) -> DataInfo:
+ def process(self, paths: Union[str, Dict[str, str]], **options) -> DataBundle:
"""
对于特定的任务和数据集,读取并处理数据,返回处理DataInfo类对象或字典。
从指定一个或多个路径中的文件中读取数据,DataInfo对象中可以包含一个或多个数据集 。
如果处理多个路径,传入的 dict 的 key 与返回DataInfo中的 dict 中的 key 保存一致。
- 返回的 :class:`DataInfo` 对象有如下属性:
+ 返回的 :class:`DataBundle` 对象有如下属性:
- vocabs: 由从数据集中获取的词表组成的字典,每个词表
- - embeddings: (可选) 数据集对应的词嵌入
- datasets: 一个dict,包含一系列 :class:`~fastNLP.DataSet` 类型的对象。其中 field 的命名参考 :mod:`~fastNLP.core.const`
:param paths: 原始数据读取的路径
:param options: 根据不同的任务和数据集,设计自己的参数
- :return: 返回一个 DataInfo
+ :return: 返回一个 DataBundle
"""
raise NotImplementedError
diff --git a/fastNLP/io/data_loader/__init__.py b/fastNLP/io/data_loader/__init__.py
new file mode 100644
index 00000000..5d6b08b0
--- /dev/null
+++ b/fastNLP/io/data_loader/__init__.py
@@ -0,0 +1,35 @@
+"""
+用于读数据集的模块, 可以读取文本分类、序列标注、Matching任务的数据集
+
+这些模块的具体介绍如下,您可以通过阅读 :doc:`教程` 来进行了解。
+"""
+__all__ = [
+ 'ConllLoader',
+ 'Conll2003Loader',
+ 'IMDBLoader',
+ 'MatchingLoader',
+ 'SNLILoader',
+ 'MNLILoader',
+ 'MTL16Loader',
+ 'PeopleDailyCorpusLoader',
+ 'QNLILoader',
+ 'QuoraLoader',
+ 'RTELoader',
+ 'SSTLoader',
+ 'SST2Loader',
+ 'YelpLoader',
+]
+
+
+from .conll import ConllLoader, Conll2003Loader
+from .imdb import IMDBLoader
+from .matching import MatchingLoader
+from .mnli import MNLILoader
+from .mtl import MTL16Loader
+from .people_daily import PeopleDailyCorpusLoader
+from .qnli import QNLILoader
+from .quora import QuoraLoader
+from .rte import RTELoader
+from .snli import SNLILoader
+from .sst import SSTLoader, SST2Loader
+from .yelp import YelpLoader
diff --git a/fastNLP/io/data_loader/conll.py b/fastNLP/io/data_loader/conll.py
new file mode 100644
index 00000000..9b2402a2
--- /dev/null
+++ b/fastNLP/io/data_loader/conll.py
@@ -0,0 +1,73 @@
+
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ..base_loader import DataSetLoader
+from ..file_reader import _read_conll
+
+
+class ConllLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.data_loader.ConllLoader`
+
+ 读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html. 数据中以"-DOCSTART-"开头的行将被忽略,因为
+ 该符号在conll 2003中被用为文档分割符。
+
+ 列号从0开始, 每列对应内容为::
+
+ Column Type
+ 0 Document ID
+ 1 Part number
+ 2 Word number
+ 3 Word itself
+ 4 Part-of-Speech
+ 5 Parse bit
+ 6 Predicate lemma
+ 7 Predicate Frameset ID
+ 8 Word sense
+ 9 Speaker/Author
+ 10 Named Entities
+ 11:N Predicate Arguments
+ N Coreference
+
+ :param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应
+ :param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None``
+ :param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False``
+ """
+
+ def __init__(self, headers, indexes=None, dropna=False):
+ super(ConllLoader, self).__init__()
+ if not isinstance(headers, (list, tuple)):
+ raise TypeError(
+ 'invalid headers: {}, should be list of strings'.format(headers))
+ self.headers = headers
+ self.dropna = dropna
+ if indexes is None:
+ self.indexes = list(range(len(self.headers)))
+ else:
+ if len(indexes) != len(headers):
+ raise ValueError
+ self.indexes = indexes
+
+ def _load(self, path):
+ ds = DataSet()
+ for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
+ ins = {h: data[i] for i, h in enumerate(self.headers)}
+ ds.append(Instance(**ins))
+ return ds
+
+
+class Conll2003Loader(ConllLoader):
+ """
+ 别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.data_loader.Conll2003Loader`
+
+ 读取Conll2003数据
+
+ 关于数据集的更多信息,参考:
+ https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
+ """
+
+ def __init__(self):
+ headers = [
+ 'tokens', 'pos', 'chunks', 'ner',
+ ]
+ super(Conll2003Loader, self).__init__(headers=headers)
diff --git a/fastNLP/io/data_loader/imdb.py b/fastNLP/io/data_loader/imdb.py
new file mode 100644
index 00000000..d3636cde
--- /dev/null
+++ b/fastNLP/io/data_loader/imdb.py
@@ -0,0 +1,99 @@
+
+from typing import Union, Dict
+
+from ..embed_loader import EmbeddingOption, EmbedLoader
+from ..base_loader import DataSetLoader, DataBundle
+from ...core.vocabulary import VocabularyOption, Vocabulary
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ...core.const import Const
+
+from ..utils import get_tokenizer
+
+
+class IMDBLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.IMDBLoader` :class:`fastNLP.io.data_loader.IMDBLoader`
+
+ 读取IMDB数据集,DataSet包含以下fields:
+
+ words: list(str), 需要分类的文本
+
+ target: str, 文本的标签
+
+ """
+
+ def __init__(self):
+ super(IMDBLoader, self).__init__()
+ self.tokenizer = get_tokenizer()
+
+ def _load(self, path):
+ dataset = DataSet()
+ with open(path, 'r', encoding="utf-8") as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+ parts = line.split('\t')
+ target = parts[0]
+ words = self.tokenizer(parts[1].lower())
+ dataset.append(Instance(words=words, target=target))
+
+ if len(dataset) == 0:
+ raise RuntimeError(f"{path} has no valid data.")
+
+ return dataset
+
+ def process(self,
+ paths: Union[str, Dict[str, str]],
+ src_vocab_opt: VocabularyOption = None,
+ tgt_vocab_opt: VocabularyOption = None,
+ char_level_op=False):
+
+ datasets = {}
+ info = DataBundle()
+ for name, path in paths.items():
+ dataset = self.load(path)
+ datasets[name] = dataset
+
+ def wordtochar(words):
+ chars = []
+ for word in words:
+ word = word.lower()
+ for char in word:
+ chars.append(char)
+ chars.append('')
+ chars.pop()
+ return chars
+
+ if char_level_op:
+ for dataset in datasets.values():
+ dataset.apply_field(wordtochar, field_name="words", new_field_name='chars')
+
+ datasets["train"], datasets["dev"] = datasets["train"].split(0.1, shuffle=False)
+
+ src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
+ src_vocab.from_dataset(datasets['train'], field_name='words')
+
+ src_vocab.index_dataset(*datasets.values(), field_name='words')
+
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
+ tgt_vocab.from_dataset(datasets['train'], field_name='target')
+ tgt_vocab.index_dataset(*datasets.values(), field_name='target')
+
+ info.vocabs = {
+ Const.INPUT: src_vocab,
+ Const.TARGET: tgt_vocab
+ }
+
+ info.datasets = datasets
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
+
+ return info
+
+
+
diff --git a/fastNLP/io/data_loader/matching.py b/fastNLP/io/data_loader/matching.py
new file mode 100644
index 00000000..481b5056
--- /dev/null
+++ b/fastNLP/io/data_loader/matching.py
@@ -0,0 +1,248 @@
+import os
+
+from typing import Union, Dict, List
+
+from ...core.const import Const
+from ...core.vocabulary import Vocabulary
+from ..base_loader import DataBundle, DataSetLoader
+from ..file_utils import _get_base_url, cached_path, PRETRAINED_BERT_MODEL_DIR
+from ...modules.encoder.bert import BertTokenizer
+
+
+class MatchingLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.MatchingLoader` :class:`fastNLP.io.data_loader.MatchingLoader`
+
+ 读取Matching任务的数据集
+
+ :param dict paths: key是数据集名称(如train、dev、test),value是对应的文件名
+ """
+
+ def __init__(self, paths: dict=None):
+ self.paths = paths
+
+ def _load(self, path):
+ """
+ :param str path: 待读取数据集的路径名
+ :return: fastNLP.DataSet ds: 返回一个DataSet对象,里面必须包含3个field:其中两个分别为两个句子
+ 的原始字符串文本,第三个为标签
+ """
+ raise NotImplementedError
+
+ def process(self, paths: Union[str, Dict[str, str]], dataset_name: str=None,
+ to_lower=False, seq_len_type: str=None, bert_tokenizer: str=None,
+ cut_text: int = None, get_index=True, auto_pad_length: int=None,
+ auto_pad_token: str='', set_input: Union[list, str, bool]=True,
+ set_target: Union[list, str, bool]=True, concat: Union[str, list, bool]=None,
+ extra_split: List[str]=None, ) -> DataBundle:
+ """
+ :param paths: str或者Dict[str, str]。如果是str,则为数据集所在的文件夹或者是全路径文件名:如果是文件夹,
+ 则会从self.paths里面找对应的数据集名称与文件名。如果是Dict,则为数据集名称(如train、dev、test)和
+ 对应的全路径文件名。
+ :param str dataset_name: 如果在paths里传入的是一个数据集的全路径文件名,那么可以用dataset_name来定义
+ 这个数据集的名字,如果不定义则默认为train。
+ :param bool to_lower: 是否将文本自动转为小写。默认值为False。
+ :param str seq_len_type: 提供的seq_len类型,支持 ``seq_len`` :提供一个数字作为句子长度; ``mask`` :
+ 提供一个0/1的mask矩阵作为句子长度; ``bert`` :提供segment_type_id(第一个句子为0,第二个句子为1)和
+ attention mask矩阵(0/1的mask矩阵)。默认值为None,即不提供seq_len
+ :param str bert_tokenizer: bert tokenizer所使用的词表所在的文件夹路径
+ :param int cut_text: 将长于cut_text的内容截掉。默认为None,即不截。
+ :param bool get_index: 是否需要根据词表将文本转为index
+ :param int auto_pad_length: 是否需要将文本自动pad到一定长度(超过这个长度的文本将会被截掉),默认为不会自动pad
+ :param str auto_pad_token: 自动pad的内容
+ :param set_input: 如果为True,则会自动将相关的field(名字里含有Const.INPUT的)设置为input,如果为False
+ 则不会将任何field设置为input。如果传入str或者List[str],则会根据传入的内容将相对应的field设置为input,
+ 于此同时其他field不会被设置为input。默认值为True。
+ :param set_target: set_target将控制哪些field可以被设置为target,用法与set_input一致。默认值为True。
+ :param concat: 是否需要将两个句子拼接起来。如果为False则不会拼接。如果为True则会在两个句子之间插入一个。
+ 如果传入一个长度为4的list,则分别表示插在第一句开始前、第一句结束后、第二句开始前、第二句结束后的标识符。如果
+ 传入字符串 ``bert`` ,则会采用bert的拼接方式,等价于['[CLS]', '[SEP]', '', '[SEP]'].
+ :param extra_split: 额外的分隔符,即除了空格之外的用于分词的字符。
+ :return:
+ """
+ if isinstance(set_input, str):
+ set_input = [set_input]
+ if isinstance(set_target, str):
+ set_target = [set_target]
+ if isinstance(set_input, bool):
+ auto_set_input = set_input
+ else:
+ auto_set_input = False
+ if isinstance(set_target, bool):
+ auto_set_target = set_target
+ else:
+ auto_set_target = False
+ if isinstance(paths, str):
+ if os.path.isdir(paths):
+ path = {n: os.path.join(paths, self.paths[n]) for n in self.paths.keys()}
+ else:
+ path = {dataset_name if dataset_name is not None else 'train': paths}
+ else:
+ path = paths
+
+ data_info = DataBundle()
+ for data_name in path.keys():
+ data_info.datasets[data_name] = self._load(path[data_name])
+
+ for data_name, data_set in data_info.datasets.items():
+ if auto_set_input:
+ data_set.set_input(Const.INPUTS(0), Const.INPUTS(1))
+ if auto_set_target:
+ if Const.TARGET in data_set.get_field_names():
+ data_set.set_target(Const.TARGET)
+
+ if extra_split is not None:
+ for data_name, data_set in data_info.datasets.items():
+ data_set.apply(lambda x: ' '.join(x[Const.INPUTS(0)]), new_field_name=Const.INPUTS(0))
+ data_set.apply(lambda x: ' '.join(x[Const.INPUTS(1)]), new_field_name=Const.INPUTS(1))
+
+ for s in extra_split:
+ data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
+ new_field_name=Const.INPUTS(0))
+ data_set.apply(lambda x: x[Const.INPUTS(0)].replace(s, ' ' + s + ' '),
+ new_field_name=Const.INPUTS(0))
+
+ _filt = lambda x: x
+ data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(0)].split(' '))),
+ new_field_name=Const.INPUTS(0), is_input=auto_set_input)
+ data_set.apply(lambda x: list(filter(_filt, x[Const.INPUTS(1)].split(' '))),
+ new_field_name=Const.INPUTS(1), is_input=auto_set_input)
+ _filt = None
+
+ if to_lower:
+ for data_name, data_set in data_info.datasets.items():
+ data_set.apply(lambda x: [w.lower() for w in x[Const.INPUTS(0)]], new_field_name=Const.INPUTS(0),
+ is_input=auto_set_input)
+ data_set.apply(lambda x: [w.lower() for w in x[Const.INPUTS(1)]], new_field_name=Const.INPUTS(1),
+ is_input=auto_set_input)
+
+ if bert_tokenizer is not None:
+ if bert_tokenizer.lower() in PRETRAINED_BERT_MODEL_DIR:
+ PRETRAIN_URL = _get_base_url('bert')
+ model_name = PRETRAINED_BERT_MODEL_DIR[bert_tokenizer]
+ model_url = PRETRAIN_URL + model_name
+ model_dir = cached_path(model_url)
+ # 检查是否存在
+ elif os.path.isdir(bert_tokenizer):
+ model_dir = bert_tokenizer
+ else:
+ raise ValueError(f"Cannot recognize BERT tokenizer from {bert_tokenizer}.")
+
+ words_vocab = Vocabulary(padding='[PAD]', unknown='[UNK]')
+ with open(os.path.join(model_dir, 'vocab.txt'), 'r') as f:
+ lines = f.readlines()
+ lines = [line.strip() for line in lines]
+ words_vocab.add_word_lst(lines)
+ words_vocab.build_vocab()
+
+ tokenizer = BertTokenizer.from_pretrained(model_dir)
+
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: tokenizer.tokenize(' '.join(x[fields])), new_field_name=fields,
+ is_input=auto_set_input)
+
+ if isinstance(concat, bool):
+ concat = 'default' if concat else None
+ if concat is not None:
+ if isinstance(concat, str):
+ CONCAT_MAP = {'bert': ['[CLS]', '[SEP]', '', '[SEP]'],
+ 'default': ['', '', '', '']}
+ if concat.lower() in CONCAT_MAP:
+ concat = CONCAT_MAP[concat]
+ else:
+ concat = 4 * [concat]
+ assert len(concat) == 4, \
+ f'Please choose a list with 4 symbols which at the beginning of first sentence ' \
+ f'the end of first sentence, the begin of second sentence, and the end of second' \
+ f'sentence. Your input is {concat}'
+
+ for data_name, data_set in data_info.datasets.items():
+ data_set.apply(lambda x: [concat[0]] + x[Const.INPUTS(0)] + [concat[1]] + [concat[2]] +
+ x[Const.INPUTS(1)] + [concat[3]], new_field_name=Const.INPUT)
+ data_set.apply(lambda x: [w for w in x[Const.INPUT] if len(w) > 0], new_field_name=Const.INPUT,
+ is_input=auto_set_input)
+
+ if seq_len_type is not None:
+ if seq_len_type == 'seq_len': #
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: len(x[fields]),
+ new_field_name=fields.replace(Const.INPUT, Const.INPUT_LEN),
+ is_input=auto_set_input)
+ elif seq_len_type == 'mask':
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: [1] * len(x[fields]),
+ new_field_name=fields.replace(Const.INPUT, Const.INPUT_LEN),
+ is_input=auto_set_input)
+ elif seq_len_type == 'bert':
+ for data_name, data_set in data_info.datasets.items():
+ if Const.INPUT not in data_set.get_field_names():
+ raise KeyError(f'Field ``{Const.INPUT}`` not in {data_name} data set: '
+ f'got {data_set.get_field_names()}')
+ data_set.apply(lambda x: [0] * (len(x[Const.INPUTS(0)]) + 2) + [1] * (len(x[Const.INPUTS(1)]) + 1),
+ new_field_name=Const.INPUT_LENS(0), is_input=auto_set_input)
+ data_set.apply(lambda x: [1] * len(x[Const.INPUT_LENS(0)]),
+ new_field_name=Const.INPUT_LENS(1), is_input=auto_set_input)
+
+ if auto_pad_length is not None:
+ cut_text = min(auto_pad_length, cut_text if cut_text is not None else auto_pad_length)
+
+ if cut_text is not None:
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if (Const.INPUT in fields) or ((Const.INPUT_LEN in fields) and (seq_len_type != 'seq_len')):
+ data_set.apply(lambda x: x[fields][: cut_text], new_field_name=fields,
+ is_input=auto_set_input)
+
+ data_set_list = [d for n, d in data_info.datasets.items()]
+ assert len(data_set_list) > 0, f'There are NO data sets in data info!'
+
+ if bert_tokenizer is None:
+ words_vocab = Vocabulary(padding=auto_pad_token)
+ words_vocab = words_vocab.from_dataset(*[d for n, d in data_info.datasets.items() if 'train' in n],
+ field_name=[n for n in data_set_list[0].get_field_names()
+ if (Const.INPUT in n)],
+ no_create_entry_dataset=[d for n, d in data_info.datasets.items()
+ if 'train' not in n])
+ target_vocab = Vocabulary(padding=None, unknown=None)
+ target_vocab = target_vocab.from_dataset(*[d for n, d in data_info.datasets.items() if 'train' in n],
+ field_name=Const.TARGET)
+ data_info.vocabs = {Const.INPUT: words_vocab, Const.TARGET: target_vocab}
+
+ if get_index:
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: [words_vocab.to_index(w) for w in x[fields]], new_field_name=fields,
+ is_input=auto_set_input)
+
+ if Const.TARGET in data_set.get_field_names():
+ data_set.apply(lambda x: target_vocab.to_index(x[Const.TARGET]), new_field_name=Const.TARGET,
+ is_input=auto_set_input, is_target=auto_set_target)
+
+ if auto_pad_length is not None:
+ if seq_len_type == 'seq_len':
+ raise RuntimeError(f'the sequence will be padded with the length {auto_pad_length}, '
+ f'so the seq_len_type cannot be `{seq_len_type}`!')
+ for data_name, data_set in data_info.datasets.items():
+ for fields in data_set.get_field_names():
+ if Const.INPUT in fields:
+ data_set.apply(lambda x: x[fields] + [words_vocab.to_index(words_vocab.padding)] *
+ (auto_pad_length - len(x[fields])), new_field_name=fields,
+ is_input=auto_set_input)
+ elif (Const.INPUT_LEN in fields) and (seq_len_type != 'seq_len'):
+ data_set.apply(lambda x: x[fields] + [0] * (auto_pad_length - len(x[fields])),
+ new_field_name=fields, is_input=auto_set_input)
+
+ for data_name, data_set in data_info.datasets.items():
+ if isinstance(set_input, list):
+ data_set.set_input(*[inputs for inputs in set_input if inputs in data_set.get_field_names()])
+ if isinstance(set_target, list):
+ data_set.set_target(*[target for target in set_target if target in data_set.get_field_names()])
+
+ return data_info
diff --git a/fastNLP/io/data_loader/mnli.py b/fastNLP/io/data_loader/mnli.py
new file mode 100644
index 00000000..65863f3d
--- /dev/null
+++ b/fastNLP/io/data_loader/mnli.py
@@ -0,0 +1,62 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class MNLILoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.MNLILoader` :class:`fastNLP.io.data_loader.MNLILoader`
+
+ 读取MNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+
+ words2: list(str), 第二句文本, hypothesis
+
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev_matched': 'dev_matched.tsv',
+ 'dev_mismatched': 'dev_mismatched.tsv',
+ 'test_matched': 'test_matched.tsv',
+ 'test_mismatched': 'test_mismatched.tsv',
+ # 'test_0.9_matched': 'multinli_0.9_test_matched_unlabeled.txt',
+ # 'test_0.9_mismatched': 'multinli_0.9_test_mismatched_unlabeled.txt',
+
+ # test_0.9_mathed与mismatched是MNLI0.9版本的(数据来源:kaggle)
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ CSVLoader.__init__(self, sep='\t')
+ self.fields = {
+ 'sentence1_binary_parse': Const.INPUTS(0),
+ 'sentence2_binary_parse': Const.INPUTS(1),
+ 'gold_label': Const.TARGET,
+ }
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+
+ for k, v in self.fields.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+
+ if Const.TARGET in ds.get_field_names():
+ if ds[0][Const.TARGET] == 'hidden':
+ ds.delete_field(Const.TARGET)
+
+ parentheses_table = str.maketrans({'(': None, ')': None})
+
+ ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(0))
+ ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(1))
+ if Const.TARGET in ds.get_field_names():
+ ds.drop(lambda x: x[Const.TARGET] == '-')
+ return ds
diff --git a/fastNLP/io/data_loader/mtl.py b/fastNLP/io/data_loader/mtl.py
new file mode 100644
index 00000000..cbca413d
--- /dev/null
+++ b/fastNLP/io/data_loader/mtl.py
@@ -0,0 +1,68 @@
+
+from typing import Union, Dict
+
+from ..base_loader import DataBundle
+from ..dataset_loader import CSVLoader
+from ...core.vocabulary import Vocabulary, VocabularyOption
+from ...core.const import Const
+from ..utils import check_dataloader_paths
+
+
+class MTL16Loader(CSVLoader):
+ """
+ 别名::class:`fastNLP.io.MTL16Loader` :class:`fastNLP.io.data_loader.MTL16Loader`
+
+ 读取MTL16数据集,DataSet包含以下fields:
+
+ words: list(str), 需要分类的文本
+
+ target: str, 文本的标签
+
+ 数据来源:https://pan.baidu.com/s/1c2L6vdA
+
+ """
+
+ def __init__(self):
+ super(MTL16Loader, self).__init__(headers=(Const.TARGET, Const.INPUT), sep='\t')
+
+ def _load(self, path):
+ dataset = super(MTL16Loader, self)._load(path)
+ dataset.apply(lambda x: x[Const.INPUT].lower().split(), new_field_name=Const.INPUT)
+ if len(dataset) == 0:
+ raise RuntimeError(f"{path} has no valid data.")
+
+ return dataset
+
+ def process(self,
+ paths: Union[str, Dict[str, str]],
+ src_vocab_opt: VocabularyOption = None,
+ tgt_vocab_opt: VocabularyOption = None,):
+
+ paths = check_dataloader_paths(paths)
+ datasets = {}
+ info = DataBundle()
+ for name, path in paths.items():
+ dataset = self.load(path)
+ datasets[name] = dataset
+
+ src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
+ src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
+ src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
+
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
+ tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
+ tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
+
+ info.vocabs = {
+ Const.INPUT: src_vocab,
+ Const.TARGET: tgt_vocab
+ }
+
+ info.datasets = datasets
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
+
+ return info
diff --git a/fastNLP/io/data_loader/people_daily.py b/fastNLP/io/data_loader/people_daily.py
new file mode 100644
index 00000000..5efadb7d
--- /dev/null
+++ b/fastNLP/io/data_loader/people_daily.py
@@ -0,0 +1,85 @@
+
+from ..base_loader import DataSetLoader
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ...core.const import Const
+
+
+class PeopleDailyCorpusLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.data_loader.PeopleDailyCorpusLoader`
+
+ 读取人民日报数据集
+ """
+
+ def __init__(self, pos=True, ner=True):
+ super(PeopleDailyCorpusLoader, self).__init__()
+ self.pos = pos
+ self.ner = ner
+
+ def _load(self, data_path):
+ with open(data_path, "r", encoding="utf-8") as f:
+ sents = f.readlines()
+ examples = []
+ for sent in sents:
+ if len(sent) <= 2:
+ continue
+ inside_ne = False
+ sent_pos_tag = []
+ sent_words = []
+ sent_ner = []
+ words = sent.strip().split()[1:]
+ for word in words:
+ if "[" in word and "]" in word:
+ ner_tag = "U"
+ print(word)
+ elif "[" in word:
+ inside_ne = True
+ ner_tag = "B"
+ word = word[1:]
+ elif "]" in word:
+ ner_tag = "L"
+ word = word[:word.index("]")]
+ if inside_ne is True:
+ inside_ne = False
+ else:
+ raise RuntimeError("only ] appears!")
+ else:
+ if inside_ne is True:
+ ner_tag = "I"
+ else:
+ ner_tag = "O"
+ tmp = word.split("/")
+ token, pos = tmp[0], tmp[1]
+ sent_ner.append(ner_tag)
+ sent_pos_tag.append(pos)
+ sent_words.append(token)
+ example = [sent_words]
+ if self.pos is True:
+ example.append(sent_pos_tag)
+ if self.ner is True:
+ example.append(sent_ner)
+ examples.append(example)
+ return self.convert(examples)
+
+ def convert(self, data):
+ """
+
+ :param data: python 内置对象
+ :return: 一个 :class:`~fastNLP.DataSet` 类型的对象
+ """
+ data_set = DataSet()
+ for item in data:
+ sent_words = item[0]
+ if self.pos is True and self.ner is True:
+ instance = Instance(
+ words=sent_words, pos_tags=item[1], ner=item[2])
+ elif self.pos is True:
+ instance = Instance(words=sent_words, pos_tags=item[1])
+ elif self.ner is True:
+ instance = Instance(words=sent_words, ner=item[1])
+ else:
+ instance = Instance(words=sent_words)
+ data_set.append(instance)
+ data_set.apply(lambda ins: len(ins[Const.INPUT]), new_field_name=Const.INPUT_LEN)
+ return data_set
diff --git a/fastNLP/io/data_loader/qnli.py b/fastNLP/io/data_loader/qnli.py
new file mode 100644
index 00000000..84b0f3d6
--- /dev/null
+++ b/fastNLP/io/data_loader/qnli.py
@@ -0,0 +1,47 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class QNLILoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.QNLILoader` :class:`fastNLP.io.data_loader.QNLILoader`
+
+ 读取QNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+
+ words2: list(str), 第二句文本, hypothesis
+
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev': 'dev.tsv',
+ 'test': 'test.tsv' # test set has not label
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ self.fields = {
+ 'question': Const.INPUTS(0),
+ 'sentence': Const.INPUTS(1),
+ 'label': Const.TARGET,
+ }
+ CSVLoader.__init__(self, sep='\t')
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+
+ for k, v in self.fields.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+ for fields in ds.get_all_fields():
+ if Const.INPUT in fields:
+ ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
+
+ return ds
diff --git a/fastNLP/io/data_loader/quora.py b/fastNLP/io/data_loader/quora.py
new file mode 100644
index 00000000..d0ee41ec
--- /dev/null
+++ b/fastNLP/io/data_loader/quora.py
@@ -0,0 +1,34 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class QuoraLoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.QuoraLoader` :class:`fastNLP.io.data_loader.QuoraLoader`
+
+ 读取MNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+
+ words2: list(str), 第二句文本, hypothesis
+
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev': 'dev.tsv',
+ 'test': 'test.tsv',
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ CSVLoader.__init__(self, sep='\t', headers=(Const.TARGET, Const.INPUTS(0), Const.INPUTS(1), 'pairID'))
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+ return ds
diff --git a/fastNLP/io/data_loader/rte.py b/fastNLP/io/data_loader/rte.py
new file mode 100644
index 00000000..f8c5e2fc
--- /dev/null
+++ b/fastNLP/io/data_loader/rte.py
@@ -0,0 +1,47 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import CSVLoader
+
+
+class RTELoader(MatchingLoader, CSVLoader):
+ """
+ 别名::class:`fastNLP.io.RTELoader` :class:`fastNLP.io.data_loader.RTELoader`
+
+ 读取RTE数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+
+ words2: list(str), 第二句文本, hypothesis
+
+ target: str, 真实标签
+
+ 数据来源:
+ """
+
+ def __init__(self, paths: dict=None):
+ paths = paths if paths is not None else {
+ 'train': 'train.tsv',
+ 'dev': 'dev.tsv',
+ 'test': 'test.tsv' # test set has not label
+ }
+ MatchingLoader.__init__(self, paths=paths)
+ self.fields = {
+ 'sentence1': Const.INPUTS(0),
+ 'sentence2': Const.INPUTS(1),
+ 'label': Const.TARGET,
+ }
+ CSVLoader.__init__(self, sep='\t')
+
+ def _load(self, path):
+ ds = CSVLoader._load(self, path)
+
+ for k, v in self.fields.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+ for fields in ds.get_all_fields():
+ if Const.INPUT in fields:
+ ds.apply(lambda x: x[fields].strip().split(), new_field_name=fields)
+
+ return ds
diff --git a/fastNLP/io/data_loader/snli.py b/fastNLP/io/data_loader/snli.py
new file mode 100644
index 00000000..1db0ac5b
--- /dev/null
+++ b/fastNLP/io/data_loader/snli.py
@@ -0,0 +1,46 @@
+
+from ...core.const import Const
+
+from .matching import MatchingLoader
+from ..dataset_loader import JsonLoader
+
+
+class SNLILoader(MatchingLoader, JsonLoader):
+ """
+ 别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.data_loader.SNLILoader`
+
+ 读取SNLI数据集,读取的DataSet包含fields::
+
+ words1: list(str),第一句文本, premise
+
+ words2: list(str), 第二句文本, hypothesis
+
+ target: str, 真实标签
+
+ 数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
+ """
+
+ def __init__(self, paths: dict=None):
+ fields = {
+ 'sentence1_binary_parse': Const.INPUTS(0),
+ 'sentence2_binary_parse': Const.INPUTS(1),
+ 'gold_label': Const.TARGET,
+ }
+ paths = paths if paths is not None else {
+ 'train': 'snli_1.0_train.jsonl',
+ 'dev': 'snli_1.0_dev.jsonl',
+ 'test': 'snli_1.0_test.jsonl'}
+ MatchingLoader.__init__(self, paths=paths)
+ JsonLoader.__init__(self, fields=fields)
+
+ def _load(self, path):
+ ds = JsonLoader._load(self, path)
+
+ parentheses_table = str.maketrans({'(': None, ')': None})
+
+ ds.apply(lambda ins: ins[Const.INPUTS(0)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(0))
+ ds.apply(lambda ins: ins[Const.INPUTS(1)].translate(parentheses_table).strip().split(),
+ new_field_name=Const.INPUTS(1))
+ ds.drop(lambda x: x[Const.TARGET] == '-')
+ return ds
diff --git a/fastNLP/io/data_loader/sst.py b/fastNLP/io/data_loader/sst.py
index 1e1b8bef..0d881e65 100644
--- a/fastNLP/io/data_loader/sst.py
+++ b/fastNLP/io/data_loader/sst.py
@@ -1,18 +1,19 @@
-from typing import Iterable
+
+from typing import Union, Dict
from nltk import Tree
-from ..base_loader import DataInfo, DataSetLoader
+
+from ..base_loader import DataBundle, DataSetLoader
+from ..dataset_loader import CSVLoader
from ...core.vocabulary import VocabularyOption, Vocabulary
from ...core.dataset import DataSet
+from ...core.const import Const
from ...core.instance import Instance
-from ..embed_loader import EmbeddingOption, EmbedLoader
+from ..utils import check_dataloader_paths, get_tokenizer
class SSTLoader(DataSetLoader):
- URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
- DATA_DIR = 'sst/'
-
"""
- 别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.dataset_loader.SSTLoader`
+ 别名::class:`fastNLP.io.SSTLoader` :class:`fastNLP.io.data_loader.SSTLoader`
读取SST数据集, DataSet包含fields::
@@ -25,6 +26,9 @@ class SSTLoader(DataSetLoader):
:param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False``
"""
+ URL = 'https://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip'
+ DATA_DIR = 'sst/'
+
def __init__(self, subtree=False, fine_grained=False):
self.subtree = subtree
@@ -34,6 +38,7 @@ class SSTLoader(DataSetLoader):
tag_v['0'] = tag_v['1']
tag_v['4'] = tag_v['3']
self.tag_v = tag_v
+ self.tokenizer = get_tokenizer()
def _load(self, path):
"""
@@ -52,29 +57,37 @@ class SSTLoader(DataSetLoader):
ds.append(Instance(words=words, target=tag))
return ds
- @staticmethod
- def _get_one(data, subtree):
+ def _get_one(self, data, subtree):
tree = Tree.fromstring(data)
if subtree:
- return [(t.leaves(), t.label()) for t in tree.subtrees()]
- return [(tree.leaves(), tree.label())]
+ return [(self.tokenizer(' '.join(t.leaves())), t.label()) for t in tree.subtrees() ]
+ return [(self.tokenizer(' '.join(tree.leaves())), tree.label())]
def process(self,
- paths,
- train_ds: Iterable[str] = None,
+ paths, train_subtree=True,
src_vocab_op: VocabularyOption = None,
- tgt_vocab_op: VocabularyOption = None,
- src_embed_op: EmbeddingOption = None):
+ tgt_vocab_op: VocabularyOption = None,):
+ paths = check_dataloader_paths(paths)
input_name, target_name = 'words', 'target'
src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
tgt_vocab = Vocabulary(unknown=None, padding=None) \
if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
- info = DataInfo(datasets=self.load(paths))
- _train_ds = [info.datasets[name]
- for name in train_ds] if train_ds else info.datasets.values()
- src_vocab.from_dataset(*_train_ds, field_name=input_name)
- tgt_vocab.from_dataset(*_train_ds, field_name=target_name)
+ info = DataBundle()
+ origin_subtree = self.subtree
+ self.subtree = train_subtree
+ info.datasets['train'] = self._load(paths['train'])
+ self.subtree = origin_subtree
+ for n, p in paths.items():
+ if n != 'train':
+ info.datasets[n] = self._load(p)
+
+ src_vocab.from_dataset(
+ info.datasets['train'],
+ field_name=input_name,
+ no_create_entry_dataset=[ds for n, ds in info.datasets.items() if n != 'train'])
+ tgt_vocab.from_dataset(info.datasets['train'], field_name=target_name)
+
src_vocab.index_dataset(
*info.datasets.values(),
field_name=input_name, new_field_name=input_name)
@@ -86,10 +99,79 @@ class SSTLoader(DataSetLoader):
target_name: tgt_vocab
}
- if src_embed_op is not None:
- src_embed_op.vocab = src_vocab
- init_emb = EmbedLoader.load_with_vocab(**src_embed_op)
- info.embeddings[input_name] = init_emb
+ return info
+
+
+class SST2Loader(CSVLoader):
+ """
+ 别名::class:`fastNLP.io.SST2Loader` :class:`fastNLP.io.data_loader.SST2Loader`
+
+ 数据来源 SST: https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8
+ """
+
+ def __init__(self):
+ super(SST2Loader, self).__init__(sep='\t')
+ self.tokenizer = get_tokenizer()
+ self.field = {'sentence': Const.INPUT, 'label': Const.TARGET}
+
+ def _load(self, path: str) -> DataSet:
+ ds = super(SST2Loader, self)._load(path)
+ for k, v in self.field.items():
+ if k in ds.get_field_names():
+ ds.rename_field(k, v)
+ ds.apply(lambda x: self.tokenizer(x[Const.INPUT]), new_field_name=Const.INPUT)
+ print("all count:", len(ds))
+ return ds
+
+ def process(self,
+ paths: Union[str, Dict[str, str]],
+ src_vocab_opt: VocabularyOption = None,
+ tgt_vocab_opt: VocabularyOption = None,
+ char_level_op=False):
+
+ paths = check_dataloader_paths(paths)
+ datasets = {}
+ info = DataBundle()
+ for name, path in paths.items():
+ dataset = self.load(path)
+ datasets[name] = dataset
+
+ def wordtochar(words):
+ chars = []
+ for word in words:
+ word = word.lower()
+ for char in word:
+ chars.append(char)
+ chars.append('')
+ chars.pop()
+ return chars
+
+ input_name, target_name = Const.INPUT, Const.TARGET
+ info.vocabs={}
+
+ # 就分隔为char形式
+ if char_level_op:
+ for dataset in datasets.values():
+ dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
+ src_vocab = Vocabulary() if src_vocab_opt is None else Vocabulary(**src_vocab_opt)
+ src_vocab.from_dataset(datasets['train'], field_name=Const.INPUT)
+ src_vocab.index_dataset(*datasets.values(), field_name=Const.INPUT)
+
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_opt is None else Vocabulary(**tgt_vocab_opt)
+ tgt_vocab.from_dataset(datasets['train'], field_name=Const.TARGET)
+ tgt_vocab.index_dataset(*datasets.values(), field_name=Const.TARGET)
+
+ info.vocabs = {
+ Const.INPUT: src_vocab,
+ Const.TARGET: tgt_vocab
+ }
+
+ info.datasets = datasets
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
return info
diff --git a/fastNLP/io/data_loader/yelp.py b/fastNLP/io/data_loader/yelp.py
new file mode 100644
index 00000000..333fcab0
--- /dev/null
+++ b/fastNLP/io/data_loader/yelp.py
@@ -0,0 +1,132 @@
+
+import csv
+from typing import Iterable
+
+from ...core.const import Const
+from ...core.dataset import DataSet
+from ...core.instance import Instance
+from ...core.vocabulary import VocabularyOption, Vocabulary
+from ..base_loader import DataBundle, DataSetLoader
+from typing import Union, Dict
+from ..utils import check_dataloader_paths, get_tokenizer
+
+
+class YelpLoader(DataSetLoader):
+ """
+ 别名::class:`fastNLP.io.YelpLoader` :class:`fastNLP.io.data_loader.YelpLoader`
+ 读取Yelp_full/Yelp_polarity数据集, DataSet包含fields:
+
+ words: list(str), 需要分类的文本
+
+ target: str, 文本的标签
+
+ chars:list(str),未index的字符列表
+
+ 数据集:yelp_full/yelp_polarity
+
+ :param fine_grained: 是否使用SST-5标准,若 ``False`` , 使用SST-2。Default: ``False``
+ :param lower: 是否需要自动转小写,默认为False。
+ """
+
+ def __init__(self, fine_grained=False, lower=False):
+ super(YelpLoader, self).__init__()
+ tag_v = {'1.0': 'very negative', '2.0': 'negative', '3.0': 'neutral',
+ '4.0': 'positive', '5.0': 'very positive'}
+ if not fine_grained:
+ tag_v['1.0'] = tag_v['2.0']
+ tag_v['5.0'] = tag_v['4.0']
+ self.fine_grained = fine_grained
+ self.tag_v = tag_v
+ self.lower = lower
+ self.tokenizer = get_tokenizer()
+
+ def _load(self, path):
+ ds = DataSet()
+ csv_reader = csv.reader(open(path, encoding='utf-8'))
+ all_count = 0
+ real_count = 0
+ for row in csv_reader:
+ all_count += 1
+ if len(row) == 2:
+ target = self.tag_v[row[0] + ".0"]
+ words = clean_str(row[1], self.tokenizer, self.lower)
+ if len(words) != 0:
+ ds.append(Instance(words=words, target=target))
+ real_count += 1
+ print("all count:", all_count)
+ print("real count:", real_count)
+ return ds
+
+ def process(self, paths: Union[str, Dict[str, str]],
+ train_ds: Iterable[str] = None,
+ src_vocab_op: VocabularyOption = None,
+ tgt_vocab_op: VocabularyOption = None,
+ char_level_op=False):
+ paths = check_dataloader_paths(paths)
+ info = DataBundle(datasets=self.load(paths))
+ src_vocab = Vocabulary() if src_vocab_op is None else Vocabulary(**src_vocab_op)
+ tgt_vocab = Vocabulary(unknown=None, padding=None) \
+ if tgt_vocab_op is None else Vocabulary(**tgt_vocab_op)
+ _train_ds = [info.datasets[name]
+ for name in train_ds] if train_ds else info.datasets.values()
+
+ def wordtochar(words):
+ chars = []
+ for word in words:
+ word = word.lower()
+ for char in word:
+ chars.append(char)
+ chars.append('')
+ chars.pop()
+ return chars
+
+ input_name, target_name = Const.INPUT, Const.TARGET
+ info.vocabs = {}
+ # 就分隔为char形式
+ if char_level_op:
+ for dataset in info.datasets.values():
+ dataset.apply_field(wordtochar, field_name=Const.INPUT, new_field_name=Const.CHAR_INPUT)
+ else:
+ src_vocab.from_dataset(*_train_ds, field_name=input_name)
+ src_vocab.index_dataset(*info.datasets.values(), field_name=input_name, new_field_name=input_name)
+ info.vocabs[input_name] = src_vocab
+
+ tgt_vocab.from_dataset(*_train_ds, field_name=target_name)
+ tgt_vocab.index_dataset(
+ *info.datasets.values(),
+ field_name=target_name, new_field_name=target_name)
+
+ info.vocabs[target_name] = tgt_vocab
+
+ info.datasets['train'], info.datasets['dev'] = info.datasets['train'].split(0.1, shuffle=False)
+
+ for name, dataset in info.datasets.items():
+ dataset.set_input(Const.INPUT)
+ dataset.set_target(Const.TARGET)
+
+ return info
+
+
+def clean_str(sentence, tokenizer, char_lower=False):
+ """
+ heavily borrowed from github
+ https://github.com/LukeZhuang/Hierarchical-Attention-Network/blob/master/yelp-preprocess.ipynb
+ :param sentence: is a str
+ :return:
+ """
+ if char_lower:
+ sentence = sentence.lower()
+ import re
+ nonalpnum = re.compile('[^0-9a-zA-Z?!\']+')
+ words = tokenizer(sentence)
+ words_collection = []
+ for word in words:
+ if word in ['-lrb-', '-rrb-', '', '-r', '-l', 'b-']:
+ continue
+ tt = nonalpnum.split(word)
+ t = ''.join(tt)
+ if t != '':
+ words_collection.append(t)
+
+ return words_collection
+
diff --git a/fastNLP/io/dataset_loader.py b/fastNLP/io/dataset_loader.py
index d175d3b9..ad6bbdc1 100644
--- a/fastNLP/io/dataset_loader.py
+++ b/fastNLP/io/dataset_loader.py
@@ -15,195 +15,13 @@ dataset_loader模块实现了许多 DataSetLoader, 用于读取不同格式的
__all__ = [
'CSVLoader',
'JsonLoader',
- 'ConllLoader',
- 'SNLILoader',
- 'SSTLoader',
- 'PeopleDailyCorpusLoader',
- 'Conll2003Loader',
]
-from nltk import Tree
+
from ..core.dataset import DataSet
from ..core.instance import Instance
-from .file_reader import _read_csv, _read_json, _read_conll
+from .file_reader import _read_csv, _read_json
from .base_loader import DataSetLoader
-from .data_loader.sst import SSTLoader
-
-class PeopleDailyCorpusLoader(DataSetLoader):
- """
- 别名::class:`fastNLP.io.PeopleDailyCorpusLoader` :class:`fastNLP.io.dataset_loader.PeopleDailyCorpusLoader`
-
- 读取人民日报数据集
- """
-
- def __init__(self, pos=True, ner=True):
- super(PeopleDailyCorpusLoader, self).__init__()
- self.pos = pos
- self.ner = ner
-
- def _load(self, data_path):
- with open(data_path, "r", encoding="utf-8") as f:
- sents = f.readlines()
- examples = []
- for sent in sents:
- if len(sent) <= 2:
- continue
- inside_ne = False
- sent_pos_tag = []
- sent_words = []
- sent_ner = []
- words = sent.strip().split()[1:]
- for word in words:
- if "[" in word and "]" in word:
- ner_tag = "U"
- print(word)
- elif "[" in word:
- inside_ne = True
- ner_tag = "B"
- word = word[1:]
- elif "]" in word:
- ner_tag = "L"
- word = word[:word.index("]")]
- if inside_ne is True:
- inside_ne = False
- else:
- raise RuntimeError("only ] appears!")
- else:
- if inside_ne is True:
- ner_tag = "I"
- else:
- ner_tag = "O"
- tmp = word.split("/")
- token, pos = tmp[0], tmp[1]
- sent_ner.append(ner_tag)
- sent_pos_tag.append(pos)
- sent_words.append(token)
- example = [sent_words]
- if self.pos is True:
- example.append(sent_pos_tag)
- if self.ner is True:
- example.append(sent_ner)
- examples.append(example)
- return self.convert(examples)
-
- def convert(self, data):
- """
-
- :param data: python 内置对象
- :return: 一个 :class:`~fastNLP.DataSet` 类型的对象
- """
- data_set = DataSet()
- for item in data:
- sent_words = item[0]
- if self.pos is True and self.ner is True:
- instance = Instance(
- words=sent_words, pos_tags=item[1], ner=item[2])
- elif self.pos is True:
- instance = Instance(words=sent_words, pos_tags=item[1])
- elif self.ner is True:
- instance = Instance(words=sent_words, ner=item[1])
- else:
- instance = Instance(words=sent_words)
- data_set.append(instance)
- data_set.apply(lambda ins: len(ins["words"]), new_field_name="seq_len")
- return data_set
-
-
-class ConllLoader(DataSetLoader):
- """
- 别名::class:`fastNLP.io.ConllLoader` :class:`fastNLP.io.dataset_loader.ConllLoader`
-
- 读取Conll格式的数据. 数据格式详见 http://conll.cemantix.org/2012/data.html
-
- 列号从0开始, 每列对应内容为::
-
- Column Type
- 0 Document ID
- 1 Part number
- 2 Word number
- 3 Word itself
- 4 Part-of-Speech
- 5 Parse bit
- 6 Predicate lemma
- 7 Predicate Frameset ID
- 8 Word sense
- 9 Speaker/Author
- 10 Named Entities
- 11:N Predicate Arguments
- N Coreference
-
- :param headers: 每一列数据的名称,需为List or Tuple of str。``header`` 与 ``indexes`` 一一对应
- :param indexes: 需要保留的数据列下标,从0开始。若为 ``None`` ,则所有列都保留。Default: ``None``
- :param dropna: 是否忽略非法数据,若 ``False`` ,遇到非法数据时抛出 ``ValueError`` 。Default: ``False``
- """
-
- def __init__(self, headers, indexes=None, dropna=False):
- super(ConllLoader, self).__init__()
- if not isinstance(headers, (list, tuple)):
- raise TypeError(
- 'invalid headers: {}, should be list of strings'.format(headers))
- self.headers = headers
- self.dropna = dropna
- if indexes is None:
- self.indexes = list(range(len(self.headers)))
- else:
- if len(indexes) != len(headers):
- raise ValueError
- self.indexes = indexes
-
- def _load(self, path):
- ds = DataSet()
- for idx, data in _read_conll(path, indexes=self.indexes, dropna=self.dropna):
- ins = {h: data[i] for i, h in enumerate(self.headers)}
- ds.append(Instance(**ins))
- return ds
-
-
-class Conll2003Loader(ConllLoader):
- """
- 别名::class:`fastNLP.io.Conll2003Loader` :class:`fastNLP.io.dataset_loader.Conll2003Loader`
-
- 读取Conll2003数据
-
- 关于数据集的更多信息,参考:
- https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data
- """
-
- def __init__(self):
- headers = [
- 'tokens', 'pos', 'chunks', 'ner',
- ]
- super(Conll2003Loader, self).__init__(headers=headers)
-
-
-def _cut_long_sentence(sent, max_sample_length=200):
- """
- 将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。
- 所以截取的句子可能长于或者短于max_sample_length
-
- :param sent: str.
- :param max_sample_length: int.
- :return: list of str.
- """
- sent_no_space = sent.replace(' ', '')
- cutted_sentence = []
- if len(sent_no_space) > max_sample_length:
- parts = sent.strip().split()
- new_line = ''
- length = 0
- for part in parts:
- length += len(part)
- new_line += part + ' '
- if length > max_sample_length:
- new_line = new_line[:-1]
- cutted_sentence.append(new_line)
- length = 0
- new_line = ''
- if new_line != '':
- cutted_sentence.append(new_line[:-1])
- else:
- cutted_sentence.append(sent)
- return cutted_sentence
class JsonLoader(DataSetLoader):
@@ -242,42 +60,6 @@ class JsonLoader(DataSetLoader):
return ds
-class SNLILoader(JsonLoader):
- """
- 别名::class:`fastNLP.io.SNLILoader` :class:`fastNLP.io.dataset_loader.SNLILoader`
-
- 读取SNLI数据集,读取的DataSet包含fields::
-
- words1: list(str),第一句文本, premise
- words2: list(str), 第二句文本, hypothesis
- target: str, 真实标签
-
- 数据来源: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
- """
-
- def __init__(self):
- fields = {
- 'sentence1_parse': 'words1',
- 'sentence2_parse': 'words2',
- 'gold_label': 'target',
- }
- super(SNLILoader, self).__init__(fields=fields)
-
- def _load(self, path):
- ds = super(SNLILoader, self)._load(path)
-
- def parse_tree(x):
- t = Tree.fromstring(x)
- return t.leaves()
-
- ds.apply(lambda ins: parse_tree(
- ins['words1']), new_field_name='words1')
- ds.apply(lambda ins: parse_tree(
- ins['words2']), new_field_name='words2')
- ds.drop(lambda x: x['target'] == '-')
- return ds
-
-
class CSVLoader(DataSetLoader):
"""
别名::class:`fastNLP.io.CSVLoader` :class:`fastNLP.io.dataset_loader.CSVLoader`
@@ -304,6 +86,36 @@ class CSVLoader(DataSetLoader):
return ds
+def _cut_long_sentence(sent, max_sample_length=200):
+ """
+ 将长于max_sample_length的sentence截成多段,只会在有空格的地方发生截断。
+ 所以截取的句子可能长于或者短于max_sample_length
+
+ :param sent: str.
+ :param max_sample_length: int.
+ :return: list of str.
+ """
+ sent_no_space = sent.replace(' ', '')
+ cutted_sentence = []
+ if len(sent_no_space) > max_sample_length:
+ parts = sent.strip().split()
+ new_line = ''
+ length = 0
+ for part in parts:
+ length += len(part)
+ new_line += part + ' '
+ if length > max_sample_length:
+ new_line = new_line[:-1]
+ cutted_sentence.append(new_line)
+ length = 0
+ new_line = ''
+ if new_line != '':
+ cutted_sentence.append(new_line[:-1])
+ else:
+ cutted_sentence.append(sent)
+ return cutted_sentence
+
+
def _add_seg_tag(data):
"""
diff --git a/fastNLP/io/embed_loader.py b/fastNLP/io/embed_loader.py
index 93861258..91a0919c 100644
--- a/fastNLP/io/embed_loader.py
+++ b/fastNLP/io/embed_loader.py
@@ -10,10 +10,10 @@ import numpy as np
from ..core.vocabulary import Vocabulary
from .base_loader import BaseLoader
-from ..core.utils import Example
+from ..core.utils import Option
-class EmbeddingOption(Example):
+class EmbeddingOption(Option):
def __init__(self,
embed_filepath=None,
dtype=np.float32,
@@ -26,6 +26,7 @@ class EmbeddingOption(Example):
error=error
)
+
class EmbedLoader(BaseLoader):
"""
别名::class:`fastNLP.io.EmbedLoader` :class:`fastNLP.io.embed_loader.EmbedLoader`
@@ -35,9 +36,10 @@ class EmbedLoader(BaseLoader):
def __init__(self):
super(EmbedLoader, self).__init__()
-
+
@staticmethod
- def load_with_vocab(embed_filepath, vocab, dtype=np.float32, normalize=True, error='ignore'):
+ def load_with_vocab(embed_filepath, vocab, dtype=np.float32, padding='', unknown='', normalize=True,
+ error='ignore', init_method=None):
"""
从embed_filepath这个预训练的词向量中抽取出vocab这个词表的词的embedding。EmbedLoader将自动判断embed_filepath是
word2vec(第一行只有两个元素)还是glove格式的数据。
@@ -46,9 +48,12 @@ class EmbedLoader(BaseLoader):
:param vocab: 词表 :class:`~fastNLP.Vocabulary` 类型,读取出现在vocab中的词的embedding。
没有出现在vocab中的词的embedding将通过找到的词的embedding的正态分布采样出来,以使得整个Embedding是同分布的。
:param dtype: 读出的embedding的类型
+ :param str padding: 词表中padding的token
+ :param str unknown: 词表中unknown的token
:param bool normalize: 是否将每个vector归一化到norm为1
:param str error: `ignore` , `strict` ; 如果 `ignore` ,错误将自动跳过; 如果 `strict` , 错误将抛出。
这里主要可能出错的地方在于词表有空行或者词表出现了维度不一致。
+ :param callable init_method: 传入numpy.ndarray, 返回numpy.ndarray, 用以初始化embedding
:return numpy.ndarray: shape为 [len(vocab), dimension], dimension由pretrain的embedding决定。
"""
assert isinstance(vocab, Vocabulary), "Only fastNLP.Vocabulary is supported."
@@ -66,12 +71,21 @@ class EmbedLoader(BaseLoader):
dim = len(parts) - 1
f.seek(0)
matrix = np.random.randn(len(vocab), dim).astype(dtype)
+ if init_method:
+ matrix = init_method(matrix)
for idx, line in enumerate(f, start_idx):
try:
parts = line.strip().split()
- if parts[0] in vocab:
- index = vocab.to_index(parts[0])
- matrix[index] = np.fromstring(' '.join(parts[1:]), sep=' ', dtype=dtype, count=dim)
+ word = ''.join(parts[:-dim])
+ nums = parts[-dim:]
+ # 对齐unk与pad
+ if word==padding and vocab.padding is not None:
+ word = vocab.padding
+ elif word==unknown and vocab.unknown is not None:
+ word = vocab.unknown
+ if word in vocab:
+ index = vocab.to_index(word)
+ matrix[index] = np.fromstring(' '.join(nums), sep=' ', dtype=dtype, count=dim)
hit_flags[index] = True
except Exception as e:
if error == 'ignore':
@@ -81,14 +95,15 @@ class EmbedLoader(BaseLoader):
raise e
total_hits = sum(hit_flags)
print("Found {} out of {} words in the pre-training embedding.".format(total_hits, len(vocab)))
- found_vectors = matrix[hit_flags]
- if len(found_vectors) != 0:
- mean = np.mean(found_vectors, axis=0, keepdims=True)
- std = np.std(found_vectors, axis=0, keepdims=True)
- unfound_vec_num = len(vocab) - total_hits
- r_vecs = np.random.randn(unfound_vec_num, dim).astype(dtype) * std + mean
- matrix[hit_flags == False] = r_vecs
-
+ if init_method is None:
+ found_vectors = matrix[hit_flags]
+ if len(found_vectors) != 0:
+ mean = np.mean(found_vectors, axis=0, keepdims=True)
+ std = np.std(found_vectors, axis=0, keepdims=True)
+ unfound_vec_num = len(vocab) - total_hits
+ r_vecs = np.random.randn(unfound_vec_num, dim).astype(dtype) * std + mean
+ matrix[hit_flags == False] = r_vecs
+
if normalize:
matrix /= np.linalg.norm(matrix, axis=1, keepdims=True)
@@ -102,14 +117,14 @@ class EmbedLoader(BaseLoader):
:param str embed_filepath: 预训练的embedding的路径。
:param dtype: 读出的embedding的类型
- :param str padding: the padding tag for vocabulary.
- :param str unknown: the unknown tag for vocabulary.
+ :param str padding: 词表中的padding的token. 并以此用做vocab的padding。
+ :param str unknown: 词表中的unknown的token. 并以此用做vocab的unknown。
:param bool normalize: 是否将每个vector归一化到norm为1
:param str error: `ignore` , `strict` ; 如果 `ignore` ,错误将自动跳过; 如果 `strict` , 错误将抛出。这里主要可能出错的地
方在于词表有空行或者词表出现了维度不一致。
- :return numpy.ndarray: shape为 [len(vocab), dimension], dimension由pretrain的embedding决定。
- :return numpy.ndarray: Vocabulary Embedding的shape是[词表大小+x, 词表维度], "词表大小+x"是由于最终的大小还取决与
+ :return (numpy.ndarray, Vocabulary): Embedding的shape是[词表大小+x, 词表维度], "词表大小+x"是由于最终的大小还取决与
是否使用padding, 以及unknown有没有在词表中找到对应的词。 Vocabulary中的词的顺序与Embedding的顺序是一一对应的。
+
"""
vocab = Vocabulary(padding=padding, unknown=unknown)
vec_dict = {}
@@ -126,15 +141,16 @@ class EmbedLoader(BaseLoader):
for idx, line in enumerate(f, start=start):
try:
parts = line.strip().split()
- word = parts[0]
if dim == -1:
dim = len(parts) - 1
- vec = np.fromstring(' '.join(parts[1:]), sep=' ', dtype=dtype, count=dim)
+ word = ''.join(parts[:-dim])
+ nums = parts[-dim:]
+ vec = np.fromstring(' '.join(nums), sep=' ', dtype=dtype, count=dim)
vec_dict[word] = vec
vocab.add_word(word)
if unknown is not None and unknown == word:
found_unknown = True
- if found_pad is not None and padding == word:
+ if padding is not None and padding == word:
found_pad = True
except Exception as e:
if error == 'ignore':
@@ -146,13 +162,17 @@ class EmbedLoader(BaseLoader):
if dim == -1:
raise RuntimeError("{} is an empty file.".format(embed_filepath))
matrix = np.random.randn(len(vocab), dim).astype(dtype)
+ for key, vec in vec_dict.items():
+ index = vocab.to_index(key)
+ matrix[index] = vec
+
if (unknown is not None and not found_unknown) or (padding is not None and not found_pad):
start_idx = 0
if padding is not None:
start_idx += 1
if unknown is not None:
start_idx += 1
-
+
mean = np.mean(matrix[start_idx:], axis=0, keepdims=True)
std = np.std(matrix[start_idx:], axis=0, keepdims=True)
if (unknown is not None and not found_unknown):
@@ -160,10 +180,6 @@ class EmbedLoader(BaseLoader):
if (padding is not None and not found_pad):
matrix[0] = np.random.randn(1, dim).astype(dtype) * std + mean
- for key, vec in vec_dict.items():
- index = vocab.to_index(key)
- matrix[index] = vec
-
if normalize:
matrix /= np.linalg.norm(matrix, axis=1, keepdims=True)
diff --git a/fastNLP/io/file_reader.py b/fastNLP/io/file_reader.py
index 5963bb56..0ae0a319 100644
--- a/fastNLP/io/file_reader.py
+++ b/fastNLP/io/file_reader.py
@@ -90,11 +90,12 @@ def _read_conll(path, encoding='utf-8', indexes=None, dropna=True):
return sample
with open(path, 'r', encoding=encoding) as f:
sample = []
- start = next(f)
- if '-DOCSTART-' not in start:
+ start = next(f).strip()
+ if '-DOCSTART-' not in start and start!='':
sample.append(start.split())
for line_idx, line in enumerate(f, 1):
- if line.startswith('\n'):
+ line = line.strip()
+ if line=='':
if len(sample):
try:
res = parse_conll(sample)
@@ -103,11 +104,12 @@ def _read_conll(path, encoding='utf-8', indexes=None, dropna=True):
except Exception as e:
if dropna:
continue
- raise ValueError('invalid instance at line: {}'.format(line_idx))
+ raise ValueError('invalid instance ends at line: {}'.format(line_idx))
elif line.startswith('#'):
continue
else:
- sample.append(line.split())
+ if not line.startswith('-DOCSTART-'):
+ sample.append(line.split())
if len(sample) > 0:
try:
res = parse_conll(sample)
@@ -115,4 +117,5 @@ def _read_conll(path, encoding='utf-8', indexes=None, dropna=True):
except Exception as e:
if dropna:
return
- raise ValueError('invalid instance at line: {}'.format(line_idx))
+ print('invalid instance ends at line: {}'.format(line_idx))
+ raise e
diff --git a/fastNLP/io/file_utils.py b/fastNLP/io/file_utils.py
new file mode 100644
index 00000000..cb762eb7
--- /dev/null
+++ b/fastNLP/io/file_utils.py
@@ -0,0 +1,299 @@
+
+import os
+from pathlib import Path
+from urllib.parse import urlparse
+import re
+import requests
+import tempfile
+from tqdm import tqdm
+import shutil
+import hashlib
+
+
+PRETRAINED_BERT_MODEL_DIR = {
+ 'en': 'bert-base-cased-f89bfe08.zip',
+ 'en-base-uncased': 'bert-base-uncased-3413b23c.zip',
+ 'en-base-cased': 'bert-base-cased-f89bfe08.zip',
+ 'en-large-uncased': 'bert-large-uncased-20939f45.zip',
+ 'en-large-cased': 'bert-large-cased-e0cf90fc.zip',
+
+ 'en-large-cased-wwm': 'bert-large-cased-wwm-a457f118.zip',
+ 'en-large-uncased-wwm': 'bert-large-uncased-wwm-92a50aeb.zip',
+ 'en-base-cased-mrpc': 'bert-base-cased-finetuned-mrpc-c7099855.zip',
+
+ 'cn': 'bert-base-chinese-29d0a84a.zip',
+ 'cn-base': 'bert-base-chinese-29d0a84a.zip',
+
+ 'multilingual': 'bert-base-multilingual-cased-1bd364ee.zip',
+ 'multilingual-base-uncased': 'bert-base-multilingual-uncased-f8730fe4.zip',
+ 'multilingual-base-cased': 'bert-base-multilingual-cased-1bd364ee.zip',
+}
+
+PRETRAINED_ELMO_MODEL_DIR = {
+ 'en': 'elmo_en-d39843fe.tar.gz',
+ 'cn': 'elmo_cn-5e9b34e2.tar.gz'
+}
+
+PRETRAIN_STATIC_FILES = {
+ 'en': 'glove.840B.300d-cc1ad5e1.tar.gz',
+ 'en-glove-840b-300': 'glove.840B.300d-cc1ad5e1.tar.gz',
+ 'en-glove-6b-50': "glove.6B.50d-a6028c70.tar.gz",
+ 'en-word2vec-300': "GoogleNews-vectors-negative300-be166d9d.tar.gz",
+ 'en-fasttext': "cc.en.300.vec-d53187b2.gz",
+ 'cn': "tencent_cn-dab24577.tar.gz",
+ 'cn-fasttext': "cc.zh.300.vec-d68a9bcf.gz",
+}
+
+
+def cached_path(url_or_filename: str, cache_dir: Path=None) -> Path:
+ """
+ 给定一个url或者文件名(可以是具体的文件名,也可以是文件),先在cache_dir下寻找该文件是否存在,如果不存在则去下载, 并
+ 将文件放入到cache_dir中
+ """
+ if cache_dir is None:
+ dataset_cache = Path(get_defalt_path())
+ else:
+ dataset_cache = cache_dir
+
+ parsed = urlparse(url_or_filename)
+
+ if parsed.scheme in ("http", "https"):
+ # URL, so get it from the cache (downloading if necessary)
+ return get_from_cache(url_or_filename, dataset_cache)
+ elif parsed.scheme == "" and Path(os.path.join(dataset_cache, url_or_filename)).exists():
+ # File, and it exists.
+ return Path(url_or_filename)
+ elif parsed.scheme == "":
+ # File, but it doesn't exist.
+ raise FileNotFoundError("file {} not found".format(url_or_filename))
+ else:
+ # Something unknown
+ raise ValueError(
+ "unable to parse {} as a URL or as a local path".format(url_or_filename)
+ )
+
+
+def get_filepath(filepath):
+ """
+ 如果filepath中只有一个文件,则直接返回对应的全路径
+ :param filepath:
+ :return:
+ """
+ if os.path.isdir(filepath):
+ files = os.listdir(filepath)
+ if len(files)==1:
+ return os.path.join(filepath, files[0])
+ else:
+ return filepath
+ return filepath
+
+
+def get_defalt_path():
+ """
+ 获取默认的fastNLP存放路径, 如果将FASTNLP_CACHE_PATH设置在了环境变量中,将使用环境变量的值,使得不用每个用户都去下载。
+
+ :return:
+ """
+ if 'FASTNLP_CACHE_DIR' in os.environ:
+ fastnlp_cache_dir = os.environ.get('FASTNLP_CACHE_DIR')
+ if os.path.exists(fastnlp_cache_dir):
+ return fastnlp_cache_dir
+ raise RuntimeError("Some errors happens on cache directory.")
+ else:
+ raise RuntimeError("There function is not available right now.")
+ fastnlp_cache_dir = os.path.expanduser(os.path.join("~", ".fastNLP"))
+ return fastnlp_cache_dir
+
+
+def _get_base_url(name):
+ # 返回的URL结尾必须是/
+ if 'FASTNLP_BASE_URL' in os.environ:
+ fastnlp_base_url = os.environ['FASTNLP_BASE_URL']
+ return fastnlp_base_url
+ raise RuntimeError("There function is not available right now.")
+
+
+def split_filename_suffix(filepath):
+ """
+ 给定filepath返回对应的name和suffix
+ :param filepath:
+ :return: filename, suffix
+ """
+ filename = os.path.basename(filepath)
+ if filename.endswith('.tar.gz'):
+ return filename[:-7], '.tar.gz'
+ return os.path.splitext(filename)
+
+
+def get_from_cache(url: str, cache_dir: Path = None) -> Path:
+ """
+ 尝试在cache_dir中寻找url定义的资源; 如果没有找到。则从url下载并将结果放在cache_dir下,缓存的名称由url的结果推断而来。
+ 如果从url中下载的资源解压后有多个文件,则返回directory的路径; 如果只有一个资源,则返回具体的路径。
+
+ """
+ cache_dir.mkdir(parents=True, exist_ok=True)
+
+ filename = re.sub(r".+/", "", url)
+ dir_name, suffix = split_filename_suffix(filename)
+ sep_index = dir_name[::-1].index('-')
+ if sep_index<0:
+ check_sum = None
+ else:
+ check_sum = dir_name[-sep_index+1:]
+ sep_index = len(dir_name) if sep_index==-1 else -sep_index-1
+ dir_name = dir_name[:sep_index]
+
+ # 寻找与它名字匹配的内容, 而不关心后缀
+ match_dir_name = match_file(dir_name, cache_dir)
+ if match_dir_name:
+ dir_name = match_dir_name
+ cache_path = cache_dir / dir_name
+
+ # get cache path to put the file
+ if cache_path.exists():
+ return get_filepath(cache_path)
+
+ # make HEAD request to check ETag TODO ETag可以用来判断资源是否已经更新了,之后需要加上
+ response = requests.head(url, headers={"User-Agent": "fastNLP"})
+ if response.status_code != 200:
+ raise IOError(
+ f"HEAD request failed for url {url} with status code {response.status_code}."
+ )
+
+ # add ETag to filename if it exists
+ # etag = response.headers.get("ETag")
+
+ if not cache_path.exists():
+ # Download to temporary file, then copy to cache dir once finished.
+ # Otherwise you get corrupt cache entries if the download gets interrupted.
+ fd, temp_filename = tempfile.mkstemp()
+ print("%s not found in cache, downloading to %s"%(url, temp_filename))
+
+ # GET file object
+ req = requests.get(url, stream=True, headers={"User-Agent": "fastNLP"})
+ content_length = req.headers.get("Content-Length")
+ total = int(content_length) if content_length is not None else None
+ progress = tqdm(unit="B", total=total)
+ sha256 = hashlib.sha256()
+ with open(temp_filename, "wb") as temp_file:
+ for chunk in req.iter_content(chunk_size=1024):
+ if chunk: # filter out keep-alive new chunks
+ progress.update(len(chunk))
+ temp_file.write(chunk)
+ sha256.update(chunk)
+ # check sum
+ digit = sha256.hexdigest()[:8]
+ if not check_sum:
+ assert digit == check_sum, "File corrupted when download."
+ progress.close()
+ print(f"Finish download from {url}.")
+
+ # 开始解压
+ delete_temp_dir = None
+ if suffix in ('.zip', '.tar.gz'):
+ uncompress_temp_dir = tempfile.mkdtemp()
+ delete_temp_dir = uncompress_temp_dir
+ print(f"Start to uncompress file to {uncompress_temp_dir}.")
+ if suffix == '.zip':
+ unzip_file(Path(temp_filename), Path(uncompress_temp_dir))
+ else:
+ untar_gz_file(Path(temp_filename), Path(uncompress_temp_dir))
+ filenames = os.listdir(uncompress_temp_dir)
+ if len(filenames)==1:
+ if os.path.isdir(os.path.join(uncompress_temp_dir, filenames[0])):
+ uncompress_temp_dir = os.path.join(uncompress_temp_dir, filenames[0])
+
+ cache_path.mkdir(parents=True, exist_ok=True)
+ print("Finish un-compressing file.")
+ else:
+ uncompress_temp_dir = temp_filename
+ cache_path = str(cache_path) + suffix
+ success = False
+ try:
+ # 复制到指定的位置
+ print(f"Copy file to {cache_path}.")
+ if os.path.isdir(uncompress_temp_dir):
+ for filename in os.listdir(uncompress_temp_dir):
+ shutil.copyfile(os.path.join(uncompress_temp_dir, filename), cache_path/filename)
+ else:
+ shutil.copyfile(uncompress_temp_dir, cache_path)
+ success = True
+ except Exception as e:
+ print(e)
+ raise e
+ finally:
+ if not success:
+ if cache_path.exists():
+ if cache_path.is_file():
+ os.remove(cache_path)
+ else:
+ shutil.rmtree(cache_path)
+ if delete_temp_dir:
+ shutil.rmtree(delete_temp_dir)
+ os.close(fd)
+ os.remove(temp_filename)
+
+ return get_filepath(cache_path)
+
+
+def unzip_file(file: Path, to: Path):
+ # unpack and write out in CoNLL column-like format
+ from zipfile import ZipFile
+
+ with ZipFile(file, "r") as zipObj:
+ # Extract all the contents of zip file in current directory
+ zipObj.extractall(to)
+
+
+def untar_gz_file(file:Path, to:Path):
+ import tarfile
+
+ with tarfile.open(file, 'r:gz') as tar:
+ tar.extractall(to)
+
+
+def match_file(dir_name: str, cache_dir: str) -> str:
+ """
+ 匹配的原则是,在cache_dir下的文件: (1) 与dir_name完全一致; (2) 除了后缀以外和dir_name完全一致。
+ 如果找到了两个匹配的结果将报错. 如果找到了则返回匹配的文件的名称; 没有找到返回空字符串
+
+ :param dir_name: 需要匹配的名称
+ :param cache_dir: 在该目录下找匹配dir_name是否存在
+ :return: str
+ """
+ files = os.listdir(cache_dir)
+ matched_filenames = []
+ for file_name in files:
+ if re.match(dir_name+'$', file_name) or re.match(dir_name+'\\..*', file_name):
+ matched_filenames.append(file_name)
+ if len(matched_filenames)==0:
+ return ''
+ elif len(matched_filenames)==1:
+ return matched_filenames[-1]
+ else:
+ raise RuntimeError(f"Duplicate matched files:{matched_filenames}, this should be caused by a bug.")
+
+
+if __name__ == '__main__':
+ cache_dir = Path('caches')
+ cache_dir = None
+ # 需要对cache_dir进行测试
+ base_url = 'http://0.0.0.0:8888/file/download'
+ # if True:
+ # for filename in os.listdir(cache_dir):
+ # if os.path.isdir(os.path.join(cache_dir, filename)):
+ # shutil.rmtree(os.path.join(cache_dir, filename))
+ # else:
+ # os.remove(os.path.join(cache_dir, filename))
+ # 1. 测试.txt文件
+ print(cached_path(base_url + '/{}'.format('txt_test-bcb4fe65.txt'), cache_dir))
+ # 2. 测试.zip文件(只有一个文件)
+ print(cached_path(base_url + '/{}'.format('zip_test-40966d39.zip'), cache_dir))
+ # 3. 测试.zip文件(有多个文件)
+ print(cached_path(base_url + '/{}'.format('zip_pack_test-70c0b20d.zip'), cache_dir))
+ # 4. 测试.tar.gz文件
+ print(cached_path(base_url + '/{}'.format('tar_gz_test-3e2679cf.tar.gz'), cache_dir))
+ # 5. 测试.tar.gz多个文件
+ print(cached_path(base_url + '/{}'.format('tar_gz_pack_test-08dfdccd.tar.gz'), cache_dir))
+
+ # 6. 测试.pkl文件
diff --git a/fastNLP/io/utils.py b/fastNLP/io/utils.py
new file mode 100644
index 00000000..a7d2de85
--- /dev/null
+++ b/fastNLP/io/utils.py
@@ -0,0 +1,69 @@
+import os
+
+from typing import Union, Dict
+
+
+def check_dataloader_paths(paths:Union[str, Dict[str, str]])->Dict[str, str]:
+ """
+ 检查传入dataloader的文件的合法性。如果为合法路径,将返回至少包含'train'这个key的dict。类似于下面的结果
+ {
+ 'train': '/some/path/to/', # 一定包含,建词表应该在这上面建立,剩下的其它文件应该只需要处理并index。
+ 'test': 'xxx' # 可能有,也可能没有
+ ...
+ }
+ 如果paths为不合法的,将直接进行raise相应的错误
+
+ :param paths: 路径. 可以为一个文件路径(则认为该文件就是train的文件); 可以为一个文件目录,将在该目录下寻找train(文件名
+ 中包含train这个字段), test.txt, dev.txt; 可以为一个dict, 则key是用户自定义的某个文件的名称,value是这个文件的路径。
+ :return:
+ """
+ if isinstance(paths, str):
+ if os.path.isfile(paths):
+ return {'train': paths}
+ elif os.path.isdir(paths):
+ filenames = os.listdir(paths)
+ files = {}
+ for filename in filenames:
+ path_pair = None
+ if 'train' in filename:
+ path_pair = ('train', filename)
+ if 'dev' in filename:
+ if path_pair:
+ raise Exception("File:{} in {} contains bot `{}` and `dev`.".format(filename, paths, path_pair[0]))
+ path_pair = ('dev', filename)
+ if 'test' in filename:
+ if path_pair:
+ raise Exception("File:{} in {} contains bot `{}` and `test`.".format(filename, paths, path_pair[0]))
+ path_pair = ('test', filename)
+ if path_pair:
+ files[path_pair[0]] = os.path.join(paths, path_pair[1])
+ return files
+ else:
+ raise FileNotFoundError(f"{paths} is not a valid file path.")
+
+ elif isinstance(paths, dict):
+ if paths:
+ if 'train' not in paths:
+ raise KeyError("You have to include `train` in your dict.")
+ for key, value in paths.items():
+ if isinstance(key, str) and isinstance(value, str):
+ if not os.path.isfile(value):
+ raise TypeError(f"{value} is not a valid file.")
+ else:
+ raise TypeError("All keys and values in paths should be str.")
+ return paths
+ else:
+ raise ValueError("Empty paths is not allowed.")
+ else:
+ raise TypeError(f"paths only supports str and dict. not {type(paths)}.")
+
+def get_tokenizer():
+ try:
+ import spacy
+ spacy.prefer_gpu()
+ en = spacy.load('en')
+ print('use spacy tokenizer')
+ return lambda x: [w.text for w in en.tokenizer(x)]
+ except Exception as e:
+ print('use raw tokenizer')
+ return lambda x: x.split()
diff --git a/fastNLP/models/bert.py b/fastNLP/models/bert.py
index 02227c0d..adecab60 100644
--- a/fastNLP/models/bert.py
+++ b/fastNLP/models/bert.py
@@ -8,35 +8,7 @@ from torch import nn
from .base_model import BaseModel
from ..core.const import Const
from ..modules.encoder import BertModel
-
-
-class BertConfig:
-
- def __init__(
- self,
- vocab_size=30522,
- hidden_size=768,
- num_hidden_layers=12,
- num_attention_heads=12,
- intermediate_size=3072,
- hidden_act="gelu",
- hidden_dropout_prob=0.1,
- attention_probs_dropout_prob=0.1,
- max_position_embeddings=512,
- type_vocab_size=2,
- initializer_range=0.02
- ):
- self.vocab_size = vocab_size
- self.hidden_size = hidden_size
- self.num_hidden_layers = num_hidden_layers
- self.num_attention_heads = num_attention_heads
- self.intermediate = intermediate_size
- self.hidden_act = hidden_act
- self.hidden_dropout_prob = hidden_dropout_prob
- self.attention_probs_dropout_prob = attention_probs_dropout_prob
- self.max_position_embeddings = max_position_embeddings
- self.type_vocab_size = type_vocab_size
- self.initializer_range = initializer_range
+from ..modules.encoder.bert import BertConfig
class BertForSequenceClassification(BaseModel):
@@ -84,11 +56,17 @@ class BertForSequenceClassification(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, num_labels)
+ @classmethod
+ def from_pretrained(cls, num_labels, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(num_labels=num_labels, config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
_, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
pooled_output = self.dropout(pooled_output)
@@ -151,11 +129,17 @@ class BertForMultipleChoice(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, 1)
+ @classmethod
+ def from_pretrained(cls, num_choices, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(num_choices=num_choices, config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
flat_input_ids = input_ids.view(-1, input_ids.size(-1))
flat_token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))
@@ -224,11 +208,17 @@ class BertForTokenClassification(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, num_labels)
+ @classmethod
+ def from_pretrained(cls, num_labels, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(num_labels=num_labels, config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
sequence_output = self.dropout(sequence_output)
@@ -302,12 +292,18 @@ class BertForQuestionAnswering(BaseModel):
self.bert = BertModel.from_pretrained(bert_dir)
else:
if config is None:
- config = BertConfig()
- self.bert = BertModel(**config.__dict__)
+ config = BertConfig(30522)
+ self.bert = BertModel(config)
# TODO check with Google if it's normal there is no dropout on the token classifier of SQuAD in the TF version
# self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.qa_outputs = nn.Linear(config.hidden_size, 2)
+ @classmethod
+ def from_pretrained(cls, pretrained_model_dir):
+ config = BertConfig(pretrained_model_dir)
+ model = cls(config=config, bert_dir=pretrained_model_dir)
+ return model
+
def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None, end_positions=None):
sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
logits = self.qa_outputs(sequence_output)
diff --git a/fastNLP/models/biaffine_parser.py b/fastNLP/models/biaffine_parser.py
index 8533e7af..29487864 100644
--- a/fastNLP/models/biaffine_parser.py
+++ b/fastNLP/models/biaffine_parser.py
@@ -20,7 +20,7 @@ from ..modules.dropout import TimestepDropout
from ..modules.encoder.transformer import TransformerEncoder
from ..modules.encoder.variational_rnn import VarLSTM
from ..modules.utils import initial_parameter
-from ..modules.utils import get_embeddings
+from ..embeddings.utils import get_embeddings
from .base_model import BaseModel
from ..core.utils import seq_len_to_mask
@@ -130,6 +130,8 @@ def _find_cycle(vertices, edges):
class GraphParser(BaseModel):
"""
+ 别名::class:`fastNLP.models.GraphParser` :class:`fastNLP.models.baffine_parser.GraphParser`
+
基于图的parser base class, 支持贪婪解码和最大生成树解码
"""
diff --git a/fastNLP/models/cnn_text_classification.py b/fastNLP/models/cnn_text_classification.py
index 3a71a80a..e00a0697 100644
--- a/fastNLP/models/cnn_text_classification.py
+++ b/fastNLP/models/cnn_text_classification.py
@@ -6,7 +6,9 @@ import torch
import torch.nn as nn
from ..core.const import Const as C
+from ..core.utils import seq_len_to_mask
from ..modules import encoder
+from ..embeddings import embedding
class CNNText(torch.nn.Module):
@@ -21,28 +23,25 @@ class CNNText(torch.nn.Module):
:param int num_classes: 一共有多少类
:param int,tuple(int) out_channels: 输出channel的数量。如果为list,则需要与kernel_sizes的数量保持一致
:param int,tuple(int) kernel_sizes: 输出channel的kernel大小。
- :param int padding: 对句子前后的pad的大小, 用0填充。
:param float dropout: Dropout的大小
"""
-
+
def __init__(self, init_embed,
num_classes,
- kernel_nums=(3, 4, 5),
- kernel_sizes=(3, 4, 5),
- padding=0,
+ kernel_nums=(30, 40, 50),
+ kernel_sizes=(1, 3, 5),
dropout=0.5):
super(CNNText, self).__init__()
-
+
# no support for pre-trained embedding currently
- self.embed = encoder.Embedding(init_embed)
+ self.embed = embedding.Embedding(init_embed)
self.conv_pool = encoder.ConvMaxpool(
in_channels=self.embed.embedding_dim,
out_channels=kernel_nums,
- kernel_sizes=kernel_sizes,
- padding=padding)
+ kernel_sizes=kernel_sizes)
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(sum(kernel_nums), num_classes)
-
+
def forward(self, words, seq_len=None):
"""
@@ -51,11 +50,15 @@ class CNNText(torch.nn.Module):
:return output: dict of torch.LongTensor, [batch_size, num_classes]
"""
x = self.embed(words) # [N,L] -> [N,L,C]
- x = self.conv_pool(x) # [N,L,C] -> [N,C]
+ if seq_len is not None:
+ mask = seq_len_to_mask(seq_len)
+ x = self.conv_pool(x, mask)
+ else:
+ x = self.conv_pool(x) # [N,L,C] -> [N,C]
x = self.dropout(x)
x = self.fc(x) # [N,C] -> [N, N_class]
return {C.OUTPUT: x}
-
+
def predict(self, words, seq_len=None):
"""
:param torch.LongTensor words: [batch_size, seq_len],句子中word的index
diff --git a/fastNLP/models/enas_trainer.py b/fastNLP/models/enas_trainer.py
index ef596b03..7abcc45f 100644
--- a/fastNLP/models/enas_trainer.py
+++ b/fastNLP/models/enas_trainer.py
@@ -14,7 +14,7 @@ except:
from ..core.utils import _pseudo_tqdm as tqdm
from ..core.trainer import Trainer
-from ..core.batch import Batch
+from ..core.batch import DataSetIter
from ..core.callback import CallbackManager, CallbackException
from ..core.dataset import DataSet
from ..core.utils import _move_dict_value_to_device
@@ -124,8 +124,8 @@ class ENASTrainer(Trainer):
len(self.train_data) % self.batch_size != 0)) * self.n_epochs
with inner_tqdm(total=total_steps, postfix='loss:{0:<6.5f}', leave=False, dynamic_ncols=True) as pbar:
avg_loss = 0
- data_iterator = Batch(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
- prefetch=self.prefetch)
+ data_iterator = DataSetIter(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
+ prefetch=self.prefetch)
for epoch in range(1, self.n_epochs + 1):
pbar.set_description_str(desc="Epoch {}/{}".format(epoch, self.n_epochs))
last_stage = (epoch > self.n_epochs + 1 - self.final_epochs)
@@ -209,8 +209,8 @@ class ENASTrainer(Trainer):
total_loss = 0
train_idx = 0
avg_loss = 0
- data_iterator = Batch(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
- prefetch=self.prefetch)
+ data_iterator = DataSetIter(self.train_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
+ prefetch=self.prefetch)
for batch_x, batch_y in data_iterator:
_move_dict_value_to_device(batch_x, batch_y, device=self._model_device)
@@ -262,8 +262,8 @@ class ENASTrainer(Trainer):
if not isinstance(entropies, np.ndarray):
entropies = entropies.data.cpu().numpy()
- data_iterator = Batch(self.dev_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
- prefetch=self.prefetch)
+ data_iterator = DataSetIter(self.dev_data, batch_size=self.batch_size, sampler=self.sampler, as_numpy=False,
+ prefetch=self.prefetch)
for inputs, targets in data_iterator:
valid_loss, hidden, _ = self.get_loss(inputs, targets, hidden, dag)
diff --git a/fastNLP/models/sequence_labeling.py b/fastNLP/models/sequence_labeling.py
index 8e6a5db1..4bf3f95f 100644
--- a/fastNLP/models/sequence_labeling.py
+++ b/fastNLP/models/sequence_labeling.py
@@ -1,19 +1,82 @@
"""
- 本模块实现了两种序列标注模型
+ 本模块实现了几种序列标注模型
"""
__all__ = [
"SeqLabeling",
- "AdvSeqLabel"
+ "AdvSeqLabel",
+ # "BiLSTMCRF"
]
import torch
import torch.nn as nn
+import torch.nn.functional as F
from .base_model import BaseModel
+from ..embeddings import embedding
from ..modules import decoder, encoder
from ..modules.decoder.crf import allowed_transitions
from ..core.utils import seq_len_to_mask
from ..core.const import Const as C
+from ..modules import LSTM
+from ..embeddings import get_embeddings
+from ..modules import ConditionalRandomField
+
+
+class BiLSTMCRF(BaseModel):
+ """
+ 结构为BiLSTM + FC + Dropout + CRF.
+
+ .. todo::
+ 继续补充文档
+
+ :param embed: tuple:
+ :param num_classes:
+ :param num_layers:
+ :param hidden_size:
+ :param dropout:
+ :param target_vocab:
+ :param encoding_type:
+ """
+ def __init__(self, embed, num_classes, num_layers=1, hidden_size=100, dropout=0.5,
+ target_vocab=None, encoding_type=None):
+ super().__init__()
+ self.embed = get_embeddings(embed)
+
+ if num_layers>1:
+ self.lstm = LSTM(embed.embedding_dim, num_layers=num_layers, hidden_size=hidden_size, bidirectional=True,
+ batch_first=True, dropout=dropout)
+ else:
+ self.lstm = LSTM(embed.embedding_dim, num_layers=num_layers, hidden_size=hidden_size, bidirectional=True,
+ batch_first=True)
+
+ self.dropout = nn.Dropout(dropout)
+ self.fc = nn.Linear(hidden_size, num_classes)
+
+ trans = None
+ if target_vocab is not None and encoding_type is not None:
+ trans = allowed_transitions(target_vocab.idx2word, encoding_type=encoding_type, include_start_end=True)
+
+ self.crf = ConditionalRandomField(num_classes, include_start_end_trans=True, allowed_transitions=trans)
+
+ def _forward(self, words, seq_len=None, target=None):
+ words = self.embed(words)
+ feats = self.lstm(words, seq_len=seq_len)
+ feats = self.fc(feats)
+ feats = self.dropout(feats)
+ logits = F.log_softmax(feats, dim=-1)
+ mask = seq_len_to_mask(seq_len)
+ if target is None:
+ pred, _ = self.crf.viterbi_decode(logits, mask)
+ return {C.OUTPUT:pred}
+ else:
+ loss = self.crf(logits, target, mask).mean()
+ return {C.LOSS:loss}
+
+ def forward(self, words, seq_len, target):
+ return self._forward(words, seq_len, target)
+
+ def predict(self, words, seq_len):
+ return self._forward(words, seq_len)
class SeqLabeling(BaseModel):
@@ -32,10 +95,10 @@ class SeqLabeling(BaseModel):
def __init__(self, init_embed, hidden_size, num_classes):
super(SeqLabeling, self).__init__()
- self.Embedding = encoder.embedding.Embedding(init_embed)
- self.Rnn = encoder.lstm.LSTM(self.Embedding.embedding_dim, hidden_size)
+ self.Embedding = embedding.Embedding(init_embed)
+ self.Rnn = encoder.LSTM(self.Embedding.embedding_dim, hidden_size)
self.Linear = nn.Linear(hidden_size, num_classes)
- self.Crf = decoder.crf.ConditionalRandomField(num_classes)
+ self.Crf = decoder.ConditionalRandomField(num_classes)
self.mask = None
def forward(self, words, seq_len, target):
@@ -129,7 +192,7 @@ class AdvSeqLabel(nn.Module):
super().__init__()
- self.Embedding = encoder.embedding.Embedding(init_embed)
+ self.Embedding = embedding.Embedding(init_embed)
self.norm1 = torch.nn.LayerNorm(self.Embedding.embedding_dim)
self.Rnn = encoder.LSTM(input_size=self.Embedding.embedding_dim, hidden_size=hidden_size, num_layers=2,
dropout=dropout,
diff --git a/fastNLP/models/snli.py b/fastNLP/models/snli.py
index 395a9bbf..8e35b6bc 100644
--- a/fastNLP/models/snli.py
+++ b/fastNLP/models/snli.py
@@ -4,149 +4,211 @@ __all__ = [
import torch
import torch.nn as nn
+import torch.nn.functional as F
+
+from torch.nn import CrossEntropyLoss
from .base_model import BaseModel
+from ..embeddings.embedding import TokenEmbedding
from ..core.const import Const
-from ..modules import decoder as Decoder
-from ..modules import encoder as Encoder
-from ..modules import aggregator as Aggregator
from ..core.utils import seq_len_to_mask
-my_inf = 10e12
-
class ESIM(BaseModel):
"""
别名::class:`fastNLP.models.ESIM` :class:`fastNLP.models.snli.ESIM`
- ESIM模型的一个PyTorch实现。
- ESIM模型的论文: Enhanced LSTM for Natural Language Inference (arXiv: 1609.06038)
+ ESIM model的一个PyTorch实现
+ 论文参见: https://arxiv.org/pdf/1609.06038.pdf
- :param int vocab_size: 词表大小
- :param int embed_dim: 词嵌入维度
- :param int hidden_size: LSTM隐层大小
- :param float dropout: dropout大小,默认为0
- :param int num_classes: 标签数目,默认为3
- :param numpy.array init_embedding: 初始词嵌入矩阵,形状为(vocab_size, embed_dim),默认为None,即随机初始化词嵌入矩阵
+ :param fastNLP.TokenEmbedding init_embedding: 初始化的TokenEmbedding
+ :param int hidden_size: 隐藏层大小,默认值为Embedding的维度
+ :param int num_labels: 目标标签种类数量,默认值为3
+ :param float dropout_rate: dropout的比率,默认值为0.3
+ :param float dropout_embed: 对Embedding的dropout比率,默认值为0.1
"""
-
- def __init__(self, vocab_size, embed_dim, hidden_size, dropout=0.0, num_classes=3, init_embedding=None):
-
- super(ESIM, self).__init__()
- self.vocab_size = vocab_size
- self.embed_dim = embed_dim
- self.hidden_size = hidden_size
- self.dropout = dropout
- self.n_labels = num_classes
-
- self.drop = nn.Dropout(self.dropout)
-
- self.embedding = Encoder.Embedding(
- (self.vocab_size, self.embed_dim), dropout=self.dropout,
- )
-
- self.embedding_layer = nn.Linear(self.embed_dim, self.hidden_size)
-
- self.encoder = Encoder.LSTM(
- input_size=self.embed_dim, hidden_size=self.hidden_size, num_layers=1, bias=True,
- batch_first=True, bidirectional=True
- )
-
- self.bi_attention = Aggregator.BiAttention()
- self.mean_pooling = Aggregator.AvgPoolWithMask()
- self.max_pooling = Aggregator.MaxPoolWithMask()
-
- self.inference_layer = nn.Linear(self.hidden_size * 4, self.hidden_size)
-
- self.decoder = Encoder.LSTM(
- input_size=self.hidden_size, hidden_size=self.hidden_size, num_layers=1, bias=True,
- batch_first=True, bidirectional=True
- )
-
- self.output = Decoder.MLP([4 * self.hidden_size, self.hidden_size, self.n_labels], 'tanh', dropout=self.dropout)
-
- def forward(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
- """ Forward function
-
- :param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
- :param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
- :param torch.LongTensor seq_len1: [B] premise的长度
- :param torch.LongTensor seq_len2: [B] hypothesis的长度
- :param torch.LongTensor target: [B] 真实目标值
- :return: dict prediction: [B, n_labels(N)] 预测结果
- """
-
- premise0 = self.embedding_layer(self.embedding(words1))
- hypothesis0 = self.embedding_layer(self.embedding(words2))
-
- if seq_len1 is not None:
- seq_len1 = seq_len_to_mask(seq_len1)
- else:
- seq_len1 = torch.ones(premise0.size(0), premise0.size(1))
- seq_len1 = (seq_len1.long()).to(device=premise0.device)
- if seq_len2 is not None:
- seq_len2 = seq_len_to_mask(seq_len2)
- else:
- seq_len2 = torch.ones(hypothesis0.size(0), hypothesis0.size(1))
- seq_len2 = (seq_len2.long()).to(device=hypothesis0.device)
-
- _BP, _PSL, _HP = premise0.size()
- _BH, _HSL, _HH = hypothesis0.size()
- _BPL, _PLL = seq_len1.size()
- _HPL, _HLL = seq_len2.size()
-
- assert _BP == _BH and _BPL == _HPL and _BP == _BPL
- assert _HP == _HH
- assert _PSL == _PLL and _HSL == _HLL
-
- B, PL, H = premise0.size()
- B, HL, H = hypothesis0.size()
-
- a0 = self.encoder(self.drop(premise0)) # a0: [B, PL, H * 2]
- b0 = self.encoder(self.drop(hypothesis0)) # b0: [B, HL, H * 2]
-
- a = torch.mean(a0.view(B, PL, -1, H), dim=2) # a: [B, PL, H]
- b = torch.mean(b0.view(B, HL, -1, H), dim=2) # b: [B, HL, H]
-
- ai, bi = self.bi_attention(a, b, seq_len1, seq_len2)
-
- ma = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 4 * H]
- mb = torch.cat((b, bi, b - bi, b * bi), dim=2) # mb: [B, HL, 4 * H]
-
- f_ma = self.inference_layer(ma)
- f_mb = self.inference_layer(mb)
-
- vat = self.decoder(self.drop(f_ma))
- vbt = self.decoder(self.drop(f_mb))
-
- va = torch.mean(vat.view(B, PL, -1, H), dim=2) # va: [B, PL, H]
- vb = torch.mean(vbt.view(B, HL, -1, H), dim=2) # vb: [B, HL, H]
-
- va_ave = self.mean_pooling(va, seq_len1, dim=1) # va_ave: [B, H]
- va_max, va_arg_max = self.max_pooling(va, seq_len1, dim=1) # va_max: [B, H]
- vb_ave = self.mean_pooling(vb, seq_len2, dim=1) # vb_ave: [B, H]
- vb_max, vb_arg_max = self.max_pooling(vb, seq_len2, dim=1) # vb_max: [B, H]
-
- v = torch.cat((va_ave, va_max, vb_ave, vb_max), dim=1) # v: [B, 4 * H]
-
- prediction = torch.tanh(self.output(v)) # prediction: [B, N]
-
- if target is not None:
- func = nn.CrossEntropyLoss()
- loss = func(prediction, target)
- return {Const.OUTPUT: prediction, Const.LOSS: loss}
-
- return {Const.OUTPUT: prediction}
-
- def predict(self, words1, words2, seq_len1=None, seq_len2=None, target=None):
- """ Predict function
- :param torch.Tensor words1: [batch size(B), premise seq len(PL)] premise的token表示
- :param torch.Tensor words2: [B, hypothesis seq len(HL)] hypothesis的token表示
- :param torch.LongTensor seq_len1: [B] premise的长度
- :param torch.LongTensor seq_len2: [B] hypothesis的长度
- :param torch.LongTensor target: [B] 真实目标值
- :return: dict prediction: [B, n_labels(N)] 预测结果
+ def __init__(self, init_embedding: TokenEmbedding, hidden_size=None, num_labels=3, dropout_rate=0.3,
+ dropout_embed=0.1):
+ super(ESIM, self).__init__()
+
+ self.embedding = init_embedding
+ self.dropout_embed = EmbedDropout(p=dropout_embed)
+ if hidden_size is None:
+ hidden_size = self.embedding.embed_size
+ self.rnn = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
+ # self.rnn = LSTM(self.embedding.embed_size, hidden_size, dropout=dropout_rate, bidirectional=True)
+
+ self.interfere = nn.Sequential(nn.Dropout(p=dropout_rate),
+ nn.Linear(8 * hidden_size, hidden_size),
+ nn.ReLU())
+ nn.init.xavier_uniform_(self.interfere[1].weight.data)
+ self.bi_attention = SoftmaxAttention()
+
+ self.rnn_high = BiRNN(self.embedding.embed_size, hidden_size, dropout_rate=dropout_rate)
+ # self.rnn_high = LSTM(hidden_size, hidden_size, dropout=dropout_rate, bidirectional=True,)
+
+ self.classifier = nn.Sequential(nn.Dropout(p=dropout_rate),
+ nn.Linear(8 * hidden_size, hidden_size),
+ nn.Tanh(),
+ nn.Dropout(p=dropout_rate),
+ nn.Linear(hidden_size, num_labels))
+
+ self.dropout_rnn = nn.Dropout(p=dropout_rate)
+
+ nn.init.xavier_uniform_(self.classifier[1].weight.data)
+ nn.init.xavier_uniform_(self.classifier[4].weight.data)
+
+ def forward(self, words1, words2, seq_len1, seq_len2, target=None):
"""
- prediction = self.forward(words1, words2, seq_len1, seq_len2)[Const.OUTPUT]
- return {Const.OUTPUT: torch.argmax(prediction, dim=-1)}
+ :param words1: [batch, seq_len]
+ :param words2: [batch, seq_len]
+ :param seq_len1: [batch]
+ :param seq_len2: [batch]
+ :param target:
+ :return:
+ """
+ mask1 = seq_len_to_mask(seq_len1, words1.size(1))
+ mask2 = seq_len_to_mask(seq_len2, words2.size(1))
+ a0 = self.embedding(words1) # B * len * emb_dim
+ b0 = self.embedding(words2)
+ a0, b0 = self.dropout_embed(a0), self.dropout_embed(b0)
+ a = self.rnn(a0, mask1.byte()) # a: [B, PL, 2 * H]
+ b = self.rnn(b0, mask2.byte())
+ # a = self.dropout_rnn(self.rnn(a0, seq_len1)[0]) # a: [B, PL, 2 * H]
+ # b = self.dropout_rnn(self.rnn(b0, seq_len2)[0])
+
+ ai, bi = self.bi_attention(a, mask1, b, mask2)
+
+ a_ = torch.cat((a, ai, a - ai, a * ai), dim=2) # ma: [B, PL, 8 * H]
+ b_ = torch.cat((b, bi, b - bi, b * bi), dim=2)
+ a_f = self.interfere(a_)
+ b_f = self.interfere(b_)
+
+ a_h = self.rnn_high(a_f, mask1.byte()) # ma: [B, PL, 2 * H]
+ b_h = self.rnn_high(b_f, mask2.byte())
+ # a_h = self.dropout_rnn(self.rnn_high(a_f, seq_len1)[0]) # ma: [B, PL, 2 * H]
+ # b_h = self.dropout_rnn(self.rnn_high(b_f, seq_len2)[0])
+
+ a_avg = self.mean_pooling(a_h, mask1, dim=1)
+ a_max, _ = self.max_pooling(a_h, mask1, dim=1)
+ b_avg = self.mean_pooling(b_h, mask2, dim=1)
+ b_max, _ = self.max_pooling(b_h, mask2, dim=1)
+
+ out = torch.cat((a_avg, a_max, b_avg, b_max), dim=1) # v: [B, 8 * H]
+ logits = torch.tanh(self.classifier(out))
+
+ if target is not None:
+ loss_fct = CrossEntropyLoss()
+ loss = loss_fct(logits, target)
+
+ return {Const.LOSS: loss, Const.OUTPUT: logits}
+ else:
+ return {Const.OUTPUT: logits}
+
+ def predict(self, **kwargs):
+ pred = self.forward(**kwargs)[Const.OUTPUT].argmax(-1)
+ return {Const.OUTPUT: pred}
+
+ # input [batch_size, len , hidden]
+ # mask [batch_size, len] (111...00)
+ @staticmethod
+ def mean_pooling(input, mask, dim=1):
+ masks = mask.view(mask.size(0), mask.size(1), -1).float()
+ return torch.sum(input * masks, dim=dim) / torch.sum(masks, dim=1)
+
+ @staticmethod
+ def max_pooling(input, mask, dim=1):
+ my_inf = 10e12
+ masks = mask.view(mask.size(0), mask.size(1), -1)
+ masks = masks.expand(-1, -1, input.size(2)).float()
+ return torch.max(input + masks.le(0.5).float() * -my_inf, dim=dim)
+
+
+class EmbedDropout(nn.Dropout):
+
+ def forward(self, sequences_batch):
+ ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1])
+ dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False)
+ return dropout_mask.unsqueeze(1) * sequences_batch
+
+
+class BiRNN(nn.Module):
+ def __init__(self, input_size, hidden_size, dropout_rate=0.3):
+ super(BiRNN, self).__init__()
+ self.dropout_rate = dropout_rate
+ self.rnn = nn.LSTM(input_size, hidden_size,
+ num_layers=1,
+ bidirectional=True,
+ batch_first=True)
+
+ def forward(self, x, x_mask):
+ # Sort x
+ lengths = x_mask.data.eq(1).long().sum(1)
+ _, idx_sort = torch.sort(lengths, dim=0, descending=True)
+ _, idx_unsort = torch.sort(idx_sort, dim=0)
+ lengths = list(lengths[idx_sort])
+
+ x = x.index_select(0, idx_sort)
+ # Pack it up
+ rnn_input = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True)
+ # Apply dropout to input
+ if self.dropout_rate > 0:
+ dropout_input = F.dropout(rnn_input.data, p=self.dropout_rate, training=self.training)
+ rnn_input = nn.utils.rnn.PackedSequence(dropout_input, rnn_input.batch_sizes)
+ output = self.rnn(rnn_input)[0]
+ # Unpack everything
+ output = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)[0]
+ output = output.index_select(0, idx_unsort)
+ if output.size(1) != x_mask.size(1):
+ padding = torch.zeros(output.size(0),
+ x_mask.size(1) - output.size(1),
+ output.size(2)).type(output.data.type())
+ output = torch.cat([output, padding], 1)
+ return output
+
+
+def masked_softmax(tensor, mask):
+ tensor_shape = tensor.size()
+ reshaped_tensor = tensor.view(-1, tensor_shape[-1])
+
+ # Reshape the mask so it matches the size of the input tensor.
+ while mask.dim() < tensor.dim():
+ mask = mask.unsqueeze(1)
+ mask = mask.expand_as(tensor).contiguous().float()
+ reshaped_mask = mask.view(-1, mask.size()[-1])
+ result = F.softmax(reshaped_tensor * reshaped_mask, dim=-1)
+ result = result * reshaped_mask
+ # 1e-13 is added to avoid divisions by zero.
+ result = result / (result.sum(dim=-1, keepdim=True) + 1e-13)
+ return result.view(*tensor_shape)
+
+
+def weighted_sum(tensor, weights, mask):
+ w_sum = weights.bmm(tensor)
+ while mask.dim() < w_sum.dim():
+ mask = mask.unsqueeze(1)
+ mask = mask.transpose(-1, -2)
+ mask = mask.expand_as(w_sum).contiguous().float()
+ return w_sum * mask
+
+
+class SoftmaxAttention(nn.Module):
+
+ def forward(self, premise_batch, premise_mask, hypothesis_batch, hypothesis_mask):
+ similarity_matrix = premise_batch.bmm(hypothesis_batch.transpose(2, 1)
+ .contiguous())
+
+ prem_hyp_attn = masked_softmax(similarity_matrix, hypothesis_mask)
+ hyp_prem_attn = masked_softmax(similarity_matrix.transpose(1, 2)
+ .contiguous(),
+ premise_mask)
+
+ attended_premises = weighted_sum(hypothesis_batch,
+ prem_hyp_attn,
+ premise_mask)
+ attended_hypotheses = weighted_sum(premise_batch,
+ hyp_prem_attn,
+ hypothesis_mask)
+
+ return attended_premises, attended_hypotheses
diff --git a/fastNLP/models/star_transformer.py b/fastNLP/models/star_transformer.py
index 4c944a54..b95d1c25 100644
--- a/fastNLP/models/star_transformer.py
+++ b/fastNLP/models/star_transformer.py
@@ -13,7 +13,7 @@ from torch import nn
from ..modules.encoder.star_transformer import StarTransformer
from ..core.utils import seq_len_to_mask
-from ..modules.utils import get_embeddings
+from ..embeddings.utils import get_embeddings
from ..core.const import Const
@@ -34,7 +34,7 @@ class StarTransEnc(nn.Module):
:param emb_dropout: 词嵌入的dropout概率.
:param dropout: 模型除词嵌入外的dropout概率.
"""
-
+
def __init__(self, init_embed,
hidden_size,
num_layers,
@@ -47,14 +47,14 @@ class StarTransEnc(nn.Module):
self.embedding = get_embeddings(init_embed)
emb_dim = self.embedding.embedding_dim
self.emb_fc = nn.Linear(emb_dim, hidden_size)
- self.emb_drop = nn.Dropout(emb_dropout)
+ # self.emb_drop = nn.Dropout(emb_dropout)
self.encoder = StarTransformer(hidden_size=hidden_size,
num_layers=num_layers,
num_head=num_head,
head_dim=head_dim,
dropout=dropout,
max_len=max_len)
-
+
def forward(self, x, mask):
"""
:param FloatTensor x: [batch, length, hidden] 输入的序列
@@ -65,7 +65,7 @@ class StarTransEnc(nn.Module):
[batch, hidden] 全局 relay 节点, 详见论文
"""
x = self.embedding(x)
- x = self.emb_fc(self.emb_drop(x))
+ x = self.emb_fc(x)
nodes, relay = self.encoder(x, mask)
return nodes, relay
@@ -79,7 +79,7 @@ class _Cls(nn.Module):
nn.Dropout(dropout),
nn.Linear(hid_dim, num_cls),
)
-
+
def forward(self, x):
h = self.fc(x)
return h
@@ -95,7 +95,7 @@ class _NLICls(nn.Module):
nn.Dropout(dropout),
nn.Linear(hid_dim, num_cls),
)
-
+
def forward(self, x1, x2):
x = torch.cat([x1, x2, torch.abs(x1 - x2), x1 * x2], 1)
h = self.fc(x)
@@ -121,7 +121,7 @@ class STSeqLabel(nn.Module):
:param emb_dropout: 词嵌入的dropout概率. Default: 0.1
:param dropout: 模型除词嵌入外的dropout概率. Default: 0.1
"""
-
+
def __init__(self, init_embed, num_cls,
hidden_size=300,
num_layers=4,
@@ -141,7 +141,7 @@ class STSeqLabel(nn.Module):
emb_dropout=emb_dropout,
dropout=dropout)
self.cls = _Cls(hidden_size, num_cls, cls_hidden_size)
-
+
def forward(self, words, seq_len):
"""
@@ -154,7 +154,7 @@ class STSeqLabel(nn.Module):
output = self.cls(nodes)
output = output.transpose(1, 2) # make hidden to be dim 1
return {Const.OUTPUT: output} # [bsz, n_cls, seq_len]
-
+
def predict(self, words, seq_len):
"""
@@ -186,7 +186,7 @@ class STSeqCls(nn.Module):
:param emb_dropout: 词嵌入的dropout概率. Default: 0.1
:param dropout: 模型除词嵌入外的dropout概率. Default: 0.1
"""
-
+
def __init__(self, init_embed, num_cls,
hidden_size=300,
num_layers=4,
@@ -205,8 +205,8 @@ class STSeqCls(nn.Module):
max_len=max_len,
emb_dropout=emb_dropout,
dropout=dropout)
- self.cls = _Cls(hidden_size, num_cls, cls_hidden_size)
-
+ self.cls = _Cls(hidden_size, num_cls, cls_hidden_size, dropout=dropout)
+
def forward(self, words, seq_len):
"""
@@ -219,7 +219,7 @@ class STSeqCls(nn.Module):
y = 0.5 * (relay + nodes.max(1)[0])
output = self.cls(y) # [bsz, n_cls]
return {Const.OUTPUT: output}
-
+
def predict(self, words, seq_len):
"""
@@ -251,7 +251,7 @@ class STNLICls(nn.Module):
:param emb_dropout: 词嵌入的dropout概率. Default: 0.1
:param dropout: 模型除词嵌入外的dropout概率. Default: 0.1
"""
-
+
def __init__(self, init_embed, num_cls,
hidden_size=300,
num_layers=4,
@@ -271,7 +271,7 @@ class STNLICls(nn.Module):
emb_dropout=emb_dropout,
dropout=dropout)
self.cls = _NLICls(hidden_size, num_cls, cls_hidden_size)
-
+
def forward(self, words1, words2, seq_len1, seq_len2):
"""
@@ -283,16 +283,16 @@ class STNLICls(nn.Module):
"""
mask1 = seq_len_to_mask(seq_len1)
mask2 = seq_len_to_mask(seq_len2)
-
+
def enc(seq, mask):
nodes, relay = self.enc(seq, mask)
return 0.5 * (relay + nodes.max(1)[0])
-
+
y1 = enc(words1, mask1)
y2 = enc(words2, mask2)
output = self.cls(y1, y2) # [bsz, n_cls]
return {Const.OUTPUT: output}
-
+
def predict(self, words1, words2, seq_len1, seq_len2):
"""
diff --git a/fastNLP/modules/__init__.py b/fastNLP/modules/__init__.py
index 194fda4e..7959e454 100644
--- a/fastNLP/modules/__init__.py
+++ b/fastNLP/modules/__init__.py
@@ -1,56 +1,56 @@
"""
-大部分用于的 NLP 任务神经网络都可以看做由编码 :mod:`~fastNLP.modules.encoder` 、
-聚合 :mod:`~fastNLP.modules.aggregator` 、解码 :mod:`~fastNLP.modules.decoder` 三种模块组成。
.. image:: figures/text_classification.png
-:mod:`~fastNLP.modules` 中实现了 fastNLP 提供的诸多模块组件,可以帮助用户快速搭建自己所需的网络。
-三种模块的功能和常见组件如下:
+大部分用于的 NLP 任务神经网络都可以看做由 :mod:`embedding` 、 :mod:`~fastNLP.modules.encoder` 、
+:mod:`~fastNLP.modules.decoder` 三种模块组成。 本模块中实现了 fastNLP 提供的诸多模块组件,
+可以帮助用户快速搭建自己所需的网络。几种模块的功能和常见组件如下:
+
+.. csv-table::
+ :header: "类型", "功能", "常见组件"
+
+ "embedding", 参见 :doc:`/fastNLP.embeddings` , "Elmo, Bert"
+ "encoder", "将输入编码为具有表示能力的向量", "CNN, LSTM, Transformer"
+ "decoder", "将具有某种表示意义的向量解码为需要的输出形式 ", "MLP, CRF"
+ "其它", "配合其它组件使用的组件", "Dropout"
-+-----------------------+-----------------------+-----------------------+
-| module type | functionality | example |
-+=======================+=======================+=======================+
-| encoder | 将输入编码为具有具 | embedding, RNN, CNN, |
-| | 有表示能力的向量 | transformer |
-+-----------------------+-----------------------+-----------------------+
-| aggregator | 从多个向量中聚合信息 | self-attention, |
-| | | max-pooling |
-+-----------------------+-----------------------+-----------------------+
-| decoder | 将具有某种表示意义的 | MLP, CRF |
-| | 向量解码为需要的输出 | |
-| | 形式 | |
-+-----------------------+-----------------------+-----------------------+
"""
__all__ = [
# "BertModel",
+
"ConvolutionCharEncoder",
"LSTMCharEncoder",
+
"ConvMaxpool",
- "Embedding",
+
"LSTM",
+
"StarTransformer",
+
"TransformerEncoder",
+
"VarRNN",
"VarLSTM",
"VarGRU",
-
+
"MaxPool",
"MaxPoolWithMask",
"AvgPool",
+ "AvgPoolWithMask",
+
"MultiHeadAttention",
-
+
"MLP",
"ConditionalRandomField",
"viterbi_decode",
"allowed_transitions",
+
+ "TimestepDropout",
]
-from . import aggregator
from . import decoder
from . import encoder
-from .aggregator import *
from .decoder import *
from .dropout import TimestepDropout
from .encoder import *
-from .utils import get_embeddings
diff --git a/fastNLP/modules/aggregator/__init__.py b/fastNLP/modules/aggregator/__init__.py
deleted file mode 100644
index a82138e7..00000000
--- a/fastNLP/modules/aggregator/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-__all__ = [
- "MaxPool",
- "MaxPoolWithMask",
- "AvgPool",
-
- "MultiHeadAttention",
-]
-
-from .pooling import MaxPool
-from .pooling import MaxPoolWithMask
-from .pooling import AvgPool
-from .pooling import AvgPoolWithMask
-
-from .attention import MultiHeadAttention
diff --git a/fastNLP/modules/decoder/crf.py b/fastNLP/modules/decoder/crf.py
index beb2b9be..7c496868 100644
--- a/fastNLP/modules/decoder/crf.py
+++ b/fastNLP/modules/decoder/crf.py
@@ -9,15 +9,15 @@ from torch import nn
from ..utils import initial_parameter
-def allowed_transitions(id2target, encoding_type='bio', include_start_end=True):
+def allowed_transitions(id2target, encoding_type='bio', include_start_end=False):
"""
- 别名::class:`fastNLP.modules.allowed_transitions` :class:`fastNLP.modules.decoder.crf.allowed_transitions`
+ 别名::class:`fastNLP.modules.allowed_transitions` :class:`fastNLP.modules.decoder.allowed_transitions`
给定一个id到label的映射表,返回所有可以跳转的(from_tag_id, to_tag_id)列表。
:param dict id2target: key是label的indices,value是str类型的tag或tag-label。value可以是只有tag的, 比如"B", "M"; 也可以是
"B-NN", "M-NN", tag和label之间一定要用"-"隔开。一般可以通过Vocabulary.idx2word得到id2label。
- :param str encoding_type: 支持"bio", "bmes", "bmeso"。
+ :param str encoding_type: 支持"bio", "bmes", "bmeso", "bioes"。
:param bool include_start_end: 是否包含开始与结尾的转换。比如在bio中,b/o可以在开头,但是i不能在开头;
为True,返回的结果中会包含(start_idx, b_idx), (start_idx, o_idx), 但是不包含(start_idx, i_idx);
start_idx=len(id2label), end_idx=len(id2label)+1。为False, 返回的结果中不含与开始结尾相关的内容
@@ -31,7 +31,7 @@ def allowed_transitions(id2target, encoding_type='bio', include_start_end=True):
id_label_lst = list(id2target.items())
if include_start_end:
id_label_lst += [(start_idx, 'start'), (end_idx, 'end')]
-
+
def split_tag_label(from_label):
from_label = from_label.lower()
if from_label in ['start', 'end']:
@@ -41,7 +41,7 @@ def allowed_transitions(id2target, encoding_type='bio', include_start_end=True):
from_tag = from_label[:1]
from_label = from_label[2:]
return from_tag, from_label
-
+
for from_id, from_label in id_label_lst:
if from_label in ['', '']:
continue
@@ -58,7 +58,7 @@ def allowed_transitions(id2target, encoding_type='bio', include_start_end=True):
def _is_transition_allowed(encoding_type, from_tag, from_label, to_tag, to_label):
"""
- :param str encoding_type: 支持"BIO", "BMES", "BEMSO"。
+ :param str encoding_type: 支持"BIO", "BMES", "BEMSO", 'bioes'。
:param str from_tag: 比如"B", "M"之类的标注tag. 还包括start, end等两种特殊tag
:param str from_label: 比如"PER", "LOC"等label
:param str to_tag: 比如"B", "M"之类的标注tag. 还包括start, end等两种特殊tag
@@ -93,7 +93,7 @@ def _is_transition_allowed(encoding_type, from_tag, from_label, to_tag, to_label
return to_tag in ['end', 'b', 'o']
else:
raise ValueError("Unexpect tag {}. Expect only 'B', 'I', 'O'.".format(from_tag))
-
+
elif encoding_type == 'bmes':
"""
第一行是to_tag, 第一列是from_tag,y任意条件下可转,-只有在label相同时可转,n不可转
@@ -134,14 +134,24 @@ def _is_transition_allowed(encoding_type, from_tag, from_label, to_tag, to_label
return to_tag in ['b', 's', 'end', 'o']
else:
raise ValueError("Unexpect tag type {}. Expect only 'B', 'M', 'E', 'S', 'O'.".format(from_tag))
-
+ elif encoding_type == 'bioes':
+ if from_tag == 'start':
+ return to_tag in ['b', 's', 'o']
+ elif from_tag == 'b':
+ return to_tag in ['i', 'e'] and from_label == to_label
+ elif from_tag == 'i':
+ return to_tag in ['i', 'e'] and from_label == to_label
+ elif from_tag in ['e', 's', 'o']:
+ return to_tag in ['b', 's', 'end', 'o']
+ else:
+ raise ValueError("Unexpect tag type {}. Expect only 'B', 'I', 'E', 'S', 'O'.".format(from_tag))
else:
- raise ValueError("Only support BIO, BMES, BMESO encoding type, got {}.".format(encoding_type))
+ raise ValueError("Only support BIO, BMES, BMESO, BIOES encoding type, got {}.".format(encoding_type))
class ConditionalRandomField(nn.Module):
"""
- 别名::class:`fastNLP.modules.ConditionalRandomField` :class:`fastNLP.modules.decoder.crf.ConditionalRandomField`
+ 别名::class:`fastNLP.modules.ConditionalRandomField` :class:`fastNLP.modules.decoder.ConditionalRandomField`
条件随机场。
提供forward()以及viterbi_decode()两个方法,分别用于训练与inference。
@@ -153,21 +163,21 @@ class ConditionalRandomField(nn.Module):
allowed_transitions()函数得到;如果为None,则所有跃迁均为合法
:param str initial_method: 初始化方法。见initial_parameter
"""
-
+
def __init__(self, num_tags, include_start_end_trans=False, allowed_transitions=None,
initial_method=None):
-
+
super(ConditionalRandomField, self).__init__()
-
+
self.include_start_end_trans = include_start_end_trans
self.num_tags = num_tags
-
+
# the meaning of entry in this matrix is (from_tag_id, to_tag_id) score
self.trans_m = nn.Parameter(torch.randn(num_tags, num_tags))
if self.include_start_end_trans:
self.start_scores = nn.Parameter(torch.randn(num_tags))
self.end_scores = nn.Parameter(torch.randn(num_tags))
-
+
if allowed_transitions is None:
constrain = torch.zeros(num_tags + 2, num_tags + 2)
else:
@@ -175,9 +185,9 @@ class ConditionalRandomField(nn.Module):
for from_tag_id, to_tag_id in allowed_transitions:
constrain[from_tag_id, to_tag_id] = 0
self._constrain = nn.Parameter(constrain, requires_grad=False)
-
+
initial_parameter(self, initial_method)
-
+
def _normalizer_likelihood(self, logits, mask):
"""Computes the (batch_size,) denominator term for the log-likelihood, which is the
sum of the likelihoods across all possible state sequences.
@@ -190,21 +200,21 @@ class ConditionalRandomField(nn.Module):
alpha = logits[0]
if self.include_start_end_trans:
alpha = alpha + self.start_scores.view(1, -1)
-
+
flip_mask = mask.eq(0)
-
+
for i in range(1, seq_len):
emit_score = logits[i].view(batch_size, 1, n_tags)
trans_score = self.trans_m.view(1, n_tags, n_tags)
tmp = alpha.view(batch_size, n_tags, 1) + emit_score + trans_score
alpha = torch.logsumexp(tmp, 1).masked_fill(flip_mask[i].view(batch_size, 1), 0) + \
alpha.masked_fill(mask[i].byte().view(batch_size, 1), 0)
-
+
if self.include_start_end_trans:
alpha = alpha + self.end_scores.view(1, -1)
-
+
return torch.logsumexp(alpha, 1)
-
+
def _gold_score(self, logits, tags, mask):
"""
Compute the score for the gold path.
@@ -216,7 +226,7 @@ class ConditionalRandomField(nn.Module):
seq_len, batch_size, _ = logits.size()
batch_idx = torch.arange(batch_size, dtype=torch.long, device=logits.device)
seq_idx = torch.arange(seq_len, dtype=torch.long, device=logits.device)
-
+
# trans_socre [L-1, B]
mask = mask.byte()
flip_mask = mask.eq(0)
@@ -233,7 +243,7 @@ class ConditionalRandomField(nn.Module):
score = score + st_scores + ed_scores
# return [B,]
return score
-
+
def forward(self, feats, tags, mask):
"""
用于计算CRF的前向loss,返回值为一个batch_size的FloatTensor,可能需要mean()求得loss。
@@ -248,9 +258,9 @@ class ConditionalRandomField(nn.Module):
mask = mask.transpose(0, 1).float()
all_path_score = self._normalizer_likelihood(feats, mask)
gold_path_score = self._gold_score(feats, tags, mask)
-
+
return all_path_score - gold_path_score
-
+
def viterbi_decode(self, logits, mask, unpad=False):
"""给定一个特征矩阵以及转移分数矩阵,计算出最佳的路径以及对应的分数
@@ -267,7 +277,7 @@ class ConditionalRandomField(nn.Module):
batch_size, seq_len, n_tags = logits.size()
logits = logits.transpose(0, 1).data # L, B, H
mask = mask.transpose(0, 1).data.byte() # L, B
-
+
# dp
vpath = logits.new_zeros((seq_len, batch_size, n_tags), dtype=torch.long)
vscore = logits[0]
@@ -276,7 +286,7 @@ class ConditionalRandomField(nn.Module):
if self.include_start_end_trans:
transitions[n_tags, :n_tags] += self.start_scores.data
transitions[:n_tags, n_tags + 1] += self.end_scores.data
-
+
vscore += transitions[n_tags, :n_tags]
trans_score = transitions[:n_tags, :n_tags].view(1, n_tags, n_tags).data
for i in range(1, seq_len):
@@ -287,17 +297,17 @@ class ConditionalRandomField(nn.Module):
vpath[i] = best_dst
vscore = best_score.masked_fill(mask[i].eq(0).view(batch_size, 1), 0) + \
vscore.masked_fill(mask[i].view(batch_size, 1), 0)
-
+
if self.include_start_end_trans:
vscore += transitions[:n_tags, n_tags + 1].view(1, -1)
-
+
# backtrace
batch_idx = torch.arange(batch_size, dtype=torch.long, device=logits.device)
seq_idx = torch.arange(seq_len, dtype=torch.long, device=logits.device)
lens = (mask.long().sum(0) - 1)
# idxes [L, B], batched idx from seq_len-1 to 0
idxes = (lens.view(1, -1) - seq_idx.view(-1, 1)) % seq_len
-
+
ans = logits.new_empty((seq_len, batch_size), dtype=torch.long)
ans_score, last_tags = vscore.max(1)
ans[idxes[0], batch_idx] = last_tags
diff --git a/fastNLP/modules/decoder/mlp.py b/fastNLP/modules/decoder/mlp.py
index c1579224..9d9d80f2 100644
--- a/fastNLP/modules/decoder/mlp.py
+++ b/fastNLP/modules/decoder/mlp.py
@@ -10,12 +10,13 @@ from ..utils import initial_parameter
class MLP(nn.Module):
"""
- 别名::class:`fastNLP.modules.MLP` :class:`fastNLP.modules.decoder.mlp.MLP`
+ 别名::class:`fastNLP.modules.MLP` :class:`fastNLP.modules.decoder.MLP`
多层感知器
:param List[int] size_layer: 一个int的列表,用来定义MLP的层数,列表中的数字为每一层是hidden数目。MLP的层数为 len(size_layer) - 1
- :param Union[str,func,List[str]] activation: 一个字符串或者函数的列表,用来定义每一个隐层的激活函数,字符串包括relu,tanh和sigmoid,默认值为relu
+ :param Union[str,func,List[str]] activation: 一个字符串或者函数的列表,用来定义每一个隐层的激活函数,字符串包括relu,tanh和
+ sigmoid,默认值为relu
:param Union[str,func] output_activation: 字符串或者函数,用来定义输出层的激活函数,默认值为None,表示输出层没有激活函数
:param str initial_method: 参数初始化方式
:param float dropout: dropout概率,默认值为0
@@ -39,7 +40,7 @@ class MLP(nn.Module):
>>> print(x)
>>> print(y)
"""
-
+
def __init__(self, size_layer, activation='relu', output_activation=None, initial_method=None, dropout=0.0):
super(MLP, self).__init__()
self.hiddens = nn.ModuleList()
@@ -50,9 +51,9 @@ class MLP(nn.Module):
self.output = nn.Linear(size_layer[i - 1], size_layer[i])
else:
self.hiddens.append(nn.Linear(size_layer[i - 1], size_layer[i]))
-
+
self.dropout = nn.Dropout(p=dropout)
-
+
actives = {
'relu': nn.ReLU(),
'tanh': nn.Tanh(),
@@ -81,7 +82,7 @@ class MLP(nn.Module):
else:
raise ValueError("should set activation correctly: {}".format(activation))
initial_parameter(self, initial_method)
-
+
def forward(self, x):
"""
:param torch.Tensor x: MLP接受的输入
diff --git a/fastNLP/modules/decoder/utils.py b/fastNLP/modules/decoder/utils.py
index 249f3ff6..9e773336 100644
--- a/fastNLP/modules/decoder/utils.py
+++ b/fastNLP/modules/decoder/utils.py
@@ -6,7 +6,7 @@ import torch
def viterbi_decode(logits, transitions, mask=None, unpad=False):
r"""
- 别名::class:`fastNLP.modules.viterbi_decode` :class:`fastNLP.modules.decoder.utils.viterbi_decode`
+ 别名::class:`fastNLP.modules.viterbi_decode` :class:`fastNLP.modules.decoder.viterbi_decode`
给定一个特征矩阵以及转移分数矩阵,计算出最佳的路径以及对应的分数
@@ -30,11 +30,11 @@ def viterbi_decode(logits, transitions, mask=None, unpad=False):
mask = mask.transpose(0, 1).data.byte() # L, B
else:
mask = logits.new_ones((seq_len, batch_size), dtype=torch.uint8)
-
+
# dp
vpath = logits.new_zeros((seq_len, batch_size, n_tags), dtype=torch.long)
vscore = logits[0]
-
+
trans_score = transitions.view(1, n_tags, n_tags).data
for i in range(1, seq_len):
prev_score = vscore.view(batch_size, n_tags, 1)
@@ -44,14 +44,14 @@ def viterbi_decode(logits, transitions, mask=None, unpad=False):
vpath[i] = best_dst
vscore = best_score.masked_fill(mask[i].eq(0).view(batch_size, 1), 0) + \
vscore.masked_fill(mask[i].view(batch_size, 1), 0)
-
+
# backtrace
batch_idx = torch.arange(batch_size, dtype=torch.long, device=logits.device)
seq_idx = torch.arange(seq_len, dtype=torch.long, device=logits.device)
lens = (mask.long().sum(0) - 1)
# idxes [L, B], batched idx from seq_len-1 to 0
idxes = (lens.view(1, -1) - seq_idx.view(-1, 1)) % seq_len
-
+
ans = logits.new_empty((seq_len, batch_size), dtype=torch.long)
ans_score, last_tags = vscore.max(1)
ans[idxes[0], batch_idx] = last_tags
diff --git a/fastNLP/modules/dropout.py b/fastNLP/modules/dropout.py
index 1363165c..0ea2a2d9 100644
--- a/fastNLP/modules/dropout.py
+++ b/fastNLP/modules/dropout.py
@@ -5,10 +5,8 @@ import torch
class TimestepDropout(torch.nn.Dropout):
"""
- 别名::class:`fastNLP.modules.TimestepDropout`
-
- 接受的参数shape为``[batch_size, num_timesteps, embedding_dim)]`` 使用同一个mask(shape为``(batch_size, embedding_dim)``)
- 在每个timestamp上做dropout。
+ 传入参数的shape为 ``(batch_size, num_timesteps, embedding_dim)``
+ 使用同一个shape为 ``(batch_size, embedding_dim)`` 的mask在每个timestamp上做dropout。
"""
def forward(self, x):
diff --git a/fastNLP/modules/encoder/__init__.py b/fastNLP/modules/encoder/__init__.py
index bdc4cbf3..1e99a0fd 100644
--- a/fastNLP/modules/encoder/__init__.py
+++ b/fastNLP/modules/encoder/__init__.py
@@ -1,28 +1,36 @@
__all__ = [
# "BertModel",
-
+
"ConvolutionCharEncoder",
"LSTMCharEncoder",
-
+
"ConvMaxpool",
-
- "Embedding",
-
+
"LSTM",
-
+
"StarTransformer",
-
+
"TransformerEncoder",
-
+
"VarRNN",
"VarLSTM",
- "VarGRU"
+ "VarGRU",
+
+ "MaxPool",
+ "MaxPoolWithMask",
+ "AvgPool",
+ "AvgPoolWithMask",
+
+ "MultiHeadAttention",
]
+
from .bert import BertModel
from .char_encoder import ConvolutionCharEncoder, LSTMCharEncoder
from .conv_maxpool import ConvMaxpool
-from .embedding import Embedding
from .lstm import LSTM
from .star_transformer import StarTransformer
from .transformer import TransformerEncoder
from .variational_rnn import VarRNN, VarLSTM, VarGRU
+
+from .pooling import MaxPool, MaxPoolWithMask, AvgPool, AvgPoolWithMask
+from .attention import MultiHeadAttention
diff --git a/fastNLP/modules/encoder/_elmo.py b/fastNLP/modules/encoder/_elmo.py
new file mode 100644
index 00000000..befae8bc
--- /dev/null
+++ b/fastNLP/modules/encoder/_elmo.py
@@ -0,0 +1,538 @@
+"""
+这个页面的代码大量参考了 allenNLP
+"""
+
+from typing import Optional, Tuple, List, Callable
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils.rnn import PackedSequence, pad_packed_sequence
+
+from ..utils import get_dropout_mask
+
+
+class LstmCellWithProjection(torch.nn.Module):
+ """
+ An LSTM with Recurrent Dropout and a projected and clipped hidden state and
+ memory. Note: this implementation is slower than the native Pytorch LSTM because
+ it cannot make use of CUDNN optimizations for stacked RNNs due to and
+ variational dropout and the custom nature of the cell state.
+ Parameters
+ ----------
+ input_size : ``int``, required.
+ The dimension of the inputs to the LSTM.
+ hidden_size : ``int``, required.
+ The dimension of the outputs of the LSTM.
+ cell_size : ``int``, required.
+ The dimension of the memory cell used for the LSTM.
+ go_forward: ``bool``, optional (default = True)
+ The direction in which the LSTM is applied to the sequence.
+ Forwards by default, or backwards if False.
+ recurrent_dropout_probability: ``float``, optional (default = 0.0)
+ The dropout probability to be used in a dropout scheme as stated in
+ `A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
+ `_ . Implementation wise, this simply
+ applies a fixed dropout mask per sequence to the recurrent connection of the
+ LSTM.
+ state_projection_clip_value: ``float``, optional, (default = None)
+ The magnitude with which to clip the hidden_state after projecting it.
+ memory_cell_clip_value: ``float``, optional, (default = None)
+ The magnitude with which to clip the memory cell.
+ Returns
+ -------
+ output_accumulator : ``torch.FloatTensor``
+ The outputs of the LSTM for each timestep. A tensor of shape
+ (batch_size, max_timesteps, hidden_size) where for a given batch
+ element, all outputs past the sequence length for that batch are
+ zero tensors.
+ final_state: ``Tuple[torch.FloatTensor, torch.FloatTensor]``
+ The final (state, memory) states of the LSTM, with shape
+ (1, batch_size, hidden_size) and (1, batch_size, cell_size)
+ respectively. The first dimension is 1 in order to match the Pytorch
+ API for returning stacked LSTM states.
+ """
+
+ def __init__(self,
+ input_size: int,
+ hidden_size: int,
+ cell_size: int,
+ go_forward: bool = True,
+ recurrent_dropout_probability: float = 0.0,
+ memory_cell_clip_value: Optional[float] = None,
+ state_projection_clip_value: Optional[float] = None) -> None:
+ super(LstmCellWithProjection, self).__init__()
+ # Required to be wrapped with a :class:`PytorchSeq2SeqWrapper`.
+ self.input_size = input_size
+ self.hidden_size = hidden_size
+ self.cell_size = cell_size
+
+ self.go_forward = go_forward
+ self.state_projection_clip_value = state_projection_clip_value
+ self.memory_cell_clip_value = memory_cell_clip_value
+ self.recurrent_dropout_probability = recurrent_dropout_probability
+
+ # We do the projections for all the gates all at once.
+ self.input_linearity = torch.nn.Linear(input_size, 4 * cell_size, bias=False)
+ self.state_linearity = torch.nn.Linear(hidden_size, 4 * cell_size, bias=True)
+
+ # Additional projection matrix for making the hidden state smaller.
+ self.state_projection = torch.nn.Linear(cell_size, hidden_size, bias=False)
+ self.reset_parameters()
+
+ def reset_parameters(self):
+ # Use sensible default initializations for parameters.
+ nn.init.orthogonal_(self.input_linearity.weight.data)
+ nn.init.orthogonal_(self.state_linearity.weight.data)
+
+ self.state_linearity.bias.data.fill_(0.0)
+ # Initialize forget gate biases to 1.0 as per An Empirical
+ # Exploration of Recurrent Network Architectures, (Jozefowicz, 2015).
+ self.state_linearity.bias.data[self.cell_size:2 * self.cell_size].fill_(1.0)
+
+ def forward(self, # pylint: disable=arguments-differ
+ inputs: torch.FloatTensor,
+ batch_lengths: List[int],
+ initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None):
+ """
+ Parameters
+ ----------
+ inputs : ``torch.FloatTensor``, required.
+ A tensor of shape (batch_size, num_timesteps, input_size)
+ to apply the LSTM over.
+ batch_lengths : ``List[int]``, required.
+ A list of length batch_size containing the lengths of the sequences in batch.
+ initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None)
+ A tuple (state, memory) representing the initial hidden state and memory
+ of the LSTM. The ``state`` has shape (1, batch_size, hidden_size) and the
+ ``memory`` has shape (1, batch_size, cell_size).
+ Returns
+ -------
+ output_accumulator : ``torch.FloatTensor``
+ The outputs of the LSTM for each timestep. A tensor of shape
+ (batch_size, max_timesteps, hidden_size) where for a given batch
+ element, all outputs past the sequence length for that batch are
+ zero tensors.
+ final_state : ``Tuple[``torch.FloatTensor, torch.FloatTensor]``
+ A tuple (state, memory) representing the initial hidden state and memory
+ of the LSTM. The ``state`` has shape (1, batch_size, hidden_size) and the
+ ``memory`` has shape (1, batch_size, cell_size).
+ """
+ batch_size = inputs.size()[0]
+ total_timesteps = inputs.size()[1]
+
+ # We have to use this '.data.new().fill_' pattern to create tensors with the correct
+ # type - forward has no knowledge of whether these are torch.Tensors or torch.cuda.Tensors.
+ output_accumulator = inputs.data.new(batch_size,
+ total_timesteps,
+ self.hidden_size).fill_(0)
+ if initial_state is None:
+ full_batch_previous_memory = inputs.data.new(batch_size,
+ self.cell_size).fill_(0)
+ full_batch_previous_state = inputs.data.new(batch_size,
+ self.hidden_size).fill_(0)
+ else:
+ full_batch_previous_state = initial_state[0].squeeze(0)
+ full_batch_previous_memory = initial_state[1].squeeze(0)
+
+ current_length_index = batch_size - 1 if self.go_forward else 0
+ if self.recurrent_dropout_probability > 0.0 and self.training:
+ dropout_mask = get_dropout_mask(self.recurrent_dropout_probability,
+ full_batch_previous_state)
+ else:
+ dropout_mask = None
+
+ for timestep in range(total_timesteps):
+ # The index depends on which end we start.
+ index = timestep if self.go_forward else total_timesteps - timestep - 1
+
+ # What we are doing here is finding the index into the batch dimension
+ # which we need to use for this timestep, because the sequences have
+ # variable length, so once the index is greater than the length of this
+ # particular batch sequence, we no longer need to do the computation for
+ # this sequence. The key thing to recognise here is that the batch inputs
+ # must be _ordered_ by length from longest (first in batch) to shortest
+ # (last) so initially, we are going forwards with every sequence and as we
+ # pass the index at which the shortest elements of the batch finish,
+ # we stop picking them up for the computation.
+ if self.go_forward:
+ while batch_lengths[current_length_index] <= index:
+ current_length_index -= 1
+ # If we're going backwards, we are _picking up_ more indices.
+ else:
+ # First conditional: Are we already at the maximum number of elements in the batch?
+ # Second conditional: Does the next shortest sequence beyond the current batch
+ # index require computation use this timestep?
+ while current_length_index < (len(batch_lengths) - 1) and \
+ batch_lengths[current_length_index + 1] > index:
+ current_length_index += 1
+
+ # Actually get the slices of the batch which we
+ # need for the computation at this timestep.
+ # shape (batch_size, cell_size)
+ previous_memory = full_batch_previous_memory[0: current_length_index + 1].clone()
+ # Shape (batch_size, hidden_size)
+ previous_state = full_batch_previous_state[0: current_length_index + 1].clone()
+ # Shape (batch_size, input_size)
+ timestep_input = inputs[0: current_length_index + 1, index]
+
+ # Do the projections for all the gates all at once.
+ # Both have shape (batch_size, 4 * cell_size)
+ projected_input = self.input_linearity(timestep_input)
+ projected_state = self.state_linearity(previous_state)
+
+ # Main LSTM equations using relevant chunks of the big linear
+ # projections of the hidden state and inputs.
+ input_gate = torch.sigmoid(projected_input[:, (0 * self.cell_size):(1 * self.cell_size)] +
+ projected_state[:, (0 * self.cell_size):(1 * self.cell_size)])
+ forget_gate = torch.sigmoid(projected_input[:, (1 * self.cell_size):(2 * self.cell_size)] +
+ projected_state[:, (1 * self.cell_size):(2 * self.cell_size)])
+ memory_init = torch.tanh(projected_input[:, (2 * self.cell_size):(3 * self.cell_size)] +
+ projected_state[:, (2 * self.cell_size):(3 * self.cell_size)])
+ output_gate = torch.sigmoid(projected_input[:, (3 * self.cell_size):(4 * self.cell_size)] +
+ projected_state[:, (3 * self.cell_size):(4 * self.cell_size)])
+ memory = input_gate * memory_init + forget_gate * previous_memory
+
+ # Here is the non-standard part of this LSTM cell; first, we clip the
+ # memory cell, then we project the output of the timestep to a smaller size
+ # and again clip it.
+
+ if self.memory_cell_clip_value:
+ # pylint: disable=invalid-unary-operand-type
+ memory = torch.clamp(memory, -self.memory_cell_clip_value, self.memory_cell_clip_value)
+
+ # shape (current_length_index, cell_size)
+ pre_projection_timestep_output = output_gate * torch.tanh(memory)
+
+ # shape (current_length_index, hidden_size)
+ timestep_output = self.state_projection(pre_projection_timestep_output)
+ if self.state_projection_clip_value:
+ # pylint: disable=invalid-unary-operand-type
+ timestep_output = torch.clamp(timestep_output,
+ -self.state_projection_clip_value,
+ self.state_projection_clip_value)
+
+ # Only do dropout if the dropout prob is > 0.0 and we are in training mode.
+ if dropout_mask is not None:
+ timestep_output = timestep_output * dropout_mask[0: current_length_index + 1]
+
+ # We've been doing computation with less than the full batch, so here we create a new
+ # variable for the the whole batch at this timestep and insert the result for the
+ # relevant elements of the batch into it.
+ full_batch_previous_memory = full_batch_previous_memory.data.clone()
+ full_batch_previous_state = full_batch_previous_state.data.clone()
+ full_batch_previous_memory[0:current_length_index + 1] = memory
+ full_batch_previous_state[0:current_length_index + 1] = timestep_output
+ output_accumulator[0:current_length_index + 1, index] = timestep_output
+
+ # Mimic the pytorch API by returning state in the following shape:
+ # (num_layers * num_directions, batch_size, ...). As this
+ # LSTM cell cannot be stacked, the first dimension here is just 1.
+ final_state = (full_batch_previous_state.unsqueeze(0),
+ full_batch_previous_memory.unsqueeze(0))
+
+ return output_accumulator, final_state
+
+
+class LstmbiLm(nn.Module):
+ def __init__(self, config):
+ super(LstmbiLm, self).__init__()
+ self.config = config
+ self.encoder = nn.LSTM(self.config['lstm']['projection_dim'],
+ self.config['lstm']['dim'],
+ num_layers=self.config['lstm']['n_layers'],
+ bidirectional=True,
+ batch_first=True,
+ dropout=self.config['dropout'])
+ self.projection = nn.Linear(self.config['lstm']['dim'], self.config['lstm']['projection_dim'], bias=True)
+
+ def forward(self, inputs, seq_len):
+ sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True)
+ inputs = inputs[sort_idx]
+ inputs = nn.utils.rnn.pack_padded_sequence(inputs, sort_lens, batch_first=self.batch_first)
+ output, hx = self.encoder(inputs, None) # -> [N,L,C]
+ output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=self.batch_first)
+ _, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
+ output = output[unsort_idx]
+ forward, backward = output.split(self.config['lstm']['dim'], 2)
+ return torch.cat([self.projection(forward), self.projection(backward)], dim=2)
+
+
+class ElmobiLm(torch.nn.Module):
+ def __init__(self, config):
+ super(ElmobiLm, self).__init__()
+ self.config = config
+ input_size = config['lstm']['projection_dim']
+ hidden_size = config['lstm']['projection_dim']
+ cell_size = config['lstm']['dim']
+ num_layers = config['lstm']['n_layers']
+ memory_cell_clip_value = config['lstm']['cell_clip']
+ state_projection_clip_value = config['lstm']['proj_clip']
+ recurrent_dropout_probability = 0.0
+
+ self.input_size = input_size
+ self.hidden_size = hidden_size
+ self.num_layers = num_layers
+ self.cell_size = cell_size
+
+ forward_layers = []
+ backward_layers = []
+
+ lstm_input_size = input_size
+ go_forward = True
+ for layer_index in range(num_layers):
+ forward_layer = LstmCellWithProjection(lstm_input_size,
+ hidden_size,
+ cell_size,
+ go_forward,
+ recurrent_dropout_probability,
+ memory_cell_clip_value,
+ state_projection_clip_value)
+ backward_layer = LstmCellWithProjection(lstm_input_size,
+ hidden_size,
+ cell_size,
+ not go_forward,
+ recurrent_dropout_probability,
+ memory_cell_clip_value,
+ state_projection_clip_value)
+ lstm_input_size = hidden_size
+
+ self.add_module('forward_layer_{}'.format(layer_index), forward_layer)
+ self.add_module('backward_layer_{}'.format(layer_index), backward_layer)
+ forward_layers.append(forward_layer)
+ backward_layers.append(backward_layer)
+ self.forward_layers = forward_layers
+ self.backward_layers = backward_layers
+
+ def forward(self, inputs, seq_len):
+ """
+
+ :param inputs: batch_size x max_len x embed_size
+ :param seq_len: batch_size
+ :return: torch.FloatTensor. num_layers x batch_size x max_len x hidden_size
+ """
+ max_len = inputs.size(1)
+ sort_lens, sort_idx = torch.sort(seq_len, dim=0, descending=True)
+ inputs = inputs[sort_idx]
+ inputs = nn.utils.rnn.pack_padded_sequence(inputs, sort_lens, batch_first=True)
+ output, _ = self._lstm_forward(inputs, None)
+ _, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
+ output = output[:, unsort_idx]
+ return output
+
+ def _lstm_forward(self,
+ inputs: PackedSequence,
+ initial_state: Optional[Tuple[torch.Tensor, torch.Tensor]] = None) -> \
+ Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
+ """
+ Parameters
+ ----------
+ inputs : ``PackedSequence``, required.
+ A batch first ``PackedSequence`` to run the stacked LSTM over.
+ initial_state : ``Tuple[torch.Tensor, torch.Tensor]``, optional, (default = None)
+ A tuple (state, memory) representing the initial hidden state and memory
+ of the LSTM, with shape (num_layers, batch_size, 2 * hidden_size) and
+ (num_layers, batch_size, 2 * cell_size) respectively.
+ Returns
+ -------
+ output_sequence : ``torch.FloatTensor``
+ The encoded sequence of shape (num_layers, batch_size, sequence_length, hidden_size)
+ final_states: ``Tuple[torch.FloatTensor, torch.FloatTensor]``
+ The per-layer final (state, memory) states of the LSTM, with shape
+ (num_layers, batch_size, 2 * hidden_size) and (num_layers, batch_size, 2 * cell_size)
+ respectively. The last dimension is duplicated because it contains the state/memory
+ for both the forward and backward layers.
+ """
+
+ if initial_state is None:
+ hidden_states: List[Optional[Tuple[torch.Tensor,
+ torch.Tensor]]] = [None] * len(self.forward_layers)
+ elif initial_state[0].size()[0] != len(self.forward_layers):
+ raise Exception("Initial states were passed to forward() but the number of "
+ "initial states does not match the number of layers.")
+ else:
+ hidden_states = list(zip(initial_state[0].split(1, 0), initial_state[1].split(1, 0)))
+
+ inputs, batch_lengths = pad_packed_sequence(inputs, batch_first=True)
+ forward_output_sequence = inputs
+ backward_output_sequence = inputs
+
+ final_states = []
+ sequence_outputs = []
+ for layer_index, state in enumerate(hidden_states):
+ forward_layer = getattr(self, 'forward_layer_{}'.format(layer_index))
+ backward_layer = getattr(self, 'backward_layer_{}'.format(layer_index))
+
+ forward_cache = forward_output_sequence
+ backward_cache = backward_output_sequence
+
+ if state is not None:
+ forward_hidden_state, backward_hidden_state = state[0].split(self.hidden_size, 2)
+ forward_memory_state, backward_memory_state = state[1].split(self.cell_size, 2)
+ forward_state = (forward_hidden_state, forward_memory_state)
+ backward_state = (backward_hidden_state, backward_memory_state)
+ else:
+ forward_state = None
+ backward_state = None
+
+ forward_output_sequence, forward_state = forward_layer(forward_output_sequence,
+ batch_lengths,
+ forward_state)
+ backward_output_sequence, backward_state = backward_layer(backward_output_sequence,
+ batch_lengths,
+ backward_state)
+ # Skip connections, just adding the input to the output.
+ if layer_index != 0:
+ forward_output_sequence += forward_cache
+ backward_output_sequence += backward_cache
+
+ sequence_outputs.append(torch.cat([forward_output_sequence,
+ backward_output_sequence], -1))
+ # Append the state tuples in a list, so that we can return
+ # the final states for all the layers.
+ final_states.append((torch.cat([forward_state[0], backward_state[0]], -1),
+ torch.cat([forward_state[1], backward_state[1]], -1)))
+
+ stacked_sequence_outputs: torch.FloatTensor = torch.stack(sequence_outputs)
+ # Stack the hidden state and memory for each layer in。to 2 tensors of shape
+ # (num_layers, batch_size, hidden_size) and (num_layers, batch_size, cell_size)
+ # respectively.
+ final_hidden_states, final_memory_states = zip(*final_states)
+ final_state_tuple: Tuple[torch.FloatTensor,
+ torch.FloatTensor] = (torch.cat(final_hidden_states, 0),
+ torch.cat(final_memory_states, 0))
+ return stacked_sequence_outputs, final_state_tuple
+
+
+class ConvTokenEmbedder(nn.Module):
+ def __init__(self, config, weight_file, word_emb_layer, char_emb_layer):
+ super(ConvTokenEmbedder, self).__init__()
+ self.weight_file = weight_file
+ self.word_emb_layer = word_emb_layer
+ self.char_emb_layer = char_emb_layer
+
+ self.output_dim = config['lstm']['projection_dim']
+ self._options = config
+
+ char_cnn_options = self._options['char_cnn']
+ if char_cnn_options['activation'] == 'tanh':
+ self.activation = torch.tanh
+ elif char_cnn_options['activation'] == 'relu':
+ self.activation = torch.nn.functional.relu
+ else:
+ raise Exception("Unknown activation")
+
+ if char_emb_layer is not None:
+ self.char_conv = []
+ cnn_config = config['char_cnn']
+ filters = cnn_config['filters']
+ char_embed_dim = cnn_config['embedding']['dim']
+ convolutions = []
+
+ for i, (width, num) in enumerate(filters):
+ conv = torch.nn.Conv1d(
+ in_channels=char_embed_dim,
+ out_channels=num,
+ kernel_size=width,
+ bias=True
+ )
+ convolutions.append(conv)
+ self.add_module('char_conv_{}'.format(i), conv)
+
+ self._convolutions = convolutions
+
+ n_filters = sum(f[1] for f in filters)
+ n_highway = cnn_config['n_highway']
+
+ self._highways = Highway(n_filters, n_highway, activation=torch.nn.functional.relu)
+
+ self._projection = torch.nn.Linear(n_filters, self.output_dim, bias=True)
+
+ def forward(self, words, chars):
+ """
+ :param words:
+ :param chars: Tensor Shape ``(batch_size, sequence_length, 50)``:
+ :return Tensor Shape ``(batch_size, sequence_length + 2, embedding_dim)`` :
+ """
+ # the character id embedding
+ # (batch_size * sequence_length, max_chars_per_token, embed_dim)
+ # character_embedding = torch.nn.functional.embedding(
+ # chars.view(-1, max_chars_per_token),
+ # self._char_embedding_weights
+ # )
+ batch_size, sequence_length, max_char_len = chars.size()
+ character_embedding = self.char_emb_layer(chars).reshape(batch_size * sequence_length, max_char_len, -1)
+ # run convolutions
+
+ # (batch_size * sequence_length, embed_dim, max_chars_per_token)
+ character_embedding = torch.transpose(character_embedding, 1, 2)
+ convs = []
+ for i in range(len(self._convolutions)):
+ conv = getattr(self, 'char_conv_{}'.format(i))
+ convolved = conv(character_embedding)
+ # (batch_size * sequence_length, n_filters for this width)
+ convolved, _ = torch.max(convolved, dim=-1)
+ convolved = self.activation(convolved)
+ convs.append(convolved)
+
+ # (batch_size * sequence_length, n_filters)
+ token_embedding = torch.cat(convs, dim=-1)
+
+ # apply the highway layers (batch_size * sequence_length, n_filters)
+ token_embedding = self._highways(token_embedding)
+
+ # final projection (batch_size * sequence_length, embedding_dim)
+ token_embedding = self._projection(token_embedding)
+
+ # reshape to (batch_size, sequence_length+2, embedding_dim)
+ return token_embedding.view(batch_size, sequence_length, -1)
+
+
+class Highway(torch.nn.Module):
+ """
+ A `Highway layer `_ does a gated combination of a linear
+ transformation and a non-linear transformation of its input. :math:`y = g * x + (1 - g) *
+ f(A(x))`, where :math:`A` is a linear transformation, :math:`f` is an element-wise
+ non-linearity, and :math:`g` is an element-wise gate, computed as :math:`sigmoid(B(x))`.
+ This module will apply a fixed number of highway layers to its input, returning the final
+ result.
+ Parameters
+ ----------
+ input_dim : ``int``
+ The dimensionality of :math:`x`. We assume the input has shape ``(batch_size,
+ input_dim)``.
+ num_layers : ``int``, optional (default=``1``)
+ The number of highway layers to apply to the input.
+ activation : ``Callable[[torch.Tensor], torch.Tensor]``, optional (default=``torch.nn.functional.relu``)
+ The non-linearity to use in the highway layers.
+ """
+
+ def __init__(self,
+ input_dim: int,
+ num_layers: int = 1,
+ activation: Callable[[torch.Tensor], torch.Tensor] = torch.nn.functional.relu) -> None:
+ super(Highway, self).__init__()
+ self._input_dim = input_dim
+ self._layers = torch.nn.ModuleList([torch.nn.Linear(input_dim, input_dim * 2)
+ for _ in range(num_layers)])
+ self._activation = activation
+ for layer in self._layers:
+ # We should bias the highway layer to just carry its input forward. We do that by
+ # setting the bias on `B(x)` to be positive, because that means `g` will be biased to
+ # be high, to we will carry the input forward. The bias on `B(x)` is the second half
+ # of the bias vector in each Linear layer.
+ layer.bias[input_dim:].data.fill_(1)
+
+ def forward(self, inputs: torch.Tensor) -> torch.Tensor: # pylint: disable=arguments-differ
+ current_input = inputs
+ for layer in self._layers:
+ projected_input = layer(current_input)
+ linear_part = current_input
+ # NOTE: if you modify this, think about whether you should modify the initialization
+ # above, too.
+ nonlinear_part = projected_input[:, (0 * self._input_dim):(1 * self._input_dim)]
+ gate = projected_input[:, (1 * self._input_dim):(2 * self._input_dim)]
+ nonlinear_part = self._activation(nonlinear_part)
+ gate = torch.sigmoid(gate)
+ current_input = gate * linear_part + (1 - gate) * nonlinear_part
+ return current_input
diff --git a/fastNLP/modules/aggregator/attention.py b/fastNLP/modules/encoder/attention.py
similarity index 94%
rename from fastNLP/modules/aggregator/attention.py
rename to fastNLP/modules/encoder/attention.py
index 4101b033..fe3f7fd8 100644
--- a/fastNLP/modules/aggregator/attention.py
+++ b/fastNLP/modules/encoder/attention.py
@@ -8,9 +8,7 @@ import torch
import torch.nn.functional as F
from torch import nn
-from ..dropout import TimestepDropout
-
-from ..utils import initial_parameter
+from fastNLP.modules.utils import initial_parameter
class DotAttention(nn.Module):
@@ -18,15 +16,15 @@ class DotAttention(nn.Module):
.. todo::
补上文档
"""
-
- def __init__(self, key_size, value_size, dropout=0):
+
+ def __init__(self, key_size, value_size, dropout=0.0):
super(DotAttention, self).__init__()
self.key_size = key_size
self.value_size = value_size
self.scale = math.sqrt(key_size)
self.drop = nn.Dropout(dropout)
self.softmax = nn.Softmax(dim=2)
-
+
def forward(self, Q, K, V, mask_out=None):
"""
@@ -37,7 +35,7 @@ class DotAttention(nn.Module):
"""
output = torch.matmul(Q, K.transpose(1, 2)) / self.scale
if mask_out is not None:
- output.masked_fill_(mask_out, -1e8)
+ output.masked_fill_(mask_out, -1e18)
output = self.softmax(output)
output = self.drop(output)
return torch.matmul(output, V)
@@ -45,8 +43,7 @@ class DotAttention(nn.Module):
class MultiHeadAttention(nn.Module):
"""
- 别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.aggregator.attention.MultiHeadAttention`
-
+ 别名::class:`fastNLP.modules.MultiHeadAttention` :class:`fastNLP.modules.encoder.MultiHeadAttention`
:param input_size: int, 输入维度的大小。同时也是输出维度的大小。
:param key_size: int, 每个head的维度大小。
@@ -54,31 +51,30 @@ class MultiHeadAttention(nn.Module):
:param num_head: int,head的数量。
:param dropout: float。
"""
-
+
def __init__(self, input_size, key_size, value_size, num_head, dropout=0.1):
super(MultiHeadAttention, self).__init__()
self.input_size = input_size
self.key_size = key_size
self.value_size = value_size
self.num_head = num_head
-
+
in_size = key_size * num_head
self.q_in = nn.Linear(input_size, in_size)
self.k_in = nn.Linear(input_size, in_size)
self.v_in = nn.Linear(input_size, in_size)
# follow the paper, do not apply dropout within dot-product
- self.attention = DotAttention(key_size=key_size, value_size=value_size, dropout=0)
+ self.attention = DotAttention(key_size=key_size, value_size=value_size, dropout=dropout)
self.out = nn.Linear(value_size * num_head, input_size)
- self.drop = TimestepDropout(dropout)
self.reset_parameters()
-
+
def reset_parameters(self):
sqrt = math.sqrt
nn.init.normal_(self.q_in.weight, mean=0, std=sqrt(2.0 / (self.input_size + self.key_size)))
nn.init.normal_(self.k_in.weight, mean=0, std=sqrt(2.0 / (self.input_size + self.key_size)))
nn.init.normal_(self.v_in.weight, mean=0, std=sqrt(2.0 / (self.input_size + self.value_size)))
nn.init.xavier_normal_(self.out.weight)
-
+
def forward(self, Q, K, V, atte_mask_out=None):
"""
@@ -94,7 +90,7 @@ class MultiHeadAttention(nn.Module):
q = self.q_in(Q).view(batch, sq, n_head, d_k)
k = self.k_in(K).view(batch, sk, n_head, d_k)
v = self.v_in(V).view(batch, sk, n_head, d_v)
-
+
# transpose q, k and v to do batch attention
q = q.permute(2, 0, 1, 3).contiguous().view(-1, sq, d_k)
k = k.permute(2, 0, 1, 3).contiguous().view(-1, sk, d_k)
@@ -102,10 +98,10 @@ class MultiHeadAttention(nn.Module):
if atte_mask_out is not None:
atte_mask_out = atte_mask_out.repeat(n_head, 1, 1)
atte = self.attention(q, k, v, atte_mask_out).view(n_head, batch, sq, d_v)
-
+
# concat all heads, do output linear
atte = atte.permute(1, 2, 0, 3).contiguous().view(batch, sq, -1)
- output = self.drop(self.out(atte))
+ output = self.out(atte)
return output
@@ -126,11 +122,11 @@ class BiAttention(nn.Module):
\end{array}
"""
-
+
def __init__(self):
super(BiAttention, self).__init__()
self.inf = 10e12
-
+
def forward(self, in_x1, in_x2, x1_len, x2_len):
"""
:param torch.Tensor in_x1: [batch_size, x1_seq_len, hidden_size] 第一句的特征表示
@@ -141,36 +137,36 @@ class BiAttention(nn.Module):
torch.Tensor out_x2: [batch_size, x2_seq_len, hidden_size] 第一句attend到的特征表示
"""
-
+
assert in_x1.size()[0] == in_x2.size()[0]
assert in_x1.size()[2] == in_x2.size()[2]
# The batch size and hidden size must be equal.
assert in_x1.size()[1] == x1_len.size()[1] and in_x2.size()[1] == x2_len.size()[1]
# The seq len in in_x and x_len must be equal.
assert in_x1.size()[0] == x1_len.size()[0] and x1_len.size()[0] == x2_len.size()[0]
-
+
batch_size = in_x1.size()[0]
x1_max_len = in_x1.size()[1]
x2_max_len = in_x2.size()[1]
-
+
in_x2_t = torch.transpose(in_x2, 1, 2) # [batch_size, hidden_size, x2_seq_len]
-
+
attention_matrix = torch.bmm(in_x1, in_x2_t) # [batch_size, x1_seq_len, x2_seq_len]
-
+
a_mask = x1_len.le(0.5).float() * -self.inf # [batch_size, x1_seq_len]
a_mask = a_mask.view(batch_size, x1_max_len, -1)
a_mask = a_mask.expand(-1, -1, x2_max_len) # [batch_size, x1_seq_len, x2_seq_len]
b_mask = x2_len.le(0.5).float() * -self.inf
b_mask = b_mask.view(batch_size, -1, x2_max_len)
b_mask = b_mask.expand(-1, x1_max_len, -1) # [batch_size, x1_seq_len, x2_seq_len]
-
+
attention_a = F.softmax(attention_matrix + a_mask, dim=2) # [batch_size, x1_seq_len, x2_seq_len]
attention_b = F.softmax(attention_matrix + b_mask, dim=1) # [batch_size, x1_seq_len, x2_seq_len]
-
+
out_x1 = torch.bmm(attention_a, in_x2) # [batch_size, x1_seq_len, hidden_size]
attention_b_t = torch.transpose(attention_b, 1, 2)
out_x2 = torch.bmm(attention_b_t, in_x1) # [batch_size, x2_seq_len, hidden_size]
-
+
return out_x1, out_x2
@@ -184,10 +180,10 @@ class SelfAttention(nn.Module):
:param float drop: dropout概率,默认值为0.5
:param str initial_method: 初始化参数方法
"""
-
+
def __init__(self, input_size, attention_unit=300, attention_hops=10, drop=0.5, initial_method=None, ):
super(SelfAttention, self).__init__()
-
+
self.attention_hops = attention_hops
self.ws1 = nn.Linear(input_size, attention_unit, bias=False)
self.ws2 = nn.Linear(attention_unit, attention_hops, bias=False)
@@ -196,7 +192,7 @@ class SelfAttention(nn.Module):
self.drop = nn.Dropout(drop)
self.tanh = nn.Tanh()
initial_parameter(self, initial_method)
-
+
def _penalization(self, attention):
"""
compute the penalization term for attention module
@@ -210,7 +206,7 @@ class SelfAttention(nn.Module):
mat = torch.bmm(attention, attention_t) - self.I[:attention.size(0)]
ret = (torch.sum(torch.sum((mat ** 2), 2), 1).squeeze() + 1e-10) ** 0.5
return torch.sum(ret) / size[0]
-
+
def forward(self, input, input_origin):
"""
:param torch.Tensor input: [baz, senLen, h_dim] 要做attention的矩阵
@@ -220,14 +216,14 @@ class SelfAttention(nn.Module):
"""
input = input.contiguous()
size = input.size() # [bsz, len, nhid]
-
+
input_origin = input_origin.expand(self.attention_hops, -1, -1) # [hops,baz, len]
input_origin = input_origin.transpose(0, 1).contiguous() # [baz, hops,len]
-
+
y1 = self.tanh(self.ws1(self.drop(input))) # [baz,len,dim] -->[bsz,len, attention-unit]
attention = self.ws2(y1).transpose(1, 2).contiguous()
# [bsz,len, attention-unit]--> [bsz, len, hop]--> [baz,hop,len]
-
+
attention = attention + (-999999 * (input_origin == 0).float()) # remove the weight on padding token.
attention = F.softmax(attention, 2) # [baz ,hop, len]
return torch.bmm(attention, input), self._penalization(attention) # output1 --> [baz ,hop ,nhid]
diff --git a/fastNLP/modules/encoder/bert.py b/fastNLP/modules/encoder/bert.py
index e123fda6..ce175df1 100644
--- a/fastNLP/modules/encoder/bert.py
+++ b/fastNLP/modules/encoder/bert.py
@@ -1,7 +1,15 @@
"""
-bert.py is modified from huggingface/pytorch-pretrained-BERT, which is licensed under the Apache License 2.0.
+这个页面的代码很大程度上参考(复制粘贴)了https://github.com/huggingface/pytorch-pretrained-BERT的代码, 如果你发现该代码对你
+ 有用,也请引用一下他们。
+"""
-"""
+__all__ = [
+ "BertModel"
+]
+
+import collections
+
+import unicodedata
import copy
import json
import math
@@ -9,9 +17,110 @@ import os
import torch
from torch import nn
+import sys
+
+from ..utils import _get_file_name_base_on_postfix
CONFIG_FILE = 'bert_config.json'
-MODEL_WEIGHTS = 'pytorch_model.bin'
+VOCAB_NAME = 'vocab.txt'
+
+
+
+class BertConfig(object):
+ """Configuration class to store the configuration of a `BertModel`.
+ """
+
+ def __init__(self,
+ vocab_size_or_config_json_file,
+ hidden_size=768,
+ num_hidden_layers=12,
+ num_attention_heads=12,
+ intermediate_size=3072,
+ hidden_act="gelu",
+ hidden_dropout_prob=0.1,
+ attention_probs_dropout_prob=0.1,
+ max_position_embeddings=512,
+ type_vocab_size=2,
+ initializer_range=0.02,
+ layer_norm_eps=1e-12):
+ """Constructs BertConfig.
+
+ Args:
+ vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
+ hidden_size: Size of the encoder layers and the pooler layer.
+ num_hidden_layers: Number of hidden layers in the Transformer encoder.
+ num_attention_heads: Number of attention heads for each attention layer in
+ the Transformer encoder.
+ intermediate_size: The size of the "intermediate" (i.e., feed-forward)
+ layer in the Transformer encoder.
+ hidden_act: The non-linear activation function (function or string) in the
+ encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
+ hidden_dropout_prob: The dropout probabilitiy for all fully connected
+ layers in the embeddings, encoder, and pooler.
+ attention_probs_dropout_prob: The dropout ratio for the attention
+ probabilities.
+ max_position_embeddings: The maximum sequence length that this model might
+ ever be used with. Typically set this to something large just in case
+ (e.g., 512 or 1024 or 2048).
+ type_vocab_size: The vocabulary size of the `token_type_ids` passed into
+ `BertModel`.
+ initializer_range: The sttdev of the truncated_normal_initializer for
+ initializing all weight matrices.
+ layer_norm_eps: The epsilon used by LayerNorm.
+ """
+ if isinstance(vocab_size_or_config_json_file, str):
+ with open(vocab_size_or_config_json_file, "r", encoding='utf-8') as reader:
+ json_config = json.loads(reader.read())
+ for key, value in json_config.items():
+ self.__dict__[key] = value
+ elif isinstance(vocab_size_or_config_json_file, int):
+ self.vocab_size = vocab_size_or_config_json_file
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.hidden_act = hidden_act
+ self.intermediate_size = intermediate_size
+ self.hidden_dropout_prob = hidden_dropout_prob
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
+ self.max_position_embeddings = max_position_embeddings
+ self.type_vocab_size = type_vocab_size
+ self.initializer_range = initializer_range
+ self.layer_norm_eps = layer_norm_eps
+ else:
+ raise ValueError("First argument must be either a vocabulary size (int)"
+ "or the path to a pretrained model config file (str)")
+
+ @classmethod
+ def from_dict(cls, json_object):
+ """Constructs a `BertConfig` from a Python dictionary of parameters."""
+ config = BertConfig(vocab_size_or_config_json_file=-1)
+ for key, value in json_object.items():
+ config.__dict__[key] = value
+ return config
+
+ @classmethod
+ def from_json_file(cls, json_file):
+ """Constructs a `BertConfig` from a json file of parameters."""
+ with open(json_file, "r", encoding='utf-8') as reader:
+ text = reader.read()
+ return cls.from_dict(json.loads(text))
+
+ def __repr__(self):
+ return str(self.to_json_string())
+
+ def to_dict(self):
+ """Serializes this instance to a Python dictionary."""
+ output = copy.deepcopy(self.__dict__)
+ return output
+
+ def to_json_string(self):
+ """Serializes this instance to a JSON string."""
+ return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
+
+ def to_json_file(self, json_file_path):
+ """ Save this instance to a json file."""
+ with open(json_file_path, "w", encoding='utf-8') as writer:
+ writer.write(self.to_json_string())
def gelu(x):
@@ -27,6 +136,8 @@ ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}
class BertLayerNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-12):
+ """Construct a layernorm module in the TF style (epsilon inside the square root).
+ """
super(BertLayerNorm, self).__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.bias = nn.Parameter(torch.zeros(hidden_size))
@@ -40,16 +151,19 @@ class BertLayerNorm(nn.Module):
class BertEmbeddings(nn.Module):
- def __init__(self, vocab_size, hidden_size, max_position_embeddings, type_vocab_size, hidden_dropout_prob):
+ """Construct the embeddings from word, position and token_type embeddings.
+ """
+
+ def __init__(self, config):
super(BertEmbeddings, self).__init__()
- self.word_embeddings = nn.Embedding(vocab_size, hidden_size)
- self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
- self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
+ self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
+ self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
- self.LayerNorm = BertLayerNorm(hidden_size, eps=1e-12)
- self.dropout = nn.Dropout(hidden_dropout_prob)
+ self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, input_ids, token_type_ids=None):
seq_length = input_ids.size(1)
@@ -69,21 +183,21 @@ class BertEmbeddings(nn.Module):
class BertSelfAttention(nn.Module):
- def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob):
+ def __init__(self, config):
super(BertSelfAttention, self).__init__()
- if hidden_size % num_attention_heads != 0:
+ if config.hidden_size % config.num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
- "heads (%d)" % (hidden_size, num_attention_heads))
- self.num_attention_heads = num_attention_heads
- self.attention_head_size = int(hidden_size / num_attention_heads)
+ "heads (%d)" % (config.hidden_size, config.num_attention_heads))
+ self.num_attention_heads = config.num_attention_heads
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
- self.query = nn.Linear(hidden_size, self.all_head_size)
- self.key = nn.Linear(hidden_size, self.all_head_size)
- self.value = nn.Linear(hidden_size, self.all_head_size)
+ self.query = nn.Linear(config.hidden_size, self.all_head_size)
+ self.key = nn.Linear(config.hidden_size, self.all_head_size)
+ self.value = nn.Linear(config.hidden_size, self.all_head_size)
- self.dropout = nn.Dropout(attention_probs_dropout_prob)
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
@@ -120,11 +234,11 @@ class BertSelfAttention(nn.Module):
class BertSelfOutput(nn.Module):
- def __init__(self, hidden_size, hidden_dropout_prob):
+ def __init__(self, config):
super(BertSelfOutput, self).__init__()
- self.dense = nn.Linear(hidden_size, hidden_size)
- self.LayerNorm = BertLayerNorm(hidden_size, eps=1e-12)
- self.dropout = nn.Dropout(hidden_dropout_prob)
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+ self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
@@ -134,10 +248,10 @@ class BertSelfOutput(nn.Module):
class BertAttention(nn.Module):
- def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob):
+ def __init__(self, config):
super(BertAttention, self).__init__()
- self.self = BertSelfAttention(hidden_size, num_attention_heads, attention_probs_dropout_prob)
- self.output = BertSelfOutput(hidden_size, hidden_dropout_prob)
+ self.self = BertSelfAttention(config)
+ self.output = BertSelfOutput(config)
def forward(self, input_tensor, attention_mask):
self_output = self.self(input_tensor, attention_mask)
@@ -146,11 +260,13 @@ class BertAttention(nn.Module):
class BertIntermediate(nn.Module):
- def __init__(self, hidden_size, intermediate_size, hidden_act):
+ def __init__(self, config):
super(BertIntermediate, self).__init__()
- self.dense = nn.Linear(hidden_size, intermediate_size)
- self.intermediate_act_fn = ACT2FN[hidden_act] \
- if isinstance(hidden_act, str) else hidden_act
+ self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+ if isinstance(config.hidden_act, str):
+ self.intermediate_act_fn = ACT2FN[config.hidden_act]
+ else:
+ self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
@@ -159,11 +275,11 @@ class BertIntermediate(nn.Module):
class BertOutput(nn.Module):
- def __init__(self, hidden_size, intermediate_size, hidden_dropout_prob):
+ def __init__(self, config):
super(BertOutput, self).__init__()
- self.dense = nn.Linear(intermediate_size, hidden_size)
- self.LayerNorm = BertLayerNorm(hidden_size, eps=1e-12)
- self.dropout = nn.Dropout(hidden_dropout_prob)
+ self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+ self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
@@ -173,13 +289,11 @@ class BertOutput(nn.Module):
class BertLayer(nn.Module):
- def __init__(self, hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob,
- intermediate_size, hidden_act):
+ def __init__(self, config):
super(BertLayer, self).__init__()
- self.attention = BertAttention(hidden_size, num_attention_heads, attention_probs_dropout_prob,
- hidden_dropout_prob)
- self.intermediate = BertIntermediate(hidden_size, intermediate_size, hidden_act)
- self.output = BertOutput(hidden_size, intermediate_size, hidden_dropout_prob)
+ self.attention = BertAttention(config)
+ self.intermediate = BertIntermediate(config)
+ self.output = BertOutput(config)
def forward(self, hidden_states, attention_mask):
attention_output = self.attention(hidden_states, attention_mask)
@@ -189,13 +303,10 @@ class BertLayer(nn.Module):
class BertEncoder(nn.Module):
- def __init__(self, num_hidden_layers, hidden_size, num_attention_heads, attention_probs_dropout_prob,
- hidden_dropout_prob,
- intermediate_size, hidden_act):
+ def __init__(self, config):
super(BertEncoder, self).__init__()
- layer = BertLayer(hidden_size, num_attention_heads, attention_probs_dropout_prob, hidden_dropout_prob,
- intermediate_size, hidden_act)
- self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(num_hidden_layers)])
+ layer = BertLayer(config)
+ self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])
def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
all_encoder_layers = []
@@ -209,9 +320,9 @@ class BertEncoder(nn.Module):
class BertPooler(nn.Module):
- def __init__(self, hidden_size):
+ def __init__(self, config):
super(BertPooler, self).__init__()
- self.dense = nn.Linear(hidden_size, hidden_size)
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
@@ -224,18 +335,27 @@ class BertPooler(nn.Module):
class BertModel(nn.Module):
- """BERT(Bidirectional Embedding Representations from Transformers).
+ """
+ 别名::class:`fastNLP.modules.BertModel` :class:`fastNLP.modules.encoder.BertModel`
+
+ BERT(Bidirectional Embedding Representations from Transformers).
如果你想使用预训练好的权重矩阵,请在以下网址下载.
sources::
- 'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
- 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
- 'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
- 'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
- 'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
- 'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
- 'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
+ 'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin",
+ 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin",
+ 'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin",
+ 'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-pytorch_model.bin",
+ 'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-pytorch_model.bin",
+ 'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin",
+ 'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin",
+ 'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-pytorch_model.bin",
+ 'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin",
+ 'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-pytorch_model.bin",
+ 'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+ 'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",
+ 'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin"
用预训练权重矩阵来建立BERT模型::
@@ -259,33 +379,30 @@ class BertModel(nn.Module):
:param int initializer_range: 初始化权重范围,默认值为0.02
"""
- def __init__(self, vocab_size=30522,
- hidden_size=768,
- num_hidden_layers=12,
- num_attention_heads=12,
- intermediate_size=3072,
- hidden_act="gelu",
- hidden_dropout_prob=0.1,
- attention_probs_dropout_prob=0.1,
- max_position_embeddings=512,
- type_vocab_size=2,
- initializer_range=0.02, **kwargs):
+ def __init__(self, config, *inputs, **kwargs):
super(BertModel, self).__init__()
- self.embeddings = BertEmbeddings(vocab_size, hidden_size, max_position_embeddings,
- type_vocab_size, hidden_dropout_prob)
- self.encoder = BertEncoder(num_hidden_layers, hidden_size, num_attention_heads,
- attention_probs_dropout_prob, hidden_dropout_prob, intermediate_size,
- hidden_act)
- self.pooler = BertPooler(hidden_size)
- self.initializer_range = initializer_range
-
+ if not isinstance(config, BertConfig):
+ raise ValueError(
+ "Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
+ "To create a model from a Google pretrained model use "
+ "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
+ self.__class__.__name__, self.__class__.__name__
+ ))
+ super(BertModel, self).__init__()
+ self.config = config
+ self.hidden_size = self.config.hidden_size
+ self.embeddings = BertEmbeddings(config)
+ self.encoder = BertEncoder(config)
+ self.pooler = BertPooler(config)
self.apply(self.init_bert_weights)
def init_bert_weights(self, module):
+ """ Initialize the weights.
+ """
if isinstance(module, (nn.Linear, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
- module.weight.data.normal_(mean=0.0, std=self.initializer_range)
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, BertLayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
@@ -324,17 +441,20 @@ class BertModel(nn.Module):
return encoded_layers, pooled_output
@classmethod
- def from_pretrained(cls, pretrained_model_dir, state_dict=None, *inputs, **kwargs):
+ def from_pretrained(cls, pretrained_model_dir, *inputs, **kwargs):
+ state_dict = kwargs.get('state_dict', None)
+ kwargs.pop('state_dict', None)
+ kwargs.pop('cache_dir', None)
+ kwargs.pop('from_tf', None)
# Load config
- config_file = os.path.join(pretrained_model_dir, CONFIG_FILE)
- config = json.load(open(config_file, "r"))
- # config = BertConfig.from_json_file(config_file)
+ config_file = _get_file_name_base_on_postfix(pretrained_model_dir, '.json')
+ config = BertConfig.from_json_file(config_file)
# logger.info("Model config {}".format(config))
# Instantiate model.
- model = cls(*inputs, **config, **kwargs)
+ model = cls(config, *inputs, **kwargs)
if state_dict is None:
- weights_path = os.path.join(pretrained_model_dir, MODEL_WEIGHTS)
- state_dict = torch.load(weights_path)
+ weights_path = _get_file_name_base_on_postfix(pretrained_model_dir, '.bin')
+ state_dict = torch.load(weights_path, map_location='cpu')
old_keys = []
new_keys = []
@@ -375,3 +495,424 @@ class BertModel(nn.Module):
print("Weights from pretrained model not used in {}: {}".format(
model.__class__.__name__, unexpected_keys))
return model
+
+
+def whitespace_tokenize(text):
+ """Runs basic whitespace cleaning and splitting on a piece of text."""
+ text = text.strip()
+ if not text:
+ return []
+ tokens = text.split()
+ return tokens
+
+
+class WordpieceTokenizer(object):
+ """Runs WordPiece tokenization."""
+
+ def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+ self.vocab = vocab
+ self.unk_token = unk_token
+ self.max_input_chars_per_word = max_input_chars_per_word
+
+ def tokenize(self, text):
+ """Tokenizes a piece of text into its word pieces.
+
+ This uses a greedy longest-match-first algorithm to perform tokenization
+ using the given vocabulary.
+
+ For example:
+ input = "unaffable"
+ output = ["un", "##aff", "##able"]
+
+ Args:
+ text: A single token or whitespace separated tokens. This should have
+ already been passed through `BasicTokenizer`.
+
+ Returns:
+ A list of wordpiece tokens.
+ """
+
+ output_tokens = []
+ for token in whitespace_tokenize(text):
+ chars = list(token)
+ if len(chars) > self.max_input_chars_per_word:
+ output_tokens.append(self.unk_token)
+ continue
+
+ is_bad = False
+ start = 0
+ sub_tokens = []
+ while start < len(chars):
+ end = len(chars)
+ cur_substr = None
+ while start < end:
+ substr = "".join(chars[start:end])
+ if start > 0:
+ substr = "##" + substr
+ if substr in self.vocab:
+ cur_substr = substr
+ break
+ end -= 1
+ if cur_substr is None:
+ is_bad = True
+ break
+ sub_tokens.append(cur_substr)
+ start = end
+
+ if is_bad:
+ output_tokens.append(self.unk_token)
+ else:
+ output_tokens.extend(sub_tokens)
+ return output_tokens
+
+
+def load_vocab(vocab_file):
+ """Loads a vocabulary file into a dictionary."""
+ vocab = collections.OrderedDict()
+ index = 0
+ with open(vocab_file, "r", encoding="utf-8") as reader:
+ while True:
+ token = reader.readline()
+ if not token:
+ break
+ token = token.strip()
+ vocab[token] = index
+ index += 1
+ return vocab
+
+
+class BasicTokenizer(object):
+ """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+ def __init__(self,
+ do_lower_case=True,
+ never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+ """Constructs a BasicTokenizer.
+
+ Args:
+ do_lower_case: Whether to lower case the input.
+ """
+ self.do_lower_case = do_lower_case
+ self.never_split = never_split
+
+ def tokenize(self, text):
+ """Tokenizes a piece of text."""
+ text = self._clean_text(text)
+ # This was added on November 1st, 2018 for the multilingual and Chinese
+ # models. This is also applied to the English models now, but it doesn't
+ # matter since the English models were not trained on any Chinese data
+ # and generally don't have any Chinese data in them (there are Chinese
+ # characters in the vocabulary because Wikipedia does have some Chinese
+ # words in the English Wikipedia.).
+ text = self._tokenize_chinese_chars(text)
+ orig_tokens = whitespace_tokenize(text)
+ split_tokens = []
+ for token in orig_tokens:
+ if self.do_lower_case and token not in self.never_split:
+ token = token.lower()
+ token = self._run_strip_accents(token)
+ split_tokens.extend(self._run_split_on_punc(token))
+
+ output_tokens = whitespace_tokenize(" ".join(split_tokens))
+ return output_tokens
+
+ def _run_strip_accents(self, text):
+ """Strips accents from a piece of text."""
+ text = unicodedata.normalize("NFD", text)
+ output = []
+ for char in text:
+ cat = unicodedata.category(char)
+ if cat == "Mn":
+ continue
+ output.append(char)
+ return "".join(output)
+
+ def _run_split_on_punc(self, text):
+ """Splits punctuation on a piece of text."""
+ if text in self.never_split:
+ return [text]
+ chars = list(text)
+ i = 0
+ start_new_word = True
+ output = []
+ while i < len(chars):
+ char = chars[i]
+ if _is_punctuation(char):
+ output.append([char])
+ start_new_word = True
+ else:
+ if start_new_word:
+ output.append([])
+ start_new_word = False
+ output[-1].append(char)
+ i += 1
+
+ return ["".join(x) for x in output]
+
+ def _tokenize_chinese_chars(self, text):
+ """Adds whitespace around any CJK character."""
+ output = []
+ for char in text:
+ cp = ord(char)
+ if self._is_chinese_char(cp):
+ output.append(" ")
+ output.append(char)
+ output.append(" ")
+ else:
+ output.append(char)
+ return "".join(output)
+
+ def _is_chinese_char(self, cp):
+ """Checks whether CP is the codepoint of a CJK character."""
+ # This defines a "chinese character" as anything in the CJK Unicode block:
+ # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+ #
+ # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+ # despite its name. The modern Korean Hangul alphabet is a different block,
+ # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+ # space-separated words, so they are not treated specially and handled
+ # like the all of the other languages.
+ if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
+ (cp >= 0x3400 and cp <= 0x4DBF) or #
+ (cp >= 0x20000 and cp <= 0x2A6DF) or #
+ (cp >= 0x2A700 and cp <= 0x2B73F) or #
+ (cp >= 0x2B740 and cp <= 0x2B81F) or #
+ (cp >= 0x2B820 and cp <= 0x2CEAF) or
+ (cp >= 0xF900 and cp <= 0xFAFF) or #
+ (cp >= 0x2F800 and cp <= 0x2FA1F)): #
+ return True
+
+ return False
+
+ def _clean_text(self, text):
+ """Performs invalid character removal and whitespace cleanup on text."""
+ output = []
+ for char in text:
+ cp = ord(char)
+ if cp == 0 or cp == 0xfffd or _is_control(char):
+ continue
+ if _is_whitespace(char):
+ output.append(" ")
+ else:
+ output.append(char)
+ return "".join(output)
+
+
+def _is_whitespace(char):
+ """Checks whether `chars` is a whitespace character."""
+ # \t, \n, and \r are technically contorl characters but we treat them
+ # as whitespace since they are generally considered as such.
+ if char == " " or char == "\t" or char == "\n" or char == "\r":
+ return True
+ cat = unicodedata.category(char)
+ if cat == "Zs":
+ return True
+ return False
+
+
+def _is_control(char):
+ """Checks whether `chars` is a control character."""
+ # These are technically control characters but we count them as whitespace
+ # characters.
+ if char == "\t" or char == "\n" or char == "\r":
+ return False
+ cat = unicodedata.category(char)
+ if cat.startswith("C"):
+ return True
+ return False
+
+
+def _is_punctuation(char):
+ """Checks whether `chars` is a punctuation character."""
+ cp = ord(char)
+ # We treat all non-letter/number ASCII as punctuation.
+ # Characters such as "^", "$", and "`" are not in the Unicode
+ # Punctuation class but we treat them as punctuation anyways, for
+ # consistency.
+ if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+ (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+ return True
+ cat = unicodedata.category(char)
+ if cat.startswith("P"):
+ return True
+ return False
+
+
+class BertTokenizer(object):
+ """Runs end-to-end tokenization: punctuation splitting + wordpiece"""
+
+ def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True,
+ never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+ """Constructs a BertTokenizer.
+
+ Args:
+ vocab_file: Path to a one-wordpiece-per-line vocabulary file
+ do_lower_case: Whether to lower case the input
+ Only has an effect when do_wordpiece_only=False
+ do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+ max_len: An artificial maximum length to truncate tokenized sequences to;
+ Effective maximum length is always the minimum of this
+ value (if specified) and the underlying BERT model's
+ sequence length.
+ never_split: List of tokens which will never be split during tokenization.
+ Only has an effect when do_wordpiece_only=False
+ """
+ if not os.path.isfile(vocab_file):
+ raise ValueError(
+ "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
+ "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file))
+ self.vocab = load_vocab(vocab_file)
+ self.ids_to_tokens = collections.OrderedDict(
+ [(ids, tok) for tok, ids in self.vocab.items()])
+ self.do_basic_tokenize = do_basic_tokenize
+ if do_basic_tokenize:
+ self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case,
+ never_split=never_split)
+ self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+ self.max_len = max_len if max_len is not None else int(1e12)
+
+ def _reinit_on_new_vocab(self, vocab):
+ """
+ 在load bert之后,可能会对vocab进行重新排列。重新排列之后调用这个函数重新初始化与vocab相关的性质
+
+ :param vocab:
+ :return:
+ """
+ self.vocab = vocab
+ self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+ def tokenize(self, text):
+ split_tokens = []
+ if self.do_basic_tokenize:
+ for token in self.basic_tokenizer.tokenize(text):
+ for sub_token in self.wordpiece_tokenizer.tokenize(token):
+ split_tokens.append(sub_token)
+ else:
+ split_tokens = self.wordpiece_tokenizer.tokenize(text)
+ return split_tokens
+
+ def convert_tokens_to_ids(self, tokens):
+ """Converts a sequence of tokens into ids using the vocab."""
+ ids = []
+ for token in tokens:
+ ids.append(self.vocab[token])
+ if len(ids) > self.max_len:
+ print(
+ "Token indices sequence length is longer than the specified maximum "
+ " sequence length for this BERT model ({} > {}). Running this"
+ " sequence through BERT will result in indexing errors".format(len(ids), self.max_len)
+ )
+ return ids
+
+ def convert_ids_to_tokens(self, ids):
+ """Converts a sequence of ids in wordpiece tokens using the vocab."""
+ tokens = []
+ for i in ids:
+ tokens.append(self.ids_to_tokens[i])
+ return tokens
+
+ def save_vocabulary(self, vocab_path):
+ """Save the tokenizer vocabulary to a directory or file."""
+ index = 0
+ if os.path.isdir(vocab_path):
+ vocab_file = os.path.join(vocab_path, VOCAB_NAME)
+ else:
+ vocab_file = vocab_path
+ with open(vocab_file, "w", encoding="utf-8") as writer:
+ for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+ if index != token_index:
+ print("Saving vocabulary to {}: vocabulary indices are not consecutive."
+ " Please check that the vocabulary is not corrupted!".format(vocab_file))
+ index = token_index
+ writer.write(token + u'\n')
+ index += 1
+ return vocab_file
+
+ @classmethod
+ def from_pretrained(cls, model_dir, *inputs, **kwargs):
+ """
+ 给定path,直接读取vocab.
+
+ """
+ pretrained_model_name_or_path = _get_file_name_base_on_postfix(model_dir, '.txt')
+ print("loading vocabulary file {}".format(pretrained_model_name_or_path))
+ max_len = 512
+ kwargs['max_len'] = min(kwargs.get('max_position_embeddings', int(1e12)), max_len)
+ # Instantiate tokenizer.
+ tokenizer = cls(pretrained_model_name_or_path, *inputs, **kwargs)
+ return tokenizer
+
+class _WordPieceBertModel(nn.Module):
+ """
+ 这个模块用于直接计算word_piece的结果.
+
+ """
+
+ def __init__(self, model_dir: str, layers: str = '-1'):
+ super().__init__()
+
+ self.tokenzier = BertTokenizer.from_pretrained(model_dir)
+ self.encoder = BertModel.from_pretrained(model_dir)
+ # 检查encoder_layer_number是否合理
+ encoder_layer_number = len(self.encoder.encoder.layer)
+ self.layers = list(map(int, layers.split(',')))
+ for layer in self.layers:
+ if layer < 0:
+ assert -layer <= encoder_layer_number, f"The layer index:{layer} is out of scope for " \
+ f"a bert model with {encoder_layer_number} layers."
+ else:
+ assert layer < encoder_layer_number, f"The layer index:{layer} is out of scope for " \
+ f"a bert model with {encoder_layer_number} layers."
+
+ self._cls_index = self.tokenzier.vocab['[CLS]']
+ self._sep_index = self.tokenzier.vocab['[SEP]']
+ self._wordpiece_pad_index = self.tokenzier.vocab['[PAD]'] # 需要用于生成word_piece
+
+ def index_dataset(self, *datasets, field_name):
+ """
+ 使用bert的tokenizer新生成word_pieces列加入到datasets中,并将他们设置为input。如果首尾不是
+ [CLS]与[SEP]会在首尾额外加入[CLS]与[SEP], 且将word_pieces这一列的pad value设置为了bert的pad value。
+
+ :param datasets: DataSet对象
+ :param field_name: 基于哪一列index
+ :return:
+ """
+
+ def convert_words_to_word_pieces(words):
+ word_pieces = []
+ for word in words:
+ tokens = self.tokenzier.wordpiece_tokenizer.tokenize(word)
+ word_piece_ids = self.tokenzier.convert_tokens_to_ids(tokens)
+ word_pieces.extend(word_piece_ids)
+ if word_pieces[0] != self._cls_index:
+ word_pieces.insert(0, self._cls_index)
+ if word_pieces[-1] != self._sep_index:
+ word_pieces.insert(-1, self._sep_index)
+ return word_pieces
+
+ for index, dataset in enumerate(datasets):
+ try:
+ dataset.apply_field(convert_words_to_word_pieces, field_name=field_name, new_field_name='word_pieces',
+ is_input=True)
+ dataset.set_pad_val('word_pieces', self._wordpiece_pad_index)
+ except Exception as e:
+ print(f"Exception happens when processing the {index} dataset.")
+ raise e
+
+ def forward(self, word_pieces, token_type_ids=None):
+ """
+
+ :param word_pieces: torch.LongTensor, batch_size x max_len
+ :param token_type_ids: torch.LongTensor, batch_size x max_len
+ :return: num_layers x batch_size x max_len x hidden_size或者num_layers x batch_size x (max_len+2) x hidden_size
+ """
+ batch_size, max_len = word_pieces.size()
+
+ attn_masks = word_pieces.ne(self._wordpiece_pad_index)
+ bert_outputs, _ = self.encoder(word_pieces, token_type_ids=token_type_ids, attention_mask=attn_masks,
+ output_all_encoded_layers=True)
+ # output_layers = [self.layers] # len(self.layers) x batch_size x max_word_piece_length x hidden_size
+ outputs = bert_outputs[0].new_zeros((len(self.layers), batch_size, max_len, bert_outputs[0].size(-1)))
+ for l_index, l in enumerate(self.layers):
+ outputs[l_index] = bert_outputs[l]
+ return outputs
diff --git a/fastNLP/modules/encoder/char_encoder.py b/fastNLP/modules/encoder/char_encoder.py
index 481ad7ad..6a6e1470 100644
--- a/fastNLP/modules/encoder/char_encoder.py
+++ b/fastNLP/modules/encoder/char_encoder.py
@@ -11,7 +11,7 @@ from ..utils import initial_parameter
# from torch.nn.init import xavier_uniform
class ConvolutionCharEncoder(nn.Module):
"""
- 别名::class:`fastNLP.modules.ConvolutionCharEncoder` :class:`fastNLP.modules.encoder.char_encoder.ConvolutionCharEncoder`
+ 别名::class:`fastNLP.modules.ConvolutionCharEncoder` :class:`fastNLP.modules.encoder.ConvolutionCharEncoder`
char级别的卷积编码器.
@@ -21,15 +21,16 @@ class ConvolutionCharEncoder(nn.Module):
:param tuple kernels: 一个由int组成的tuple. tuple的长度是char级别卷积操作的数目, 第`i`个int表示第`i`个卷积操作的卷积核.
:param initial_method: 初始化参数的方式, 默认为`xavier normal`
"""
-
- def __init__(self, char_emb_size=50, feature_maps=(40, 30, 30), kernels=(3, 4, 5), initial_method=None):
+
+ def __init__(self, char_emb_size=50, feature_maps=(40, 30, 30), kernels=(1, 3, 5), initial_method=None):
super(ConvolutionCharEncoder, self).__init__()
self.convs = nn.ModuleList([
- nn.Conv2d(1, feature_maps[i], kernel_size=(char_emb_size, kernels[i]), bias=True, padding=(0, 4))
+ nn.Conv2d(1, feature_maps[i], kernel_size=(char_emb_size, kernels[i]), bias=True,
+ padding=(0, kernels[i] // 2))
for i in range(len(kernels))])
-
+
initial_parameter(self, initial_method)
-
+
def forward(self, x):
"""
:param torch.Tensor x: ``[batch_size * sent_length, word_length, char_emb_size]`` 输入字符的embedding
@@ -40,7 +41,7 @@ class ConvolutionCharEncoder(nn.Module):
x = x.transpose(2, 3)
# [batch_size*sent_length, channel, height, width]
return self._convolute(x).unsqueeze(2)
-
+
def _convolute(self, x):
feats = []
for conv in self.convs:
@@ -57,13 +58,13 @@ class ConvolutionCharEncoder(nn.Module):
class LSTMCharEncoder(nn.Module):
"""
- 别名::class:`fastNLP.modules.LSTMCharEncoder` :class:`fastNLP.modules.encoder.char_encoder.LSTMCharEncoder`
+ 别名::class:`fastNLP.modules.LSTMCharEncoder` :class:`fastNLP.modules.encoder.LSTMCharEncoder`
char级别基于LSTM的encoder.
"""
-
+
def __init__(self, char_emb_size=50, hidden_size=None, initial_method=None):
"""
:param int char_emb_size: char级别embedding的维度. Default: 50
@@ -73,14 +74,14 @@ class LSTMCharEncoder(nn.Module):
"""
super(LSTMCharEncoder, self).__init__()
self.hidden_size = char_emb_size if hidden_size is None else hidden_size
-
+
self.lstm = nn.LSTM(input_size=char_emb_size,
hidden_size=self.hidden_size,
num_layers=1,
bias=True,
batch_first=True)
initial_parameter(self, initial_method)
-
+
def forward(self, x):
"""
:param torch.Tensor x: ``[ n_batch*n_word, word_length, char_emb_size]`` 输入字符的embedding
@@ -91,6 +92,6 @@ class LSTMCharEncoder(nn.Module):
h0 = nn.init.orthogonal_(h0)
c0 = torch.empty(1, batch_size, self.hidden_size)
c0 = nn.init.orthogonal_(c0)
-
+
_, hidden = self.lstm(x, (h0, c0))
return hidden[0].squeeze().unsqueeze(2)
diff --git a/fastNLP/modules/encoder/conv_maxpool.py b/fastNLP/modules/encoder/conv_maxpool.py
index ae6bea04..8ce6b163 100644
--- a/fastNLP/modules/encoder/conv_maxpool.py
+++ b/fastNLP/modules/encoder/conv_maxpool.py
@@ -5,12 +5,10 @@ import torch
import torch.nn as nn
import torch.nn.functional as F
-from ..utils import initial_parameter
-
class ConvMaxpool(nn.Module):
"""
- 别名::class:`fastNLP.modules.ConvMaxpool` :class:`fastNLP.modules.encoder.conv_maxpool.ConvMaxpool`
+ 别名::class:`fastNLP.modules.ConvMaxpool` :class:`fastNLP.modules.encoder.ConvMaxpool`
集合了Convolution和Max-Pooling于一体的层。给定一个batch_size x max_len x input_size的输入,返回batch_size x
sum(output_channels) 大小的matrix。在内部,是先使用CNN给输入做卷积,然后经过activation激活层,在通过在长度(max_len)
@@ -19,20 +17,15 @@ class ConvMaxpool(nn.Module):
:param int in_channels: 输入channel的大小,一般是embedding的维度; 或encoder的output维度
:param int,tuple(int) out_channels: 输出channel的数量。如果为list,则需要与kernel_sizes的数量保持一致
:param int,tuple(int) kernel_sizes: 输出channel的kernel大小。
- :param int stride: 见pytorch Conv1D文档。所有kernel共享一个stride。
- :param int padding: 见pytorch Conv1D文档。所有kernel共享一个padding。
- :param int dilation: 见pytorch Conv1D文档。所有kernel共享一个dilation。
- :param int groups: 见pytorch Conv1D文档。所有kernel共享一个groups。
- :param bool bias: 见pytorch Conv1D文档。所有kernel共享一个bias。
:param str activation: Convolution后的结果将通过该activation后再经过max-pooling。支持relu, sigmoid, tanh
- :param str initial_method: str。
"""
-
- def __init__(self, in_channels, out_channels, kernel_sizes,
- stride=1, padding=0, dilation=1,
- groups=1, bias=True, activation="relu", initial_method=None):
+
+ def __init__(self, in_channels, out_channels, kernel_sizes, activation="relu"):
super(ConvMaxpool, self).__init__()
-
+
+ for kernel_size in kernel_sizes:
+ assert kernel_size % 2 == 1, "kernel size has to be odd numbers."
+
# convolution
if isinstance(kernel_sizes, (list, tuple, int)):
if isinstance(kernel_sizes, int) and isinstance(out_channels, int):
@@ -44,22 +37,22 @@ class ConvMaxpool(nn.Module):
" of kernel_sizes."
else:
raise ValueError("The type of out_channels and kernel_sizes should be the same.")
-
+
self.convs = nn.ModuleList([nn.Conv1d(
in_channels=in_channels,
out_channels=oc,
kernel_size=ks,
- stride=stride,
- padding=padding,
- dilation=dilation,
- groups=groups,
- bias=bias)
+ stride=1,
+ padding=ks // 2,
+ dilation=1,
+ groups=1,
+ bias=None)
for oc, ks in zip(out_channels, kernel_sizes)])
-
+
else:
raise Exception(
'Incorrect kernel sizes: should be list, tuple or int')
-
+
# activation function
if activation == 'relu':
self.activation = F.relu
@@ -70,9 +63,7 @@ class ConvMaxpool(nn.Module):
else:
raise Exception(
"Undefined activation function: choose from: relu, tanh, sigmoid")
-
- initial_parameter(self, initial_method)
-
+
def forward(self, x, mask=None):
"""
@@ -86,7 +77,7 @@ class ConvMaxpool(nn.Module):
xs = [self.activation(conv(x)) for conv in self.convs] # [[N,C,L], ...]
if mask is not None:
mask = mask.unsqueeze(1) # B x 1 x L
- xs = [x.masked_fill_(mask, float('-inf')) for x in xs]
+ xs = [x.masked_fill_(mask.eq(0), float('-inf')) for x in xs]
# max-pooling
xs = [F.max_pool1d(input=i, kernel_size=i.size(2)).squeeze(2)
for i in xs] # [[N, C], ...]
diff --git a/fastNLP/modules/encoder/embedding.py b/fastNLP/modules/encoder/embedding.py
deleted file mode 100644
index c2dfab65..00000000
--- a/fastNLP/modules/encoder/embedding.py
+++ /dev/null
@@ -1,50 +0,0 @@
-__all__ = [
- "Embedding"
-]
-import torch.nn as nn
-from ..utils import get_embeddings
-
-
-class Embedding(nn.Embedding):
- """
- 别名::class:`fastNLP.modules.Embedding` :class:`fastNLP.modules.encoder.embedding.Embedding`
-
- Embedding组件. 可以通过self.num_embeddings获取词表大小; self.embedding_dim获取embedding的维度"""
-
- def __init__(self, init_embed, padding_idx=None, dropout=0.0, sparse=False, max_norm=None, norm_type=2,
- scale_grad_by_freq=False):
- """
-
- :param tuple(int,int),torch.FloatTensor,nn.Embedding,numpy.ndarray init_embed: Embedding的大小(传入tuple(int, int),
- 第一个int为vocab_zie, 第二个int为embed_dim); 如果为Tensor, Embedding, ndarray等则直接使用该值初始化Embedding
- :param None,int padding_idx: 该index的Embedding将一直为0.
- :param float dropout: 对Embedding的输出的dropout。
- :param bool sparse: 如果为True,则对Embedding的梯度将是sparse的,参考Pytorch Embedding获取更多信息。
- :param None,float max_norm: 每个vector最大的norm能为多大
- :param int norm_type: norm的类型
- :param bool scale_grad_by_freq: 如果为True,将会把梯度除以这个词出现的次数.
- """
- embed = get_embeddings(init_embed)
- num_embeddings, embedding_dim = embed.weight.size()
-
- super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx,
- max_norm=max_norm, norm_type=norm_type, scale_grad_by_freq=scale_grad_by_freq,
- sparse=sparse, _weight=embed.weight.data)
- del embed
-
- self.dropout = nn.Dropout(dropout)
-
- def forward(self, x):
- """
- :param torch.LongTensor x: [batch, seq_len]
- :return: torch.Tensor : [batch, seq_len, embed_dim]
- """
- x = super().forward(x)
- return self.dropout(x)
-
- def size(self):
- """
- Embedding的大小
- :return: torch.Size()
- """
- return self.weight.size()
diff --git a/fastNLP/modules/encoder/lstm.py b/fastNLP/modules/encoder/lstm.py
index b4f960e7..e2358132 100644
--- a/fastNLP/modules/encoder/lstm.py
+++ b/fastNLP/modules/encoder/lstm.py
@@ -10,17 +10,16 @@ import torch
import torch.nn as nn
import torch.nn.utils.rnn as rnn
-from ..utils import initial_parameter
-
class LSTM(nn.Module):
"""
- 别名::class:`fastNLP.modules.LSTM` :class:`fastNLP.modules.encoder.lstm.LSTM`
+ 别名::class:`fastNLP.modules.LSTM` :class:`fastNLP.modules.encoder.LSTM`
- LSTM 模块, 轻量封装的Pytorch LSTM
+ LSTM 模块, 轻量封装的Pytorch LSTM. 在提供seq_len的情况下,将自动使用pack_padded_sequence; 同时默认将forget gate的bias初始化
+ 为1; 且可以应对DataParallel中LSTM的使用问题。
:param input_size: 输入 `x` 的特征维度
- :param hidden_size: 隐状态 `h` 的特征维度
+ :param hidden_size: 隐状态 `h` 的特征维度.
:param num_layers: rnn的层数. Default: 1
:param dropout: 层间dropout概率. Default: 0
:param bidirectional: 若为 ``True``, 使用双向的RNN. Default: ``False``
@@ -28,25 +27,37 @@ class LSTM(nn.Module):
:(batch, seq, feature). Default: ``False``
:param bias: 如果为 ``False``, 模型将不会使用bias. Default: ``True``
"""
-
+
def __init__(self, input_size, hidden_size=100, num_layers=1, dropout=0.0, batch_first=True,
- bidirectional=False, bias=True, initial_method=None):
+ bidirectional=False, bias=True):
super(LSTM, self).__init__()
self.batch_first = batch_first
self.lstm = nn.LSTM(input_size, hidden_size, num_layers, bias=bias, batch_first=batch_first,
dropout=dropout, bidirectional=bidirectional)
- initial_parameter(self, initial_method)
-
+ self.init_param()
+
+ def init_param(self):
+ for name, param in self.named_parameters():
+ if 'bias' in name:
+ # based on https://github.com/pytorch/pytorch/issues/750#issuecomment-280671871
+ param.data.fill_(0)
+ n = param.size(0)
+ start, end = n // 4, n // 2
+ param.data[start:end].fill_(1)
+ else:
+ nn.init.xavier_uniform_(param)
+
def forward(self, x, seq_len=None, h0=None, c0=None):
"""
:param x: [batch, seq_len, input_size] 输入序列
:param seq_len: [batch, ] 序列长度, 若为 ``None``, 所有输入看做一样长. Default: ``None``
- :param h0: [batch, hidden_size] 初始隐状态, 若为 ``None`` , 设为全1向量. Default: ``None``
- :param c0: [batch, hidden_size] 初始Cell状态, 若为 ``None`` , 设为全1向量. Default: ``None``
+ :param h0: [batch, hidden_size] 初始隐状态, 若为 ``None`` , 设为全0向量. Default: ``None``
+ :param c0: [batch, hidden_size] 初始Cell状态, 若为 ``None`` , 设为全0向量. Default: ``None``
:return (output, ht) 或 output: 若 ``get_hidden=True`` [batch, seq_len, hidden_size*num_direction] 输出序列
和 [batch, hidden_size*num_direction] 最后时刻隐状态.
"""
+ batch_size, max_len, _ = x.size()
if h0 is not None and c0 is not None:
hx = (h0, c0)
else:
@@ -59,7 +70,7 @@ class LSTM(nn.Module):
x = x[:, sort_idx]
x = rnn.pack_padded_sequence(x, sort_lens, batch_first=self.batch_first)
output, hx = self.lstm(x, hx) # -> [N,L,C]
- output, _ = rnn.pad_packed_sequence(output, batch_first=self.batch_first)
+ output, _ = rnn.pad_packed_sequence(output, batch_first=self.batch_first, total_length=max_len)
_, unsort_idx = torch.sort(sort_idx, dim=0, descending=False)
if self.batch_first:
output = output[unsort_idx]
diff --git a/fastNLP/modules/aggregator/pooling.py b/fastNLP/modules/encoder/pooling.py
similarity index 95%
rename from fastNLP/modules/aggregator/pooling.py
rename to fastNLP/modules/encoder/pooling.py
index 51438aae..d8aa54ad 100644
--- a/fastNLP/modules/aggregator/pooling.py
+++ b/fastNLP/modules/encoder/pooling.py
@@ -1,7 +1,8 @@
__all__ = [
"MaxPool",
"MaxPoolWithMask",
- "AvgPool"
+ "AvgPool",
+ "AvgPoolWithMask"
]
import torch
import torch.nn as nn
@@ -9,7 +10,7 @@ import torch.nn as nn
class MaxPool(nn.Module):
"""
- 别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.aggregator.pooling.MaxPool`
+ 别名::class:`fastNLP.modules.MaxPool` :class:`fastNLP.modules.encoder.MaxPool`
Max-pooling模块。
@@ -20,9 +21,9 @@ class MaxPool(nn.Module):
:param kernel_size: max pooling的窗口大小,默认为tensor最后k维,其中k为dimension
:param ceil_mode:
"""
-
+
def __init__(self, stride=None, padding=0, dilation=1, dimension=1, kernel_size=None, ceil_mode=False):
-
+
super(MaxPool, self).__init__()
assert (1 <= dimension) and (dimension <= 3)
self.dimension = dimension
@@ -31,7 +32,7 @@ class MaxPool(nn.Module):
self.dilation = dilation
self.kernel_size = kernel_size
self.ceil_mode = ceil_mode
-
+
def forward(self, x):
if self.dimension == 1:
pooling = nn.MaxPool1d(
@@ -58,15 +59,15 @@ class MaxPool(nn.Module):
class MaxPoolWithMask(nn.Module):
"""
- 别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.MaxPoolWithMask`
+ 别名::class:`fastNLP.modules.MaxPoolWithMask` :class:`fastNLP.modules.encoder.MaxPoolWithMask`
带mask矩阵的max pooling。在做max-pooling的时候不会考虑mask值为0的位置。
"""
-
+
def __init__(self):
super(MaxPoolWithMask, self).__init__()
self.inf = 10e12
-
+
def forward(self, tensor, mask, dim=1):
"""
:param torch.FloatTensor tensor: [batch_size, seq_len, channels] 初始tensor
@@ -81,11 +82,11 @@ class MaxPoolWithMask(nn.Module):
class KMaxPool(nn.Module):
"""K max-pooling module."""
-
+
def __init__(self, k=1):
super(KMaxPool, self).__init__()
self.k = k
-
+
def forward(self, x):
"""
:param torch.Tensor x: [N, C, L] 初始tensor
@@ -98,16 +99,16 @@ class KMaxPool(nn.Module):
class AvgPool(nn.Module):
"""
- 别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.aggregator.pooling.AvgPool`
+ 别名::class:`fastNLP.modules.AvgPool` :class:`fastNLP.modules.encoder.AvgPool`
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size]
"""
-
+
def __init__(self, stride=None, padding=0):
super(AvgPool, self).__init__()
self.stride = stride
self.padding = padding
-
+
def forward(self, x):
"""
:param torch.Tensor x: [N, C, L] 初始tensor
@@ -125,16 +126,16 @@ class AvgPool(nn.Module):
class AvgPoolWithMask(nn.Module):
"""
- 别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.aggregator.pooling.AvgPoolWithMask`
+ 别名::class:`fastNLP.modules.AvgPoolWithMask` :class:`fastNLP.modules.encoder.AvgPoolWithMask`
给定形如[batch_size, max_len, hidden_size]的输入,在最后一维进行avg pooling. 输出为[batch_size, hidden_size], pooling
的时候只会考虑mask为1的位置
"""
-
+
def __init__(self):
super(AvgPoolWithMask, self).__init__()
self.inf = 10e12
-
+
def forward(self, tensor, mask, dim=1):
"""
:param torch.FloatTensor tensor: [batch_size, seq_len, channels] 初始tensor
diff --git a/fastNLP/modules/encoder/star_transformer.py b/fastNLP/modules/encoder/star_transformer.py
index 1eec7c13..3927a494 100644
--- a/fastNLP/modules/encoder/star_transformer.py
+++ b/fastNLP/modules/encoder/star_transformer.py
@@ -13,7 +13,7 @@ from torch.nn import functional as F
class StarTransformer(nn.Module):
"""
- 别名::class:`fastNLP.modules.StarTransformer` :class:`fastNLP.modules.encoder.star_transformer.StarTransformer`
+ 别名::class:`fastNLP.modules.StarTransformer` :class:`fastNLP.modules.encoder.StarTransformer`
Star-Transformer 的encoder部分。 输入3d的文本输入, 返回相同长度的文本编码
@@ -29,24 +29,26 @@ class StarTransformer(nn.Module):
模型会为输入序列加上position embedding。
若为`None`,忽略加上position embedding的步骤. Default: `None`
"""
-
+
def __init__(self, hidden_size, num_layers, num_head, head_dim, dropout=0.1, max_len=None):
super(StarTransformer, self).__init__()
self.iters = num_layers
-
- self.norm = nn.ModuleList([nn.LayerNorm(hidden_size) for _ in range(self.iters)])
+
+ self.norm = nn.ModuleList([nn.LayerNorm(hidden_size, eps=1e-6) for _ in range(self.iters)])
+ # self.emb_fc = nn.Conv2d(hidden_size, hidden_size, 1)
+ self.emb_drop = nn.Dropout(dropout)
self.ring_att = nn.ModuleList(
- [_MSA1(hidden_size, nhead=num_head, head_dim=head_dim, dropout=dropout)
+ [_MSA1(hidden_size, nhead=num_head, head_dim=head_dim, dropout=0.0)
for _ in range(self.iters)])
self.star_att = nn.ModuleList(
- [_MSA2(hidden_size, nhead=num_head, head_dim=head_dim, dropout=dropout)
+ [_MSA2(hidden_size, nhead=num_head, head_dim=head_dim, dropout=0.0)
for _ in range(self.iters)])
-
+
if max_len is not None:
self.pos_emb = nn.Embedding(max_len, hidden_size)
else:
self.pos_emb = None
-
+
def forward(self, data, mask):
"""
:param FloatTensor data: [batch, length, hidden] 输入的序列
@@ -56,34 +58,35 @@ class StarTransformer(nn.Module):
[batch, hidden] 全局 relay 节点, 详见论文
"""
-
+
def norm_func(f, x):
# B, H, L, 1
return f(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
-
+
B, L, H = data.size()
mask = (mask == 0) # flip the mask for masked_fill_
smask = torch.cat([torch.zeros(B, 1, ).byte().to(mask), mask], 1)
-
+
embs = data.permute(0, 2, 1)[:, :, :, None] # B H L 1
- if self.pos_emb:
+ if self.pos_emb and False:
P = self.pos_emb(torch.arange(L, dtype=torch.long, device=embs.device) \
.view(1, L)).permute(0, 2, 1).contiguous()[:, :, :, None] # 1 H L 1
embs = embs + P
-
+ embs = norm_func(self.emb_drop, embs)
nodes = embs
relay = embs.mean(2, keepdim=True)
ex_mask = mask[:, None, :, None].expand(B, H, L, 1)
r_embs = embs.view(B, H, 1, L)
for i in range(self.iters):
ax = torch.cat([r_embs, relay.expand(B, H, 1, L)], 2)
- nodes = nodes + F.leaky_relu(self.ring_att[i](norm_func(self.norm[i], nodes), ax=ax))
+ nodes = F.leaky_relu(self.ring_att[i](norm_func(self.norm[i], nodes), ax=ax))
+ # nodes = F.leaky_relu(self.ring_att[i](nodes, ax=ax))
relay = F.leaky_relu(self.star_att[i](relay, torch.cat([relay, nodes], 2), smask))
-
+
nodes = nodes.masked_fill_(ex_mask, 0)
-
+
nodes = nodes.view(B, H, L).permute(0, 2, 1)
-
+
return nodes, relay.view(B, H)
@@ -96,19 +99,19 @@ class _MSA1(nn.Module):
self.WK = nn.Conv2d(nhid, nhead * head_dim, 1)
self.WV = nn.Conv2d(nhid, nhead * head_dim, 1)
self.WO = nn.Conv2d(nhead * head_dim, nhid, 1)
-
+
self.drop = nn.Dropout(dropout)
-
+
# print('NUM_HEAD', nhead, 'DIM_HEAD', head_dim)
self.nhid, self.nhead, self.head_dim, self.unfold_size = nhid, nhead, head_dim, 3
-
+
def forward(self, x, ax=None):
# x: B, H, L, 1, ax : B, H, X, L append features
nhid, nhead, head_dim, unfold_size = self.nhid, self.nhead, self.head_dim, self.unfold_size
B, H, L, _ = x.shape
-
+
q, k, v = self.WQ(x), self.WK(x), self.WV(x) # x: (B,H,L,1)
-
+
if ax is not None:
aL = ax.shape[2]
ak = self.WK(ax).view(B, nhead, head_dim, aL, L)
@@ -121,12 +124,12 @@ class _MSA1(nn.Module):
if ax is not None:
k = torch.cat([k, ak], 3)
v = torch.cat([v, av], 3)
-
+
alphas = self.drop(F.softmax((q * k).sum(2, keepdim=True) / NP.sqrt(head_dim), 3)) # B N L 1 U
att = (alphas * v).sum(3).view(B, nhead * head_dim, L, 1)
-
+
ret = self.WO(att)
-
+
return ret
@@ -138,19 +141,19 @@ class _MSA2(nn.Module):
self.WK = nn.Conv2d(nhid, nhead * head_dim, 1)
self.WV = nn.Conv2d(nhid, nhead * head_dim, 1)
self.WO = nn.Conv2d(nhead * head_dim, nhid, 1)
-
+
self.drop = nn.Dropout(dropout)
-
+
# print('NUM_HEAD', nhead, 'DIM_HEAD', head_dim)
self.nhid, self.nhead, self.head_dim, self.unfold_size = nhid, nhead, head_dim, 3
-
+
def forward(self, x, y, mask=None):
# x: B, H, 1, 1, 1 y: B H L 1
nhid, nhead, head_dim, unfold_size = self.nhid, self.nhead, self.head_dim, self.unfold_size
B, H, L, _ = y.shape
-
+
q, k, v = self.WQ(x), self.WK(y), self.WV(y)
-
+
q = q.view(B, nhead, 1, head_dim) # B, H, 1, 1 -> B, N, 1, h
k = k.view(B, nhead, head_dim, L) # B, H, L, 1 -> B, N, h, L
v = v.view(B, nhead, head_dim, L).permute(0, 1, 3, 2) # B, H, L, 1 -> B, N, L, h
diff --git a/fastNLP/modules/encoder/transformer.py b/fastNLP/modules/encoder/transformer.py
index 698ff95c..bc488e54 100644
--- a/fastNLP/modules/encoder/transformer.py
+++ b/fastNLP/modules/encoder/transformer.py
@@ -3,13 +3,13 @@ __all__ = [
]
from torch import nn
-from ..aggregator.attention import MultiHeadAttention
+from fastNLP.modules.encoder.attention import MultiHeadAttention
from ..dropout import TimestepDropout
class TransformerEncoder(nn.Module):
"""
- 别名::class:`fastNLP.modules.TransformerEncoder` :class:`fastNLP.modules.encoder.transformer.TransformerEncoder`
+ 别名::class:`fastNLP.modules.TransformerEncoder` :class:`fastNLP.modules.encoder.TransformerEncoder`
transformer的encoder模块,不包含embedding层
@@ -22,7 +22,7 @@ class TransformerEncoder(nn.Module):
:param int num_head: head的数量。
:param float dropout: dropout概率. Default: 0.1
"""
-
+
class SubLayer(nn.Module):
def __init__(self, model_size, inner_size, key_size, value_size, num_head, dropout=0.1):
super(TransformerEncoder.SubLayer, self).__init__()
@@ -33,7 +33,7 @@ class TransformerEncoder(nn.Module):
nn.Linear(inner_size, model_size),
TimestepDropout(dropout), )
self.norm2 = nn.LayerNorm(model_size)
-
+
def forward(self, input, seq_mask=None, atte_mask_out=None):
"""
@@ -48,11 +48,11 @@ class TransformerEncoder(nn.Module):
output = self.norm2(output + norm_atte)
output *= seq_mask
return output
-
+
def __init__(self, num_layers, **kargs):
super(TransformerEncoder, self).__init__()
self.layers = nn.ModuleList([self.SubLayer(**kargs) for _ in range(num_layers)])
-
+
def forward(self, x, seq_mask=None):
"""
:param x: [batch, seq_len, model_size] 输入序列
diff --git a/fastNLP/modules/encoder/variational_rnn.py b/fastNLP/modules/encoder/variational_rnn.py
index 29b728e5..8e5e804b 100644
--- a/fastNLP/modules/encoder/variational_rnn.py
+++ b/fastNLP/modules/encoder/variational_rnn.py
@@ -28,14 +28,14 @@ class VarRnnCellWrapper(nn.Module):
"""
Wrapper for normal RNN Cells, make it support variational dropout
"""
-
+
def __init__(self, cell, hidden_size, input_p, hidden_p):
super(VarRnnCellWrapper, self).__init__()
self.cell = cell
self.hidden_size = hidden_size
self.input_p = input_p
self.hidden_p = hidden_p
-
+
def forward(self, input_x, hidden, mask_x, mask_h, is_reversed=False):
"""
:param PackedSequence input_x: [seq_len, batch_size, input_size]
@@ -47,13 +47,13 @@ class VarRnnCellWrapper(nn.Module):
hidden: for LSTM, tuple of (h_n, c_n), [batch_size, hidden_size]
for other RNN, h_n, [batch_size, hidden_size]
"""
-
+
def get_hi(hi, h0, size):
h0_size = size - hi.size(0)
if h0_size > 0:
return torch.cat([hi, h0[:h0_size]], dim=0)
return hi[:size]
-
+
is_lstm = isinstance(hidden, tuple)
input, batch_sizes = input_x.data, input_x.batch_sizes
output = []
@@ -64,7 +64,7 @@ class VarRnnCellWrapper(nn.Module):
else:
batch_iter = batch_sizes
idx = 0
-
+
if is_lstm:
hn = (hidden[0].clone(), hidden[1].clone())
else:
@@ -91,7 +91,7 @@ class VarRnnCellWrapper(nn.Module):
hi = cell(input_i, hi)
hn[:size] = hi
output.append(hi)
-
+
if is_reversed:
output = list(reversed(output))
output = torch.cat(output, dim=0)
@@ -117,7 +117,7 @@ class VarRNNBase(nn.Module):
:param hidden_dropout: 对每个隐状态的dropout概率. Default: 0
:param bidirectional: 若为 ``True``, 使用双向的RNN. Default: ``False``
"""
-
+
def __init__(self, mode, Cell, input_size, hidden_size, num_layers=1,
bias=True, batch_first=False,
input_dropout=0, hidden_dropout=0, bidirectional=False):
@@ -141,7 +141,7 @@ class VarRNNBase(nn.Module):
cell, self.hidden_size, input_dropout, hidden_dropout))
initial_parameter(self)
self.is_lstm = (self.mode == "LSTM")
-
+
def _forward_one(self, n_layer, n_direction, input, hx, mask_x, mask_h):
is_lstm = self.is_lstm
idx = self.num_directions * n_layer + n_direction
@@ -150,7 +150,7 @@ class VarRNNBase(nn.Module):
output_x, hidden_x = cell(
input, hi, mask_x, mask_h, is_reversed=(n_direction == 1))
return output_x, hidden_x
-
+
def forward(self, x, hx=None):
"""
@@ -170,13 +170,13 @@ class VarRNNBase(nn.Module):
else:
max_batch_size = int(x.batch_sizes[0])
x, batch_sizes = x.data, x.batch_sizes
-
+
if hx is None:
hx = x.new_zeros(self.num_layers * self.num_directions,
max_batch_size, self.hidden_size, requires_grad=True)
if is_lstm:
hx = (hx, hx.new_zeros(hx.size(), requires_grad=True))
-
+
mask_x = x.new_ones((max_batch_size, self.input_size))
mask_out = x.new_ones(
(max_batch_size, self.hidden_size * self.num_directions))
@@ -185,7 +185,7 @@ class VarRNNBase(nn.Module):
training=self.training, inplace=True)
nn.functional.dropout(mask_out, p=self.hidden_dropout,
training=self.training, inplace=True)
-
+
hidden = x.new_zeros(
(self.num_layers * self.num_directions, max_batch_size, self.hidden_size))
if is_lstm:
@@ -207,22 +207,22 @@ class VarRNNBase(nn.Module):
else:
hidden[idx] = hidden_x
x = torch.cat(output_list, dim=-1)
-
+
if is_lstm:
hidden = (hidden, cellstate)
-
+
if is_packed:
output = PackedSequence(x, batch_sizes)
else:
x = PackedSequence(x, batch_sizes)
output, _ = pad_packed_sequence(x, batch_first=self.batch_first)
-
+
return output, hidden
class VarLSTM(VarRNNBase):
"""
- 别名::class:`fastNLP.modules.VarLSTM` :class:`fastNLP.modules.encoder.variational_rnn.VarLSTM`
+ 别名::class:`fastNLP.modules.VarLSTM` :class:`fastNLP.modules.encoder.VarLSTM`
Variational Dropout LSTM.
@@ -236,18 +236,18 @@ class VarLSTM(VarRNNBase):
:param hidden_dropout: 对每个隐状态的dropout概率. Default: 0
:param bidirectional: 若为 ``True``, 使用双向的LSTM. Default: ``False``
"""
-
+
def __init__(self, *args, **kwargs):
super(VarLSTM, self).__init__(
mode="LSTM", Cell=nn.LSTMCell, *args, **kwargs)
-
+
def forward(self, x, hx=None):
return super(VarLSTM, self).forward(x, hx)
class VarRNN(VarRNNBase):
"""
- 别名::class:`fastNLP.modules.VarRNN` :class:`fastNLP.modules.encoder.variational_rnn.VarRNN`
+ 别名::class:`fastNLP.modules.VarRNN` :class:`fastNLP.modules.encoder.VarRNN`
Variational Dropout RNN.
@@ -261,18 +261,18 @@ class VarRNN(VarRNNBase):
:param hidden_dropout: 对每个隐状态的dropout概率. Default: 0
:param bidirectional: 若为 ``True``, 使用双向的RNN. Default: ``False``
"""
-
+
def __init__(self, *args, **kwargs):
super(VarRNN, self).__init__(
mode="RNN", Cell=nn.RNNCell, *args, **kwargs)
-
+
def forward(self, x, hx=None):
return super(VarRNN, self).forward(x, hx)
class VarGRU(VarRNNBase):
"""
- 别名::class:`fastNLP.modules.VarGRU` :class:`fastNLP.modules.encoder.variational_rnn.VarGRU`
+ 别名::class:`fastNLP.modules.VarGRU` :class:`fastNLP.modules.encoder.VarGRU`
Variational Dropout GRU.
@@ -286,10 +286,10 @@ class VarGRU(VarRNNBase):
:param hidden_dropout: 对每个隐状态的dropout概率. Default: 0
:param bidirectional: 若为 ``True``, 使用双向的GRU. Default: ``False``
"""
-
+
def __init__(self, *args, **kwargs):
super(VarGRU, self).__init__(
mode="GRU", Cell=nn.GRUCell, *args, **kwargs)
-
+
def forward(self, x, hx=None):
return super(VarGRU, self).forward(x, hx)
diff --git a/fastNLP/modules/utils.py b/fastNLP/modules/utils.py
index 741429bb..dbae9c73 100644
--- a/fastNLP/modules/utils.py
+++ b/fastNLP/modules/utils.py
@@ -1,6 +1,5 @@
from functools import reduce
-import numpy as np
import torch
import torch.nn as nn
import torch.nn.init as init
@@ -70,31 +69,6 @@ def initial_parameter(net, initial_method=None):
net.apply(weights_init)
-def get_embeddings(init_embed):
- """
- 根据输入的init_embed生成nn.Embedding对象。
-
- :param init_embed: 可以是 tuple:(num_embedings, embedding_dim), 即embedding的大小和每个词的维度;也可以传入
- nn.Embedding 对象, 此时就以传入的对象作为embedding; 传入np.ndarray也行,将使用传入的ndarray作为作为Embedding初始
- 化; 传入orch.Tensor, 将使用传入的值作为Embedding初始化。
- :return nn.Embedding embeddings:
- """
- if isinstance(init_embed, tuple):
- res = nn.Embedding(
- num_embeddings=init_embed[0], embedding_dim=init_embed[1])
- elif isinstance(init_embed, nn.Embedding):
- res = init_embed
- elif isinstance(init_embed, torch.Tensor):
- res = nn.Embedding.from_pretrained(init_embed, freeze=False)
- elif isinstance(init_embed, np.ndarray):
- init_embed = torch.tensor(init_embed, dtype=torch.float32)
- res = nn.Embedding.from_pretrained(init_embed, freeze=False)
- else:
- raise TypeError(
- 'invalid init_embed type: {}'.format((type(init_embed))))
- return res
-
-
def summary(model: nn.Module):
"""
得到模型的总参数量
@@ -130,3 +104,33 @@ def summary(model: nn.Module):
strings = [bar] + strings + [bar]
print('\n'.join(strings))
return total, total_train, total_nontrain
+
+
+def get_dropout_mask(drop_p: float, tensor: torch.Tensor):
+ """
+ 根据tensor的形状,生成一个mask
+
+ :param drop_p: float, 以多大的概率置为0。
+ :param tensor:torch.Tensor
+ :return: torch.FloatTensor. 与tensor一样的shape
+ """
+ mask_x = torch.ones_like(tensor)
+ nn.functional.dropout(mask_x, p=drop_p,
+ training=False, inplace=True)
+ return mask_x
+
+import glob
+
+def _get_file_name_base_on_postfix(dir_path, postfix):
+ """
+ 在dir_path中寻找后缀为postfix的文件.
+ :param dir_path: str, 文件夹
+ :param postfix: 形如".bin", ".json"等
+ :return: str,文件的路径
+ """
+ files = glob.glob(os.path.join(dir_path, '*' + postfix))
+ if len(files) == 0:
+ raise FileNotFoundError(f"There is no file endswith *.{postfix} file in {dir_path}")
+ elif len(files) > 1:
+ raise FileExistsError(f"There are multiple *.{postfix} files in {dir_path}")
+ return os.path.join(dir_path, files[0])
\ No newline at end of file
diff --git a/legacy/api/api.py b/legacy/api/api.py
index d5d1df6b..1408731f 100644
--- a/legacy/api/api.py
+++ b/legacy/api/api.py
@@ -8,7 +8,8 @@ import os
from fastNLP.core.dataset import DataSet
from .utils import load_url
from .processor import ModelProcessor
-from fastNLP.io.dataset_loader import _cut_long_sentence, ConllLoader
+from fastNLP.io.dataset_loader import _cut_long_sentence
+from fastNLP.io.data_loader import ConllLoader
from fastNLP.core.instance import Instance
from ..api.pipeline import Pipeline
from fastNLP.core.metrics import SpanFPreRecMetric
diff --git a/reproduction/CNN-sentence_classification/.gitignore b/reproduction/CNN-sentence_classification/.gitignore
deleted file mode 100644
index 4ae0ed76..00000000
--- a/reproduction/CNN-sentence_classification/.gitignore
+++ /dev/null
@@ -1,110 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-# Usually these files are written by a python script from a template
-# before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-.hypothesis/
-.pytest_cache/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# pyenv
-.python-version
-
-# celery beat schedule file
-celerybeat-schedule
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache
-
-#custom
-GoogleNews-vectors-negative300.bin/
-GoogleNews-vectors-negative300.bin.gz
-models/
-*.swp
diff --git a/reproduction/CNN-sentence_classification/README.md b/reproduction/CNN-sentence_classification/README.md
deleted file mode 100644
index ee752779..00000000
--- a/reproduction/CNN-sentence_classification/README.md
+++ /dev/null
@@ -1,77 +0,0 @@
-## Introduction
-This is the implementation of [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882) paper in PyTorch.
-* MRDataset, non-static-model(word2vec rained by Mikolov etal. (2013) on 100 billion words of Google News)
-* It can be run in both CPU and GPU
-* The best accuracy is 82.61%, which is better than 81.5% in the paper
-(by Jingyuan Liu @Fudan University; Email:(fdjingyuan@outlook.com) Welcome to discussion!)
-
-## Requirement
-* python 3.6
-* pytorch > 0.1
-* numpy
-* gensim
-
-## Run
-STEP 1
-install packages like gensim (other needed pakages is the same)
-```
-pip install gensim
-```
-
-STEP 2
-install MRdataset and word2vec resources
-* MRdataset: you can download the dataset in (https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz)
-* word2vec: you can download the file in (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)
-
-Since this file is more than 1.5G, I did not display in folders. If you download the file, please remember modify the path in Function def word_embeddings(path = './GoogleNews-vectors-negative300.bin/'):
-
-
-STEP 3
-train the model
-```
-python train.py
-```
-you will get the information printed in the screen, like
-```
-Epoch [1/20], Iter [100/192] Loss: 0.7008
-Test Accuracy: 71.869159 %
-Epoch [2/20], Iter [100/192] Loss: 0.5957
-Test Accuracy: 75.700935 %
-Epoch [3/20], Iter [100/192] Loss: 0.4934
-Test Accuracy: 78.130841 %
-
-......
-Epoch [20/20], Iter [100/192] Loss: 0.0364
-Test Accuracy: 81.495327 %
-Best Accuracy: 82.616822 %
-Best Model: models/cnn.pkl
-```
-
-## Hyperparameters
-According to the paper and experiment, I set:
-
-|Epoch|Kernel Size|dropout|learning rate|batch size|
-|---|---|---|---|---|
-|20|\(h,300,100\)|0.5|0.0001|50|
-
-h = [3,4,5]
-If the accuracy is not improved, the learning rate will \*0.8.
-
-## Result
-I just tried one dataset : MR. (Other 6 dataset in paper SST-1, SST-2, TREC, CR, MPQA)
-There are four models in paper: CNN-rand, CNN-static, CNN-non-static, CNN-multichannel.
-I have tried CNN-non-static:A model with pre-trained vectors from word2vec.
-All words—including the unknown ones that are randomly initialized and the pretrained vectors are fine-tuned for each task
-(which has almost the best performance and the most difficut to implement among the four models)
-
-|Dataset|Class Size|Best Result|Kim's Paper Result|
-|---|---|---|---|
-|MR|2|82.617%(CNN-non-static)|81.5%(CNN-nonstatic)|
-
-
-
-## Reference
-* [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
-* https://github.com/Shawn1993/cnn-text-classification-pytorch
-* https://github.com/junwang4/CNN-sentence-classification-pytorch-2017/blob/master/utils.py
-
diff --git a/reproduction/CNN-sentence_classification/dataset.py b/reproduction/CNN-sentence_classification/dataset.py
deleted file mode 100644
index 4cbe17a4..00000000
--- a/reproduction/CNN-sentence_classification/dataset.py
+++ /dev/null
@@ -1,136 +0,0 @@
-import codecs
-import random
-import re
-
-import gensim
-import numpy as np
-from gensim import corpora
-from torch.utils.data import Dataset
-
-
-def clean_str(string):
- """
- Tokenization/string cleaning for all datasets except for SST.
- Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
- """
- string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
- string = re.sub(r"\'s", " \'s", string)
- string = re.sub(r"\'ve", " \'ve", string)
- string = re.sub(r"n\'t", " n\'t", string)
- string = re.sub(r"\'re", " \'re", string)
- string = re.sub(r"\'d", " \'d", string)
- string = re.sub(r"\'ll", " \'ll", string)
- string = re.sub(r",", " , ", string)
- string = re.sub(r"!", " ! ", string)
- string = re.sub(r"\(", " \( ", string)
- string = re.sub(r"\)", " \) ", string)
- string = re.sub(r"\?", " \? ", string)
- string = re.sub(r"\s{2,}", " ", string)
- return string.strip()
-
-
-def pad_sentences(sentence, padding_word=" "):
- sequence_length = 64
- sent = sentence.split()
- padded_sentence = sentence + padding_word * (sequence_length - len(sent))
- return padded_sentence
-
-
-# data loader
-class MRDataset(Dataset):
- def __init__(self):
-
- # load positive and negative sentenses from files
- with codecs.open("./rt-polaritydata/rt-polarity.pos", encoding='ISO-8859-1') as f:
- positive_examples = list(f.readlines())
- with codecs.open("./rt-polaritydata/rt-polarity.neg", encoding='ISO-8859-1') as f:
- negative_examples = list(f.readlines())
- # s.strip: clear "\n"; clear_str; pad
- positive_examples = [pad_sentences(clean_str(s.strip())) for s in positive_examples]
- negative_examples = [pad_sentences(clean_str(s.strip())) for s in negative_examples]
- self.examples = positive_examples + negative_examples
- self.sentences_texts = [sample.split() for sample in self.examples]
-
- # word dictionary
- dictionary = corpora.Dictionary(self.sentences_texts)
- self.word2id_dict = dictionary.token2id # transform to dict, like {"human":0, "a":1,...}
-
- # set lables: postive is 1; negative is 0
- positive_labels = [1 for _ in positive_examples]
- negative_labels = [0 for _ in negative_examples]
- self.lables = positive_labels + negative_labels
- examples_lables = list(zip(self.examples, self.lables))
- random.shuffle(examples_lables)
- self.MRDataset_frame = examples_lables
-
- # transform word to id
- self.MRDataset_wordid = \
- [(
- np.array([self.word2id_dict[word] for word in sent[0].split()], dtype=np.int64),
- sent[1]
- ) for sent in self.MRDataset_frame]
-
- def word_embeddings(self, path="./GoogleNews-vectors-negative300.bin/GoogleNews-vectors-negative300.bin"):
- # establish from google
- model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True)
-
- print('Please wait ... (it could take a while to load the file : {})'.format(path))
- word_dict = self.word2id_dict
- embedding_weights = np.random.uniform(-0.25, 0.25, (len(self.word2id_dict), 300))
-
- for word in word_dict:
- word_id = word_dict[word]
- if word in model.wv.vocab:
- embedding_weights[word_id, :] = model[word]
- return embedding_weights
-
- def __len__(self):
- return len(self.MRDataset_frame)
-
- def __getitem__(self, idx):
-
- sample = self.MRDataset_wordid[idx]
- return sample
-
- def getsent(self, idx):
-
- sample = self.MRDataset_wordid[idx][0]
- return sample
-
- def getlabel(self, idx):
-
- label = self.MRDataset_wordid[idx][1]
- return label
-
- def word2id(self):
-
- return self.word2id_dict
-
- def id2word(self):
-
- id2word_dict = dict([val, key] for key, val in self.word2id_dict.items())
- return id2word_dict
-
-
-class train_set(Dataset):
-
- def __init__(self, samples):
- self.train_frame = samples
-
- def __len__(self):
- return len(self.train_frame)
-
- def __getitem__(self, idx):
- return self.train_frame[idx]
-
-
-class test_set(Dataset):
-
- def __init__(self, samples):
- self.test_frame = samples
-
- def __len__(self):
- return len(self.test_frame)
-
- def __getitem__(self, idx):
- return self.test_frame[idx]
diff --git a/reproduction/CNN-sentence_classification/model.py b/reproduction/CNN-sentence_classification/model.py
deleted file mode 100644
index 0aca34c7..00000000
--- a/reproduction/CNN-sentence_classification/model.py
+++ /dev/null
@@ -1,42 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class CNN_text(nn.Module):
- def __init__(self, kernel_h=[3, 4, 5], kernel_num=100, embed_num=1000, embed_dim=300, num_classes=2, dropout=0.5,
- L2_constrain=3,
- pretrained_embeddings=None):
- super(CNN_text, self).__init__()
-
- self.embedding = nn.Embedding(embed_num, embed_dim)
- self.dropout = nn.Dropout(dropout)
- if pretrained_embeddings is not None:
- self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
-
- # the network structure
- # Conv2d: input- N,C,H,W output- (50,100,62,1)
- self.conv1 = nn.ModuleList([nn.Conv2d(1, kernel_num, (K, embed_dim)) for K in kernel_h])
- self.fc1 = nn.Linear(len(kernel_h) * kernel_num, num_classes)
-
- def max_pooling(self, x):
- x = F.relu(self.conv1(x)).squeeze(3) # N,C,L - (50,100,62)
- x = F.max_pool1d(x, x.size(2)).squeeze(2)
- # x.size(2)=62 squeeze: (50,100,1) -> (50,100)
- return x
-
- def forward(self, x):
- x = self.embedding(x) # output: (N,H,W) = (50,64,300)
- x = x.unsqueeze(1) # (N,C,H,W)
- x = [F.relu(conv(x)).squeeze(3) for conv in self.conv1] # [N, C, H(50,100,62),(50,100,61),(50,100,60)]
- x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x] # [N,C(50,100),(50,100),(50,100)]
- x = torch.cat(x, 1)
- x = self.dropout(x)
- x = self.fc1(x)
- return x
-
-
-if __name__ == '__main__':
- model = CNN_text(kernel_h=[1, 2, 3, 4], embed_num=3, embed_dim=2)
- x = torch.LongTensor([[1, 2, 1, 2, 0]])
- print(model(x))
diff --git a/reproduction/CNN-sentence_classification/train.py b/reproduction/CNN-sentence_classification/train.py
deleted file mode 100644
index 6e35ee5e..00000000
--- a/reproduction/CNN-sentence_classification/train.py
+++ /dev/null
@@ -1,92 +0,0 @@
-import os
-
-import torch
-import torch.nn as nn
-from torch.autograd import Variable
-
-from . import dataset as dst
-from .model import CNN_text
-
-# Hyper Parameters
-batch_size = 50
-learning_rate = 0.0001
-num_epochs = 20
-cuda = True
-
-# split Dataset
-dataset = dst.MRDataset()
-length = len(dataset)
-
-train_dataset = dataset[:int(0.9 * length)]
-test_dataset = dataset[int(0.9 * length):]
-
-train_dataset = dst.train_set(train_dataset)
-test_dataset = dst.test_set(test_dataset)
-
-# Data Loader
-train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
- batch_size=batch_size,
- shuffle=True)
-
-test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
- batch_size=batch_size,
- shuffle=False)
-
-# cnn
-
-cnn = CNN_text(embed_num=len(dataset.word2id()), pretrained_embeddings=dataset.word_embeddings())
-if cuda:
- cnn.cuda()
-
-# Loss and Optimizer
-criterion = nn.CrossEntropyLoss()
-optimizer = torch.optim.Adam(cnn.parameters(), lr=learning_rate)
-
-# train and test
-best_acc = None
-
-for epoch in range(num_epochs):
- # Train the Model
- cnn.train()
- for i, (sents, labels) in enumerate(train_loader):
- sents = Variable(sents)
- labels = Variable(labels)
- if cuda:
- sents = sents.cuda()
- labels = labels.cuda()
- optimizer.zero_grad()
- outputs = cnn(sents)
- loss = criterion(outputs, labels)
- loss.backward()
- optimizer.step()
-
- if (i + 1) % 100 == 0:
- print('Epoch [%d/%d], Iter [%d/%d] Loss: %.4f'
- % (epoch + 1, num_epochs, i + 1, len(train_dataset) // batch_size, loss.data[0]))
-
- # Test the Model
- cnn.eval()
- correct = 0
- total = 0
- for sents, labels in test_loader:
- sents = Variable(sents)
- if cuda:
- sents = sents.cuda()
- labels = labels.cuda()
- outputs = cnn(sents)
- _, predicted = torch.max(outputs.data, 1)
- total += labels.size(0)
- correct += (predicted == labels).sum()
- acc = 100. * correct / total
- print('Test Accuracy: %f %%' % (acc))
-
- if best_acc is None or acc > best_acc:
- best_acc = acc
- if os.path.exists("models") is False:
- os.makedirs("models")
- torch.save(cnn.state_dict(), 'models/cnn.pkl')
- else:
- learning_rate = learning_rate * 0.8
-
-print("Best Accuracy: %f %%" % best_acc)
-print("Best Model: models/cnn.pkl")
diff --git a/reproduction/Char-aware_NLM/LICENSE b/reproduction/Char-aware_NLM/LICENSE
deleted file mode 100644
index 9689f68b..00000000
--- a/reproduction/Char-aware_NLM/LICENSE
+++ /dev/null
@@ -1,21 +0,0 @@
-MIT License
-
-Copyright (c) 2017
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
\ No newline at end of file
diff --git a/reproduction/Char-aware_NLM/README.md b/reproduction/Char-aware_NLM/README.md
deleted file mode 100644
index 4bb06386..00000000
--- a/reproduction/Char-aware_NLM/README.md
+++ /dev/null
@@ -1,40 +0,0 @@
-
-# PyTorch-Character-Aware-Neural-Language-Model
-
-This is the PyTorch implementation of character-aware neural language model proposed in this [paper](https://arxiv.org/abs/1508.06615) by Yoon Kim.
-
-## Requiredments
-The code is run and tested with **Python 3.5.2** and **PyTorch 0.3.1**.
-
-## HyperParameters
-| HyperParam | value |
-| ------ | :-------|
-| LSTM batch size | 20 |
-| LSTM sequence length | 35 |
-| LSTM hidden units | 300 |
-| epochs | 35 |
-| initial learning rate | 1.0 |
-| character embedding dimension | 15 |
-
-## Demo
-Train the model with split train/valid/test data.
-
-`python train.py`
-
-The trained model will saved in `cache/net.pkl`.
-Test the model.
-
-`python test.py`
-
-Best result on test set:
-PPl=127.2163
-cross entropy loss=4.8459
-
-## Acknowledgement
-This implementation borrowed ideas from
-
-https://github.com/jarfo/kchar
-
-https://github.com/cronos123/Character-Aware-Neural-Language-Models
-
-
diff --git a/reproduction/Char-aware_NLM/main.py b/reproduction/Char-aware_NLM/main.py
deleted file mode 100644
index 6467d98d..00000000
--- a/reproduction/Char-aware_NLM/main.py
+++ /dev/null
@@ -1,9 +0,0 @@
-PICKLE = "./save/"
-
-
-def train():
- pass
-
-
-if __name__ == "__main__":
- train()
diff --git a/reproduction/Char-aware_NLM/model.py b/reproduction/Char-aware_NLM/model.py
deleted file mode 100644
index 7880d6eb..00000000
--- a/reproduction/Char-aware_NLM/model.py
+++ /dev/null
@@ -1,145 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class Highway(nn.Module):
- """Highway network"""
-
- def __init__(self, input_size):
- super(Highway, self).__init__()
- self.fc1 = nn.Linear(input_size, input_size, bias=True)
- self.fc2 = nn.Linear(input_size, input_size, bias=True)
-
- def forward(self, x):
- t = F.sigmoid(self.fc1(x))
- return torch.mul(t, F.relu(self.fc2(x))) + torch.mul(1 - t, x)
-
-
-class charLM(nn.Module):
- """CNN + highway network + LSTM
- # Input:
- 4D tensor with shape [batch_size, in_channel, height, width]
- # Output:
- 2D Tensor with shape [batch_size, vocab_size]
- # Arguments:
- char_emb_dim: the size of each character's attention
- word_emb_dim: the size of each word's attention
- vocab_size: num of unique words
- num_char: num of characters
- use_gpu: True or False
- """
-
- def __init__(self, char_emb_dim, word_emb_dim,
- vocab_size, num_char, use_gpu):
- super(charLM, self).__init__()
- self.char_emb_dim = char_emb_dim
- self.word_emb_dim = word_emb_dim
- self.vocab_size = vocab_size
-
- # char attention layer
- self.char_embed = nn.Embedding(num_char, char_emb_dim)
-
- # convolutions of filters with different sizes
- self.convolutions = []
-
- # list of tuples: (the number of filter, width)
- self.filter_num_width = [(25, 1), (50, 2), (75, 3), (100, 4), (125, 5), (150, 6)]
-
- for out_channel, filter_width in self.filter_num_width:
- self.convolutions.append(
- nn.Conv2d(
- 1, # in_channel
- out_channel, # out_channel
- kernel_size=(char_emb_dim, filter_width), # (height, width)
- bias=True
- )
- )
-
- self.highway_input_dim = sum([x for x, y in self.filter_num_width])
-
- self.batch_norm = nn.BatchNorm1d(self.highway_input_dim, affine=False)
-
- # highway net
- self.highway1 = Highway(self.highway_input_dim)
- self.highway2 = Highway(self.highway_input_dim)
-
- # LSTM
- self.lstm_num_layers = 2
-
- self.lstm = nn.LSTM(input_size=self.highway_input_dim,
- hidden_size=self.word_emb_dim,
- num_layers=self.lstm_num_layers,
- bias=True,
- dropout=0.5,
- batch_first=True)
-
- # output layer
- self.dropout = nn.Dropout(p=0.5)
- self.linear = nn.Linear(self.word_emb_dim, self.vocab_size)
-
- if use_gpu is True:
- for x in range(len(self.convolutions)):
- self.convolutions[x] = self.convolutions[x].cuda()
- self.highway1 = self.highway1.cuda()
- self.highway2 = self.highway2.cuda()
- self.lstm = self.lstm.cuda()
- self.dropout = self.dropout.cuda()
- self.char_embed = self.char_embed.cuda()
- self.linear = self.linear.cuda()
- self.batch_norm = self.batch_norm.cuda()
-
- def forward(self, x, hidden):
- # Input: Variable of Tensor with shape [num_seq, seq_len, max_word_len+2]
- # Return: Variable of Tensor with shape [num_words, len(word_dict)]
- lstm_batch_size = x.size()[0]
- lstm_seq_len = x.size()[1]
-
- x = x.contiguous().view(-1, x.size()[2])
- # [num_seq*seq_len, max_word_len+2]
-
- x = self.char_embed(x)
- # [num_seq*seq_len, max_word_len+2, char_emb_dim]
-
- x = torch.transpose(x.view(x.size()[0], 1, x.size()[1], -1), 2, 3)
- # [num_seq*seq_len, 1, max_word_len+2, char_emb_dim]
-
- x = self.conv_layers(x)
- # [num_seq*seq_len, total_num_filters]
-
- x = self.batch_norm(x)
- # [num_seq*seq_len, total_num_filters]
-
- x = self.highway1(x)
- x = self.highway2(x)
- # [num_seq*seq_len, total_num_filters]
-
- x = x.contiguous().view(lstm_batch_size, lstm_seq_len, -1)
- # [num_seq, seq_len, total_num_filters]
-
- x, hidden = self.lstm(x, hidden)
- # [seq_len, num_seq, hidden_size]
-
- x = self.dropout(x)
- # [seq_len, num_seq, hidden_size]
-
- x = x.contiguous().view(lstm_batch_size * lstm_seq_len, -1)
- # [num_seq*seq_len, hidden_size]
-
- x = self.linear(x)
- # [num_seq*seq_len, vocab_size]
- return x, hidden
-
- def conv_layers(self, x):
- chosen_list = list()
- for conv in self.convolutions:
- feature_map = F.tanh(conv(x))
- # (batch_size, out_channel, 1, max_word_len-width+1)
- chosen = torch.max(feature_map, 3)[0]
- # (batch_size, out_channel, 1)
- chosen = chosen.squeeze()
- # (batch_size, out_channel)
- chosen_list.append(chosen)
-
- # (batch_size, total_num_filers)
- return torch.cat(chosen_list, 1)
diff --git a/reproduction/Char-aware_NLM/test.py b/reproduction/Char-aware_NLM/test.py
deleted file mode 100644
index abf3f44d..00000000
--- a/reproduction/Char-aware_NLM/test.py
+++ /dev/null
@@ -1,117 +0,0 @@
-import os
-from collections import namedtuple
-
-import numpy as np
-import torch
-import torch.nn as nn
-from torch.autograd import Variable
-from utilities import *
-
-
-def to_var(x):
- if torch.cuda.is_available():
- x = x.cuda()
- return Variable(x)
-
-
-def test(net, data, opt):
- net.eval()
-
- test_input = torch.from_numpy(data.test_input)
- test_label = torch.from_numpy(data.test_label)
-
- num_seq = test_input.size()[0] // opt.lstm_seq_len
- test_input = test_input[:num_seq * opt.lstm_seq_len, :]
- # [num_seq, seq_len, max_word_len+2]
- test_input = test_input.view(-1, opt.lstm_seq_len, opt.max_word_len + 2)
-
- criterion = nn.CrossEntropyLoss()
-
- loss_list = []
- num_hits = 0
- total = 0
- iterations = test_input.size()[0] // opt.lstm_batch_size
- test_generator = batch_generator(test_input, opt.lstm_batch_size)
- label_generator = batch_generator(test_label, opt.lstm_batch_size * opt.lstm_seq_len)
-
- hidden = (to_var(torch.zeros(2, opt.lstm_batch_size, opt.word_embed_dim)),
- to_var(torch.zeros(2, opt.lstm_batch_size, opt.word_embed_dim)))
-
- add_loss = 0.0
- for t in range(iterations):
- batch_input = test_generator.__next__()
- batch_label = label_generator.__next__()
-
- net.zero_grad()
- hidden = [state.detach() for state in hidden]
- test_output, hidden = net(to_var(batch_input), hidden)
-
- test_loss = criterion(test_output, to_var(batch_label)).data
- loss_list.append(test_loss)
- add_loss += test_loss
-
- print("Test Loss={0:.4f}".format(float(add_loss) / iterations))
- print("Test PPL={0:.4f}".format(float(np.exp(add_loss / iterations))))
-
-
-#############################################################
-
-if __name__ == "__main__":
-
- word_embed_dim = 300
- char_embedding_dim = 15
-
- if os.path.exists("cache/prep.pt") is False:
- print("Cannot find prep.pt")
-
- objetcs = torch.load("cache/prep.pt")
-
- word_dict = objetcs["word_dict"]
- char_dict = objetcs["char_dict"]
- reverse_word_dict = objetcs["reverse_word_dict"]
- max_word_len = objetcs["max_word_len"]
- num_words = len(word_dict)
-
- print("word/char dictionary built. Start making inputs.")
-
- if os.path.exists("cache/data_sets.pt") is False:
-
- test_text = read_data("./test.txt")
- test_set = np.array(text2vec(test_text, char_dict, max_word_len))
-
- # Labels are next-word index in word_dict with the same length as inputs
- test_label = np.array([word_dict[w] for w in test_text[1:]] + [word_dict[test_text[-1]]])
-
- category = {"test": test_set, "tlabel": test_label}
- torch.save(category, "cache/data_sets.pt")
- else:
- data_sets = torch.load("cache/data_sets.pt")
- test_set = data_sets["test"]
- test_label = data_sets["tlabel"]
- train_set = data_sets["tdata"]
- train_label = data_sets["trlabel"]
-
- DataTuple = namedtuple("DataTuple", "test_input test_label train_input train_label ")
- data = DataTuple(test_input=test_set,
- test_label=test_label, train_label=train_label, train_input=train_set)
-
- print("Loaded data sets. Start building network.")
-
- USE_GPU = True
- cnn_batch_size = 700
- lstm_seq_len = 35
- lstm_batch_size = 20
-
- net = torch.load("cache/net.pkl")
-
- Options = namedtuple("Options", ["cnn_batch_size", "lstm_seq_len",
- "max_word_len", "lstm_batch_size", "word_embed_dim"])
- opt = Options(cnn_batch_size=lstm_seq_len * lstm_batch_size,
- lstm_seq_len=lstm_seq_len,
- max_word_len=max_word_len,
- lstm_batch_size=lstm_batch_size,
- word_embed_dim=word_embed_dim)
-
- print("Network built. Start testing.")
-
- test(net, data, opt)
diff --git a/reproduction/Char-aware_NLM/test.txt b/reproduction/Char-aware_NLM/test.txt
deleted file mode 100644
index 92aaec44..00000000
--- a/reproduction/Char-aware_NLM/test.txt
+++ /dev/null
@@ -1,320 +0,0 @@
- no it was n't black monday
- but while the new york stock exchange did n't fall apart friday as the dow jones industrial average plunged N points most of it in the final hour it barely managed to stay this side of chaos
- some circuit breakers installed after the october N crash failed their first test traders say unable to cool the selling panic in both stocks and futures
- the N stock specialist firms on the big board floor the buyers and sellers of last resort who were criticized after the N crash once again could n't handle the selling pressure
- big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock traders say
- heavy selling of standard & poor 's 500-stock index futures in chicago beat stocks downward
- seven big board stocks ual amr bankamerica walt disney capital cities\/abc philip morris and pacific telesis group stopped trading and never resumed
- the has already begun
- the equity market was
- once again the specialists were not able to handle the imbalances on the floor of the new york stock exchange said christopher senior vice president at securities corp
- james chairman of specialists henderson brothers inc. it is easy to say the specialist is n't doing his job
- when the dollar is in a even central banks ca n't stop it
- speculators are calling for a degree of liquidity that is not there in the market
- many money managers and some traders had already left their offices early friday afternoon on a warm autumn day because the stock market was so quiet
- then in a plunge the dow jones industrials in barely an hour surrendered about a third of their gains this year up a 190.58-point or N N loss on the day in trading volume
- trading accelerated to N million shares a record for the big board
- at the end of the day N million shares were traded
- the dow jones industrials closed at N
- the dow 's decline was second in point terms only to the black monday crash that occurred oct. N N
- in percentage terms however the dow 's dive was the ever and the sharpest since the market fell N or N N a week after black monday
- the dow fell N N on black monday
- shares of ual the parent of united airlines were extremely active all day friday reacting to news and rumors about the proposed $ N billion buy-out of the airline by an group
- wall street 's takeover-stock speculators or risk arbitragers had placed unusually large bets that a takeover would succeed and ual stock would rise
- at N p.m. edt came the news the big board was trading in ual pending news
- on the exchange floor as soon as ual stopped trading we for a panic said one top floor trader
- several traders could be seen shaking their heads when the news
- for weeks the market had been nervous about takeovers after campeau corp. 's cash crunch spurred concern about the prospects for future highly leveraged takeovers
- and N minutes after the ual trading halt came news that the ual group could n't get financing for its bid
- at this point the dow was down about N points
- the market
- arbitragers could n't dump their ual stock but they rid themselves of nearly every rumor stock they had
- for example their selling caused trading halts to be declared in usair group which closed down N N to N N delta air lines which fell N N to N N and industries which sank N to N N
- these stocks eventually reopened
- but as panic spread speculators began to sell blue-chip stocks such as philip morris and international business machines to offset their losses
- when trading was halted in philip morris the stock was trading at N down N N while ibm closed N N lower at N
- selling because of waves of automatic stop-loss orders which are triggered by computer when prices fall to certain levels
- most of the stock selling pressure came from wall street professionals including computer-guided program traders
- traders said most of their major institutional investors on the other hand sat tight
- now at N one of the market 's post-crash reforms took hold as the s&p N futures contract had plunged N points equivalent to around a drop in the dow industrials
- under an agreement signed by the big board and the chicago mercantile exchange trading was temporarily halted in chicago
- after the trading halt in the s&p N pit in chicago waves of selling continued to hit stocks themselves on the big board and specialists continued to prices down
- as a result the link between the futures and stock markets apart
- without the of stock-index futures the barometer of where traders think the overall stock market is headed many traders were afraid to trust stock prices quoted on the big board
- the futures halt was even by big board floor traders
- it things up said one major specialist
- this confusion effectively halted one form of program trading stock index arbitrage that closely links the futures and stock markets and has been blamed by some for the market 's big swings
- in a stock-index arbitrage sell program traders buy or sell big baskets of stocks and offset the trade in futures to lock in a price difference
- when the airline information came through it every model we had for the marketplace said a managing director at one of the largest program-trading firms
- we did n't even get a chance to do the programs we wanted to do
- but stocks kept falling
- the dow industrials were down N points at N p.m. before the halt
- at N p.m. at the end of the cooling off period the average was down N points
- meanwhile during the the s&p trading halt s&p futures sell orders began up while stocks in new york kept falling sharply
- big board chairman john j. phelan said yesterday the circuit breaker worked well
- i just think it 's at this point to get into a debate if index arbitrage would have helped or hurt things
- under another post-crash system big board president richard mr. phelan was flying to as the market was falling was talking on an hot line to the other exchanges the securities and exchange commission and the federal reserve board
- he out at a high-tech center on the floor of the big board where he could watch on prices and pending stock orders
- at about N p.m. edt s&p futures resumed trading and for a brief time the futures and stock markets started to come back in line
- buyers stepped in to the futures pit
- but the of s&p futures sell orders weighed on the market and the link with stocks began to fray again
- at about N the s&p market to still another limit of N points down and trading was locked again
- futures traders say the s&p was that the dow could fall as much as N points
- during this time small investors began ringing their brokers wondering whether another crash had begun
- at prudential-bache securities inc. which is trying to cater to small investors some brokers thought this would be the final
- that 's when george l. ball chairman of the prudential insurance co. of america unit took to the internal system to declare that the plunge was only mechanical
- i have a that this particular decline today is something more about less
- it would be my to advise clients not to sell to look for an opportunity to buy mr. ball told the brokers
- at merrill lynch & co. the nation 's biggest brokerage firm a news release was prepared merrill lynch comments on market drop
- the release cautioned that there are significant differences between the current environment and that of october N and that there are still attractive investment opportunities in the stock market
- however jeffrey b. lane president of shearson lehman hutton inc. said that friday 's plunge is going to set back relations with customers because it the concern of volatility
- and i think a lot of people will on program trading
- it 's going to bring the debate right back to the
- as the dow average ground to its final N loss friday the s&p pit stayed locked at its trading limit
- jeffrey of program trader investment group said N s&p contracts were for sale on the close the equivalent of $ N million in stock
- but there were no buyers
- while friday 's debacle involved mainly professional traders rather than investors it left the market vulnerable to continued selling this morning traders said
- stock-index futures contracts settled at much lower prices than indexes of the stock market itself
- at those levels stocks are set up to be by index arbitragers who lock in profits by buying futures when futures prices fall and simultaneously sell off stocks
- but nobody knows at what level the futures and stocks will open today
- the between the stock and futures markets friday will undoubtedly cause renewed debate about whether wall street is properly prepared for another crash situation
- the big board 's mr. said our performance was good
- but the exchange will look at the performance of all specialists in all stocks
- obviously we 'll take a close look at any situation in which we think the obligations were n't met he said
- see related story fed ready to big funds wsj oct. N N
- but specialists complain privately that just as in the N crash the firms big investment banks that support the market by trading big blocks of stock stayed on the sidelines during friday 's
- mr. phelan said it will take another day or two to analyze who was buying and selling friday
- concerning your sept. N page-one article on prince charles and the it 's a few hundred years since england has been a kingdom
- it 's now the united kingdom of great britain and northern ireland northern ireland scotland and oh yes england too
- just thought you 'd like to know
- george
- ports of call inc. reached agreements to sell its remaining seven aircraft to buyers that were n't disclosed
- the agreements bring to a total of nine the number of planes the travel company has sold this year as part of a restructuring
- the company said a portion of the $ N million realized from the sales will be used to repay its bank debt and other obligations resulting from the currently suspended operations
- earlier the company announced it would sell its aging fleet of boeing co. because of increasing maintenance costs
- a consortium of private investors operating as funding co. said it has made a $ N million cash bid for most of l.j. hooker corp. 's real-estate and holdings
- the $ N million bid includes the assumption of an estimated $ N million in secured liabilities on those properties according to those making the bid
- the group is led by jay chief executive officer of investment corp. in and a. boyd simpson chief executive of the atlanta-based simpson organization inc
- mr. 's company specializes in commercial real-estate investment and claims to have $ N billion in assets mr. simpson is a developer and a former senior executive of l.j. hooker
- the assets are good but they require more money and management than can be provided in l.j. hooker 's current situation said mr. simpson in an interview
- hooker 's philosophy was to build and sell
- we want to build and hold
- l.j. hooker based in atlanta is operating with protection from its creditors under chapter N of the u.s. bankruptcy code
- its parent company hooker corp. of sydney australia is currently being managed by a court-appointed provisional
- sanford chief executive of l.j. hooker said yesterday in a statement that he has not yet seen the bid but that he would review it and bring it to the attention of the creditors committee
- the $ N million bid is estimated by mr. simpson as representing N N of the value of all hooker real-estate holdings in the u.s.
- not included in the bid are teller or b. altman & co. l.j. hooker 's department-store chains
- the offer covers the massive N forest fair mall in cincinnati the N fashion mall in columbia s.c. and the N town center mall in
- the mall opened sept. N with a 's as its the columbia mall is expected to open nov. N
- other hooker properties included are a office tower in atlanta expected to be completed next february vacant land sites in florida and ohio l.j. hooker international the commercial real-estate brokerage company that once did business as merrill lynch commercial real estate plus other shopping centers
- the consortium was put together by the london-based investment banking company that is a subsidiary of security pacific corp
- we do n't anticipate any problems in raising the funding for the bid said campbell the head of mergers and acquisitions at in an interview
- is acting as the consortium 's investment bankers
- according to people familiar with the consortium the bid was project a reference to the film in which a played by actress is saved from a businessman by a police officer named john
- l.j. hooker was a small company based in atlanta in N when mr. simpson was hired to push it into commercial development
- the company grew modestly until N when a majority position in hooker corp. was acquired by australian developer george currently hooker 's chairman
- mr. to launch an ambitious but $ N billion acquisition binge that included teller and b. altman & co. as well as majority positions in merksamer jewelers a sacramento chain inc. the retailer and inc. the southeast department-store chain
- eventually mr. simpson and mr. had a falling out over the direction of the company and mr. simpson said he resigned in N
- since then hooker corp. has sold its interest in the chain back to 's management and is currently attempting to sell the b. altman & co. chain
- in addition robert chief executive of the chain is seeking funds to buy out the hooker interest in his company
- the merksamer chain is currently being offered for sale by first boston corp
- reached in mr. said that he believes the various hooker can become profitable with new management
- these are n't mature assets but they have the potential to be so said mr.
- managed properly and with a long-term outlook these can become investment-grade quality properties
- canadian production totaled N metric tons in the week ended oct. N up N N from the preceding week 's total of N tons statistics canada a federal agency said
- the week 's total was up N N from N tons a year earlier
- the total was N tons up N N from N tons a year earlier
- the treasury plans to raise $ N million in new cash thursday by selling about $ N billion of 52-week bills and $ N billion of maturing bills
- the bills will be dated oct. N and will mature oct. N N
- they will be available in minimum denominations of $ N
- bids must be received by N p.m. edt thursday at the treasury or at federal reserve banks or branches
- as small investors their mutual funds with phone calls over the weekend big fund managers said they have a strong defense against any wave of withdrawals cash
- unlike the weekend before black monday the funds were n't with heavy withdrawal requests
- and many fund managers have built up cash levels and say they will be buying stock this week
- at fidelity investments the nation 's largest fund company telephone volume was up sharply but it was still at just half the level of the weekend preceding black monday in N
- the boston firm said redemptions were running at less than one-third the level two years ago
- as of yesterday afternoon the redemptions represented less than N N of the total cash position of about $ N billion of fidelity 's stock funds
- two years ago there were massive redemption levels over the weekend and a lot of fear around said c. bruce who runs fidelity investments ' $ N billion fund
- this feels more like a deal
- people are n't
- the test may come today
- friday 's stock market sell-off came too late for many investors to act
- some shareholders have held off until today because any fund exchanges made after friday 's close would take place at today 's closing prices
- stock fund redemptions during the N debacle did n't begin to until after the market opened on black monday
- but fund managers say they 're ready
- many have raised cash levels which act as a buffer against steep market declines
- mario for instance holds cash positions well above N N in several of his funds
- windsor fund 's john and mutual series ' michael price said they had raised their cash levels to more than N N and N N respectively this year
- even peter lynch manager of fidelity 's $ N billion fund the nation 's largest stock fund built up cash to N N or $ N million
- one reason is that after two years of monthly net redemptions the fund posted net inflows of money from investors in august and september
- i 've let the money build up mr. lynch said who added that he has had trouble finding stocks he likes
- not all funds have raised cash levels of course
- as a group stock funds held N N of assets in cash as of august the latest figures available from the investment company institute
- that was modestly higher than the N N and N N levels in august and september of N
- also persistent redemptions would force some fund managers to dump stocks to raise cash
- but a strong level of investor withdrawals is much more unlikely this time around fund managers said
- a major reason is that investors already have sharply scaled back their purchases of stock funds since black monday
- sales have rebounded in recent months but monthly net purchases are still running at less than half N levels
- there 's not nearly as much said john chairman of vanguard group inc. a big valley forge pa. fund company
- many fund managers argue that now 's the time to buy
- vincent manager of the $ N billion wellington fund added to his positions in bristol-myers squibb woolworth and dun & bradstreet friday
- and today he 'll be looking to buy drug stocks like eli lilly pfizer and american home products whose dividend yields have been bolstered by stock declines
- fidelity 's mr. lynch for his part snapped up southern co. shares friday after the stock got
- if the market drops further today he said he 'll be buying blue chips such as bristol-myers and kellogg
- if they stocks like that he said it presents an opportunity that is the kind of thing you dream about
- major mutual-fund groups said phone calls were at twice the normal weekend pace yesterday
- but most investors were seeking share prices and other information
- trading volume was only modestly higher than normal
- still fund groups are n't taking any chances
- they hope to avoid the phone lines and other that some fund investors in october N
- fidelity on saturday opened its N investor centers across the country
- the centers normally are closed through the weekend
- in addition east coast centers will open at N edt this morning instead of the normal N
- t. rowe price associates inc. increased its staff of phone representatives to handle investor requests
- the group noted that some investors moved money from stock funds to money-market funds
- but most investors seemed to be in an information mode rather than in a transaction mode said steven a vice president
- and vanguard among other groups said it was adding more phone representatives today to help investors get through
- in an unusual move several funds moved to calm investors with on their phone lines
- we view friday 's market decline as offering us a buying opportunity as long-term investors a recording at & co. funds said over the weekend
- the group had a similar recording for investors
- several fund managers expect a rough market this morning before prices stabilize
- some early selling is likely to stem from investors and portfolio managers who want to lock in this year 's fat profits
- stock funds have averaged a staggering gain of N N through september according to lipper analytical services inc
- who runs shearson lehman hutton inc. 's $ N million sector analysis portfolio predicts the market will open down at least N points on technical factors and some panic selling
- but she expects prices to rebound soon and is telling investors she expects the stock market wo n't decline more than N N to N N from recent highs
- this is not a major crash she said
- nevertheless ms. said she was with phone calls over the weekend from nervous shareholders
- half of them are really scared and want to sell she said but i 'm trying to talk them out of it
- she added if they all were bullish i 'd really be upset
- the backdrop to friday 's slide was different from that of the october N crash fund managers argue
- two years ago unlike today the dollar was weak interest rates were rising and the market was very they say
- from the investors ' standpoint institutions and individuals learned a painful lesson by selling at the lows on black monday said stephen boesel manager of the $ N million t. rowe price growth and income fund
- this time i do n't think we 'll get a panic reaction
- newport corp. said it expects to report earnings of between N cents and N cents a share somewhat below analysts ' estimates of N cents to N cents
- the maker of scientific instruments and laser parts said orders fell below expectations in recent months
- a spokesman added that sales in the current quarter will about equal the quarter 's figure when newport reported net income of $ N million or N cents a share on $ N million in sales
- from the strike by N machinists union members against boeing co. reached air carriers friday as america west airlines announced it will postpone its new service out of houston because of delays in receiving aircraft from the seattle jet maker
- peter vice president for planning at the phoenix ariz. carrier said in an interview that the work at boeing now entering its 13th day has caused some turmoil in our scheduling and that more than N passengers who were booked to fly out of houston on america west would now be put on other airlines
- mr. said boeing told america west that the N it was supposed to get this thursday would n't be delivered until nov. N the day after the airline had been planning to service at houston with four daily flights including three to phoenix and one to las vegas
- now those routes are n't expected to begin until jan
- boeing is also supposed to send to america west another N aircraft as well as a N by year 's end
- those too are almost certain to arrive late
- at this point no other america west flights including its new service at san antonio texas newark n.j. and calif. have been affected by the delays in boeing deliveries
- nevertheless the company 's reaction the effect that a huge manufacturer such as boeing can have on other parts of the economy
- it also is sure to help the machinists put added pressure on the company
- i just do n't feel that the company can really stand or would want a prolonged tom baker president of machinists ' district N said in an interview yesterday
- i do n't think their customers would like it very much
- america west though is a smaller airline and therefore more affected by the delayed delivery of a single plane than many of its competitors would be
- i figure that american and united probably have such a hard time counting all the planes in their fleets they might not miss one at all mr. said
- indeed a random check friday did n't seem to indicate that the strike was having much of an effect on other airline operations
- southwest airlines has a boeing N set for delivery at the end of this month and expects to have the plane on time
- it 's so close to completion boeing 's told us there wo n't be a problem said a southwest spokesman
- a spokesman for amr corp. said boeing has assured american airlines it will deliver a N on time later this month
- american is preparing to take delivery of another N in early december and N more next year and is n't anticipating any changes in that timetable
- in seattle a boeing spokesman explained that the company has been in constant communication with all of its customers and that it was impossible to predict what further disruptions might be triggered by the strike
- meanwhile supervisors and employees have been trying to finish some N aircraft mostly N and N jumbo jets at the company 's wash. plant that were all but completed before the
- as of friday four had been delivered and a fifth plane a N was supposed to be out over the weekend to air china
- no date has yet been set to get back to the bargaining table
- we want to make sure they know what they want before they come back said doug hammond the federal mediator who has been in contact with both sides since the strike began
- the investment community for one has been anticipating a resolution
- though boeing 's stock price was battered along with the rest of the market friday it actually has risen over the last two weeks on the strength of new orders
- the market has taken two views that the labor situation will get settled in the short term and that things look very for boeing in the long term said howard an analyst at j. lawrence inc
- boeing 's shares fell $ N friday to close at $ N in composite trading on the new york stock exchange
- but mr. baker said he thinks the earliest a pact could be struck would be the end of this month that the company and union may resume negotiations as early as this week
- still he said it 's possible that the strike could last considerably longer
- i would n't expect an immediate resolution to anything
- last week boeing chairman frank sent striking workers a letter saying that to my knowledge boeing 's offer represents the best overall three-year contract of any major u.s. industrial firm in recent history
- but mr. baker called the letter and the company 's offer of a N N wage increase over the life of the pact plus bonuses very weak
- he added that the company the union 's resolve and the workers ' with being forced to work many hours overtime
- in separate developments talks have broken off between machinists representatives at lockheed corp. and the calif. aerospace company
- the union is continuing to work through its expired contract however
- it had planned a strike vote for next sunday but that has been pushed back indefinitely
- united auto workers local N which represents N workers at boeing 's helicopter unit in delaware county pa. said it agreed to extend its contract on a basis with a notification to cancel while it continues bargaining
- the accord expired yesterday
- and boeing on friday said it received an order from for four model N valued at a total of about $ N million
- the planes long range versions of the