fastNLP/reproduction/multi-criteria-cws
2020-06-28 14:38:02 +08:00
..
data-prepare.py [update] multi criteria cws 2020-03-18 11:26:00 +08:00
data-process.py [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00
main.py 1.增加RobertaEmbedding与GPT2Embedding 2020-04-11 22:55:54 +08:00
make_data.sh [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00
model.py [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00
models.py 在@linzehui 的帮助下seq2seq终于有了第一个版本; 目前实现了Seq2Seq的Transformer和LSTM版本,但metric和loss还没update; 2020-06-28 14:38:02 +08:00
optm.py [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00
README.md [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00
train.py [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00
train.sh [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00
transformer.py [bugfix] 针对pytorch1.3.0版本bug的补丁 2019-11-06 13:39:44 +08:00
utils.py [add] reproduction of multi-criteria cws 2019-10-10 21:03:12 +08:00

Multi-Criteria-CWS

An implementation of Multi-Criteria Chinese Word Segmentation with Transformer with fastNLP.

Dataset

Overview

We use the same datasets listed in paper.

  • sighan2005
    • pku
    • msr
    • as
    • cityu
  • sighan2008
    • ctb
    • ckip
    • cityu (combined with data in sighan2005)
    • ncc
    • sxu

Preprocess

First, download OpenCC to convert between Traditional Chinese and Simplified Chinese.

pip install opencc-python-reimplemented

Then, set a path to save processed data, and run the shell script to process the data.

export DATA_DIR=path/to/processed-data
bash make_data.sh path/to/sighan2005 path/to/sighan2008

It would take a few minutes to finish the process.

Model

We use transformer to build the model, as described in paper.

Train

Finally, to train the model, run the shell script. The train.sh takes one argument, the GPU-IDs to use, for example:

bash train.sh 0,1

This command use GPUs with ID 0 and 1.

Note: Please refer to the paper for details of hyper-parameters. And modify the settings in train.sh to match your experiment environment.

Type

python main.py --help

to learn all arguments to be specified in training.

Performance

Results on the test sets of eight CWS datasets with multi-criteria learning.

Dataset MSRA AS PKU CTB CKIP CITYU NCC SXU Avg.
Original paper 98.05 96.44 96.41 96.99 96.51 96.91 96.04 97.61 96.87
Ours 96.92 95.71 95.65 95.96 96.00 96.09 94.61 96.64 95.95