AIAS/2_nlp_sdks/kits/sentencepiece_sdk
2023-03-21 20:46:31 +08:00
..
build/test/models no message 2021-10-03 22:25:35 +08:00
doc/img no message 2021-10-03 22:25:35 +08:00
src/main update english comments 2023-03-21 20:46:31 +08:00
pom.xml upgrade to 0.17.0 2022-05-28 14:48:33 +08:00
README_cn.md initial version - writing english doc. 2023-03-20 17:07:14 +08:00
README.md no message 2021-10-29 10:29:42 +08:00
sentencepiece_sdk.iml no message 2021-10-03 22:25:35 +08:00

Sentencepiece分词的Java实现

Sentencepiece是google开源的文本Tokenzier工具其主要原理是利用统计算法 在语料库中生成一个类似分词器的工具外加可以将词token化的功能。

image

运行例子 - SpTokenizerExample

运行成功后,命令行应该看到下面的信息:


#测试token生成并根据token还原句子
[INFO ] - Test Tokenize
[INFO ] - Input sentence: Hello World
[INFO ] - Tokens: [▁He, ll, o, ▁, W, or, l, d]
[INFO ] - Recovered sentence: Hello World

#测试Encode生成ids并根据ids还原句子
[INFO ] - Test Encode & Decode
[INFO ] - Input sentence: Hello World
[INFO ] - Ids: [151, 88, 21, 4, 321, 54, 31, 17]
[INFO ] - Recovered sentence: Hello World

#测试GetToken根据id获取token
[INFO ] - Test GetToken
[INFO ] - ids: [151, 88, 21, 4, 321, 54, 31, 17]
[INFO ] - ▁He
[INFO ] - ll
[INFO ] - o
[INFO ] - ▁
[INFO ] - W
[INFO ] - or
[INFO ] - l
[INFO ] - d

#测试GetId根据token获取id
[INFO ] - Test GetId
[INFO ] - tokens: [▁He, ll, o, ▁, W, or, l, d]
[INFO ] - 151
[INFO ] - 88
[INFO ] - 21
[INFO ] - 4
[INFO ] - 321
[INFO ] - 54
[INFO ] - 31
[INFO ] - 17

如何训练模型?

参考:https://github.com/google/sentencepiece/blob/master/README.md

1. 安装编译sentencepiece

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

2. 训练模型:

% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>

官网:

官网链接

Git地址

Github链接
Gitee链接