fastNLP/reproduction/HAN-document_classification
FengZiYjun 501ffb26c5 optimize CWS example
- see test_fastNLP.py
- update interpret_word_seg_results in fastnlp.py
- delete useless data to increase git clone speed
2018-08-31 11:23:40 +08:00
..
__init__.py - add validation loss into trainer.train 2018-07-11 21:51:35 +08:00
evaluate.py - add validation loss into trainer.train 2018-07-11 21:51:35 +08:00
model.py - add validation loss into trainer.train 2018-07-11 21:51:35 +08:00
preprocess.py - add validation loss into trainer.train 2018-07-11 21:51:35 +08:00
README.md - add validation loss into trainer.train 2018-07-11 21:51:35 +08:00
train.py - add validation loss into trainer.train 2018-07-11 21:51:35 +08:00

Introduction

This is the implementation of Hierarchical Attention Networks for Document Classification paper in PyTorch.

  • Dataset is 600k documents extracted from Yelp 2018 customer reviews
  • Use NLTK and Stanford CoreNLP to tokenize documents and sentences
  • Both CPU & GPU support
  • The best accuracy is 71%, reaching the same performance in the paper

Requirement

  • python 3.6
  • pytorch = 0.3.0
  • numpy
  • gensim
  • nltk
  • coreNLP

Parameters

According to the paper and experiment, I set model parameters:

word embedding dimension GRU hidden size GRU layer word/sentence context vector dimension
200 50 1 100

And the training parameters:

Epoch learning rate momentum batch size
3 0.01 0.9 64

Run

  1. Prepare dataset. Download the data set, and unzip the custom reviews as a file. Use preprocess.py to transform file into data set foe model input.
  2. Train the model. Word enbedding of train data in 'yelp.word2vec'. The model will trained and autosaved in 'model.dict'
python train
  1. Test the model.
python evaluate