Skip to content

Latest commit

 

History

History

CWSNER

Chinese Word Segmentation and Named Entity Recognization

Requirement

  • Split the training data into ratio 7:3 (train:test) and Evaluate them with P/R/F1 metrics
  • Use entire training data to train the model and Test with the test data and then Submit the prediciton as the same format as the training data

data format is utf-16 little endian

Corpus

This is a Traditional Chinese corpus.

CWS Data

  • Max sentence (sequence) length: 165 (training data max: 164)
  • Total unique word (include PAD): 4744
  • Training data size
    • examples (sentences): 66713 (70% training data)
    • examples (sentences): 95304 (100% training data)
    • words (max sentence length): 165
    • features (one-hot encode, i.e. total unique word): 4744
    • tags (cws tags): 4

NER Data

  • Max sentence (sequence) length: 374 (training data max > test data max)
  • Total unique word (include PAD): 4379
  • Training data size
    • examples (sentences): 25434 (70% training data)
    • examples (sentences): 36334 (100% training data)
    • words (max sentence length): 374
    • features (one-hot encode, i.e. total unique word): 4379
    • tags (ner tags): 7 (PER x 2 + LOC x 2 + ORG x 2 + N)
Category Tag
PER Person
LOC Location
ORG Organization

The NER starts with B-[Tag], if it is multiple words than will follow by I-[Tag].

If the word is not NER than use the N tag.

Report

  • Method/Approach
  • Experiment Settings and Steps
  • The 30% test data evaluation result
  • Question analysis and discussion

Submission should be named as Name-ID.seg and Name-ID.ner

Usage

Train and Predict

  • python3 cws_crf.py
  • python3 ner_crf.py

Chinese Word Segmentation

CWS Evaluation

previous notes

Named Entity Recognization

NER Evaluation

Model

CRF

BiLSTM + CRF

cws

ner

Resources

TensorFlow CRF

Example

Not same task but similar model

--

Appendix

Links

Python Logging

TensorFlow notes

One-hot solutions

One-time transform

When I try to use them, it will swallow up more than 50G

  • Numpy
    • np.eye
      • np.eye(num_features, dtype=np.uint8)[numpy_dataset]
    • np.eye + np.reshape
      • np.squeeze(np.eye(num_features, dtype=np.uint8)[numpy_dataset.reshape(-1)]).reshape([num_examples, num_words, num_features])
  • Scipy.sparse: currently don't support 3-dim matrix (scipy issue - 3D sparse matrices #8868)
    • list + scipy.sparse.eye => np.array
      • sparse.eye(num_features, dtype=np.uint8).tolil()[numpy_dataset.reshape(-1)].toarray().reshape((num_examples, num_words, num_features)) (don't work)
    • list + scipy.sparse.eye -> tf.sparse.SparseTensor: this need to modify the network structure (X)
  • Keras
  • Pandas: don't seem will support 3-dim either
  • TensorFlow: this will need to modify network structure which will limit the generalization
  • Scikit Learn
    • sklearn.preprocessing.OneHotEncoder: this will need to input the "original word: encode" pair, which is not what I want
    • sklearn.preprocessing.LabelBinarizer: can't transform 3-dim data
  • lazyarray

Batch transform

Transform back from one-hot

  • Numpy
    • np.argmax