Name		Name	Last commit message	Last commit date
parent directory ..
Report		Report
data		data
README.md		README.md
basemodel.py		basemodel.py
constant.py		constant.py
cws_crf.py		cws_crf.py
dataset.py		dataset.py
embedding.py		embedding.py
evaluation.py		evaluation.py
ner_crf.py		ner_crf.py

README.md

Chinese Word Segmentation and Named Entity Recognization

Requirement

Split the training data into ratio 7:3 (train:test) and Evaluate them with P/R/F1 metrics
Use entire training data to train the model and Test with the test data and then Submit the prediciton as the same format as the training data

data format is utf-16 little endian

Corpus

This is a Traditional Chinese corpus.

CWS Data

Max sentence (sequence) length: 165 (training data max: 164)
Total unique word (include PAD): 4744
Training data size
- examples (sentences): 66713 (70% training data)
- examples (sentences): 95304 (100% training data)
- words (max sentence length): 165
- features (one-hot encode, i.e. total unique word): 4744
- tags (cws tags): 4

NER Data

Max sentence (sequence) length: 374 (training data max > test data max)
Total unique word (include PAD): 4379
Training data size
- examples (sentences): 25434 (70% training data)
- examples (sentences): 36334 (100% training data)
- words (max sentence length): 374
- features (one-hot encode, i.e. total unique word): 4379
- tags (ner tags): 7 (PER x 2 + LOC x 2 + ORG x 2 + N)

Category	Tag
PER	Person
LOC	Location
ORG	Organization

The NER starts with B-[Tag], if it is multiple words than will follow by I-[Tag].

If the word is not NER than use the N tag.

Report

Method/Approach
Experiment Settings and Steps
The 30% test data evaluation result
Question analysis and discussion

Submission should be named as Name-ID.seg and Name-ID.ner

Usage

Train and Predict

python3 cws_crf.py
python3 ner_crf.py

Chinese Word Segmentation

CWS Evaluation

previous notes

[原創]中文分詞器分詞效果的評測方法

Named Entity Recognization

NER Evaluation

sklearn_crfsuite Evaluation

Named-Entity evaluation metrics based on entity-level

example-full-named-entity-evaluation.ipynb

Performance per label type per token
- sklearn_crfsuite.metrics
Performance over full named-entity
- davidsbatista/NER-Evaluation

Model

CRF

BiLSTM + CRF

tf.contrib.layers.xavier_initializer
tf.nn.xw_plus_b: Computes matmul(x, weights) + biases.

Resources

sklearn_crfsuite API

TensorFlow CRF

Example

macanv/BERT-BiLSTM-CRF-NER: BERT + BiLSTM + CRF
- lstm_crf_layer.py
scofield7419/sequence-labeling-BiLSTM-CRF: BiLSTM + CRF
- BiLSTM_CRFs.py
  - blocks.py

Not same task but similar model

nyu-mll/multiNLI - Baseline Models for MultiNLI Corpus
- bilstm.py

--

Appendix

Links

Python Logging

TensorFlow notes

Graphs and Sessions
Save and Restore
global_step
Variables
- Sharing Variable
  - What happens when setting reuse=True in tensorflow
    - reuse and variable scopes in general are deprecated and will be removed in tf2
    - instead recommend you use the tf.keras layers to build your model, which you can reuse by just reusing the objects
  - tf.variable_scope
- Is Training
  - Question of tensorflow : How could I turn is_training of batchnorm to False

One-hot solutions

Convert array of indices to 1-hot encoded numpy array

Smarter Ways to Encode Categorical Data for Machine Learning

How can I one hot encode in Python?

Python: One-hot encoding for huge data

Tutorial: (Robust) One Hot Encoding in Python

One-time transform

When I try to use them, it will swallow up more than 50G

Numpy
- np.eye
  - np.eye(num_features, dtype=np.uint8)[numpy_dataset]
- np.eye + np.reshape
  - np.squeeze(np.eye(num_features, dtype=np.uint8)[numpy_dataset.reshape(-1)]).reshape([num_examples, num_words, num_features])
Scipy.sparse: currently don't support 3-dim matrix (scipy issue - 3D sparse matrices #8868)
- list + scipy.sparse.eye => np.array
  - sparse.eye(num_features, dtype=np.uint8).tolil()[numpy_dataset.reshape(-1)].toarray().reshape((num_examples, num_words, num_features)) (don't work)
- list + scipy.sparse.eye -> tf.sparse.SparseTensor: this need to modify the network structure (X)
Keras
- keras.utils.to_categorical (tf.keras.utils.to_categorical)
  - to_categorical(numpy_dataset, num_classes=num_features)
Pandas: don't seem will support 3-dim either
- pandas.get_dummies (How to do sparse one hot encoding with pandas?)
TensorFlow: this will need to modify network structure which will limit the generalization
- tf.one_hot
Scikit Learn
- sklearn.preprocessing.OneHotEncoder: this will need to input the "original word: encode" pair, which is not what I want
- sklearn.preprocessing.LabelBinarizer: can't transform 3-dim data
lazyarray

Batch transform

Transform back from one-hot

Coverting Back One Hot Encoded Results back to single Column in Python

Numpy
- np.argmax

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CWSNER

CWSNER

README.md

Chinese Word Segmentation and Named Entity Recognization

Requirement

Corpus

CWS Data

NER Data

Report

Usage

Chinese Word Segmentation

CWS Evaluation

Named Entity Recognization

NER Evaluation

Model

CRF

BiLSTM + CRF

Resources

TensorFlow CRF

Example

Appendix

Links

TensorFlow notes

One-hot solutions

One-time transform

Batch transform

Transform back from one-hot

Files

CWSNER

Directory actions

More options

Directory actions

More options

Latest commit

History

CWSNER

Folders and files

parent directory

README.md

Chinese Word Segmentation and Named Entity Recognization

Requirement

Corpus

CWS Data

NER Data

Report

Usage

Chinese Word Segmentation

CWS Evaluation

Named Entity Recognization

NER Evaluation

Model

CRF

BiLSTM + CRF

Resources

TensorFlow CRF

Example

Appendix

Links

TensorFlow notes

One-hot solutions

One-time transform

Batch transform

Transform back from one-hot