Setence Boundary Detection

Jump to bottom Edit New page

Tanja edited this page Nov 21, 2015 · 1 revision

Tasks and requirements

Extract word vector from "GoogleNews-vector-negative300"
Train a model with CAFFE, which can distinguish sentence boundary

It can be pure lexical or a hybrid of lexical and acoustic features
Word vectors must be used
Training network optimization

Create a complete prototype

Input: audio file or ASR transcript
Prediction based on the model trained
Necessary heuristic operations
Output: segmented transcript

Baseline frameword

Pure lexical model with only word vectors as features
Take 5-words as a sample, classify whether there should be a punctuation mark after the third word.
Exisitng structure of a DNN or CNN for training
Use "Pause" to segment the ASR transcript into Sentence Units (SUs)
Check whether the border of two adjacent SUs is lexically correct:

In most cases, it should
If not, combine them together

Set a threshold for the maximum length of SU, and segment the longer ones by finding the most possible punctuated position according to the prediction result from the lexical model trained.

Potential improvements

Sample structure: 7-words? Border position change?
Features used: POS tag? Acoustic features?
Training netword structure
SU operation process