Skip to content

Setence Boundary Detection

Tanja edited this page Nov 21, 2015 · 1 revision

Tasks and requirements

  1. Extract word vector from "GoogleNews-vector-negative300"
  2. Train a model with CAFFE, which can distinguish sentence boundary
  • It can be pure lexical or a hybrid of lexical and acoustic features
  • Word vectors must be used
  • Training network optimization
  1. Create a complete prototype
  • Input: audio file or ASR transcript
  • Prediction based on the model trained
  • Necessary heuristic operations
  • Output: segmented transcript

Baseline frameword

  1. Pure lexical model with only word vectors as features
  2. Take 5-words as a sample, classify whether there should be a punctuation mark after the third word.
  3. Exisitng structure of a DNN or CNN for training
  4. Use "Pause" to segment the ASR transcript into Sentence Units (SUs)
  5. Check whether the border of two adjacent SUs is lexically correct:
  • In most cases, it should
  • If not, combine them together
  1. Set a threshold for the maximum length of SU, and segment the longer ones by finding the most possible punctuated position according to the prediction result from the lexical model trained.

Potential improvements

  1. Sample structure: 7-words? Border position change?
  2. Features used: POS tag? Acoustic features?
  3. Training netword structure
  4. SU operation process
Clone this wiki locally