Using Keras to implement a bidirectional LSTM for Twitter hashtag segmentation. The task is to take a hashtag and segment it into the phrase it corresponds to. This task may seem trivial, but a single hashtag can be segmented in many different ways:
#wordsoftheday
=> “word soft he day" or “words of the day"
#statefarmisthere
=> “state far mist here" or “state farm is here"
#brainstorm
=> “bra in storm" or “brain strom"
#doubledown
=> “do u bled own" or “double down"
#votedems
=> “voted ems" or “vote dems"
The approach to solving this problem is to assume each timestep is a character and assign a binary label: 1 if a character should be followed by a space and 0 otherwise.
#nlprocks
input: [n, l, p, r, o, c, k, s]
label: [0, 0, 1, 0, 0, 0, 0, 0]
~700,000 segmented hashtags from Twitter