You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Write output to file and calculate stats based on the file.
Fix Bugs
Use the split and joined pseudo-queries as well
It was reading sentences with a dot at the end. Hadn’t tested it cos I was only looking at phrases so far.
Memoize get-prior
Code
Testing
Use Suggestion objects everywhere
First, make suggestions into Suggestion objects
Then, queries too
Test with refined edit distance
Ain’t working as of now
Restrict the number of final suggestions for which we look up MS API
Deal with the cases where the number of terms in the suggestion and query are different (eg. run-on and split queries).
Output the stuff in order
Get rid of duplicates in the suggestions
Fix the get_all_stats bug - why are the stats different only now (after adding a new lexicon)?
Include proper nouns? - chester a arthur, missouri
Include abbreviations? - dept
When taking the top suggestions in generate_suggestions, we are filtering based on the likelihood, but the “correct” suggestion could have a low likelihood…
Sliding window - ie. use a smaller n-gram instead of the whole phrase or sentence
Test data in data/0643.zip
import re
string = “”“I OFTEN VISITED my AUNT. She lived in a MAGNIFICENT
HOUSE OPPOSITE the GALLERY. I REMEMBER her SPLENDID PURPLE
CURTAINS. She WROTE POETRY. The PROBLEM was nobody could
UNDERSTAND it. Her LATEST POEMS had words like prunty,
slimber, grondel, blomp. I WANTED to LAUGH but I had to
PRETEND to like them. However, I REALLY like the SPECIAL
REFRESHMENT. THERE was BLUE JUICE, CAKE and BISCUITS. When
I left, my STOMACH was full and I was happy and CONTENTED.”“”
print re.findall(‘[A-Z]+’, string)
Evaluation
run-test on full script took ~7m
Why is the EF1 so low?
Words
ecstacy (them) vs ecstasy (us)
fail vs fails (0.2 + 0.25 vs just 0.25)
respe + ct not being considered. Why?
hellpp not having help as a human suggestion. WTH!
thruout -> throughout is difficult
volly goes to everything else but folly (with volley having very low prob)
Phrases
ny city -> we don’t know ‘ny’
cause … vs cause/causes …
dept
motorcycles vs motorcycles/motor cycles
chester a arthur vs
u s vs us
Sentences
discuss the bil -> [bill, bid]
both seem ok to me, but only bill is accepted
I think in the TREC dataset, they have correct phrases along with misspelled phrases, whereas in the 15 cases we have, pretty much all are misspelled.
Noop speller: This speller actually does nothing and return exactly the same one as the input query. This speller achieved 0.91 EF1, 0.94 EP and 0.87 ER. This result implies that recall might be bottleneck.
Bing speller: We tried the bing speller API2 . Although this was expected to achieve high score, its score was actually 0.91 EF1: lower than Noop speller. This re- sult reminded us that the annotation policy of TREC dataset is quite differ from commercial search engine query speller. TREC anotation policy prefers various- ness of spellings, rather than one exact spelling.
Problem with the evaluation measure: ER is just Recall. The posteriors are not considered at all… So you can cheat and increase ER by giving suggestions with very low posterior, without hurting EP.