Skip to content

Latest commit

 

History

History
executable file
·
179 lines (177 loc) · 9.29 KB

Notes.org

File metadata and controls

executable file
·
179 lines (177 loc) · 9.29 KB

Notes

[2012-09-19 Wed] Performance

Problem Statement

Word spell check - standalone words are given and you are supposed to suggest corrections.

Phrase spell check - words present in a phrases need to be checked for spelling

Sentence spell check - an entire sentence needs to be checked for spelling

Metrics

EP, ER, EF1

Word spell-check

Edit distance will suffice

But to which distance will we consider alternative words?

Kernighan paper has the costs for the edits

Phrase and sentence spell-check

Context-sensitive spell-checking

Paper 1 - A Data-driven approach for correcting search queries

Why purely data-driven?

Cos large datasets available

Algo

Generation step - Postulate a set of alternative spellings

Check if it exists in the dictionary

If not, suggestion set = term + set of N_t suggestions from the dictionary

How to trim the search space

Trimming and sorting - Use probability calculation

log (P (c|q)) = log (P (c)) - (r_e * lev (q, c)) / len (q)

r_e - unknown constant representing the error rate

Finally, normalize the probabilities

P (c_i | q) = relative probability

The parameters to be tuned

r_e (his final value = 36)

N_t (his final value = 2)

Evaluation

We need C (q), the set of all suggestions we produced for q, and the corresponding posterior probabilities P (c | q)

S (q) - set of human-annotated plausible spelling variations for q

Possible Improvements

Edit Distance with different costs for each operation

We’re not considering two-edit words if one-edit words exist. But there could be better two-edit words

eg. ‘adres’ now would go to ‘acres’ instead of ‘address’ (which is more probable)

Join all the words and sort using Sujeet’s costs

New Split and joined queries can be considered as new queries

Filter words while generating them by ignoring operations with low probability (insert q next to z).

Unknown words

Maybe change 2/3/4-letter sequences according to some rules

Code - SpellChecker

filter_valid_words (list of words) -> list of valid words OR is_in_dict (term) -> true or false

Store a list of dictionary words.

generate_candidate_terms (term) -> list of N_t candidate terms + the original term

Make sure it returns only valid terms.

Maybe do set intersection with the dictionary.

generate_candidate_suggestions (list of list of candidates for each term) -> list of candidate suggestions (phrases)

cross-product of the lists

get_corrected_split_phrases (query) -> list of phrases consisting of split words joined together if it results in valid terms

Assumption: max two words will be joined.

get_corrected_run_on_phrases (query) -> list of phrases consisting of run-on words split if it results in valid terms

Assumption: max three words have been joined together.

Could have used this but went with the inefficient yet more Pythonic version

def parts(list_, indices):
    # http://stackoverflow.com/questions/1198512/split-a-list-into-parts-based-on-a-set-of-indexes-in-python
    indices = [0]+indices+[len(list_)]
    return [list_[v:indices[k+1]] for k, v in enumerate(indices[:-1])]

get_prior (n-gram) -> prior probability for the n-gram

lev (c, q) -> edit distance with different costs for each operation

get_normalized_probability (list of posterior probabilities) -> normalized list

get_posterior (suggestion, query)

generate_suggestions (q) -> [(suggestion, posterior), …]

or C (q) and P (c | q)

evaluate_suggestions (q, C (q), P (c | q), S (q)) -> [EP, ER]

get_EP

get_ER

EF1 = HM (EP, ER)

run_spell_check ()

record_performance (EF1, Q)

Script to test performance on different queries

Python learning

Nested for loops (as generators)

print list((e1, e2) 
# Outer loop
for e1 in xrange(10) 
# Inner loop
for e2 in xrange(3, 6))

Code for evaluation

Testing

Test words, phrases, and sentences separately

Dataset

query1 suggestion1 …

Input

query1

query2

Output

query{i} suggestion1 prob1 suggestion2 prob2 …

prob1 = P (suggestion1 | query{i})

Calculate stats based on the file.

Write output to file and calculate stats based on the file.

Fix Bugs

Use the split and joined pseudo-queries as well

It was reading sentences with a dot at the end. Hadn’t tested it cos I was only looking at phrases so far.

Memoize get-prior

Code

Testing

Use Suggestion objects everywhere

First, make suggestions into Suggestion objects

Then, queries too

Test with refined edit distance

Ain’t working as of now

Restrict the number of final suggestions for which we look up MS API

Deal with the cases where the number of terms in the suggestion and query are different (eg. run-on and split queries).

Output the stuff in order

Get rid of duplicates in the suggestions

Fix the get_all_stats bug - why are the stats different only now (after adding a new lexicon)?

Include proper nouns? - chester a arthur, missouri

Include abbreviations? - dept

When taking the top suggestions in generate_suggestions, we are filtering based on the likelihood, but the “correct” suggestion could have a low likelihood…

Use two-edit words as well

Rank them using the fine edit distance

Sliding window - ie. use a smaller n-gram instead of the whole phrase or sentence

Test data in data/0643.zip

import re string = “”“I OFTEN VISITED my AUNT. She lived in a MAGNIFICENT HOUSE OPPOSITE the GALLERY. I REMEMBER her SPLENDID PURPLE CURTAINS. She WROTE POETRY. The PROBLEM was nobody could UNDERSTAND it. Her LATEST POEMS had words like prunty, slimber, grondel, blomp. I WANTED to LAUGH but I had to PRETEND to like them. However, I REALLY like the SPECIAL REFRESHMENT. THERE was BLUE JUICE, CAKE and BISCUITS. When I left, my STOMACH was full and I was happy and CONTENTED.”“” print re.findall(‘[A-Z]+’, string)

Evaluation

run-test on full script took ~7m

Why is the EF1 so low?

Words

ecstacy (them) vs ecstasy (us)

fail vs fails (0.2 + 0.25 vs just 0.25)

respe + ct not being considered. Why?

hellpp not having help as a human suggestion. WTH!

thruout -> throughout is difficult

volly goes to everything else but folly (with volley having very low prob)

Phrases

ny city -> we don’t know ‘ny’

cause … vs cause/causes …

dept

motorcycles vs motorcycles/motor cycles

chester a arthur vs

u s vs us

Sentences

discuss the bil -> [bill, bid]

both seem ok to me, but only bill is accepted

I think in the TREC dataset, they have correct phrases along with misspelled phrases, whereas in the 15 cases we have, pretty much all are misspelled.

Noop speller: This speller actually does nothing and return exactly the same one as the input query. This speller achieved 0.91 EF1, 0.94 EP and 0.87 ER. This result implies that recall might be bottleneck.

http://spellerchallenge.com/Rules.aspx

Bing speller: We tried the bing speller API2 . Although this was expected to achieve high score, its score was actually 0.91 EF1: lower than Noop speller. This re- sult reminded us that the annotation policy of TREC dataset is quite differ from commercial search engine query speller. TREC anotation policy prefers various- ness of spellings, rather than one exact spelling.

Problem with the evaluation measure: ER is just Recall. The posteriors are not considered at all… So you can cheat and increase ER by giving suggestions with very low posterior, without hurting EP.