Skip to content
/ typod Public

Spelling errors correction for the search engine (based on SphinxSearch) on your site.

License

Notifications You must be signed in to change notification settings

kalloc/typod

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Typod

Spelling errors correction for the search engine (based on sphinx) on your site.

What is it?

Typod is a missing spelling corrector for sphinx, and it uses sphinx dictionary as source for spell checking. It's based on original research and PoC that were written by @iamdonefor

What have we?

We have python classes for spell checking and a tcp interface to them

How to use it?

You can use typod in your python application:

from typo import TypoDefault

TYPO = 'начь улеца фанарь аптека бесмысленый итусклый светт'.decode('utf8')
corrector = TypoDefault('examples/http/test.index')
corrected, ok = corrector.suggestion(TYPO)
print corrected
"""
ночь улица фонарь аптека бессмысленный и тусклый свет
"""

and you can use typod as service (TaaS)

TCP Server:

[email protected] typod[master] $ python -m typo --corrector-index examples/http/test.index server --port 3333
INFO:typo.cmd_server:Run server on 0.0.0.0:3333, using default corrector
[email protected] typod[master] $ (echo QUERY начь улеца фанарь аптека бесмысленый итусклый светт;sleep 1) | nc localhost 3333
ночь улица фонарь аптека бессмысленный и тусклый свет

or

import socket

def suggest(query):
    sock = socket.socket()
    sock.settimeout(0.1)
    try:
        sock.connect(('localhost', 3333))
        sock.send('QUERY {}\n'.format(query.encode('utf-8')))
        data = sock.recv(4096).strip().decode('utf-8')
    except socket.error as e:
        # TODO: do something with connection error
        print e
        data = None
    except socket.timeout:
        data = None
    finally:
        sock.close()
    return data if data != query else query
    
print suggest('w0rd') # word

uWSGI server:

uwsgi --plugin corerouter,python,http --http=:9090   --module typo.wsgi --pyargv='--index=examples/http/test.index'

Example for POST request:

POST / HTTP/1.1
Content-Length: 96
Accept-Encoding: gzip, deflate
Host: localhost:9090
Accept: application/json
User-Agent: HTTPie/0.9.2
Connection: keep-alive
Content-Type: application/json

начь улеца фанарь аптека бесмысленый итусклый светт


HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8

ночь улица фонарь аптека бессмысленный и тусклый свет

and for GET request:

GET /?query=%D0%BD%D0%B0%D1%87%D1%8C%20%D1%83%D0%BB%D0%B5%D1%86%D0%B0%20%D1%84%D0%B0%D0%BD%D1%82%D0%B0%D0%BD%20%D0%B0%D0%BF%D1%82%D0%B5%D0%BA%D0%B0%20%D0%B1%D0%B5%D1%81%D0%BC%D1%8B%D1%81%D0%BB%D0%B5%D0%BD%D1%8B%D0%B9%20%D0%B8%20%D1%82%D1%83%D1%81%D0%BA%D0%BB%D0%B8%D0%B9%20%D1%81%D0%B2%D0%B5%D0%B4 HTTP/1.1
Connection: keep-alive
Host: localhost:9090
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: HTTPie/0.9.2



HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8

ночь улица фонарь аптека бессмысленный и тусклый свет

How to make an index?

# Run indextool to dump a sphinx dictionary
se@goat $ indextool --dumpdict texts_index  > ~/dumpdict.txt

# copy to your machine
scp se@goat:~/dumpdict.txt .

# Convert a sphinx dictionary to typo index
[email protected] typod[master*] $ python -m typo --corrector-index typo.index convert --sphinx-dump dumpdict.txt
Convert indextool format to "TypoDefault" corrector format
Converting  [####################################]  100%
Export result to /usr/home/x/src/typod/typo.index
//EOE

# Test
[email protected] typod[master*] $ python -m typo --corrector-index typo.index console
Phrase: w0rd
Result is word (True) spend time 0.000121, 216.26 mb usage
Phrase: ploy
Result is play (True) spend time 0.000217, 216.27 mb usage
Phrase: w0rld
Result is world (True) spend time 0.000109, 216.27 mb usage

Special thanks:

TODO

  • support n-gram
  • support noisy_channel
  • support blocking-server
  • support custom dictionaries
  • add tests

About

Spelling errors correction for the search engine (based on SphinxSearch) on your site.

Resources

License

Stars

Watchers

Forks

Packages

No packages published