This is a simple information retrieval system.
- Python 2.7
- Noteworthy modules: nltk, future.
$ python search.py
It will download the json database and build the index in the first run.
A search prompt will appear. The query string is a list of words that will be searched in all fields. Example:
> led light bulb
The IR will show a list of documents ordered according to relevance. The list is paginated. Pressing Enter will show the next page (Pressing "q" exits the query).
Field-specific searches are possible by using the field name (title, merchant, description) followed by a colon. Example:
> shirt merchant: lewis
The word "shirt" will be searched on all fields. "lewis" and all words from there on will be searched only on merchant field.
To disable a previously set field name, use the keyword "all:"
> merchant: lewis all: shirt
Important: At this stage, field-specific searches
just increase the weight of the matches on the corresponding field.
If a little more time is allowed, a better filter can be implemented
in the weighWord
functions.
To exit the program, enter a single character "q".
> q
To force the download of the database, run the retrieve_data module.
$ python retrieve_data.py
This will also force the rebuild of the index the next time that search.py is run.
To rebuild the index, run the index module.
$ python index.py
- Basic NLP (tokenization, lemmatization, stopwords).
- Field-specific weighting scheme.
- Positional weighting scheme.
Improvements of this version:
- Better documentation
- Better function naming
- Function modularization (calcWeight)
- Weight factor modified by word or lemma match
- Save search to file for later inspection
- Avoid some duplication in indexing
- Display result weights for better debugging
- Improve "search by field". Currently, it is implemented via weight manipulation (i.e. if the word is match in the desired field, the weight is multiplied). This approach works well for only one field. But if we search in two or more fields at the same time, the results may not be what should be expected. It is probably better to implement it as a filter.
- Index the POS (Part-of-speech) tag of the word, along with the position. Nouns in the title could be given a higher weight that verbs when matched at search time.
- The output of the function
wheighWord
(docs dict) could hold more information that just the gained weight. Some more information could help "global" query evaluation. For instance, proximity of the words matched (on the query and on the field). - Spelling correction, etc.
- More advanced corpus indexing.