Skip to content

ntson2002/topic-based-retrieval

Repository files navigation

Step 1. Prepare data

A folder contains a list of legal articles 
A json file contains all documents

Step 2. Document indexing

  1. Build TF-IDF vectors from corpus
    echo "=========================================="
    echo "Indexing TFIDF ..."
    INPUT=data/all_articles.json
    OUTPUT=output/model_TFIDF.pkl
    python document-indexing.py --index_type tfidf --file_type json --input $INPUT --output $OUTPUT
  1. Build TF-IDF vectors from corpus then using MDS to reduce the space
    echo "=========================================="
    echo "Indexing MDS ..."
    INPUT=data/all_articles.json
    OUTPUT=output/model_TFIDF_MDS.pkl
    python document-indexing.py --index_type mds --file_type json --input $INPUT --output $OUTPUT

Step 3. Build topic vectors from corpus

  1. Build topic model file from corpus
    echo "=========================================="
    echo "Creating topic vectors ..."
    INPUT=data/all_articles.json
    OUTPUT=output/topic.pickle
    python create-topic-model.py --file_type json --input $INPUT --output $OUTPUT

Step 4. Retrieval

Support 3 types of query:

1) query on TF-IDF space
2) query on MDS space (using MDS to reduce dimension)
3) query on TF-IDF space with injection of topic vectors

Step 5. API

  1. start search api (default port = 8081)
    $ python search-api.py --port 8081

Step 6. Run API on web browser

Notes: using "_" instead of spaces Query: A demand for payment shall not have the effect URL:

http://0.0.0.0:8081/api/search/A_demand_for_payment_shall_not_have_the_effect

Notes

  1. Upload to server
    git checkout master
    git add -A . && git commit -m "Upload"
    git push origin master 
  1. Download from server
    git checkout master
    git pull origin master

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published