Skip to content

Simple keyword/feature extractor written in Python

License

Notifications You must be signed in to change notification settings

kpalac/smallsem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


ABOUT


SmallSem is a simple module and CLI application for extracting features/keywords from and summarizing text. 

The keywords are supposed to be characteristic of a document and used for findong similar documents etc. It was aimed to be simple, 
reasonably fast and accurate enough to be usable in other projects.

It makes use of Xapian database to index vocabulary from a language and then use frequencues to classify them as interesting. 
Word pairs are also used if the cooccurr in a document.

Language models' xapian indexes must be unzipped to the same folder to be functional (archives named **_index.zip).

New languages can be added by modifying generator_en.py script and by training a new Xapian DB on a corpus from a gicen language.


You can learn new texts by using SmallSemTrainer class or command:
   
    smallsem.py --lang=[SOME LANGUAGE SYMBOL] --learn-from-dir [DIRECTORY WITH PLAINTEXT]

Text provided should be in plaintext. The bigger the database the more accurate extraction is.
Language data is stored in separate folder.


You can extract keywords from a text file by command:

    smallsem.py --keywords [TEXT_FILE]

You can summarize text using:

    smallsem.py --level=[1..100] --summarize [TEXT_FILE]

If vocabulary DB is not present for a language, simple dictionaries will be used.


SmallSem also has a simple language detection feature to choose from present languages using a text sample.

Feel free to modify and play around :)


CONTACT: Karol Pałac, [email protected]






About

Simple keyword/feature extractor written in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages