Skip to content

Topic Modeling with Latent Dirichlet Allocation (LDA) for NIPS papers

Notifications You must be signed in to change notification settings

lotaibe/topic-modeling-nips

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Neural Information Processing Systems (NIPS) is one of the top machine learning conferences in the world where groundbreaking work is published. Since the year 1987, a lot of exciting work has been published in this confernece, but are the trends in recent machine learning research published to this Journal. The objective of this project is to analyze a large collection of NIPS research papers from the past decades to discover the latest trends in machine learning.

Data

The data was gotten from Kaggle and includes the title, authors, abstracts, and extracted text for all NIPS papers up to 2017 (ranging from the first 1987 conference to the current 2016 conference).

Method

  • The data was first cleaned and processed using:
    • Regular Expression/Normalization — to convert text to lowercase and remove punctuation and numbers
    • Stop Words Removal — to remove commonly used words in any language
    • Tokenization — to split the text into smaller pieces called tokens
    • Bigram, Trigram Models and Lemmatized - to group together the inflected forms of a word so they can be analyzed as a single item

After processing the data, I followed a structured workflow to build an insightful topic model based on the Latent Dirichlet Allocation (LDA) algorithm. The topic model was built using gensim’s native LdaModel and visualized the results using matplotlib plots.

Next steps:

  • Build the LDA topic model using LdaModel(), the corpus and the dictionary.

I answered the following questions graphically:

  • Dominant topic and its percentage contribution in each document
  • The most representative sentence for each topic
  • What are the most discussed topics in the documents?

Finally, I visualized:

  • The information using pyLavis.
  • Theclusters of documents using t-SNE (t-distributed stochastic neighbor embedding) algorithm

You can view the source code here

Results

About

Topic Modeling with Latent Dirichlet Allocation (LDA) for NIPS papers

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages