This repository contains code to run a LDA (Latent Dirichlet Allocation) topic modeling. This model usually reuquires loads of memory and could be quite slow in Python. For this reason its is better to know a cuple of ways to run it quicker when datasets are outsize, in this case using Apache Spark with the Python API.
It has been done using Dataproc at Google Cloud using a cluster configuration which allows work with Jupyter Notebooks.
It has been used a dataset from Kaggle which contains over a million news headlines.
-
A Google Cloud Account
-
Python 3
-
Pyspark v.2.2.1
-
GENSIM Mallet Module
-
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
-
https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
-
http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA
-
https://spark.apache.org/docs/2.1.0/ml-clustering.html#latent-dirichlet-allocation-lda
This notebook is based on GENSIM toolkit.
There are 2 notebooks to make it quite different ways