Skip to content

Use of RNNs to build a TV script using 'Simpsons by the data' dataset from Kaggle

Notifications You must be signed in to change notification settings

nvmoyar/dlnd-text-generation

Repository files navigation

Deep Learning Nanodegree

Text Generation using Recurrent Neural Networks: A new script for The Simpsons

Project overview

Recurrent Neural Networks are proven a good approach to deal with sequential data like time series or meaningful text. In this project, a new script for the Simpsons TV series is created from the 'Simpsons by the data' dataset from Kaggle.

Initially, for this Udacity's project, only the graph built was trained only using all scenes at Moe's Tavern. However, for experimenting purposes, the whole dataset has been used for training in order to get a more sensical script.

Unlike the Udacity's repo, in this project we find two Jupiter notebooks:

  • Data preparation --> the additional operations to prepare a text with the whole The Simpsons Dataset from Kaggle.
  • Model + Training --> the model, the training and the predicted script.

We could visualize the input words passed as one-hot encoded vectors, and on this case, this might work if we are dealing with a small dataset, however, with a larger corpus, the dimensionality will be so huge that the computation could be unbearable. To deal with this high dimensionality problem, we would need a better vector representation for our words, since words that could be used in the same context should appear closer than unrelated words. This weight matrix is usually called the embedding matrix or embedding look-up table and Tensorflow provides a convenient function which does this lookup to get the embedding tensors. In this case, we build lookup tables for later use when creating the word embeddings.

A tokenizer is needed in order to remove unnecessary marks. The Tokenizer function outputs a dictionary that will be used to token the symbols and add the delimiter (space) around it. This separates the symbols as it's own word, making it easier for the neural network to predict the next word.

The final model is a stack of LSTM cells that will be fed in a custom length of batches of an embedding tensor randomly initialized between [-1, 1]. That means that additionally to the usual batch size, it is necessary to tune the length of the sequence of the batch jointly with the rest of hyperparameters.

Besides, the neural network uses element-wise gradient clipping when it exceeds an absolute threshold. Although it introduces an additional hyperparameter, this parameter has not been exposed jointly with the rest of the hyperparameters to tune in this neural graph. Thresholds have been manually established in [-1., 1.].

The idea is to use the final graph to build a word picker function, that will be in charge of giving a word, find the probabilities of the next word, given a word set initially as a start of the sequence.

Despite we use the whole 'Simpsons by the Data' dataset, still, the script is non-sensical, so few ideas are provided to increase the accuracy at the end of this document.

Install environment, Test

Requirements

FloydHub is a platform for training and deploying deep learning models in the cloud. It removes the hassle of launching your own cloud instances and configuring the environment. For example, FloydHub will automatically set up an AWS instance with TensorFlow, the entire Python data science toolkit, and a GPU. Then you can run your scripts or Jupyter notebooks on the instance. For this project:

floyd run --mode jupyter --gpu --env tensorflow-1.0

You can see your instance on the list is running and has ID XXXXXXXXXXXXXXXXXXXXXX. So you can stop this instance with Floyd stop XXXXXXXXXXXXXXXXXXXXXX. Also, if you want more information about that instance, use Floyd info XXXXXXXXXXXXXXXXXXXXXX

Environments

FloydHub comes with a bunch of popular deep learning frameworks such as TensorFlow, Keras, Caffe, Torch, etc. You can specify which framework you want to use by setting the environment. Here's the list of environments FloydHub has available, with more to come!

Datasets

Already included in this project in the datasets folder: The Simpsons by the Data

Usage

floyd run --gpu --env tensorflow-1.1 --mode jupyter

Output

Often you'll be writing data out, things like TensorFlow checkpoints, updated notebooks, trained models and HDF5 files. You will find all these files, you can get links to the data with:

floyd output run_ID

TODO

  • Try to reduce loss by initializing weights
  • Try to reduce loss by using pre-trained embedding weights
  • Try to reduce loss by adding Batch Normalization layers. It has been proven as perform enhancer in Speech Recognition systems.

About

Use of RNNs to build a TV script using 'Simpsons by the data' dataset from Kaggle

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published