Welcome to the hands-on assignments for our course on NLP concepts. These assignments are designed to give you practical experience with core concepts, helping you explore the nuances and challenges in natural language processing.
This assignment explores the concept of the curse of dimensionality in language representation. You will create one-hot encodings for words in a dataset and compare them to word embeddings to understand the practical implications of high-dimensional representations.
- 20 Newsgroups Dataset: Contains approximately 20,000 newsgroup documents, organized across 20 categories, suitable for generating word encodings and analyzing vocabulary size.
- SMS Spam Collection Data: Smaller dataset with text messages labeled as spam or ham, which allows exploration of word encoding on shorter text.
- Load and preprocess the dataset, including tokenizing the text.
- Create one-hot encodings for the words in the dataset, observing the high dimensionality.
- Compute and discuss memory usage of the one-hot encodings for various vocabulary sizes.
- Compare this with pre-trained word embeddings (such as GloVe or Word2Vec) for the same text to observe the dimensionality reduction and benefits of dense embeddings.
- Why does the curse of dimensionality pose a challenge in natural language processing?
- How do dense embeddings improve computational efficiency compared to one-hot encoding?
- Are there trade-offs when switching from one-hot encoding to dense word embeddings?
In this assignment, you will create a simple text generator using Markov Chains. You’ll train a model on a text dataset, create bigram or trigram chains, and generate new text sequences to observe how realistic they can become.
- The Adventures of Sherlock Holmes: A single-text corpus that’s rich in vocabulary and structure.
- Shakespeare’s Complete Works: Offers structured text with dialogues, suitable for sequence modeling and generating responses.
- Load and preprocess the dataset, focusing on tokenizing and lowercasing words.
- Create a bigram or trigram Markov Chain based on word sequences.
- Generate new text by randomly selecting transitions based on the chain’s probabilities.
- Experiment with different starting words or phrases to see how the generated text changes.
- How does the chain length (unigram, bigram, trigram) affect the coherence of generated text?
- What are the limitations of Markov Chains in generating human-like text?
- What improvements might you suggest to make the generated text more realistic?
In this assignment, you will create a basic rule-based chatbot using a Markov Chain model trained on text conversations. This chatbot will respond to simple queries based on word patterns and sequences in the text. Test its “human-like” qualities by having others interact with it and give feedback.
- Cornell Movie Dialogues Dataset: Contains dialogue snippets from movies, ideal for conversational modeling.
- Persona-Chat Dataset: A conversational dataset with various dialogue examples, good for creating chat patterns.
- Load and preprocess the dataset, structuring it as a series of dialogue exchanges.
- Use a bigram or trigram Markov Chain to model response sequences based on user inputs.
- Implement simple keyword matching to improve response relevance.
- Test the chatbot’s responses by interacting with it and making adjustments to improve coherence.
- How well does the chatbot handle ambiguous or complex questions?
- How might you modify this chatbot to make it more adaptable to different conversation topics?
- What are the ethical implications of deploying simplistic chatbots in real-world scenarios?
These assignments will give you practical experience with critical NLP concepts, including dimensionality, language generation with Markov Chains, and the basics of conversational modeling. Good luck, and enjoy exploring these foundational topics in NLP!