Natural Language Processing Assignments

Welcome to the hands-on assignments for our course on NLP concepts. These assignments are designed to give you practical experience with core concepts, helping you explore the nuances and challenges in natural language processing.

Assignment 1: Curse of Dimensionality Experiment

Description

This assignment explores the concept of the curse of dimensionality in language representation. You will create one-hot encodings for words in a dataset and compare them to word embeddings to understand the practical implications of high-dimensional representations.

Dataset

20 Newsgroups Dataset: Contains approximately 20,000 newsgroup documents, organized across 20 categories, suitable for generating word encodings and analyzing vocabulary size.
SMS Spam Collection Data: Smaller dataset with text messages labeled as spam or ham, which allows exploration of word encoding on shorter text.

Steps

Load and preprocess the dataset, including tokenizing the text.
Create one-hot encodings for the words in the dataset, observing the high dimensionality.
Compute and discuss memory usage of the one-hot encodings for various vocabulary sizes.
Compare this with pre-trained word embeddings (such as GloVe or Word2Vec) for the same text to observe the dimensionality reduction and benefits of dense embeddings.

Follow-Up Questions

Why does the curse of dimensionality pose a challenge in natural language processing?
How do dense embeddings improve computational efficiency compared to one-hot encoding?
Are there trade-offs when switching from one-hot encoding to dense word embeddings?

Assignment 2: Creating a Markov Chain Text Generator

Description

In this assignment, you will create a simple text generator using Markov Chains. You’ll train a model on a text dataset, create bigram or trigram chains, and generate new text sequences to observe how realistic they can become.

Dataset

The Adventures of Sherlock Holmes: A single-text corpus that’s rich in vocabulary and structure.
Shakespeare’s Complete Works: Offers structured text with dialogues, suitable for sequence modeling and generating responses.

Steps

Load and preprocess the dataset, focusing on tokenizing and lowercasing words.
Create a bigram or trigram Markov Chain based on word sequences.
Generate new text by randomly selecting transitions based on the chain’s probabilities.
Experiment with different starting words or phrases to see how the generated text changes.

Follow-Up Questions

How does the chain length (unigram, bigram, trigram) affect the coherence of generated text?
What are the limitations of Markov Chains in generating human-like text?
What improvements might you suggest to make the generated text more realistic?

Assignment 3: Building and Evaluating a Simple Chatbot

Description

In this assignment, you will create a basic rule-based chatbot using a Markov Chain model trained on text conversations. This chatbot will respond to simple queries based on word patterns and sequences in the text. Test its “human-like” qualities by having others interact with it and give feedback.

Dataset

Cornell Movie Dialogues Dataset: Contains dialogue snippets from movies, ideal for conversational modeling.
Persona-Chat Dataset: A conversational dataset with various dialogue examples, good for creating chat patterns.

Steps

Load and preprocess the dataset, structuring it as a series of dialogue exchanges.
Use a bigram or trigram Markov Chain to model response sequences based on user inputs.
Implement simple keyword matching to improve response relevance.
Test the chatbot’s responses by interacting with it and making adjustments to improve coherence.

Follow-Up Questions

How well does the chatbot handle ambiguous or complex questions?
How might you modify this chatbot to make it more adaptable to different conversation topics?
What are the ethical implications of deploying simplistic chatbots in real-world scenarios?

Conclusion

These assignments will give you practical experience with critical NLP concepts, including dimensionality, language generation with Markov Chains, and the basics of conversational modeling. Good luck, and enjoy exploring these foundational topics in NLP!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing Assignments

Assignment 1: Curse of Dimensionality Experiment

Description

Dataset

Steps

Follow-Up Questions

Assignment 2: Creating a Markov Chain Text Generator

Description

Dataset

Steps

Follow-Up Questions

Assignment 3: Building and Evaluating a Simple Chatbot

Description

Dataset

Steps

Follow-Up Questions

Conclusion

About

Releases

Packages

i-yam/ki-lehre

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing Assignments

Assignment 1: Curse of Dimensionality Experiment

Description

Dataset

Steps

Follow-Up Questions

Assignment 2: Creating a Markov Chain Text Generator

Description

Dataset

Steps

Follow-Up Questions

Assignment 3: Building and Evaluating a Simple Chatbot

Description

Dataset

Steps

Follow-Up Questions

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages