See doc here
See doc here
An index of Natural Language Processing (NLP) concepts serves as a foundational glossary for students beginning their journey into this complex field. Here are some key concepts:
- Tokenization: The process of splitting text into individual words or phrases.
- Corpus: A large collection of text data used for NLP tasks.
- Stemming: Reducing words to their root form by removing suffixes.
- Lemmatization: Converting words to their dictionary form by considering the context.
- Part-of-Speech Tagging: Identifying the grammatical parts of speech of each word in a sentence.
- Named Entity Recognition (NER): Identifying and classifying named entities (people, places, organizations) in text.
- Word Embeddings: Representation of words in a continuous vector space where similar words have similar representations.
- Word2Vec: A method to produce word embeddings by using a neural network model.
- GloVe (Global Vectors): An unsupervised learning algorithm for generating word embeddings by aggregating global word-word co-occurrence statistics from a corpus.
- Syntax Tree: A tree representation of the syntactic structure of sentences.
- Dependency Parsing: Analyzing the grammatical structure of a sentence to establish relationships between "head" words and words that modify those heads.
- Bag of Words (BoW): A representation of text that describes the occurrence of words within a document.
- TF-IDF (Term Frequency-Inverse Document Frequency): A numerical statistic intended to reflect how important a word is to a document in a collection or corpus.
- N-grams: Contiguous sequences of n-items from a given sample of text or speech.
- Stop Words: Commonly used words (such as "the", "a", "an", "in") which are typically ignored in NLP tasks.
- Recurrent Neural Networks (RNNs): A class of neural networks for processing sequential data.
- Long Short-Term Memory (LSTM): A special kind of RNN capable of learning long-term dependencies.
- Gated Recurrent Unit (GRU): A variant of LSTM that uses gating mechanisms to control the flow of information.
- Attention Mechanism: A component of neural networks that allows the model to focus on different parts of the input sequentially.
- Transformer Architecture: A model architecture that uses self-attention mechanisms and has become the basis for many state-of-the-art NLP models.
- BERT (Bidirectional Encoder Representations from Transformers): A method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks.
- GPT (Generative Pretrained Transformer): An autoregressive language model that uses deep learning to produce human-like text.
- Language Modeling: The task of predicting the next word in a sentence given the previous words.
- Machine Translation: The task of automatically converting text from one language to another.
- Text Classification: The task of assigning predefined categories to text.
- Sentiment Analysis: The process of determining the emotional tone within a series of words to gain an understanding of the attitudes, opinions, and emotions expressed.
- Topic Modeling: The task of identifying topics that best describe a set of documents.
- Dialog Systems and Chatbots: Computer systems designed to converse with human users via natural language.
- Speech Recognition: The process of converting spoken words into text.
- Natural Language Generation (NLG): The task of generating natural language from a machine representation system.