Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text: Sentiment and Emotional Analysis #153

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

adi611
Copy link

@adi611 adi611 commented Sep 11, 2024

No description provided.

pyproject.toml Outdated
@@ -57,6 +56,7 @@ umap-learn = "~=0.5"
scikit-learn = "~=1.5"
nltk = "~=3.8"
vocos = "~=0.1"
huggingface-hub = {extras = ["cli"], version = "^0.24.6"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is cli needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, planned to remove this in the final PR

from enum import Enum


class Emotion(Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that pre-defining the emotional classes is a good idea. What if someone wants to use a model that can capture an emotion that is not in your pool?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, will make changes to the test cases as well.

if not pieces_of_text:
raise ValueError("Input list is empty or None.")

if model is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you specify the default in the param definition space (L21)?

def analyze_emotion(
pieces_of_text: List[str],
device: Optional[DeviceType] = None,
model: Optional[HFModel] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think model should be of type SenselabModel. this way, we can accept models that are not necessarily HF models

device: Optional[DeviceType] = None,
model: Optional[HFModel] = None,
max_length: int = 512,
overlap: int = 128,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_length and overlap look like model-specific params and shouldn't be in the general emotion recognition API. you can obtain them as part ** kwargs

emotion_labels = model_instance.config.id2label

results: List[Dict[str, Any]] = []
for text in pieces_of_text:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the advantage of using the tokenizer and the model instead of working with the huggingface text classification pipeline? https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline

Copy link
Author

@adi611 adi611 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly using the model directly for processing long texts, but can do the same with a pipeline:

Text → text chunks → pass chunks to the pipeline → average the scores for each label.
Text → tokenized chunks → pass chunks to the model → softmax for probabilities → average the scores for each label.

I'll switch to using the pipeline for simplicity, will still need the tokenizer for chunking.

@adi611 adi611 marked this pull request as ready for review September 16, 2024 17:41
@adi611
Copy link
Author

adi611 commented Sep 16, 2024

Hi @fabiocat93 - please review. I've made the requested changes and tried to make the code as extensible as possible. If a new model type needs to be added in the future, we won’t need to modify the existing code - just create utils for it and add it to the MODEL_TYPE_TO_UTILS dict. These utility classes could also be useful elsewhere in the code.

@adi611 adi611 force-pushed the 34-text-sentiment-analysis branch from 602faff to 426e19d Compare September 17, 2024 18:01
@adi611 adi611 force-pushed the 34-text-sentiment-analysis branch from 426e19d to dcca19b Compare September 17, 2024 18:04
for text in input_data:
cls.validate_input(text)

chunks = chunk_text(text=text, tokenizer=tokenizer, max_length=max_length, overlap=overlap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what i undertsand, pipeline internally implements chunk batching (https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-chunk-batching). can you please clarify to me why we can't use that directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChunkPipeline is meant only for zero-shot-classification and question-answering tasks, not for text-classification. It throws an error if the text goes over the max token limit. I tried it with a few BERT-based models that have a 512 token limit, and it fails when the length exceeds that. But please let me know if I am missing something here.

List[Dict[str, Any]]: A list of dictionaries containing emotional analysis results.
"""
model_type = type(model)
model_utils = MODEL_TYPE_TO_UTILS.get(model_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a pipeline, you can simply pass the model's name and the revision. What's the advantage here of using the model + tokenizer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I am using the tokenizer to handle chunking for long texts and not to actually create the classifier from the pipeline. I haven’t come across a robust method for handling long texts in text classification tasks beyond this. From what I have tested and read on the HF docs, the chunk batching pipeline is only supported for zero-shot-classification and question-answering and not for text-classification.

from senselab.utils.model_utils import BaseModelSourceUtils


class BaseAnalysis(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting interface. some thoughts:

  • I am curious to see how it generalizes on other tasks
  • it may be good to include a validate_output method as well

)

tokenizer = model_utils.get_tokenizer(task="sentiment-analysis")
pipe = model_utils.get_pipeline(task="sentiment-analysis", device=device, torch_dtype=torch_dtype, top_k=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an advantage in handling SentimentAnalysis with a different pipeline than "text-classification"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really an advantage, it actually uses the same TextClassificationPipeline as text-classification. but I’ve noticed in the docs that sentiment-analysis is often mentioned as the task when building any sentiment analysis classifier. I guess that’s just to explicitly define the type of text classification. functionality-wise, it’s not really different.

from senselab.utils.model_utils import HFUtils


class Emotion(Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this enum?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using this in the tests to not hardcode any strings, but not really used in the actual classification code.

from senselab.utils.tasks.chunking import chunk_text


class EmotionAnalysis(BaseTextAnalysis):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify that this class is for emotion analysis with HF

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will be making the change. My initial thought was to use the same class for every model type, including HF, and have the utils classes return the classifier with a get_classifier method. But you’re right, different model types might have different output formats and my approach would fail then. I should’ve shared an LLD doc first to clear this up.

@@ -0,0 +1,160 @@
"""This module provides utility classes for handling common utilities based on model type."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is HF-specific. not every framework is structured with tokenizers, feature_extractor, models, and pipelines. I still wonder if we really need all this or if it's just an over-complication. please, help me understand your choice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. The plan was to have functions like get_classifier (get_pipeline now) and get_tokenizer to create a standardized way to access these components across different utils, making it easier to work with various models. You’d just call model_utils.get_classifier() to get the classifier and model_utils.get_tokenizer() to fetch the tokenizer for the model. If tokenizer is not available for a model and a long text is passed that exceeds the token limit, we could show an error. But I understand your concern about complexity and can rework the entire thing to work as it does currently with if-else checks at the api.py level.

from transformers import AutoTokenizer


def chunk_text(text: str, tokenizer: AutoTokenizer, max_length: int, overlap: int) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this is a text-specific utility and not a general utility.
  2. I think chunking may be already managed inside of the text pipeline (at least for HF)

@fabiocat93
Copy link
Collaborator

Hi @fabiocat93 - please review. I've made the requested changes and tried to make the code as extensible as possible. If a new model type needs to be added in the future, we won’t need to modify the existing code - just create utils for it and add it to the MODEL_TYPE_TO_UTILS dict. These utility classes could also be useful elsewhere in the code.

Thank you @adi611 . i have left some comments and questions. please, feel free to address them to help me understand better your way of thinking.

@adi611
Copy link
Author

adi611 commented Sep 26, 2024

Hi @fabiocat93 - please review. I've made the requested changes and tried to make the code as extensible as possible. If a new model type needs to be added in the future, we won’t need to modify the existing code - just create utils for it and add it to the MODEL_TYPE_TO_UTILS dict. These utility classes could also be useful elsewhere in the code.

Thank you @adi611 . i have left some comments and questions. please, feel free to address them to help me understand better your way of thinking.

Hi, I was thinking we could have one EmotionAnalysis and one SentimentAnalysis class to handle all model types, and use MODEL_TYPE_TO_UTILS to get the utils for a model type. The BaseModelSourceUtils could have something like a get_classifier abstract method, which the concrete utils classes would implement to get the right classifier for each model type. So we wouldn’t have to update existing code if support for more model types are added, just create the new utils class and add it to MODEL_TYPE_TO_UTILS. But I get your point, it might oversimplify things across models and not always be practical. I should’ve written up a design doc first, gotten your feedback, and then moved ahead with the code. I’ll make sure to do that going forward to keep things smoother!

@adi611
Copy link
Author

adi611 commented Sep 28, 2024

Proposed Changes

1. Base Analysis Classes

  • Add validate_output method to BaseAnalysis and BaseTextAnalysis.
  • Move BaseTextAnalysis to something like senselab/text/utils/interfaces.py.

2. Chunking Strategy

  • If custom chunking needed, move to senselab/text/utils/.

3. Restructure Utilities

  • Remove BaseModelSourceUtils. HFUtils will not be a derived class and can still be used across the codebase to avoid repeating similar pipeline code.

4. Decouple Analysis Logic from Model Type

def analyze_emotion(...):
    if isinstance(model, HFModel):
        return HFEmotionAnalysis.analyze(input_data=pieces_of_text, device=device, **kwargs)
    else:
        raise NotImplementedError(f"The specified model type is not supported.")

5. Emotion Enum

  • Remove Emotion enum and use string values in the tests.
  • OR, move to test file since only used there.

Please review and provide feedback. cc: @fabiocat93

@adi611
Copy link
Author

adi611 commented Oct 3, 2024

Hi @fabiocat93, let me know if there's another task I can start working on in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants