-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text: Sentiment and Emotional Analysis #153
base: main
Are you sure you want to change the base?
Conversation
pyproject.toml
Outdated
@@ -57,6 +56,7 @@ umap-learn = "~=0.5" | |||
scikit-learn = "~=1.5" | |||
nltk = "~=3.8" | |||
vocos = "~=0.1" | |||
huggingface-hub = {extras = ["cli"], version = "^0.24.6"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is cli
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed, planned to remove this in the final PR
from enum import Enum | ||
|
||
|
||
class Emotion(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that pre-defining the emotional classes is a good idea. What if someone wants to use a model that can capture an emotion that is not in your pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, will make changes to the test cases as well.
if not pieces_of_text: | ||
raise ValueError("Input list is empty or None.") | ||
|
||
if model is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you specify the default in the param definition space (L21)?
def analyze_emotion( | ||
pieces_of_text: List[str], | ||
device: Optional[DeviceType] = None, | ||
model: Optional[HFModel] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think model should be of type SenselabModel
. this way, we can accept models that are not necessarily HF models
device: Optional[DeviceType] = None, | ||
model: Optional[HFModel] = None, | ||
max_length: int = 512, | ||
overlap: int = 128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_length
and overlap
look like model-specific params and shouldn't be in the general emotion recognition API. you can obtain them as part ** kwargs
emotion_labels = model_instance.config.id2label | ||
|
||
results: List[Dict[str, Any]] = [] | ||
for text in pieces_of_text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the advantage of using the tokenizer and the model instead of working with the huggingface text classification pipeline? https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mainly using the model directly for processing long texts, but can do the same with a pipeline:
Text → text chunks → pass chunks to the pipeline → average the scores for each label.
Text → tokenized chunks → pass chunks to the model → softmax for probabilities → average the scores for each label.
I'll switch to using the pipeline for simplicity, will still need the tokenizer for chunking.
Hi @fabiocat93 - please review. I've made the requested changes and tried to make the code as extensible as possible. If a new model type needs to be added in the future, we won’t need to modify the existing code - just create utils for it and add it to the |
602faff
to
426e19d
Compare
426e19d
to
dcca19b
Compare
for text in input_data: | ||
cls.validate_input(text) | ||
|
||
chunks = chunk_text(text=text, tokenizer=tokenizer, max_length=max_length, overlap=overlap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what i undertsand, pipeline internally implements chunk batching (https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-chunk-batching). can you please clarify to me why we can't use that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ChunkPipeline
is meant only for zero-shot-classification
and question-answering
tasks, not for text-classification
. It throws an error if the text goes over the max token limit. I tried it with a few BERT-based models that have a 512 token limit, and it fails when the length exceeds that. But please let me know if I am missing something here.
List[Dict[str, Any]]: A list of dictionaries containing emotional analysis results. | ||
""" | ||
model_type = type(model) | ||
model_utils = MODEL_TYPE_TO_UTILS.get(model_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use a pipeline, you can simply pass the model's name and the revision. What's the advantage here of using the model + tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I am using the tokenizer to handle chunking for long texts and not to actually create the classifier from the pipeline
. I haven’t come across a robust method for handling long texts in text classification tasks beyond this. From what I have tested and read on the HF docs, the chunk batching pipeline is only supported for zero-shot-classification
and question-answering
and not for text-classification
.
from senselab.utils.model_utils import BaseModelSourceUtils | ||
|
||
|
||
class BaseAnalysis(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting interface. some thoughts:
- I am curious to see how it generalizes on other tasks
- it may be good to include a validate_output method as well
) | ||
|
||
tokenizer = model_utils.get_tokenizer(task="sentiment-analysis") | ||
pipe = model_utils.get_pipeline(task="sentiment-analysis", device=device, torch_dtype=torch_dtype, top_k=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an advantage in handling SentimentAnalysis with a different pipeline than "text-classification"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really an advantage, it actually uses the same TextClassificationPipeline
as text-classification
. but I’ve noticed in the docs that sentiment-analysis
is often mentioned as the task when building any sentiment analysis classifier. I guess that’s just to explicitly define the type of text classification. functionality-wise, it’s not really different.
from senselab.utils.model_utils import HFUtils | ||
|
||
|
||
class Emotion(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need this enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using this in the tests to not hardcode any strings, but not really used in the actual classification code.
from senselab.utils.tasks.chunking import chunk_text | ||
|
||
|
||
class EmotionAnalysis(BaseTextAnalysis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify that this class is for emotion analysis with HF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will be making the change. My initial thought was to use the same class for every model type, including HF, and have the utils classes return the classifier with a get_classifier
method. But you’re right, different model types might have different output formats and my approach would fail then. I should’ve shared an LLD doc first to clear this up.
@@ -0,0 +1,160 @@ | |||
"""This module provides utility classes for handling common utilities based on model type.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is HF-specific. not every framework is structured with tokenizers, feature_extractor, models, and pipelines. I still wonder if we really need all this or if it's just an over-complication. please, help me understand your choice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback. The plan was to have functions like get_classifier
(get_pipeline
now) and get_tokenizer
to create a standardized way to access these components across different utils, making it easier to work with various models. You’d just call model_utils.get_classifier()
to get the classifier and model_utils.get_tokenizer()
to fetch the tokenizer for the model. If tokenizer is not available for a model and a long text is passed that exceeds the token limit, we could show an error. But I understand your concern about complexity and can rework the entire thing to work as it does currently with if-else checks at the api.py
level.
from transformers import AutoTokenizer | ||
|
||
|
||
def chunk_text(text: str, tokenizer: AutoTokenizer, max_length: int, overlap: int) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this is a text-specific utility and not a general utility.
- I think chunking may be already managed inside of the text pipeline (at least for HF)
Thank you @adi611 . i have left some comments and questions. please, feel free to address them to help me understand better your way of thinking. |
Hi, I was thinking we could have one |
Proposed Changes1. Base Analysis Classes
2. Chunking Strategy
3. Restructure Utilities
4. Decouple Analysis Logic from Model Typedef analyze_emotion(...):
if isinstance(model, HFModel):
return HFEmotionAnalysis.analyze(input_data=pieces_of_text, device=device, **kwargs)
else:
raise NotImplementedError(f"The specified model type is not supported.") 5. Emotion Enum
Please review and provide feedback. cc: @fabiocat93 |
Hi @fabiocat93, let me know if there's another task I can start working on in the meantime. |
No description provided.