Text: Sentiment and Emotional Analysis #153

adi611 · 2024-09-11T21:30:55Z

No description provided.

fabiocat93 · 2024-09-12T11:02:07Z

pyproject.toml

@@ -57,6 +56,7 @@ umap-learn = "~=0.5"
 scikit-learn = "~=1.5"
 nltk = "~=3.8"
 vocos = "~=0.1"
+huggingface-hub = {extras = ["cli"], version = "^0.24.6"}


why is cli needed?

not needed, planned to remove this in the final PR

fabiocat93 · 2024-09-12T11:05:32Z

src/senselab/text/tasks/emotional_analysis/constants.py

+from enum import Enum
+
+
+class Emotion(Enum):


I am not sure that pre-defining the emotional classes is a good idea. What if someone wants to use a model that can capture an emotion that is not in your pool?

you're right, will make changes to the test cases as well.

fabiocat93 · 2024-09-12T11:06:26Z

src/senselab/text/tasks/emotional_analysis/emotional_analysis.py

+        if not pieces_of_text:
+            raise ValueError("Input list is empty or None.")
+
+        if model is None:


why don't you specify the default in the param definition space (L21)?

fabiocat93 · 2024-09-12T11:07:54Z

src/senselab/text/tasks/emotional_analysis/api.py

+def analyze_emotion(
+    pieces_of_text: List[str],
+    device: Optional[DeviceType] = None,
+    model: Optional[HFModel] = None,


I think model should be of type SenselabModel. this way, we can accept models that are not necessarily HF models

fabiocat93 · 2024-09-12T11:09:12Z

src/senselab/text/tasks/emotional_analysis/api.py

+    device: Optional[DeviceType] = None,
+    model: Optional[HFModel] = None,
+    max_length: int = 512,
+    overlap: int = 128,


max_length and overlap look like model-specific params and shouldn't be in the general emotion recognition API. you can obtain them as part ** kwargs

fabiocat93 · 2024-09-12T11:12:31Z

src/senselab/text/tasks/emotional_analysis/emotional_analysis.py

+        emotion_labels = model_instance.config.id2label
+
+        results: List[Dict[str, Any]] = []
+        for text in pieces_of_text:


what is the advantage of using the tokenizer and the model instead of working with the huggingface text classification pipeline? https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline

I was mainly using the model directly for processing long texts, but can do the same with a pipeline:

Text → text chunks → pass chunks to the pipeline → average the scores for each label.
Text → tokenized chunks → pass chunks to the model → softmax for probabilities → average the scores for each label.

I'll switch to using the pipeline for simplicity, will still need the tokenizer for chunking.

adi611 · 2024-09-16T21:28:12Z

Hi @fabiocat93 - please review. I've made the requested changes and tried to make the code as extensible as possible. If a new model type needs to be added in the future, we won’t need to modify the existing code - just create utils for it and add it to the MODEL_TYPE_TO_UTILS dict. These utility classes could also be useful elsewhere in the code.

fabiocat93 · 2024-09-19T12:11:33Z

src/senselab/text/tasks/emotion_analysis/emotion_analysis.py

+        for text in input_data:
+            cls.validate_input(text)
+
+            chunks = chunk_text(text=text, tokenizer=tokenizer, max_length=max_length, overlap=overlap)


From what i undertsand, pipeline internally implements chunk batching (https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-chunk-batching). can you please clarify to me why we can't use that directly?

The ChunkPipeline is meant only for zero-shot-classification and question-answering tasks, not for text-classification. It throws an error if the text goes over the max token limit. I tried it with a few BERT-based models that have a 512 token limit, and it fails when the length exceeds that. But please let me know if I am missing something here.

fabiocat93 · 2024-09-19T12:13:02Z

src/senselab/text/tasks/emotion_analysis/api.py

+        List[Dict[str, Any]]: A list of dictionaries containing emotional analysis results.
+    """
+    model_type = type(model)
+    model_utils = MODEL_TYPE_TO_UTILS.get(model_type)


If you use a pipeline, you can simply pass the model's name and the revision. What's the advantage here of using the model + tokenizer?

Right, I am using the tokenizer to handle chunking for long texts and not to actually create the classifier from the pipeline. I haven’t come across a robust method for handling long texts in text classification tasks beyond this. From what I have tested and read on the HF docs, the chunk batching pipeline is only supported for zero-shot-classification and question-answering and not for text-classification.

fabiocat93 · 2024-09-19T12:20:02Z

src/senselab/utils/interfaces/analyses.py

+from senselab.utils.model_utils import BaseModelSourceUtils
+
+
+class BaseAnalysis(ABC):


this is an interesting interface. some thoughts:

I am curious to see how it generalizes on other tasks

it may be good to include a validate_output method as well

fabiocat93 · 2024-09-19T12:24:04Z

src/senselab/text/tasks/sentiment_analysis/sentiment_analysis.py

+        )
+
+        tokenizer = model_utils.get_tokenizer(task="sentiment-analysis")
+        pipe = model_utils.get_pipeline(task="sentiment-analysis", device=device, torch_dtype=torch_dtype, top_k=None)


is there an advantage in handling SentimentAnalysis with a different pipeline than "text-classification"?

not really an advantage, it actually uses the same TextClassificationPipeline as text-classification. but I’ve noticed in the docs that sentiment-analysis is often mentioned as the task when building any sentiment analysis classifier. I guess that’s just to explicitly define the type of text classification. functionality-wise, it’s not really different.

fabiocat93 · 2024-09-19T12:26:16Z

src/senselab/text/tasks/emotion_analysis/constants.py

+from senselab.utils.model_utils import HFUtils
+
+
+class Emotion(Enum):


do we really need this enum?

using this in the tests to not hardcode any strings, but not really used in the actual classification code.

fabiocat93 · 2024-09-19T12:27:08Z

src/senselab/text/tasks/emotion_analysis/emotion_analysis.py

+from senselab.utils.tasks.chunking import chunk_text
+
+
+class EmotionAnalysis(BaseTextAnalysis):


clarify that this class is for emotion analysis with HF

Yes, will be making the change. My initial thought was to use the same class for every model type, including HF, and have the utils classes return the classifier with a get_classifier method. But you’re right, different model types might have different output formats and my approach would fail then. I should’ve shared an LLD doc first to clear this up.

fabiocat93 · 2024-09-19T12:30:43Z

src/senselab/utils/model_utils.py

@@ -0,0 +1,160 @@
+"""This module provides utility classes for handling common utilities based on model type."""


this is HF-specific. not every framework is structured with tokenizers, feature_extractor, models, and pipelines. I still wonder if we really need all this or if it's just an over-complication. please, help me understand your choice

Thank you for your feedback. The plan was to have functions like get_classifier (get_pipeline now) and get_tokenizer to create a standardized way to access these components across different utils, making it easier to work with various models. You’d just call model_utils.get_classifier() to get the classifier and model_utils.get_tokenizer() to fetch the tokenizer for the model. If tokenizer is not available for a model and a long text is passed that exceeds the token limit, we could show an error. But I understand your concern about complexity and can rework the entire thing to work as it does currently with if-else checks at the api.py level.

fabiocat93 · 2024-09-19T12:31:52Z

src/senselab/utils/tasks/chunking.py

+from transformers import AutoTokenizer
+
+
+def chunk_text(text: str, tokenizer: AutoTokenizer, max_length: int, overlap: int) -> List[str]:


this is a text-specific utility and not a general utility.

I think chunking may be already managed inside of the text pipeline (at least for HF)

fabiocat93 · 2024-09-19T12:33:18Z

Hi @fabiocat93 - please review. I've made the requested changes and tried to make the code as extensible as possible. If a new model type needs to be added in the future, we won’t need to modify the existing code - just create utils for it and add it to the MODEL_TYPE_TO_UTILS dict. These utility classes could also be useful elsewhere in the code.

Thank you @adi611 . i have left some comments and questions. please, feel free to address them to help me understand better your way of thinking.

adi611 · 2024-09-26T21:40:28Z

Hi @fabiocat93 - please review. I've made the requested changes and tried to make the code as extensible as possible. If a new model type needs to be added in the future, we won’t need to modify the existing code - just create utils for it and add it to the MODEL_TYPE_TO_UTILS dict. These utility classes could also be useful elsewhere in the code.

Thank you @adi611 . i have left some comments and questions. please, feel free to address them to help me understand better your way of thinking.

Hi, I was thinking we could have one EmotionAnalysis and one SentimentAnalysis class to handle all model types, and use MODEL_TYPE_TO_UTILS to get the utils for a model type. The BaseModelSourceUtils could have something like a get_classifier abstract method, which the concrete utils classes would implement to get the right classifier for each model type. So we wouldn’t have to update existing code if support for more model types are added, just create the new utils class and add it to MODEL_TYPE_TO_UTILS. But I get your point, it might oversimplify things across models and not always be practical. I should’ve written up a design doc first, gotten your feedback, and then moved ahead with the code. I’ll make sure to do that going forward to keep things smoother!

adi611 · 2024-09-28T18:16:17Z

Proposed Changes

1. Base Analysis Classes

Add validate_output method to BaseAnalysis and BaseTextAnalysis.
Move BaseTextAnalysis to something like senselab/text/utils/interfaces.py.

2. Chunking Strategy

If custom chunking needed, move to senselab/text/utils/.

3. Restructure Utilities

Remove BaseModelSourceUtils. HFUtils will not be a derived class and can still be used across the codebase to avoid repeating similar pipeline code.

4. Decouple Analysis Logic from Model Type

def analyze_emotion(...):
    if isinstance(model, HFModel):
        return HFEmotionAnalysis.analyze(input_data=pieces_of_text, device=device, **kwargs)
    else:
        raise NotImplementedError(f"The specified model type is not supported.")

5. Emotion Enum

Remove Emotion enum and use string values in the tests.
OR, move to test file since only used there.

Please review and provide feedback. cc: @fabiocat93

adi611 · 2024-10-03T07:02:11Z

Hi @fabiocat93, let me know if there's another task I can start working on in the meantime.

adi611 added 4 commits September 11, 2024 11:57

sentiment analysis

b35d7a3

sentiment analysis - fixes

76e5bcb

emotional analysis

c0c83e9

emotional analysis - minor changes

8cc5098

fabiocat93 reviewed Sep 12, 2024

View reviewed changes

adi611 added 3 commits September 15, 2024 20:32

review changes and follow DRY and SOLID principles

7362d65

add type info to docstrings

5a3f31b

make module utils more generic

528b7b7

adi611 marked this pull request as ready for review September 16, 2024 17:41

adi611 added 2 commits September 16, 2024 21:06

handle edge cases

17c8cf3

minor type fixes

6f33c8f

adi611 force-pushed the 34-text-sentiment-analysis branch from 602faff to 426e19d Compare September 17, 2024 18:01

minor fixes

dcca19b

adi611 force-pushed the 34-text-sentiment-analysis branch from 426e19d to dcca19b Compare September 17, 2024 18:04

fabiocat93 reviewed Sep 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text: Sentiment and Emotional Analysis #153

Text: Sentiment and Emotional Analysis #153

adi611 commented Sep 11, 2024

fabiocat93 Sep 12, 2024

adi611 Sep 13, 2024

fabiocat93 Sep 12, 2024

adi611 Sep 13, 2024

fabiocat93 Sep 12, 2024

fabiocat93 Sep 12, 2024

fabiocat93 Sep 12, 2024

fabiocat93 Sep 12, 2024

adi611 Sep 13, 2024 •

edited

Loading

adi611 commented Sep 16, 2024

fabiocat93 Sep 19, 2024

adi611 Sep 26, 2024

fabiocat93 Sep 19, 2024

adi611 Sep 26, 2024

fabiocat93 Sep 19, 2024

fabiocat93 Sep 19, 2024

adi611 Sep 26, 2024

fabiocat93 Sep 19, 2024

adi611 Sep 26, 2024

fabiocat93 Sep 19, 2024

adi611 Sep 26, 2024

fabiocat93 Sep 19, 2024

adi611 Sep 26, 2024

fabiocat93 Sep 19, 2024

fabiocat93 commented Sep 19, 2024

adi611 commented Sep 26, 2024 •

edited

Loading

adi611 commented Sep 28, 2024 •

edited

Loading

adi611 commented Oct 3, 2024

		from senselab.utils.model_utils import BaseModelSourceUtils


		class BaseAnalysis(ABC):

		from senselab.utils.model_utils import HFUtils


		class Emotion(Enum):

		from senselab.utils.tasks.chunking import chunk_text


		class EmotionAnalysis(BaseTextAnalysis):

		@@ -0,0 +1,160 @@
		"""This module provides utility classes for handling common utilities based on model type."""

		from transformers import AutoTokenizer


		def chunk_text(text: str, tokenizer: AutoTokenizer, max_length: int, overlap: int) -> List[str]:

Text: Sentiment and Emotional Analysis #153

Are you sure you want to change the base?

Text: Sentiment and Emotional Analysis #153

Conversation

adi611 commented Sep 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adi611 Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

adi611 commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiocat93 commented Sep 19, 2024

adi611 commented Sep 26, 2024 • edited Loading

adi611 commented Sep 28, 2024 • edited Loading

Proposed Changes

1. Base Analysis Classes

2. Chunking Strategy

3. Restructure Utilities

4. Decouple Analysis Logic from Model Type

5. Emotion Enum

adi611 commented Oct 3, 2024

adi611 Sep 13, 2024 •

edited

Loading

adi611 commented Sep 26, 2024 •

edited

Loading

adi611 commented Sep 28, 2024 •

edited

Loading