Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to change alignment_mode in a hf pipelines in code? #13548

Open
mikrzol opened this issue Jun 26, 2024 · 1 comment
Open

How to change alignment_mode in a hf pipelines in code? #13548

mikrzol opened this issue Jun 26, 2024 · 1 comment

Comments

@mikrzol
Copy link

mikrzol commented Jun 26, 2024

I'm having the same problem described here: microsoft/presidio#1262 - some of the annotations are skipped because of alignment problems between the spaCy pipeline and the hf pipeline wrapper.

I would like to simply substitute one of the components of the spaCy pipeline with a HF model I trained for NER and use it for this task. I'm trying do this using this code:

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.remove_pipe("ner")

nlp.add_pipe("hf_token_pipe", config={"model": "mikrz/bert-vir_naeus-ner"})

The model loads correctly and technically works fine, but some of the tokens are skipped, e.g.:

text = 'A novel bacteriophage vB_SauS_SA2 (hereafter designated SA2) that infects Staphylococcus aureus was isolated.'
doc = nlp(text)

For the code above I'm getting this warning:

spacy_huggingface_pipelines/token_classification.py:129: UserWarning: Skipping annotation, {'entity_group': 'VIR', 'score': 0.67677, 'word': '_ SauS _ SA2', 'start': 24, 'end': 33} is overlapping or can't be aligned for doc 'A novel bacteriophage vB_SauS_SA2 (hereafter designated SA2) that infects Staphylococcus aureus was ...'
  warnings.warn(

I saw that in microsoft/presidio#1262 someone recommended changing the alignment_mode of the hf pipeline component. How can I do this in code when using the default en_core_web_sm model?
For my use case, I need the spans of the named entities in the form of word numbers, so in the example above it would be:

vB_SauS_SA2: (3,3),
Staphylococcus aureus: (11,12)

where we get the position the the start and end of the named entitites.

I'd like to change just this part - I would like to avoid training a custom pipeline which, if I understand https://spacy.io/usage/training correctly, seems to be necessary when creating a new spaCy pipeline from the config file. Or did I misunderstand and there is an option to just change parts of the config file? If that's the case, could you instruct me how to create a config file, where to put it and how to use it?

@Siddharth-Latthe-07
Copy link

@mikrzol I think the issue arises from the fact that the tokens produced by the Hugging Face model might not align perfectly with the tokens produced by the spaCy model. This misalignment can lead to warnings about overlapping annotations or annotations that can't be aligned.
Try out this sample code snippet which adjust the spans returned by the Hugging Face model to match the token boundaries of the spaCy model more accurately.

import spacy
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from spacy.tokens import Span

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Remove default NER pipeline component
nlp.remove_pipe("ner")

# Load Hugging Face model and tokenizer
model_name = "mikrz/bert-vir_naeus-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
hf_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Custom pipeline component for Hugging Face NER
def hf_ner_component(doc):
    # Tokenize using spaCy to ensure alignment
    tokens = [token.text for token in doc]
    text = " ".join(tokens)
    
    # Use Hugging Face pipeline for NER
    hf_ner_results = hf_pipeline(text)
    
    # Convert Hugging Face NER results to spaCy entities
    entities = []
    for ent in hf_ner_results:
        start = ent['start']
        end = ent['end']
        label = ent['entity']
        span = doc.char_span(start, end, label=label, alignment_mode='contract')
        if span:
            entities.append(span)
    
    # Assign entities to the doc
    doc.ents = entities
    return doc

# Add custom component to spaCy pipeline
nlp.add_pipe(hf_ner_component, last=True)

# Example usage
text = 'A novel bacteriophage vB_SauS_SA2 (hereafter designated SA2) that infects Staphylococcus aureus was isolated.'
doc = nlp(text)

# Print named entities detected by the new NER component
for ent in doc.ents:
    print(ent.text, ent.label_)

please let me know if the above works
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants