Lowercase lemmatization in pipe, when tok2vec disabled #13511

BMarcin · 2024-05-29T08:56:36Z

Introduction

Hi! I am using spaCy lemmatizer for some tasks. I saw that when using a pipe to process the data faster, I'm getting different results with tok2vec disabled or enabled. Maintaining case-sensitivity is critical for me. Is the below behavior expected?

How to reproduce the behaviour

Case1

import spacy
nlp = spacy.load("en_core_web_sm")

for doc in nlp.pipe(["Hello! My name is Marcin.", "I have a SFTP server running in my HomeLab"], batch_size=100, n_process=1, disable=["ner", "tok2vec"]):
    for token in doc:
        print(str(token), token.lemma_)
    print("")

Output:

Hello hello
! !
My my
name name
is is
Marcin marcin
. .

I i
have have
a a
SFTP sftp
server server
running running
in in
my my
HomeLab homelab

Case2

import spacy
nlp = spacy.load("en_core_web_sm")

for doc in nlp.pipe(["Hello! My name is Marcin.", "I have a SFTP server running in my HomeLab"], batch_size=100, n_process=1, disable=["ner"]):
    for token in doc:
        print(str(token), token.lemma_)
    print("")

Output:

Hello hello
! !
My my
name name
is be
Marcin Marcin
. .

I I
have have
a a
SFTP sftp
server server
running run
in in
my my
HomeLab HomeLab

Info about spaCy

spaCy version: 3.7.2
Platform: Windows-10-10.0.19045-SP0
Python version: 3.10.13
Pipelines: en_core_web_sm (3.7.1), en_core_web_trf (3.7.3), es_core_news_lg (3.7.0), es_core_news_sm (3.7.0), pl_core_news_lg (3.7.0), pl_core_news_sm (3.7.0)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowercase lemmatization in pipe, when tok2vec disabled #13511

Lowercase lemmatization in pipe, when tok2vec disabled #13511

BMarcin commented May 29, 2024

Lowercase lemmatization in pipe, when tok2vec disabled #13511

Lowercase lemmatization in pipe, when tok2vec disabled #13511

Comments

BMarcin commented May 29, 2024

Introduction

How to reproduce the behaviour

Case1

Case2

Info about spaCy