Skip to content

Span.start_char has unexpected return value #5541

Discussion options

You must be logged in to vote

In doc[i], the i refers to the token index, not the character offset in the text string. You don't need to access a span to get the character position of the token, either, it's available as Token.idx, so a shorter way to do this is:

for idx, token in enumerate(nlp(text)):
    print(idx, token.idx) # (also: doc[idx].idx)

Token.i is the token position and Token.idx is the character offset, which is admittedly a bit confusing because it's not consistent with the Span API.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects
2 participants
Converted from issue

This discussion was converted from issue #5541 on December 11, 2020 00:21.