How can I use sciBERT for Token Classification? #98

Sachit1137 · 2020-07-24T18:34:00Z

I tried with the code below:

from transformers import AutoTokenizer, AutoModel,AutoModelForTokenClassification
import torch

#I am getting the label list from labels.txt file present in the Pytorch Huggingface model(scibert-scivocab-uncased)
def read_label_list():
    f = open('labels.txt','r')
    label_list = []
    for line in f:
        label_list.append(line)
    return label_list

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

sequence = 'Effectiveness of current drug treatments for hospitalized patients with SARS-CoV-2 infection (COVID-19 patients) in routine clinical practice|Risk factors or modifiers of pharmacological effect such as demographic characteristics, comorbidity or underlying pathology, concomitant medication.'

label_list = read_label_list()
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)

for token, prediction in zip(tokens,predictions[0].numpy()):
    print((token, label_list[prediction]))

I am getting the following output which is not making sense:
('[CLS]', '##.49\n')
('effectiveness', '##.49\n')
('of', '##.49\n')
('current', '##.49\n')
('drug', '##.49\n')
('treatments', '##.49\n')
('for', '##.49\n')
('hospitalized', '##.49\n')
('patients', '##.49\n')
('with', '##.49\n')
('sar', '##.49\n')
('##s', '##.49\n')
('-', '##.49\n')
('cov', '##.49\n')
('-', '##.49\n')
('2', '##.49\n')
('infection', '##.49\n')
('(', '##.49\n')
('cov', '##.49\n')
('##id', '##.49\n')
('-', '##.49\n')
('19', '##.49\n')
('patients', '##.49\n')
(')', '##.49\n')
('in', '##.49\n')
('routine', '##.49\n')
('clinical', '##.49\n')
('practice', '##.49\n')
('|', '##.49\n')
('risk', '##.49\n')
('factors', '##.49\n')
('or', '##.49\n')
('modi', '##.49\n')
('##fi', '##.49\n')
('##ers', '##.49\n')
('of', '##.49\n')
('pharmacological', '##.49\n')
('effect', '##.49\n')
('such', '##.49\n')
('as', '##.49\n')
('demographic', '##.49\n')
('characteristics', '##.49\n')
(',', '##.49\n')
('comorbidity', '##.49\n')
('or', '##.49\n')
('underlying', '##.49\n')
('pathology', '##.49\n')
(',', '##.49\n')
('concomitant', '##.49\n')
('medication', '##.49\n')
('.', '##1-4\n')
('[SEP]', '##.49\n')

The text was updated successfully, but these errors were encountered:

Sachit1137 · 2020-07-24T18:35:53Z

Can someone help me figure this out?

ibeltagy · 2020-07-24T22:29:14Z

The code is a bit difficult to read without formatting, but the obvious issues are that you need to use AutoModelForTokenClassification and it is weird to do encode(tokenize(tokenizer.decode(tokenizer.encode(string))). I think the easiest solution is to follow the huggingface NER pytorch-lightning example here.

stefan-it · 2020-07-24T22:46:23Z

@Sachit1137 just use the fine-tuning example for token-classification from Transformers:

https://github.com/huggingface/transformers/tree/master/examples/token-classification

There are two examples given which you just need to adapt for your dataset.

Later, you can just use the Transformers Pipelines feature to make predictions, see this example.

If you need help with the token classification example, just ping me :)

zcyzhuangzhou · 2021-03-30T13:57:32Z

I tried with the code below:
from transformers import AutoTokenizer, AutoModel,AutoModelForTokenClassification
import torch

#I am getting the label list from labels.txt file present in the Pytorch Huggingface model(scibert-scivocab-uncased)
def read_label_list():
    f = open('labels.txt','r')
    label_list = []
    for line in f:
        label_list.append(line)
    return label_list

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

sequence = 'Effectiveness of current drug treatments for hospitalized patients with SARS-CoV-2 infection (COVID-19 patients) in routine clinical practice|Risk factors or modifiers of pharmacological effect such as demographic characteristics, comorbidity or underlying pathology, concomitant medication.'

label_list = read_label_list()
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)

for token, prediction in zip(tokens,predictions[0].numpy()):
    print((token, label_list[prediction]))
I am getting the following output which is not making sense:
('[CLS]', '##.49\n')
('effectiveness', '##.49\n')
('of', '##.49\n')
('current', '##.49\n')
('drug', '##.49\n')
('treatments', '##.49\n')
('for', '##.49\n')
('hospitalized', '##.49\n')
('patients', '##.49\n')
('with', '##.49\n')
('sar', '##.49\n')
('##s', '##.49\n')
('-', '##.49\n')
('cov', '##.49\n')
('-', '##.49\n')
('2', '##.49\n')
('infection', '##.49\n')
('(', '##.49\n')
('cov', '##.49\n')
('##id', '##.49\n')
('-', '##.49\n')
('19', '##.49\n')
('patients', '##.49\n')
(')', '##.49\n')
('in', '##.49\n')
('routine', '##.49\n')
('clinical', '##.49\n')
('practice', '##.49\n')
('|', '##.49\n')
('risk', '##.49\n')
('factors', '##.49\n')
('or', '##.49\n')
('modi', '##.49\n')
('##fi', '##.49\n')
('##ers', '##.49\n')
('of', '##.49\n')
('pharmacological', '##.49\n')
('effect', '##.49\n')
('such', '##.49\n')
('as', '##.49\n')
('demographic', '##.49\n')
('characteristics', '##.49\n')
(',', '##.49\n')
('comorbidity', '##.49\n')
('or', '##.49\n')
('underlying', '##.49\n')
('pathology', '##.49\n')
(',', '##.49\n')
('concomitant', '##.49\n')
('medication', '##.49\n')
('.', '##1-4\n')
('[SEP]', '##.49\n')

Hello, why is there no label.txt in the model file I downloaded? I want to fine-tune my data, but I don’t know the format of the scibert data and all labels.

jaynibandh · 2022-05-26T08:00:44Z

No such file or directory: 'labels.txt' same as [zcyzhuangzhou] encountered. What is the format or any example file to go through?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use sciBERT for Token Classification? #98

How can I use sciBERT for Token Classification? #98

Sachit1137 commented Jul 24, 2020 •

edited

Loading

Sachit1137 commented Jul 24, 2020

ibeltagy commented Jul 24, 2020

stefan-it commented Jul 24, 2020 •

edited

Loading

zcyzhuangzhou commented Mar 30, 2021

jaynibandh commented May 26, 2022

How can I use sciBERT for Token Classification? #98

How can I use sciBERT for Token Classification? #98

Comments

Sachit1137 commented Jul 24, 2020 • edited Loading

Sachit1137 commented Jul 24, 2020

ibeltagy commented Jul 24, 2020

stefan-it commented Jul 24, 2020 • edited Loading

zcyzhuangzhou commented Mar 30, 2021

jaynibandh commented May 26, 2022

Sachit1137 commented Jul 24, 2020 •

edited

Loading

stefan-it commented Jul 24, 2020 •

edited

Loading