You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi.
Thank you for publishing the model.
I have a problem. But I can't find where I went wrong.
I have the titles and the abstracts of some articles and want to get vector embeddings of these.
I used the code below using your suggestion on how to use it.
Interestingly, while trying to process data consisting of about 500 records, I could not get a response because my computer's 16GB RAM was full.
So I divided it into chunks and got the embeddings quickly.
But, for example, the vectors I get by choosing chunk sizes 10 and the vectors I get by choosing chunk sizes 20 are different.
I'm probably using the tokenizer wrong.
If you have any ideas, I would be glad.
Best wishes.
Orhan
import json
import numpy as np
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('scincl')
model = AutoModel.from_pretrained('scincl')
csize = 20
def normalizer(x):
normalized_vector = x / np.linalg.norm(x)
return np.array(normalized_vector)
def split_chunks(data):
return [data[x:x + csize] for x in range(0, len(data), csize)]
def get_vectors(chunks):
title_vecs = np.empty(shape=[0, 768])
abstract_vecs = np.empty(shape=[0, 768])
for chunk in chunks:
title = [d['title'] for d in chunk]
abstract = [d['abstract'] for d in chunk]
inputs = tokenizer(title, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embedT = result.last_hidden_state[:, 0, :]
title_vecs = np.append(title_vecs, normalizer(embedT.detach().numpy()), axis=0)
inputs = tokenizer(abstract, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
embedA = result.last_hidden_state[:, 0, :]
abstract_vecs = np.append(abstract_vecs, normalizer(embedA.detach().numpy()), axis=0)
return title_vecs, abstract_vecs
print("started")
f = open('abstracts.json', "r")
data = json.loads(f.read())
chunks = split_chunks(data)
title_vecs, abstract_vecs = get_vectors(chunks)
np.save('data_title_scincl_norm_' + str(csize) + '.npy', title_vecs)
np.save('data_abstract_scincl_norm_' + str(csize) + '.npy', abstract_vecs)
print("finished")
The text was updated successfully, but these errors were encountered:
Hi.
Thank you for publishing the model.
I have a problem. But I can't find where I went wrong.
I have the titles and the abstracts of some articles and want to get vector embeddings of these.
I used the code below using your suggestion on how to use it.
Interestingly, while trying to process data consisting of about 500 records, I could not get a response because my computer's 16GB RAM was full.
So I divided it into chunks and got the embeddings quickly.
But, for example, the vectors I get by choosing chunk sizes 10 and the vectors I get by choosing chunk sizes 20 are different.
I'm probably using the tokenizer wrong.
If you have any ideas, I would be glad.
Best wishes.
Orhan
The text was updated successfully, but these errors were encountered: