Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scincl for sentence embedding #10

Open
orhansonmeztr opened this issue Apr 18, 2023 · 0 comments
Open

scincl for sentence embedding #10

orhansonmeztr opened this issue Apr 18, 2023 · 0 comments

Comments

@orhansonmeztr
Copy link

Hi.
Thank you for publishing the model.
I have a problem. But I can't find where I went wrong.
I have the titles and the abstracts of some articles and want to get vector embeddings of these.
I used the code below using your suggestion on how to use it.
Interestingly, while trying to process data consisting of about 500 records, I could not get a response because my computer's 16GB RAM was full.
So I divided it into chunks and got the embeddings quickly.
But, for example, the vectors I get by choosing chunk sizes 10 and the vectors I get by choosing chunk sizes 20 are different.
I'm probably using the tokenizer wrong.
If you have any ideas, I would be glad.
Best wishes.
Orhan

import json
import numpy as np
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('scincl')
model = AutoModel.from_pretrained('scincl')
csize = 20

def normalizer(x):
    normalized_vector = x / np.linalg.norm(x)
    return np.array(normalized_vector)

def split_chunks(data):
    return [data[x:x + csize] for x in range(0, len(data), csize)]

def get_vectors(chunks):
    title_vecs = np.empty(shape=[0, 768])
    abstract_vecs = np.empty(shape=[0, 768])
    for chunk in chunks:
        title = [d['title'] for d in chunk]
        abstract = [d['abstract'] for d in chunk]

        inputs = tokenizer(title, padding=True, truncation=True, return_tensors="pt", max_length=512)
        result = model(**inputs)
        embedT = result.last_hidden_state[:, 0, :]
        title_vecs = np.append(title_vecs, normalizer(embedT.detach().numpy()), axis=0)

        inputs = tokenizer(abstract, padding=True, truncation=True, return_tensors="pt", max_length=512)
        result = model(**inputs)
        embedA = result.last_hidden_state[:, 0, :]
        abstract_vecs = np.append(abstract_vecs, normalizer(embedA.detach().numpy()), axis=0)

    return title_vecs, abstract_vecs

print("started")
f = open('abstracts.json', "r")
data = json.loads(f.read())
chunks = split_chunks(data)
title_vecs, abstract_vecs = get_vectors(chunks)
np.save('data_title_scincl_norm_' + str(csize) + '.npy', title_vecs)
np.save('data_abstract_scincl_norm_' + str(csize) + '.npy', abstract_vecs)
print("finished")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant