Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up for large dataset #214

Open
linWujl opened this issue Apr 20, 2023 · 3 comments
Open

How to speed up for large dataset #214

linWujl opened this issue Apr 20, 2023 · 3 comments

Comments

@linWujl
Copy link

linWujl commented Apr 20, 2023

Hello, my corpus is 700G, is there any way to speed up?

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 20, 2023 via email

@linWujl
Copy link
Author

linWujl commented Apr 20, 2023

The coocur step has cost about 7500mins and it stills at the merge step.

Is it possible that use spark to construct the cooccurrence statistics and train it with tensorflow?

@AngledLuffa
Copy link
Contributor

We did try converting it to torch at one point, but it wound up being significantly slower than the C version. We may try again sometime. You are welcome to try...

Do you have enough memory? Might be worth checking top

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants