Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How train? #6

Open
alphaonex86 opened this issue Mar 18, 2024 · 2 comments
Open

How train? #6

alphaonex86 opened this issue Mar 18, 2024 · 2 comments

Comments

@alphaonex86
Copy link

alphaonex86 commented Mar 18, 2024

Hi, I wish train on larger base set (like gentoo), and on multiple architecture (RISC-V, ARM, MIPS v2, x86, ...).
How do?
You code support foreign architecture?
I wish too train with old compiler, that's help to analis old unmaintened code on MIPS arch.

@albertan017
Copy link
Owner

Due to the sequence length constraints of most large language models (LLMs), which typically range from 1,000 to 16,000 tokens, processing extensive inputs directly isn't feasible. It's better to segment your data set into smaller, function-level chunks that pair the binary code with its corresponding source code. Once the data is prepared, it can be feed into the LLM for fine-tuning.

Currently, our model is trained to support C language decompilation on the Linux x86_64 architecture. For your interest in working with older compilers, the LLM generally treats input from various compilers similarly, without significant differentiation.

@alphaonex86
Copy link
Author

I understand totally. But the real world where everybody have blocking unmaintained binary, we have lot of crap and large binary. I don't see too how decompile by part, this imply do previous/next chunk into the token and be able to rewrite the previous writed file.
Maybe auto chunk by function.

Currently, our model is trained to support C language decompilation on the Linux x86_64 architecture

Yes, I wish train with more arch because I have some code from router to study (from MIPS) and from gcc 4.6 (kernel modules), obfuscated into multiple .ko and .so for userspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants