-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How train? #6
Comments
Due to the sequence length constraints of most large language models (LLMs), which typically range from 1,000 to 16,000 tokens, processing extensive inputs directly isn't feasible. It's better to segment your data set into smaller, function-level chunks that pair the binary code with its corresponding source code. Once the data is prepared, it can be feed into the LLM for fine-tuning. Currently, our model is trained to support C language decompilation on the Linux x86_64 architecture. For your interest in working with older compilers, the LLM generally treats input from various compilers similarly, without significant differentiation. |
I understand totally. But the real world where everybody have blocking unmaintained binary, we have lot of crap and large binary. I don't see too how decompile by part, this imply do previous/next chunk into the token and be able to rewrite the previous writed file.
Yes, I wish train with more arch because I have some code from router to study (from MIPS) and from gcc 4.6 (kernel modules), obfuscated into multiple .ko and .so for userspace. |
Hi, I wish train on larger base set (like gentoo), and on multiple architecture (RISC-V, ARM, MIPS v2, x86, ...).
How do?
You code support foreign architecture?
I wish too train with old compiler, that's help to analis old unmaintened code on MIPS arch.
The text was updated successfully, but these errors were encountered: