-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support embedding markdown while retaining formatting #119
base: main
Are you sure you want to change the base?
Support embedding markdown while retaining formatting #119
Conversation
Sorry for the delay; I was making some updates to the main. I will pull this today, test and fix the conflicts. Will let you know |
Hi i was going through this. Do we need a separate |
Can do, I was just worried about making any breaking changes We should probably add |
That's fine. I can handle some patching if there are any breaking changes. You are correct; integrating HTML would also be beneficial. I'm also curious—do you know any effective methods for JSON and Code Splitting? I've been researching on this and would like to hear if you've come across any useful approaches. |
I haven't looked at JSON at all, but we can tackle that after these MD and HTML changes 😄 |
I was having a play around with this locally, but the |
# Conflicts: # Cargo.lock # rust/Cargo.toml # rust/src/lib.rs
That's indeed one valid approach. But I would suggest we keep the embedding logic out of the processors. One reason is because it could result in a lot of repeated code because after chunking, embedding is the same for all file types. Second, is the division of concern. The idea for the processors is to make everything ready for the embeddings. What can be done is to move chunking/splitting logic to each processor such that they can return the chunks and the So once we have a file, we can convert it to a suitable document type based on your implementation of |
We can also in the future have a |
Seems reasonable, I'll try keeping the splitters in each processor and see how I go :) |
Alright I've got some changes with the Rust side building |
Hi, do let me know if you are still going ahead with this. |
I'm currently on holiday, I'll be back in April but I'll probably have work to catch up on too so this will be on hold until then. I'll mark it as ready for review once it's in a place I'm happy with 🙂 |
Summary of changes
embed_markdown
for itprocess_markdown
andprocess_markdown_file
toMarkdownProcessor
HtmlProcessor
, which I assume is what we wantembed_markdown
convenience functionI've kept the original Markdown text processing pipeline intact for now to avoid breaking changes.