Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support embedding markdown while retaining formatting #119

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

boswelja
Copy link
Contributor

Summary of changes

  • Created MarkdownDocument struct
    • Implemented embed_markdown for it
  • Added process_markdown and process_markdown_file to MarkdownProcessor
    • This brings the processor in-line with HtmlProcessor, which I assume is what we want
  • Added embed_markdown convenience function

I've kept the original Markdown text processing pipeline intact for now to avoid breaking changes.

@akshayballal95
Copy link
Collaborator

Sorry for the delay; I was making some updates to the main. I will pull this today, test and fix the conflicts. Will let you know

@akshayballal95
Copy link
Collaborator

Hi i was going through this. Do we need a separate embed_markdown function in lib.rs. The idea is to use embed_file to capture all extensions automatically. Could this not be integrated into the existing extract_text to get the content and change the split_into_chunks function in TextLoader to use the right splitter based on the file extension. This will also be more scalable once we introduce code splitting and json splitting.

@boswelja
Copy link
Contributor Author

Can do, I was just worried about making any breaking changes

We should probably add html to embed_file types at some stage too in that case 🤔

@akshayballal95
Copy link
Collaborator

That's fine. I can handle some patching if there are any breaking changes.

You are correct; integrating HTML would also be beneficial.

I'm also curious—do you know any effective methods for JSON and Code Splitting? I've been researching on this and would like to hear if you've come across any useful approaches.

@boswelja
Copy link
Contributor Author

I haven't looked at JSON at all, but we can tackle that after these MD and HTML changes 😄

@boswelja
Copy link
Contributor Author

Could this not be integrated into the existing extract_text to get the content and change the split_into_chunks function in TextLoader to use the right splitter based on the file extension.

I was having a play around with this locally, but the *Processor struct ends up only reading a file. Is that really what we want?
I'm wondering if it makes more sense to keep Markdown handling within MarkdownProcessor - embed_file would see the .md extension and pass it on to MarkdownProcessor, which would return the embeddings. With this approach, we end up with a self-contained Processor for every file type, and a utility function (embed_file) that handles choosing and using them.

# Conflicts:
#	Cargo.lock
#	rust/Cargo.toml
#	rust/src/lib.rs
@akshayballal95
Copy link
Collaborator

akshayballal95 commented Feb 20, 2025

That's indeed one valid approach. But I would suggest we keep the embedding logic out of the processors. One reason is because it could result in a lot of repeated code because after chunking, embedding is the same for all file types. Second, is the division of concern. The idea for the processors is to make everything ready for the embeddings. What can be done is to move chunking/splitting logic to each processor such that they can return the chunks and the emb_text function just consumes those chunks. This way we can also completely make the *Processor standalone for generating chunks just like document loaders in langchain or llamaindex.

So once we have a file, we can convert it to a suitable document type based on your implementation of MarkdownDocument and have extract and split_into_chunks methods associated for each document. We can infact have a Document trait which implements these methods. This can make the whole processor module very scalable.

@akshayballal95
Copy link
Collaborator

We can also in the future have a get_metadata function for each *Document. For example for a markdown file we can output the front matter as metadata, for pdf page numbers etc.

@boswelja
Copy link
Contributor Author

Seems reasonable, I'll try keeping the splitters in each processor and see how I go :)

@boswelja boswelja marked this pull request as draft February 22, 2025 05:15
@boswelja
Copy link
Contributor Author

Alright I've got some changes with the Rust side building
I haven't tested it or anything - we need a bunch of new tests now too)
There's also some parts I'm generally not happy with, I'll be thinking on them over the weekend 😅

@akshayballal95
Copy link
Collaborator

Hi, do let me know if you are still going ahead with this.

@boswelja
Copy link
Contributor Author

boswelja commented Mar 8, 2025

I'm currently on holiday, I'll be back in April but I'll probably have work to catch up on too so this will be on hold until then.

I'll mark it as ready for review once it's in a place I'm happy with 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants