Support embedding markdown while retaining formatting #119

boswelja · 2025-02-14T10:10:09Z

Summary of changes

Created MarkdownDocument struct
- Implemented embed_markdown for it
Added process_markdown and process_markdown_file to MarkdownProcessor
- This brings the processor in-line with HtmlProcessor, which I assume is what we want
Added embed_markdown convenience function

I've kept the original Markdown text processing pipeline intact for now to avoid breaking changes.

akshayballal95 · 2025-02-17T13:32:49Z

Sorry for the delay; I was making some updates to the main. I will pull this today, test and fix the conflicts. Will let you know

akshayballal95 · 2025-02-18T21:04:17Z

Hi i was going through this. Do we need a separate embed_markdown function in lib.rs. The idea is to use embed_file to capture all extensions automatically. Could this not be integrated into the existing extract_text to get the content and change the split_into_chunks function in TextLoader to use the right splitter based on the file extension. This will also be more scalable once we introduce code splitting and json splitting.

boswelja · 2025-02-18T21:44:41Z

Can do, I was just worried about making any breaking changes

We should probably add html to embed_file types at some stage too in that case 🤔

akshayballal95 · 2025-02-18T22:06:00Z

That's fine. I can handle some patching if there are any breaking changes.

You are correct; integrating HTML would also be beneficial.

I'm also curious—do you know any effective methods for JSON and Code Splitting? I've been researching on this and would like to hear if you've come across any useful approaches.

boswelja · 2025-02-18T23:48:43Z

I haven't looked at JSON at all, but we can tackle that after these MD and HTML changes 😄

boswelja · 2025-02-20T03:48:28Z

Could this not be integrated into the existing extract_text to get the content and change the split_into_chunks function in TextLoader to use the right splitter based on the file extension.

I was having a play around with this locally, but the *Processor struct ends up only reading a file. Is that really what we want?
I'm wondering if it makes more sense to keep Markdown handling within MarkdownProcessor - embed_file would see the .md extension and pass it on to MarkdownProcessor, which would return the embeddings. With this approach, we end up with a self-contained Processor for every file type, and a utility function (embed_file) that handles choosing and using them.

# Conflicts: # Cargo.lock # rust/Cargo.toml # rust/src/lib.rs

akshayballal95 · 2025-02-20T12:08:22Z

That's indeed one valid approach. But I would suggest we keep the embedding logic out of the processors. One reason is because it could result in a lot of repeated code because after chunking, embedding is the same for all file types. Second, is the division of concern. The idea for the processors is to make everything ready for the embeddings. What can be done is to move chunking/splitting logic to each processor such that they can return the chunks and the emb_text function just consumes those chunks. This way we can also completely make the *Processor standalone for generating chunks just like document loaders in langchain or llamaindex.

So once we have a file, we can convert it to a suitable document type based on your implementation of MarkdownDocument and have extract and split_into_chunks methods associated for each document. We can infact have a Document trait which implements these methods. This can make the whole processor module very scalable.

akshayballal95 · 2025-02-20T12:29:34Z

We can also in the future have a get_metadata function for each *Document. For example for a markdown file we can output the front matter as metadata, for pdf page numbers etc.

boswelja · 2025-02-21T22:53:23Z

Seems reasonable, I'll try keeping the splitters in each processor and see how I go :)

boswelja · 2025-02-22T05:16:24Z

Alright I've got some changes with the Rust side building
I haven't tested it or anything - we need a bunch of new tests now too)
There's also some parts I'm generally not happy with, I'll be thinking on them over the weekend 😅

akshayballal95 · 2025-03-08T11:59:14Z

Hi, do let me know if you are still going ahead with this.

boswelja · 2025-03-08T23:47:43Z

I'm currently on holiday, I'll be back in April but I'll probably have work to catch up on too so this will be on hold until then.

I'll mark it as ready for review once it's in a place I'm happy with 🙂

boswelja added 3 commits February 14, 2025 20:52

Support embedding markdown while retaining formatting

9ac84e0

Fix build

12fbb87

Revert rogue changes

a51877e

boswelja added 3 commits February 20, 2025 14:37

Adjust embed_markdown to take content instead of a file

7a4acf4

Remove generics from embed_file

8439b94

Remove generics from emb_text

3522840

Merge branch 'main' into improved-markdown-embedding

d20838b

# Conflicts: # Cargo.lock # rust/Cargo.toml # rust/src/lib.rs

boswelja added 4 commits February 22, 2025 13:32

Lift embedding out of markdown_processor.rs

8bc0cce

Drafting processor traits

39b6323

Fixes

c6961f5

Building

5a1e5cb

boswelja marked this pull request as draft February 22, 2025 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support embedding markdown while retaining formatting #119

Support embedding markdown while retaining formatting #119

boswelja commented Feb 14, 2025

akshayballal95 commented Feb 17, 2025

akshayballal95 commented Feb 18, 2025

boswelja commented Feb 18, 2025

akshayballal95 commented Feb 18, 2025

boswelja commented Feb 18, 2025

boswelja commented Feb 20, 2025

akshayballal95 commented Feb 20, 2025 •

edited

Loading

akshayballal95 commented Feb 20, 2025

boswelja commented Feb 21, 2025

boswelja commented Feb 22, 2025

akshayballal95 commented Mar 8, 2025

boswelja commented Mar 8, 2025

Support embedding markdown while retaining formatting #119

Are you sure you want to change the base?

Support embedding markdown while retaining formatting #119

Conversation

boswelja commented Feb 14, 2025

akshayballal95 commented Feb 17, 2025

akshayballal95 commented Feb 18, 2025

boswelja commented Feb 18, 2025

akshayballal95 commented Feb 18, 2025

boswelja commented Feb 18, 2025

boswelja commented Feb 20, 2025

akshayballal95 commented Feb 20, 2025 • edited Loading

akshayballal95 commented Feb 20, 2025

boswelja commented Feb 21, 2025

boswelja commented Feb 22, 2025

akshayballal95 commented Mar 8, 2025

boswelja commented Mar 8, 2025

akshayballal95 commented Feb 20, 2025 •

edited

Loading