[Feature][File] Add markdown parser for RAG support #9714 #9760

joonseolee · 2025-08-24T23:08:48Z

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

joonseolee · 2025-08-24T23:10:46Z

I have studied this project as much as possible and added the RAG functionality, but since this is my first time working on it, there may be parts that I missed. If you could let me know about those parts, I will make sure to correct them immediately.

Hisoka-X · 2025-08-25T05:43:30Z

Hi @joonseolee , thanks for your PR.

However, this implementation may require some adjustments before it can be merged.

We should add new format in

seatunnel/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java

Line 46 in 555f6c6

public enum FileFormat implements Serializable {
We should use file series source to read markdown file and parse it in MarkdownReadStrategy added by step 1.

So all function in file source connector, not another transform. Please refer https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/MultipleTableFileSourceReader.java#L42

joonseolee · 2025-08-25T23:10:04Z

@Hisoka-X

Thank you for your comment ! :))
So, is my understanding correct that instead of adding the new format (Markdown) to the seatunnel-transforms-v2 module, I should add it to the FileFormat enum and implement a corresponding ReadStrategy? And based on the chunkSize and overlap values, I can create a new class similar to MultipleTableFileSourceReader to convert it into structured data. Additionally, I should implement a new SourceSplit, for example, by creating something like RagFileSourceSplit.

cc @iinow

Hisoka-X · 2025-08-26T02:58:37Z

I should add it to the FileFormat enum and implement a corresponding ReadStrategy?

Yes.

And based on the chunkSize and overlap values, I can create a new class similar to MultipleTableFileSourceReader to convert it into structured data. Additionally, I should implement a new SourceSplit, for example, by creating something like RagFileSourceSplit.

No, the MultipleTableFileSourceReader and SourceSplit used by all FileFormat.

MultipleTableFileSourceReader used to read file path, ReadStrategy used to parse data in file.

joonseolee · 2025-08-26T06:58:26Z

I should add it to the FileFormat enum and implement a corresponding ReadStrategy?

Yes.

Ah, then for now, in this ticket, should I just proceed with the things mentioned above?
I’ll make sure that the chunk related functionality is discussed in ticket #9717.
And I’ll try to develop it as quickly as possible so that I can submit a PR soon.

cc @iinow

Hisoka-X · 2025-08-26T07:20:04Z

Ah, then for now, in this ticket, should I just proceed with the things mentioned above?

Choose the way you like :)

joonseolee · 2025-08-28T00:30:39Z

@Hisoka-X

For this ticket, I have added the markdown read strategy first, and I plan to focus on improvements in the split ticket #9717 next.
I have worked on this again. Could you please take a look when you have a moment?

Hisoka-X · 2025-08-28T03:57:13Z

...ile-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java

+        @Override
+        public WriteStrategy getWriteStrategy(FileSinkConfig fileSinkConfig) {
+            throw new UnsupportedOperationException(
+                    "File format 'markdown' does not support reading.");


Suggested change

"File format 'markdown' does not support reading.");

"File format 'markdown' does not support writing.");

Hisoka-X · 2025-08-28T06:05:47Z

.../java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/MarkdownReadStrategy.java

+                new String[]{"type", "value"},
+                new org.apache.seatunnel.api.table.type.SeaTunnelDataType[]{
+                        BasicType.STRING_TYPE, BasicType.STRING_TYPE


The row fields too simple.
We should contains:

element_id

element_type

heading_level

text

page_number

position_index

parent_id

child_ids

For example:

# Data Source Configuration ## Kafka Configuration Users can configure Kafka by specifying the bootstrap.servers, topic, and group.id. ## MySQL Configuration MySQL requires a JDBC URL, username, and password. ### Notes Make sure to test the connection before deploying.

The row should be

{ "element_id": "uuid-elem-1", "element_type": "heading", "heading_level": 1, "text": "Data Source Configuration", "page_number": null, "position_index": 0, "parent_id": null, "child_ids": ["uuid-elem-2","uuid-elem-4"], } { "element_id": "uuid-elem-2", "element_type": "heading", "heading_level": 2, "text": "Kafka Configuration", "page_number": null, "position_index": 1, "parent_id": "uuid-elem-1", "child_ids": ["uuid-elem-3"] } { "element_id": "uuid-elem-3", "element_type": "paragraph", "heading_level": null, "text": "Users can configure Kafka by specifying the bootstrap.servers, topic, and group.id.", "page_number": null, "position_index": 2, "parent_id": "uuid-elem-2", "child_ids": [] } { "element_id": "uuid-elem-4", "element_type": "heading", "heading_level": 2, "text": "MySQL Configuration", "page_number": null, "position_index": 3, "parent_id": "uuid-elem-1", "child_ids": ["uuid-elem-5","uuid-elem-6"] } { "element_id": "uuid-elem-5", "element_type": "paragraph", "heading_level": null, "text": "MySQL requires a JDBC URL, username, and password.", "page_number": null, "position_index": 4, "parent_id": "uuid-elem-4", "child_ids": [] } { "element_id": "uuid-elem-6", "element_type": "heading", "heading_level": 3, "text": "Notes", "page_number": null, "position_index": 5, "parent_id": "uuid-elem-4", "child_ids": ["uuid-elem-7"] } { "element_id": "uuid-elem-7", "element_type": "paragraph", "heading_level": null, "text": "Make sure to test the connection before deploying.", "page_number": null, "position_index": 6, "parent_id": "uuid-elem-6", "child_ids": [] }

joonseolee · 2025-08-31T23:58:07Z

@Hisoka-X

I'm sorry to bother you, but could you please check it again?

Hisoka-X · 2025-09-01T13:21:14Z

...a/org/apache/seatunnel/connectors/seatunnel/file/source/reader/MarkdownReadStrategyTest.java

+class MarkdownReadStrategyTest {
+
+    @Test
+    public void testReadMarkdown() throws Exception {


Could you test all field value in one row?

@Hisoka-X

Thank you ! I have done it

Hisoka-X · 2025-09-01T13:21:46Z

.../java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/MarkdownReadStrategy.java

+
+package org.apache.seatunnel.connectors.seatunnel.file.source.reader;
+
+import com.vladsch.flexmark.ast.*;


Please run mvn spotless:apply to fix code style.

@Hisoka-X

I have done it all. Can you check it again?

Hisoka-X · 2025-09-03T14:18:50Z

Hi @joonseolee . Thanks for update! Please follow the guide to open github action on your fork repository. https://github.com/apache/seatunnel/pull/9760/checks?check_run_id=49358045346

joonseolee · 2025-09-04T00:55:31Z

@Hisoka-X

Can you check the below again?
I have passed all tests excluding unit-test (8, windows-latest).

It printed the message like this.

Write Count So Far        :                   0
Average Read Count        :                 0/s
Average Write Count       :                 0/s
Last Statistic Time       : 2025-09-04 00:39:30
Current Statistic Time    : 2025-09-04 00:40:30
***********************************************

org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
Terminate batch job (Y/N)? 
Error: The operation was canceled.

Hisoka-X · 2025-09-04T14:15:22Z

Please re-trigger the failed ci. Maybe it just unstable.

joonseolee · 2025-09-05T07:10:03Z

@Hisoka-X

Finally! I have passed all CI!
Can you check it again?

Hisoka-X · 2025-09-05T08:53:50Z

Hi @joonseolee . Could you update the docs with another PR?

joonseolee · 2025-09-05T21:57:07Z

@Hisoka-X

Got it. I will make a PR about that :)

corgy-w · 2025-09-08T10:40:04Z

@joonseolee Wait for the completion of the document pr I will merge together

zhangshenghang · 2025-09-09T09:08:01Z

@joonseolee Can the image links in Markdown be parsed?

joonseolee · 2025-09-09T09:29:02Z

@corgy-w

I already modified the files and opened a PR, so are you saying that you are going to write the documentation instead of me?

joonseolee · 2025-09-09T09:29:59Z

@zhangshenghang

Of course, I also made it so that the image links can be extracted.

zhangshenghang · 2025-09-09T13:40:34Z

@zhangshenghang

Of course, I also made it so that the image links can be extracted.

@joonseolee However, I don't seem to have found any specialized logic for processing images.

joonseolee · 2025-09-09T13:56:44Z

@zhangshenghang

Sorry, I didn’t check it properly. To explain in more detail, as you saw in the code, I divided it based on Heading, Paragraph, ListItem, BulletList, OrderedList, BlockQuote, FencedCodeBlock, and TableBlock. Elements like image, bold, and italic are only defined as syntax within those blocks. It would be better to divide them in more detail, but for now I set it up so that the bigger sections are separated while image, bold, and italic retain their syntax. If you think my approach is lacking, I can update it right away.

github-actions bot added document Transform-v2 api labels Aug 24, 2025

joonseolee force-pushed the feature/rag-markdown-parser branch from 48f1343 to 47e523e Compare August 28, 2025 00:23

github-actions bot added connectors-v2 file and removed document api labels Aug 28, 2025

joonseolee force-pushed the feature/rag-markdown-parser branch from 47e523e to 37d009a Compare August 28, 2025 00:25

Hisoka-X reviewed Aug 28, 2025

View reviewed changes

joonseolee force-pushed the feature/rag-markdown-parser branch from 37d009a to 5437c32 Compare August 31, 2025 23:56

Hisoka-X reviewed Sep 1, 2025

View reviewed changes

Hisoka-X mentioned this pull request Sep 1, 2025

Support parse pdf to structured data (Parser + Normalization). #9716

Open

joonseolee force-pushed the feature/rag-markdown-parser branch from 5437c32 to 49bab53 Compare September 1, 2025 22:25

joonseolee force-pushed the feature/rag-markdown-parser branch 3 times, most recently from e0050e1 to 3fee7d2 Compare September 3, 2025 23:07

joonseolee force-pushed the feature/rag-markdown-parser branch from 3fee7d2 to 8b3c078 Compare September 5, 2025 05:17

Hisoka-X approved these changes Sep 5, 2025

View reviewed changes

github-actions bot added approved reviewed labels Sep 5, 2025

[Feature][File] Add markdown parser apache#9714

7a52620

joonseolee force-pushed the feature/rag-markdown-parser branch from 8b3c078 to 7a52620 Compare September 7, 2025 22:25

joonseolee changed the title ~~[Feature][API] Add markdown parser for RAG support #9714~~ [Feature][File] Add markdown parser for RAG support #9714 Sep 7, 2025

joonseolee mentioned this pull request Sep 7, 2025

[Feature][File] Add markdown parser documentation #9834

Open

3 tasks

	"File format 'markdown' does not support reading.");
	"File format 'markdown' does not support writing.");


		package org.apache.seatunnel.connectors.seatunnel.file.source.reader;

		import com.vladsch.flexmark.ast.*;

[Feature][File] Add markdown parser for RAG support #9714 #9760

Are you sure you want to change the base?

[Feature][File] Add markdown parser for RAG support #9714 #9760

Uh oh!

Conversation

joonseolee commented Aug 24, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

joonseolee commented Aug 24, 2025

Uh oh!

Hisoka-X commented Aug 25, 2025

Uh oh!

joonseolee commented Aug 25, 2025

Uh oh!

Hisoka-X commented Aug 26, 2025

Uh oh!

joonseolee commented Aug 26, 2025

Uh oh!

Hisoka-X commented Aug 26, 2025

Uh oh!

joonseolee commented Aug 28, 2025

Uh oh!

Hisoka-X Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Hisoka-X Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

joonseolee commented Aug 31, 2025

Uh oh!

Hisoka-X Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

joonseolee Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Hisoka-X Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

joonseolee Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Hisoka-X commented Sep 3, 2025

Uh oh!

joonseolee commented Sep 4, 2025

Uh oh!

Hisoka-X commented Sep 4, 2025

Uh oh!

joonseolee commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hisoka-X commented Sep 5, 2025

Uh oh!

joonseolee commented Sep 5, 2025

Uh oh!

corgy-w commented Sep 8, 2025

Uh oh!

zhangshenghang commented Sep 9, 2025

Uh oh!

joonseolee commented Sep 9, 2025

Uh oh!

joonseolee commented Sep 9, 2025

Uh oh!

zhangshenghang commented Sep 9, 2025

Uh oh!

joonseolee commented Sep 9, 2025

Uh oh!

Uh oh!

joonseolee commented Sep 5, 2025 •

edited

Loading