Skip to content

Conversation

joonseolee
Copy link

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@joonseolee
Copy link
Author

@Hisoka-X @iinow

I have studied this project as much as possible and added the RAG functionality, but since this is my first time working on it, there may be parts that I missed. If you could let me know about those parts, I will make sure to correct them immediately.

@Hisoka-X
Copy link
Member

Hi @joonseolee , thanks for your PR.

However, this implementation may require some adjustments before it can be merged.

  1. We should add new format in
  2. We should use file series source to read markdown file and parse it in MarkdownReadStrategy added by step 1.

So all function in file source connector, not another transform. Please refer https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/MultipleTableFileSourceReader.java#L42

@joonseolee
Copy link
Author

@Hisoka-X

Thank you for your comment ! :))
So, is my understanding correct that instead of adding the new format (Markdown) to the seatunnel-transforms-v2 module, I should add it to the FileFormat enum and implement a corresponding ReadStrategy? And based on the chunkSize and overlap values, I can create a new class similar to MultipleTableFileSourceReader to convert it into structured data. Additionally, I should implement a new SourceSplit, for example, by creating something like RagFileSourceSplit.

cc @iinow

@Hisoka-X
Copy link
Member

I should add it to the FileFormat enum and implement a corresponding ReadStrategy?

Yes.

And based on the chunkSize and overlap values, I can create a new class similar to MultipleTableFileSourceReader to convert it into structured data. Additionally, I should implement a new SourceSplit, for example, by creating something like RagFileSourceSplit.

No, the MultipleTableFileSourceReader and SourceSplit used by all FileFormat.

MultipleTableFileSourceReader used to read file path, ReadStrategy used to parse data in file.

@joonseolee
Copy link
Author

I should add it to the FileFormat enum and implement a corresponding ReadStrategy?

Yes.

Ah, then for now, in this ticket, should I just proceed with the things mentioned above?
I’ll make sure that the chunk related functionality is discussed in ticket #9717.
And I’ll try to develop it as quickly as possible so that I can submit a PR soon.

cc @iinow

@Hisoka-X
Copy link
Member

Ah, then for now, in this ticket, should I just proceed with the things mentioned above?

Choose the way you like :)

@joonseolee joonseolee force-pushed the feature/rag-markdown-parser branch from 48f1343 to 47e523e Compare August 28, 2025 00:23
@joonseolee joonseolee force-pushed the feature/rag-markdown-parser branch from 47e523e to 37d009a Compare August 28, 2025 00:25
@joonseolee
Copy link
Author

@Hisoka-X

For this ticket, I have added the markdown read strategy first, and I plan to focus on improvements in the split ticket #9717 next.
I have worked on this again. Could you please take a look when you have a moment?

@Override
public WriteStrategy getWriteStrategy(FileSinkConfig fileSinkConfig) {
throw new UnsupportedOperationException(
"File format 'markdown' does not support reading.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"File format 'markdown' does not support reading.");
"File format 'markdown' does not support writing.");

Comment on lines 157 to 159
new String[]{"type", "value"},
new org.apache.seatunnel.api.table.type.SeaTunnelDataType[]{
BasicType.STRING_TYPE, BasicType.STRING_TYPE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row fields too simple.
We should contains:

  • element_id
  • element_type
  • heading_level
  • text
  • page_number
  • position_index
  • parent_id
  • child_ids

For example:

# Data Source Configuration

## Kafka Configuration
Users can configure Kafka by specifying the bootstrap.servers, topic, and group.id.

## MySQL Configuration
MySQL requires a JDBC URL, username, and password.

### Notes
Make sure to test the connection before deploying.

The row should be

{
      "element_id": "uuid-elem-1",
      "element_type": "heading",
      "heading_level": 1,
      "text": "Data Source Configuration",
      "page_number": null,
      "position_index": 0,
      "parent_id": null,
      "child_ids": ["uuid-elem-2","uuid-elem-4"],
 }
{
      "element_id": "uuid-elem-2",
      "element_type": "heading",
      "heading_level": 2,
      "text": "Kafka Configuration",
      "page_number": null,
      "position_index": 1,
      "parent_id": "uuid-elem-1",
      "child_ids": ["uuid-elem-3"]
}
{
      "element_id": "uuid-elem-3",
      "element_type": "paragraph",
      "heading_level": null,
      "text": "Users can configure Kafka by specifying the bootstrap.servers, topic, and group.id.",
      "page_number": null,
      "position_index": 2,
      "parent_id": "uuid-elem-2",
      "child_ids": []
    }
    {
      "element_id": "uuid-elem-4",
      "element_type": "heading",
      "heading_level": 2,
      "text": "MySQL Configuration",
      "page_number": null,
      "position_index": 3,
      "parent_id": "uuid-elem-1",
      "child_ids": ["uuid-elem-5","uuid-elem-6"]
    }
    {
      "element_id": "uuid-elem-5",
      "element_type": "paragraph",
      "heading_level": null,
      "text": "MySQL requires a JDBC URL, username, and password.",
      "page_number": null,
      "position_index": 4,
      "parent_id": "uuid-elem-4",
      "child_ids": []
    }
    {
      "element_id": "uuid-elem-6",
      "element_type": "heading",
      "heading_level": 3,
      "text": "Notes",
      "page_number": null,
      "position_index": 5,
      "parent_id": "uuid-elem-4",
      "child_ids": ["uuid-elem-7"]
    }
    {
      "element_id": "uuid-elem-7",
      "element_type": "paragraph",
      "heading_level": null,
      "text": "Make sure to test the connection before deploying.",
      "page_number": null,
      "position_index": 6,
      "parent_id": "uuid-elem-6",
      "child_ids": []
    }

@joonseolee joonseolee force-pushed the feature/rag-markdown-parser branch from 37d009a to 5437c32 Compare August 31, 2025 23:56
@joonseolee
Copy link
Author

@Hisoka-X

I'm sorry to bother you, but could you please check it again?

class MarkdownReadStrategyTest {

@Test
public void testReadMarkdown() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test all field value in one row?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hisoka-X

Thank you ! I have done it


package org.apache.seatunnel.connectors.seatunnel.file.source.reader;

import com.vladsch.flexmark.ast.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run mvn spotless:apply to fix code style.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hisoka-X

I have done it all. Can you check it again?

@Hisoka-X
Copy link
Member

Hisoka-X commented Sep 3, 2025

Hi @joonseolee . Thanks for update! Please follow the guide to open github action on your fork repository. https://github.com/apache/seatunnel/pull/9760/checks?check_run_id=49358045346

@joonseolee joonseolee force-pushed the feature/rag-markdown-parser branch 3 times, most recently from e0050e1 to 3fee7d2 Compare September 3, 2025 23:07
@joonseolee
Copy link
Author

@Hisoka-X

Can you check the below again?
I have passed all tests excluding unit-test (8, windows-latest).

It printed the message like this.

Write Count So Far        :                   0
Average Read Count        :                 0/s
Average Write Count       :                 0/s
Last Statistic Time       : 2025-09-04 00:39:30
Current Statistic Time    : 2025-09-04 00:40:30
***********************************************

org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
Terminate batch job (Y/N)? 
Error: The operation was canceled.

@Hisoka-X
Copy link
Member

Hisoka-X commented Sep 4, 2025

Please re-trigger the failed ci. Maybe it just unstable.

@joonseolee joonseolee force-pushed the feature/rag-markdown-parser branch from 3fee7d2 to 8b3c078 Compare September 5, 2025 05:17
@joonseolee
Copy link
Author

joonseolee commented Sep 5, 2025

@Hisoka-X

Finally! I have passed all CI!
Can you check it again?

@Hisoka-X
Copy link
Member

Hisoka-X commented Sep 5, 2025

Hi @joonseolee . Could you update the docs with another PR?

@joonseolee
Copy link
Author

@Hisoka-X

Got it. I will make a PR about that :)

@joonseolee joonseolee force-pushed the feature/rag-markdown-parser branch from 8b3c078 to 7a52620 Compare September 7, 2025 22:25
@joonseolee joonseolee changed the title [Feature][API] Add markdown parser for RAG support #9714 [Feature][File] Add markdown parser for RAG support #9714 Sep 7, 2025
@corgy-w
Copy link
Contributor

corgy-w commented Sep 8, 2025

@joonseolee Wait for the completion of the document pr I will merge together

@zhangshenghang
Copy link
Member

@joonseolee Can the image links in Markdown be parsed?

@joonseolee
Copy link
Author

@corgy-w

I already modified the files and opened a PR, so are you saying that you are going to write the documentation instead of me?

@joonseolee
Copy link
Author

@zhangshenghang

Of course, I also made it so that the image links can be extracted.

@zhangshenghang
Copy link
Member

@zhangshenghang

Of course, I also made it so that the image links can be extracted.

@joonseolee However, I don't seem to have found any specialized logic for processing images.

@joonseolee
Copy link
Author

@zhangshenghang

Sorry, I didn’t check it properly. To explain in more detail, as you saw in the code, I divided it based on Heading, Paragraph, ListItem, BulletList, OrderedList, BlockQuote, FencedCodeBlock, and TableBlock. Elements like image, bold, and italic are only defined as syntax within those blocks. It would be better to divide them in more detail, but for now I set it up so that the bigger sections are separated while image, bold, and italic retain their syntax. If you think my approach is lacking, I can update it right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants