-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][File] Add markdown parser for RAG support #9714 #9760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Hi @joonseolee , thanks for your PR. However, this implementation may require some adjustments before it can be merged.
So all function in file source connector, not another transform. Please refer https://github.com/apache/seatunnel/blob/dev/seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/MultipleTableFileSourceReader.java#L42 |
Thank you for your comment ! :)) cc @iinow |
Yes.
No, the MultipleTableFileSourceReader and SourceSplit used by all FileFormat. MultipleTableFileSourceReader used to read file path, ReadStrategy used to parse data in file. |
Ah, then for now, in this ticket, should I just proceed with the things mentioned above? cc @iinow |
Choose the way you like :) |
48f1343
to
47e523e
Compare
47e523e
to
37d009a
Compare
@Override | ||
public WriteStrategy getWriteStrategy(FileSinkConfig fileSinkConfig) { | ||
throw new UnsupportedOperationException( | ||
"File format 'markdown' does not support reading."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"File format 'markdown' does not support reading."); | |
"File format 'markdown' does not support writing."); |
new String[]{"type", "value"}, | ||
new org.apache.seatunnel.api.table.type.SeaTunnelDataType[]{ | ||
BasicType.STRING_TYPE, BasicType.STRING_TYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The row fields too simple.
We should contains:
- element_id
- element_type
- heading_level
- text
- page_number
- position_index
- parent_id
- child_ids
For example:
# Data Source Configuration
## Kafka Configuration
Users can configure Kafka by specifying the bootstrap.servers, topic, and group.id.
## MySQL Configuration
MySQL requires a JDBC URL, username, and password.
### Notes
Make sure to test the connection before deploying.
The row should be
{
"element_id": "uuid-elem-1",
"element_type": "heading",
"heading_level": 1,
"text": "Data Source Configuration",
"page_number": null,
"position_index": 0,
"parent_id": null,
"child_ids": ["uuid-elem-2","uuid-elem-4"],
}
{
"element_id": "uuid-elem-2",
"element_type": "heading",
"heading_level": 2,
"text": "Kafka Configuration",
"page_number": null,
"position_index": 1,
"parent_id": "uuid-elem-1",
"child_ids": ["uuid-elem-3"]
}
{
"element_id": "uuid-elem-3",
"element_type": "paragraph",
"heading_level": null,
"text": "Users can configure Kafka by specifying the bootstrap.servers, topic, and group.id.",
"page_number": null,
"position_index": 2,
"parent_id": "uuid-elem-2",
"child_ids": []
}
{
"element_id": "uuid-elem-4",
"element_type": "heading",
"heading_level": 2,
"text": "MySQL Configuration",
"page_number": null,
"position_index": 3,
"parent_id": "uuid-elem-1",
"child_ids": ["uuid-elem-5","uuid-elem-6"]
}
{
"element_id": "uuid-elem-5",
"element_type": "paragraph",
"heading_level": null,
"text": "MySQL requires a JDBC URL, username, and password.",
"page_number": null,
"position_index": 4,
"parent_id": "uuid-elem-4",
"child_ids": []
}
{
"element_id": "uuid-elem-6",
"element_type": "heading",
"heading_level": 3,
"text": "Notes",
"page_number": null,
"position_index": 5,
"parent_id": "uuid-elem-4",
"child_ids": ["uuid-elem-7"]
}
{
"element_id": "uuid-elem-7",
"element_type": "paragraph",
"heading_level": null,
"text": "Make sure to test the connection before deploying.",
"page_number": null,
"position_index": 6,
"parent_id": "uuid-elem-6",
"child_ids": []
}
37d009a
to
5437c32
Compare
I'm sorry to bother you, but could you please check it again? |
class MarkdownReadStrategyTest { | ||
|
||
@Test | ||
public void testReadMarkdown() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test all field value in one row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ! I have done it
|
||
package org.apache.seatunnel.connectors.seatunnel.file.source.reader; | ||
|
||
import com.vladsch.flexmark.ast.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run mvn spotless:apply
to fix code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done it all. Can you check it again?
5437c32
to
49bab53
Compare
Hi @joonseolee . Thanks for update! Please follow the guide to open github action on your fork repository. https://github.com/apache/seatunnel/pull/9760/checks?check_run_id=49358045346 |
e0050e1
to
3fee7d2
Compare
Can you check the below again? It printed the message like this. Write Count So Far : 0
Average Read Count : 0/s
Average Write Count : 0/s
Last Statistic Time : 2025-09-04 00:39:30
Current Statistic Time : 2025-09-04 00:40:30
***********************************************
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7767ea8d
Terminate batch job (Y/N)?
Error: The operation was canceled. |
Please re-trigger the failed ci. Maybe it just unstable. |
3fee7d2
to
8b3c078
Compare
Finally! I have passed all CI! |
Hi @joonseolee . Could you update the docs with another PR? |
Got it. I will make a PR about that :) |
8b3c078
to
7a52620
Compare
@joonseolee Wait for the completion of the document pr I will merge together |
@joonseolee Can the image links in Markdown be parsed? |
I already modified the files and opened a PR, so are you saying that you are going to write the documentation instead of me? |
Of course, I also made it so that the image links can be extracted. |
@joonseolee However, I don't seem to have found any specialized logic for processing images. |
Sorry, I didn’t check it properly. To explain in more detail, as you saw in the code, I divided it based on Heading, Paragraph, ListItem, BulletList, OrderedList, BlockQuote, FencedCodeBlock, and TableBlock. Elements like image, bold, and italic are only defined as syntax within those blocks. It would be better to divide them in more detail, but for now I set it up so that the bigger sections are separated while image, bold, and italic retain their syntax. If you think my approach is lacking, I can update it right away. |
Purpose of this pull request
Does this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide