How to Integrate Amazon Textract with S3 and EC2 for Async Data Extraction and RAG Application Update #14480

crman · 2024-07-01T05:56:45Z

crman
Jul 1, 2024

Hello everyone,

I am working on a Retrieval-Augmented Generation (RAG) application for tabular data extraction from PDF documents. Here is my current setup:

Tabular Data Extraction:

Using Amazon Textract to extract tabular information from PDF documents.
Converting the extracted data into CSV files and saving them in a directory.

Loading and Orchestration:

Loading these CSV files using an orchestration framework such as LlamaIndex.

I now want to enhance my setup by using Amazon Textract's asynchronous capabilities to extract tables and store the extracted CSV files in an S3 bucket. My goal is to run my RAG application on an EC2 instance and sync the local directory with the S3 bucket. This way, whenever new files are added to the S3 bucket, they are also updated in my EC2 local directory. This update should trigger my application to update the vector store, ensuring that I get the latest responses based on new documents.

Specific Requirements:

Use Amazon Textract's async feature to extract tables and store them in an S3 bucket.
Sync the S3 bucket with a local directory on an EC2 instance.
Automatically trigger the application to update the vector store whenever new files are added to the bucket.

I am able to refresh the vector store based on the arrival of new documents. However, I need guidance on whether there is any specific implementation available that can synchronize the S3 bucket with the local EC2 directory effectively.

Any suggestions or guidance would be greatly appreciated.

@dosu · 2024-07-01T06:08:04Z

dosubot[bot]
bot Jul 1, 2024

To integrate Amazon Textract with S3 and EC2 for asynchronous data extraction and ensure your RAG application updates accordingly, you can follow these steps:

Amazon Textract Async Processing:

Use Amazon Textract's asynchronous API to process PDF documents and extract tabular data. Store the results in an S3 bucket.
Here is an example of how to start an asynchronous Textract job and check its status:

import boto3

textract = boto3.client('textract')

# Start the asynchronous job
response = textract.start_document_analysis(
    DocumentLocation={'S3Object': {'Bucket': 'your-bucket', 'Name': 'your-document.pdf'}},
    FeatureTypes=['TABLES']
)
job_id = response['JobId']

# Check the job status
response = textract.get_document_analysis(JobId=job_id)
status = response['JobStatus']
while status == 'IN_PROGRESS':
    time.sleep(5)
    response = textract.get_document_analysis(JobId=job_id)
    status = response['JobStatus']

if status == 'SUCCEEDED':
    # Process the response and save the extracted data to S3
    # ...

Sync S3 Bucket with EC2 Local Directory:
- Use AWS CLI or a tool like s3fs to sync the S3 bucket with a local directory on your EC2 instance. You can set up a cron job or a systemd service to periodically sync the S3 bucket with the local directory.
```
# Install AWS CLI
sudo apt-get install awscli

# Sync S3 bucket with local directory
aws s3 sync s3://your-bucket /path/to/local/directory
```
- Alternatively, you can use s3fs to mount the S3 bucket as a local file system:
```
# Install s3fs
sudo apt-get install s3fs

# Mount S3 bucket
s3fs your-bucket /path/to/mountpoint -o iam_role=auto
```

Trigger Application to Update Vector Store:

Use a file system watcher like watchdog in Python to monitor the local directory for changes and trigger the re-indexing process in your application.

import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from llama_index import LlamaIndex  # Assuming LlamaIndex is the module for indexing

class Watcher:
    def __init__(self, directory_to_watch, index):
        self.DIRECTORY_TO_WATCH = directory_to_watch
        self.index = index
        self.observer = Observer()

    def run(self):
        event_handler = Handler(self.index)
        self.observer.schedule(event_handler, self.DIRECTORY_TO_WATCH, recursive=True)
        self.observer.start()
        try:
            while True:
                time.sleep(5)
        except KeyboardInterrupt:
            self.observer.stop()
        self.observer.join()

class Handler(FileSystemEventHandler):
    def __init__(self, index):
        self.index = index

    def on_modified(self, event):
        if event.is_directory:
            return None
        else:
            self.index.reindex(event.src_path)

if __name__ == '__main__':
    index = LlamaIndex()  # Initialize your LlamaIndex instance
    w = Watcher("/path/to/local/directory", index)
    w.run()

This setup ensures that your RAG application on the EC2 instance is always up-to-date with the latest documents from the S3 bucket, and the vector store is refreshed accordingly.

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Integrate Amazon Textract with S3 and EC2 for Async Data Extraction and RAG Application Update #14480

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to Integrate Amazon Textract with S3 and EC2 for Async Data Extraction and RAG Application Update #14480

crman Jul 1, 2024

Replies: 1 comment

dosubot[bot] bot Jul 1, 2024

crman
Jul 1, 2024

dosubot[bot]
bot Jul 1, 2024