An open-source file conversion webapp built with NextJs, Python
and AWS for the HTTP API, Lambda functions and S3 object storage.
Converts .docx files to .pdf
Features · Running locally · Overview · API Routing · Authors
-
Website
- NextJs App Router
- Amazon Web Services for backend functionality
- Support for
HTTP API
,S3
File Storage, andLambda
functions - Edge runtime-ready
-
AWS Infrastructure
- Amazon S3 Allows for object storage and static site hosting
- API Gateway hosts the HTTP API
- AWS Lambda for processing JSON and filtering required data
- Amazon EC2 for provisioning VM instances
-
A static site is hosted on
S3
with a document upload form. We useAPI Gateway
to create an API which makes aGET
request to aLambda
function after the user clicks Upload File on the form. -
The API sends a
presigned bucket URL
for theuploads-bucket
. The site then automatically conducts aPUT
request to the same bucket with the.docx
file data. -
Another
Lambda
function is configured to listen forPUT Object events
in the S3uploads-bucket
. It parses the event record for file name and sends aPOST
request to the PythonFlask App
performing the document conversion. -
An
EC2
instance is deployed with an Ubuntu OS image. A python script is setup to run as a background process. -
The python microservice converts documents using
pandoc
package and is exposed as an API usingFlask
listening forPOST
requests on a specified port. -
It downloads and saves the specified file with its ID, uploads the converted file to the
output-bucket
onS3
. The static site returns the download link for the converted file from theoutput-bucket
.
The frontend of the app is hosted as a Static site in a separate S3 bucket.
Note
To learn more about the S3
static site and how to deploy it, visit the frontend/README.md
The HTTP API
is hosted on AWS using API Gateway and Lambda function which deploys a getPresignedURL.js app
. Source code for lambda function is in the lambda/presignedURL.js
Note
To learn more about the getPresignedURL.js app
and how to deploy it, visit the lambda/README.md
-
Create a
EC2 t2.micro
instance with anUbuntu Linux AMI
and note the VM's public IPv4 address. -
Assign an IAM role to the EC2 instance with the
AmazonS3FullAccess
policy attached. -
Run the Flask development server within the VM:
Before installing ensure its the correct Python version via python -V
sudo apt update && apt upgrade
sudo apt install pandoc texlive python3.10-venv
python3 -m venv venv
source venv/bin/activate
pip install pypandoc boto3 flask
mkdir inputs outputs
touch app.py
Copy the contents of app.py
within the python file by opening it with any code editor (nano, vim etc).
sudo su
nohup python3 app.py > log.txt 2>&1 &
- The Flask app should now be able to handle requests 24/7. It is being run as a background process using the
nohup
command to ensure application uptime as long as VM is running even if we were to exit out of remote shell. - The logs and stdout along with stderr is saved to
log.txt
in the same directory. - The
&
displays the process ID for the python process which may be recorded to performkill <PID>
in case the process is to be stopped.
The Flask app should now be running on: http://{ec2-instance-public-ipv4-address}:5000
Replace this address in the API endpoint URL within the trigger_converter.py Lambda function to send the S3 .docx
files to the Flask microservice to be converted.
Warning
This command only starts the webapp. You will need to configure the instance Security Group to allow TCP connections to port 5000 of the EC2 instance from any external IPv4 address [0.0.0.0/0] on AWS to get the full functionality.
Note
Follow the above steps for the PNG
and CSV
converter microservices in similar fashion in separate directories and expose them on different ports.
Tip
In case webapp demo videos aren't loading below in the README, please visit Youtube.
site.mp4
DOCX to PDF Conversion
image.mp4
PNG to PDF Conversion
S3 uploads-bucket for .docx files
S3 output-bucket for .pdf files
Flask App process running in EC2
This project is created by MLSA KIIT for Cloud Computing Domain's Project Wing:
- Sourasish Basu (@SourasishBasu) - MLSA KIIT
Version | Date | Comments |
---|---|---|
1.0 | Jan 24th, 2024 | Initial release |
Website/API
- File Validation and Sanitization on server side
- Better PDF conversion engine to retain original formatting in higher quality
- Better Error Handling
AWS Infrastructure
- Actual implementation in production
- Conversion feature between multiple file types
- Implementing image compression using methods such as Huffman Encoding