Service that listens for incoming messages from SQS. On receipt of a message this service will download reference PDF, extract ALTO file per-page and upload to specified S3 bucket.
If COMPLETED_TOPIC_ARN
env var specified a notification will be raised.
The incoming message is in the shape:
{
"pdfLocation": "https://www.hq.nasa.gov/alsj/a17/A17_FlightPlan.pdf",
"pdfIdentifier": "a17_flightplan",
"outputLocation": "s3://pdf-to-alto/a17_flightplan_alto"
}
Where
pdfLocation
- the URL where PDF can be downloaded from.pdfIdentifier
- unique identifier for PDF. wherei
is 0-based page index. If omitted a random uuid will be used.outputLocation
- s3 location where final ALTO files will be output. With or without precedings3://
and no trailing/
.
(See sample.json)
The completed notification message echos back the original message with "numberOfFiles"
property added.
The generated alto file will be placed in outputLocation
. The format of each file will depend on value of PREPEND_ID
envvar. This format will be (where i
is the page number):
- If true:
f"{pdfIdentifier}-{i:04d}.xml"
- else
f"{i:04d}.xml"
,
This is a Python script that utilises the following libraries:
- pdfalto - C lib used to generate ALTO files.
- PyMuPDF - Python lib used to query PDF object for page dimensions.
- lxml - Used to parse ALTO files and update scaled values.
- requests - Used to download PDF files.
The following environment variables can be used to configure the app:
Env Var | Description | Default |
---|---|---|
DOWNLOAD_CHUNK_SIZE | Chunk size for downloading PDF | 2048 |
WORKING_FOLDER | Local working folder for storing generated files | ./work |
REMOVE_WORK_DIR | Whether to clean up working dir on completion | True |
RESCALE_ALTO | Whether to rescale generated ALTO to page | True |
MONITOR_SLEEP_SECS | How long to sleep long polling operations if no messages received | 30 |
AWS_REGION | AWS region being used | eu-west-1 |
INCOMING_QUEUE | The name of the SQS queue to monitor for incoming messages. Required | |
COMPLETED_TOPIC_ARN | The ARN of a topic to post completion notifications to | |
LOCALSTACK | If using LocalStack | False |
LOCALSTACK_ADDRESS | Address for LocalStack instance | http://localhost:4566 |
(See .env.dist for sample .env file)
There is a multi-stage Dockerfile that builds the pdfalto
binary and copies it to a new stage.
docker-compose.yml will build and start the main Python app alongside a LocalStack instance.
# build and start image using LocalStack
docker-compose up
# build image
docker build --tag pdf-to-alto:local .
# run docker image and listen to queue
docker run --env-file .env -it --rm --name pdf-to-alto pdf-to-alto:local
# run docker image to process 1 single api
docker run -it --rm --name pdf-to-alto \
pdf-to-alto:local \
opt/app/app/pdf_processor.py https://text.example/test.pdf my-pdf-identifier s3://pdf-bucket/alto
Note: Building pdfalto from source takes a few minutes
The docker-compose.local.yml
file will spin up a LocalStack instance and configure a few
resource for local testing:
- An S3 bucket titled "pdf-to-alto"
- An SNS topic "incoming-topic" with an SQS subscription to "incoming"
- An SNS topic "completed-topic" with an SQS subscription to "completed"
docker-compose -f docker-compose.local.yml up
To use LocalStack set the LOCALSTACK
and LOCALSTACK_ADDRESS
env vars (see above).
When using the aws-cli with LocalStack the --endpoint-url
needs to be specified. Below are some handy commands to use
when testing:
# raise sample notification using sample.json
aws --endpoint-url=http://localhost:4566 sns publish --topic-arn arn:aws:sns:eu-west-1:000000000000:incoming-topic --message file://sample.json --region eu-west-1
# clear incoming queue
aws --endpoint-url=http://localhost:4566 sqs purge-queue --queue-url "http://localstack:4566/000000000000/incoming" --region eu-west-1
# check number of 'completed' notifications raised
aws --endpoint-url=http://localhost:4566 sqs get-queue-attributes --queue-url "http://localstack:4566/000000000000/completed" --attribute-names All --region eu-west-1
# check contents of s3
aws --endpoint-url=http://localhost:4566 s3 ls pdf-to-alto --recursive --region eu-west-1