Skip to content

dlcs/composite-handler

Repository files navigation

DLCS - Composite Handler

The DLCS Composite Handler is an implementation of DLCS RFC011.

About

The component is written in Python and utilises Django with the following extensions:

Additionally, the project uses:

Getting Started

The project ships with a docker-compose.yml that can be used to get a local version of the component running:

docker compose up

Note that for the Composite Handler to be able to interact with the target S3 bucket, the Docker Compose assumes that the AWS_PROFILE environment variable has been set and a valid AWS session is available.

This will create a PostgreSQL instance, bootstrap it with the required tables, deploy a single instance of the API, and three instances of the engine. Requests can then be targetted at localhost:8000.

The component can also be run directly, either in an IDE or from the CLI. The component must first be configured either via the creation of a .env file (see .env.dist for an example configuration), or via a set of environment variables (see the Configuration section).

Once configuration is in place, the following commands will start the API and / or engine:

  • API: python manage.py runserver 0.0.0.0:8000
  • Engine: python manage.py qcluster

Should the required tables not exist in the target database, the following commands should be run first:

python manage.py migrate
python manage.py createcachetable

Once the API is running, an administrator interface can be accessed via the browser at http://localhost:8000/admin. To create an administrator login, run the following command:

python manage.py createsuperuser

The administrator user can be used to browse the database and manage the queue (including deleting tasks and resubmitting failed tasks into the queue).

Entrypoints

There are 3 possible entrypoints to make the above easier:

  • entrypoint.sh - this will wait for Postgres to be available and run manage.py migrate and manage.py createcachetable if MIGRATE=True. It will run manage.py createsuperuser is INIT_SUPERUSER=True (also needs DJANGO_SUPERUSER_* envvars)
  • entrypoint-api.sh - this runs above then starts nginx instance fronting gunicorn process
  • entrypoint-worker.sh - this runs above then python manage.py qcluster

Configuration

The following list of environment variables are supported:

Environment Variable Default Value Component(s) Description
DJANGO_DEBUG True API, Engine Whether Django should run in debug. Useful for development purposes but should be set to False in production.
DJANGO_SECRET_KEY None API, Engine The secret key used by Django when generating sensitive tokens. This should a randomly generated 50 character string.
SCRATCH_DIRECTORY /tmp/scratch Engine A locally accessible filesystem path where work-in-progress files are written during rasterization.
WEB_SERVER_SCHEME http API The HTTP scheme used when generating URI's.
WEB_SERVER_HOSTNAME localhost:8000 API The hostname (and optional port) used when generating URI's.
ORIGIN_CHUNK_SIZE 8192 Engine The chunk size, in bytes, used when retrieving objects from origins. Tailoring this value can theoretically improve download speeds.
DATABASE_URL None API, Engine The URL of the target PostgreSQL database, in a format acceptable to django-environ, e.g. postgresql://dlcs:password@postgres:5432/compositedb.
CACHE_URL None API, Engine The URL of the target cache, in a format acceptable to django-environ, e.g. dbcache://app_cache.
PDF_RASTERIZER_THREAD_COUNT 3 Engine The number of concurrent Poppler threads spawned when a worker is rasterizing a PDF. Each thread typically consumes 100% of a CPU core.
PDF_RASTERIZER_DPI 500 Engine The DPI of images generated during the rasterization process. For JPEG's, the default value of 500 typically produces images approximately 1.5MiB to 2MiB in size.
PDF_RASTERIZER_FALLBACK_DPI 200 Engine The DPI to use for images that exceed pdftoppm memory size and produce a 1x1 pixel (see Belval/pdf2image#34)
PDF_RASTERIZER_FORMAT jpg Engine The format to generate rasterized images in. Supported values are ppm, jpeg / jpg, png and tiff
PDF_RASTERIZER_MAX_LENGTH 0 Engine Optional, the maximum size of pixels on longest edge that will be saved. If rasterized image exceeds this it will be resized, maintaining aspect ratio.
DLCS_API_ROOT https://api.dlcs.digirati.io Engine The root URI of the API of the target DLCS deployment, without the trailing slash.
DLCS_S3_BUCKET_NAME dlcs-composite-images Engine The S3 bucket that the Composite Handler will push rasterized images to, for consumption by the wider DLCS. Both the Composite Handler and the DLCS must have access to this bucket.
DLCS_S3_OBJECT_KEY_PREFIX composites Engine The S3 key prefix to use when pushing images to the DLCS_S3_BUCKET_NAME - in other words, the folder within the S3 bucket into which images are stored.
DLCS_S3_UPLOAD_THREADS 8 Engine The number of concurrent threads to use when pushing images to the S3 bucket. A higher number of threads will significantly lower the amount of time spent pushing images to S3, however too high a value will cause issues with Boto3. 8 is a testing and sensible value.
ENGINE_WORKER_COUNT 2 Engine The number of workers a single instance of the engine will spawn. Each worker will handle the processing of a single PDF, so the total number of concurrent PDF's that can be processed is engine_count * worker_count.
ENGINE_WORKER_TIMEOUT 3600 Engine The number of seconds that a task (i.e. the processing of a single PDF) can run for before being terminated and treated as a failure. This value is useful to purging "stuck" tasks which haven't technically failed but are occupying a worker.
ENGINE_WORKER_RETRY 4500 Engine The number of seconds since a task was presented for processing before a worker will re-run, regardless of whether it is still running or failed. As such, this value must be higher than ENGINE_WORKER_TIMEOUT.
ENGINE_WORKER_MAX_ATTEMPTS 0 Engine The number of processing attempts a single task will undergo before it is abandoned. Setting this value to 0 will cause a task to be retried forever.
MIGRATE None API, Engine If "True" will run migrations + createcachetable on startup if entrypoint used.
INIT_SUPERUSER None API, Engine If "True" will attempt to create superuser. Needs standard Django envvars to be set (e.g. DJANGO_SUPERUSER_USERNAME, DJANGO_SUPERUSER_EMAIL, DJANGO_SUPERUSER_PASSWORD) if entrypoint used.
GUNICORN_WORKERS 2 API The value of --workers arg when running gunicorn
SQS_BROKER_QUEUE_NAME None API, Engine If set, django-q SQS broker will be used. Queue created if doesn't exist. If empty default Django ORM broker is used

Note that in order to access the S3 bucket, the Composite Handler assumes that valid AWS credentials are available in the environment - this can be in the former of environment variables, or in the form of ambient credentials.

Django Q Broker

By default Django Q will use the default Django ORM broker.

The SQS broker can be configured by specifying the SQS_BROKER_QUEUE_NAME environment variable. Default SQS broker behaviour is to create this queue if it is not found.

As with S3, above, Composite Handler assumes that valid AWS credentials are available in the environment.

Building

The project ships with a Dockerfile:

docker build -t dlcs/composite-handler:local .

This will produce a single image that can be used to execute any of the supported Django commands, including running the API and the engine:

docker run dlcs/composite-handler:local python manage.py migrate # Apply any pending DB schema changes
docker run dlcs/composite-handler:local python manage.py createcachetable # Create the cache table (if it doesn't exist)
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-api.sh # Run the API
docker run --env-file .env -it --rm dlcs/composite-handler:local /srv/dlcs/entrypoint-worker.sh # Run the engine
docker run dlcs/composite-handler:local python manage.py qmonitor # Monitor the workers