Ingest data from new commits in IMGTHLA repository #102

chrisammon3000 · 2024-02-12T06:30:31Z

Description

Updates to gfe-db are triggered by new commits instead of just new branches (quarterly releases)
Execution state is tracked for each commit and release version in the IMGTHLA repo
Pipeline execution requests are idempotent when executions are in progress
Errors during pipeline executions are handled
Concurrency for database loading is constrained to one release version at a time to avoid fatal collisions
Formatted notifications of pipeline executions are sent by email
Pydantic models are implemented for automatic validation

New Commits

The CheckSourceUpdate Lambda function monitors the IMGTHLA source repository for new commits daily. If one or more new commits are found, the service resolves the release version number and creates new rows in the Execution State table (DynamoDB) and processes only the most recent commit for any given release. The commit's execution status is updated throughout pipeline execution using an enum class:

class ExecutionStatus(str, Enum):
    """
    ExecutionStatus is synced using the Step Functions DynamoDB integration:
    NOT_PROCESSED: never processed (set by CheckSourceUpdate) ✅
    SKIPPED: never processed (set by CheckSourceUpdate) ✅
    PENDING: state machine execution started (set by CheckSourceUpdate) ✅
    BUILD_IN_PROGRESS: build started (set by State Machine) ✅
    BUILD_SUCCESS: build succeeded (set by State Machine) ✅
    LOAD_IN_PROGRESS: load started (set by State Machine) ✅
    LOAD_SUCCESS: load succeeded (set by State Machine) ✅
    LOAD_FAILED: load failed (set by State Machine) ✅
    LOAD_INVALID: load invalid from query results (set by State Machine) ✅
    LOAD_SKIPPED: load skipped (set by State Machine) ✅
    BUILD_FAILED: build failed (set by State Machine) ✅
    EXECUTION_FAILED: build or load failed (set by State Machine) ✅
    ABORTED: build or load aborted (set by UpdateExecutionState) ✅
    """
    NOT_PROCESSED = "NOT_PROCESSED"
    SKIPPED = "SKIPPED"
    PENDING = "PENDING"
    BUILD_IN_PROGRESS = "BUILD_IN_PROGRESS"
    BUILD_SUCCESS = "BUILD_SUCCESS"
    BUILD_FAILED = "BUILD_FAILED"
    LOAD_IN_PROGRESS = "LOAD_IN_PROGRESS"
    LOAD_COMPLETE = "LOAD_COMPLETE"
    LOAD_SUCCESS = "LOAD_SUCCESS"
    LOAD_FAILED = "LOAD_FAILED"
    LOAD_INVALID = "LOAD_INVALID"
    LOAD_SKIPPED = "LOAD_SKIPPED"
    EXECUTION_FAILED = "EXECUTION_FAILED"
    ABORTED = "ABORTED"

Execution State

A DynamoDB table is deployed to store state for pipeline executions. For new deployments, the repository's state is built using the GitHub REST API and loaded into the table.

Pipeline Request Idempotency

Idempotency is acheived by using SQS FIFO queues, where the group ID is the unique deployment ID (${STAGE}-${APP_NAME}) and the deduplication ID is the release version. This means that when messages are in the queue, duplicate messages are not processed and that releases are loaded in chronological order.

Error Handling

Errors that occur during pipeline execution are caught and the state table entry is updated to reflect failure or aborted executions.

Database Concurrency Management

Database concurrency is maintained by a state machine called the Load Concurrency Manager (LCM). The LCM runs continuously when the main pipeline is running (by monitoring a CloudWatch Alarm) and handles the pre- and post-execution backups. This is to ensure that the database is not overloaded or shut-down during the loading process. All requests for loading data to Neo4j pass through a FIFO queue to avoid duplication and maintain the order of release versions. The consumer of the FIFO queue (Message Received?) will only receive one message at a time, and will not be invoked again until loading has succeeded or failed. Once the queue is empty and all releases have been loaded, the LCM will stop running.

Success/Failure Notifications

Notifications are sent by email including execution outcomes, validation results and error information in the event of failure.

Pydantic Models

Pydantic is a Python framework for ensuring data integrity. Every object within the pipeline now uses a Pydantic class for automatic schema and type validation. This prevents the state table from having corrupt or missing fields when reading and writing records.

Infrastructure Changes

New Lambda functions
- FormatResults - Formats notification messages
- InvokeLoadConcurrencyManager - Triggers the LCM when the Update Pipeline state machine has executions in progress
- LcmReceiveMessage - Checks the GfeDbLoadQueue for messages
- UpdateExecutionState - Handle aborted state machines and updates the state table
Lambda Layers
- GfeDbModelsLayer - Contains the logic and methods for building execution state from GitHub API calls as well as models for data handling and validation
SQS Queues
- GfeDbProcessingQueue - Queues releases for processing
- GfeDbLoadQueue - Queues releases for loading once they are built
DynamoDB table
- GfeDbExecutionStateTable - Stores state for each commit, release and execution combination
State Machines
- LoadConcurrencyManager - Runs continuously during release processing and limits concurrency of loading to 1 release at a time

Known Issues

Some of the earlier releases are missing because 1) their commits are not on the default branch (Latest), and 2) because of inconsistencies in the versioning and availability of metadata (fix in progress)
Commits should only be processed if they include a change to hla.dat or msf/ since these assets contain the source data (fix in progress)

Next Steps

Address the known issues
Merge CSV builds before loading to Neo4j to speed up loading
Update the documentation
Write tests

…efinition

chrisammon3000 · 2024-04-28T16:37:09Z

This is behind the most recent PR for Amazon Linux 2, but I'll update once it's caught up.

chrisammon3000 added 30 commits September 11, 2023 17:09

keep limit param as integer

537c98d

add pygethub

622568c

deploy to private subnet

9c38040

configure Lambda for VPC

aecc5e7

resolve resource dependencies for security group, params and endpoints

34b9f3b

disable ssl/tls policies in Neo4j

6b1bb0a

update neo4j.conf from local using SSM

5d93420

fix Neo4j init query

6f2cfe4

fix Neo4j uri for private instance

330cee7

test Neo4j connection on server

85c79ad

disable backup state

7ec14db

add private subnet variable

32cf182

fix neo4j waiter

a3df16c

get private IP using target

2c55a7d

fix local restore logic

800aa67

check coreutils dependency before deploy

249e0cd

validate private subnet variable

d9a41b2

validate VPC endpoint variables

097c347

add conditions for VPC endpoints

3b7beef

add Neo4j security group to external VPC endpoints

c12d47f

add and remove security groups on VPC endpoints

d1a7f3e

fix colors in stdout

bc82b04

validate boolean variables

3cb8b1d

merge stash

f0bea8b

disable vpc endpoint creation and management

df43874

deploy to private or public subnet

d218771

configure neo4j for public subnet

8932ece

re-enable elastic IP

c79b819

conditionally deploy to private or public subnet

c847812

create vpc endpoints for private server using script

168464a

chrisammon3000 added 25 commits January 26, 2024 17:15

build and load state only for initial deployment

fc7ffbe

trigger alarm and invoke function when queue has messages

7158c5e

rename function path

2351713

receive message from SQS for LCM state machine

6e26a4b

Load Concurrency Manager (LCM) basic control loop and state machine d…

7fcfeff

…efinition

resource configurations for LCM service

ca6edfe

fix SAM rejecting large template

912ac16

minor changes, comments and clean up

ee85f6f

increase message visibility timeout for processing

8e3e913

move load states to LCM

865a806

check database status before sending load request

6c19fa5

update release state from LCM

bbdb1cb

return message to queue if load fails

5cfb60b

integrate Update Pipeline with LCM

4d1f38a

sync LCM with Update Pipeline executions; add backup steps to LCM

0540eeb

limit LCM concurrency to 1 execution

5200a7e

handle releases in user input not yet in state table

d16b943

use release version for deduplication id to avoid duplicate processing

f776810

build and deploy commits to state table

a7be620

use FIFO queue for builds to maintain idempotency of requests

6c2505f

remove pipeline state from version control

8b2f2f2

disable GfeDbLoadQueueHasMessages alarm and associated resources

a66d81d

move queues and alarms to pipeline layer

33befcd

indicate invalid load

411b7aa

remove validation causing false load failures

8f9a846

chrisammon3000 requested review from mmaiers-nmdp and pbashyal-nmdp February 12, 2024 06:31

chrisammon3000 changed the title ~~Ingest data from new commits to IMGTHLA repository~~ Ingest data from new commits in IMGTHLA repository Feb 12, 2024

chrisammon3000 linked an issue May 9, 2024 that may be closed by this pull request

Upgrade the pipeline trigger logic to run on updates to previously processed release branches #82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest data from new commits in IMGTHLA repository #102

Ingest data from new commits in IMGTHLA repository #102

chrisammon3000 commented Feb 12, 2024 •

edited

Loading

chrisammon3000 commented Apr 28, 2024

Ingest data from new commits in IMGTHLA repository #102

Are you sure you want to change the base?

Ingest data from new commits in IMGTHLA repository #102

Conversation

chrisammon3000 commented Feb 12, 2024 • edited Loading

Description

New Commits

Execution State

Pipeline Request Idempotency

Error Handling

Database Concurrency Management

Success/Failure Notifications

Pydantic Models

Infrastructure Changes

Known Issues

Next Steps

chrisammon3000 commented Apr 28, 2024

chrisammon3000 commented Feb 12, 2024 •

edited

Loading