diff --git a/README.md b/README.md index 0f0a596..b4ff925 100644 --- a/README.md +++ b/README.md @@ -90,12 +90,12 @@ Heimdall supports a growing set of pluggable command types: | Plugin | Description | Execution Mode | | ----------- | -------------------------------------- | -------------- | -| `ping` | Basic plugin used for testing | Sync or Async | +| `ping` | [Basic plugin used for testing](https://github.com/patterninc/heimdall/blob/main/plugins/ping/README.md) | Sync or Async | | `shell` | [Shell command execution](https://github.com/patterninc/heimdall/blob/main/plugins/shell/README.md) | Sync or Async | -| `glue` | Pulling Iceberg table metadata | Sync or Async | -| `dynamodb` | DynamoDB read operation | Sync or Async | +| `glue` | [Pulling Iceberg table metadata](https://github.com/patterninc/heimdall/blob/main/plugins/glue/README.md) | Sync or Async | +| `dynamo` | [DynamoDB read operation](https://github.com/patterninc/heimdall/blob/main/plugins/dynamo/README.md) | Sync or Async | | `snowflake` | Query execution in Snowflake | Async | -| `spark` | SparkSQL query execution on EMR on EKS | Async | +| `spark` | [SparkSQL query execution on EMR on EKS](https://github.com/patterninc/heimdall/blob/main/plugins/spark/README.md) | Async | --- diff --git a/plugins/dynamo/README.md b/plugins/dynamo/README.md new file mode 100644 index 0000000..99c3279 --- /dev/null +++ b/plugins/dynamo/README.md @@ -0,0 +1,101 @@ +# πŸ“¦ DynamoDB Plugin + +The **DynamoDB Plugin** enables Heimdall to run read-only PartiQL queries against AWS DynamoDB tables. It securely connects using AWS credentials or an optional assumed IAM role, supports pagination, and returns structured query results. + +πŸ”’ **Read-Only:** This plugin only supports data retrieval via PartiQL β€” no writes or modifications. + +--- + +## 🧩 Plugin Overview + +* **Plugin Name:** `dynamo` +* **Execution Mode:** Sync +* **Use Case:** Querying DynamoDB tables using PartiQL from within Heimdall workflows + +--- + +## βš™οΈ Defining a DynamoDB Command + +### πŸ“ Command Definition + +```yaml +- name: dynamo-0.0.1 + status: active + plugin: dynamo + version: 0.0.1 + description: Read data using PartiQL + tags: + - type:dynamodb + cluster_tags: + - type:dynamodb +``` + +### πŸ–₯️ Cluster Definition (AWS Authentication) + +```yaml +- name: dynamo-0.0.1 + status: active + version: 0.0.1 + description: AWS DynamoDB + context: + role_arn: arn:aws:iam::123456789012:role/HeimdallDynamoQueryRole + tags: + - type:dynamodb + - data:prod +``` + +* `role_arn` is optional. If provided, Heimdall will assume this IAM role to authenticate requests. + +--- + +## πŸš€ Submitting a DynamoDB Job + +Define the PartiQL query and optional result limit in the job context: + +```json +{ + "name": "list-items-job", + "version": "0.0.1", + "command_criteria": ["type:dynamo"], + "cluster_criteria": ["data:prod"], + "context": { + "query": "SELECT * FROM my_table WHERE category = 'books'", + "limit": 100 + } +} +``` + +--- + +## πŸ“Š Returning Job Results + +The plugin paginates through query results automatically, returning a structured output like: + +```json +{ + "columns": [ + {"name": "id", "type": "string"}, + {"name": "category", "type": "string"}, + {"name": "price", "type": "float"} + ], + "data": [ + ["123", "books", 12.99], + ["124", "books", 15.50] + ] +} +``` + +Retrieve results via: + +``` +GET /api/v1/job//result +``` + +--- + +## 🧠 Best Practices + +* Use IAM roles with **least privilege** for security. +* Test queries in AWS console or CLI before running via Heimdall. +* Avoid large result sets by leveraging the `limit` parameter. +* Validate job input to prevent malformed PartiQL queries. diff --git a/plugins/glue/README.md b/plugins/glue/README.md new file mode 100644 index 0000000..7373306 --- /dev/null +++ b/plugins/glue/README.md @@ -0,0 +1,69 @@ +# 🍯 Glue Plugin + +The **Glue Plugin** enables Heimdall to query the AWS Glue Data Catalog to retrieve metadata about a specific table. It helps you integrate Glue metadata queries directly into your orchestration workflows. + +--- + +## 🧩 Plugin Overview + +* **Plugin Name:** `glue` +* **Execution Mode:** Sync +* **Use Case:** Fetching AWS Glue table metadata for auditing, validation, or downstream processing + +--- + +## βš™οΈ Defining a Glue Command + +```yaml +- name: glue-metadata-0.0.1 + status: active + plugin: glue + version: 0.0.1 + description: Query AWS Glue catalog for table metadata + context: + catalog_id: 123456789012 + tags: + - type:glue-query + cluster_tags: + - type:localhost +``` + +* `catalog_id` is the AWS Glue catalog identifier (optional; defaults to AWS account ID if omitted). + +--- + +## πŸš€ Submitting a Glue Job + +Specify the Glue table name to fetch metadata for in the job context: + +```json +{ + "name": "fetch-glue-table-metadata", + "version": "0.0.1", + "command_criteria": ["type:glue-query"], + "cluster_criteria": ["data:local"], + "context": { + "table_name": "my_database.my_table" + } +} +``` + +--- + +## πŸ“Š Returning Job Results + +The plugin returns raw metadata as a JSON string containing the table schema and properties. + +Retrieve results via: + +``` +GET /api/v1/job//result +``` + +--- + +## 🧠 Best Practices + +* Use appropriate IAM permissions for Glue Catalog access. +* Validate the fully qualified `table_name` to avoid query errors. +* Use the metadata for auditing, documentation, or dynamic pipeline decisions. diff --git a/plugins/ping/README.md b/plugins/ping/README.md new file mode 100644 index 0000000..4064c30 --- /dev/null +++ b/plugins/ping/README.md @@ -0,0 +1,102 @@ +# πŸ“ Ping Plugin + +The **Ping Plugin** is a sample command used for testing Heimdall’s orchestration flow. Instead of sending actual ICMP packets, it responds instantly with a predefined message β€” perfect for dry runs, plugin testing, or just checking your Heimdall wiring. 🚧 + +⚠️ **Testing Only:** This plugin is a *no-op*. It does **not** reach out to real hosts. Use it to verify that jobs run through Heimdall correctly. + +--- + +## 🧩 Plugin Overview + +* **Plugin Name:** `ping` +* **Execution Mode:** Sync +* **Use Case:** Testing job submission, validation, or plugin behavior without side effects + +--- + +## βš™οΈ Defining a Ping Command + +You don’t need to specify much β€” just use the `ping` plugin and give it a name. + +```yaml +- name: ping-0.0.1 + status: active + plugin: ping + version: 0.0.1 + description: Check Heimdall wiring + tags: + - type:ping + cluster_tags: + - type:localhost +``` + +πŸ”Ή When this job runs, Heimdall will simulate a ping and respond with a message like: + +``` +Hello, ! +``` + +--- + +## πŸ–₯️ Cluster Configuration + +Use a simple localhost cluster (or any compatible test target) to execute ping jobs: + +```yaml +- name: localhost-0.0.1 + status: active + version: 0.0.1 + description: Localhost + tags: + - type:localhost + - data:local +``` + +--- + +## πŸš€ Submitting a Ping Job + +Here’s how to submit an example ping command via the Heimdall API: + +```json +{ + "name": "ping-check-job", + "version": "0.0.1", + "command_criteria": ["type:ping"], + "cluster_criteria": ["data:local"], + "context": {} +} +``` + +🟒 This will run the ping plugin and instantly return result. + +--- + +## πŸ“Š Returning Job Results + +The plugin returns this result: + +```json +{ + "columns": [ + {"name": "message", "type": "string"} + ], + "data": [ + ["Hello, alice!"] + ] +} +``` + +You can retrieve the result from: + +``` +GET /api/v1/job//result +``` + +--- + +## 🧠 Best Practices + +* Use this plugin to **test your pipelines** before running real jobs. +* It’s great for **CI/CD checks**, plugin regression tests, or mocking command behavior. +* Don't forget: **no real pinging happens** β€” it's just a friendly "Hello!" 🎯 diff --git a/plugins/spark/README.md b/plugins/spark/README.md new file mode 100644 index 0000000..1078711 --- /dev/null +++ b/plugins/spark/README.md @@ -0,0 +1,119 @@ +# πŸ”₯ Spark Plugin + +The **Spark Plugin** enables Heimdall to submit SparkSQL batch jobs on AWS EMR on EKS clusters. It uploads SQL queries to S3, runs them via EMR Containers, and optionally returns query results in Avro format. + +--- + +## 🧩 Plugin Overview + +* **Plugin Name:** `spark` +* **Execution Mode:** Asynchronous batch jobs +* **Use Case:** Running SparkSQL queries with configurable properties on EMR on EKS clusters + +--- + +## βš™οΈ Defining a Spark Command + +A Spark command requires a SQL `query` in its job context and optionally job-specific Spark properties. The plugin uploads queries and reads results from S3 paths configured in the command context. + +```yaml + - name: spark-sql-3.5.3 + status: active + plugin: spark + version: 3.5.3 + description: Run a SparkSQL query + context: + queries_uri: s3://bucket/spark/queries + results_uri: s3://bucket/spark/results + logs_uri: s3://bucket/spark/logs + wrapper_uri: s3://bucket/contrib/spark/spark-sql-s3-wrapper.py + properties: + spark.executor.instances: "1" + spark.executor.memory: "500M" + spark.executor.cores: "1" + spark.driver.cores: "1" + tags: + - type:sparksql + cluster_tags: + - type:spark +``` + +πŸ”Έ This defines the S3 URIs for queries, results, logs, and [python wrapper script](https://github.com/patterninc/heimdall/blob/main/configs/spark-sql-s3-wrapper.py) used for queries that return results. Additional Spark properties can be specified globally or per job. Job properties take precedence over properties identified in the command. + +--- + +## πŸ–₯️ Cluster Configuration + +Clusters must specify their EMR execution role ARN, EMR release label, and optionally an IAM RoleARN to assume: + +```yaml + - name: emr-on-eks + status: active + version: 3.5.3 + description: EMR on EKS cluster (7.6.0) + context: + execution_role_arn: arn:aws:iam::123456789012:role/EMRExecutionRole + emr_release_label: emr-6.5.0-latest + role_arn: arn:aws:iam::123456789012:role/AssumeRoleForEMR + properties: + spark.driver.memory: "2G" + spark.sql.catalog.glue_catalog: "org.apache.iceberg.spark.SparkCatalog" + tags: + - type:spark + - data:prod +``` + +--- + +## πŸš€ Submitting a Spark Job + +A typical Spark job includes a SQL query and optional Spark properties: + +```json +{ + "name": "run-my-query", + "version": "0.0.1", + "command_criteria": ["type:sparksql"], + "cluster_criteria": ["data_prod"], + "context": { + "query": "SELECT * FROM my_table WHERE dt='2023-01-01'", + "properties": { + "spark.sql.shuffle.partitions": "10" + }, + "return_result": true + } +} +``` + +πŸ”Ή The job uploads the query to S3, submits it to EMR on EKS, and polls until completion. If `return_result` is true, results are fetched from S3 in Avro format. + +--- + +## πŸ“¦ Job Context & Runtime + +The Spark plugin handles: + +* Uploading the SQL query file to the configured `queries_uri` in S3 +* Submitting the job with the specified properties +* Polling the EMR job status until completion or failure +* Fetching results from the `results_uri` if requested +* Streaming job logs to configured `logs_uri` + +--- + +## πŸ“Š Returning Job Results + +If `return_result` is set, Heimdall reads Avro-formatted query results from the results S3 path and exposes them via: + +``` +GET /api/v1/job//result +``` + +--- + +## 🧠 Best Practices + +* Provide IAM roles with minimal necessary permissions for EMR and S3 access. +* Tune Spark properties carefully to optimize resource usage and performance. +* Use the `wrapper_uri` to customize job execution logic as you need it +* Monitor job states and handle failures gracefully in your workflows.