Feat/automated staging publish and configure on PR #165

slesaad · 2024-08-15T21:00:58Z

Changes

Adds a workflow that automatically publishes data from a dataset-config file to staging and creates a PR in veda-config with a minimal .data.mdx file representing the data.

Testing

See a test run in this ghgc-data PR - https://github.com/US-GHG-Center/ghgc-data/pull/45

Test dataset config json: https://github.com/US-GHG-Center/ghgc-data/blob/4a0f7ee817e38ceac6b1147a78e9d2532b883d1c/ingestion-data/dataset-config/test.json
Automatically generated PR comment: https://github.com/US-GHG-Center/ghgc-data/pull/45#issuecomment-2291741264
PR created in veda-config-ghg repo: Add dataset [Automated workflow] US-GHG-Center/veda-config-ghg#487

Disclaimer

I haven't tested this in veda-data yet, but wanted to open this PR for visibility and discussion.
If anyone else wants to test it out, go for it!

Planned future work

We also want to automatically publish to production once everything in staging looks good.
So after review and approval, the merge to main would do the following:

Automatically transfer the files to veda-data-store if transfer is set to True in the dataset config (work needs to be done in airflow dag to support this) before publishing items to STAC
Update the PR in veda-config to revert the changes to .env file

Questions?

Since we have staging/ and production/ folders right now, we might need to figure out where the dataset-config folder is gonna go? ideally we'd avoid the need to create two PRs for staging and production ingest

ciaransweet · 2024-08-16T09:42:24Z

Never changes wherever I work 😆

ciaransweet

More general questions than anything specific I'd change!

ciaransweet · 2024-08-16T09:45:43Z

.github/workflows/pr.yml

+        id: init-comment
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |


I've seen a few of our other workflows define shell: bash - Worth doing for these run's?

ciaransweet · 2024-08-16T09:50:56Z

.github/workflows/pr.yml

+        run: |
+          if [ -z "$WORKFLOWS_URL" ]; then
+            echo "WORKFLOWS_URL is not set"
+            exit 1
+          fi
+
+          if [ -z "$AUTH_TOKEN" ]; then
+            echo "AUTH_TOKEN is not set"
+            exit 1
+          fi
+
+          publish_url="${WORKFLOWS_URL%/}/dataset/publish"
+          bearer_token=$AUTH_TOKEN
+
+          # Track successful publications
+          all_failed=true
+          success_collections=()
+          status_message='### Collection Publication Status
+          '
+
+          for file in "${ALL_CHANGED_FILES[@]}"; do
+            echo $file
+            if [ -f "$file" ]; then
+              dataset_config=$(jq '.' "$file")
+              collection_id=$(jq -r '.collection' "$file")
+
+              response=$(curl -s -w "%{http_code}" -o response.txt -X POST "$publish_url" \
+                -H "Content-Type: application/json" \
+                -H "Authorization: Bearer $AUTH_TOKEN" \
+                -d "$dataset_config"
+              )
+
+              status_code=$(tail -n1 <<< "$response")
+
+              # Update status message based on response code
+              if [ "$status_code" -eq 200 ] || [ "$status_code" -eq 201 ]; then
+                echo "$collection_id successfully published ✅"
+                status_message+="- **$collection_id**: Successfully published ✅
+                "
+                success_collections+=("$file")
+                all_failed=false
+              else
+                echo "$collection_id failed to publish ❌"
+                status_message+="- **$collection_id**: Failed to publish ❌
+                "
+              fi
+            else
+              echo "File $file does not exist"
+              exit 1
+            fi
+          done
+
+          # Exit workflow if all the requests fail
+          if [ "$all_failed" = true ]; then
+            echo "All collections failed to publish."
+            exit 1
+          fi
+
+          # Output only successful collections to be used in subsequent steps
+          echo "success_collections=$(IFS=','; echo "${success_collections[*]}")" >> $GITHUB_OUTPUT
+
+          # Update PR comment
+          CURRENT_BODY=$(gh api -H "Authorization: token $GITHUB_TOKEN" /repos/${{ github.repository }}/issues/comments/$COMMENT_ID --jq '.body')
+          UPDATED_BODY="$CURRENT_BODY
+
+          $status_message"
+          gh api -X PATCH -H "Authorization: token $GITHUB_TOKEN" /repos/${{ github.repository }}/issues/comments/$COMMENT_ID -f body="$UPDATED_BODY"


General question - Do we have a preference for scripting and what language we use? Would we prefer this in Python for example? It's pretty involved.

ciaransweet · 2024-08-16T09:53:00Z

.github/workflows/pr.yml

+          fi
+
+          # Output only successful collections to be used in subsequent steps
+          echo "success_collections=$(IFS=','; echo "${success_collections[*]}")" >> $GITHUB_OUTPUT


Worth also publishing the failed ones?

Saves someone having to find those manually?

ciaransweet · 2024-08-16T09:55:22Z

.github/workflows/pr.yml

+      - name: Update PR comment for PR creation
+        if: success()
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          COMMENT_ID: ${{ steps.init-comment.outputs.COMMENT_ID }}
+        run: |
+          CURRENT_BODY=$(gh api -H "Authorization: token $GITHUB_TOKEN" /repos/${{ github.repository }}/issues/comments/$COMMENT_ID --jq '.body')
+          UPDATED_BODY="$CURRENT_BODY
+
+          **Creating a PR in veda-config...**"
+          gh api -X PATCH -H "Authorization: token $GITHUB_TOKEN" /repos/${{ github.repository }}/issues/comments/$COMMENT_ID -f body="$UPDATED_BODY"


I might be wrong, but if this is just adding a note to say it's creating the PR to the comment in the step above, why a separate statement? Seems like this 4 extra run lines could go in there, not sure a separate step gives much extra value?

ciaransweet · 2024-08-16T09:56:05Z

.github/workflows/pr.yml

+          pip install -r scripts/requirements.txt
+          for file in "${PUBLISHED_COLLECTION_FILES[@]}"
+          do
+            python3 scripts/mdx.py "$file"


Could the script accept a list of files and do the iteration in that?

ciaransweet · 2024-08-16T10:04:21Z

scripts/dataset.mdx

@@ -0,0 +1,5 @@
+<Block>


Could you explain what the purpose of this file is?

From what I can tell, we'll generate a file in the form:

<yaml populated from json> <This files contents unchanged?>

Does this filler text get changed elsewhere?

ciaransweet · 2024-08-16T10:04:55Z

scripts/mdx.py

@@ -0,0 +1,126 @@
+"""


General question again (sorry 😅) - Do we have a decision point on when we do type python code and not?

anayeaye · 2024-08-20T22:48:49Z

Since we have staging/ and production/ folders right now, we might need to figure out where the dataset-config folder is gonna go? ideally we'd avoid the need to create two PRs for staging and production ingest

What about adding a publication PR template with a couple required fields (collection_id and publication stage). This would be enough information for us to select only the collection(s) of that is intended for publication (instead of diffing all of the jsons in a folder) and it would tell us whether this is intended for staging or production. That would let us choose between vars.STAGING_WORKFLOWS_URL and vars.PRODUCTION_WORKFLOWS_URL based on stage and infer the full path to the json input configuration ingestion-inputs/{stage}/dataset-config/{collection_id}.json

- type: textarea
  attributes:
    label: Collection ID
    description: What is the id of the collection you are publishing, i.e. openveda.cloud/collections/{collection_id}
    placeholder: "collection_id"
    value: collection_id
  validations:
    required: true
- type: dropdown
  attributes:
    label: Publication stage
    description: What stage do you want to publish this collection to
    multiple: false
    options:
      - staging (Default)
      - production
    default: 0
    value: publication_stage
  validations:
    required: true

anayeaye · 2024-08-20T22:50:06Z

.github/workflows/pr.yml

+
+              # Update status message based on response code
+              if [ "$status_code" -eq 200 ] || [ "$status_code" -eq 201 ]; then
+                echo "$collection_id successfully published ✅"


anayeaye · 2024-08-20T22:56:30Z

Automatically generated PR comment: https://github.com/US-GHG-Center/ghgc-data/pull/45#issuecomment-2291741264

PR created in veda-config-ghg repo: Add dataset [Automated workflow] US-GHG-Center/veda-config-ghg#487

🤩

slesaad added 5 commits August 7, 2024 17:39

Add script to create a dataset mdx file

e479004

Add a workflow to run staging publish on pr

a6b78a9

Add requirements

02d0d4e

Clean up and add comments

307780d

Add documentation about the automated workflow to README

e0623de

slesaad requested a review from smohiudd as a code owner August 15, 2024 21:00

slesaad requested review from anayeaye, ciaransweet, botanical, amarouane-ABDELHAK and ividito August 15, 2024 21:01

slesaad mentioned this pull request Aug 15, 2024

Auto staging publish workflow #161

Closed

slesaad added 2 commits August 15, 2024 16:14

Update mdx.py

8167761

Make the lint god happy

dedc8b6

ciaransweet reviewed Aug 16, 2024

View reviewed changes

anayeaye reviewed Aug 20, 2024

View reviewed changes

smohiudd mentioned this pull request Oct 28, 2024

Automated ingestion to Staging and Production Catalogs #180

Open

1 task

botanical mentioned this pull request Nov 4, 2024

fix: remove veda-config related steps #181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/automated staging publish and configure on PR #165

Feat/automated staging publish and configure on PR #165

slesaad commented Aug 15, 2024

ciaransweet commented Aug 16, 2024

ciaransweet left a comment •

edited

Loading

ciaransweet Aug 16, 2024

ciaransweet Aug 16, 2024

ciaransweet Aug 16, 2024

ciaransweet Aug 16, 2024

ciaransweet Aug 16, 2024

ciaransweet Aug 16, 2024

ciaransweet Aug 16, 2024

anayeaye commented Aug 20, 2024

anayeaye Aug 20, 2024

anayeaye commented Aug 20, 2024

Feat/automated staging publish and configure on PR #165

Are you sure you want to change the base?

Feat/automated staging publish and configure on PR #165

Conversation

slesaad commented Aug 15, 2024

Changes

Testing

Disclaimer

Planned future work

Questions?

ciaransweet commented Aug 16, 2024

ciaransweet left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anayeaye commented Aug 20, 2024

Choose a reason for hiding this comment

anayeaye commented Aug 20, 2024

ciaransweet left a comment •

edited

Loading