MBD Batch Improvements: Continued stories #2791

Jkd-eth · 2024-08-19T21:25:01Z

User Story:
As a data engineer, I want to set up an internal batch scoring MBD API endpoint, so that I can process large datasets efficiently for the data team and provide results in a downloadable CSV file.

Acceptance Criteria:
GIVEN the internal API endpoint,
WHEN the data team submits a list of addresses with their API key,
THEN the API should provide an estimated processing time and a job ID, allow status checks via a separate endpoint, and return an S3 bucket link to download the CSV file with the results when the job is completed.

Tech Details:

Given the work we did on MBD batch scoring (Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) #2680) we noticed some improvements that could be made to enhance the process and increase the speed

This is a continuation of the following stories: #2794, #2795, #2796 that have already done work on improving the MBD batch model

Story 4:

Save the data in the DB instead of S3 and then at the end of the run do a data dump to create S3 file
2 Django models separate
Cron job
(team will review during hangout meeting)

Story 5: Infra update(s)

Move DNS records for the data science models to be public names, but private IP addresses
Add VPN access to staging and review
3 pts

Story 6: Improve the Queries (Data science team)

Improve lambdas data gathering and analysis
Could create libraries specific to querying and processing
Optimize model logic

Story 7:

Move processing to shared resources or separate passport resource

Open Questions:

Notes/Assumptions:

nutrina · 2024-09-10T19:37:35Z

Regarding point 4 above, a possible solution concept:

we have BatchModelScoringRequest already
- for each line in the CVS we create 1 BatchModelScoringItemResult with following fields for example:
  - address
  - status (pending|error|done)
  - details ...
it will also be very important to have the possibility to:
- start / stop / restart. the task from the django admin as opposed to have it being started automatically on the S3 event
- split the list of BatchModelScoringItemResult into ranges, start a different task for each range to increase the processing speed
for exporting the results to S3, we can also create a button or a download link in the django admin

Jkd-eth added this to Passport New Aug 19, 2024

Jkd-eth moved this to Prioritized in Passport New Aug 19, 2024

Jkd-eth moved this from Prioritized to Backlog in Passport New Aug 20, 2024

Jkd-eth changed the title ~~MBD Batch Improvements~~ MBD Batch Improvements: Continued stories Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBD Batch Improvements: Continued stories #2791

MBD Batch Improvements: Continued stories #2791

Jkd-eth commented Aug 19, 2024 •

edited

Loading

nutrina commented Sep 10, 2024 •

edited

Loading

MBD Batch Improvements: Continued stories #2791

MBD Batch Improvements: Continued stories #2791

Comments

Jkd-eth commented Aug 19, 2024 • edited Loading

nutrina commented Sep 10, 2024 • edited Loading

Jkd-eth commented Aug 19, 2024 •

edited

Loading

nutrina commented Sep 10, 2024 •

edited

Loading