Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBD Batch Improvements: Continued stories #2791

Open
7 tasks
Jkd-eth opened this issue Aug 19, 2024 · 1 comment
Open
7 tasks

MBD Batch Improvements: Continued stories #2791

Jkd-eth opened this issue Aug 19, 2024 · 1 comment

Comments

@Jkd-eth
Copy link
Contributor

Jkd-eth commented Aug 19, 2024

User Story:
As a data engineer, I want to set up an internal batch scoring MBD API endpoint, so that I can process large datasets efficiently for the data team and provide results in a downloadable CSV file.

Acceptance Criteria:
GIVEN the internal API endpoint,
WHEN the data team submits a list of addresses with their API key,
THEN the API should provide an estimated processing time and a job ID, allow status checks via a separate endpoint, and return an S3 bucket link to download the CSV file with the results when the job is completed.

Tech Details:

This is a continuation of the following stories: #2794, #2795, #2796 that have already done work on improving the MBD batch model

Story 4:

  • Save the data in the DB instead of S3 and then at the end of the run do a data dump to create S3 file
    2 Django models separate
    Cron job
    (team will review during hangout meeting)

Story 5: Infra update(s)

  • Move DNS records for the data science models to be public names, but private IP addresses
  • Add VPN access to staging and review
    3 pts

Story 6: Improve the Queries (Data science team)

  • Improve lambdas data gathering and analysis
  • Could create libraries specific to querying and processing
  • Optimize model logic

Story 7:

  • Move processing to shared resources or separate passport resource

Open Questions:

Notes/Assumptions:

@Jkd-eth Jkd-eth changed the title MBD Batch Improvements MBD Batch Improvements: Continued stories Aug 20, 2024
@nutrina
Copy link
Collaborator

nutrina commented Sep 10, 2024

Regarding point 4 above, a possible solution concept:

  • we have BatchModelScoringRequest already
    • for each line in the CVS we create 1 BatchModelScoringItemResult with following fields for example:
      • address
      • status (pending|error|done)
      • details ...
  • it will also be very important to have the possibility to:
    • start / stop / restart. the task from the django admin as opposed to have it being started automatically on the S3 event
    • split the list of BatchModelScoringItemResult into ranges, start a different task for each range to increase the processing speed
  • for exporting the results to S3, we can also create a button or a download link in the django admin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

2 participants