Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW DAG] Update covid ch dashboard #148

Open
eduardocorrearaujo opened this issue Oct 19, 2022 · 18 comments
Open

[AIRFLOW DAG] Update covid ch dashboard #148

eduardocorrearaujo opened this issue Oct 19, 2022 · 18 comments
Assignees
Labels

Comments

@eduardocorrearaujo
Copy link
Contributor

eduardocorrearaujo commented Oct 19, 2022

DAG Description

It's necessary to create dags to weekly update the results of the dashboard:
https://epigraphhub.org/covidch/

Basic Workflow

  • what would be the expected result?
    This dag would pull the data from the epigraphhub database and apply some machine learning models in this data to forecast it. After making the forecast the dag should upload the data frame with the forecasted values in the database.

  • are the modules/methods for the dag already written? if so, where can it be found?
    This method uses some functions in the epigraphhub_py package. But I already have the scripts written on my personal machine.

  • when should the DAG run? daily? weekly? triggered by another DAG?
    This DAG should be run after the dag that uploads the foph tables.

More info

No response

@luabida
Copy link
Contributor

luabida commented Oct 19, 2022

@eduardocorrearaujo could you provide more information about the DAG, for example:

  • what would be the expected result?
  • are the modules/methods for the dag already written? if so, where can it be found?
  • when the DAG should run? daily? weekly? triggered by another DAG?

@eduardocorrearaujo
Copy link
Contributor Author

@eduardocorrearaujo could you provide more information about the DAG, for example:

what would be the expected result?
This dag would pull the data from the epigraphhub database and apply some machine learning models in this data to forecast it. After making the forecast the dag should upload the data frame with the forecasted values in the database.

are the modules/methods for the dag already written? if so, where can it be found?
This method uses some functions in the epigraphhub_py package. But I already have the scripts written on my personal machine.

when should the DAG run? daily? weekly? triggered by another DAG?
This DAG should be run after the dag that uploads the foph tables

@luabida
Copy link
Contributor

luabida commented Oct 24, 2022

This DAG should be run after the dag that uploads the foph tables

I've created a template with the start and end tasks (triggered by foph - done task). Let me know if you have any questions about this template.

@eduardocorrearaujo
Copy link
Contributor Author

Captura de Tela 2022-10-25 às 12 41 56
The code to update the results of the covid ch dashboard should be applied after the update of the foph data. In the foph tasks, we have, after the task done a task to remove the data download (remove_csv_dir).

Do you think that it should be better to put the task to upload the dashboard after done and before remove_csv_dir, so we could use this data saved in the csv instead of pull the data from the epigraphhub database (what could save some time) or put this task after the remove_csv_dir and pull the data from the database?

@eduardocorrearaujo
Copy link
Contributor Author

Another possible issue concerning this DAG is that I train the models in my machine before applying the models. So, we should define a time to retrain the models periodically with new data. Do you think 2 months is a good time interval??

@luabida
Copy link
Contributor

luabida commented Oct 25, 2022

Do you think that it should be better to put the task to upload the dashboard after done and before remove_csv_dir

I think if you wanna use the CSV with the data that is going to DB before being deleted, just delete the task remove_csv_dir and let the other DAG take care of deleting it, this way you would get rid of querying the database.

@luabida
Copy link
Contributor

luabida commented Oct 25, 2022

Another possible issue concerning this DAG is that I train the models in my machine before applying the models. So, we should define a time to retrain the models periodically with new data. Do you think 2 months is a good time interval??

Sorry, I'm not quite sure if I understood the problem here

@fccoelho
Copy link
Contributor

Another possible issue concerning this DAG is that I train the models in my machine before applying the models. So, we should define a time to retrain the models periodically with new data. Do you think 2 months is a good time interval??

Yes. The models don't need to be re-trained very often. Only the prediction has to be generated every week.
but the scipt to train the models should be versioned somewhere, even if it is not run by a DAG.

@luabida
Copy link
Contributor

luabida commented Oct 25, 2022

The DAG could still be triggered by a external task and have a 2 months interval at the same time, the version would be the timestamp of the dag run, something lilke: scheduled__2022-10-17T14:06:34.055306+00:00

@eduardocorrearaujo
Copy link
Contributor Author

@fccoelho @luabida the data collection code is saved into the path: epigraphhub.data.data_collection.foph of epigraphhub. After the refac that you are doing, where you think the functions used to upload the results of the dashboard should be stored? I think about something like: epigraphhub.apps.switzerland.foph what are your toughts about it? I just would like to remember that this code is used to train and apply ML models.

@luabida
Copy link
Contributor

luabida commented Oct 28, 2022

From my point of view, apps implies that switzerland is an application. Wouldn't be better something like epigraphhub.models.switzerland.foph?

@eduardocorrearaujo
Copy link
Contributor Author

eduardocorrearaujo commented Oct 28, 2022

With apps I refer to the fact that this code is related to a dashboard application created by epigraphhub. So this code is not general. The functions used in this script already come from epigraphhub.analysis.forecast_models.ngboost_models (which is general)

@fccoelho
Copy link
Contributor

Any code that is specific to a single dashboard does not belong in the library. Keep that in mind.

@eduardocorrearaujo
Copy link
Contributor Author

Any code that is specific to a single dashboard does not belong in the library. Keep that in mind.

In this case, I don't know where I should put this code, since in the Epigraphhub repo, in my understanding, we should put the less code possible. So my idea was to save the functions in the library and just import them to run the dags in the epigraphhub repo, in the same way, Lua did with the data collection scripts.

@fccoelho
Copy link
Contributor

Code that is specific to the dashboard and nothing more can stay in the dashboard repo.

@eduardocorrearaujo
Copy link
Contributor Author

Code that is specific to the dashboard and nothing more can stay in the dashboard repo.

But, the code should be used by the airflow dags. How could I import it if the code is saved in the COVID-CH-dashboard?

@fccoelho
Copy link
Contributor

In that case, it can live in a standalone executable script that is run by the DAG. For that, the Airflow container may need to mount an external directory with this and other such scripts

@eduardocorrearaujo
Copy link
Contributor Author

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants