Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run dbt models only on the relevant databases #176

Open
1yuv opened this issue Oct 25, 2024 · 6 comments
Open

Run dbt models only on the relevant databases #176

1yuv opened this issue Oct 25, 2024 · 6 comments
Assignees
Labels
Type: Feature Add something new

Comments

@1yuv
Copy link
Member

1yuv commented Oct 25, 2024

Describe the issue
It looks like currently dbt models are run on all the couchdb databases synced. While there's an issue in cht-sync syncing multiple databases (#165 ), if you specify any database other than medic, it would still sync. Couchdb has separated databases for different types of data storing, and running default dbt models on all databases is not helpful and this can actualy cause a performance problem.

Describe the improvement you'd like
We should separate the couchdb database and models we want to run on them. For example, the dbt models you want to run on medic will be totally different from medic-users-meta or medic-sentinel databases.

This is critical also from the perfromance point of view. There's no point in running the dbt models where it's not going to make any changes. medic-users-meta and medic-sentinel dbs could be as big as medic and running medic's models on those databases is not helpful.

Related:
medic/cht-pipeline#168 is related to this as it looks as of now we only have models defined for medic database.

@1yuv 1yuv added the Type: Technical issue Improve something that users won't notice label Oct 25, 2024
@dianabarsan
Copy link
Member

This is a serious shift from the way cht-sync works now and the way couch2pg worked before, and would likely require a significant overhaul of how data is managed.

I don't think this is a straight-forward technical issue, because one solution requires having multiple data source tables in Postgres, one for each medic database, or at least keep medic separate from the others.
This would mean that any query that relies on all data being present in a single source would break.
I think keeping duplicate data is problematic for disk space, so having both - one common table and several separate tables - is probably very costly.

Another solution could be to store which database which document was copied from, and update the base models to exclude non-medic docs.
This would still mean that base models would run over all documents, but subsequent modules would not.
We would need to quantify how big of a performance improvement this yields.

@dianabarsan dianabarsan added Type: Feature Add something new and removed Type: Technical issue Improve something that users won't notice labels Oct 28, 2024
@witash
Copy link
Contributor

witash commented Oct 28, 2024

Another solution could be to store which database which document was copied from

Seems like the simplest solution, and I don't think it will be a significant performance issue to have more rows in the source table which are then ignored in all downstream models.

@1yuv
Copy link
Member Author

1yuv commented Oct 29, 2024

because one solution requires having multiple data source tables in Postgres

Is it not possible on cht-sync to store data in two separate tables? As under the hood cht-sync also uses cht-couch2pg to sync data, is it not possible to insert medic database's data to one table and medic-users-meta or any other database's data to another table?

I think keeping duplicate data is problematic for disk space, so having both - one common table and several separate tables - is probably very costly.

I am not suggesting storing data duplicately. The main concern of this issue is to run dbt models that are required to run or that are useful to run on that data set. It does not make any sense running models like contact or patient on database other than medic.

@dianabarsan
Copy link
Member

Is it not possible on cht-sync to store data in two separate tables? As under the hood cht-sync also uses cht-couch2pg to sync data, is it not possible to insert medic database's data to one table and medic-users-meta or any other database's data to another table?

If anyone wrote models or queries directly against the main database, those would be broken. So it's a potentially breaking change.

@dianabarsan
Copy link
Member

Not to mention that migrating to the new "structure" will require a re-sync and model rebuild for all projects that have already deployed cht-sync.

@1yuv
Copy link
Member Author

1yuv commented Nov 21, 2024

We added medic-users-meta to be synced recently on one of our deployment without adding any new models or tests. The dbt run time increased from 4 hours to 20 hours. Since all databases are still inserted into one table and all models are made from that table. I will post update after next run.

19:10:33  Finished running 5 materialized view models, 1 project hook in 0 hours 10 minutes and 4.17 seconds (604.17s).
21:47:57  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 37 minutes and 15.21 seconds (9435.21s).
23:44:43  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 55 minutes and 32.44 seconds (6932.44s).
01:39:00  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 53 minutes and 4.41 seconds (6784.41s).
03:42:37  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 2 minutes and 25.67 seconds (7345.67s).
05:51:18  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 7 minutes and 28.30 seconds (7648.30s).
08:27:46  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 35 minutes and 12.13 seconds (9312.13s).
11:54:04  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 3 hours 25 minutes and 0.88 seconds (12300.88s).
15:25:43  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 3 hours 30 minutes and 19.57 seconds (12619.57s).
18:20:25  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 53 minutes and 28.41 seconds (10408.42s).
20:47:46  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 26 minutes and 6.52 seconds (8766.52s).
22:39:16  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 50 minutes and 15.65 seconds (6615.65s).
00:32:14  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 51 minutes and 48.41 seconds (6708.41s).
02:28:44  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 55 minutes and 13.71 seconds (6913.71s).
05:14:03  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 44 minutes and 4.30 seconds (9844.30s).
09:40:17  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 4 hours 24 minutes and 59.23 seconds (15899.23s).
15:06:27  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 5 hours 24 minutes and 53.49 seconds (19493.49s).
19:20:36  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 4 hours 12 minutes and 53.30 seconds (15173.30s).
15:22:33  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 20 hours 0 minutes and 43.03 seconds (72043.03s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Add something new
Projects
Status: This Week's commitments
Development

No branches or pull requests

4 participants