Skip to content

Commit

Permalink
chore: Cleanup data pipeline pages (#1520)
Browse files Browse the repository at this point in the history
  • Loading branch information
andrablaj authored Sep 4, 2024
1 parent b70caf4 commit c80ade7
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 24 deletions.
4 changes: 1 addition & 3 deletions content/en/apps/guides/data/analytics/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,13 @@ relatedContent: >

{{% pageinfo %}}
The pages in this section apply to both CHT 3.x (beyond 3.12) and CHT 4.x.

[CHT Sync schema](https://github.com/medic/cht-sync/blob/main/postgres/init-dbt-resources.sh) differs from [CHT Couch2pg](https://github.com/medic/cht-couch2pg).
{{% /pageinfo %}}

Most CHT deployments require some sort of analytics so that stakeholders can make data driven decisions. CouchDB, which is the database used by the CHT, is not designed for analytics. It is a document database, which means that it is optimized for storing and retrieving documents, and not for aggregating data. For example, if you wanted to know how many patients were registered in a particular area, you would have to query the database for all the patients in that area, and then count them. This is not a very efficient process. It is much more efficient to store the number of patients in a particular area in a separate database, and update that number whenever a patient is registered or unregistered. This is what CHT Sync paired with CHT Pipeline is designed to do.

## CHT Sync Introduction

Medic maintains CHT Sync which is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. It can easily be deployed using Docker. It is supported on CHT 3.12 and later, including CHT 4.x. By using CHT Sync, a CHT deployment can easily get analytics by using a data visualization tool. All tools are open-source and have no licensing fees.
CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. It can easily be deployed using Docker. It is supported on CHT 3.12 and later, including CHT 4.x. By using CHT Sync, a CHT deployment can easily get analytics by using a data visualization tool. All tools are open-source and have no licensing fees.

CHT Sync has been designed to work in both local development environments for testing models or workflows, and in production environments. The setup can accommodate the needs of different environments.

Expand Down
34 changes: 17 additions & 17 deletions content/en/apps/guides/data/analytics/production.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ relatedContent: >
core/overview/cht-sync
---

We recommend running cht-sync in production using Kubernetes. This guide will walk you through setting up a production deployment of [CHT Sync](https://github.com/medic/cht-sync) with the CHT using Kubernetes.
We recommend running [CHT Sync](https://github.com/medic/cht-sync) in production using Kubernetes. This guide will walk you through setting up a production deployment of CHT Sync with the CHT using Kubernetes.

## Pre-requisites:
## Prerequisites:
- A Kubernetes cluster: You can use a managed Kubernetes service like Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), or Azure Kubernetes Service (AKS), or you can set up a cluster using a tool like Minikube.
- kubectl: The Kubernetes command-line tool. You can install it using the [kubectl installation](https://kubernetes.io/docs/tasks/tools/install-kubectl/) instructions.
- Helm: The Kubernetes package manager. You can install it using the [helm installation guide](https://helm.sh/docs/intro/install/).
Expand All @@ -22,9 +22,9 @@ We recommend running cht-sync in production using Kubernetes. This guide will wa
- If you require a Postgres database to be set up in the cluster, you can use the `postgres.enabled` flag in the `values.yaml` file. If you already have a Postgres database outside the cluster, you can set the `postgres.enabled` flag to `false`.
- If outside the cluster, specify `host` and `port` in this section
- In either case, specify `user`, `password`, `db`, `schema`, and `table`
- schema can be used to separate cht models from any other data that may already be in the database
- table is the name of the table that couch2pg will write couch documents to, and the source table for dbt models. It is recommended to leave this as `couchdb`
```
- `schema` can be used to separate CHT models from any other data that may already be in the database
- `table` is the name of the table that couch2pg will write couch documents to, and the source table for dbt models. It is recommended to leave this as `couchdb`.
```yaml
postgres:
enabled: true
user: "postgres"
Expand All @@ -34,29 +34,29 @@ postgres:
table: "couchdb"
```
- Set CouchDB shared values in the `values.yaml` file.
```
```yaml
couchdb:
user: "your_couchdb_user"
dbs: "medic"
port: "443"
secure: "true"
```
- Configure the CouchDB instance to be replicated in the `values.yaml` file. For the host, use the CouchDB host URL used to publicly access the instance and for the password, use the password associated with the user set above.
```
```yaml
couchdbs:
- host: "host1.cht-core.test"
password: "password1"
```
- If you have multiple CouchDB instances to replicate, you can add them to the `couchdbs` list.
```
```yaml
couchdbs:
- host: "host1.cht-core.test"
password: "password1"
- host: "host2.cht-core.test"
password: "password2"
```
- If an instance has a different port, user or different CouchDB databases to be synced, you can specify it in the `couchdbs` list.
```
```yaml
- host: "host1" # required for all couchdb instances
password: "" # required for all couchdb instances
- host: "host2.cht-core.test"
Expand All @@ -70,26 +70,26 @@ couchdbs:
```

- Set the CHT Pipeline Branch URL in the `values.yaml` file.
```
```yaml
cht_pipeline_branch_url: "https://github.com/medic/cht-pipeline.git#main"
```
- (Optional) Configure the Metrics Exporter. If enabled, this will create a sql exporter that queries the database for couch2pg status, number of changes pending, and current sequence and exposes these metrics in prometheus format at a service with name `metrics` at port 9399, for use with [cht watchdog](https://docs.communityhealthtoolkit.org/hosting/monitoring/setup/) or any other monitoring service.
An HTTP ingress needs to be created to allow access from outside the cluster.
```
```yaml
metrics_exporter:
enabled: true
```
## Deploy
- Run the command below to deploy the cht-sync helm chart. If installing from root, specify path to directory containing `chart.yaml` and `values.yaml`
```
Run the command below to deploy the cht-sync helm chart. If installing from root, specify path to directory containing `chart.yaml` and `values.yaml`
```shell
helm install cht-sync cht-sync --values values.yaml
```
## Verify the deployment
- Run the following command to get the status of the deployment.
```
Run the following command to get the status of the deployment.
```shell
kubectl get pods
```
- Run the following command to get the logs of a pod.
```
Run the following command to get the logs of a pod.
```shell
kubectl logs -f cht-sync-<pod-id>
```
8 changes: 4 additions & 4 deletions content/en/core/overview/cht-sync.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,30 +13,30 @@ relatedContent: >
---

## Overview
CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. It combines several technologies to achieve this synchronization and provides an efficient workflow for data processing and visualization. The synchronization occurs in real-time, ensuring that the data displayed on dashboards is up-to-date.
CHT Sync is an integrated solution designed to enable data synchronization between CouchDB and PostgreSQL for the purpose of analytics. It combines several technologies to achieve this synchronization and provides an efficient workflow for data processing and visualization. The synchronization occurs in near real-time, ensuring that the data displayed on dashboards is up-to-date.

Read more about setting up [CHT Sync]({{< relref "apps/guides/data/analytics/setup" >}}).

<!-- make updates to this diagram on the google slides: -->
<!-- https://docs.google.com/presentation/d/1j4jPsi-gHbiaLBfgYOyru1g_YV98PkBrx2zs7bwhoEQ/ -->
{{< figure src="cht-sync.png" link="cht-sync.png" class=" center col-8 col-lg-6" >}}

[CHT Sync](https://github.com/medic/cht-sync) uses `couch2pg` to replicate data from CouchDB to PostgreSQL in a real-time manner. It listens to changes in the CHT database, and updates the analytics database accordingly.
[CHT Sync](https://github.com/medic/cht-sync) uses `couch2pg` to replicate data from CouchDB to PostgreSQL in a near real-time manner. It listens to changes in the CHT database, and updates the analytics database accordingly.
It is not designed to be accessed by users, and it does not have a user interface. It is designed to be run on the same server as the CHT, but it can be run on a separate server if necessary.

As CHT Sync puts all new data into a PostgreSQL database into a single table that has a `jsonb` column, this is not very useful for analytics. [CHT Pipeline](https://github.com/medic/cht-pipeline) is a set of SQL queries that transform the data in the `jsonb` column into a more useful format. It uses [dbt](https://www.getdbt.com/) to define the models that are translated into PostgreSQL tables or views, which makes it easier to query the data in the analytics platform of choice.

#### couch2pg

[couch2pg](https://github.com/medic/cht-sync/tree/main/couch2pg) streams data from CouchDB and forwards it to PostgreSQL, ensuring real-time updates.
[couch2pg](https://github.com/medic/cht-sync/tree/main/couch2pg) streams data from CouchDB and forwards it to PostgreSQL, ensuring near real-time updates.

#### PostgreSQL

A free and open source SQL database used for analytics queries. See more at the [PostgreSQL](https://www.postgresql.org) site.

#### dbt

Once the data is synchronized and stored in PostgreSQL, it undergoes transformation using predefined [dbt](https://www.getdbt.com/) models from the [cht-pipeline](https://github.com/medic/cht-pipeline). dbt is used to ingest raw JSON data from the PosgtreSQL database (`jsonb` column) and normalize it into a relational schema to make it easier to query. A daemon runs CHT Pipeline, and it updates the database whenever the data in the `jsonb` column changes.
Once the data is synchronized and stored in PostgreSQL, it undergoes transformation using predefined [dbt](https://www.getdbt.com/) models from the [CHT Pipeline](https://github.com/medic/cht-pipeline). dbt is used to ingest raw JSON data from the PosgtreSQL database (`jsonb` column) and normalize it into a relational schema to make it easier to query. A daemon runs CHT Pipeline, and it updates the database whenever the data in the `jsonb` column changes.

#### Data Visualization

Expand Down

0 comments on commit c80ade7

Please sign in to comment.