Skip to content

Commit

Permalink
feat(docs): add information on how to speed up large pipelines
Browse files Browse the repository at this point in the history
  • Loading branch information
rileyhgrant committed Jan 13, 2025
1 parent 978671a commit e9acd8d
Showing 1 changed file with 23 additions and 13 deletions.
36 changes: 23 additions & 13 deletions deploy/docs/LoadingLargeDatasets.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
# Loading large datasets

## Running a large pipeline

To speed up the execution time of large pipelines (such as the Short Variants), add additional workers nodes to the dataproc cluster you create.

```
./deployctl dataproc-cluster start variants --num-workers 32
```

## Loading large datasets into Elasticsearch

To speed up loading large datasets into Elasticsearch, spin up many temporary pods and spread indexing across them.
Then move ES shards from temporary pods onto permanent pods.

## 1. Set [shard allocation filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html)
### 1. Set [shard allocation filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html)

This configures existing indices so that data stays on permanent data pods and does not migrate to temporary ingest pods. You can issue the following request to the `_all` index to pin existing indices to a particular nodeSet.

Expand All @@ -13,11 +23,11 @@ curl -u "elastic:$ELASTICSEARCH_PASSWORD" "http://localhost:9200/_all/_settings"
EOF
```

## 2. Add nodes to the es-ingest node pool
### 2. Add nodes to the es-ingest node pool

This can be done by adjusting the `pool_num_nodes` variable in our [terraform deployment](https://github.com/broadinstitute/gnomad-terraform/blob/8519ea09e697afc7993b278f1c2b4240ae21c8a4/exac-gnomad/services/browserv4/main.tf#L99) and opening a PR to review and apply the infrastructure change.

## 3. When the new es-ingest GKE nodes are ready, Add temporary pods to Elasticsearch cluster
### 3. When the new es-ingest GKE nodes are ready, Add temporary pods to Elasticsearch cluster

```
./deployctl elasticsearch apply --n-ingest-pods=48
Expand All @@ -27,7 +37,7 @@ The number of ingest pods should match the number of nodes in the `es-ingest` no

Watch pods' readiness with `kubectl get pods -w`.

## 4. Create a Dataproc cluster and load a Hail table.
### 4. Create a Dataproc cluster and load a Hail table.

The number of workers in the cluster should match the number of ingest pods.

Expand All @@ -37,7 +47,7 @@ The number of workers in the cluster should match the number of ingest pods.
./deployctl dataproc-cluster stop es
```

## 5. Determine available space on the persistent data nodes, and how much space you'll need for the new data
### 5. Determine available space on the persistent data nodes, and how much space you'll need for the new data

First, Look at the total size of all indices in Elasticsearch to see how much storage will be required for permanent pods. Add up the values in the `store.size` column output from the [cat indices API](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cat-indices.html).

Expand All @@ -57,7 +67,7 @@ Depending on how much additional space you need, you can take one of three optio
2. You need slightly more space, so you can increase the size of the disks in the current persistent data nodeSet.
3. You need at least as much space as ~80% of a data pod holds (around ~1.4TB as of this writing), so you add another persistent data node.

### Option 1: No modifications
#### Option 1: No modifications

You don't need to make modifications, so you can simply move your new index to the permanent data pods. Skip to [Move data to persitent data pods](#6-move-data-to-persistent-data-pods)

Expand All @@ -67,7 +77,7 @@ curl -u "elastic:$ELASTICSEARCH_PASSWORD" "http://localhost:9200/_all/_settings"
EOF
```

### Option 2: Increase disk size
#### Option 2: Increase disk size

You can increase the size of the disks on the of the existing data pods. This will do an online resize of the disks. It's a good idea to ensure you have a recent snapshot of the cluster before doing this. See [Elasticsearch snapshots](./ElasticsearchSnapshots.md) for more information.

Expand All @@ -79,7 +89,7 @@ Then apply the changes:
./deployctl elasticsearch apply --n-ingest-pods=48
```

### Option 3: Add another persistent data node
#### Option 3: Add another persistent data node

Edit [elasticsearch.yaml.jinja2](../manifests/elasticsearch/elasticsearch.yaml.jinja2) and add a new pod to the persistent nodeSet by incrementing the `count` parameter in the `data-{green,blue}` nodeSet. Note that when applied, this will cause data movement as Elasticsearch rebalances shards across the persistent nodeSet. This is generally low-impact, but it's a good idea to do this during a low-traffic period.

Expand All @@ -89,7 +99,7 @@ Apply the changes:
./deployctl elasticsearch apply --n-ingest-pods=48
```

## 6. Move data to persistent data pods
### 6. Move data to persistent data pods

Set [shard allocation filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html) on new indices to move shards to the persistent data nodeSet. Do this for any newly loaded indices as well as any pre-existing indices that will be kept. Replace $INDEX with the name of indicies you need to move.

Expand Down Expand Up @@ -117,7 +127,7 @@ curl -u "elastic:$ELASTICSEARCH_PASSWORD" -XPUT "localhost:9200/_cluster/setting
```

## 7. Once data is done being moved, remove the ingest pods
### 7. Once data is done being moved, remove the ingest pods

```
./deployctl elasticsearch apply --n-ingest-pods=0
Expand All @@ -129,11 +139,11 @@ Watch the cluster to ensure that the ingest pods are successfully terminated:
kubectl get pods -w
```

## 8. Once the ingest pods are terminated, resize the es-ingest node pool to 0
### 8. Once the ingest pods are terminated, resize the es-ingest node pool to 0

Set the `pool_num_nodes` varible for the es-ingest node pool to 0 in our [terraform deployment](https://github.com/broadinstitute/gnomad-terraform/blob/8519ea09e697afc7993b278f1c2b4240ae21c8a4/exac-gnomad/services/browserv4/main.tf#L99) and open a PR to review and apply the infrastructure change.

## 9. Clean up, delete any unused indices.
### 9. Clean up, delete any unused indices.

If cleaning up unused indices affords you enough space to remove a persistent data node, you can do so by editing the `count` parameter in the `data-{green/blue}` nodeSet in [elasticsearch.yaml.jinja2](../manifests/elasticsearch/elasticsearch.yaml.jinja2). Note that applying this will cause data movement, and it's a good idea to do this during a low-traffic period.

Expand All @@ -143,7 +153,7 @@ If cleaning up unused indices affords you enough space to remove a persistent da

Lastly, besure to update relevant [Elasticsearch index aliases](./ElasticsearchIndexAliases.md) and [clear caches](./RedisCache.md).

## References
### References

- [Elastic Cloud on K8S](https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-overview.html)
- [Run Elasticsearch on ECK](https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-elasticsearch-specification.html)
Expand Down

0 comments on commit e9acd8d

Please sign in to comment.