Skip to content

Commit f4f5ca9

Browse files
docs: add horizontal scalability documentation
Added a new guide explaining how FHIR Data Pipes scales using sharding, Beam runners, autoscaling, and parallel I/O. git commit --amend --reset-author
1 parent 4f74e8c commit f4f5ca9

File tree

1 file changed

+45
-0
lines changed

1 file changed

+45
-0
lines changed

doc/horizontal_scalability.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Horizontal Scalability in FHIR Data Pipes
2+
3+
Horizontal scalability allows FHIR Data Pipes to process large volumes of FHIR resources by distributing workload across multiple machines or workers. This enables faster data ingestion and transformation, particularly when working with large clinical datasets.
4+
5+
## Why Horizontal Scalability Matters
6+
- FHIR servers can contain millions of resources.
7+
- Single-machine processing becomes slow for analytics workloads.
8+
- Distributed pipelines help parallelize:
9+
- Resource extraction
10+
- Transformation
11+
- Data writing to Parquet or another FHIR store
12+
13+
## How FHIR Data Pipes Scales Horizontally
14+
FHIR Data Pipes supports horizontal scalability through:
15+
1. **Apache Beam Runner Scaling**
16+
- Beam supports distributed runners like:
17+
- Google Dataflow
18+
- Flink
19+
- Spark
20+
- These runners automatically scale workers based on job load.
21+
22+
2. **Sharding FHIR Resources**
23+
- Resources can be split by:
24+
- Resource type
25+
- Logical ID ranges
26+
- Page tokens
27+
- Each shard is processed independently.
28+
29+
3. **Parallel I/O**
30+
- Multiple workers fetch resources from FHIR store in parallel.
31+
- Writes to Parquet or downstream systems are also parallelized.
32+
33+
4. **Stateless Transformations**
34+
- Most transformations in pipelines are stateless.
35+
- This allows Beam to distribute them across workers without coordination issues.
36+
37+
## Recommended Deployment Pattern
38+
To scale horizontally:
39+
1. Choose a distributed runner (Dataflow recommended for GCP users).
40+
2. Enable autoscaling.
41+
3. Increase worker limits depending on dataset size.
42+
4. Use sharded execution for large Patient-centric datasets.
43+
5. Monitor worker performance using runner-specific dashboards.
44+
45+
## Example High-Level Architecture

0 commit comments

Comments
 (0)