Name		Name	Last commit message	Last commit date
parent directory ..
data		data
java		java
python		python
schema		schema
terraform		terraform
.gitignore		.gitignore
CONTRIBUTORS.MD		CONTRIBUTORS.MD
README.MD		README.MD
cloudbuild_tf.yaml		cloudbuild_tf.yaml
run_system_integration_test.sh		run_system_integration_test.sh

README.MD

Dataflow Production-Ready Pipeline

Introduction

This repo aims to provide a reference implementation for a number of best practices for Google Dataflow (Apache Beam) via a sample pipeline. This sample pipeline doesn't focus on complex transformations or specific business logic, but rather on the the scaffolding around data pipelines in terms of:

Unit testing
Integration testing
Infrastructure automation
Deployment automation

The repo uses many concepts as explained in the Google Cloud blog series Building production-ready data pipelines using Dataflow

Sample pipelines

The repo provides a data-preprocessing pipeline for a hypothetical ML use case in both Python and Java(WIP). The pipeline reads and parse a CSV in the following format:

source_address;source_city;target_address;target_city

Then, apply some text-cleaning on the fields and calculate similarity features address_similarity and city_similarity between source and target attributes. The output and rejected records are then written into BigQuery into two separate tables.

The main goal of the repo is to demonstrates the following:

Beam Pipeline structuring and patterns
- DoFns
- PTransform
- Counters
- Side Inputs
- Multiple outputs (Error output)
- Writing to BigQuery
Testing
- Structuring the pipeline into testable units
- Unit Tests (python methods, DoFns, TestPipeline, PAssert)
- Transform-integration test with static data (PTransform)
- System-integration test on Dataflow service
Flex Template
- Packaging the pipeline code and dependencies into a container image
- Using multi-python modules (Python example)
CD pipeline
- Using Cloud Build
- Running unit tests
- Building and deploying Flex template
- Running system integration test with Flex template
Infrastructure automation
- Using Terraform to automate environment creation for the data pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataflow-production-ready

dataflow-production-ready

README.MD

Dataflow Production-Ready Pipeline

Introduction

Sample pipelines

Files

dataflow-production-ready

Directory actions

More options

Directory actions

More options

Latest commit

History

dataflow-production-ready

Folders and files

parent directory

README.MD

Dataflow Production-Ready Pipeline

Introduction

Sample pipelines