Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial showing how to migrate from DynamoDB #194

Merged
merged 2 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .github/workflows/tutorial-dynamodb.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: "Tests / Tutorials"
on:
push:
branches:
- master
pull_request:

env:
TUTORIAL_DIR: docs/source/tutorials/dynamodb-to-scylladb-alternator

jobs:
test:
name: DynamoDB migration
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Cache Docker images
uses: ScribeMD/[email protected]
with:
key: docker-${{ runner.os }}-${{ hashFiles('docker-compose-tests.yml') }}
- uses: actions/setup-java@v4
with:
distribution: temurin
java-version: 8
cache: sbt
- name: Build migrator
run: |
./build.sh
mv migrator/target/scala-2.13/scylla-migrator-assembly.jar "$TUTORIAL_DIR/spark-data"
- name: Set up services
run: |
cd $TUTORIAL_DIR
docker compose up -d
- name: Wait for the services to be up
run: |
.github/wait-for-port.sh 8000 # DynamoDB
.github/wait-for-port.sh 8001 # ScyllaDB Alternator
.github/wait-for-port.sh 8080 # Spark master
.github/wait-for-port.sh 8081 # Spark worker
- name: Run tutorial
run: |
cd $TUTORIAL_DIR
aws configure set region us-west-1
aws configure set aws_access_key_id dummy
aws configure set aws_secret_access_key dummy
sed -i 's/seq 1 40000/seq 1 40/g' ./create-data.sh
./create-data.sh
. ./run-migrator.sh
- name: Stop services
run: |
cd $TUTORIAL_DIR
docker compose down
2 changes: 1 addition & 1 deletion docs/source/getting-started/ansible.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ The Ansible playbook expects to be run in an Ubuntu environment where the direct
- Ensure networking is configured to allow you access spark master node via TCP ports 8080 and 4040
- visit ``http://<spark-master-hostname>:8080``

9. `Review and modify config.yaml <../#configure-the-migration>`_ based whether you're performing a migration to CQL or Alternator
9. `Review and modify config.yaml <./#configure-the-migration>`_ based whether you're performing a migration to CQL or Alternator

- If you're migrating to ScyllaDB CQL interface (from Apache Cassandra, ScyllaDB, or other CQL source), make a copy review the comments in ``config.yaml.example``, and edit as directed.
- If you're migrating to Alternator (from DynamoDB or other ScyllaDB Alternator), make a copy, review the comments in ``config.dynamodb.yml``, and edit as directed.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting-started/aws-emr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
--output-document=config.yaml


2. `Configure the migration <../#configure-the-migration>`_ according to your needs.
2. `Configure the migration <./#configure-the-migration>`_ according to your needs.

3. Download the latest release of the Migrator.

Expand Down Expand Up @@ -67,7 +67,7 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c

spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar

See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.

- Add a Bootstrap action to download the Migrator and the migration configuration:

Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting-started/docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ This page describes how to set up a Spark cluster locally on your machine by usi

http://localhost:8080

5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure <../#configure-the-migration>`_ it according to your needs.
5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure <./#configure-the-migration>`_ it according to your needs.

6. Finally, run the migration.

Expand All @@ -47,7 +47,7 @@ This page describes how to set up a Spark cluster locally on your machine by usi

The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``.

See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.

7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``.

Expand Down
4 changes: 4 additions & 0 deletions docs/source/getting-started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ A Spark cluster is made of several *nodes*, which can contain several *workers*

We recommend provisioning at least 2 GB of memory per CPU on each node. For instance, a cluster node with 4 CPUs should have at least 8 GB of memory.

.. caution::

Make sure the Spark version, the Scala version, and the Migrator version you use are `compatible together <../#compatibility-matrix>`_.

The following pages describe various alternative ways to set up a Spark cluster:

* :doc:`on your infrastructure, using Ansible </getting-started/ansible>`,
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting-started/spark-standalone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This page describes how to set up a Spark cluster on your infrastructure and to
wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
--output-document=config.yaml

4. `Configure the migration <../#configure-the-migration>`_ according to your needs.
4. `Configure the migration <./#configure-the-migration>`_ according to your needs.

5. Finally, run the migration as follows from the Spark master node.

Expand All @@ -32,6 +32,6 @@ This page describes how to set up a Spark cluster on your infrastructure and to
--conf spark.scylla.config=<path to config.yaml> \
<path to scylla-migrator-assembly.jar>

See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.

6. You can monitor progress from the `Spark web UI <https://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging>`_.
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ The ScyllaDB Migrator is a Spark application that migrates data to ScyllaDB. Its
* It can rename columns along the way.
* When migrating from DynamoDB it can transfer a snapshot of the source data, or continuously migrate new data as they come.

Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster for a migration.
Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster and to configure your migration. Alternatively, follow our :doc:`step-by-step tutorial to perform a migration between fake databases using Docker </tutorials/dynamodb-to-scylladb-alternator/index>`.

--------------------
Compatibility Matrix
Expand All @@ -33,3 +33,4 @@ Migrator Spark Scala
rename-columns
validate
configuration
tutorials/index
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/usr/bin/env sh

generate_25_items() {
local items=""
for i in `seq 1 25`; do
items="${items}"'{
"PutRequest": {
"Item": {
"id": { "S": "'"$(uuidgen)"'" },
"col1": { "S": "'"$(uuidgen)"'" },
"col2": { "S": "'"$(uuidgen)"'" },
"col3": { "S": "'"$(uuidgen)"'" },
"col4": { "S": "'"$(uuidgen)"'" },
"col5": { "S": "'"$(uuidgen)"'" }
}
}
},'
done
echo "${items%,}" # remove trailing comma
}

aws \
--endpoint-url http://localhost:8000 \
dynamodb batch-write-item \
--request-items '{
"Example": ['"$(generate_25_items)"']
}' > /dev/null
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env sh

# Create table
aws \
--endpoint-url http://localhost:8000 \
dynamodb create-table \
--table-name Example \
--attribute-definitions AttributeName=id,AttributeType=S \
--key-schema AttributeName=id,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100

# Add items in parallel
# Change 40000 into 400 below for a faster demo (10,000 items instead of 1,000,000)
seq 1 40000 | xargs --max-procs=8 --max-args=1 ./create-25-items.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
services:

dynamodb:
command: "-jar DynamoDBLocal.jar -sharedDb -inMemory"
image: "amazon/dynamodb-local:2.5.2"
ports:
- "8000:8000"
working_dir: /home/dynamodblocal

spark-master:
build: dockerfiles/spark
command: master
environment:
SPARK_PUBLIC_DNS: localhost
ports:
- 4040:4040
- 8080:8080
volumes:
- ./spark-data:/app

spark-worker:
build: dockerfiles/spark
command: worker
environment:
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 4G
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
ports:
- 8081:8081
depends_on:
- spark-master

scylla:
image: scylladb/scylla:6.0.1
expose:
- 8001
ports:
- "8001:8001"
command: "--smp 1 --memory 2048M --alternator-port 8001 --alternator-write-isolation only_rmw_uses_lwt"
151 changes: 151 additions & 0 deletions docs/source/tutorials/dynamodb-to-scylladb-alternator/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
=========================================================
Migrate from DynamoDB to ScyllaDB Alternator Using Docker
=========================================================

In this tutorial, you will replicate 1,000,000 items from a DynamoDB table to ScyllaDB Alternator.

All the scripts and configuration files shown on the tutorial can be found in our `GitHub repository <https://github.com/scylladb/scylla-migrator/tree/master/docs/source/tutorials/dynamodb-to-scylladb-alternator>`_.

The whole system is composed of the DynamoDB service, a Spark cluster with a single worker node, and a ScyllaDB cluster with a single node, as illustrated below:

.. image:: architecture.png
:alt: Architecture
:width: 600

To follow this tutorial, you need to install `Docker <https://docker.com>`_ and the `AWS CLI <https://aws.amazon.com/cli/>`_.

----------------------------------------------------
Set Up the Services and Populate the Source Database
----------------------------------------------------

In an empty directory, create the following ``docker-compose.yaml`` file to define all the services:

.. literalinclude:: docker-compose.yaml
:language: YAML

Let’s break down this Docker Compose file.

1. We define the DynamoDB service by reusing the official image ``amazon/dynamodb-local``. We use the TCP port 8000 for communicating with DynamoDB.
2. We define the Spark master and Spark worker services by using a custom image (see below). Indeed, the official Docker images for Spark 3.5.1 only support Scala 2.12 for now, but we need Scala 2.13. We mount the local directory ``./spark-data`` to the Spark master container path ``/app`` so that we can supply the Migrator jar and configuration to the Spark master node. We expose the ports 8080 and 4040 of the master node to access the Spark UIs from our host environment. We allocate 2 cores and 4 GB of memory to the Spark worker node. As a general rule, we recommend allocating 2 GB of memory per core on each worker.
3. We define the ScyllaDB service by reusing the official image ``scylladb/scylla``. We use the TCP port 8001 for communicating with ScyllaDB Alternator.

Create the ``Dockerfile`` required by the Spark services at path ``./dockerfiles/spark/Dockerfile`` and write the following content:

.. literalinclude:: dockerfiles/spark/Dockerfile
:language: Dockerfile

This ``Dockerfile`` installs Java and a Spark distribution. It uses a custom shell script as entry point. Create the file ``./dockerfiles/spark/entrypoint.sh``, and write the following content:

.. literalinclude:: dockerfiles/spark/entrypoint.sh
:language: sh

The entry point takes an argument that can be either ``master`` or ``worker`` to control whether to start a master node or a worker node.

Prepare your system for building the Spark Docker image with the following commands, which create the ``spark-data`` directory and make the entry point executable:

.. code-block:: sh

mkdir spark-data
chmod +x entrypoint.sh

Finally, start all the services with the following command:

.. code-block:: sh

docker compose up

Your system's Docker daemon will download the DynamoDB and ScyllaDB images and build your Spark Docker image.

Check that you can access the Spark cluster UI by opening http://localhost:8080 in your browser. You should see your worker node in the workers list.

.. image:: spark-cluster.png
:alt: Spark UI listing the worker node
:width: 883

Once all the services are up, you can access your local DynamoDB instance and your local ScyllaDB instance by using the standard AWS CLI. Make sure to configure the AWS CLI as follows before running the ``dynamodb`` commands:

.. code-block:: sh

# Set dummy region and credentials
aws configure set region us-west-1
aws configure set aws_access_key_id dummy
aws configure set aws_secret_access_key dummy
# Access DynamoDB
aws --endpoint-url http://localhost:8000 dynamodb list-tables
# Access ScyllaDB Alternator
aws --endpoint-url http://localhost:8001 dynamodb list-tables

The last preparatory step consists of creating a table in DynamoDB and filling it with random data. Create a file named ``create-data.sh``, make it executable, and write the following content into it:

.. literalinclude:: create-data.sh
:language: sh

This script creates a table named ``Example`` and adds 1 million items to it. It does so by invoking another script, ``create-25-items.sh``, which uses the ``batch-write-item`` command to insert 25 items in a single call:

.. literalinclude:: create-25-items.sh
:language: sh

Every added item contains an id and five columns, all filled with random data.
Run the script ``./create-data.sh`` and wait for a couple of hours until all the data is inserted (or change the last line of ``create-data.sh`` to insert fewer items).

---------------------
Perform the Migration
---------------------

Once you have set up the services and populated the source database, you are ready to perform the migration.

Download the latest stable release of the Migrator in the ``spark-data`` directory:

.. code-block::

wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
--directory-prefix=./spark-data

Create a configuration file in ``./spark-data/config.yaml`` and write the following content:

.. literalinclude:: spark-data/config.yaml
:language: YAML

This configuration tells the Migrator to read the items from the table ``Example`` in the ``dynamodb`` service, and to write them to the table of the same name in the ``scylla`` service.

Finally, start the migration with the following command:

.. literalinclude:: run-migrator.sh
:language: sh

This command calls ``spark-submit`` in the ``spark-master`` service with the file ``scylla-migrator-assembly.jar``, which bundles the Migrator and all its dependencies.

In the ``spark-submit`` command invocation, we explicitly tell Spark to use 4 GB of memory; otherwise, it would default to 1 GB only. We also explicitly tell Spark to use 2 cores. This is not really necessary as the default behavior is to use all the available cores, but we set it for the sake of illustration. If the Spark worker node had 20 cores, it would be better to use only 10 cores per executor to optimize the throughput (big executors require more memory management operations, which decrease the overall application performance). We would achieve this by passing ``--executor-cores 10``, and the Spark engine would allocate two executors for our application to fully utilize the resources of the worker node.

The migration process inspects the source table, replicates its schema to the target database if it does not exist, and then migrates the data. The data migration uses the Hadoop framework under the hood to leverage the Spark cluster resources. The migration process breaks down the data to transfer chunks of about 128 MB each, and processes all the partitions in parallel. Since the source is a DynamoDB table in our example, each partition translates into a `scan segment <https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan>`_ to maximize the parallelism level when reading the data. Here is a diagram that illustrates the migration process:

.. image:: process.png
:alt: Migration process
:width: 700

During the execution of the command, a lot of logs are printed, mostly related to Spark scheduling. Still, you should be able to spot the following relevant lines:

.. code-block:: text

24/07/22 15:46:13 INFO migrator: ScyllaDB Migrator 0.9.2
24/07/22 15:46:20 INFO alternator: We need to transfer: 2 partitions in total
24/07/22 15:46:20 INFO alternator: Starting write…
24/07/22 15:46:20 INFO DynamoUtils: Checking for table existence at destination

And when the migration ends, you will see the following line printed:

.. code-block:: text

24/07/22 15:46:24 INFO alternator: Done transferring table snapshot

During the migration, it is possible to monitor the underlying Spark job by opening the Spark UI available at http://localhost:4040.

.. image:: stages.png
:alt: Spark stages
:width: 900

`Example of a migration broken down in 6 tasks. The Spark UI allows us to follow the overall progress, and it can also show specific metrics such as the memory consumption of an executor`.

In our example the size of the source table is ~200 MB. In practice, it is common to migrate tables containing several terabytes of data. If necessary, and as long as your DynamoDB source supports a higher read throughput level, you can increase the migration throughput by adding more Spark worker nodes. The Spark engine will automatically spread the workload between all the worker nodes.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
docker compose exec spark-master \
/spark/bin/spark-submit \
--executor-memory 4G \
--executor-cores 2 \
--class com.scylladb.migrator.Migrator \
--master spark://spark-master:7077 \
--conf spark.driver.host=spark-master \
--conf spark.scylla.config=/app/config.yaml \
/app/scylla-migrator-assembly.jar
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading