Skip to content

Commit

Permalink
Add documentation content
Browse files Browse the repository at this point in the history
- Set up a Spark cluster with Ansible (and remove corresponding section from README)
- Set up a Spark cluster with AWS EMR
- Set up a Spark cluster manually
- Set up a Spark cluster with Docker (and remove corresponding section from README)
- Configure source for C* migration
- Minor fixes in `config.yaml.example`
- Minor typo fixes in Ansible files
  • Loading branch information
julienrf committed Jun 27, 2024
1 parent 995a019 commit c78adca
Show file tree
Hide file tree
Showing 12 changed files with 376 additions and 87 deletions.
77 changes: 2 additions & 75 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,3 @@
# Ansible deployment

An ansible playbook is provided in ansible folder. The ansible playbook will install the pre-requisites, spark, on the master and workers added to the `ansible/inventory/hosts` file. Scylla-migrator will be installed on the spark master node.
1. Update `ansible/inventory/hosts` file with master and worker instances
2. Update `ansible/ansible.cfg` with location of private key if necessary
3. The `ansible/template/spark-env-master-sample` and `ansible/template/spark-env-worker-sample` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them.
4. run `ansible-playbook scylla-migrator.yml`
5. On the spark master node:
cd scylla-migrator
`./start-spark.sh`
6. On the spark worker nodes:
`./start-slave.sh`
7. Open spark web console
- Ensure networking is configured to allow you access spark master node via 8080 and 4040
- visit http://<spark-master-hostname>:8080
8. Review and modify `config.yaml` based whether you're performing a migration to CQL or Alternator
- If you're migrating to Scylla CQL interface (from Cassandra, Scylla, or other CQL source), make a copy review the comments in `config.yaml.example`, and edit as directed.
- If you're migrating to Alternator (from DynamoDB or other Scylla Alternator), make a copy, review the comments in `config.dynamodb.yml`, and edit as directed.
9. As part of ansible deployment, sample submit jobs were created. You may edit and use the submit jobs.
- For CQL migration: Edit `scylla-migrator/submit-cql-job.sh`, change line `--conf spark.scylla.config=config.yaml \` to point to the whatever you named the config.yaml in previous step.
- For Alternator migration: Edit `scylla-migrator/submit-alternator-job.sh`, change line `--conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \` to reference the config.yaml file you created and modified in previous step.
10. Ensure the table has been created in the target environment.
11. Submit the migration by submitting the appropriate job
- CQL migration: `./submit-cql-job.sh`
- Alternator migration: `./submit-alternator-job.sh`
12. You can monitor progress by observing the spark web console you opened in step 7. Additionally, after the job has started, you can track progress via http://<spark-master-hostname>:4040.
FYI: When no spark jobs are actively running, the spark progress page at port 4040 displays unavailable. It is only useful and renders when a spark job is in progress.

# Configuring the Migrator

Create a `config.yaml` for your migration using the template `config.yaml.example` in the repository root. Read the comments throughout carefully.
Expand Down Expand Up @@ -74,54 +46,9 @@ spark-submit --class com.scylladb.migrator.Migrator \
<path to scylla-migrator-assembly.jar>
```

# Running the validator

This project also includes an entrypoint for comparing the source
table and the target table. You can launch it as so (after performing
the previous steps):

```shell
spark-submit --class com.scylladb.migrator.Validator \
--master spark://<spark-master-hostname>:7077 \
--conf spark.scylla.config=<path to config.yaml> \
<path to scylla-migrator-assembly.jar>
```

# Running locally

To run in the local Docker-based setup:

1. First start the environment:
```shell
docker compose up -d
```

2. Launch `cqlsh` in Cassandra's container and create a keyspace and a table with some data:
```shell
docker compose exec cassandra cqlsh
<create stuff>
```

3. Launch `cqlsh` in Scylla's container and create the destination keyspace and table with the same schema as the source table:
```shell
docker compose exec scylla cqlsh
<create stuff>
```

4. Edit the `config.yaml` file; note the comments throughout.

5. Run `build.sh`.

6. Then, launch `spark-submit` in the master's container to run the job:
```shell
docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \
--master spark://spark-master:7077 \
--conf spark.driver.host=spark-master \
--conf spark.scylla.config=/app/config.yaml \
/jars/scylla-migrator-assembly.jar
```
# Documentation

The `spark-master` container mounts the `./migrator/target/scala-2.13` dir on `/jars` and the repository root on `/app`. To update the jar with new code, just run `build.sh` and then run `spark-submit` again.
See https://migrator.docs.scylladb.com.

# Building

Expand Down
2 changes: 1 addition & 1 deletion ansible/templates/spark-env-master-sample
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# MEMORY is used in the spark-submit job and allocates the memory per executor.
# You can have one or more executors per worker.
#
# By using multiple workers on an instance, we can control the velocit of the migration.
# By using multiple workers on an instance, we can control the velocity of the migration.
#
# Eg.
# Target system is 3 x i4i.4xlarge (16 vCPU, 128G)
Expand Down
2 changes: 1 addition & 1 deletion ansible/templates/spark-env-worker-sample
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# MEMORY is used in the spark-submit job and allocates the memory per executor.
# You can have one or more executors per worker.
#
# By using multiple workers on an instance, we can control the velocit of the migration.
# By using multiple workers on an instance, we can control the velocity of the migration.
#
# Eg.
# Target system is 3 x i4i.4xlarge (16 vCPU, 128G)
Expand Down
3 changes: 1 addition & 2 deletions config.yaml.example
Original file line number Diff line number Diff line change
Expand Up @@ -268,8 +268,7 @@ renames: []
# create a savepoint file with this filled.
skipTokenRanges: []

# Configuration section for running the validator. The validator is run manually (see README)
# and currently only supports comparing a Cassandra source to a Scylla target.
# Configuration section for running the validator. The validator is run manually (see README).
validation:
# Should WRITETIMEs and TTLs be compared?
compareTimestamps: true
Expand Down
35 changes: 35 additions & 0 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
=======================
Configuration Reference
=======================

------------------
AWS Authentication
------------------

When reading from DynamoDB or S3, or when writing to DynamoDB, the communication with AWS can be configured with the properties ``credentials``, ``endpoint``, and ``region`` in the configuration:

.. code-block:: yaml
credentials:
accessKey: <access-key>
secretKey: <secret-key>
# Optional AWS endpoint configuration
endpoint:
host: <host>
port: <port>
# Optional AWS availability region, required if you use a custom endpoint
region: <region>
Additionally, you can authenticate with `AssumeRole <https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html>`_. In such a case, the ``accessKey`` and ``secretKey`` are the credentials of the user whose access to the resource (DynamoDB table or S3 bucket) has been granted via a “role”, and you need to add the property ``assumeRole`` as follows:

.. code-block:: yaml
credentials:
accessKey: <access-key>
secretKey: <secret-key>
assumeRole:
arn: <role-arn>
# Optional session name to use. If not set, we use 'scylla-migrator'.
sessionName: <role-session-name>
# Note that the region is mandatory when you use `assumeRole`
region: <region>
40 changes: 40 additions & 0 deletions docs/source/getting-started/ansible.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,43 @@
===================================
Set Up a Spark Cluster with Ansible
===================================

An `Ansible <https://www.ansible.com/>`_ playbook is provided in the `ansible <https://github.com/scylladb/scylla-migrator/tree/master/ansible>`_ folder of our Git repository. The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file. Scylla-migrator will be installed on the spark master node.

1. Update ``ansible/inventory/hosts`` file with master and worker instances
2. Update ``ansible/ansible.cfg`` with location of private key if necessary
3. The ``ansible/template/spark-env-master-sample`` and ``ansible/template/spark-env-worker-sample`` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them.
4. run ``ansible-playbook scylla-migrator.yml``
5. On the Spark master node: ::

cd scylla-migrator
./start-spark.sh

6. On the Spark worker nodes: ::

./start-slave.sh

7. Open Spark web console

- Ensure networking is configured to allow you access spark master node via TCP ports 8080 and 4040
- visit ``http://<spark-master-hostname>:8080``

8. Review and modify ``config.yaml`` based whether you're performing a migration to CQL or Alternator

- If you're migrating to Scylla CQL interface (from Cassandra, Scylla, or other CQL source), make a copy review the comments in ``config.yaml.example``, and edit as directed.
- If you're migrating to Alternator (from DynamoDB or other Scylla Alternator), make a copy, review the comments in ``config.dynamodb.yml``, and edit as directed.

9. As part of ansible deployment, sample submit jobs were created. You may edit and use the submit jobs.

- For CQL migration: edit ``scylla-migrator/submit-cql-job.sh``, change line ``--conf spark.scylla.config=config.yaml \`` to point to the whatever you named the ``config.yaml`` in previous step.
- For Alternator migration: edit ``scylla-migrator/submit-alternator-job.sh``, change line ``--conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \`` to reference the ``config.yaml`` file you created and modified in previous step.

10. Ensure the table has been created in the target environment.
11. Submit the migration by submitting the appropriate job

- CQL migration: ``./submit-cql-job.sh``
- Alternator migration: ``./submit-alternator-job.sh``

12. You can monitor progress by observing the Spark web console you opened in step 7. Additionally, after the job has started, you can track progress via ``http://<spark-master-hostname>:4040``.

FYI: When no Spark jobs are actively running, the Spark progress page at port 4040 displays unavailable. It is only useful and renders when a Spark job is in progress.
57 changes: 57 additions & 0 deletions docs/source/getting-started/aws-emr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,60 @@
Set Up a Spark Cluster with AWS EMR
===================================

This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.com/emr/>`_. This approach is useful if you already have an AWS account, or if you do not want to manage your infrastructure manually.

1. Download the ``config.yaml.example`` from our Git repository. ::

wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
--output-document=config.yaml

2. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.

3. Download the latest release of the Migrator. ::

wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar

4. Upload them to an S3 bucket. ::

aws s3 cp config.yaml s3://<your-bucket>/scylla-migrator/config.yaml
aws s3 cp scylla-migrator-assembly.jar s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar

Replace ``<your-bucket>`` with an S3 bucket name that you manage.

Each time you change the migration configuration, re-upload it to the bucket.

4. Create a script named ``copy-files.sh``, to load the files ``config.yaml`` and ``scylla-migrator-assembly.jar`` from your S3 bucket. ::

#!/bin/bash
aws s3 cp s3://<your-bucket>/scylla-migrator/config.yaml /mnt1/config.yaml
aws s3 cp s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar /mnt1/scylla-migrator-assembly.jar

5. Upload the script to your S3 bucket as well. ::

aws s3 cp copy-files.sh s3://<your-bucket>/scylla-migrator/copy-files.sh

6. Log in to the `AWS EMR console <https://console.aws.amazon.com/emr>`_.

7. Choose “Create cluster” to create a new cluster based on EC2.

8. Configure the cluster as follows:

- Choose the EMR release ``emr-7.1.0``, or any EMR release that is compatible with the Spark version used by the Migrator.
- Make sure to include Spark in the application bundle.
- Choose all-purpose EC2 instance types (e.g., i4i).
- Make sure to include at least one task node.
- Add a Step to run the Migrator:

- Type: Custom JAR
- JAR location: ``command-runner.jar``
- Arguments: ::

spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar

- Add a Bootstrap action to download the Migrator and the migration configuration:

- Script location: ``s3://<your-bucket>/scylla-migrator/copy-files.sh``

9. Finalize your cluster configuration according to your needs and finally choose “Create cluster”.

10. The migration will start automatically after the cluster is fully up.
45 changes: 45 additions & 0 deletions docs/source/getting-started/docker.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
==================================
Set Up a Spark Cluster with Docker
==================================

This page describes how to set up a Spark cluster locally on your machine by using Docker containers. This approach is useful if you do not need a high-level of performance, and want to quickly try out the Migrator without having to set up a real cluster of nodes. It requires Docker and Git.

1. Clone the Migrator repository. ::

git clone https://github.com/scylladb/scylla-migrator.git
cd scylla-migrator

2. Download the latest release of the ``scylla-migrator-assembly.jar`` and put it in the directory ``migrator/target/scala-2.13/``. ::

mkdir -p migrator/target/scala-2.13
wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
--directory-prefix=migrator/target/scala-2.13

3. Start the Spark cluster. ::

docker compose up -d

4. Open the Spark web UI.

http://localhost:8080

Tip: add the following aliases to your ``/etc/hosts`` to make links work in the Spark UI ::

127.0.0.1 spark-master
127.0.0.1 spark-worker

5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure </getting-started/#configure-the-migration>`_ it according to your needs.

6. Finally, run the migration. ::

docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \
--master spark://spark-master:7077 \
--conf spark.driver.host=spark-master \
--conf spark.scylla.config=/app/config.yaml \
/jars/scylla-migrator-assembly.jar

The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``.

7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``.

FYI: When no Spark jobs are actively running, the Spark progress page at port 4040 displays unavailable. It is only useful and renders when a Spark job is in progress.
35 changes: 34 additions & 1 deletion docs/source/getting-started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,42 @@
Getting Started
===============

Since the Migrator is packaged as a Spark application, you have to set up a Spark cluster to use it. Then, you submit the application along with its :doc:`configuration </configuration>` on the Spark cluster, which will execute the migration by reading from your source database and writing to your target database.

----------------------
Set Up a Spark Cluster
----------------------

The following pages describe various alternative ways to set up a Spark cluster:

* on your infrastructure, using :doc:`Ansible </getting-started/ansible>`,
* on your infrastructure, :doc:`manually </getting-started/spark-standalone>`,
* using :doc:`AWS EMR </getting-started/aws-emr>`,
* or, on a single machine, using :doc:`Docker </getting-started/docker>`.

-----------------------
Configure the Migration
-----------------------

Once you have a Spark cluster ready to run the ``scylla-migrator-assembly.jar``, download the file `config.yaml.example <https://github.com/scylladb/scylla-migrator/blob/master/config.yaml.example>`_ and rename it to ``config.yaml``. This file contains properties such as ``source`` or ``target`` defining how to connect to the source database and to the target database, as well as other settings to perform the migration. Adapt it to your case according to the following guides:

- :doc:`migrate from Cassandra or Parquet files to ScyllaDB </migrate-from-cassandra-or-parquet>`,
- or, :doc:`migrate from DynamoDB to ScyllaDB’s Alternator </migrate-from-dynamodb>`.

--------------
Extra Features
--------------

You might also be interested in the following extra features:

* :doc:`rename columns along the migration </rename-columns>`,
* :doc:`replicate changes applied to the source data after the initial snapshot transfer has completed </stream-changes>`,
* :doc:`validate that the migration was complete and correct </validate>`.

.. toctree::
:hidden:

ansible
aws-emr
spark-standalone
aws-emr
docker
23 changes: 23 additions & 0 deletions docs/source/getting-started/spark-standalone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,26 @@
Manual Set Up of a Spark Cluster
================================

This page describes how to set up a Spark cluster on your infrastructure and to use it to perform a migration.

1. Follow the `official documentation <https://spark.apache.org/docs/latest/spark-standalone.html>`_ to install Spark on each node of your cluster, and start the Spark master and the Spark workers.

2. In the Spark master node, download the latest release of the Migrator. ::

wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar

3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository. ::

wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
--output-document=config.yaml

4. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.

5. Finally, run the migration as follows from the Spark master node. ::

spark-submit --class com.scylladb.migrator.Migrator \
--master spark://<spark-master-hostname>:7077 \
--conf spark.scylla.config=<path to config.yaml> \
<path to scylla-migrator-assembly.jar>

6. You can monitor progress from the `Spark web UI <https://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging>`_.
9 changes: 5 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ ScyllaDB Migrator Documentation

The Scylla Migrator is a Spark application that migrates data to ScyllaDB. Its main features are the following:

* it can read from Cassandra, Parquet, DynamoDB, or a DynamoDB S3 export
* it can be distributed over multiple nodes of a Spark cluster to scale with your database cluster
* it can rename columns along the way
* it can transfer a snapshot of the source data, or continuously migrate new data as they come
* it can read from Cassandra, Parquet, DynamoDB, or a DynamoDB S3 export,
* it can be distributed over multiple nodes of a Spark cluster to scale with your database cluster,
* it can rename columns along the way,
* it can transfer a snapshot of the source data, or continuously migrate new data as they come.

Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster for a migration.

Expand All @@ -20,3 +20,4 @@ Read over the :doc:`Getting Started </getting-started/index>` page to set up a S
stream-changes
rename-columns
validate
configuration
Loading

0 comments on commit c78adca

Please sign in to comment.