Add documentation content

- Set up a Spark cluster with Ansible (and remove corresponding section from README) - Set up a Spark cluster with AWS EMR - Set up a Spark cluster manually - Set up a Spark cluster with Docker (and remove corresponding section from README) - Configure source for C* migration - Minor fixes in `config.yaml.example` - Minor typo fixes in Ansible files
scylladb · Jun 27, 2024 · c78adca · c78adca
1 parent 995a019
commit c78adca
Show file tree

Hide file tree

Showing 12 changed files with 376 additions and 87 deletions.
diff --git a/README.md b/README.md
@@ -1,31 +1,3 @@
-# Ansible deployment
-
-An ansible playbook is provided in ansible folder.  The ansible playbook will install the pre-requisites, spark, on the master and workers added to the `ansible/inventory/hosts` file.  Scylla-migrator will be installed on the spark master node.
-1. Update `ansible/inventory/hosts` file with master and worker instances
-2. Update `ansible/ansible.cfg` with location of private key if necessary
-3. The `ansible/template/spark-env-master-sample` and `ansible/template/spark-env-worker-sample` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them.
-4. run `ansible-playbook scylla-migrator.yml`
-5. On the spark master node:
-  cd scylla-migrator
-  `./start-spark.sh`
-6. On the spark worker nodes:
-  `./start-slave.sh`
-7. Open spark web console
-  - Ensure networking is configured to allow you access spark master node via 8080 and 4040
-  - visit http://<spark-master-hostname>:8080
-8. Review and modify `config.yaml` based whether you're performing a migration to CQL or Alternator
-  - If you're migrating to Scylla CQL interface (from Cassandra, Scylla, or other CQL source), make a copy review the comments in `config.yaml.example`, and edit as directed.
-  - If you're migrating to Alternator (from DynamoDB or other Scylla Alternator), make a copy, review the comments in `config.dynamodb.yml`, and edit as directed.
-9. As part of ansible deployment, sample submit jobs were created.  You may edit and use the submit jobs.
-  - For CQL migration: Edit `scylla-migrator/submit-cql-job.sh`, change line `--conf spark.scylla.config=config.yaml \` to point to the whatever you named the config.yaml in previous step.
-  - For Alternator migration: Edit `scylla-migrator/submit-alternator-job.sh`, change line `--conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \` to reference the config.yaml file you created and modified in previous step.
-10. Ensure the table has been created in the target environment.
-11. Submit the migration by submitting the appropriate job
-  - CQL migration: `./submit-cql-job.sh`
-  - Alternator migration: `./submit-alternator-job.sh`
-12. You can monitor progress by observing the spark web console you opened in step 7.  Additionally, after the job has started, you can track progress via http://<spark-master-hostname>:4040.  
-  FYI: When no spark jobs are actively running, the spark progress page at port 4040 displays unavailable.  It is only useful and renders when a spark job is in progress.
-
 # Configuring the Migrator
 
 Create a `config.yaml` for your migration using the template `config.yaml.example` in the repository root. Read the comments throughout carefully.
@@ -74,54 +46,9 @@ spark-submit --class com.scylladb.migrator.Migrator \
   <path to scylla-migrator-assembly.jar>
 ```
 
-# Running the validator
-
-This project also includes an entrypoint for comparing the source
-table and the target table. You can launch it as so (after performing
-the previous steps):
-
-```shell
-spark-submit --class com.scylladb.migrator.Validator \
-  --master spark://<spark-master-hostname>:7077 \
-  --conf spark.scylla.config=<path to config.yaml> \
-  <path to scylla-migrator-assembly.jar>
-```
-
-# Running locally
-
-To run in the local Docker-based setup:
-
-1. First start the environment:
-```shell
-docker compose up -d
-```
-
-2. Launch `cqlsh` in Cassandra's container and create a keyspace and a table with some data:
-```shell
-docker compose exec cassandra cqlsh
-<create stuff>
-```
-
-3. Launch `cqlsh` in Scylla's container and create the destination keyspace and table with the same schema as the source table:
-```shell
-docker compose exec scylla cqlsh
-<create stuff>
-```
-
-4. Edit the `config.yaml` file; note the comments throughout.
-
-5. Run `build.sh`.
-
-6. Then, launch `spark-submit` in the master's container to run the job:
-```shell
-docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \
-  --master spark://spark-master:7077 \
-  --conf spark.driver.host=spark-master \
-  --conf spark.scylla.config=/app/config.yaml \
-  /jars/scylla-migrator-assembly.jar
-```
+# Documentation
 
-The `spark-master` container mounts the `./migrator/target/scala-2.13` dir on `/jars` and the repository root on `/app`. To update the jar with new code, just run `build.sh` and then run `spark-submit` again.
+See https://migrator.docs.scylladb.com.
 
 # Building
 

diff --git a/ansible/templates/spark-env-master-sample b/ansible/templates/spark-env-master-sample
@@ -8,7 +8,7 @@
 # MEMORY is used in the spark-submit job and allocates the memory per executor.
 # You can have one or more executors per worker.
 # 
-# By using multiple workers on an instance, we can control the velocit of the migration.
+# By using multiple workers on an instance, we can control the velocity of the migration.
 #
 # Eg. 
 #    Target system is 3 x i4i.4xlarge (16 vCPU, 128G)

diff --git a/ansible/templates/spark-env-worker-sample b/ansible/templates/spark-env-worker-sample
@@ -8,7 +8,7 @@
 # MEMORY is used in the spark-submit job and allocates the memory per executor.
 # You can have one or more executors per worker.
 # 
-# By using multiple workers on an instance, we can control the velocit of the migration.
+# By using multiple workers on an instance, we can control the velocity of the migration.
 #
 # Eg. 
 #    Target system is 3 x i4i.4xlarge (16 vCPU, 128G)

diff --git a/config.yaml.example b/config.yaml.example
@@ -268,8 +268,7 @@ renames: []
 # create a savepoint file with this filled.
 skipTokenRanges: []
 
-# Configuration section for running the validator. The validator is run manually (see README)
-# and currently only supports comparing a Cassandra source to a Scylla target.
+# Configuration section for running the validator. The validator is run manually (see README).
 validation:
   # Should WRITETIMEs and TTLs be compared?
   compareTimestamps: true

diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst
@@ -0,0 +1,35 @@
+=======================
+Configuration Reference
+=======================
+
+------------------
+AWS Authentication
+------------------
+
+When reading from DynamoDB or S3, or when writing to DynamoDB, the communication with AWS can be configured with the properties ``credentials``, ``endpoint``, and ``region`` in the configuration:
+
+.. code-block:: yaml
+
+  credentials:
+    accessKey: <access-key>
+    secretKey: <secret-key>
+  # Optional AWS endpoint configuration
+  endpoint:
+    host: <host>
+    port: <port>
+  # Optional AWS availability region, required if you use a custom endpoint
+  region: <region>
+
+Additionally, you can authenticate with `AssumeRole <https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html>`_. In such a case, the ``accessKey`` and ``secretKey`` are the credentials of the user whose access to the resource (DynamoDB table or S3 bucket) has been granted via a “role”, and you need to add the property ``assumeRole`` as follows:
+
+.. code-block:: yaml
+
+  credentials:
+    accessKey: <access-key>
+    secretKey: <secret-key>
+    assumeRole:
+      arn: <role-arn>
+      # Optional session name to use. If not set, we use 'scylla-migrator'.
+      sessionName: <role-session-name>
+  # Note that the region is mandatory when you use `assumeRole`
+  region: <region>
diff --git a/docs/source/getting-started/ansible.rst b/docs/source/getting-started/ansible.rst
@@ -1,3 +1,43 @@
 ===================================
 Set Up a Spark Cluster with Ansible
 ===================================
+
+An `Ansible <https://www.ansible.com/>`_ playbook is provided in the `ansible <https://github.com/scylladb/scylla-migrator/tree/master/ansible>`_ folder of our Git repository.  The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file.  Scylla-migrator will be installed on the spark master node.
+
+1. Update ``ansible/inventory/hosts`` file with master and worker instances
+2. Update ``ansible/ansible.cfg`` with location of private key if necessary
+3. The ``ansible/template/spark-env-master-sample`` and ``ansible/template/spark-env-worker-sample`` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them.
+4. run ``ansible-playbook scylla-migrator.yml``
+5. On the Spark master node: ::
+
+     cd scylla-migrator
+     ./start-spark.sh
+
+6. On the Spark worker nodes: ::
+
+     ./start-slave.sh
+
+7. Open Spark web console
+
+   - Ensure networking is configured to allow you access spark master node via TCP ports 8080 and 4040
+   - visit ``http://<spark-master-hostname>:8080``
+
+8. Review and modify ``config.yaml`` based whether you're performing a migration to CQL or Alternator
+
+   - If you're migrating to Scylla CQL interface (from Cassandra, Scylla, or other CQL source), make a copy review the comments in ``config.yaml.example``, and edit as directed.
+   - If you're migrating to Alternator (from DynamoDB or other Scylla Alternator), make a copy, review the comments in ``config.dynamodb.yml``, and edit as directed.
+
+9. As part of ansible deployment, sample submit jobs were created.  You may edit and use the submit jobs.
+
+   - For CQL migration: edit ``scylla-migrator/submit-cql-job.sh``, change line ``--conf spark.scylla.config=config.yaml \`` to point to the whatever you named the ``config.yaml`` in previous step.
+   - For Alternator migration: edit ``scylla-migrator/submit-alternator-job.sh``, change line ``--conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \`` to reference the ``config.yaml`` file you created and modified in previous step.
+
+10. Ensure the table has been created in the target environment.
+11. Submit the migration by submitting the appropriate job
+
+    - CQL migration: ``./submit-cql-job.sh``
+    - Alternator migration: ``./submit-alternator-job.sh``
+
+12. You can monitor progress by observing the Spark web console you opened in step 7. Additionally, after the job has started, you can track progress via ``http://<spark-master-hostname>:4040``.
+
+    FYI: When no Spark jobs are actively running, the Spark progress page at port 4040 displays unavailable. It is only useful and renders when a Spark job is in progress.
diff --git a/docs/source/getting-started/aws-emr.rst b/docs/source/getting-started/aws-emr.rst
@@ -2,3 +2,60 @@
 Set Up a Spark Cluster with AWS EMR
 ===================================
 
+This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.com/emr/>`_. This approach is useful if you already have an AWS account, or if you do not want to manage your infrastructure manually.
+
+1. Download the ``config.yaml.example`` from our Git repository. ::
+
+     wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
+       --output-document=config.yaml
+
+2. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.
+
+3. Download the latest release of the Migrator. ::
+
+     wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
+
+4. Upload them to an S3 bucket. ::
+
+     aws s3 cp config.yaml s3://<your-bucket>/scylla-migrator/config.yaml
+     aws s3 cp scylla-migrator-assembly.jar s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar
+
+   Replace ``<your-bucket>`` with an S3 bucket name that you manage.
+
+   Each time you change the migration configuration, re-upload it to the bucket.
+
+4. Create a script named ``copy-files.sh``, to load the files ``config.yaml`` and ``scylla-migrator-assembly.jar`` from your S3 bucket. ::
+
+     #!/bin/bash
+     aws s3 cp s3://<your-bucket>/scylla-migrator/config.yaml /mnt1/config.yaml
+     aws s3 cp s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar /mnt1/scylla-migrator-assembly.jar
+
+5. Upload the script to your S3 bucket as well. ::
+
+     aws s3 cp copy-files.sh s3://<your-bucket>/scylla-migrator/copy-files.sh
+
+6. Log in to the `AWS EMR console <https://console.aws.amazon.com/emr>`_.
+
+7. Choose “Create cluster” to create a new cluster based on EC2.
+
+8. Configure the cluster as follows:
+
+   - Choose the EMR release ``emr-7.1.0``, or any EMR release that is compatible with the Spark version used by the Migrator.
+   - Make sure to include Spark in the application bundle.
+   - Choose all-purpose EC2 instance types (e.g., i4i).
+   - Make sure to include at least one task node.
+   - Add a Step to run the Migrator:
+
+     - Type: Custom JAR
+     - JAR location: ``command-runner.jar``
+     - Arguments: ::
+
+         spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar
+
+   - Add a Bootstrap action to download the Migrator and the migration configuration:
+
+     - Script location: ``s3://<your-bucket>/scylla-migrator/copy-files.sh``
+
+9. Finalize your cluster configuration according to your needs and finally choose “Create cluster”.
+
+10. The migration will start automatically after the cluster is fully up.
diff --git a/docs/source/getting-started/docker.rst b/docs/source/getting-started/docker.rst
@@ -0,0 +1,45 @@
+==================================
+Set Up a Spark Cluster with Docker
+==================================
+
+This page describes how to set up a Spark cluster locally on your machine by using Docker containers. This approach is useful if you do not need a high-level of performance, and want to quickly try out the Migrator without having to set up a real cluster of nodes. It requires Docker and Git.
+
+1. Clone the Migrator repository. ::
+
+     git clone https://github.com/scylladb/scylla-migrator.git
+     cd scylla-migrator
+
+2. Download the latest release of the ``scylla-migrator-assembly.jar`` and put it in the directory ``migrator/target/scala-2.13/``. ::
+
+     mkdir -p migrator/target/scala-2.13
+     wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
+       --directory-prefix=migrator/target/scala-2.13
+
+3. Start the Spark cluster. ::
+
+     docker compose up -d
+
+4. Open the Spark web UI.
+
+   http://localhost:8080
+
+   Tip: add the following aliases to your ``/etc/hosts`` to make links work in the Spark UI ::
+
+     127.0.0.1   spark-master
+     127.0.0.1   spark-worker
+
+5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure </getting-started/#configure-the-migration>`_ it according to your needs.
+
+6. Finally, run the migration. ::
+
+     docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \
+       --master spark://spark-master:7077 \
+       --conf spark.driver.host=spark-master \
+       --conf spark.scylla.config=/app/config.yaml \
+       /jars/scylla-migrator-assembly.jar
+
+   The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``.
+
+7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``.
+
+    FYI: When no Spark jobs are actively running, the Spark progress page at port 4040 displays unavailable. It is only useful and renders when a Spark job is in progress.
diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst
@@ -2,9 +2,42 @@
 Getting Started
 ===============
 
+Since the Migrator is packaged as a Spark application, you have to set up a Spark cluster to use it. Then, you submit the application along with its :doc:`configuration </configuration>` on the Spark cluster, which will execute the migration by reading from your source database and writing to your target database.
+
+----------------------
+Set Up a Spark Cluster
+----------------------
+
+The following pages describe various alternative ways to set up a Spark cluster:
+
+* on your infrastructure, using :doc:`Ansible </getting-started/ansible>`,
+* on your infrastructure, :doc:`manually </getting-started/spark-standalone>`,
+* using :doc:`AWS EMR </getting-started/aws-emr>`,
+* or, on a single machine, using :doc:`Docker </getting-started/docker>`.
+
+-----------------------
+Configure the Migration
+-----------------------
+
+Once you have a Spark cluster ready to run the ``scylla-migrator-assembly.jar``, download the file `config.yaml.example <https://github.com/scylladb/scylla-migrator/blob/master/config.yaml.example>`_ and rename it to ``config.yaml``. This file contains properties such as ``source`` or ``target`` defining how to connect to the source database and to the target database, as well as other settings to perform the migration. Adapt it to your case according to the following guides:
+
+- :doc:`migrate from Cassandra or Parquet files to ScyllaDB </migrate-from-cassandra-or-parquet>`,
+- or, :doc:`migrate from DynamoDB to ScyllaDB’s Alternator </migrate-from-dynamodb>`.
+
+--------------
+Extra Features
+--------------
+
+You might also be interested in the following extra features:
+
+* :doc:`rename columns along the migration </rename-columns>`,
+* :doc:`replicate changes applied to the source data after the initial snapshot transfer has completed </stream-changes>`,
+* :doc:`validate that the migration was complete and correct </validate>`.
+
 .. toctree::
     :hidden:
 
     ansible
-    aws-emr
     spark-standalone
+    aws-emr
+    docker
diff --git a/docs/source/getting-started/spark-standalone.rst b/docs/source/getting-started/spark-standalone.rst
@@ -2,3 +2,26 @@
 Manual Set Up of a Spark Cluster
 ================================
 
+This page describes how to set up a Spark cluster on your infrastructure and to use it to perform a migration.
+
+1. Follow the `official documentation <https://spark.apache.org/docs/latest/spark-standalone.html>`_ to install Spark on each node of your cluster, and start the Spark master and the Spark workers.
+
+2. In the Spark master node, download the latest release of the Migrator. ::
+
+     wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
+
+3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository. ::
+
+     wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
+       --output-document=config.yaml
+
+4. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.
+
+5. Finally, run the migration as follows from the Spark master node. ::
+
+     spark-submit --class com.scylladb.migrator.Migrator \
+       --master spark://<spark-master-hostname>:7077 \
+       --conf spark.scylla.config=<path to config.yaml> \
+       <path to scylla-migrator-assembly.jar>
+
+6. You can monitor progress from the `Spark web UI <https://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging>`_.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -4,10 +4,10 @@ ScyllaDB Migrator Documentation
 
 The Scylla Migrator is a Spark application that migrates data to ScyllaDB. Its main features are the following:
 
-* it can read from Cassandra, Parquet, DynamoDB, or a DynamoDB S3 export
-* it can be distributed over multiple nodes of a Spark cluster to scale with your database cluster
-* it can rename columns along the way
-* it can transfer a snapshot of the source data, or continuously migrate new data as they come
+* it can read from Cassandra, Parquet, DynamoDB, or a DynamoDB S3 export,
+* it can be distributed over multiple nodes of a Spark cluster to scale with your database cluster,
+* it can rename columns along the way,
+* it can transfer a snapshot of the source data, or continuously migrate new data as they come.
 
 Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster for a migration.
 
@@ -20,3 +20,4 @@ Read over the :doc:`Getting Started </getting-started/index>` page to set up a S
   stream-changes
   rename-columns
   validate
+  configuration