diff --git a/README.md b/README.md index a377d0b9..c7f24df5 100644 --- a/README.md +++ b/README.md @@ -1,31 +1,3 @@ -# Ansible deployment - -An ansible playbook is provided in ansible folder. The ansible playbook will install the pre-requisites, spark, on the master and workers added to the `ansible/inventory/hosts` file. Scylla-migrator will be installed on the spark master node. -1. Update `ansible/inventory/hosts` file with master and worker instances -2. Update `ansible/ansible.cfg` with location of private key if necessary -3. The `ansible/template/spark-env-master-sample` and `ansible/template/spark-env-worker-sample` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them. -4. run `ansible-playbook scylla-migrator.yml` -5. On the spark master node: - cd scylla-migrator - `./start-spark.sh` -6. On the spark worker nodes: - `./start-slave.sh` -7. Open spark web console - - Ensure networking is configured to allow you access spark master node via 8080 and 4040 - - visit http://:8080 -8. Review and modify `config.yaml` based whether you're performing a migration to CQL or Alternator - - If you're migrating to Scylla CQL interface (from Cassandra, Scylla, or other CQL source), make a copy review the comments in `config.yaml.example`, and edit as directed. - - If you're migrating to Alternator (from DynamoDB or other Scylla Alternator), make a copy, review the comments in `config.dynamodb.yml`, and edit as directed. -9. As part of ansible deployment, sample submit jobs were created. You may edit and use the submit jobs. - - For CQL migration: Edit `scylla-migrator/submit-cql-job.sh`, change line `--conf spark.scylla.config=config.yaml \` to point to the whatever you named the config.yaml in previous step. - - For Alternator migration: Edit `scylla-migrator/submit-alternator-job.sh`, change line `--conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \` to reference the config.yaml file you created and modified in previous step. -10. Ensure the table has been created in the target environment. -11. Submit the migration by submitting the appropriate job - - CQL migration: `./submit-cql-job.sh` - - Alternator migration: `./submit-alternator-job.sh` -12. You can monitor progress by observing the spark web console you opened in step 7. Additionally, after the job has started, you can track progress via http://:4040. - FYI: When no spark jobs are actively running, the spark progress page at port 4040 displays unavailable. It is only useful and renders when a spark job is in progress. - # Configuring the Migrator Create a `config.yaml` for your migration using the template `config.yaml.example` in the repository root. Read the comments throughout carefully. @@ -74,54 +46,9 @@ spark-submit --class com.scylladb.migrator.Migrator \ ``` -# Running the validator - -This project also includes an entrypoint for comparing the source -table and the target table. You can launch it as so (after performing -the previous steps): - -```shell -spark-submit --class com.scylladb.migrator.Validator \ - --master spark://:7077 \ - --conf spark.scylla.config= \ - -``` - -# Running locally - -To run in the local Docker-based setup: - -1. First start the environment: -```shell -docker compose up -d -``` - -2. Launch `cqlsh` in Cassandra's container and create a keyspace and a table with some data: -```shell -docker compose exec cassandra cqlsh - -``` - -3. Launch `cqlsh` in Scylla's container and create the destination keyspace and table with the same schema as the source table: -```shell -docker compose exec scylla cqlsh - -``` - -4. Edit the `config.yaml` file; note the comments throughout. - -5. Run `build.sh`. - -6. Then, launch `spark-submit` in the master's container to run the job: -```shell -docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \ - --master spark://spark-master:7077 \ - --conf spark.driver.host=spark-master \ - --conf spark.scylla.config=/app/config.yaml \ - /jars/scylla-migrator-assembly.jar -``` +# Documentation -The `spark-master` container mounts the `./migrator/target/scala-2.13` dir on `/jars` and the repository root on `/app`. To update the jar with new code, just run `build.sh` and then run `spark-submit` again. +See https://migrator.docs.scylladb.com. # Building diff --git a/ansible/templates/spark-env-master-sample b/ansible/templates/spark-env-master-sample index c79ec7e8..60767491 100644 --- a/ansible/templates/spark-env-master-sample +++ b/ansible/templates/spark-env-master-sample @@ -8,7 +8,7 @@ # MEMORY is used in the spark-submit job and allocates the memory per executor. # You can have one or more executors per worker. # -# By using multiple workers on an instance, we can control the velocit of the migration. +# By using multiple workers on an instance, we can control the velocity of the migration. # # Eg. # Target system is 3 x i4i.4xlarge (16 vCPU, 128G) diff --git a/ansible/templates/spark-env-worker-sample b/ansible/templates/spark-env-worker-sample index e2d143e7..72826684 100644 --- a/ansible/templates/spark-env-worker-sample +++ b/ansible/templates/spark-env-worker-sample @@ -8,7 +8,7 @@ # MEMORY is used in the spark-submit job and allocates the memory per executor. # You can have one or more executors per worker. # -# By using multiple workers on an instance, we can control the velocit of the migration. +# By using multiple workers on an instance, we can control the velocity of the migration. # # Eg. # Target system is 3 x i4i.4xlarge (16 vCPU, 128G) diff --git a/config.yaml.example b/config.yaml.example index af186154..e076ec7a 100644 --- a/config.yaml.example +++ b/config.yaml.example @@ -268,8 +268,7 @@ renames: [] # create a savepoint file with this filled. skipTokenRanges: [] -# Configuration section for running the validator. The validator is run manually (see README) -# and currently only supports comparing a Cassandra source to a Scylla target. +# Configuration section for running the validator. The validator is run manually (see README). validation: # Should WRITETIMEs and TTLs be compared? compareTimestamps: true diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst new file mode 100644 index 00000000..e7a80cdd --- /dev/null +++ b/docs/source/configuration.rst @@ -0,0 +1,35 @@ +======================= +Configuration Reference +======================= + +------------------ +AWS Authentication +------------------ + +When reading from DynamoDB or S3, or when writing to DynamoDB, the communication with AWS can be configured with the properties ``credentials``, ``endpoint``, and ``region`` in the configuration: + +.. code-block:: yaml + + credentials: + accessKey: + secretKey: + # Optional AWS endpoint configuration + endpoint: + host: + port: + # Optional AWS availability region, required if you use a custom endpoint + region: + +Additionally, you can authenticate with `AssumeRole `_. In such a case, the ``accessKey`` and ``secretKey`` are the credentials of the user whose access to the resource (DynamoDB table or S3 bucket) has been granted via a “role”, and you need to add the property ``assumeRole`` as follows: + +.. code-block:: yaml + + credentials: + accessKey: + secretKey: + assumeRole: + arn: + # Optional session name to use. If not set, we use 'scylla-migrator'. + sessionName: + # Note that the region is mandatory when you use `assumeRole` + region: diff --git a/docs/source/getting-started/ansible.rst b/docs/source/getting-started/ansible.rst index e2f3cf10..0483a917 100644 --- a/docs/source/getting-started/ansible.rst +++ b/docs/source/getting-started/ansible.rst @@ -1,3 +1,43 @@ =================================== Set Up a Spark Cluster with Ansible =================================== + +An `Ansible `_ playbook is provided in the `ansible `_ folder of our Git repository. The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file. Scylla-migrator will be installed on the spark master node. + +1. Update ``ansible/inventory/hosts`` file with master and worker instances +2. Update ``ansible/ansible.cfg`` with location of private key if necessary +3. The ``ansible/template/spark-env-master-sample`` and ``ansible/template/spark-env-worker-sample`` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them. +4. run ``ansible-playbook scylla-migrator.yml`` +5. On the Spark master node: :: + + cd scylla-migrator + ./start-spark.sh + +6. On the Spark worker nodes: :: + + ./start-slave.sh + +7. Open Spark web console + + - Ensure networking is configured to allow you access spark master node via TCP ports 8080 and 4040 + - visit ``http://:8080`` + +8. Review and modify ``config.yaml`` based whether you're performing a migration to CQL or Alternator + + - If you're migrating to Scylla CQL interface (from Cassandra, Scylla, or other CQL source), make a copy review the comments in ``config.yaml.example``, and edit as directed. + - If you're migrating to Alternator (from DynamoDB or other Scylla Alternator), make a copy, review the comments in ``config.dynamodb.yml``, and edit as directed. + +9. As part of ansible deployment, sample submit jobs were created. You may edit and use the submit jobs. + + - For CQL migration: edit ``scylla-migrator/submit-cql-job.sh``, change line ``--conf spark.scylla.config=config.yaml \`` to point to the whatever you named the ``config.yaml`` in previous step. + - For Alternator migration: edit ``scylla-migrator/submit-alternator-job.sh``, change line ``--conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \`` to reference the ``config.yaml`` file you created and modified in previous step. + +10. Ensure the table has been created in the target environment. +11. Submit the migration by submitting the appropriate job + + - CQL migration: ``./submit-cql-job.sh`` + - Alternator migration: ``./submit-alternator-job.sh`` + +12. You can monitor progress by observing the Spark web console you opened in step 7. Additionally, after the job has started, you can track progress via ``http://:4040``. + + FYI: When no Spark jobs are actively running, the Spark progress page at port 4040 displays unavailable. It is only useful and renders when a Spark job is in progress. diff --git a/docs/source/getting-started/aws-emr.rst b/docs/source/getting-started/aws-emr.rst index 7cbb9649..4d523e11 100644 --- a/docs/source/getting-started/aws-emr.rst +++ b/docs/source/getting-started/aws-emr.rst @@ -2,3 +2,60 @@ Set Up a Spark Cluster with AWS EMR =================================== +This page describes how to use the Migrator in `Amazon EMR `_. This approach is useful if you already have an AWS account, or if you do not want to manage your infrastructure manually. + +1. Download the ``config.yaml.example`` from our Git repository. :: + + wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \ + --output-document=config.yaml + +2. `Configure the migration `_ according to your needs. + +3. Download the latest release of the Migrator. :: + + wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar + +4. Upload them to an S3 bucket. :: + + aws s3 cp config.yaml s3:///scylla-migrator/config.yaml + aws s3 cp scylla-migrator-assembly.jar s3:///scylla-migrator/scylla-migrator-assembly.jar + + Replace ```` with an S3 bucket name that you manage. + + Each time you change the migration configuration, re-upload it to the bucket. + +4. Create a script named ``copy-files.sh``, to load the files ``config.yaml`` and ``scylla-migrator-assembly.jar`` from your S3 bucket. :: + + #!/bin/bash + aws s3 cp s3:///scylla-migrator/config.yaml /mnt1/config.yaml + aws s3 cp s3:///scylla-migrator/scylla-migrator-assembly.jar /mnt1/scylla-migrator-assembly.jar + +5. Upload the script to your S3 bucket as well. :: + + aws s3 cp copy-files.sh s3:///scylla-migrator/copy-files.sh + +6. Log in to the `AWS EMR console `_. + +7. Choose “Create cluster” to create a new cluster based on EC2. + +8. Configure the cluster as follows: + + - Choose the EMR release ``emr-7.1.0``, or any EMR release that is compatible with the Spark version used by the Migrator. + - Make sure to include Spark in the application bundle. + - Choose all-purpose EC2 instance types (e.g., i4i). + - Make sure to include at least one task node. + - Add a Step to run the Migrator: + + - Type: Custom JAR + - JAR location: ``command-runner.jar`` + - Arguments: :: + + spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar + + - Add a Bootstrap action to download the Migrator and the migration configuration: + + - Script location: ``s3:///scylla-migrator/copy-files.sh`` + +9. Finalize your cluster configuration according to your needs and finally choose “Create cluster”. + +10. The migration will start automatically after the cluster is fully up. diff --git a/docs/source/getting-started/docker.rst b/docs/source/getting-started/docker.rst new file mode 100644 index 00000000..177737fb --- /dev/null +++ b/docs/source/getting-started/docker.rst @@ -0,0 +1,45 @@ +================================== +Set Up a Spark Cluster with Docker +================================== + +This page describes how to set up a Spark cluster locally on your machine by using Docker containers. This approach is useful if you do not need a high-level of performance, and want to quickly try out the Migrator without having to set up a real cluster of nodes. It requires Docker and Git. + +1. Clone the Migrator repository. :: + + git clone https://github.com/scylladb/scylla-migrator.git + cd scylla-migrator + +2. Download the latest release of the ``scylla-migrator-assembly.jar`` and put it in the directory ``migrator/target/scala-2.13/``. :: + + mkdir -p migrator/target/scala-2.13 + wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \ + --directory-prefix=migrator/target/scala-2.13 + +3. Start the Spark cluster. :: + + docker compose up -d + +4. Open the Spark web UI. + + http://localhost:8080 + + Tip: add the following aliases to your ``/etc/hosts`` to make links work in the Spark UI :: + + 127.0.0.1 spark-master + 127.0.0.1 spark-worker + +5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure `_ it according to your needs. + +6. Finally, run the migration. :: + + docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \ + --master spark://spark-master:7077 \ + --conf spark.driver.host=spark-master \ + --conf spark.scylla.config=/app/config.yaml \ + /jars/scylla-migrator-assembly.jar + + The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``. + +7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``. + + FYI: When no Spark jobs are actively running, the Spark progress page at port 4040 displays unavailable. It is only useful and renders when a Spark job is in progress. diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst index 0d40b881..6eb9f611 100644 --- a/docs/source/getting-started/index.rst +++ b/docs/source/getting-started/index.rst @@ -2,9 +2,42 @@ Getting Started =============== +Since the Migrator is packaged as a Spark application, you have to set up a Spark cluster to use it. Then, you submit the application along with its :doc:`configuration ` on the Spark cluster, which will execute the migration by reading from your source database and writing to your target database. + +---------------------- +Set Up a Spark Cluster +---------------------- + +The following pages describe various alternative ways to set up a Spark cluster: + +* on your infrastructure, using :doc:`Ansible `, +* on your infrastructure, :doc:`manually `, +* using :doc:`AWS EMR `, +* or, on a single machine, using :doc:`Docker `. + +----------------------- +Configure the Migration +----------------------- + +Once you have a Spark cluster ready to run the ``scylla-migrator-assembly.jar``, download the file `config.yaml.example `_ and rename it to ``config.yaml``. This file contains properties such as ``source`` or ``target`` defining how to connect to the source database and to the target database, as well as other settings to perform the migration. Adapt it to your case according to the following guides: + +- :doc:`migrate from Cassandra or Parquet files to ScyllaDB `, +- or, :doc:`migrate from DynamoDB to ScyllaDB’s Alternator `. + +-------------- +Extra Features +-------------- + +You might also be interested in the following extra features: + +* :doc:`rename columns along the migration `, +* :doc:`replicate changes applied to the source data after the initial snapshot transfer has completed `, +* :doc:`validate that the migration was complete and correct `. + .. toctree:: :hidden: ansible - aws-emr spark-standalone + aws-emr + docker diff --git a/docs/source/getting-started/spark-standalone.rst b/docs/source/getting-started/spark-standalone.rst index 2542a539..acefc663 100644 --- a/docs/source/getting-started/spark-standalone.rst +++ b/docs/source/getting-started/spark-standalone.rst @@ -2,3 +2,26 @@ Manual Set Up of a Spark Cluster ================================ +This page describes how to set up a Spark cluster on your infrastructure and to use it to perform a migration. + +1. Follow the `official documentation `_ to install Spark on each node of your cluster, and start the Spark master and the Spark workers. + +2. In the Spark master node, download the latest release of the Migrator. :: + + wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar + +3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository. :: + + wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \ + --output-document=config.yaml + +4. `Configure the migration `_ according to your needs. + +5. Finally, run the migration as follows from the Spark master node. :: + + spark-submit --class com.scylladb.migrator.Migrator \ + --master spark://:7077 \ + --conf spark.scylla.config= \ + + +6. You can monitor progress from the `Spark web UI `_. diff --git a/docs/source/index.rst b/docs/source/index.rst index 05ecc120..aedebc14 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -4,10 +4,10 @@ ScyllaDB Migrator Documentation The Scylla Migrator is a Spark application that migrates data to ScyllaDB. Its main features are the following: -* it can read from Cassandra, Parquet, DynamoDB, or a DynamoDB S3 export -* it can be distributed over multiple nodes of a Spark cluster to scale with your database cluster -* it can rename columns along the way -* it can transfer a snapshot of the source data, or continuously migrate new data as they come +* it can read from Cassandra, Parquet, DynamoDB, or a DynamoDB S3 export, +* it can be distributed over multiple nodes of a Spark cluster to scale with your database cluster, +* it can rename columns along the way, +* it can transfer a snapshot of the source data, or continuously migrate new data as they come. Read over the :doc:`Getting Started ` page to set up a Spark cluster for a migration. @@ -20,3 +20,4 @@ Read over the :doc:`Getting Started ` page to set up a S stream-changes rename-columns validate + configuration diff --git a/docs/source/migrate-from-cassandra-or-parquet.rst b/docs/source/migrate-from-cassandra-or-parquet.rst index 895f2a9d..b6bdbdb4 100644 --- a/docs/source/migrate-from-cassandra-or-parquet.rst +++ b/docs/source/migrate-from-cassandra-or-parquet.rst @@ -1,5 +1,134 @@ -================================= -Migrate from Cassandra or Parquet -================================= +============================================= +Migrate from Cassandra or from a Parquet File +============================================= + +This page explains how to fill the ``source`` and ``target`` properties of the `configuration file `_ to migrate data: + +- from Cassandra, ScyllaDB, or from a `Parquet `_ file, +- to Cassandra or ScyllaDB. + +In file ``config.yaml``, make sure to keep only one ``source`` property and one ``target`` property, and configure them as explained in the following subsections according to your case. + +---------------------- +Configuring the Source +---------------------- + +The data `source` can be a Cassandra or ScyllaDB database, or a Parquet file. + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Reading from Cassandra or ScyllaDB +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In both cases, when reading from Cassandra or ScyllaDB, the type of source should be ``cassandra`` in the configuration file. Here is a minimal ``source`` configuration: + +.. code-block:: yaml + + source: + type: cassandra + # host name of one of the nodes of your database cluster + host: + # TCP port to use for CQL + port: 9042 + # Keyspace in which the table is located + keyspace: + # Name of the table to read + table: + # Consistency Level for the source connection. + # Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM. + # We recommend using LOCAL_QUORUM. If using ONE or LOCAL_ONE, ensure the source system is fully repaired. + consistencyLevel: LOCAL_QUORUM + # Preserve TTLs and WRITETIMEs of cells in the source database. Note that this + # option is *incompatible* when copying tables with collections (lists, maps, sets). + preserveTimestamps: true + # Number of splits to use - this should be at minimum the amount of cores + # available in the Spark cluster, and optimally more; higher splits will lead + # to more fine-grained resumes. Aim for 8 * (Spark cores). + splitCount: 256 + # Number of connections to use to Cassandra when copying + connections: 8 + # Number of rows to fetch in each read + fetchSize: 1000 + +Where the values ````, ````, and ``
`` should be replaced with your specific values. + +Additionally, you can also set the following optional properties: + +.. code-block:: yaml + + source: + # ... same as above + + # Datacenter to use + localDC: + + # Connection credentials + credentials: + username: + password: + + # SSL options as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options + sslOptions: + clientAuthEnabled: false + enabled: false + # all below are optional! (generally just trustStorePassword and trustStorePath is needed) + trustStorePassword: + trustStorePath: + trustStoreType: JKS + keyStorePassword: + keyStorePath: + keyStoreType: JKS + enabledAlgorithms: + - TLS_RSA_WITH_AES_128_CBC_SHA + - TLS_RSA_WITH_AES_256_CBC_SHA + protocol: TLS + + # Condition to filter data that will be migrated + where: race_start_date = '2015-05-27' AND race_end_date = '2015-05-27' + +Where ````, ````, ````, ````, and the content of the ``where`` properties should be replaced with your specific values. + +^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Reading from a Parquet File +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Migrator can read data from a Parquet file located on the filesystem of the Spark master node, or on an S3 bucket. In both cases, set the source type to ``parquet``. Here is a complete ``source`` configuration to read from the filesystem: + +.. code-block:: yaml + + source: + type: parquet + path: / + +Where ```` should be replaced with your actual file path. + +Here is a minimal ``source`` configuration to read the Parquet file from an S3 bucket: + +.. code-block:: yaml + + source: + type: parquet + path: s3a:// + +Where ```` should be replaced with your actual S3 bucket and key. + +In case the object is not public in the S3 bucket, you can provide the AWS credentials to use as follows: + +.. code-block:: yaml + + source: + type: parquet + path: s3a://my-bucket/my-key.parquet + credentials: + accessKey: + secretKey: + +Where ```` and ```` should be replaced with your actual AWS access key and secret key. + +The Migrator also supports advanced AWS authentication options such as using `AssumeRole `_. Please read the `configuration reference ` for more details. + +--------------------------- +Configuring the Destination +--------------------------- +