Add more documentation content

- Use proper syntax highlighting in code blocks - Fix some typos - Add a section describing how to configure the target database when migrating to Scylla - Add a page describing how to configure the source and target when migrating from DynamoDB
scylladb · Jun 29, 2024 · b8fe56d · b8fe56d
1 parent c78adca
commit b8fe56d
Show file tree

Hide file tree

Showing 6 changed files with 297 additions and 22 deletions.
diff --git a/docs/source/getting-started/ansible.rst b/docs/source/getting-started/ansible.rst
@@ -2,18 +2,22 @@
 Set Up a Spark Cluster with Ansible
 ===================================
 
-An `Ansible <https://www.ansible.com/>`_ playbook is provided in the `ansible <https://github.com/scylladb/scylla-migrator/tree/master/ansible>`_ folder of our Git repository.  The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file.  Scylla-migrator will be installed on the spark master node.
+An `Ansible <https://www.ansible.com/>`_ playbook is provided in the `ansible folder<https://github.com/scylladb/scylla-migrator/tree/master/ansible>`_ folder of our Git repository.  The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file.  Scylla-migrator will be installed on the spark master node.
 
 1. Update ``ansible/inventory/hosts`` file with master and worker instances
 2. Update ``ansible/ansible.cfg`` with location of private key if necessary
 3. The ``ansible/template/spark-env-master-sample`` and ``ansible/template/spark-env-worker-sample`` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them.
 4. run ``ansible-playbook scylla-migrator.yml``
-5. On the Spark master node: ::
+5. On the Spark master node:
+
+   .. code-block:: bash
 
      cd scylla-migrator
      ./start-spark.sh
 
-6. On the Spark worker nodes: ::
+6. On the Spark worker nodes:
+
+   .. code-block:: bash
 
      ./start-slave.sh
 

diff --git a/docs/source/getting-started/aws-emr.rst b/docs/source/getting-started/aws-emr.rst
@@ -4,18 +4,25 @@ Set Up a Spark Cluster with AWS EMR
 
 This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.com/emr/>`_. This approach is useful if you already have an AWS account, or if you do not want to manage your infrastructure manually.
 
-1. Download the ``config.yaml.example`` from our Git repository. ::
+1. Download the ``config.yaml.example`` from our Git repository.
+
+   .. code-block:: bash
 
      wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
        --output-document=config.yaml
 
+
 2. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.
 
-3. Download the latest release of the Migrator. ::
+3. Download the latest release of the Migrator.
+
+   .. code-block:: bash
 
      wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
 
-4. Upload them to an S3 bucket. ::
+4. Upload them to an S3 bucket.
+
+   .. code-block:: bash
 
      aws s3 cp config.yaml s3://<your-bucket>/scylla-migrator/config.yaml
      aws s3 cp scylla-migrator-assembly.jar s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar
@@ -24,13 +31,17 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
 
    Each time you change the migration configuration, re-upload it to the bucket.
 
-4. Create a script named ``copy-files.sh``, to load the files ``config.yaml`` and ``scylla-migrator-assembly.jar`` from your S3 bucket. ::
+4. Create a script named ``copy-files.sh``, to load the files ``config.yaml`` and ``scylla-migrator-assembly.jar`` from your S3 bucket.
+
+   .. code-block:: bash
 
      #!/bin/bash
      aws s3 cp s3://<your-bucket>/scylla-migrator/config.yaml /mnt1/config.yaml
      aws s3 cp s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar /mnt1/scylla-migrator-assembly.jar
 
-5. Upload the script to your S3 bucket as well. ::
+5. Upload the script to your S3 bucket as well.
+
+   .. code-block:: bash
 
      aws s3 cp copy-files.sh s3://<your-bucket>/scylla-migrator/copy-files.sh
 
@@ -48,7 +59,9 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
 
      - Type: Custom JAR
      - JAR location: ``command-runner.jar``
-     - Arguments: ::
+     - Arguments:
+
+       .. code-block:: text
 
          spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar
 

diff --git a/docs/source/getting-started/docker.rst b/docs/source/getting-started/docker.rst
@@ -4,33 +4,43 @@ Set Up a Spark Cluster with Docker
 
 This page describes how to set up a Spark cluster locally on your machine by using Docker containers. This approach is useful if you do not need a high-level of performance, and want to quickly try out the Migrator without having to set up a real cluster of nodes. It requires Docker and Git.
 
-1. Clone the Migrator repository. ::
+1. Clone the Migrator repository.
+
+   .. code-block:: bash
 
      git clone https://github.com/scylladb/scylla-migrator.git
      cd scylla-migrator
 
-2. Download the latest release of the ``scylla-migrator-assembly.jar`` and put it in the directory ``migrator/target/scala-2.13/``. ::
+2. Download the latest release of the ``scylla-migrator-assembly.jar`` and put it in the directory ``migrator/target/scala-2.13/``.
+
+   .. code-block:: bash
 
      mkdir -p migrator/target/scala-2.13
      wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
        --directory-prefix=migrator/target/scala-2.13
 
-3. Start the Spark cluster. ::
+3. Start the Spark cluster.
+
+   .. code-block:: bash
 
      docker compose up -d
 
 4. Open the Spark web UI.
 
    http://localhost:8080
 
-   Tip: add the following aliases to your ``/etc/hosts`` to make links work in the Spark UI ::
+   Tip: add the following aliases to your ``/etc/hosts`` to make links work in the Spark UI
+
+   .. code-block:: text
 
      127.0.0.1   spark-master
      127.0.0.1   spark-worker
 
 5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure </getting-started/#configure-the-migration>`_ it according to your needs.
 
-6. Finally, run the migration. ::
+6. Finally, run the migration.
+
+   .. code-block:: bash
 
      docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \
        --master spark://spark-master:7077 \

diff --git a/docs/source/getting-started/spark-standalone.rst b/docs/source/getting-started/spark-standalone.rst
@@ -6,18 +6,24 @@ This page describes how to set up a Spark cluster on your infrastructure and to
 
 1. Follow the `official documentation <https://spark.apache.org/docs/latest/spark-standalone.html>`_ to install Spark on each node of your cluster, and start the Spark master and the Spark workers.
 
-2. In the Spark master node, download the latest release of the Migrator. ::
+2. In the Spark master node, download the latest release of the Migrator.
+
+   .. code-block:: bash
 
      wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
 
-3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository. ::
+3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository.
+
+   .. code-block:: bash
 
      wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
        --output-document=config.yaml
 
 4. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.
 
-5. Finally, run the migration as follows from the Spark master node. ::
+5. Finally, run the migration as follows from the Spark master node.
+
+   .. code-block:: bash
 
      spark-submit --class com.scylladb.migrator.Migrator \
        --master spark://<spark-master-hostname>:7077 \

diff --git a/docs/source/migrate-from-cassandra-or-parquet.rst b/docs/source/migrate-from-cassandra-or-parquet.rst
@@ -13,7 +13,7 @@ In file ``config.yaml``, make sure to keep only one ``source`` property and one
 Configuring the Source
 ----------------------
 
-The data `source` can be a Cassandra or ScyllaDB database, or a Parquet file.
+The data ``source`` can be a Cassandra or ScyllaDB table, or a Parquet file.
 
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Reading from Cassandra or ScyllaDB
@@ -25,7 +25,7 @@ In both cases, when reading from Cassandra or ScyllaDB, the type of source shoul
 
   source:
     type: cassandra
-    # host name of one of the nodes of your database cluster
+    # Host name of one of the nodes of your database cluster
     host: <cassandra-server-01>
     # TCP port to use for CQL
     port: 9042
@@ -117,18 +117,88 @@ In case the object is not public in the S3 bucket, you can provide the AWS crede
 
   source:
     type: parquet
-    path: s3a://my-bucket/my-key.parquet
+    path: s3a://<my-bucket/my-key.parquet>
     credentials:
       accessKey: <access-key>
       secretKey: <secret-key>
 
-Where ``<access-key>`` and ``<my-secret-key>`` should be replaced with your actual AWS access key and secret key.
+Where ``<access-key>`` and ``<secret-key>`` should be replaced with your actual AWS access key and secret key.
 
-The Migrator also supports advanced AWS authentication options such as using `AssumeRole <https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html>`_. Please read the `configuration reference </configuration#aws-authentication>` for more details.
+The Migrator also supports advanced AWS authentication options such as using `AssumeRole <https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html>`_. Please read the `configuration reference </configuration#aws-authentication>`__ for more details.
 
 ---------------------------
 Configuring the Destination
 ---------------------------
 
+The migration ``target`` can be Cassandra or Scylla. In both cases, we use the type ``cassandra`` in the configuration. Here is a minimal ``target`` configuration to write to Cassandra or ScyllaDB:
+
+.. code-block:: yaml
+
+  target:
+    # can be 'cassandra' or 'scylla', it does not matter
+    type: cassandra
+    # Host name of one of the nodes of your target database cluster
+    host: <scylla-server-01>
+    port: 9042
+    keyspace: <keyspace>
+    # Name of the table to write. If it does not exist, it will be created on the fly.
+    # It has to have the same schema as the source table. If needed, you can rename
+    # columns along the way, look at the documentation page “Rename Columns”.
+    table: <table>
+    # Consistency Level for the target connection
+    # Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM.
+    consistencyLevel: LOCAL_QUORUM
+    # Number of connections to use to Scylla/Cassandra when copying
+    connections: 16
+    # Spark pads decimals with zeros appropriate to their scale. This causes values
+    # like '3.5' to be copied as '3.5000000000...' to the target. There's no good way
+    # currently to preserve the original value, so this flag can strip trailing zeros
+    # on decimal values before they are written.
+    stripTrailingZerosForDecimals: false
+
+Where ``<scylla-server-01>``, ``<keyspace>``, and ``<table>`` should be replaced with your specific values.
+
+Additionally, you can also set the following optional properties:
+
+.. code-block:: yaml
+
+  target:
+    # ... same as above
+
+    # Datacenter to use
+    localDC: <datacenter>
+
+    # Authentication credentials
+    credentials:
+      username: <username>
+      password: <pass>
 
+    # SSL as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options
+    sslOptions:
+      clientAuthEnabled: false
+      enabled: false
+      # all below are optional! (generally just trustStorePassword and trustStorePath is needed)
+      trustStorePassword: <pass>
+      trustStorePath: <path>
+      trustStoreType: JKS
+      keyStorePassword: <pass>
+      keyStorePath: <path>
+      keyStoreType: JKS
+      enabledAlgorithms:
+       - TLS_RSA_WITH_AES_128_CBC_SHA
+       - TLS_RSA_WITH_AES_256_CBC_SHA
+      protocol: TLS
 
+    # If we do not persist timestamps (when preserveTimestamps is false in the source)
+    # we can enforce in writer a single TTL or writetimestamp for ALL written records.
+    # Such writetimestamp can be e.g. set to time BEFORE starting dual writes,
+    # and this will make your migration safe from overwriting dual write
+    # even for collections.
+    # ALL rows written will get the same TTL or writetimestamp or both
+    # (you can uncomment just one of them, or all or none)
+    # TTL in seconds (sample 7776000 is 90 days)
+    writeTTLInS: 7776000
+    # writetime in microseconds (sample 1640998861000 is Saturday, January 1, 2022 2:01:01 AM GMT+01:00 )
+    writeWritetimestampInuS: 1640998861000
+
+Where ``<datacenter>``, ``<username>``, ``<pass>``, and ``<path>`` should be replaced with your specific values.