Skip to content

Commit

Permalink
Add more documentation content
Browse files Browse the repository at this point in the history
- Use proper syntax highlighting in code blocks
- Fix some typos
- Add a section describing how to configure the target database when migrating to Scylla
- Add a page describing how to configure the source and target when migrating from DynamoDB
  • Loading branch information
julienrf committed Jun 29, 2024
1 parent c78adca commit b8fe56d
Show file tree
Hide file tree
Showing 6 changed files with 297 additions and 22 deletions.
10 changes: 7 additions & 3 deletions docs/source/getting-started/ansible.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,22 @@
Set Up a Spark Cluster with Ansible
===================================

An `Ansible <https://www.ansible.com/>`_ playbook is provided in the `ansible <https://github.com/scylladb/scylla-migrator/tree/master/ansible>`_ folder of our Git repository. The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file. Scylla-migrator will be installed on the spark master node.
An `Ansible <https://www.ansible.com/>`_ playbook is provided in the `ansible folder<https://github.com/scylladb/scylla-migrator/tree/master/ansible>`_ folder of our Git repository. The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file. Scylla-migrator will be installed on the spark master node.

1. Update ``ansible/inventory/hosts`` file with master and worker instances
2. Update ``ansible/ansible.cfg`` with location of private key if necessary
3. The ``ansible/template/spark-env-master-sample`` and ``ansible/template/spark-env-worker-sample`` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them.
4. run ``ansible-playbook scylla-migrator.yml``
5. On the Spark master node: ::
5. On the Spark master node:

.. code-block:: bash
cd scylla-migrator
./start-spark.sh
6. On the Spark worker nodes: ::
6. On the Spark worker nodes:

.. code-block:: bash
./start-slave.sh
Expand Down
25 changes: 19 additions & 6 deletions docs/source/getting-started/aws-emr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,25 @@ Set Up a Spark Cluster with AWS EMR

This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.com/emr/>`_. This approach is useful if you already have an AWS account, or if you do not want to manage your infrastructure manually.

1. Download the ``config.yaml.example`` from our Git repository. ::
1. Download the ``config.yaml.example`` from our Git repository.

.. code-block:: bash
wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
--output-document=config.yaml
2. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.

3. Download the latest release of the Migrator. ::
3. Download the latest release of the Migrator.

.. code-block:: bash
wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
4. Upload them to an S3 bucket. ::
4. Upload them to an S3 bucket.

.. code-block:: bash
aws s3 cp config.yaml s3://<your-bucket>/scylla-migrator/config.yaml
aws s3 cp scylla-migrator-assembly.jar s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar
Expand All @@ -24,13 +31,17 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c

Each time you change the migration configuration, re-upload it to the bucket.

4. Create a script named ``copy-files.sh``, to load the files ``config.yaml`` and ``scylla-migrator-assembly.jar`` from your S3 bucket. ::
4. Create a script named ``copy-files.sh``, to load the files ``config.yaml`` and ``scylla-migrator-assembly.jar`` from your S3 bucket.

.. code-block:: bash
#!/bin/bash
aws s3 cp s3://<your-bucket>/scylla-migrator/config.yaml /mnt1/config.yaml
aws s3 cp s3://<your-bucket>/scylla-migrator/scylla-migrator-assembly.jar /mnt1/scylla-migrator-assembly.jar
5. Upload the script to your S3 bucket as well. ::
5. Upload the script to your S3 bucket as well.

.. code-block:: bash
aws s3 cp copy-files.sh s3://<your-bucket>/scylla-migrator/copy-files.sh
Expand All @@ -48,7 +59,9 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c

- Type: Custom JAR
- JAR location: ``command-runner.jar``
- Arguments: ::
- Arguments:

.. code-block:: text
spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar
Expand Down
20 changes: 15 additions & 5 deletions docs/source/getting-started/docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,43 @@ Set Up a Spark Cluster with Docker

This page describes how to set up a Spark cluster locally on your machine by using Docker containers. This approach is useful if you do not need a high-level of performance, and want to quickly try out the Migrator without having to set up a real cluster of nodes. It requires Docker and Git.

1. Clone the Migrator repository. ::
1. Clone the Migrator repository.

.. code-block:: bash
git clone https://github.com/scylladb/scylla-migrator.git
cd scylla-migrator
2. Download the latest release of the ``scylla-migrator-assembly.jar`` and put it in the directory ``migrator/target/scala-2.13/``. ::
2. Download the latest release of the ``scylla-migrator-assembly.jar`` and put it in the directory ``migrator/target/scala-2.13/``.

.. code-block:: bash
mkdir -p migrator/target/scala-2.13
wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
--directory-prefix=migrator/target/scala-2.13
3. Start the Spark cluster. ::
3. Start the Spark cluster.

.. code-block:: bash
docker compose up -d
4. Open the Spark web UI.

http://localhost:8080

Tip: add the following aliases to your ``/etc/hosts`` to make links work in the Spark UI ::
Tip: add the following aliases to your ``/etc/hosts`` to make links work in the Spark UI

.. code-block:: text
127.0.0.1 spark-master
127.0.0.1 spark-worker
5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure </getting-started/#configure-the-migration>`_ it according to your needs.

6. Finally, run the migration. ::
6. Finally, run the migration.

.. code-block:: bash
docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \
--master spark://spark-master:7077 \
Expand Down
12 changes: 9 additions & 3 deletions docs/source/getting-started/spark-standalone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,24 @@ This page describes how to set up a Spark cluster on your infrastructure and to

1. Follow the `official documentation <https://spark.apache.org/docs/latest/spark-standalone.html>`_ to install Spark on each node of your cluster, and start the Spark master and the Spark workers.

2. In the Spark master node, download the latest release of the Migrator. ::
2. In the Spark master node, download the latest release of the Migrator.

.. code-block:: bash
wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository. ::
3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository.

.. code-block:: bash
wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
--output-document=config.yaml
4. `Configure the migration </getting-started/#configure-the-migration>`_ according to your needs.

5. Finally, run the migration as follows from the Spark master node. ::
5. Finally, run the migration as follows from the Spark master node.

.. code-block:: bash
spark-submit --class com.scylladb.migrator.Migrator \
--master spark://<spark-master-hostname>:7077 \
Expand Down
80 changes: 75 additions & 5 deletions docs/source/migrate-from-cassandra-or-parquet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ In file ``config.yaml``, make sure to keep only one ``source`` property and one
Configuring the Source
----------------------

The data `source` can be a Cassandra or ScyllaDB database, or a Parquet file.
The data ``source`` can be a Cassandra or ScyllaDB table, or a Parquet file.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Reading from Cassandra or ScyllaDB
Expand All @@ -25,7 +25,7 @@ In both cases, when reading from Cassandra or ScyllaDB, the type of source shoul
source:
type: cassandra
# host name of one of the nodes of your database cluster
# Host name of one of the nodes of your database cluster
host: <cassandra-server-01>
# TCP port to use for CQL
port: 9042
Expand Down Expand Up @@ -117,18 +117,88 @@ In case the object is not public in the S3 bucket, you can provide the AWS crede
source:
type: parquet
path: s3a://my-bucket/my-key.parquet
path: s3a://<my-bucket/my-key.parquet>
credentials:
accessKey: <access-key>
secretKey: <secret-key>
Where ``<access-key>`` and ``<my-secret-key>`` should be replaced with your actual AWS access key and secret key.
Where ``<access-key>`` and ``<secret-key>`` should be replaced with your actual AWS access key and secret key.

The Migrator also supports advanced AWS authentication options such as using `AssumeRole <https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html>`_. Please read the `configuration reference </configuration#aws-authentication>` for more details.
The Migrator also supports advanced AWS authentication options such as using `AssumeRole <https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html>`_. Please read the `configuration reference </configuration#aws-authentication>`__ for more details.

---------------------------
Configuring the Destination
---------------------------

The migration ``target`` can be Cassandra or Scylla. In both cases, we use the type ``cassandra`` in the configuration. Here is a minimal ``target`` configuration to write to Cassandra or ScyllaDB:

.. code-block:: yaml
target:
# can be 'cassandra' or 'scylla', it does not matter
type: cassandra
# Host name of one of the nodes of your target database cluster
host: <scylla-server-01>
port: 9042
keyspace: <keyspace>
# Name of the table to write. If it does not exist, it will be created on the fly.
# It has to have the same schema as the source table. If needed, you can rename
# columns along the way, look at the documentation page “Rename Columns”.
table: <table>
# Consistency Level for the target connection
# Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM.
consistencyLevel: LOCAL_QUORUM
# Number of connections to use to Scylla/Cassandra when copying
connections: 16
# Spark pads decimals with zeros appropriate to their scale. This causes values
# like '3.5' to be copied as '3.5000000000...' to the target. There's no good way
# currently to preserve the original value, so this flag can strip trailing zeros
# on decimal values before they are written.
stripTrailingZerosForDecimals: false
Where ``<scylla-server-01>``, ``<keyspace>``, and ``<table>`` should be replaced with your specific values.

Additionally, you can also set the following optional properties:

.. code-block:: yaml
target:
# ... same as above
# Datacenter to use
localDC: <datacenter>
# Authentication credentials
credentials:
username: <username>
password: <pass>
# SSL as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options
sslOptions:
clientAuthEnabled: false
enabled: false
# all below are optional! (generally just trustStorePassword and trustStorePath is needed)
trustStorePassword: <pass>
trustStorePath: <path>
trustStoreType: JKS
keyStorePassword: <pass>
keyStorePath: <path>
keyStoreType: JKS
enabledAlgorithms:
- TLS_RSA_WITH_AES_128_CBC_SHA
- TLS_RSA_WITH_AES_256_CBC_SHA
protocol: TLS
# If we do not persist timestamps (when preserveTimestamps is false in the source)
# we can enforce in writer a single TTL or writetimestamp for ALL written records.
# Such writetimestamp can be e.g. set to time BEFORE starting dual writes,
# and this will make your migration safe from overwriting dual write
# even for collections.
# ALL rows written will get the same TTL or writetimestamp or both
# (you can uncomment just one of them, or all or none)
# TTL in seconds (sample 7776000 is 90 days)
writeTTLInS: 7776000
# writetime in microseconds (sample 1640998861000 is Saturday, January 1, 2022 2:01:01 AM GMT+01:00 )
writeWritetimestampInuS: 1640998861000
Where ``<datacenter>``, ``<username>``, ``<pass>``, and ``<path>`` should be replaced with your specific values.
Loading

0 comments on commit b8fe56d

Please sign in to comment.