diff --git a/docs/source/getting-started/ansible.rst b/docs/source/getting-started/ansible.rst index 0483a917..3ea30f88 100644 --- a/docs/source/getting-started/ansible.rst +++ b/docs/source/getting-started/ansible.rst @@ -2,18 +2,22 @@ Set Up a Spark Cluster with Ansible =================================== -An `Ansible `_ playbook is provided in the `ansible `_ folder of our Git repository. The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file. Scylla-migrator will be installed on the spark master node. +An `Ansible `_ playbook is provided in the `ansible folder`_ folder of our Git repository. The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file. Scylla-migrator will be installed on the spark master node. 1. Update ``ansible/inventory/hosts`` file with master and worker instances 2. Update ``ansible/ansible.cfg`` with location of private key if necessary 3. The ``ansible/template/spark-env-master-sample`` and ``ansible/template/spark-env-worker-sample`` contain environment variables determining number of workers, CPUs per worker, and memory allocations - as well as considerations for setting them. 4. run ``ansible-playbook scylla-migrator.yml`` -5. On the Spark master node: :: +5. On the Spark master node: + + .. code-block:: bash cd scylla-migrator ./start-spark.sh -6. On the Spark worker nodes: :: +6. On the Spark worker nodes: + + .. code-block:: bash ./start-slave.sh diff --git a/docs/source/getting-started/aws-emr.rst b/docs/source/getting-started/aws-emr.rst index 4d523e11..80b7f7cf 100644 --- a/docs/source/getting-started/aws-emr.rst +++ b/docs/source/getting-started/aws-emr.rst @@ -4,18 +4,25 @@ Set Up a Spark Cluster with AWS EMR This page describes how to use the Migrator in `Amazon EMR `_. This approach is useful if you already have an AWS account, or if you do not want to manage your infrastructure manually. -1. Download the ``config.yaml.example`` from our Git repository. :: +1. Download the ``config.yaml.example`` from our Git repository. + + .. code-block:: bash wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \ --output-document=config.yaml + 2. `Configure the migration `_ according to your needs. -3. Download the latest release of the Migrator. :: +3. Download the latest release of the Migrator. + + .. code-block:: bash wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar -4. Upload them to an S3 bucket. :: +4. Upload them to an S3 bucket. + + .. code-block:: bash aws s3 cp config.yaml s3:///scylla-migrator/config.yaml aws s3 cp scylla-migrator-assembly.jar s3:///scylla-migrator/scylla-migrator-assembly.jar @@ -24,13 +31,17 @@ This page describes how to use the Migrator in `Amazon EMR /scylla-migrator/config.yaml /mnt1/config.yaml aws s3 cp s3:///scylla-migrator/scylla-migrator-assembly.jar /mnt1/scylla-migrator-assembly.jar -5. Upload the script to your S3 bucket as well. :: +5. Upload the script to your S3 bucket as well. + + .. code-block:: bash aws s3 cp copy-files.sh s3:///scylla-migrator/copy-files.sh @@ -48,7 +59,9 @@ This page describes how to use the Migrator in `Amazon EMR `_ it according to your needs. -6. Finally, run the migration. :: +6. Finally, run the migration. + + .. code-block:: bash docker compose exec spark-master /spark/bin/spark-submit --class com.scylladb.migrator.Migrator \ --master spark://spark-master:7077 \ diff --git a/docs/source/getting-started/spark-standalone.rst b/docs/source/getting-started/spark-standalone.rst index acefc663..47ebb4b8 100644 --- a/docs/source/getting-started/spark-standalone.rst +++ b/docs/source/getting-started/spark-standalone.rst @@ -6,18 +6,24 @@ This page describes how to set up a Spark cluster on your infrastructure and to 1. Follow the `official documentation `_ to install Spark on each node of your cluster, and start the Spark master and the Spark workers. -2. In the Spark master node, download the latest release of the Migrator. :: +2. In the Spark master node, download the latest release of the Migrator. + + .. code-block:: bash wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar -3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository. :: +3. In the Spark master node, copy the file ``config.yaml.example`` from our Git repository. + + .. code-block:: bash wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \ --output-document=config.yaml 4. `Configure the migration `_ according to your needs. -5. Finally, run the migration as follows from the Spark master node. :: +5. Finally, run the migration as follows from the Spark master node. + + .. code-block:: bash spark-submit --class com.scylladb.migrator.Migrator \ --master spark://:7077 \ diff --git a/docs/source/migrate-from-cassandra-or-parquet.rst b/docs/source/migrate-from-cassandra-or-parquet.rst index b6bdbdb4..e50d4763 100644 --- a/docs/source/migrate-from-cassandra-or-parquet.rst +++ b/docs/source/migrate-from-cassandra-or-parquet.rst @@ -13,7 +13,7 @@ In file ``config.yaml``, make sure to keep only one ``source`` property and one Configuring the Source ---------------------- -The data `source` can be a Cassandra or ScyllaDB database, or a Parquet file. +The data ``source`` can be a Cassandra or ScyllaDB table, or a Parquet file. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Reading from Cassandra or ScyllaDB @@ -25,7 +25,7 @@ In both cases, when reading from Cassandra or ScyllaDB, the type of source shoul source: type: cassandra - # host name of one of the nodes of your database cluster + # Host name of one of the nodes of your database cluster host: # TCP port to use for CQL port: 9042 @@ -117,18 +117,88 @@ In case the object is not public in the S3 bucket, you can provide the AWS crede source: type: parquet - path: s3a://my-bucket/my-key.parquet + path: s3a:// credentials: accessKey: secretKey: -Where ```` and ```` should be replaced with your actual AWS access key and secret key. +Where ```` and ```` should be replaced with your actual AWS access key and secret key. -The Migrator also supports advanced AWS authentication options such as using `AssumeRole `_. Please read the `configuration reference ` for more details. +The Migrator also supports advanced AWS authentication options such as using `AssumeRole `_. Please read the `configuration reference `__ for more details. --------------------------- Configuring the Destination --------------------------- +The migration ``target`` can be Cassandra or Scylla. In both cases, we use the type ``cassandra`` in the configuration. Here is a minimal ``target`` configuration to write to Cassandra or ScyllaDB: + +.. code-block:: yaml + + target: + # can be 'cassandra' or 'scylla', it does not matter + type: cassandra + # Host name of one of the nodes of your target database cluster + host: + port: 9042 + keyspace: + # Name of the table to write. If it does not exist, it will be created on the fly. + # It has to have the same schema as the source table. If needed, you can rename + # columns along the way, look at the documentation page “Rename Columns”. + table: + # Consistency Level for the target connection + # Options are: LOCAL_ONE, ONE, LOCAL_QUORUM, QUORUM. + consistencyLevel: LOCAL_QUORUM + # Number of connections to use to Scylla/Cassandra when copying + connections: 16 + # Spark pads decimals with zeros appropriate to their scale. This causes values + # like '3.5' to be copied as '3.5000000000...' to the target. There's no good way + # currently to preserve the original value, so this flag can strip trailing zeros + # on decimal values before they are written. + stripTrailingZerosForDecimals: false + +Where ````, ````, and ``
`` should be replaced with your specific values. + +Additionally, you can also set the following optional properties: + +.. code-block:: yaml + + target: + # ... same as above + + # Datacenter to use + localDC: + + # Authentication credentials + credentials: + username: + password: + # SSL as per https://github.com/scylladb/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-ssl-connection-options + sslOptions: + clientAuthEnabled: false + enabled: false + # all below are optional! (generally just trustStorePassword and trustStorePath is needed) + trustStorePassword: + trustStorePath: + trustStoreType: JKS + keyStorePassword: + keyStorePath: + keyStoreType: JKS + enabledAlgorithms: + - TLS_RSA_WITH_AES_128_CBC_SHA + - TLS_RSA_WITH_AES_256_CBC_SHA + protocol: TLS + # If we do not persist timestamps (when preserveTimestamps is false in the source) + # we can enforce in writer a single TTL or writetimestamp for ALL written records. + # Such writetimestamp can be e.g. set to time BEFORE starting dual writes, + # and this will make your migration safe from overwriting dual write + # even for collections. + # ALL rows written will get the same TTL or writetimestamp or both + # (you can uncomment just one of them, or all or none) + # TTL in seconds (sample 7776000 is 90 days) + writeTTLInS: 7776000 + # writetime in microseconds (sample 1640998861000 is Saturday, January 1, 2022 2:01:01 AM GMT+01:00 ) + writeWritetimestampInuS: 1640998861000 + +Where ````, ````, ````, and ```` should be replaced with your specific values. diff --git a/docs/source/migrate-from-dynamodb.rst b/docs/source/migrate-from-dynamodb.rst index b2f80442..86b675ea 100644 --- a/docs/source/migrate-from-dynamodb.rst +++ b/docs/source/migrate-from-dynamodb.rst @@ -3,3 +3,175 @@ Migrate from DynamoDB ===================== +This page explains how to fill the ``source`` and ``target`` properties of the `configuration file `_ to migrate data: + +- from a DynamoDB table, a ScyllaDB’s Alternator table, or a `DynamoDB S3 export `_, +- to a DynamoDB table or a ScyllaDB’s Alternator table. + +In file ``config.yaml``, make sure to keep only one ``source`` property and one ``target`` property, and configure them as explained in the following subsections according to your case. + +---------------------- +Configuring the Source +---------------------- + +The data ``source`` can be a DynamoDB or Alternator table, or a DynamoDB S3 export. + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Reading from DynamoDB or Alternator +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In both cases, when reading from DynamoDB or Alternator, the type of source should be ``dynamodb`` in the configuration file. Here is a minimal ``source`` configuration to read a DynamoDB table: + +.. code-block:: yaml + + source: + type: dynamodb + table:
+ region: + +Where ``
`` is the name of the table to read, and ```` is the AWS region where the DynamoDB instance is located. + +To read from the Alternator, you need to provide an ``endpoint`` instead of a ``region``: + +.. code-block:: yaml + + source: + type: dynamodb + table:
+ endpoint: + host: http:// + port: + +Where ```` and ```` should be replaced with the host name and TCP port of your Alternator instance. + +In practice, your source database (DynamoDB or Alternator) may require authentication. You can provide the AWS credentials with the ``credentials`` property: + +.. code-block:: yaml + + source: + type: dynamodb + table:
+ region: + credentials: + accessKey: + secretKey: + +Where ```` and ```` should be replaced with your actual AWS access key and secret key. + +The Migrator also supports advanced AWS authentication options such as using `AssumeRole `_. Please read the `configuration reference `_ for more details. + +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Reading a DynamoDB S3 Export +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To read the content of a table exported to S3, use the ``source`` type ``dynamodb-s3-export``. Here is a minimal source configuration: + +.. code-block:: yaml + + source: + type: dynamodb-s3-export + # Name of the S3 bucket where the DynamoDB table has been exported + bucket: + # Key of the `manifest-summary.json` object in the bucket + manifestKey: + # Key schema and attribute definitions, see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TableCreationParameters.html + tableDescription: + # See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_AttributeDefinition.html + attributeDefinitions: + - name: + type: + - ... + # See https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_KeySchemaElement.html + keySchema: + - name: + type: + - ... + + +Additionally, you can also provide the following optional properties: + +.. code-block:: yaml + + source: + # ... same as above + + # Connect to a custom endpoint instead of the standard AWS S3 endpoint + endpoint: + # Specify the hostname without a protocol + host: + port: + + # AWS availability region + region: + + # Connection credentials: + credentials: + accessKey: + secretKey: + + # Whether to use “path-style access” in S3 (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html). Default is false. + usePathStyleAccess: true + +Where ````, ````, ````, ````, and ```` should be replaced with your specific values. + +The Migrator also supports advanced AWS authentication options such as using `AssumeRole `_. Please read the :doc:`configuration reference ` for more details. + +--------------------------- +Configuring the Destination +--------------------------- + +The migration ``target`` can be DynamoDB or Alternator. In both cases, we use the configuration type ``dynamodb`` in the configuration. Here is a minimal ``target`` configuration to write to DynamoDB or Alternator: + +.. code-block:: yaml + + target: + type: dynamodb + # Name of the table to write. If it does not exist, it will be created on the fly. + table:
+ # Split factor for reading/writing. This is required for Scylla targets. + scanSegments: 1 + # Throttling settings, set based on your database capacity (or wanted capacity) + readThroughput: 1 + # Can be between 0.1 and 1.5, inclusively. + # 0.5 represents the default read rate, meaning that the job will attempt to consume half of the read capacity of the table. + # If you increase the value above 0.5, spark will increase the request rate; decreasing the value below 0.5 decreases the read request rate. + # (The actual read rate will vary, depending on factors such as whether there is a uniform key distribution in the DynamoDB table.) + throughputReadPercent: 1.0 + # At most how many tasks per Spark executor? + maxMapTasks: 1 + # When transferring DynamoDB sources to DynamoDB targets (such as other DynamoDB tables or Alternator tables), + # the migrator supports transferring live changes occurring on the source table after transferring an initial + # snapshot. + # Please see the documentation page “Stream Changes” for more details about this option. + streamChanges: false + +Where ``
`` should be replaced with your specific value. + +Additionally, you can also set the following optional properties: + +.. code-block:: yaml + + target: + # ... same as above + + # Connect to a custom endpoint. Mandatory if writing to Scylla Alternator. + endpoint: + # If writing to Scylla Alternator, prefix the hostname with 'http://'. + host: + port: + + # AWS availability region: + region: + + # Authentication credentials: + credentials: + accessKey: + secretKey: + + # When streamChanges is true, skip the initial snapshot transfer and only stream changes. + # This setting is ignored if streamChanges is false. + skipInitialSnapshotTransfer: false + +Where ````, ````, ````, ````, and ```` are replaced with your specific values. + +The Migrator also supports advanced AWS authentication options such as using `AssumeRole `_. Please read the :doc:`configuration reference `_ for more details.