Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document better the resources to allocate to the Spark executors #200

Merged
merged 3 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docker-compose-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,8 @@ services:
build: dockerfiles/spark
command: worker
environment:
SPARK_WORKER_CORES: 3
SPARK_WORKER_MEMORY: 1024m
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 4G
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
expose:
Expand Down
4 changes: 2 additions & 2 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ services:
build: dockerfiles/spark
command: worker
environment:
SPARK_WORKER_CORES: 3
SPARK_WORKER_MEMORY: 1024m
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 4G
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
ports:
Expand Down
5 changes: 3 additions & 2 deletions docs/source/getting-started/aws-emr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,10 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c

.. code-block:: text

spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar
spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml <... other arguments> /mnt1/scylla-migrator-assembly.jar

See a complete description of the expected arguments to ``spark-submit`` in page :doc:`Run the Migration </run-the-migration>`, and replace “<... other arguments>” above with the appropriate arguments.

See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.

- Add a Bootstrap action to download the Migrator and the migration configuration:

Expand Down
3 changes: 2 additions & 1 deletion docs/source/getting-started/docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,12 @@ This page describes how to set up a Spark cluster locally on your machine by usi
--master spark://spark-master:7077 \
--conf spark.driver.host=spark-master \
--conf spark.scylla.config=/app/config.yaml \
<... other arguments> \
/jars/scylla-migrator-assembly.jar

The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``.

See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
See a complete description of the expected arguments to ``spark-submit`` in page :doc:`Run the Migration </run-the-migration>`, and replace “<... other arguments>” above with the appropriate arguments.

7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``.

Expand Down
20 changes: 3 additions & 17 deletions docs/source/getting-started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ Since the Migrator is packaged as a Spark application, you first have to set up
Set Up a Spark Cluster
----------------------

A Spark cluster is made of several *nodes*, which can contain several *workers* (although there is usually just one worker per node). When you start the Migrator, the Spark *driver* looks at the job content and splits it into tasks. It then spawns *executors* on the cluster workers and feed them with the tasks to compute.
A Spark cluster is made of several *nodes*, which can contain several *workers* (although there is usually just one worker per node). When you start the Migrator, the Spark *driver* looks at the job content and splits it into tasks. It then spawns *executors* on the cluster workers and feeds them with the tasks to compute. Since the tasks are processed in parallel, you can increase the possible throughput of the migration by increasing the number of worker nodes. Note that the migration throughput is also limited by the read throughput of the source database and the write throughput of the target database.

We recommend provisioning at least 2 GB of memory per CPU on each node. For instance, a cluster node with 4 CPUs should have at least 8 GB of memory.
We suggest starting with a small cluster containing a single worker node with 5 to 10 CPUs, and increasing the number of worker nodes (or the number of CPUs per node) if necessary, as long as the source and target database are not saturated. We recommend provisioning at least 2 GB of memory per CPU on each node. For instance, a cluster node with 8 CPUs should have at least 16 GB of memory.

.. caution::

Expand All @@ -36,21 +36,7 @@ Once you have a Spark cluster ready to run the ``scylla-migrator-assembly.jar``,
Run the Migration
-----------------

The way to start the Migrator depends on how the Spark cluster was installed. Please refer to the page that describes your Spark cluster setup to see how to invoke the ``spark-submit`` command. The remainder of this section describes general options you can use to fine-tune the Migration job.

We recommend using between 5 to 10 CPUs per Spark executor. For instance, if your Spark worker node has 16 CPUs, you could use 8 CPUs per executor (the Spark driver would then allocate two executors on the worker to fully utilize its resources). You can control the number of CPUs per executors with the argument ``--executor-cores`` passed to the ``spark-submit`` command:

.. code-block:: bash

--executor-cores 8

We also recommend using 2 GB of memory per CPU. So, if you provide 8 CPU per executor, you should require 16 GB of memory on the executor. You can control the amount of memory per executor with the argument ``--executor-memory`` passed to the ``spark-submit`` command:

.. code-block:: bash

--executor-memory 16G

As long as your source and target databases are not saturated during the migration, you can increase the migration throughput by adding more worker nodes to your Spark cluster.
Start the migration by invoking the ``spark-submit`` command with the appropriate arguments, as explained in the page :doc:`/run-the-migration`.

--------------
Extra Features
Expand Down
3 changes: 2 additions & 1 deletion docs/source/getting-started/spark-standalone.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,9 @@ This page describes how to set up a Spark cluster on your infrastructure and to
spark-submit --class com.scylladb.migrator.Migrator \
--master spark://<spark-master-hostname>:7077 \
--conf spark.scylla.config=<path to config.yaml> \
<... other arguments> \
<path to scylla-migrator-assembly.jar>

See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
See a complete description of the expected arguments to ``spark-submit`` in page :doc:`Run the Migration </run-the-migration>`, and replace “<spark-master-hostname>”, “<... other arguments>”, and “<path to scylla-migrator-assembly.jar>” above with appropriate values.

6. You can monitor progress from the `Spark web UI <https://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging>`_.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ Migrator Spark Scala
getting-started/index
migrate-from-cassandra-or-parquet
migrate-from-dynamodb
run-the-migration
stream-changes
rename-columns
validate
Expand Down
61 changes: 61 additions & 0 deletions docs/source/run-the-migration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
=================
Run the Migration
=================

After you have `set up a Spark cluster <./getting-started#set-up-a-spark-cluster>`_ and `configured the migration <./getting-started#configure-the-migration>`_ you can start the migration by submitting a job to your Spark cluster. The command to use to submit the job depends on how the Spark cluster was installed. Please refer to the page that describes your Spark cluster setup to see how to invoke the ``spark-submit`` command. This page describes the arguments you need to pass to the ``spark-submit`` command to control the resources allocated to the migration job.

-----------------------
Invoke ``spark-submit``
-----------------------

The ``spark-submit`` command submits a job to the Spark cluster. You can run it from the Spark master node. You should supply the following arguments:

.. code-block:: bash

spark-submit \
--class com.scylladb.migrator.Migrator \
--master spark://<spark-master-hostname>:7077 \
--conf spark.scylla.config=<path to config.yaml> \
--executor-cores 2 \
--executor-memory 4G \
<path to scylla-migrator-assembly.jar>

Here is an explanation of the arguments shown above:

- ``--class com.scylladb.migrator.Migrator`` sets the entry point of the Migrator.
- ``--master spark://<spark-master-hostname>:7077`` indicates the URI of the Spark master node. Replace ``<spark-master-hostname>`` with the actual hostname of your master node.
- ``--conf spark.scylla.config=<path to config.yaml>`` indicates the location of the migration :doc:`configuration file </configuration>`. It must be a path on the Spark master node.
- ``--executor-cores 2`` and ``--executor-memory 4G`` set the CPU and memory requirements for the Spark executors. See the section `below <#executor-resources>`_ for an explanation of how to set these values.
- Finally, ``<path to scylla-migrator-assembly.jar>`` indicates the location of the program binaries. It must be a path on the Spark master node.

------------------
Executor Resources
------------------

When the Spark master node starts the application, it breaks down the work into multiple tasks, and spawns *executors* on the worker nodes to compute these tasks.

.. caution:: You should explicitly indicate the CPU and memory requirements of the Spark executors, otherwise by default Spark will create a single executor using all the cores but only 1 GB of memory, which may not be enough and would lead to run-time errors such as ``OutOfMemoryError``.

The number of CPUs and the amount of memory to allocate to the Spark executors depends on the number of CPUs and amount of memory of the Spark worker nodes.

We recommend using between 5 to 10 CPUs per Spark executor. For instance, if your Spark worker node has 16 CPUs, you could use 8 CPUs per executor (the Spark driver would then allocate two executors on the worker to fully utilize its resources). You can control the number of CPUs per executors with the argument ``--executor-cores`` passed to the ``spark-submit`` command:

.. code-block:: bash

--executor-cores 8

We also recommend using 2 GB of memory per CPU. So, if you provide 8 CPU per executor, you should require 16 GB of memory on the executor. You can control the amount of memory per executor with the argument ``--executor-memory`` passed to the ``spark-submit`` command:

.. code-block:: bash

--executor-memory 16G

As long as your source and target databases are not saturated during the migration, you can increase the migration throughput by adding more worker nodes to your Spark cluster.

.. caution::

To decrease the migration throughput, do not decrease the number of executor cores. Indeed, if you do that, Spark will simply allocate several executors to fully utilize the resources of the cluster. If you want to decrease the migration throughput, you can:

- use a “smaller” Spark cluster (ie, with fewer worker nodes, each having fewer cores),
- limit the number of total cores allocated to the application by passing the argument ``--conf spark.cores.max=2``,
- in the case of a DynamoDB migration, decrease the value of the configuration properties ``throughputReadPercent`` and ``throughputWritePercent``.
1 change: 1 addition & 0 deletions docs/source/tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,6 @@ Tutorials


.. toctree::
:maxdepth: 1

dynamodb-to-scylladb-alternator/index
2 changes: 2 additions & 0 deletions tests/src/test/scala/com/scylladb/migrator/SparkUtils.scala
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ object SparkUtils {
"spark.driver.host=spark-master",
"--conf",
s"spark.scylla.config=/app/configurations/${migratorConfigFile}",
"--executor-cores", "2",
"--executor-memory", "4G",
// Uncomment one of the following lines to plug a remote debugger on the Spark master or worker.
// "--conf", "spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5005",
// "--conf", "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=*:5006",
Expand Down
Loading