[data/docs] Key Concepts Page #50129

richardliaw · 2025-01-29T16:23:13Z

Why are these changes needed?

Refresher for #50022, but on a separate page and a bit more holistic.

It's not tightly integrated into the other pages yet but I will do a revision of quickstart/overview/data.rst pages.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Richard Liaw <[email protected]>

doc/source/data/key-concepts.rst

Signed-off-by: Richard Liaw <[email protected]>

alexeykudinkin · 2025-01-30T23:12:27Z

doc/source/data/key-concepts.rst

+*Blocks* are the basic unit of data that Ray Data operates on. A block is a contiguous
+subset of rows from a dataset.


Suggested change

*Blocks* are the basic unit of data that Ray Data operates on. A block is a contiguous

subset of rows from a dataset.

To parallelize processing every `Dataset` is split into `Blocks` -- a subset of rows distributed and processed independently.

The original version sounds more natural. We can add a sentence like "Blocks are distributed and processed across a Ray cluster independently" if we want

alexeykudinkin · 2025-01-30T23:13:18Z

doc/source/data/key-concepts.rst

+Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution
+(which is usually the entrypoint of the program, referred to as the *driver*)
+and stores the blocks as objects in Ray's shared-memory
+:ref:`object store <objects-in-ray>`. Underneath the hood, blocks are represented as


Suggested change

:ref:`object store <objects-in-ray>`. Underneath the hood, blocks are represented as

:ref:`object store <objects-in-ray>`. Underneath the hood, blocks could be represented as

alexeykudinkin · 2025-01-30T23:14:53Z

doc/source/data/key-concepts.rst

+.. code-block:: python
+
+    dataset = ray.data.read_csv("s3://my-bucket/my-file.csv")
+    dataset = dataset.map(lambda x: x + 1)


Let's add to a column instead

alexeykudinkin · 2025-01-30T23:19:41Z

doc/source/data/key-concepts.rst

+    dataset = dataset.map(lambda x: x + 1)
+    dataset = dataset.select_columns("col1")
+
+The logical plan for this program, which you can expect by calling ``print(dataset)``, is:


Suggested change

The logical plan for this program, which you can expect by calling ``print(dataset)``, is:

You can inspect logical plan by doing ``print(dataset)``:

alexeykudinkin · 2025-01-30T23:20:28Z

doc/source/data/key-concepts.rst

+    +- Map(<lambda>)
+       +- Dataset(schema={...})
+
+When a dataset's execution plan is executed, the logical plan is optimized and transformed into a *physical plan* that in turn is also optimized. A *physical plan* is a graph of *physical operators*, which contain actual implementation of the data transformation and may also handle orchestration and execution across different Ray actors/workers. Read more about Ray actors and workers in :ref:`Ray Core Concepts <core-key-concepts>`.


Let's capture the whole pipline:

Logical Optimization

Planning

Physical Optimization

Execution

And expand on these in details

gvspraveen · 2025-01-29T16:51:44Z

doc/source/data/key-concepts.rst

+
+When a user writes a program using the Dataset API, a *logical plan* is constructed underneath the hood.
+
+A *logical plan* represents a sequence of data transformations, each of which is represented by a *logical operator*. For example, a ``Map`` operator represents applying a function to each row of the dataset, and a ``Project`` operator represents selecting a subset of columns from the dataset.


Current docs specify - Logical plan describe “what” to do. Physical plan “how” to do it". I feel that is a good explanation.
is there a way to incorporate it here?

Which part of the docs are you referring to?

This one - https://docs.ray.io/en/latest/data/data-internals.html#operators

gvspraveen · 2025-01-29T16:54:27Z

doc/source/data/key-concepts.rst

+
+The Dataset API is lazy, meaning that operations aren't executed until you call an action
+like :meth:`~ray.data.Dataset.show`. This allows Ray Data to optimize the execution plan
+and execute operations in parallel.


call an action
Do you want to make it clear by saying something like "materialized or consumed"?

scottsun94 · 2025-01-31T01:12:14Z

doc/source/data/key-concepts.rst

+
+The following figure visualizes a dataset with three blocks, each holding 1000 rows.
+Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution
+(which is usually the entrypoint of the program, referred to as the *driver*)


link to https://docs.ray.io/en/latest/ray-references/glossary.html#term-Driver ?

is there a proper rst way to reference items in glossary?

scottsun94 · 2025-01-31T01:18:04Z

doc/source/data/images/streaming-topology.svg

Why adding a "Build" at the beginning of the image title?

scottsun94 · 2025-01-31T01:19:51Z

nice. I learnt something new after reading it through

key-concepts

6003c80

Signed-off-by: Richard Liaw <[email protected]>

richardliaw requested a review from a team as a code owner January 29, 2025 16:23

richardliaw marked this pull request as draft January 29, 2025 16:24

richardliaw commented Jan 29, 2025

View reviewed changes

doc/source/data/key-concepts.rst Outdated Show resolved Hide resolved

richardliaw added 2 commits January 29, 2025 22:23

execution-plan

8ec1951

Signed-off-by: Richard Liaw <[email protected]>

goodstuff

da518c1

Signed-off-by: Richard Liaw <[email protected]>

richardliaw added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Jan 30, 2025

richardliaw marked this pull request as ready for review January 30, 2025 22:31

alexeykudinkin reviewed Jan 30, 2025

View reviewed changes

gvspraveen reviewed Jan 30, 2025

View reviewed changes

scottsun94 reviewed Jan 31, 2025

View reviewed changes

doc/source/data/images/streaming-topology.svg

Copy link

Contributor

scottsun94 Jan 31, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding a "Build" at the beginning of the image title?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data/docs] Key Concepts Page #50129

[data/docs] Key Concepts Page #50129

richardliaw commented Jan 29, 2025 •

edited

Loading

alexeykudinkin Jan 30, 2025

scottsun94 Jan 31, 2025

alexeykudinkin Jan 30, 2025

alexeykudinkin Jan 30, 2025

alexeykudinkin Jan 30, 2025

alexeykudinkin Jan 30, 2025

gvspraveen Jan 29, 2025

richardliaw Jan 31, 2025

gvspraveen Jan 31, 2025

gvspraveen Jan 29, 2025

scottsun94 Jan 31, 2025

richardliaw Jan 31, 2025

scottsun94 Jan 31, 2025 •

edited

Loading

scottsun94 commented Jan 31, 2025 •

edited

Loading

		Blocks are the basic unit of data that Ray Data operates on. A block is a contiguous
		subset of rows from a dataset.

	Blocks are the basic unit of data that Ray Data operates on. A block is a contiguous
	subset of rows from a dataset.
	To parallelize processing every `Dataset` is split into `Blocks` -- a subset of rows distributed and processed independently.

	:ref:`object store <objects-in-ray>`. Underneath the hood, blocks are represented as
	:ref:`object store <objects-in-ray>`. Underneath the hood, blocks could be represented as

	The logical plan for this program, which you can expect by calling ``print(dataset)``, is:
	You can inspect logical plan by doing ``print(dataset)``:


		When a user writes a program using the Dataset API, a logical plan is constructed underneath the hood.

		A logical plan represents a sequence of data transformations, each of which is represented by a logical operator. For example, a ``Map`` operator represents applying a function to each row of the dataset, and a ``Project`` operator represents selecting a subset of columns from the dataset.

[data/docs] Key Concepts Page #50129

Are you sure you want to change the base?

[data/docs] Key Concepts Page #50129

Conversation

richardliaw commented Jan 29, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottsun94 Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

scottsun94 commented Jan 31, 2025 • edited Loading

richardliaw commented Jan 29, 2025 •

edited

Loading

scottsun94 Jan 31, 2025 •

edited

Loading

scottsun94 commented Jan 31, 2025 •

edited

Loading