Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data/docs] Key Concepts Page #50129

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Jan 29, 2025

Why are these changes needed?

Refresher for #50022, but on a separate page and a bit more holistic.

It's not tightly integrated into the other pages yet but I will do a revision of quickstart/overview/data.rst pages.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Richard Liaw <[email protected]>
@richardliaw richardliaw requested a review from a team as a code owner January 29, 2025 16:23
@richardliaw richardliaw marked this pull request as draft January 29, 2025 16:24
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
@richardliaw richardliaw added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Jan 30, 2025
@richardliaw richardliaw marked this pull request as ready for review January 30, 2025 22:31
Comment on lines +25 to +26
*Blocks* are the basic unit of data that Ray Data operates on. A block is a contiguous
subset of rows from a dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Blocks* are the basic unit of data that Ray Data operates on. A block is a contiguous
subset of rows from a dataset.
To parallelize processing every `Dataset` is split into `Blocks` -- a subset of rows distributed and processed independently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original version sounds more natural. We can add a sentence like "Blocks are distributed and processed across a Ray cluster independently" if we want

Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution
(which is usually the entrypoint of the program, referred to as the *driver*)
and stores the blocks as objects in Ray's shared-memory
:ref:`object store <objects-in-ray>`. Underneath the hood, blocks are represented as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:ref:`object store <objects-in-ray>`. Underneath the hood, blocks are represented as
:ref:`object store <objects-in-ray>`. Underneath the hood, blocks could be represented as

.. code-block:: python

dataset = ray.data.read_csv("s3://my-bucket/my-file.csv")
dataset = dataset.map(lambda x: x + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add to a column instead

dataset = dataset.map(lambda x: x + 1)
dataset = dataset.select_columns("col1")

The logical plan for this program, which you can expect by calling ``print(dataset)``, is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The logical plan for this program, which you can expect by calling ``print(dataset)``, is:
You can inspect logical plan by doing ``print(dataset)``:

+- Map(<lambda>)
+- Dataset(schema={...})

When a dataset's execution plan is executed, the logical plan is optimized and transformed into a *physical plan* that in turn is also optimized. A *physical plan* is a graph of *physical operators*, which contain actual implementation of the data transformation and may also handle orchestration and execution across different Ray actors/workers. Read more about Ray actors and workers in :ref:`Ray Core Concepts <core-key-concepts>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's capture the whole pipline:

  1. Logical Optimization
  2. Planning
  3. Physical Optimization
  4. Execution

And expand on these in details


When a user writes a program using the Dataset API, a *logical plan* is constructed underneath the hood.

A *logical plan* represents a sequence of data transformations, each of which is represented by a *logical operator*. For example, a ``Map`` operator represents applying a function to each row of the dataset, and a ``Project`` operator represents selecting a subset of columns from the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current docs specify - Logical plan describe “what” to do. Physical plan “how” to do it". I feel that is a good explanation.
is there a way to incorporate it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part of the docs are you referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The Dataset API is lazy, meaning that operations aren't executed until you call an action
like :meth:`~ray.data.Dataset.show`. This allows Ray Data to optimize the execution plan
and execute operations in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call an action
Do you want to make it clear by saying something like "materialized or consumed"?


The following figure visualizes a dataset with three blocks, each holding 1000 rows.
Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution
(which is usually the entrypoint of the program, referred to as the *driver*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a proper rst way to reference items in glossary?

Copy link
Contributor

@scottsun94 scottsun94 Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding a "Build" at the beginning of the image title?

@scottsun94
Copy link
Contributor

scottsun94 commented Jan 31, 2025

nice. I learnt something new after reading it through

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants