-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data/docs] Key Concepts Page #50129
base: master
Are you sure you want to change the base?
[data/docs] Key Concepts Page #50129
Conversation
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
*Blocks* are the basic unit of data that Ray Data operates on. A block is a contiguous | ||
subset of rows from a dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Blocks* are the basic unit of data that Ray Data operates on. A block is a contiguous | |
subset of rows from a dataset. | |
To parallelize processing every `Dataset` is split into `Blocks` -- a subset of rows distributed and processed independently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original version sounds more natural. We can add a sentence like "Blocks are distributed and processed across a Ray cluster independently" if we want
Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution | ||
(which is usually the entrypoint of the program, referred to as the *driver*) | ||
and stores the blocks as objects in Ray's shared-memory | ||
:ref:`object store <objects-in-ray>`. Underneath the hood, blocks are represented as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ref:`object store <objects-in-ray>`. Underneath the hood, blocks are represented as | |
:ref:`object store <objects-in-ray>`. Underneath the hood, blocks could be represented as |
.. code-block:: python | ||
|
||
dataset = ray.data.read_csv("s3://my-bucket/my-file.csv") | ||
dataset = dataset.map(lambda x: x + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add to a column instead
dataset = dataset.map(lambda x: x + 1) | ||
dataset = dataset.select_columns("col1") | ||
|
||
The logical plan for this program, which you can expect by calling ``print(dataset)``, is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logical plan for this program, which you can expect by calling ``print(dataset)``, is: | |
You can inspect logical plan by doing ``print(dataset)``: |
+- Map(<lambda>) | ||
+- Dataset(schema={...}) | ||
|
||
When a dataset's execution plan is executed, the logical plan is optimized and transformed into a *physical plan* that in turn is also optimized. A *physical plan* is a graph of *physical operators*, which contain actual implementation of the data transformation and may also handle orchestration and execution across different Ray actors/workers. Read more about Ray actors and workers in :ref:`Ray Core Concepts <core-key-concepts>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's capture the whole pipline:
- Logical Optimization
- Planning
- Physical Optimization
- Execution
And expand on these in details
|
||
When a user writes a program using the Dataset API, a *logical plan* is constructed underneath the hood. | ||
|
||
A *logical plan* represents a sequence of data transformations, each of which is represented by a *logical operator*. For example, a ``Map`` operator represents applying a function to each row of the dataset, and a ``Project`` operator represents selecting a subset of columns from the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current docs specify - Logical plan describe “what” to do. Physical plan “how” to do it". I feel that is a good explanation.
is there a way to incorporate it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part of the docs are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
The Dataset API is lazy, meaning that operations aren't executed until you call an action | ||
like :meth:`~ray.data.Dataset.show`. This allows Ray Data to optimize the execution plan | ||
and execute operations in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call an action
Do you want to make it clear by saying something like "materialized or consumed"?
|
||
The following figure visualizes a dataset with three blocks, each holding 1000 rows. | ||
Ray Data holds the :class:`~ray.data.Dataset` on the process that triggers execution | ||
(which is usually the entrypoint of the program, referred to as the *driver*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a proper rst way to reference items in glossary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why adding a "Build" at the beginning of the image title?
nice. I learnt something new after reading it through |
Why are these changes needed?
Refresher for #50022, but on a separate page and a bit more holistic.
It's not tightly integrated into the other pages yet but I will do a revision of quickstart/overview/data.rst pages.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.