-
Notifications
You must be signed in to change notification settings - Fork 125
Internal workings of an RDD
-
RDDs operate in parallel. This is the strongest advantage of working in Spark: Each transformation is executed in parallel for enormous increase in speed.
-
The transformations to the dataset are lazy. This means that any transformation is only executed when an action on a dataset is called. This helps Spark to optimize the execution.
-
For instance, consider the following very common steps that an analyst would normally do to get familiar with a dataset:
- Count the occurrence of distinct values in a certain column.
- Select those that start with an A.
- Print the results to the screen.
-
As simple as the previously mentioned steps sound, if only items that start with the letter A are of interest, there is no point in counting distinct values for all the other items. Thus, instead of following the execution as outlined in the preceding points, Spark could only count the items that start with A, and then print the results to the screen.
- Understanding Spark
- Spark Jobs & API
- Architecture
- RDD Internals
- Creating RDD
- Understanding Deployment & Program Behaviour
- RDD Transformation & Action
- Assignments 1
- Best Practices -1
- Introduction to DataFrame
- PySpark SQL
- Pandas to DataFrames
- Machine Learning with PySpark
- Transformers
- Estimators
- Spark Streaming
- Structured Streaming
- GraphX & GraphFrames
- Data Processing Architectures
- Problems