-
Notifications
You must be signed in to change notification settings - Fork 125
Creating RDDs
There are two ways to create an RDD in PySpark: you can either .parallelize(...) a collection (list or an array of some elements):
data = sc.parallelize([('Amber', 22), ('Alfred', 23), ('Skye',4), ('Albert', 12), ('Amber', 9)])
Object holding data type - ParallelCollectionRDD
Or you can reference a file (or files) located either locally or somewhere externally:
data_from_file = sc. textFile('/Users/drabast/Documents/VS14MORT.txt.gz',4)
Object holding data type - MapPartitionsRDD The last parameter in sc.textFile(..., n) specifies the number of partitions the dataset is divided into.
RDDs are schema-less data structures (unlike DataFrames, which we will discuss in the next). Thus, parallelizing a dataset, such as in the following code snippet, is perfectly fine with Spark when using RDDs:
data_heterogenous = sc.parallelize([ ('Ferrari', 'fast'), {'Porsche': 100000}, ['Spain','visited', 4504] ]).collect()
So, we can mix almost anything: a tuple, a dict, or a list and Spark will not complain.
Once you .collect() the dataset (that is, run an action to bring it back to the driver) you can access the data in the object as you would normally do in Python:
data_heterogenous[1]['Porsche'] 100000
The .collect() method returns all the elements of the RDD to the driver where it is serialized as a list.
- Understanding Spark
- Spark Jobs & API
- Architecture
- RDD Internals
- Creating RDD
- Understanding Deployment & Program Behaviour
- RDD Transformation & Action
- Assignments 1
- Best Practices -1
- Introduction to DataFrame
- PySpark SQL
- Pandas to DataFrames
- Machine Learning with PySpark
- Transformers
- Estimators
- Spark Streaming
- Structured Streaming
- GraphX & GraphFrames
- Data Processing Architectures
- Problems