Data Algorithms with Spark by Mahmoud Parsian
|
|
"... This book will be a great resource for both readers looking to implement existing algorithms in a scalable fashion and readers who are developing new, custom algorithms using Spark. ..." Dr. Matei Zaharia Original Creator of Apache Spark FOREWORD by Dr. Matei Zaharia |
Foreword by Dr. Matei Zaharia (Original Creator of Apache Spark)
Author: Mahmoud Parsian
-
This new O'Reilly book is the successor Edition of Data Algorithms (published by O'Reilly)
-
This book uses PySpark (much simpler and readable)
-
@OReillyMedia: Data Algorithms with Spark, By @mahmoudparsian
-
Autor Contact: [
Email ] [
Mahmoud Parsian @LinkedIn ][
Mahmoud Parsian @GitHub ]
-
This GitHub repository will host all source code and scripts for Data Algorithms with Spark
-
Chapter solutions are provided in PySpark and Scala
- PySpark solutions are provided by Mahmoud Parsian
- Scala solutions are provided by Deepak Kumar and Biman Mandal
All programs are tested with the following software:
| Spark | Python | Scala | Java |
|---|---|---|---|
| Apache Spark 3.4.0 | Python 3.10.5 | Scala 2.13 | Java 11 |
| Chapter | Title |
|---|---|
| Glossary | Glossary of Big Data, MapReduce, Spark |
| Chapter 1 | Introduction to Data Algorithms |
| Chapter 2 | Transformations in Action |
| Chapter 3 | Mapper Transformations |
| Chapter 4 | Reductions in Spark |
| Chapter 5 | Partitioning Data |
| Chapter 6 | Graph Algorithms |
| Chapter 7 | Interacting with External Data Sources |
| Chapter 8 | Ranking Algorithms |
| Chapter 9 | Fundamental Data Design Patterns |
| Chapter 10 | Common Data Design Patterns |
| Chapter 11 | Join Design Patterns |
| Chapter 12 | Feature Engineering in PySpark |
| Bonus Chapter | Title / Description |
|---|---|
| Glossary | Glossary of Big Data, MapReduce, Spark |
| Word Count | Solutions for Word Count using RDDs and DataFrames |
| Anagrams | Find words, which are anagrams |
| Lambda Expressions | Using Lambda Expressions in PySpark programs |
| TF-IDF | Term Frequency - Inverse Document Frequency |
| K-mers | K-mers for DNA Sequences |
| Correlation | All vs. All Correlation |
| Mapping Partitions | mapPartitions() Complete Example |
| UDF | User-Defined Function Examples |
| DataFrames Transformations | Examples on Creation and Transformation of DataFrames |
| DataFrames Tutorials | DataFrames Tutorials: from collections and CSV text files |
| Join Operations | Examples on join of RDDs and DataFrames |
| PySpark Tutorial 101 | Examples on using PySpark RDDs and DataFrames |
| Physical Data Partitioning | Tutorial of Physical Data Partitioning |
| Monoids and Combiners | Monoid as a Design Principle |
Email
Mahmoud Parsian @LinkedIn
Mahmoud Parsian @GitHub