Skip to content

drabastomek/PySparkCookbook

Repository files navigation

PySpark Cookbook

Code base for the PySpark Coookbook by Denny Lee and Tomasz Drabas.

Book cover

Introduction

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem.

You'll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You'll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems.

By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

Table of contents:

  1. Installing and configuring Spark
  2. Abstracting data with RDDs
  3. Abstracting data with DataFrames
  4. Preparing Data for Modeling
  5. Introducing MLlib
  6. Introducing the ML Module
  7. Structured Streaming with PySpark
  8. GraphFrames - Graph Theory with PySpark

About authors

Denny Lee is a Technical Product Marketing Manager with Databricks, working as closely to Apache Spark as humanly possible. Previously, Denny was a Principal Program Manager at Microsoft for the Azure Cosmos DB team – Microsoft’s blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data sciences engineer with more than 20 years of experience developing Internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments.

He has extensive experience in building green field teams as well as turnaround / change catalyst. Prior to joining the Azure Cosmos DB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft’s Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers for the last fifteen years.

Tomasz Drabas is a Senior Data Scientist working for Microsoft and currently residing in Seattle area. He has over 15 years of experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance and consulting he gained while working on three continents: Europe, Australia and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with focus on choice modeling and revenue management applications in the airline industry.

At Microsoft, Tomasz works with big data on a daily basis solving machine learning problems such as anomaly detection, churn prediction or pattern recognition using Spark.

Tomasz has also authored the Learning PySpark with Denny Lee in 2017 and the Practical Data Analysis Cookbook (Python focused) published by Packt Publishing in 2016.

You can purchase our books and videos from

About

A repository for a PySpark Cookbook by Tomasz Drabas and Denny Lee

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published