layout

title

custom_title

description

type

navigation

home

Home

Apache Spark™ - Unified Engine for large-scale data analytics

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

page

weight	show
1	true

Simple.
Fast.
Scalable.
Unified.

Key features

Batch/streaming data

Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

SQL analytics

Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.

Data science at scale

Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling

Machine learning

Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Python SQL Scala Java R

Run now

Installing with 'pip'

$ pip install pyspark

$ pyspark

QuickStart Machine Learning Analytics & Data Science

{% highlight python %} df = spark.read.json("logs.json") df.where("age > 21").select("name.first").show() {% endhighlight %}

{% highlight python %} # Every record contains a label and feature vector df = spark.createDataFrame(data, ["label", "features"])

Split the data into train/test datasets

train_df, test_df = df.randomSplit([.80, .20], seed=42)

Set hyperparameters for the algorithm

rf = RandomForestRegressor(numTrees=100)

Fit the model to the training data

model = rf.fit(train_df)

Generate predictions on the test dataset.

model.transform(test_df).show() {% endhighlight %}

{% highlight python %} df = spark.read.csv("accounts.csv", header=True)

Select subset of features and filter for balance > 0

filtered_df = df.select("AccountBalance", "CountOfDependents").filter("AccountBalance > 0")

Generate summary statistics

filtered_df.summary().show() {% endhighlight %}

Run now

$ docker run -it --rm apache/spark /opt/spark/bin/spark-sql

spark-sql>

{% highlight sql %} SELECT name.first AS first_name, name.last AS last_name, age FROM json.logs.json WHERE age > 21; {% endhighlight %}

Run now

$ docker run -it --rm apache/spark /opt/spark/bin/spark-shell

scala>

{% highlight scala %} val df = spark.read.json("logs.json") df.where("age > 21") .select("name.first").show() {% endhighlight %}

Run now

$ docker run -it --rm apache/spark /opt/spark/bin/spark-shell

scala>

{% highlight java %} Dataset df = spark.read().json("logs.json"); df.where("age > 21") .select("name.first").show(); {% endhighlight %}

Run now

$ SPARK-HOME/bin/sparkR

>

{% highlight r %} df <- read.json(path = "logs.json") df <- filter(df, df$age > 21) head(select(df, df$name.first)) {% endhighlight %}

The most widely-used engine for scalable computing

Thousands of companies, including 80% of the Fortune 500, use Apache Spark™.
Over 2,000 contributors to the open source project from industry and academia.

Ecosystem

Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.

Data science and Machine learning

SQL analytics and BI

Storage and Infrastructure

Spark SQL engine: under the hood

Apache Spark™ is built on an advanced distributed SQL engine for large-scale data

Adaptive Query Execution

Spark SQL adapts the execution plan at runtime, such as automatically setting the number of reducers and join algorithms.

Support for ANSI SQL

Use the same SQL you’re already comfortable with.

Structured and unstructured data

Spark SQL works on structured tables and unstructured data such as JSON or images.

TPC-DS 1TB No-Stats With vs. Without Adaptive Query Execution

Accelerates TPC-DS queries up to 8x

Join the community

Spark has a thriving open source community, with contributors from around the globe building features, documentation and assisting other users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Split the data into train/test datasets

Set hyperparameters for the algorithm

Fit the model to the training data

Generate predictions on the test dataset.

Select subset of features and filter for balance > 0

Generate summary statistics

Files

index.md

Latest commit

History

index.md

File metadata and controls

Split the data into train/test datasets

Set hyperparameters for the algorithm

Fit the model to the training data

Generate predictions on the test dataset.

Select subset of features and filter for balance > 0

Generate summary statistics