Skip to content

Latest commit

 

History

History
400 lines (392 loc) · 22.5 KB

index.md

File metadata and controls

400 lines (392 loc) · 22.5 KB
layout title custom_title description type navigation
home
Home
Apache Spark™ - Unified Engine for large-scale data analytics
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
page
weight show
1
true
Simple.
Fast.
Scalable.
Unified.
Key features
Batch/streaming data
Batch/streaming data
Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics
SQL analytics
Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale
Data science at scale
Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
Machine Learning
Machine learning
Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
Python SQL Scala Java R
Run now
Installing with 'pip'

$ pip install pyspark

$ pyspark

QuickStart Machine Learning Analytics & Data Science
{% highlight python %} df = spark.read.json("logs.json") df.where("age > 21").select("name.first").show() {% endhighlight %}
{% highlight python %} # Every record contains a label and feature vector df = spark.createDataFrame(data, ["label", "features"])

Split the data into train/test datasets

train_df, test_df = df.randomSplit([.80, .20], seed=42)

Set hyperparameters for the algorithm

rf = RandomForestRegressor(numTrees=100)

Fit the model to the training data

model = rf.fit(train_df)

Generate predictions on the test dataset.

model.transform(test_df).show() {% endhighlight %}

{% highlight python %} df = spark.read.csv("accounts.csv", header=True)

Select subset of features and filter for balance > 0

filtered_df = df.select("AccountBalance", "CountOfDependents").filter("AccountBalance > 0")

Generate summary statistics

filtered_df.summary().show() {% endhighlight %}

Run now

$ docker run -it --rm apache/spark /opt/spark/bin/spark-sql

spark-sql>

{% highlight sql %} SELECT name.first AS first_name, name.last AS last_name, age FROM json.logs.json WHERE age > 21; {% endhighlight %}
Run now

$ docker run -it --rm apache/spark /opt/spark/bin/spark-shell

scala>

{% highlight scala %} val df = spark.read.json("logs.json") df.where("age > 21") .select("name.first").show() {% endhighlight %}
Run now

$ docker run -it --rm apache/spark /opt/spark/bin/spark-shell

scala>

{% highlight java %} Dataset df = spark.read().json("logs.json"); df.where("age > 21") .select("name.first").show(); {% endhighlight %}
Run now

$ SPARK-HOME/bin/sparkR

>

{% highlight r %} df <- read.json(path = "logs.json") df <- filter(df, df$age > 21) head(select(df, df$name.first)) {% endhighlight %}

The most widely-used engine for scalable computing
Thousands of companies, including 80% of the Fortune 500, use Apache Spark.
Over 2,000 contributors to the open source project from industry and academia.
Ecosystem
Apache Spark integrates with your favorite frameworks, helping to scale them to thousands of machines.
Data science and Machine learning
SQL analytics and BI
Storage and Infrastructure
Spark SQL engine: under the hood
Apache Spark is built on an advanced distributed SQL engine for large-scale data
Adaptive Query Execution

Spark SQL adapts the execution plan at runtime, such as automatically setting the number of reducers and join algorithms.

Support for ANSI SQL

Use the same SQL you’re already comfortable with.

Structured and unstructured data

Spark SQL works on structured tables and unstructured data such as JSON or images.

TPC-DS 1TB No-Stats With vs. Without Adaptive Query Execution
Accelerates TPC-DS queries up to 8x
Join the community
Spark has a thriving open source community, with contributors from around the globe building features, documentation and assisting other users.