You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+10-10
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ All the projects were homework assignments which were implemented by me.
13
13
14
14
Flight records in USA are stored and some of them are made available for research purposes at [Statistical Computing](http://stat-computing.org/dataexpo/2009/the-data.html). The data are separated by year from 1987 to 2008. The attributes include the common properties a flight record have (e.g. date, origin and destination airports, air time, scheduled and actual departure and arrival times, etc).
15
15
16
-
During a practical course called 'Big Data Analytics Tools with Open-Source Platforms' at BME we had a homework assignment which contained two questions. The questions had to be answered by implementing a data analysis chain, that retrieves the neccesary information from the input files. We could use several technologies from the Hadoop Framework. I used Spark and native Java MapReduce.
16
+
During a practical course called 'Big Data Analytics Tools with Open-Source Platforms' at BME we had a homework assignment which contained two questions. The questions had to be answered by implementing a data analysis chain, that retrieves the neccesary information from the input files. We could use several technologies from the Hadoop Framework. I used Apache Spark™ and native Java MapReduce.
17
17
18
18
**We had to work with the dataset of year 2008, that has been stored on the [datasets branch](https://github.com/benedekh/bigdata-projects/tree/datasets) of this as well.**
19
19
@@ -27,11 +27,11 @@ During a practical course called 'Big Data Analytics Tools with Open-Source Plat
27
27
28
28
According to the dataset of year 2008.
29
29
30
-
I used Spark and Java MapReduce to answer the question. The Spark solution is avaiable [here](https://github.com/benedekh/bigdata-projects/tree/master/hu.bme.bigdata.homework.spark.flight2/), while the Java MapReduce solution is available [here](https://github.com/benedekh/bigdata-projects/tree/master/hu.bme.bigdata.homework.mapreduce.flight2/).
30
+
I used Apache Spark™ and Java MapReduce to answer the question. The Apache Spark™ solution is avaiable [here](https://github.com/benedekh/bigdata-projects/tree/master/hu.bme.bigdata.homework.spark.flight2/), while the Java MapReduce solution is available [here](https://github.com/benedekh/bigdata-projects/tree/master/hu.bme.bigdata.homework.mapreduce.flight2/).
31
31
32
-
#### Spark
32
+
#### Apache Spark™
33
33
34
-
You should use Spark 1.5.1. for Hadoop 2.6. I downloaded a pre-built version [here](http://spark.apache.org/downloads.html), and I used Spark in a standanlone mode, without Hadoop.
34
+
You should use Apache Spark™ 1.5.1. for Hadoop 2.6. I downloaded a pre-built version [here](http://spark.apache.org/downloads.html), and I used Apache Spark™ in a standanlone mode, without Hadoop.
35
35
36
36
To compile the source code of the implementation, you should use Maven:
37
37
@@ -96,11 +96,11 @@ Winter is between 1st November and 7th March. Other dates belong to summer.
96
96
97
97
According to the dataset of year 2008.
98
98
99
-
I used Spark to answer the question. The Spark solution is avaiable [here](https://github.com/benedekh/bigdata-projects/tree/master/hu.bme.bigdata.homework.spark.flight3/).
99
+
I used Apache Spark™ to answer the question. The Apache Spark™ solution is avaiable [here](https://github.com/benedekh/bigdata-projects/tree/master/hu.bme.bigdata.homework.spark.flight3/).
100
100
101
-
#### Spark
101
+
#### Apache Spark™
102
102
103
-
You should use Spark 1.5.1. for Hadoop 2.6. I downloaded a pre-built version [here](http://spark.apache.org/downloads.html), and I used Spark in a standanlone mode, without Hadoop.
103
+
You should use Apache Spark™ 1.5.1. for Hadoop 2.6. I downloaded a pre-built version [here](http://spark.apache.org/downloads.html), and I used Apache Spark™ in a standanlone mode, without Hadoop.
104
104
105
105
To compile the source code of the implementation, you should use Maven:
106
106
@@ -119,21 +119,21 @@ To run from the command line:
The parameters are self-explaining, though the partitions parameter should be set for the number of cores, your computer CPU has (use *--partitions 1*, if you are not sure how many cores your CPU has).
126
126
127
127
128
128
### Benchmark
129
129
130
-
I benchmarked the two Spark solutions for the questions, and the Java MapReduce implementation for the first question.
130
+
I benchmarked the two Apache Spark™ solutions for the questions, and the Java MapReduce implementation for the first question.
131
131
132
132
_The input data was the **2008.csv**, that is available in a compressed archive [here](https://github.com/benedekh/bigdata-projects/tree/datasets) in the repository, and [here](http://stat-computing.org/dataexpo/2009/the-data.html) on the original website._
133
133
134
134
The benchmarking was done on a computer containing Intel Core i7-4700MQ @ 2.4GHz CPU, 8 GB RAM.
135
135
136
-
Spark was run on a VirtualBox virtual machine using 4 CPU cores and 5 GB RAM. The Spark implementations of assigments were run using 4 partitions as a parameter. The Spark solution of the 1st assigment is called **Flight2-Spark**, the 2nd assignment is called **Flight3-Spark** on the figure.
136
+
Apache Spark™ was run on a VirtualBox virtual machine using 4 CPU cores and 5 GB RAM. The Apache Spark™ implementations of assigments were run using 4 partitions as a parameter. The Apache Spark™ solution of the 1st assigment is called **Flight2-Spark**, the 2nd assignment is called **Flight3-Spark** on the figure.
137
137
138
138
Java MapReduce was run on the Cloudera VM using 4 CPU cores and 5 GB RAM. The Java MapReduce solution of the 1st assigment is called **Flight2-MR** on the figure.
0 commit comments