PURPOSE: This chapter explains how to use Docker to create a Hadoop cluster and a Big Data application in Java. It highlights several concepts like service scale, dynamic port allocation, container links, integration tests, debugging, etc.
Big Data applications usually involve distributed processing using tools like Hadoop or Spark. These services can be scaled up, running with several nodes to support more parallelism. Running tools like Hadoop and Spark on Docker makes it easy to scale them up and down. This is very useful to simulate a cluster on development time and also to run integration tests before taking your application to production.
The application on this example reads a file, count how many words are on that file using a MapReduce job implemented on Hadoop and then saves the result on a MongoDB database. In order to do that, we will run a Hadoop cluster and a MongoDB server on Docker.
Note
|
Apache Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. The Hadoop framework itself is mostly written in Java. |
Clone the project at https://github.com/fabianenardon/hadoop-docker-demo
Inspect the sample/docker/docker-compose.yml
file. It defines a MongoDB service and the services needed to run a Hadoop cluster. It also defines a service for our application. See how the services are linked together.
cd sample
mvn clean install -Papp-docker-image
In the command above, -Papp-docker-image
will fire up the app-docker-image
profile, defined in the application pom.xml
. This profile will create a dockerized version of the application, creating two images:
-
docker-hadoop-example
: docker image used to run the application -
docker-hadoop-example-tests
: docker image used to run integration tests
Go to the sample/docker
folder and start the services:
cd docker docker-compose up -d
See the logs and wait until everything is up:
docker-compose logs -f
Here is a sample output:
Attaching to docker_nodemanager_1, docker-hadoop-example, yarn, secondarynamenode, docker_datanode_1, namenode, mongo
docker-hadoop-example | Usage: hdfs [--config confdir] [--loglevel loglevel] COMMAND
docker-hadoop-example | where COMMAND is one of:
. . .
nodemanager_1 | 17/10/11 18:26:27 INFO nodemanager.NodeManager: STARTUP_MSG:
docker-hadoop-example | zkfc run the ZK Failover Controller daemon
nodemanager_1 | /************************************************************
docker-hadoop-example | datanode run a DFS datanode
nodemanager_1 | STARTUP_MSG: Starting NodeManager
docker-hadoop-example | dfsadmin run a DFS admin client
datanode_1 | 17/10/11 18:26:25 INFO datanode.DataNode: STARTUP_MSG:
nodemanager_1 | STARTUP_MSG: host = db2d63621ba4/172.23.0.8
namenode | FORMATTING NAMENODE
. . .
secondarynamenode | STARTUP_MSG: Starting SecondaryNameNode
datanode_1 | STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41; compiled by 'jenkins' on 2016-01-26T00:08Z
docker-hadoop-example | oev apply the offline edits viewer to an edits file
namenode | STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41; compiled by 'jenkins' on 2016-01-26T00:08Z
nodemanager_1 | 17/10/11 18:26:27 INFO nodemanager.NodeManager: registered UNIX signal handlers for [TERM, HUP, INT]
secondarynamenode | STARTUP_MSG: host = secondarynamenode/172.23.0.5
datanode_1 | STARTUP_MSG: java = 1.8.0_112
. . .
namenode | 17/10/11 18:27:31 INFO namenode.TransferFsImage: Transfer took 0.00s at 0.00 KB/s
namenode | 17/10/11 18:27:31 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000015 size 946 bytes.
namenode | 17/10/11 18:27:31 INFO namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 0
secondarynamenode | 17/10/11 18:27:32 INFO namenode.TransferFsImage: Uploaded image with txid 15 to namenode at http://namenode:50070 in 0.115 seconds
secondarynamenode | 17/10/11 18:27:32 WARN namenode.SecondaryNameNode: Checkpoint done. New Image Size: 946
In order to see if everything is up, open http://localhost:8088/cluster
. You should see 1 active node when everything is up and running.
This application reads a text file from HDFS and counts how many words it has. The result is saved on MongoDB.
First, create a folder on HDFS. We will save the file to be processed on it:
docker-compose exec yarn hdfs dfs -mkdir /files/
In the command above, we are executing hdfs dfs -mkdir /files/
on the service yarn
. This command creates a new folder called /files/
on HDFS, the distributed file system used by Hadoop.
Put the file we are going to process on HDFS:
docker-compose run docker-hadoop-example \
hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
The text_for_word_count.txt
file was added to the application image by maven when we built it, so we can use it to test. The command above will transfer the text_for_word_count.txt
file from the local disk to the /files/
folder on HDFS, so the Hadoop process can access it.
Run our application
docker-compose run docker-hadoop-example \
hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar \
hdfs://namenode:9000 /files mongo yarn:8050
The command above will run our jar file on the Hadoop cluster. The hdfs://namenode:9000
parameter is the HDFS address. The /files
parameter is where the file to process can be found on HDFS. The mongo
parameter is the MongoDB host address. The yarn:8050
parameter is the Hadoop yarn address, where the MapReduce job will be deployed. Note that since we are running the Hadoop components (namenode, yarn), MongoDB and our application as Docker services, they can all find each other and we can use the service names as host addresses.
If you go to http://localhost:8088/cluster
, you can see your job running. When the job finishes, you should see this:
If everything ran successful, you should be able to see the results on MongoDB.
Connect to the Mongo container:
docker-compose exec mongo mongo
When connected, type:
use mongo_hadoop
db.word_count.find();
You should see the results of running the application. Something like this:
> db.word_count.find();
{ "_id" : "Counts on Sat Mar 18 18:16:20 UTC 2017", "words" : 256 }
If you want, you can scale your cluster, adding more Hadoop nodes to it:
docker-compose scale nodemanager=2
This means that you want to have 2 nodes in your Hadoop cluster. Go to http://localhost:8088/cluster
and refresh until you see 2 active nodes.
The trick to scale the nodes is to use dynamically allocated ports and let docker assign a different port to each new nodemanager. See this approach in this snippet of the docker-compose.yml
file:
nodemanager:
image: tailtarget/hadoop:2.7.2
command: yarn nodemanager
ports:
- "8042" # local port dynamically assigned. allows node to be scaled up and down
links:
- namenode
- datanode
- yarn
hostname: nodemanager
Stop all the services
docker-compose down
Note that since our docker-compose.yml
file defines volume mappings for HDFS and MongoDB, next time you start the services again, your data will still be there.
Debugging distributed Hadoop applications can be cumbersome. However, you can configure your environment to use the docker Hadoop cluster and debug your code easily from an IDE.
First, make sure your services are up:
docker-compose up -d
Then, add this to your /etc/hosts
:
127.0.0.1 datanode
127.0.0.1 yarn
127.0.0.1 namenode
127.0.0.1 secondarynamenode
127.0.0.1 nodemanager
This configuration will allow you to access the docker Hadoop cluster from your IDE.
Then, open your project from https://github.com/fabianenardon/hadoop-docker-demo in Netbeans (or any other IDE) and run the application file:
Note that you will be connecting to the docker services at localhost.
You can also set a breakpoint in your application and debug:
Shutdown the services:
docker-compose down
When running integration tests, you want to test your application in an environment as close to production as possible, so you can test interactions between the several components, services, databases, network communication, etc. Fortunately, docker can help you a lot with integration tests.
There are several strategies to run integration tests, but in this application we are going to use the following:
-
Start the services with a
docker-compose.yml
file created for testing purposes. This file won’t have any volumes mapped, so when the test is over, no state will be saved. The testdocker-compose.yml
file won’t publish any port on the host machine, so we can run simultaneous tests. -
Run the application, using the services started with the
docker-compose.yml
test file. -
Run Maven integration tests to check if the application execution produced the expected results. This will be done by checking what was saved on the MongoDB database.
-
Stop the services. No state will be stored, so next time you run the integration tests, you will have a clean environment.
Here is how to execute this strategy, step by step. The complete source code for this is in the sample
directory of https://github.com/fabianenardon/hadoop-docker-demo.
Start the services with the test configuration:
docker-compose --file src/test/resources/docker-compose.yml up -d
Make sure all services are started and create the folder we need on hdfs to test:
docker-compose --file src/test/resources/docker-compose.yml exec yarn hdfs dfs -mkdir /files/
Put the test file on hdfs:
docker-compose --file src/test/resources/docker-compose.yml \
run docker-hadoop-example \
hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
Run the application
docker-compose --file src/test/resources/docker-compose.yml \
run docker-hadoop-example \
hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar \
hdfs://namenode:9000 /files mongo yarn:8050
Run our integration tests:
docker-compose --file src/test/resources/docker-compose.yml \
run docker-hadoop-example-tests mvn -f /maven/code/pom.xml \
-Dmaven.repo.local=/m2/repository -Pintegration-test verify
Stop all the services:
docker-compose --file src/test/resources/docker-compose.yml down
If you want to remote debug tests, run the tests this way instead:
docker run -v ~/.m2:/m2 -p 5005:5005 \
--link mongo:mongo \
--net resources_default \
docker-hadoop-example-tests \
mvn -f /maven/code/pom.xml \
-Dmaven.repo.local=/m2/repository \
-Pintegration-test verify \
-Dmaven.failsafe.debug
Running with this configuration, the application will wait until an IDE connects for remote debugging on port 5005.
See more about integration tests in the CI/CD using Docker chapter