This tutorial is from the Community part of tutorial for Hortonworks Sandbox - a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.
This tutorial is a basic step-by-step tutorial to get Hadoop and Elasticsearch talk to each other.
To follow the steps in this tutorial, your computer must have the following items installed and running.
- Hortonworks Sandbox VM
- elasticsearch up & running
- elasticsearch-hadoop jars
- Kibana is a plus
Just download apache.zip from Costin Leau video tutorial.
####Elasticsearch setup
Launch elasticsearch. We won't enter into details on how Elasticsearch works here as it is not our point. For more information, please refer to elasticsearch website.
####Hortonworks Sandbox Download and launch the sandbox. It should show a screen like this one :
Then connect to your sandbox (root user) and set your /etc/hosts
so that it knows your elasticsearch cluster ip.
You can now connect to the sandbox using your favorite browser at http://127.0.0.0.1:8000. You should see this screen :
Click on "Go to sandbox", then follow these steps :
- In file browser, upload apache.zip (and note the path - for my example it will be
/user/apache
) - Still in file browser, upload elasticsearch-hadoop jar (as for now elasticsearch-hadoop-1.3.0.M1.jar)
- In Hive Query Editor, add elasticsearch-hadoop jar to your query
- Then launch the following query (from Costin Leau tutorial)
CREATE TABLE logs (type STRING, time STRING, ext STRING, ip STRING, req STRING, res INT, bytes INT, phpmem INT, agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA INPATH '/user/apache/apache.log' OVERWRITE INTO TABLE logs;
CREATE EXTERNAL TABLE eslogs (time STRING, extension STRING, clientip STRING, request STRING, response INT, agent STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'demo/hive',
'es.mapping.names' = 'time:@timestamp',
'es.nodes' = 'your_es_cluster_hostname');
INSERT OVERWRITE TABLE eslogs SELECT s.time, s.ext, s.ip, s.req, s.res, s.agent FROM logs s;
You should now have your data injected in your elasticsearch cluster. (Note : I had to run queries separately).
To check if everything went well, you can now launch Kibana, set it up to crawl demo
index, and you should be done.
Next step is to query data from elasticsearch and use it in Hadoop. Back to your sandbox :
- In Hive Query Editor, add elasticsearch-hadoop jar to your query
- Then launch the following query
CREATE EXTERNAL TABLE logses (
time TIMESTAMP,
extension STRING,
clientip STRING,
request STRING,
response BIGINT,
agent STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'demo/hive/',
'es.nodes' = 'your-es-cluster-hostname',
'es.mapping.names' = 'time:@timestamp');
-- stream data from Elasticsearch
SELECT * FROM logses;
If everything went OK, , you should see something like this :