In a previous post I showed how to run a simple job using AWS Elastic MapReduce (EMR). In this example we continue to make use of EMR but now to run a Hive job. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
To create the job in EMR I will still make use of the CLI (written in Ruby) supplied with EMR (for installation see here). The job that I am going to create is described in more detail in the ‘Getting started Guide Analyzing Big Data with AWS’.
- Create the EMR cluster
elastic-mapreduce-ruby$ ./elastic-mapreduce --create --name "My Job Flow" --hive-interactive --key-pair-file ../../../4synergy_palma.pem --enable-debugging --alive Created job flow j-2CC8Q43IWSQ42
I enabled debugging and like I showed here I defined the logging directory as a parameter in my credentials.json file.
While the job is running and the cluster is being created you can see the progress by listing the job details:
elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42 j-2CC8Q43IWSQ42 STARTING My Job Flow PENDING Setup Hadoop Debugging PENDING Setup Hive
After a few minutes:
elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42 j-2CC8Q43IWSQ42 STARTING ec2-54-228-55-226.eu-west-1.compute.amazonaws.com My Job Flow PENDING Setup Hadoop Debugging PENDING Setup Hive
We see a public DNS is provided of the master node but the setup is still running so we wait a little longer till we see this:
elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42 j-2CC8Q43IWSQ42 WAITING ec2-54-228-55-226.eu-west-1.compute.amazonaws.com My Job Flow COMPLETED Setup Hadoop Debugging COMPLETED Setup Hive
Now we can ssh into the master node by supplying the following command:
ssh email@example.com -i 4synergy_palma.pem
You might need to make the pem file readable for the user you use ssh with. you can do so by running
chmod og-rwx ~/mykeypair.pem
Add the host to the list of known hosts and we get the following startup screen:
Linux (none) 3.2.30-49.59.amzn1.i686 #1 SMP Wed Oct 3 19:55:00 UTC 2012 i686 -------------------------------------------------------------------------------- Welcome to Amazon Elastic MapReduce running Hadoop and Debian/Squeeze. Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check /mnt/var/log/hadoop/steps for diagnosing step failures. The Hadoop UI can be accessed via the following commands: JobTracker lynx http://localhost:9100/ NameNode lynx http://localhost:9101/ -------------------------------------------------------------------------------- hadoop@ip-10-48-206-175:~$
Next we start up the hive console on this node so we can add a Jar library to the Hives runtime. This Jar library is used for instance to have easy access to S3 buckets:
hadoop@ip-10-48-206-175:~$ hive Logging initialized using configuration in file:/home/hadoop/.versions/hive-0.8.1/conf/hive-log4j.properties Hive history file=/mnt/var/lib/hive_081/tmp/history/hive_job_log_hadoop_201305261845_2098337447.txt hive> add jar /home/hadoop/hive/lib/hive_contrib.jar; Added /home/hadoop/hive/lib/hive_contrib.jar to class path Added resource: /home/hadoop/hive/lib/hive_contrib.jar hive>
Now lets create the Hive table and have it represent the Apache log files that are in a S3 bucket.
Run the following command in the Hive console to create the table:
hive> CREATE TABLE serde_regex( > host STRING, > identity STRING, > user STRING, > time STRING, > request STRING, > status STRING, > size STRING, > referer STRING, > agent STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > WITH SERDEPROPERTIES ( > "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ > \"]*|\"[^\"]*\"))?", > "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" > ) > LOCATION 's3://elasticmapreduce/samples/pig-apache/input/'; OK Time taken: 17.146 seconds hive>
Now we can run Hive queries in this table. To run a job to count all records in the Apache log files:
hive> select count(1) from serde_regex; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201305261839_0001, Tracking URL = http://ip-10-48-206-175.eu-west-1.compute.internal:9100/jobdetails.jsp?jobid=job_201305261839_0001 Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.48.206.175:9001 -kill job_201305261839_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2013-05-26 19:06:46,442 Stage-1 map = 0%, reduce = 0% 2013-05-26 19:07:02,857 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 4.03 sec 2013-05-26 19:07:03,871 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 4.03 sec ... break .... 2013-05-26 19:07:59,677 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.06 sec 2013-05-26 19:08:00,709 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.06 sec MapReduce Total cumulative CPU time: 11 seconds 60 msec Ended Job = job_201305261839_0001 Counters: MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Accumulative CPU: 11.06 sec HDFS Read: 593 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 60 msec OK 239344 Time taken: 111.722 seconds hive>
To show all fields of a row:
hive> select * from serde_regex limit 1; OK 22.214.171.124 - - [20/Jul/2009:20:12:22 -0700] "GET /gallery/main.php?g2_controller=exif.SwitchDetailMode&g2_mode=detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15741&g2_returnName=photo HTTP/1.1" 302 5 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Time taken: 2.335 seconds
After playing around we have to terminate the cluster (so costs are kept to a minimum). Because we started the job in ‘interactive’ mode so we were able to logon the the server and run our ‘ad-hoc’ queries we have to terminate it ourselves:
elastic-mapreduce-ruby$ ./elastic-mapreduce --terminate j-2CC8Q43IWSQ42 Terminated job flow j-2CC8Q43IWSQ42 elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42 j-2CC8Q43IWSQ42 SHUTTING_DOWN ec2-54-228-55-226.eu-west-1.compute.amazonaws.com My Job Flow COMPLETED Setup Hadoop Debugging COMPLETED Setup Hive elastic-mapreduce-ruby$
After termination we have still access to the created log files in our defined S3 bucket:
Although it might not be very useful in this case because we ran the cluster in interactive mode this option can be helpful when you bootstrap the cluster. In that case the queries run automatically and the cluster terminates when it is finished (together with the log files).
One way to browse through this logging is by using the Debugging Tool of EMR. Go to the Management Console and select the EMR service. In the start screen select the Job flow you sued for this example and click the ‘Debug’ button:
Now we see the steps of our previous Job Flow. The step in which we are interested here is the Interactive Jobs. Click on the View jobs link of that line:
Now we see to jobs of which we can ‘View Tasks’ by clicking the corresponding link. Finally click the ‘View Attempts’ of the reduce or map task and you will have access to the copied log files:
For more information about using Hive with EMR see here.