Running Hive jobs on AWS EMR

In a previous post I showed how to run a simple job using AWS Elastic MapReduce (EMR). In this example we continue to make use of EMR but now to run a Hive job. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
To create the job in EMR I will still make use of the CLI (written in Ruby) supplied with EMR (for installation see here). The job that I am going to create is described in more detail in the ‘Getting started Guide Analyzing Big Data with AWS’.

  • Create the EMR cluster
  • elastic-mapreduce-ruby$ ./elastic-mapreduce --create --name "My Job Flow" --hive-interactive --key-pair-file ../../../4synergy_palma.pem --enable-debugging --alive
    Created job flow j-2CC8Q43IWSQ42
    

    I enabled debugging and like I showed here I defined the logging directory as a parameter in my credentials.json file.

    While the job is running and the cluster is being created you can see the progress by listing the job details:

    elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42
    j-2CC8Q43IWSQ42     STARTING                                                         My Job Flow
       PENDING        Setup Hadoop Debugging
       PENDING        Setup Hive
    

    After a few minutes:

    elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42
    j-2CC8Q43IWSQ42     STARTING       ec2-54-228-55-226.eu-west-1.compute.amazonaws.com My Job Flow
       PENDING        Setup Hadoop Debugging
       PENDING        Setup Hive
    

    We see a public DNS is provided of the master node but the setup is still running so we wait a little longer till we see this:

    elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42
    j-2CC8Q43IWSQ42     WAITING        ec2-54-228-55-226.eu-west-1.compute.amazonaws.com My Job Flow
       COMPLETED      Setup Hadoop Debugging
       COMPLETED      Setup Hive
    
  • Login into the EMR cluster (Master node)
  • Now we can ssh into the master node by supplying the following command:
    ssh hadoop@ec2-54-228-55-226.eu-west-1.compute.amazonaws.com -i 4synergy_palma.pem
    You might need to make the pem file readable for the user you use ssh with. you can do so by running chmod og-rwx ~/mykeypair.pem

    Add the host to the list of known hosts and we get the following startup screen:

    Linux (none) 3.2.30-49.59.amzn1.i686 #1 SMP Wed Oct 3 19:55:00 UTC 2012 i686
    --------------------------------------------------------------------------------
    
    Welcome to Amazon Elastic MapReduce running Hadoop and Debian/Squeeze.
    
    Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
    /mnt/var/log/hadoop/steps for diagnosing step failures.
    
    The Hadoop UI can be accessed via the following commands:
    
      JobTracker    lynx http://localhost:9100/
      NameNode      lynx http://localhost:9101/
    
    --------------------------------------------------------------------------------
    hadoop@ip-10-48-206-175:~$
    
  • Startup and configure Hive
  • Next we start up the hive console on this node so we can add a Jar library to the Hives runtime. This Jar library is used for instance to have easy access to S3 buckets:

    hadoop@ip-10-48-206-175:~$ hive
    Logging initialized using configuration in file:/home/hadoop/.versions/hive-0.8.1/conf/hive-log4j.properties
    Hive history file=/mnt/var/lib/hive_081/tmp/history/hive_job_log_hadoop_201305261845_2098337447.txt
    hive> add jar /home/hadoop/hive/lib/hive_contrib.jar;
    Added /home/hadoop/hive/lib/hive_contrib.jar to class path
    Added resource: /home/hadoop/hive/lib/hive_contrib.jar
    hive>
    

    Now lets create the Hive table and have it represent the Apache log files that are in a S3 bucket.
    Run the following command in the Hive console to create the table:

    hive> CREATE TABLE serde_regex(
        > host STRING,
        > identity STRING,
        > user STRING,
        > time STRING,
        > request STRING,
        > status STRING,
        > size STRING,
        > referer STRING,
        > agent STRING)
        > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
        > WITH SERDEPROPERTIES (
        > "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
        > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^
        > \"]*|\"[^\"]*\"))?",
        > "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
        > )
        > LOCATION 's3://elasticmapreduce/samples/pig-apache/input/';
    OK
    Time taken: 17.146 seconds
    hive>
    
  • Run Hive queries
  • Now we can run Hive queries in this table. To run a job to count all records in the Apache log files:

    hive> select count(1) from serde_regex;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=
    Starting Job = job_201305261839_0001, Tracking URL = http://ip-10-48-206-175.eu-west-1.compute.internal:9100/jobdetails.jsp?jobid=job_201305261839_0001
    Kill Command = /home/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.48.206.175:9001 -kill job_201305261839_0001
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2013-05-26 19:06:46,442 Stage-1 map = 0%,  reduce = 0%
    2013-05-26 19:07:02,857 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 4.03 sec
    2013-05-26 19:07:03,871 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 4.03 sec
    ...
    break
    ....
    2013-05-26 19:07:59,677 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.06 sec
    2013-05-26 19:08:00,709 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.06 sec
    MapReduce Total cumulative CPU time: 11 seconds 60 msec
    Ended Job = job_201305261839_0001
    Counters:
    MapReduce Jobs Launched:
    Job 0: Map: 1  Reduce: 1   Accumulative CPU: 11.06 sec   HDFS Read: 593 HDFS Write: 7 SUCCESS
    Total MapReduce CPU Time Spent: 11 seconds 60 msec
    OK
    239344
    Time taken: 111.722 seconds
    hive>
    

    To show all fields of a row:

    hive> select * from serde_regex limit 1;
    OK
    66.249.67.3	-	-	[20/Jul/2009:20:12:22 -0700]	"GET /gallery/main.php?g2_controller=exif.SwitchDetailMode&g2_mode=detailed&g2_return=%2Fgallery%2Fmain.php%3Fg2_itemId%3D15741&g2_returnName=photo HTTP/1.1"	302	5	"-"	"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    Time taken: 2.335 seconds
    
  • Terminate the cluster
  • After playing around we have to terminate the cluster (so costs are kept to a minimum). Because we started the job in ‘interactive’ mode so we were able to logon the the server and run our ‘ad-hoc’ queries we have to terminate it ourselves:

    elastic-mapreduce-ruby$ ./elastic-mapreduce --terminate  j-2CC8Q43IWSQ42
    Terminated job flow j-2CC8Q43IWSQ42
    elastic-mapreduce-ruby$ ./elastic-mapreduce --list -j j-2CC8Q43IWSQ42
    j-2CC8Q43IWSQ42     SHUTTING_DOWN  ec2-54-228-55-226.eu-west-1.compute.amazonaws.com My Job Flow
       COMPLETED      Setup Hadoop Debugging
       COMPLETED      Setup Hive
    elastic-mapreduce-ruby$
    
  • Analyze the logging
  • After termination we have still access to the created log files in our defined S3 bucket:
    Screen Shot 2013-05-26 at 21.33.02
    Although it might not be very useful in this case because we ran the cluster in interactive mode this option can be helpful when you bootstrap the cluster. In that case the queries run automatically and the cluster terminates when it is finished (together with the log files).
    One way to browse through this logging is by using the Debugging Tool of EMR. Go to the Management Console and select the EMR service. In the start screen select the Job flow you sued for this example and click the ‘Debug’ button:
    Screen Shot 2013-05-28 at 13.51.22
    Now we see the steps of our previous Job Flow. The step in which we are interested here is the Interactive Jobs. Click on the View jobs link of that line:
    Screen Shot 2013-05-28 at 13.53.12
    Now we see to jobs of which we can ‘View Tasks’ by clicking the corresponding link. Finally click the ‘View Attempts’ of the reduce or map task and you will have access to the copied log files:
    Screen Shot 2013-05-28 at 14.04.06

For more information about using Hive with EMR see here.

Advertisement

About Pascal Alma

Pascal is a senior IT consultant and has been working in IT since 1997. He is monitoring the latest development in new technologies (Mobile, Cloud, Big Data) closely and particularly interested in Java open source tool stacks, cloud related technologies like AWS and mobile development like building iOS apps with Swift. Specialties: Java/JEE/Spring Amazon AWS API/REST Big Data Continuous Delivery Swift/iOS
This entry was posted in AWS, cloud and tagged , , , , . Bookmark the permalink.