Running MapReduce Design Patterns on Cloudera’s CDH5

cloudera-hadoopOne of the better books I read so far about MapReduce is ‘MapReduce Design Patterns‘ as I mentioned in my previous post. In this post I describe the steps to get started with running the Hadoop source code that goes with the book on Cloudera’s latest Hadoop distribution CDH5. I decided to be making use of HDFS and YARN for testing the patterns. Take the following steps to get it all up and running:

  • Get CDH5 and run it
  • Install IntelliJ IDEA
  • Upgrade GIT client
  • Create local directory
  • Checkout source code
  • Install source data
  • Run the job

Please note that some of these steps are optional and just reflecting my personal preferences. Let’s go through each step in more detail.

  • Get CDH5 and run it
  • I chose to download the Cloudera QuickStart VM. You can simply run this on a Mac by using VMWare Fusion or Virtual Box. It comes with all necessary Hadoop tools to start with. After downloading follow these guidelines to get started.

  • Install IntelliJ IDEA
  • As I am rather using IntelliJ than Eclipse I installed IntelliJ on the VM. The way to do this can be found here. It is well described there how to do this.

  • Upgrade GIT client
  • To be able to use GIT from within IntelliJ IDEA I had to upgrade the GIT client on my VM. This is described here.

  • Create local directory
  • Create a local directory in ‘/home/cloudera/’ named ‘play_area’ by running the commands on the VM:
    cd ~
    mkdir 'play_area'

    We use this directory to run our compiled and packaged jobs from.

  • Checkout source code
  • The ‘raw’ version of the source code is made available in GIT by the author of the book here. Since I wanted to be able to play around with it and I was missing the Maven pom file I forked it and added the pom to it. My version can be found here. Clone this with GIT in IntelliJ so it will get the necessary sources:
    Screenshot 2014-09-09 12.57.29
    If all is well you can now run ‘mvn clean package’ to build the jar with the Hadoop jobs. The jar should become visible in the directory ‘/home/cloudera/play_area/’.

  • Install source data
  • In the book they are using the data from the StackOverflow website. This data is publicly available here. I decided to download the data about ‘Apple’ to use as test data.
    I ran the following commands in a terminal to put the data in HDFS on the VM:

    # Get the data as a compressed 7z file
    wget --no-check-certificate

    # Install p7zip tool to unpack the data
    sudo yum install p7zip

    # unzip the downloaded data
    7za e

    # Create a HDFS map to store the data in
    hdfs dfs -mkdir /mrdp

    # Put the local file on the HDFS filesystem
    hdfs dfs -put Comments.xml /mrdp/Comments.xml

    # Create a HDFS to put the output of the Hadoop jobs
    hdfs dfs -mkdir /mrdp/output

  • Run the job
  • Now we are ready to run some jobs. Perform the following command to run the Hadoop job of the first pattern in the book:

    cd /home/cloudera/play_area
    yarn jar mapreduce-patterns.jar mrdp.ch1.CommentWordCount /mrdp/Comments.xml /mrdp/output/job1

    If all is well this should run fine and you can see the output with the following command:
    hdfs dfs -cat /mrdp/output/job1/part-r-00000 | head -25
    Or you can transfer the output to the local filesystem with
    hdfs dfs -get /mrdp/output/job1/part-r-00000 output.txt

    and then open the file with an editor of your choice and have a look at the result:

    m	602
    ma	102
    maaaaannn	1
    maattachedwindow	2
    mabe	1
    maby	1
    mac	8978
    macaco	21
    macademic	1
    macadmins	1
    macair	7
    macally	1
    macapper	1
    macappstore	7
    macattack	1
    macaulay	1
    macbartender	5
    macbidouille	1
    macbitz	1
    macblog	1
    macboo	1
    macbook	2508
    macbookair	10
    macbookpro	62
    macbookpros	1

What is left to do is that I only tested the first job with this setup. I also noticed several deprecations in the existing code that should be replaced with the new (correct) code. But like I set before this should be enough to get you started with putting the described patterns in practice and start playing around with it!

About Pascal Alma

Pascal is a senior software developer and architect. Pascal has been designing and building applications since 2001. He is particularly interested in Open Source toolstack (Mule, Spring Framework, JBoss) and technologies like Web Services, SOA and Cloud technologies. Lately he is having great fun by building iOS apps with Swift. Specialties: JEE AWS XML/XSD/XSLT Web Services/SOA Mule ESB/ WSO2 ESB Maven Cloud Technology Swift/ iOS
This entry was posted in Hadoop and tagged , , . Bookmark the permalink.