Paradigm: Hadoop application

Introduction Print This Post

 

The Hadoop application paradigm follows the vision in Paradigm-based development by Rolf Wiegand Storgaard.

Creating an application at Hadoop can be quite messy! An application can contain some methods for getting data into HDFS, a job for computing the data, some kind of trigger when the job should run, etc. All this is normally tight together with small bash scripts that are dependent on the given Hadoop cluster.

The Hadoop community haven’t created a standard artifact that contains everything needed to do a Hadoop application. This means that each deployment mostly are done differently and that makes it hard to implement a good DevOps approach.

IBM have a good approach using Oozie (http://oozie.apache.org/) to create a trigger and a workflow. IBM then have a proprietary format tart stors all needed for a complete application in a zip file.  This is unfortunate not a custom solution only running on IBM software. IBM’s approach is good and using Oozie as the core of an Hadoop application is the way to go.

Oozie as the core Hadoop technology

Apache Oozie have the basic steps for an generalized Hadoop application and most application only needs what’s in the default Oozie toolbox. Oozie contains of three base components shown in the image below.

Oozie Properties

The Oozie property file is the base where all the default behaviour for a Hadoop application is stated. This file contains cluster specific properties, oozie mandatory properties as well as  application specific properties.

If cluster specific properties is needed this file need to be manipulated on each cluster.

Oozie Coordinator

For each properties file there must be a corresponding coordinator XML file. The coordinator specifies what should trigger the application to execute. The trigger can be time-based (CRON) or event-based. And it will state how many times an application should be executed.

Oozie Workflow

The workflow contains the individual steps that a given application needs to do. The workflow can be quite big (call sub-workflows) or it can just contain a simple step depending on what need to be solved. The basic specification for a workflow can be found here.

A number of very helpful actions/steps are out of the box provided by Oozie:

  • execute Map-reduce
  • start a Pig job
  • FS for running commands on HDFS
  • execute JAVA
  • send out e-mail
  • run shell commands on the OS
  • execute Hive scripts and commands
  • Sqoop data from an external database to HDFS
  • execute SSH commands and copy data
  • execute a Spark application.

Construct the artifact

There are many good de facto tools in the development world that are widely spread and common known by developers. The tools that will be in focus in this blog comes from the JAVA world but a similar approach using tools from the .NET stack could probably also be used.

The goal is to use existing technologies and get a standardized artifact that easily can be installed on the Hadoop platform.

The foundation in every project is the template for organizing the individual components. When working with enterprise application and JAVA – a given choice is Apache Maven as the foundation. Maven state where to place core components and a main configuration file where plug-ins for almost every problem can be added.

Maven structure

Maven have the configuration file (pom.xml) in the root folder of the project and four core folders for code, resources and test. Other code folders can be added in stead of JAVA – e.g. Scala (src/main/scala).

 

Maven pom.xml

The pom.xml is where Maven is configured. For the Hadoop application paradigm maven is used for the following things:

  • The name of the project
  • Source code repository management
  • Distribution management
  • Main properties
  • Dependency management
  • JAVA version
  • Divide tests into unit- and acceptancetests
  • JAVA GIT flow for handling versioning
  • Define how to assemble the Hadoop application ZIP package

Maven src/main/java

This is where maven state the source code must be placed. JAVA is used as the source code for constructing Apache Spark applications in this paradigm.

Maven src/main/resources

This is where maven state the resources for a project must be placed. Oozie information containing properties, coordinators, workflows is nested here. The resource folder is ordered in four subfolders.

Folder: properties

This is where Oozie properties nest. A Hadoop application can have a single properties file placed here that points to the Oozie coordinator. The folder can also contain multiple properties files and each properties file will represent a Oozie coordinator.

Folder: coordinators

This is where Oozie coordinators nest. For each Oozie coordinator a Oozie properties file must exist. Ooozie coordinators can share a Oozie properties file. You can have a Oozie properties file for two countries – both a Danish and a Swedish properties file can point to the same Oozie coordinator file.

Folder: workflows

This is where Oozie workflows nest. For each Oozie workflow a Oozie coordinator file must exists. Again, a Oozie workflow can be shared between multiple Oozie coordinators.

Folder: lib

This is where supporting files nest. It can be shell scripts, extra configuration, drivers, etc.

Maven src/test/java

This is where maven state the test code need to be placed. Both unit test code and deeper test code like acceptance test code. Only non-production code can be put here. This is mostly for testing code.

I recommend that a small Junit is placed here that ensure that everything needed in the resources folder is in place.

Maven src/test/resources

This is where maven state the test resources need to be placed. Only non-production resources can be placed here. I recommend using the resources for production instead of adding extra files in the test resources, but some it is necessary with specialized test resources.

How to package the Hadoop application

Maven are mainly focused on the packaging of JAVA code and how to construct bundles of JAVA code. Special packaging types can be done and some have been developed through external projects. For the purpose of package the Hadoop application a ZIP file is preferred and that can easily be accomplished by the maven-assembly-plugin.

Maven assembly plug-in in POM

The snippet below should be in lined in the Maven POM file. 3.1.0 is the latest edition by the time of this article. The plug-in points to a file descriptor that describes how to package the application. The descriptor for the Hadoop application is placed in the folder “src/assembly” with the name “oozie.xml“.

<!-- Assembly -->
<plugin>
  <artifactId>maven-assembly-plugin</artifactId>
  <version>3.1.0</version>
  <configuration>
    <descriptors>
      <descriptor>src/assembly/oozie.xml</descriptor>
    </descriptors>
  </configuration>
  <executions>
    <execution>
      <id>create-archive</id>
      <phase>package</phase>
      <goals>
        <goal>single</goal>
      </goals>
    </execution>
  </executions>
</plugin>

 Hadoop application assembly descriptor

The descriptor contains information on the package format, where to get information, where to place information and how to handle the information.

The format of the target package is chosen as ZIP and the string in the ID tag will be added to the end of the name of the ZIP package. E.g. [artifactid]-[version]-[id].[format] –> [artifactid]-[version]-oozie.zip

Tree file set need to be handled. The first file set containing the Oozie application located in “/src/main/resources“. Almost everything should be included and a Junit should insure the needed structure and information are in place. Mostly Hadoop runs at a Linux OS and therefore must the “lineEndings” be stated as “unix“. I have enabled filtered so Maven properties will be substituted into all the Oozie files – this makes it possible to reuse name, version, etc. from Maven in the Oosie application.

The second file set handle the lib folder separately otherwise binary files like JARs would be corrupted. The line endings are still corrected but not filtering by Maven in done.

Note: If properties are placed in the lib folder Maven filtering are not applied.

The third file set include the JAR file constructed if any. This is the Spark application if that is needed in the given setup. If an über JAR is created and original-*.jar file is also in the output directory that needs to be ignored.

<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2 http://maven.apache.org/xsd/assembly-1.1.2.xsd">
  <id>oozie</id>
  <formats>
    <format>zip</format>
  </formats>
  <fileSets>
    <fileSet>
      <directory>${project.basedir}/src/main/resources</directory>
      <outputDirectory></outputDirectory>
      <includes>
        <include>**/*</include>
      </includes>
      <excludes>
        <exclude>lib/</exclude>
      </excludes>
      <lineEnding>unix</lineEnding>
      <filtered>true</filtered>
    </fileSet>
    <fileSet>
      <directory>${project.basedir}/src/main/resources/lib</directory>
      <outputDirectory>lib/</outputDirectory>
      <includes>
        <include>**/*</include>
      </includes>
      <lineEnding>unix</lineEnding>
      <filtered>false</filtered>
    </fileSet>
    <fileSet>
      <directory>${project.build.directory}</directory>
      <outputDirectory>lib/</outputDirectory>
      <includes>
        <include>*.jar</include>
      </includes>
      <excludes>
        <exclude>original-*.jar</exclude>
      </excludes>
    </fileSet>
  </fileSets>
</assembly>

Release handling with Maven

Release handling with Maven is a combination of different infrastructure components. Release handling covers marking in a source code repository and the release of an artifact in a Maven repository.

Source code repositories can be CVS, Subversion, MS team foundation server, GIT, etc. A good save choice in 2018 is GIT. NorthTech uses Atlassian Bitbucket – a nice, easy and affordable GIT repository. It is possible to use Maven with most source code repositories, but of cause some of them a easier to manage.

The two big open source players that have implemented a Maven repositories are JFrog Artifactory and Sonatype Nexus. NorthTech have tries both and ended up using Nexus.

Maven is used to tie the infrastructure components together. With GIT as the source code repository and Nexus as the artifact repository jgitflow by Atlassian is easy to implement. A standard setup of the jgitflow Maven plug-in could be.

<!-- GIT flow -->
<plugin>
  <groupId>external.atlassian.jgitflow</groupId>
  <artifactId>jgitflow-maven-plugin</artifactId>
  <version>1.0-m5.1</version>
  <configuration>
    <pushReleases>true</pushReleases>
    <username>${git.username}</username>
    <password>${git.password}</password>
  </configuration>
</plugin>

When the setup of the jgitflow Maven plug-in above is inserted in the pom.xml the release will automatically be pushed to the origin GIT server. Login and password need to be provided to access the origin GIT server. No further setup is needed to use GIT from the jgitflow plug-in.

In order to upload the new artifact to the Maven repository distribution management need to be specified in Maven. NorthTech both have a Maven repository for releases and for snapshots.

<distributionManagement>
  <repository>
    <id>northtech-releases</id>
    <url>http://m2.northtech.dk/content/repositories/releases/</url>
  </repository>
  <snapshotRepository>
    <id>northtech-snapshots</id>
    <url>http://m2.northtech.dk/content/repositories/snapshots/</url>
  </snapshotRepository>
</distributionManagement>

The credentials for the Maven repository are specified in the Maven settings.xml file located in the local Maven repository root.

<!-- NorthTech Maven release repository -->
<server>
  <id>northtech-releases</id>
  <username>username</username>
  <password>password</password>
</server>
<!-- NorthTech Maven snapshot repository -->
<server>
  <id>northtech-snapshots</id>
  <username>username</username>
  <password>password</password>
</server>

Note that the id in the settings.xml file need to be exactly the same as the id in the pom.xml file.

When the above configuration is done a snapshot can be deployed with the command below.

mvn clean deploy

And a complete release can be done with the command below.

mvn clean --batch-mode jgitflow:release-start jgitflow:release-finish -Dgit.username=username -Dgit.password=password

Using Spark

TBD. Use Maven shade plugin to do über JAR, setting JAVA version (could use other language), Cloudera repository

Maven Archetype

From the documentation above a project could be put together that incorporate the structure and setup maven correct. This is not something that should be done for each application. A first shot is creating a demo project that incorporate everything needed, but it is a pain cloning this project and do the modifications needed over and over and forgetting to do all the small tweaks.

Maven have a templating toolkit called archetype.  Here a generic template can be created and you can then create projects with a list of configuration parameters.

I have done an implementation of the Hadoop application paradigm as an archetype and the implementation can be found in NorthTech GIT. You are welcome to clone the GIT repository and come with suggestions on improvements on the implementation on this post. The Hadoop application paradigm is released to NorthTech Maven repository – both as snapshot and release.

To use the Hadoop application archetype you need to include the NorthTech Maven repository in your Maven settings.xml. I am doing this as a profile.

<profiles>
  <profile>
    <id>northtech-maven-profile</id>
    <activation>
      <activeByDefault>true</activeByDefault>
    </activation>
    <repositories>
      <repository>
        <id>northtech-releases</id>
        <url>http://m2.northtech.dk/content/repositories/releases/</url>
        <layout>default</layout>
        <releases>
          <enabled>true</enabled>
        </releases>
        <snapshots>
          <enabled>false</enabled>
        </snapshots>
      </repository>
      <repository>
        <id>northtech-snapshots</id>
        <url>http://m2.northtech.dk/content/repositories/snapshots/</url>
        <releases>
          <enabled>false</enabled>
        </releases>
        <snapshots>
          <enabled>true</enabled>
        </snapshots>
      </repository>
    </repositories>
  </profile>
</profiles>

When the NorthTech Maven repository in added to the settings.xml the following command should be possible to execute. The command below will download the snapshot – NOT the release version of the Hadoop application paradigm.

mvn archetype:generate -DarchetypeGroupId=dk.northtech.paradigms -DarchetypeArtifactId=paradigm-hadoop-application -DarchetypeVersion=1.0.0-SNAPSHOT -DgroupId=dk.wiegand.test -DartifactId=test -Dversion=1.0.0-SNAPSHOT

This command generates a project from the release version

mvn archetype:generate -DarchetypeGroupId=dk.northtech.paradigms -DarchetypeArtifactId=paradigm-hadoop-application -DarchetypeVersion=1.0.0 -DgroupId=dk.wiegand.test -DartifactId=test -Dversion=1.0.0-SNAPSHOT

The command above does:

  • Use the Maven archetype with
    • The archetype group ID: dk.northtech.paradigms
    • The archetype artifact ID: paradigm-hadoop-application
    • The archetype version: 1.0.0
  • Specify the following parameters
    • groupId: dk.wiegand.test
    • artifactId: test
    • version: 1.0.0-SNAPSHOT

You now have a complete Hadoop application that can execute all Oozie actions. The paradigm is configured to run and test Spark from JAVA as well. To build your new project and execute the Spark test etc. run the command below in the root of the project.

mvn clean package

Conclution and next step

The Hadoop application paradigm is an easy and scalable way of doing application at Hadoop in your organization. It is a good way in introduce a enterprise approach where an application can be handled as an common artifact so deployment, testing, logging, etc. is only configured once and a development team can focus on the business logic rather that the infrastructure.

The Hadoop application paradigm implementation is just one part of the deployment process. Another step is doing the installation on the Hadoop cluster. An installation program that can understand the artifact need to be constructed. I have done an edition of the installation program in bash script but is working on a JAVA edition. You can follow my work on this blog on the Hadoop application installer in JAVA.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *