Skip to main content

Oozie workflow basic example


I asked to build periodic job to perform data aggregation MR and upload to HBase based on:

1. period of time
2. start running when input folder exists
3. start running when input folder contains _SUCCESS file

So I define 4 steps in my workflow

Step 1: based on job requirements check if input folder exists and contains _SUCCESS file

<decision name="upload-decision">
   <switch>
      <case to="create-csv">
         ${fs:exists(startFlag)}
      </case>
      <default to="end"/>
    </switch>
</decision>

Step 2: Running MR job as java action of Oozie
In prepare section appears deletion of _SUCCESS file and MR job output folder

<action name="create-csv">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}${startFlag}"/>
        <delete path="${nameNode}${csvOutputDir}"/>
      </prepare>
  <main-class>LocationToCsvMRMain</main-class>
<arg>${rowDataInputDir}</arg>
<arg>${csvOutputDir}</arg>
<arg>${numOfReducers}</arg>
<capture-output />
</java>
<ok to="import-csv-to-hbase" />
<error to="fail" />
</action>

Step 3: Import data in CSV format to HBase as shell action of Oozie

<action name="import-csv-to-hbase">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>import-csv-to-hbase.sh</exec>
<argument>${csvOutputDir}</argument>
<argument>${zkHostPort}</argument>
<file>import-csv-to-hbase.sh#import-csv-to-hbase.sh</file>
<capture-output />
</shell>
<ok to="cleanup" />
<error to="fail" />
</action>

Step 4: Move input folder to archive folder to prevent multiple processing of the same data

<action name="cleanup">
<fs>
    <mkdir path='${nameNode}${archiveFolder}${wf:id()}'/>
    <move source='${nameNode}${rowDataInputDir}' target='${archiveFolder}${wf:id()}/processed-input'/>
</fs>
   <ok to="end"/>
   <error to="fail"/>
</action>

To provide periodic behavior I wrap this workflow in simple coordinator

<coordinator-app name="simple-coordinator"
  frequency="${coord:minutes(15)}"
  start="2014-06-01T00:00Z" end="2014-06-03T00:15Z" timezone="America/Los_Angeles"   xmlns="uri:oozie:coordinator:0.2">
   <action>
    <workflow>
      <app-path>${nameNode}${wfPath}</app-path>
      <configuration>
            <property>
              <name>nameNode</name>
              <value>${nameNode}</value>
            </property>
             <property>
              <name>jobTracker</name>
              <value>${jobTracker}</value>
            </property>
         </configuration>
   </workflow>
  </action>
</coordinator-app>









Comments

Popular posts from this blog

Geo-spatial search in HBase

Nowadays geospatial query of the data became a part of almost every application working with location coordinates. There are a number of NoSQL databases supports geospatial search out of the box such as MongoDB, Couchbase, Solr etc. I'm going to write about bounding box query over HBase . I'm using this kind of query to select points located on visible geographic area of the WEB client. During my investigation, I realized that more effective and more complex solution is using Geohash or QuadKeys approach. These approaches required to redesign you data model then I found more simple (and less effective) solution - using built-in HBase encoder OrderedBytes  (hbase-common-0.98.4-hadoop2.jar) Current example of code working with HBase version 0.98.4. Actually bounding box query required comparison of latitude and longitude coordinates of the point saved in HBase. As you know HBase keep data in a binary format according to lexicographical order. Because of coordinates ...

Distributed bloom filter

Requirements To filter in very quick and efficient way the stream of location data coming from the external source with a rate of 1M events per second. Assumption that 80% of events should be filtered out and only 20% pass to the analytic system for further processing. Filtering process should find the match between the event on input and predefined half-static data structure (predefined location areas) contains 60,000M entries. In this article, I assume that reader familiar with Storm framework  and Bloom filter definition. Solution Once requirement talking about a stream of data Storm framework was chosen, to provide efficient filtering Guava Bloom filter was chosen. Using Bloom filter calculator  find out that having 0.02% false positive probability Bloom filter bit array takes about 60G memory which can't be loaded into java process heap memory. Created simple Storm topology contains Kafka Spout and Filter bolts. There are several possible solutions ...