Paul Mazak

Monday, August 24, 2015

Jenkins as a Hadoop Job Scheduler

Jenkins is a well-known continuous integration server - checking out source code, running unit tests, yada yada yada. However, because of its simplicity, I've been able to leverage it for a variety of use cases: lately, as a Hadoop Job Scheduler.

The expert panel pick was Oozie. But whenever I asked those "experts" for their use cases for Oozie, they'd tell me, "Oh, I've never actually used it, just heard that's what you do." Well, that's just great. I played around with Oozie by scheduling a simple "echo 123" command. It launched a JVM on all the nodes and never printed the result. At my company not everything we schedule on the Hadoop cluster is a Hadoop job, and certainly not a map-reduce job. We have bash scripts that verify data. We have groovy scripts that poll a database on a different server, and if triggered, then run a Hadoop job. I found Oozie to be cumbersome and limited in features.

The underground pick was Azkaban, created by LinkedIn. I looked at that and liked the simplicity of job workflows being described as plain text files. I loved the workflow diagrams it provided, giving you a clear picture of the interoperability of multiple jobs. However, it too was limited in features. In particular, the setting for concurrent builds was an all or nothing setting at the global level. We wanted the ability to allow a max number of things going on at a time globally, as well as limit certain types of jobs to only one of this or that type of job.

Jenkins is what came to mind, but I felt peer-pressure to try Oozie and Azkaban. Jenkins was not a popular choice for scheduling Hadoop jobs. Did I say not a popular choice? I mean that I heard things like, "Nobody in their right mind would choose Jenkins for this! Isn't Jenkins for continuous integration?!? "... but wait a minute!

Here's what we get with Jenkins:

Ability to run any command
Not just Hadoop sorts of jobs like java mapreduce, sqoop, pig, or hive, but absolutely anything that can be scripted. You can use the right tool for the job and conditionally launch those parallel Hadoop jobs.

Cron-like scheduling
Basic, I know. Most tools have this, but Jenkins's is also easy to use.

Email notification
One neat thing about Jenkins is the plugin ecosystem is very active. There are plugins for templated emails, plugins to fail the build given certain text in the console, send a tweet, send an SMS text message, and so on.

Console log streamed to web UI
We have a nice history of all the job output from scripts and Hadoop driver output all in one place on Jenkins. We can cap the history at any number of builds or by date.

Some concurrent job support
Jenkins has a global maximum in the Manage Jenkins administration screen and each job can be set to allow concurrent builds or not. There is also a plugin to give you finer control over builds. (See Throttle Concurrent Builds Plug-In)

Parameters for ad-hoc runs
This feature is really handy and I've not found it with other tools. Jenkins has two ways to programmatically kickoff a build: a command line (CLI) and a REST API. We basically built a job submission UI which launches a Jenkins job underneath, giving us all that history, progress, and console output, not to mention failure notification. (Of course, Jenkins also has a standard way of prompting you for parameters of a build when using the UI directly.)

Limitations:
I recognize there are limitations to what Jenkins can do - it is not an enterprise scheduler. So we may someday outgrow Jenkins and migrate over to an enterprise scheduler. However, it's been a great ride for the past 2 years of getting Hadoop going the way we wanted. Here are a few of those limitations:

- No way to setup automatic reruns of failed jobs
- No cross-system control where a client can streaming feedback to a server to trigger other jobs
- No dashboard view of jobs that ran at a particular time of day
- Poor load control; other tools can limit the number of instances of specific kinds of jobs based on capacity
- We ran Jenkins on our submitting edge node, in the cluster. Depending on your security, you may not have that luxury

Nonetheless, If you are searching for a lightweight way to get more exposure to your Hadoop cluster, I recommend giving Jenkins a shot.

Dependency Injection on Hadoop (without Guice)

Does your Java map-reduce code look like a bunch of dominoes strung together, in which you can't play one piece until you have the other 2 chained together to get to that point? There is so much cruft - so much bootstrapping code - in the Mapper and Reducer setup methods. When it comes to Hadoop, we seemed to have thrown away what we learned from other software projects. If you don't believe me, just take a look at any WordCount example and you'll see nested classes inside one super class. I get the brevity, but I've seen people take this code, and put the blinders on when working with big data. They view it as a quick and dirty job to get at some result. Instead, I view it as a living breathing application that you can extend and maintain. That's why I propose using Dependency Injection on Hadoop. It will decouple your code and make it testable - YES, I said testable Hadoop code! Each class can focus on the bare minimum pieces it needs to carry out its duty. For example, you could inject an instance of Counter rather than pass around Context to every class that needs to increment it (or heaven forbid, mocking out an entire Context for test). Here's how you get there.

Enter... Spit-DI - a lightweight dependency injection framework built for Hadoop.

Spit-DI overcomes the challenges unique to the Hadoop map-reduce framework, where you are given a Mapper or Reducer as your starting point. You are not in control of instantiating Mappers and Reducers. Spit-DI allows you to set dependency instances on the Mapper or Reducer itself. It works by using a temporary IoC Container (just a Map) that pushes out the queued up singleton values when inject() is called. Spit-DI uses the JSR250 @Resource annotation. It works with statics. It finds and injects those same annotations on parent classes of an instance. It's tiny. It's simple. Give it a try!

But hold the phone. There are so many wonderful DI frameworks out there already, right?

Well... none that fit the Hadoop use case very well. Here was the rationale for creating Spit. Many thanks to Keegan Witt for investigating each of the ones out there. It was one of those things that seemed easy in principle, but became painful in practice.

We first thought of Spring, but it felt a little heavy-handed for what we were trying to do. If we were already using Spring for other purposes, we probably would have just used this for DI.

Then found PicoContainer, but we wanted to inject fields based on an annotation of a certain name. That is, if a class had two Strings on it, stuffA and stuffB, both could be injected. Pico offered NamedFieldInjection and AnnotatedFieldInjection, but not both at the same time. It also was not ideal in that did not work with JSR250's @Resource annotation.

We also found Plexus. It was pretty specific to the maven repository use case and its syntax was not very terse.

We really wanted to use Guice because it seemed to be gaining popularity as a lightweight DI framework. So we brought it in and started working with it. It got us farther, but still fell short of our desires. Here's why Guice did not fit our Hadoop project and why I wrote Spit-DI. (Please bear with me. I'm anticipating most readers will suggest Guice so I feel the need to defend myself by walking through some examples here.)

Lack of JSR250 annotation support. Guice relies on its own @Inject annotation, not standard javax. Mycila extension adds this, but when you switch to using javax @Resource, you are unable to mixin the following @Optional and @Nullable - those only work with Guice's own @Inject. Here's how your code looks using Guice annotations:
```
@Inject
String stuffA;
```
Redundant by-name bindings. Guice out of the box requires @Inject @Named("stuff") String stuff, rather than relying on the name of the variable by convention. Having to litter the code with @Named("sameAsVariableName") is not ideal. Here's how your code looks now:
```
@Inject @Named("stuffA")
String stuffA;
@Inject @Named("stuffB")
String stuffB;
```
Optionals. You may wish to use a model with injected things on it both in Map phase and Reduce phase, where some injected properties only make sense for Map phase and some only make sense in Reduce. Having to litter the code with @Optional is not ideal. And now you see it's getting worse...
```
@Inject(optional=true) @Named("stuffA")
String stuffA;
```
Nullables. It may be the case that you are injecting a null to a field and that is valid - especially when unit testing. Having to litter the code with @Nullable is not ideal.

@Inject(optional=true) @Nullable @Named("stuffA")
String stuffA;

Statics. It was working against the grain, but we actually wanted some static fields on our POJOs because we wanted to create instances based on input data and asking the container to wire a new one up for you everytime was innefficient. We have domain entities with static fields (singletons) that are set once up front but available for reference by each smart domain model. More cruft...
```
@Inject(optional=true) @Nullable @Named("stuffA")
static String stuffA;

//...elsewhere in Guice config...
requestStatic(binder(), MyClass.class);
```

Ok. Now do you believe me? Are you ready for the clean and simple way using Spit-DI?!!

Hadoop map-reduce with Spit-DI:

class MovieMapper extends Mapper {
   @Resource
   private Movie movie;

   @Override
   protected void setup(Context context) {
      DependencyInjector.instance().using(context).injectOn(this);
   }
}

class Movie {
   @Resource
   private Counter numMoviesRequested;
   
   public Integer getYear(String title) { 
     numMoviesRequested.increment(1);
     // more code...
   }
}

/**
 * You can have a wrapper class around Spit-DI for all your configuration.
 * (We have a TestDependencyInjector as well for the context of unit testing.)
 */
class DependencyInjector {
   private SpitDI spit = new SpitDI();

   public void injectOn(Object instance) {
      spit.inject(instance);
   }

   public DependencyInjector using(final Mapper.Context context) {
      spit.bindByType(Movie.class, new Movie());
      spit.bindByName(Counter.class, "numMoviesRequested", context.getCounter("movies", "numMoviesRequested");
      return this;
   }
}

Ah, I can breathe again. In conclusion, Spit-DI doesn't have all the features of the others, but it was all we ever wanted for Hadoop Dependency Injection. I hope it works for you too. Please leave your feedback and feature requests and happy coding!

(PS: I realize it has been 2 years since my last blog. I switched from Web Development to Hadoop Development so it took me this long to have my own thoughts, I guess. :) Hopefully, more to come!)

Unit of Work Pattern Proved Useful in the View Layer

Today, I'm going to talk about the Unit of Work pattern and how it can be useful in a presentation tier. For those of you familiar with this pattern, it was intended for aggregating an object and later committing to a database. The book definition for Unit of Work is:

Maintains a list of objects affected by a business transaction and coordinates the writing out of changes...

I discovered that at least part of this - Maintains a list ... and coordinates the writing out of changes - proved useful in creating cohesive, reusable web components. Why? Because in today's rich web experience, those components are comprised of both HTML and JavaScript. HTML has a one-to-one correspondence between tag and element on the web page. However, with JavaScript we can ascribe behavior to multiple elements at once; it can be one-to-many. HTML is the templating language and so where it appears in the source matters. JavaScript augments that markup and can be done before or after the page is rendered. Let's walk through an example using Unit of Work for a "Unit on the Screen".

We're all in agreement that attaching JavaScript behavior after your DOM like this is good. It has Separation of Concerns. Putting it at the bottom allows the page to load faster.

// markup
<html>
 <body>
  <form>
   <input id="firstName" name="firstName" type="text" class="textbox">
   <input id="middleName" name="middleName" type="text" class="textbox">
   <input id="lastName" name="lastName" type="text" class="textbox">
   <input id="birthDate" name="birthDate" type="text" class="datebox">
  </form>
 </body>
 <script src="behavior.js"></script>
</html>

// behavior.js
$(document).ready(function() {
 $(".textbox").change(function() {
  Utils.uppercase($(this));
 });
 $("#birthDate").datepicker();
});

But what if we want to make some reusable components for our application? Afterall, that is the principle of DRY, Don't Repeat Yourself.

// ui:text
<input id="{{id}}" name="{{id}}" type="text" class="textbox">

// ui:date
<input id="{{id}}" name="{{id}}" type="text" class="datebox">

// markup
<html>
 <body>
  <form>
   <ui:text id="firstName"/>
   <ui:text id="middleName"/>   
   <ui:text id="lastName"/>   
   <ui:date id="birthDate"/>   
  </form>
 </body>
 <script src="behavior.js"></script>
</html>

// behavior.js
$(document).ready(function() {
 $(".textbox").change(function() {
  Utils.uppercase($(this));
 });
 $("#birthDate").datepicker();
});

This is bad. The caller of ui:text has no way of knowing the class of the resulting element to setup the CSS Selector. Also, the caller can forget to attach the behavior and the idea of reusable "ui:text" is not preserved within the application as being consistent. There is no cohesion.

Solution: BottomJS. It implements the Unit of Work pattern to have the reusable component render HTML and also add the JavaScript to be attached later at the bottom of the page.

// ui:text
<input id="{{id}}" name="{{id}}" type="text" class="textbox">
<ui:bottomJs>
$(".textbox").change(function() {
 Utils.uppercase($(this));
});
</ui:bottomJs>

// ui:date
<input id="{{id}}" name="{{id}}" type="text" class="datebox">
<ui:bottomJs>
$("{{id}}").datepicker();
</ui:bottomJs>

// markup
<html>
 <body>
  <form>
   <ui:text id="firstName"/>
   <ui:text id="middleName"/>   
   <ui:text id="lastName"/>   
   <ui:date id="birthDate"/>
  </form>
 </body>
 <ui:bottomJs/>
</html>

Generates the following. The reason is BottomJS builds an in-memory set of the JavaScripts to be added at the last step so redundant calls are ignored.

<html>
 <body>
  <form>
   <input id="firstName" name="firstName" type="text" class="textbox">
   <input id="middleName" name="middleName" type="text" class="textbox">
   <input id="lastName" name="lastName" type="text" class="textbox">
   <input id="birthDate" name="birthDate" type="text" class="datebox">
  </form>
 </body>
 <script>
  $(document).ready(function() {
   $(".textbox").change(function() {
    Utils.uppercase($(this));
   });
   $("#birthDate").datepicker();
  });
 </script>
</html>

TID: Test-If-Development (A more pragmatic TDD)

The moment you need to introduce some if-logic into a method, you jump over to write the test, but not before. This approach argues it is unnecessary overhead to write the test before you create the method because the method could be absent of any conditions, in the case it is comprised of calls to other methods.

Example - not needing a test

void populateContactInfo() {
   populateName();
   populateAddress();
   populatePhone();
}

Example - needs at least 2 tests because there are 2 branches of code

void populateContactInfo() {
   if (hasName) {
       populateName();
   }
   else {
       populateDefaultName();  
   }
   populateAddress();
   populatePhone();
}

The phrase Test-If-Development has one other benefit. That is,"If" you are doing "Development", you "Test". Period.