Paul Mazak: patterns

Does your Java map-reduce code look like a bunch of dominoes strung together, in which you can't play one piece until you have the other 2 chained together to get to that point? There is so much cruft - so much bootstrapping code - in the Mapper and Reducer setup methods. When it comes to Hadoop, we seemed to have thrown away what we learned from other software projects. If you don't believe me, just take a look at any WordCount example and you'll see nested classes inside one super class. I get the brevity, but I've seen people take this code, and put the blinders on when working with big data. They view it as a quick and dirty job to get at some result. Instead, I view it as a living breathing application that you can extend and maintain. That's why I propose using Dependency Injection on Hadoop. It will decouple your code and make it testable - YES, I said testable Hadoop code! Each class can focus on the bare minimum pieces it needs to carry out its duty. For example, you could inject an instance of Counter rather than pass around Context to every class that needs to increment it (or heaven forbid, mocking out an entire Context for test). Here's how you get there.

Enter... Spit-DI - a lightweight dependency injection framework built for Hadoop.

Spit-DI overcomes the challenges unique to the Hadoop map-reduce framework, where you are given a Mapper or Reducer as your starting point. You are not in control of instantiating Mappers and Reducers. Spit-DI allows you to set dependency instances on the Mapper or Reducer itself. It works by using a temporary IoC Container (just a Map) that pushes out the queued up singleton values when inject() is called. Spit-DI uses the JSR250 @Resource annotation. It works with statics. It finds and injects those same annotations on parent classes of an instance. It's tiny. It's simple. Give it a try!

But hold the phone. There are so many wonderful DI frameworks out there already, right?

Well... none that fit the Hadoop use case very well. Here was the rationale for creating Spit. Many thanks to Keegan Witt for investigating each of the ones out there. It was one of those things that seemed easy in principle, but became painful in practice.

We first thought of Spring, but it felt a little heavy-handed for what we were trying to do. If we were already using Spring for other purposes, we probably would have just used this for DI.

Then found PicoContainer, but we wanted to inject fields based on an annotation of a certain name. That is, if a class had two Strings on it, stuffA and stuffB, both could be injected. Pico offered NamedFieldInjection and AnnotatedFieldInjection, but not both at the same time. It also was not ideal in that did not work with JSR250's @Resource annotation.

We also found Plexus. It was pretty specific to the maven repository use case and its syntax was not very terse.

We really wanted to use Guice because it seemed to be gaining popularity as a lightweight DI framework. So we brought it in and started working with it. It got us farther, but still fell short of our desires. Here's why Guice did not fit our Hadoop project and why I wrote Spit-DI. (Please bear with me. I'm anticipating most readers will suggest Guice so I feel the need to defend myself by walking through some examples here.)

Lack of JSR250 annotation support. Guice relies on its own @Inject annotation, not standard javax. Mycila extension adds this, but when you switch to using javax @Resource, you are unable to mixin the following @Optional and @Nullable - those only work with Guice's own @Inject. Here's how your code looks using Guice annotations:
```
@Inject
String stuffA;
```
Redundant by-name bindings. Guice out of the box requires @Inject @Named("stuff") String stuff, rather than relying on the name of the variable by convention. Having to litter the code with @Named("sameAsVariableName") is not ideal. Here's how your code looks now:
```
@Inject @Named("stuffA")
String stuffA;
@Inject @Named("stuffB")
String stuffB;
```
Optionals. You may wish to use a model with injected things on it both in Map phase and Reduce phase, where some injected properties only make sense for Map phase and some only make sense in Reduce. Having to litter the code with @Optional is not ideal. And now you see it's getting worse...
```
@Inject(optional=true) @Named("stuffA")
String stuffA;
```
Nullables. It may be the case that you are injecting a null to a field and that is valid - especially when unit testing. Having to litter the code with @Nullable is not ideal.

@Inject(optional=true) @Nullable @Named("stuffA")
String stuffA;

Statics. It was working against the grain, but we actually wanted some static fields on our POJOs because we wanted to create instances based on input data and asking the container to wire a new one up for you everytime was innefficient. We have domain entities with static fields (singletons) that are set once up front but available for reference by each smart domain model. More cruft...
```
@Inject(optional=true) @Nullable @Named("stuffA")
static String stuffA;

//...elsewhere in Guice config...
requestStatic(binder(), MyClass.class);
```

Ok. Now do you believe me? Are you ready for the clean and simple way using Spit-DI?!!

Hadoop map-reduce with Spit-DI:

class MovieMapper extends Mapper {
   @Resource
   private Movie movie;

   @Override
   protected void setup(Context context) {
      DependencyInjector.instance().using(context).injectOn(this);
   }
}

class Movie {
   @Resource
   private Counter numMoviesRequested;
   
   public Integer getYear(String title) { 
     numMoviesRequested.increment(1);
     // more code...
   }
}

/**
 * You can have a wrapper class around Spit-DI for all your configuration.
 * (We have a TestDependencyInjector as well for the context of unit testing.)
 */
class DependencyInjector {
   private SpitDI spit = new SpitDI();

   public void injectOn(Object instance) {
      spit.inject(instance);
   }

   public DependencyInjector using(final Mapper.Context context) {
      spit.bindByType(Movie.class, new Movie());
      spit.bindByName(Counter.class, "numMoviesRequested", context.getCounter("movies", "numMoviesRequested");
      return this;
   }
}

Ah, I can breathe again. In conclusion, Spit-DI doesn't have all the features of the others, but it was all we ever wanted for Hadoop Dependency Injection. I hope it works for you too. Please leave your feedback and feature requests and happy coding!

(PS: I realize it has been 2 years since my last blog. I switched from Web Development to Hadoop Development so it took me this long to have my own thoughts, I guess. :) Hopefully, more to come!)

Today, I'm going to talk about the Unit of Work pattern and how it can be useful in a presentation tier. For those of you familiar with this pattern, it was intended for aggregating an object and later committing to a database. The book definition for Unit of Work is:

Maintains a list of objects affected by a business transaction and coordinates the writing out of changes...

I discovered that at least part of this - Maintains a list ... and coordinates the writing out of changes - proved useful in creating cohesive, reusable web components. Why? Because in today's rich web experience, those components are comprised of both HTML and JavaScript. HTML has a one-to-one correspondence between tag and element on the web page. However, with JavaScript we can ascribe behavior to multiple elements at once; it can be one-to-many. HTML is the templating language and so where it appears in the source matters. JavaScript augments that markup and can be done before or after the page is rendered. Let's walk through an example using Unit of Work for a "Unit on the Screen".

We're all in agreement that attaching JavaScript behavior after your DOM like this is good. It has Separation of Concerns. Putting it at the bottom allows the page to load faster.

// markup
<html>
 <body>
  <form>
   <input id="firstName" name="firstName" type="text" class="textbox">
   <input id="middleName" name="middleName" type="text" class="textbox">
   <input id="lastName" name="lastName" type="text" class="textbox">
   <input id="birthDate" name="birthDate" type="text" class="datebox">
  </form>
 </body>
 <script src="behavior.js"></script>
</html>

// behavior.js
$(document).ready(function() {
 $(".textbox").change(function() {
  Utils.uppercase($(this));
 });
 $("#birthDate").datepicker();
});

But what if we want to make some reusable components for our application? Afterall, that is the principle of DRY, Don't Repeat Yourself.

// ui:text
<input id="{{id}}" name="{{id}}" type="text" class="textbox">

// ui:date
<input id="{{id}}" name="{{id}}" type="text" class="datebox">

// markup
<html>
 <body>
  <form>
   <ui:text id="firstName"/>
   <ui:text id="middleName"/>   
   <ui:text id="lastName"/>   
   <ui:date id="birthDate"/>   
  </form>
 </body>
 <script src="behavior.js"></script>
</html>

// behavior.js
$(document).ready(function() {
 $(".textbox").change(function() {
  Utils.uppercase($(this));
 });
 $("#birthDate").datepicker();
});

This is bad. The caller of ui:text has no way of knowing the class of the resulting element to setup the CSS Selector. Also, the caller can forget to attach the behavior and the idea of reusable "ui:text" is not preserved within the application as being consistent. There is no cohesion.

Solution: BottomJS. It implements the Unit of Work pattern to have the reusable component render HTML and also add the JavaScript to be attached later at the bottom of the page.

// ui:text
<input id="{{id}}" name="{{id}}" type="text" class="textbox">
<ui:bottomJs>
$(".textbox").change(function() {
 Utils.uppercase($(this));
});
</ui:bottomJs>

// ui:date
<input id="{{id}}" name="{{id}}" type="text" class="datebox">
<ui:bottomJs>
$("{{id}}").datepicker();
</ui:bottomJs>

// markup
<html>
 <body>
  <form>
   <ui:text id="firstName"/>
   <ui:text id="middleName"/>   
   <ui:text id="lastName"/>   
   <ui:date id="birthDate"/>
  </form>
 </body>
 <ui:bottomJs/>
</html>

Generates the following. The reason is BottomJS builds an in-memory set of the JavaScripts to be added at the last step so redundant calls are ignored.

<html>
 <body>
  <form>
   <input id="firstName" name="firstName" type="text" class="textbox">
   <input id="middleName" name="middleName" type="text" class="textbox">
   <input id="lastName" name="lastName" type="text" class="textbox">
   <input id="birthDate" name="birthDate" type="text" class="datebox">
  </form>
 </body>
 <script>
  $(document).ready(function() {
   $(".textbox").change(function() {
    Utils.uppercase($(this));
   });
   $("#birthDate").datepicker();
  });
 </script>
</html>

Paul Mazak

Monday, June 15, 2015

Dependency Injection on Hadoop (without Guice)

Monday, March 18, 2013

Unit of Work Pattern Proved Useful in the View Layer