Paul Mazak: Dependency Injection on Hadoop (without Guice)

Does your Java map-reduce code look like a bunch of dominoes strung together, in which you can't play one piece until you have the other 2 chained together to get to that point? There is so much cruft - so much bootstrapping code - in the Mapper and Reducer setup methods. When it comes to Hadoop, we seemed to have thrown away what we learned from other software projects. If you don't believe me, just take a look at any WordCount example and you'll see nested classes inside one super class. I get the brevity, but I've seen people take this code, and put the blinders on when working with big data. They view it as a quick and dirty job to get at some result. Instead, I view it as a living breathing application that you can extend and maintain. That's why I propose using Dependency Injection on Hadoop. It will decouple your code and make it testable - YES, I said testable Hadoop code! Each class can focus on the bare minimum pieces it needs to carry out its duty. For example, you could inject an instance of Counter rather than pass around Context to every class that needs to increment it (or heaven forbid, mocking out an entire Context for test). Here's how you get there.

Enter... Spit-DI - a lightweight dependency injection framework built for Hadoop.

Spit-DI overcomes the challenges unique to the Hadoop map-reduce framework, where you are given a Mapper or Reducer as your starting point. You are not in control of instantiating Mappers and Reducers. Spit-DI allows you to set dependency instances on the Mapper or Reducer itself. It works by using a temporary IoC Container (just a Map) that pushes out the queued up singleton values when inject() is called. Spit-DI uses the JSR250 @Resource annotation. It works with statics. It finds and injects those same annotations on parent classes of an instance. It's tiny. It's simple. Give it a try!

But hold the phone. There are so many wonderful DI frameworks out there already, right?

Well... none that fit the Hadoop use case very well. Here was the rationale for creating Spit. Many thanks to Keegan Witt for investigating each of the ones out there. It was one of those things that seemed easy in principle, but became painful in practice.

We first thought of Spring, but it felt a little heavy-handed for what we were trying to do. If we were already using Spring for other purposes, we probably would have just used this for DI.

Then found PicoContainer, but we wanted to inject fields based on an annotation of a certain name. That is, if a class had two Strings on it, stuffA and stuffB, both could be injected. Pico offered NamedFieldInjection and AnnotatedFieldInjection, but not both at the same time. It also was not ideal in that did not work with JSR250's @Resource annotation.

We also found Plexus. It was pretty specific to the maven repository use case and its syntax was not very terse.

We really wanted to use Guice because it seemed to be gaining popularity as a lightweight DI framework. So we brought it in and started working with it. It got us farther, but still fell short of our desires. Here's why Guice did not fit our Hadoop project and why I wrote Spit-DI. (Please bear with me. I'm anticipating most readers will suggest Guice so I feel the need to defend myself by walking through some examples here.)

Lack of JSR250 annotation support. Guice relies on its own @Inject annotation, not standard javax. Mycila extension adds this, but when you switch to using javax @Resource, you are unable to mixin the following @Optional and @Nullable - those only work with Guice's own @Inject. Here's how your code looks using Guice annotations:
```
@Inject
String stuffA;
```
Redundant by-name bindings. Guice out of the box requires @Inject @Named("stuff") String stuff, rather than relying on the name of the variable by convention. Having to litter the code with @Named("sameAsVariableName") is not ideal. Here's how your code looks now:
```
@Inject @Named("stuffA")
String stuffA;
@Inject @Named("stuffB")
String stuffB;
```
Optionals. You may wish to use a model with injected things on it both in Map phase and Reduce phase, where some injected properties only make sense for Map phase and some only make sense in Reduce. Having to litter the code with @Optional is not ideal. And now you see it's getting worse...
```
@Inject(optional=true) @Named("stuffA")
String stuffA;
```
Nullables. It may be the case that you are injecting a null to a field and that is valid - especially when unit testing. Having to litter the code with @Nullable is not ideal.

@Inject(optional=true) @Nullable @Named("stuffA")
String stuffA;

Statics. It was working against the grain, but we actually wanted some static fields on our POJOs because we wanted to create instances based on input data and asking the container to wire a new one up for you everytime was innefficient. We have domain entities with static fields (singletons) that are set once up front but available for reference by each smart domain model. More cruft...
```
@Inject(optional=true) @Nullable @Named("stuffA")
static String stuffA;

//...elsewhere in Guice config...
requestStatic(binder(), MyClass.class);
```

Ok. Now do you believe me? Are you ready for the clean and simple way using Spit-DI?!!

Hadoop map-reduce with Spit-DI:

class MovieMapper extends Mapper {
   @Resource
   private Movie movie;

   @Override
   protected void setup(Context context) {
      DependencyInjector.instance().using(context).injectOn(this);
   }
}

class Movie {
   @Resource
   private Counter numMoviesRequested;
   
   public Integer getYear(String title) { 
     numMoviesRequested.increment(1);
     // more code...
   }
}

/**
 * You can have a wrapper class around Spit-DI for all your configuration.
 * (We have a TestDependencyInjector as well for the context of unit testing.)
 */
class DependencyInjector {
   private SpitDI spit = new SpitDI();

   public void injectOn(Object instance) {
      spit.inject(instance);
   }

   public DependencyInjector using(final Mapper.Context context) {
      spit.bindByType(Movie.class, new Movie());
      spit.bindByName(Counter.class, "numMoviesRequested", context.getCounter("movies", "numMoviesRequested");
      return this;
   }
}

Ah, I can breathe again. In conclusion, Spit-DI doesn't have all the features of the others, but it was all we ever wanted for Hadoop Dependency Injection. I hope it works for you too. Please leave your feedback and feature requests and happy coding!

(PS: I realize it has been 2 years since my last blog. I switched from Web Development to Hadoop Development so it took me this long to have my own thoughts, I guess. :) Hopefully, more to come!)

Paul Mazak

Monday, June 15, 2015

Dependency Injection on Hadoop (without Guice)

1 comment: