Monday, August 24, 2015

Jenkins as a Hadoop Job Scheduler

"Which Way" by oatsy40
Jenkins is a well-known continuous integration server - checking out source code, running unit tests, yada yada yada.  However, because of its simplicity, I've been able to leverage it for a variety of use cases: lately, as a Hadoop Job Scheduler.

The expert panel pick was Oozie.  But whenever I asked those "experts" for their use cases for Oozie, they'd tell me, "Oh, I've never actually used it, just heard that's what you do."  Well, that's just great.  I played around with Oozie by scheduling a simple "echo 123" command. It launched a JVM on all the nodes and never printed the result. At my company not everything we schedule on the Hadoop cluster is a Hadoop job, and certainly not a map-reduce job. We have bash scripts that verify data. We have groovy scripts that poll a database on a different server, and if triggered, then run a Hadoop job. I found Oozie to be cumbersome and limited in features.

The underground pick was Azkaban, created by LinkedIn. I looked at that and liked the simplicity of job workflows being described as plain text files. I loved the workflow diagrams it provided, giving you a clear picture of the interoperability of multiple jobs.  However, it too was limited in features. In particular, the setting for concurrent builds was an all or nothing setting at the global level.  We wanted the ability to allow a max number of things going on at a time globally, as well as limit certain types of jobs to only one of this or that type of job.

Jenkins is what came to mind, but I felt peer-pressure to try Oozie and Azkaban. Jenkins was not a popular choice for scheduling Hadoop jobs. Did I say not a popular choice? I mean that I heard things like, "Nobody in their right mind would choose Jenkins for this! Isn't Jenkins for continuous integration?!? "... but wait a minute!

Here's what we get with Jenkins:

Ability to run any command
  Not just Hadoop sorts of jobs like java mapreduce, sqoop, pig, or hive, but absolutely anything that can be scripted. You can use the right tool for the job and conditionally launch those parallel Hadoop jobs.


Cron-like scheduling
  Basic, I know. Most tools have this, but Jenkins's is also easy to use.

Email notification
  One neat thing about Jenkins is the plugin ecosystem is very active. There are plugins for templated emails, plugins to fail the build given certain text in the console, send a tweet, send an SMS text message, and so on.

Console log streamed to web UI
  We have a nice history of all the job output from scripts and Hadoop driver output all in one place on Jenkins. We can cap the history at any number of builds or by date.

Some concurrent job support
  Jenkins has a global maximum in the Manage Jenkins administration screen and each job can be set to allow concurrent builds or not.  There is also a plugin to give you finer control over builds. (See Throttle Concurrent Builds Plug-In)

Parameters for ad-hoc runs
  This feature is really handy and I've not found it with other tools. Jenkins has two ways to programmatically kickoff a build: a command line (CLI) and a REST API.  We basically built a job submission UI which launches a Jenkins job underneath, giving us all that history, progress, and console output, not to mention failure notification. (Of course, Jenkins also has a standard way of prompting you for parameters of a build when using the UI directly.)

Limitations:
I recognize there are limitations to what Jenkins can do - it is not an enterprise scheduler. So we may someday outgrow Jenkins and migrate over to an enterprise scheduler. However, it's been a great ride for the past 2 years of getting Hadoop going the way we wanted.  Here are a few of those limitations:

 - No way to setup automatic reruns of failed jobs
 - No cross-system control where a client can streaming feedback to a server to trigger other jobs
 - No dashboard view of jobs that ran at a particular time of day
 - Poor load control; other tools can limit the number of instances of specific kinds of jobs based on capacity
 - We ran Jenkins on our submitting edge node, in the cluster. Depending on your security, you may not have that luxury

Nonetheless, If you are searching for a lightweight way to get more exposure to your Hadoop cluster, I recommend giving Jenkins a shot.

5 comments:

  1. We also ran Jenkins as our Hadoop job scheduler at one of my big clients, it just works. Oozie has such a legacy feel to it.

    ReplyDelete
  2. I've used Jenkins to schedule Hadoop and other jobs in my last 4 ventures; it's the ultimate in down-and-dirty v0 job coordination. I picked up the trick from a team of data scientists at Apple.

    ReplyDelete
  3. Paul, this is a very interesting approach. I like Jenkins too, but I have never tried to use it this way. Great run down of how you used it.

    ReplyDelete
  4. At my last gig, the client was using Hudson to schedule and manage dozens of legacy batch jobs, notifications, and dependencies. I was very surprised at first, but once you think about it, it makes a great scheduler.

    ReplyDelete