Monday, August 10, 2015

Advanced dependency management - Part 1


I remember when I was reading Jez Humble's Continuous Delivery Book, I could not wait to get to the "hard" part. Having practiced continuous integration (CI) for a few years already, I knew most of the practices and benefits, but I also knew that the real difficulty with CI lay with testing integration between components, and orchestrating their releases. Components can be anything that is packaged and released on its own cycle - a binary library, a service, an application. The intricate dependencies and their evolution over time are quite complex, and I was looking for some practical advice on the topic from the experts. Unfortunately, Chapter 13, titled Managing Components and Dependencies, left me disappointed. The above pictured toy example that is discussed in the chapter is spot on, but there is so much more to solving dependency CI problems in practice. For a few years now I've been waiting for a more in depth treatment of the subject, or a CI tool that would treat dependencies as a first class citizen. There is still nothing out there that would offer advice for dealing with this problem in practice and at scale, or at least hold my hand while glancing at the abyss of complexity and tell me that everything is going to be OK.

Finally, thinking the problem cannot be that difficult, I decided to tackle it during one of the Carfax hackathons. I figured that by just limiting the solution to managing Java's versioned jars, it would be a breeze to write a tool following a simple algorithm that would automatically update all dependencies in a graph. As you can imagine, I would not be writing these blog posts, if this approach was successful on the first try.

If you've ever been frustrated by a mysterious "NoClassDefFound" exception, or missing method, or null pointer in a jar that had nothing to do with the code you are changing, you are in with good company. It is simply the price we pay for sort of having our cake and eating it. Also, if you think you have it bad - I recently was involved in an outage of a redundant cloud service that was supposed to be bulletproof. At the root of the disaster - you guessed it - a dependency bump with unexpected consequences. In this series we will explore the weak points of versioned binary dependency management that is the staple of the Java world, and hopefully come out the other end with some ideas on how to do CI on a realistic dependency graph.

Versioned module dependency management in Java

Dividing up code into pre-compiled modules speeds up releases and testing feedback (as anyone who compiled Linux kernel from scratch can attest). If we store the compiled modules somewhere (the place is called artifact repository), a developer only needs to download the modules needed for his code to compile and run, and so time is saved both on not having to retrieve the entire codebase, and not having to recompile the modules.

Over time, modules need to be modified. Not all of these changes are backwards compatible, and even the ones intended to be, may break some functionality of the code that's using them. The choices made at this point are at the core of CI practices. In the ideal world, person making the changes would go in and verify that all the downstream dependencies work. Unfortunately, this doesn't scale very well, and puts a huge premium on making a change. So the next option is to produce an artifact, and let the developers of the downstream apps verify that everything works the next time they compile the dependent module. After enough changes have been put into the system like that, practically every application breaks the first time it is built, and developers have to dig through the code they don't understand, to figure out how to integrate the incoming changes. Needless to say, unpredictable release schedule and waste involved in having everyone learn low level details of breaking changes, is not very efficient. And yet this is how Java development looked like for the first 9 years, since 1995 to 2004.

In 2004, what seems like ages ago, two tools appeared that started versioning the modules: Ivy and Maven. Each successive iteration of a module would get an incremental version number assigned, and a separate artifact would be stored in the repository for each version. The downstream dependencies specified the artifact name and version as their dependencies. This allowed the developers to make changes in upstream modules, while developers working on downstream modules could choose when to integrate. This worked so well, that it is the dominant delivery model in the Java world. The versioned dependency model is so brilliantly simple and worked out so well in practice, that it has become a de-facto standard. However, it does have some fundamental shortcomings, which have been a source of frustration to those who run into them.

First, when two versions of the same jar are brought in by dependency resolution, only one can remain on the classpath. This is known as conflict resolution and by default, the latest of the jars is allowed to remain. The problem is that one of the dependencies which needed the older version is now implicitly using the new version, with unpredictable consequences. No compile error is generated, since all modules have been precompiled. There may be a runtime error of missing class or method, but worst of all, a bug may be completely undetected at this point, since the upgrade has not been verified by automated tests of the module specifying the old dependency. The reason this has not been a significant problem in practice is that great majority of module changes are backwards compatible.

Second, complexity of the system expands beyond what can be understood or even effectively visualized by a human. Consider the following simple example of (unversioned) module dependencies:

With versioned dependencies, we run into a situation where modules depend on past versions of other modules, so in reality, the dependencies are the ones specified by blue arrows: A 3rd dimension of past versions and their dependencies is added to dependency resolution that cannot be visualized or manually managed effectively, except for the simplest cases. While this third dimension cannot be completely eliminated, CI principles dictate that it should be kept to a minimum. And yet to date, no CI tooling exists that would be aware of versioned module dependencies, not to mention facilitate their integration.

How a dependency management CI may work

It seems that a fairly straightforward naive approach to build such tool would be to:
  • arrange jars in layers starting with ones that have no dependencies
  • beginning with 2nd lowest layer, update all dependencies and produce new artifacts
  • if any updates fail, triangulate, until a successful combination of updates is found
  • move up the layers until top is reached
This is pretty much the essence of the mysterious Chaffee's "Cautious Optimism" algorithm as described in Chapter 13 of the Continuous Delivery book. In the next post I explain why this approach is too simplistic for dependency graphs that could most benefit from it, and offer some advice on how to approach dependency integration in such environments.

No comments:

Post a Comment