Sunday, January 26, 2014

What does it take to improve batch process performance 4,500 times?

The subject is a bit untypical for this blog, but I've been asked multiple times by colleagues about details of this project, and so here it is for others to comment and get ideas.

Our team recently showed how to achieve 4,500 times performance improvement in a key component of our legacy batch processing system. The component is responsible for classifying records, and it was under a constant strain. The component had a messaging subsystem and a deduplication strategy which assured that only records which changed that day were reclassified. Over the years, the classification data drifted out of sync, but it was impossible to reclassify all records, because the system was only able to process a day’s worth of changes, which was about 2 million records. Additionally, any change to rules which would result in updating a significant portion of the records, had to be carefully planned and throttled. The new system we put in place does not concern itself with daily changeset. It is able to reclassify the entire dataset of 30 million records in 5 minutes.

To achieve this, we focused on understanding the computation that was taking place. Our classifier is essentially doing a map-reduce operation, with about 7 input records with the same key field producing one output record. There are approximately 30 million input records. The system is I/O bound, with reduce step taking little computational power. Knowing these facts gave us a pause, because there is nothing daunting about processing 30 million records, and yet our system was unable to cope with a 10th of it. At this scale, we did not feel it necessary to introduce any special technology, and stayed within our Oracle relational database.

Efficient mapping was accomplished by keeping an index on a key field, and using order by operator in the select statement. This allowed us to start processing records before the query completed. Additionally, we noticed that characters in certain positions of the key were well distributed, and so we used underscore operator in our select statement and multithreaded the job. The select statement for each thread looks like this:

select * from table where key like ‘___X____’ order by key
where X is a different character for each thread, and keys are distributed evenly. The source code reads one record at a time from the result set, and when key changes between the records, it outputs the classification for all records for the previous key. See the partial pseudo-code below.
recordsToClassify = new List()
currentKey = ‘NOT_INITIALIZED’
while (results.next()) {
  currentRecord = readRecord(results)
  if (currentRecord.key != currentKey) {
    classify(recordsToClassify)
    recordsToClassify.clear
  }
  currentKey = currentRecord.key
  recordsToClassify.add(currentRecord)
}
classify(recordsToClassify)

Finally, on the output end of things, since we did not have to deal with adds and drops, we could truncate the classification table before writing data to it.

It is difficult to gauge how each part of this solution contributes to the performance improvement, since new and old system are so different. However, the important performance differentiators are as follows:

  • Batch processing. Asking the database to retrieve all records at once is a lot more efficient than running millions of queries for a small subset of records
  • Processing while the query was running. Under the hood, the database retrieves records in batches, and ResultSet.next() is JDBC’s abstraction that allows to read records as they become available, without waiting for the query to complete. In addition to performance improvement opportunities, this also assures that the system does not run out of memory while reading a large dataset. Plain JDBC seems to be the tool of choice here, as just about every framework makes getting at this functionality difficult, if not impossible.
  • No distractions: logging, deduplication of changes. A system that does a complete resync in 5 minutes does not need careful bookkeeping.
  • Writes are much faster than updates. New system outputs to an empty table, while the old one needed to make changes to existing records.
  • Recognizing opportunity for distributed processing. Our map-reduce job was partitioned by key, and selecting keys on a specific character through the ‘like’ clause with underscores is very efficient. Since keys were already indexed, the query just needs to walk through the index and retrieve records as they appear in the index.

One final point to ponder is how simple the end result ended up. Having been there at the front lines, I can attest that achieving this simplicity wasn’t easy. Even after we thought we understood what the old system was doing, we iterated over several prototypes. Each of the prototypes was simpler than its predecessor, and each time, it seemed we have reached the limits of optimization. While many optimizations trade complexity for performance, this is an example of an optimization achieved mainly through reduction of complexity.

Monday, June 10, 2013

Migrating interdependent jars

We have recently completed a build and infrastructure migration at Carfax. It involved moving around 200 jars to a new build and distribution system. We wanted to work in small steps and be able to ship applications throughout the migration. This is challenging, because our dependency graph was highly connected. We found it useful to transform the dependency graph by removing certain redundant edges and ordering nodes based on how many levels of separation they had from leaf nodes. We then migrated leaf nodes first, followed by level 1 nodes, until all levels were converted. This bottom-up breadth first approach simplifies many issues around migration.

To explain what we did, it will be helpful to look at how dependencies may be laid out for a single application:

If we were to migrate jar4, our new build system would have to understand how to pull jar2 from the old build system, and jar3 would need to understand both old and new build systems in order to retrieve all of its dependencies. If you were then to migrate jar2 next, you would end up having to rebuild jar4 in order to eliminate the old version of jar2 from its dependency tree. On the other hand, if we were to start with jar2 and then convert jar4, the new build system would not need to know anything about the old build system, and jar4 would only need to be built out once. So it should be clear, that the order in which jars get migrated makes a difference in how much work has to be done. What then is the optimal migration order?

Intuition from the above example suggests that in order to minimize work, we should assure that all of given jar's dependencies have been migrated, before the jar itself gets migrated. If we were to assign level 1 to all jars that do not have any dependencies, level 2 to all jars that depend on level 1 jars, etc., we would arrive at the following graph:

Note that jar3's level is determined by the highest level of its dependencies, jar4. If we were to leave jar3 at the same level as jar4, it would not be clear which one to migrate first. Another thing to note is that for the purposes of visualizing migration, we can safely get rid of 3 edges, because the appropriate jars are brought in transiently anyway. Now we can proceed with converting all level 1 jars, followed by level 2, etc.

The above example is very simplified. This is the problem our team was facing in reality:

Note, the levels above were assigned by the graphing software (GraphViz), and do not have any meaning in the context of the migration.

And now a graph that orders nodes by their migration level:

The bottom-up breadth-first migration approach has these important characteristics:

  • Can always ship latest software.
  • Number of times a jar is rebuilt is minimized.
  • Build system complexity is reduced. The new build does not need to integrate with the old build. (Old build, however, still needs to integrate with the new build.)
  • Throughput is maximized. Libraries of the same level can be migrated in parallell. This means less coordination is necessary on the migration team.
  • No surprises. No transient dependencies using the old build will ever show up during dependency resolution.

Monday, June 3, 2013

3 quick metrics to understand large codebases

Sonar is the tool of choice for single project metrics, but it is a bit harder to use it to look at the entire codebase. Carfax Code Corpus, that I have been working with, consists of about 3 million lines of Java and Groovy code, 1100 projects (applications, jars, sql and static content), and about 35 thousand classes. Even though at 6.5Gb it is small enough to fit on a thumb drive, getting a handle on such a codebase is beyond what a single person can do with an IDE or even most visualization tools. There are 3 metrics that are easy to generate with basic parsing tools that yield interesting insights.

The metrics I use are class size, class collaborators and class popularity. They give quick insight into the underlying relationships. Class size is measured in lines of code as it appears in text editor. Class collaborators are other classes that are needed to compile and they are measured by import statements. Class popularity is measured the same way - ie how many other classes use this particular class as their collaborator. Import statements are not a perfect network measurement tool: wildcard imports are not being counted, imports from the same package are not counted, and imported class may not actually be needed to compile. After spot-checking for these potential problems, I do not believe they are an issue in this corpus.

Size and popularity

First, I looked for frequently imported classes that are highly complex. In the diagram below, "References" axis is the number of times the given class is being imported. Large number of such classes would indicate problems in the codebase. The corpus seems healthy. There are still a few classes classes that are potential problems, and these are worth a closer look.

Size and collaborators

Next I look for problem classes in another way - are there large classes that also do a lot of work on other classes. This is usually the group targeted for code smells.

In this codebase having over 40 imports puts the class outside of mainstream. It is also interesting to take a look at the outliers. Some of them turn out to be quite innocent from the code cleanliness perspective - like classes representing some particularly large file format. I do not believe this diagram carries much value because it does not convey actual impact of the offending classes, and yet this type of breakdown is what you get with traditional static analysis tools.

Popularity and collaborators

Finally, time to plot popularity and collaborators.

The striking pattern is that popular classes do not have many collaborators. I am still trying to understand the cause and effect here. The most popular classes consist of two groups: domain and framework. Domain classes represent something key in the business. Since Carfax is in the vehicle history business, these classes have to do with cars and their history. Framework classes deal with processing policies, for instance, persistence or error reporting. And yet both of these groups follow the same pattern which says - if you want to be popular, you better not have many friends.

I am also puzzled by the magical power of the number 20. This number holds over a huge range of popularity - with exception of one class, anything used more than 20 times, needs to import less than 20 classes. How does this property emerge? What happens to a very useful class that has over 20 imports as a developer tries to import it in into his project?

Parting thoughts

I think this approach is underrepresented in software engineering. It has practical uses. A consultant coming into a company would do well to familiarize himself with the outliers in the above diagrams. It also has implications for training new hires and setting coding standards. But there is more. I think that among the meaningless noise lie hidden laws that govern the fabric of software development.

Sunday, April 14, 2013

Groovy and Java class sizes in practice

The shared perception is that Groovy offers productivity gains over Java. So how many times is an average Groovy class smaller than Java? Here is how it works out at our company. At Carfax we have about 10 years of Java and 5 years of Groovy in our codebase. The entire dataset is 3 million lines of code. For this comparison I only included Java files which have been modified since the introduction of Groovy so as to account for how coding practices change over time. Summary of the results is in the table below (LOC stands for lines of code).

 JavaGroovy
total classes26,1355,681
avg. LOC9870
median LOC5738

Groovy files are shorter overall, but not by that much. Groovy classes are around 70% the size of Java classes. The one difference I was able to find is for the subset of small classes, under 300 LOC. Take a look at the diagram below.

Raw data:

LOCJava %Groovy %
0-93.110.7
10-198.719.0
20-2911.710.4

Notice how 30% of Groovy classes are under 20 lines of code, while it is only 12% for Java. It appears that useful Groovy classes can be smaller than in Java, which matches the "less boilerplate" perception of Groovy. It is surprising at how small an effect it actually has in practice. If you open up an average Groovy file in our codebase, it's not going to take much less screen space than Java.

Thursday, March 7, 2013

Key unsolved problems with Java jar architecture

Writing correct code and thorough testing is difficult enough, without having to worry about mechanics of how the build puts application code together. Any manual step that stands in the way of developer getting feedback from his code changes will eventually be paid for with fatigue, lowered productivity and bugs. Dependency system, architecture, and the build system should intuitively do the right thing, and be streamlined for the most common development operations. The purpose of this article is to describe a type of Java architecture I am familiar with and its underlying limitations. Throughout, I will interject some possible remedies, although I think the most value comes just from articulating the issues and recognizing that they can be solved.

World of JAR graphs

The kind of Java ecosystem I focus on here consists of applications, and modules they depend on. The distinguishing feature of an application is that it either has a Main class or is deployed on a web server. Duplicated code is abstracted into modules called jars, which can be shared among applications. Additional benefit of modules is that they can be precompiled into jars and stored in a repository and the application can be built much quicker. In order to stabilize the application promotion process, jars are versioned and versions are explicitly specified for every dependency. Jars can depend on other jars, and thus the whole ecosystem forms one or more directed graphs.

To get the idea of the kind of problems I will be describing here, note how the existence of versioned jars introduces intrinsic problems in the build. There may not be a single class in common between two versions of the same jar, and their dependency graph could be completely different. If we wanted to assure correctness, we should treat each version of the module as a separate module in its own right. But then how to account for the fact that most modules share most of the same class names between versions? If this is the kind of thing that keeps you awake at night, and frustrated during the day, please read on.

Local build problem

When code is located in the application, the change-test cycle is quick and seamless. Change code, run the application and observe new behavior, seem to be the smallest set of steps (although even this has been questioned by Bret Victor). This quick cycle allows for good flow and encourages small steps. However as soon as the code change has to happen in a module, additional steps are introduced into the feedback cycle. Once the change is made, module has to be published to a local repository, usually with a new version number, and the application has to specify that new version number before it is built. Having to perform these steps breaks the flow by making a developer shift focus to mechanics of the build.

Local build problem is adressed by many techniques, from introducing a project dependency in Eclipse, through maven snapshot builds, to git submodule feature. Neither of these are satisfactory. Ideally, the build system should detect if source code is present on the local developer machine and build from source, otherwise use the binary dependency (I first saw this idea mentioned in Adam Murdoch's post).

Publish problem

Once a change in the module has been made and tested with the local version of the application, it is time to publish that module to a public repository and start the promotion process for the app. This single logical operation involves multiple steps, for instance: manually change the version, commit the code, build out the jar to be published, make a change in the application to use the new version. In an ideal world, developer would just commit the code and everything else would happen automatically.

Possible solution to the local build and publish problem

We do not have to accept the status quo. Consider an example of a hypothetical system that supports low-friction workflow. This system will use version naming conventions to facilitate automation.

We will use the following version numbering scheme: moduleA-X.Y.Z.jar where X, Y, and Z are version numbers. Position X can be manually updated by developer, if there are breaking changes that need to be visually indicated to other developers or dependency conflict resolution mechanism. Position Z is reserved for local builds. Artifacts in public repository always have 0 in position Z. When a developer makes a change in the module and runs the deploy script, it automatically increments Z, say by assigning a timestamp to it. Application is always configured to let local workstation versions of the module win resolution with the versions in the public repository, ie the dependency is specified as "plus" type (moduleA-X.Y.+). When application is built locally, the build system automatically deletes from local workstation repository all versions from the previous day (facilitated by using timestamps). This prevents stale versions from unrelated work in other modules from getting into the local application build. When application is built for the promotion, say on a CI server, the + dependency resolves to officially published version in the public repository, which is always 0.

The publish problem is then somewhat alleviated by allocating position Y for automatically generated public versions of the jar. Upon commit, Continuous Integration server picks up a change, and starts a publish job, with timestamp or other incremental value in position Y (and 0 in position Z). This solution addresses only part of the publish problem by eliminating the manual version bump. The application's dependencies still need to be updated with a new public version of the jar. Read on to see how this problem could be solved with introduction of a new build primitive.

Bump downstream problem

When an update is needed in a jar that the application brings in transiently, the publish problem degenerates into what I refer to as bump downstream problem. Consider a case where AppX depends on moduleA, which depends on moduleB, which depends on moduleC. If the change needs to happen in moduleC, the versions have to be bumped in the hierarchy all the way to the application level. This is a manual and tedious process. If additional applications also depend on moduleC, it is unlikely that the developer would try to integrate his change in moduleC into those applications, because the manual process is in the way.

What could be useful is a build that supports a primitive operation "bump-downstream" with terminal node in the path as the parameter (ie from within moduleC's build directory we could execute command "bump-downstream(AppX)" to have all the necessary modules brought in locally, bumped, unit tested, committed, and deployed all the way through the application level. The process would terminate if tests failed at any level before getting to the application.

Workflow example

Let us take a look of what a streamlined workflow with a module may look like if we added "bump-downstream" operation to it.

Developer checks out AppX, and moduleA module source code. He writes a failing test in AppX. He makes a change in moduleA source, runs "local-deploy" target, and runs his tests in the app. He continues this cycle until he is satisfied with local development and testing. He then commits the module, and AppX code. He runs "bump-downstream(AppX)" from within moduleA.

Note that this workflow is the same no matter how distant a dependency moduleA is. There are no manual modifications to version numbers anywhere. All the complexity of maintaining the network of dependencies and publishing is managed automatically inside the bump-downstream operation. This kind of solution is achievable by any development team with the use of existing Java build tools and a bit of dedication.

Continuous integration problem

The existence of multiple versions of jars in the dependency graph poses a number of problems.

ModuleA and moduleB could depend on different versions of moduleC, which would be incompatible (ie a class could have been deleted or renamed). If an application needs both moduleA and moduleB, a choice has to be made which version of moduleC to bring. Dependency conflict resolution tools do not have a good way of dealing with incompatible versions, and all they can offer is "latest win" or "fail on major conflict" strategy. Here again OSGI offers to solve this problem, but at the price of complicating the build.

Another issue with multiple simultaneous live versions is that if a change in a jar has a compatibility issue downstream, the developer who makes it, gets no immediate feedback. If developer knew of the problem he could try another non-breaking solution, or at least would have everything fresh in his mind to resolve as many downstream problems as possible. The static dependency system that is so essential to application release stability, makes it hard to do the right thing.

I refer to this as Continuous integration problem, because I am biased in solving it through increasing integration pressure. Using the "bump-downstream" primitive and cautiously optimistic algorithm as described in Continuous Delivery book chapter on component architectures, the system can automatically increment all inter-jar dependencies in the graph, while leaving applications alone. This is not a trivial problem, as any automated system at this scale needs to have good traceability and has to be able to automatically recover from incompatible changes.

Appendix: Other problems

Since this article has a humble title that makes pretenses to completeness, I wanted to list some other problems here, which I do not discuss in depth. These problems are a result of insufficient checks provided by JVM or compiler that allow logic errors in dependency organization to go undetected. Feedback is delayed until a problem manifests itself as a hard to track bug.

Cycles in the graph

It is possible to have cycles in the application's dependency graph. The problems with dependency cycles have been well described, and even though the remedies are known, not even high profile projects, like logback are immune to it. Java compiler and Virtual Machine does nothing to prevent cycles.

Class name conflict

Java package system discourages name conflicts, but does not prevent them. Class org.foo.Bar can exist in more than one jar, and neither the compiler nor JVM do anything to prevent this. Which class gets used depends on runtime behavior of the app. One common scenario in open source world is when the module name or organization changes and the old versions continue to exist in public repositories under the old names. Another is when a framework requires certain classes by name, leaving implementation of these classes to third parties. Some frameworks will complain at runtime if multiple bindings exist in such case. OSGI can help solve this problem, but at the cost of complicating the build. This problem exists in spring-struts 3.0.6.RELEASE (see my post on another site).

Stale dependencies problem

Nothing prevents dependencies to be specified for a module that are not needed. While it is possible to point out which dependencies are not needed to compile existing code, only a thorough test suite can assure that all needed dependencies are exercised. I am not familiar with a tool that would identify unneeded runtime dependencies. Having unnecessary dependencies in the graph makes it harder to draw correct conclusions drawn from the graph analysis.

Dependencies not specified at the correct level

The dependency resolution systems I am familiar with will resolve the entire graph and add everything into the classpath before attempting to compile. This means that a dependency needed at compile time may be brought transiently through a different module. If next version of that module removes the dependency, this will result in a surprising compile error that may be difficult to track. While this problem is in principle solvable for compile time dependencies, it is very hard to both solve and debug for runtime dependencies.

Tools

Maven dependency plugin has an analyze target that addresses stale and missing dependencies. Tattletale is a straightforward to use tool that I have used to identify class name conflicts and circular dependencies.