Monday, June 3, 2013

3 quick metrics to understand large codebases

Sonar is the tool of choice for single project metrics, but it is a bit harder to use it to look at the entire codebase. Carfax Code Corpus, that I have been working with, consists of about 3 million lines of Java and Groovy code, 1100 projects (applications, jars, sql and static content), and about 35 thousand classes. Even though at 6.5Gb it is small enough to fit on a thumb drive, getting a handle on such a codebase is beyond what a single person can do with an IDE or even most visualization tools. There are 3 metrics that are easy to generate with basic parsing tools that yield interesting insights.

The metrics I use are class size, class collaborators and class popularity. They give quick insight into the underlying relationships. Class size is measured in lines of code as it appears in text editor. Class collaborators are other classes that are needed to compile and they are measured by import statements. Class popularity is measured the same way - ie how many other classes use this particular class as their collaborator. Import statements are not a perfect network measurement tool: wildcard imports are not being counted, imports from the same package are not counted, and imported class may not actually be needed to compile. After spot-checking for these potential problems, I do not believe they are an issue in this corpus.

Size and popularity

First, I looked for frequently imported classes that are highly complex. In the diagram below, "References" axis is the number of times the given class is being imported. Large number of such classes would indicate problems in the codebase. The corpus seems healthy. There are still a few classes classes that are potential problems, and these are worth a closer look.

Size and collaborators

Next I look for problem classes in another way - are there large classes that also do a lot of work on other classes. This is usually the group targeted for code smells.

In this codebase having over 40 imports puts the class outside of mainstream. It is also interesting to take a look at the outliers. Some of them turn out to be quite innocent from the code cleanliness perspective - like classes representing some particularly large file format. I do not believe this diagram carries much value because it does not convey actual impact of the offending classes, and yet this type of breakdown is what you get with traditional static analysis tools.

Popularity and collaborators

Finally, time to plot popularity and collaborators.

The striking pattern is that popular classes do not have many collaborators. I am still trying to understand the cause and effect here. The most popular classes consist of two groups: domain and framework. Domain classes represent something key in the business. Since Carfax is in the vehicle history business, these classes have to do with cars and their history. Framework classes deal with processing policies, for instance, persistence or error reporting. And yet both of these groups follow the same pattern which says - if you want to be popular, you better not have many friends.

I am also puzzled by the magical power of the number 20. This number holds over a huge range of popularity - with exception of one class, anything used more than 20 times, needs to import less than 20 classes. How does this property emerge? What happens to a very useful class that has over 20 imports as a developer tries to import it in into his project?

Parting thoughts

I think this approach is underrepresented in software engineering. It has practical uses. A consultant coming into a company would do well to familiarize himself with the outliers in the above diagrams. It also has implications for training new hires and setting coding standards. But there is more. I think that among the meaningless noise lie hidden laws that govern the fabric of software development.

No comments:

Post a Comment