Managing highly repetitive code and documentation in Java
Highly repetitive code is generally a bad thing, and there are design patterns that can help minimize this. However, sometimes it's simply inevitable due to the constraints of the language itself. Take the following example from java.util.Arrays
:
/**
* Assigns the specified long value to each element of the specified
* range of the specified array of longs. The range to be filled
* extends from index <tt>fromIndex</tt>, inclusive, to index
* <tt>toIndex</tt>, exclusive. (If <tt>fromIndex==toIndex</tt>, the
* range to be filled is empty.)
*
* @param a the array to be filled
* @param fromIndex the index of the first element (inclusive) to be
* filled with the specified value
* @param toIndex the index of the last element (exclusive) to be
* filled with the specified value
* @param val the value to be stored in all elements of the array
* @throws IllegalArgumentException if <tt>fromIndex > toIndex</tt>
* @throws ArrayIndexOutOfBoundsException if <tt>fromIndex < 0</tt> or
* <tt>toIndex > a.length</tt>
*/
public static void fill(long[] a, int fromIndex, int toIndex, long val) {
rangeCheck(a.length, fromIndex, toIndex);
for (int i=fromIndex; i<toIndex; i++)
a[i] = val;
}
The above snippet appears in the source code 8 times, with very little variation in the documentation/method signature but exactly the same method body, one for each of the root array types int[]
, short[]
, char[]
, byte[]
, boolean[]
, double[]
, float[]
, and Object[]
.
I believe that unless one resorts to reflection (which is an entirely different subject in itself), this repetition is inevitable. I understand that as a utility class, such high concentration of repetitive Java code is highly atypical, but even with the best practice, repetition does happen! Refactoring doesn't always work because it's not always possible (the obvious case is when the repetition is in the documentation).
Obviously maintaining this source code is a nightmare. A slight typo in the documentation, or a minor bug in the implementation, is multiplied by however many repetitions was made. In fact, the best example happens to involve this exact class:
Google Research Blog - Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken (by Joshua Bloch, Software Engineer)
The bug is a surprisingly subtle one, occurring in what many thought to be just a simple and straightforward algorithm.
// int mid =(low + high) / 2; // the bug
int mid = (low + high) >>> 1; // the fix
The above line appears 11 times in the source code!
So my questions are:
- How are these kinds of repetitive Java code/documentation handled in practice? How are they developed, maintained, and tested?
- Do you start with "the original", and make it as mature as possible, and then copy and paste as necessary and hope you didn't make a mistake? 开发者_C百科
- And if you did make a mistake in the original, then just fix it everywhere, unless you're comfortable with deleting the copies and repeating the whole replication process?
- And you apply this same process for the testing code as well?
- Would Java benefit from some sort of limited-use source code preprocessing for this kind of thing?
- Perhaps Sun has their own preprocessor to help write, maintain, document and test these kind of repetitive library code?
A comment requested another example, so I pulled this one from Google Collections: com.google.common.base.Predicates lines 276-310 (AndPredicate
) vs lines 312-346 (OrPredicate
).
The source for these two classes are identical, except for:
AndPredicate
vsOrPredicate
(each appears 5 times in its class)"And("
vsOr("
(in the respectivetoString()
methods)#and
vs#or
(in the@see
Javadoc comments)true
vsfalse
(inapply
;!
can be rewritten out of the expression)-1 /* all bits on */
vs0 /* all bits off */
inhashCode()
&=
vs|=
inhashCode()
For people that absolutely need performance, boxing and unboxing and generified collections and whatnot are big no-no's.
The same problem happens in performance computing where you need the same complex to work both for float and double (say some of the method shown in Goldberd's "What every computer scientist should know about floating-point numbers" paper).
There's a reason why Trove's TIntIntHashMap
runs circles around Java's HashMap<Integer,Integer>
when working with a similar amount of data.
Now how are Trove collection's source code written?
By using source code instrumentation of course :)
There are several Java libraries for higher performance (much higher than the default Java ones) that use code generators to create the repeated source code.
We all know that "source code instrumentation" is evil and that code generation is crap, but still that's how people who really know what they're doing (i.e. the kind of people that write stuff like Trove) do it :)
For what it is worth we generate source code that contains big warnings like:
/*
* This .java source file has been auto-generated from the template xxxxx
*
* DO NOT MODIFY THIS FILE FOR IT SHALL GET OVERWRITTEN
*
*/
If you absolutely must duplicate code, follow the great examples you've given and group all of that code in one place where it's easy to find and fix when you have to make a change. Document the duplication and, more importantly, the reason for the duplication so that everyone who comes after you is aware of both.
From Wikipedia Don't Repeat Yourself (DRY) or Duplication is Evil (DIE)
In some contexts, the effort required to enforce the DRY philosophy may be greater than the effort to maintain separate copies of the data. In some other contexts, duplicated information is immutable or kept under a control tight enough to make DRY not required.
There is probably no answer or technique to prevent problems like that.
Even fancy pants languages like Haskell have repetitive code (see my post on haskell and serialization)
It seems there are three choices to this problem:
- Use reflection and lose performance
- Use preprocessing like Template Haskell or Caml4p equivalent for your language and live with nastiness
- Or my personal favorite use macros if your language supports it (scheme, and lisp)
I consider the macros different than preprocessing because the macros are usually in the same language that the target is where as preprocessing is a different language.
I think Lisp/Scheme macros would solve many of these problems.
I get that Sun has to document like this for the Java SE library code and maybe other 3rd party library writers do as well.
However, I think it is an utter waste to copy and paste documentation throughout a file like this in code that is only used in house. I know many people will disagree because it will make their in house JavaDocs look less clean. However, the trade off is that is makes their code more clean which, in my opinion, is more important.
Java primitive types screw you, especially when it comes to arrays. If you're specifically asking about code involving primitive types, then I would say just try to avoid them. The Object[] method is sufficient if you use the boxed types.
In general, you need lots of unit tests and there really isn't anything else to be done, other than resorting to reflection. Like you said, it's another subject entirely, but don't be too afraid of reflection. Write the DRYest code you can first, then profile it and determine if the reflection performance hit is really bad enough to warrant writing out and maintaining the extra code.
You could use a code generator to construct variations of the code using a template. In that case, the java source is a product of the generator and the real code is the template.
Given two code fragments that are claimed to be similar, most languages have limited facilities for constructing abstractions that unify the code fragments into a monolith. To abstract when your language can't do it, you have to step outside the language :-{
The most general "abstraction" mechanism is a full macro processor which can apply arbitrary computations to the "macro body" while instantiating it (think Post or string-rewriting system, which is Turing capable). M4 and GPM are quintessential examples. The C preprocessor isn't one of these.
If you have such a macro processor, you can construct an "abstraction" as a macro, and run the macro processor on your "abstracted" source text to produce the actual source code you compile and run.
You can also use more limited versions of the ideas, often called "code generators". These are usually not Turing capable, but in many cases they work well enough. It depends on how sophisticated your "macro instantiation" needs to be. (The reason people are enamored with the C++ template mechanism is ths despite its ugliness, it is Turing capable and so people can do truly ugly but astonishing code generation tasks with it). Another answer here mentions Trove, which is apparantly in the more limited but still very useful category.
Really general macro processors (like M4) manipulate just text; that makes them powerful but they don't handle the structure of programming language well, and it is really awkward to write a generaor in such a mcaro processor that can not only produce code, but optimize the generated result. Most code generators that I encounter are "plug this string into this string template" and so cannot do any optimization of a generated result. If you want generation of arbitrary code and high performance to boot, you need something that is Turing capable but understands the structure of the generated code so it can easily manipulate (e.g., optimize) it).
Such a tool is called a Program Transformation System. Such a tool parses the source text just like a compiler does,and then carries analyses/transformations on it to achieve a desired effect. If you can put markers in the source text of your program (e.g, structured comments or annotations in langauges that have them) directing the program transformaiton tool what to do, then you can use it to carry out such abstraction instantiation, code generation, and/or code optimization. (One poster's suggestion of hooking into the Java compiler is a variation on this idea). Using a general puprose transformation system (such as DMS Software Reengineering Tookit means you can do this for essentially any language.
A lot of this kind of repetition can now be avoided thanks to generics. They're a godsend when writing the same code where only the types change.
Sadly though, I think generic arrays are still not very well supported. For now at least, use containers that allow you to take advantage of generics. Polymorphism is also a useful tool to reduce this kind of code duplication.
To answer your question about how to handle code that absolutely must be duplicated... Tag each instance with easily searchable comments. There are some java preprocessors out there, that add C-style macros. I think I remember netbeans having one.
精彩评论