How to write Java code that doesn't suck

2017-09-04 02:42 by Ian

It only takes one simple piece of knowledge to write Java code that has a runtime heap usage more inline with that of a comparable C program. I am writing this post as a public service to teach your engineers how to remove Java from their Java code.

IMO, Java's worst design decision was making String an "immutable" type by the means that they did (deep-copy upon passage as a parameter). I put quotes around immutable, because when received as an argument to a function, the String is perfectly happy to be mutated, and passed further down the stack frame, being copied into a fresh allocation that must then experience a garbage collection cycle (sometimes several).

All of these choices are made by the JVM and driven by the Java VM's CPU specification. Nothing the Java programmer can do will stop String from being treated in this fashion.
Non-string objects, however, are treated in the same way as object pointers in C++ (without the notation * to denote so). So we have to use Java in an un-Java way.

Make a wrapper object for String, and pass that around. You wouldn't need to implement the "length" member of such a shim object in Java, since Java has its own way of handling .length(). It should be the Java equivalent of this:

class StrObj{
  public:
    String  str;   // The string.
};

Possible indications that this might be a worhtwhile thing to do to your Java codebase:

If any of these apply to you, tell your team that on review, all internal APIs that were using String as an argument should use something like the StrObj type above instead (which will be passed as a pointer, thus avoiding the JVM's implicit deep-copy of the String content). But remember: Java devs don't usually know how to manage RAM at this level. So expect your team to have trouble with this, and not understand why you are blacklisting String use in function calls. Depending on the codebase, making that sort of a conversion looks like a waste of effort, simplicity, and intelligibility for no obvious gain.

But if they make the conversion, much of their memory bloat will be gone. Thread latency spikes in both the JVM, and its host OS will also be reduced, since the garbage collector isn't being invoked on hundreds of megabytes of redundant string data.
Everyone will be amazed and demand that you explain how you knew.

Strings

This is basic computer science stuff, but you might be shocked to know how many people out there are writing code every day without knowing the basics of strings (and/or their language).

Consider that strings are not simply types. No matter how you represent them, strings are always data structures. As far as I am aware, there are really only two basic ways to implement a string's datastructure:

Null-terminationPointer-Length
AKA\0, C-style
struct String {
  byte* str;
  int32 len;
};
A string is......a sequence of bytes whose value is not zero, terminated by (and including) a zero....a sequence of a specified number of bytes.
Exemplified byC/C++Golang
NOTE: For our purposes here, ptr-len strings also includes len-data style strings (IE, the same basic data structure, but with the member order reversed, and with no pointer indirection). All that is important is that it is an explicitly composite type with length as a fixed-size integer.

C-style strings may be the simplest possible implementation of a String. They only require a single byte to implement (the null-terminator), and the software architecture for handling them is so close to the hardware, that traversal loops can be as small as 4-bytes of machine code.

From a purely linguistic standpoint, \0 forces you into a goofball definition of what constitutes a String. Brittle definitions lead to brittle code, and practically, \0 is a nightmare to cope with.

It means you must over-allocate by one byte to copy the literal content, and leads directly to the unique flavor of brain-damage experienced by people who know what a null-terminated string is. The need to distinguish between a length versus an ending offset, and the ever-present risk of mishandling the null-terminator both invite off-by-one errors from several different angles.

Worse than that, it means that to know the length, you must traverse the entire length. This means that null-terminated strings cannot be made constant-time. And because you may not find the \0 before you run into other memory segments or data, every invocation of strlen() in your program is a possible halting problem. But instead of running forever, execution generates a fault as a finite memory boundary is crossed before \0 is encountered.

Null-terminated strings are a safety risk for those reasons. And good C/C++ programmers implicitly hate their language's use of null-terminated strings. A ptr-len style string type is one of the earliest commits in a codebase that does lots of dynamic string handling (if it doesn't import something for the same purpose).

Golang has the sanest treatment of strings I've ever seen in a language. Go set out with the same design goal as Java: Immutable strings that are pass-by-value in function calls, but did it in a much cleaner way by opting for a pointer-length arrangement (in that structural order). This has two serious advantages versus C-style strings:

Previous:
Next: