Wednesday, September 8, 2010

Tony Hoare Billion Dollar Mistake Transcript

Transcript from Tony Hoare's presentation on the danger of nulls.

Don’t know how much it’s cost the IT industry. Somewhere in the order of $100m - $10bn
How I came to make this rather bad idea.
ALGOL-60
Reference to array with subscript
Check every subscript agains the array bounds on every access to an array – historically good idea.
Java has reproduced this decision made 30 years ago

Extension of ALGOL-60 committee designed successor to ALGOL-60
Hoare suggested “record handling” – concept of object, an object to which reference could be made via a pointer
Bitter experience with machine code using pointers, indirect addressing – wreak ultimate havoc on program because if you happen to use a floating point number or integer as pointer and update contents of whatever location it was pointing to, it was likely as not you would update a piece of your own code.
Took it for granted that for every var or attribute that could take a ref as its value, the programmer must declare type of location of that pointer. Standard for languages like C++.
Pointer types can be checked at compile time.
Discovered years ago that this was quite an original idea. No other language checked the types of variables at indirect addresses pointed to by pointers.
Great thing about record handling was that if you structured your data as a set of records you would never have a subscript error – no need to test whether a pointer is within range - you cannot construct a pointer that doesn’t point to something that exists and is of the expected type. That is logically impossible and so here is a whole class of errors that can never happen again. Good idea to do this checking at compile-time because there was a time penalty in doing subscript checking in ALGOL compiler. Asked customers if they’d like option of switching off type checking after they’d tested the programs. Once said that removing type checking from your running programs and using them only for testing is like wearing a life jacket on your practice emergency drills and taking them off as soon as your ship was really sinking. But customers where asked if they’d like the option of taking the life jacket off – they said no. Never put the option in. But still it was a very serious mistake for Elliots. Because a lot of potential customers were Fortran users rather than ALGOL. Translate from Fortran to ALGOL. Disaster. No Fortran user would touch it. They couldn’t run any of their programs – subscript error. User’s didn’t care about subscript error – just wanted it to run like it did on their machines.

So, everything so far seems fine. We’ve got our object-orientation with careful type-checking of all the pointers, safe and there’s no overhead. Testing every subscript for its correctness.
And then I went and invented a null pointer.
And if you use a null pointer you either have to check every reference or you risk disaster – most people of course (like Fortran programmers) would prefer to risk disaster, and indeed suffer disaster, rather than check all their subscripts. I didn’t know that at the time but my friend Dykstra who was also working with the team developing ALGOL-60 thought the null reference was a bad idea and he gave a very subtle and abstract reason for it. He said if you have a null reference, then every bachelor who you’d represent in your object structure would seem to be polygamously or rather polyandorously to the same person that’s called her “nuller”. I thought that was rather a nice criticism but my real criticism is that it brings back again unnecessarily all the agony of having to choose whether to run your program fast without checking or run it slow with checking.

Now, I did know that there was a solution to this problem. It was based on the idea of discrimination between objects belonging to a disjoint union class. Idea I got from pure mathematics is that of a disjoint union that is a union between two sets which have no members in common. So a language can quite reasonably have a facility for declaring a new class not as the Cartesian product of attributes drawn from other classes but rather as a disjoint union of various other classes. E.g. if you had a class of vehicles you might classify them as either busses or private cars and separately declaring different structures of attributes for those two classes. Bus has max passengers and private car has capacity of trunk/boot. And every time you accessed either of those components of the object you would have to do it within a discrimination clause which tested. It looked at the vehicle and say when it is a bus you can look at max passengers, when it’s a car you can look at its capacity. But size of source program gets quite a bit larger by having to make all these discrimination clauses and you have to deal with both cases separately. If we insisted on a discrimination clause then we should make null not into a value of a pointer but rather into a class – a class which never had any objects in it and which only had one pointer which obviously didn’t point to any objects. So now whenever you wanted to have a null pointer you would declare your pointer to be either a pointer to the null class or a pointer to the Vehicle class or the Wife class and that would give you a way of specifying whether you wanted this pointer to be able to take a null value or not or whether you wanted it to remain definitely pointing to something and you would never have to test.

Cumbersome. Lots of corner cases. What happens if you assign a new value to the pointer which you are currently analysing and assuming to be a member of the Bus class?
Even worse problem with initialisation. One of the things you want a high-level language to do for you is to protect you against uninitialized variables. Standard way of doing that is to assign a fixed known value to every variable or attribute and null is a very – well it’s really the only – certain thing that you can assign as a value of an attribute to a recently created pointer or pointer attribute. And so if you want to avoid using null then you have to invent a whole sub-language to use for initialising pointers that aren’t allowed to be null. This is alright as long as all your objects are in a tree structure, because you can then start with the leaves of the tree, build up in a hierarchical and well-ordered fashion the value of a little tree network to be the initial value of your new variable. But if you wanted to create a cyclic structure this way there’s no way of doing it. You would assign a null value to the pointer and later on you would insert the cyclic pointer to somewhere else in the tree. All those problems in the end I didn’t want to deal with and that led me to suggest that a null pointer was a possible value of every reference variable and the possible mistake on every use of that reference variable. And perhaps it was a billion-dollar mistake.

Well the world is gradually recovering from the mistake I’m glad to say. Modern languages like C# and Spec# and Java are introducing the idea of reference declarations which are declared to be non-null and reference parameters which are declared to be non-null and compile-time checking which checks that such variables do not have, could not possible have null values. And they have had to tackle all the problems of initialisation and discrimination in a context which is now very much more complicated than I’d ever thought of all the contexts of overloading and inheritance makes the concept of initialisation very much more elaborate than I would ever have dreamt of. But that is what is happening. I think the movement must have been motivated that null references were an expensive mistake.

Billion dollars... put things in reasonable proportion... I did feel I ought to take responsibility for this mistake, but since it was unlikely I’d every be able to pay back a billion dollars I would keep rather quiet about it – so don’t tell anyone about it please.

I put forward the view (which really derived from making ALGOL octal dump proof) that a programming language designer should be responsible for the mistakes that are made by the programmers using the language. And that means, since programming language design is rather a serious engineering activity – not one that you give to shall we say programmers with 9 months experience in machine code programming [referring to self] – but rather one which requires a good scientific basis, good understanding, ingenuity, invention, a lot of control of detail and a clear objective that the programs are written by people using that language would be proof against at least certain kinds of errors and at least as easy as possible to get right, not full of little traps and syntactic awkwardnesses that people are constantly bumping their toes against.

This was the view that led to me idea of using proof, formal verification of programs, as logical and mathematical models is a method of conducting research into the design of good languages. I wasn’t too optimistic in 1969 about the likelihood that people out there would actually be using proofs to guarantee the correctness of their programs at least not in the immediate future – in fact not for the next 30 years was my prediction – but that by investigating the logical properties of your programming language and finding out how difficult it would be to prove correctness if you wanted to you would get an objective measurement of how easy the language was to use. So if the proof of program correctness requires a very large number different proof rules and if each proof rule has a lot of side conditions, in particular, if the validity of the local application of a rule to a small bit of program depends on properties which can only be established by a scan of a program as a whole then you know you’ve done a bad job as a language designer and you do not need your customers to tell you that. Mind you they don’t because it’s actually very easy to persuade the customers of your language that every thing that goes wrong is their fault not yours. But I rejected that and thought that no, language design is a serious scientific engineering activity and we should begin to take responsibility for mistakes our users make when they use our languages. It’s beginning to happen again – Java and its successors have all used avoidance of error as one of the criteria that they use in the detail design of new features of the language. Of course it’s only one of the criteria and it’s not or at least it wasn’t at the time the most important criteria. The most important criteria is of course compatibility with every thing that has gone before – you can’t throw away millions of lines of code that have been written in other languages. Every commercial product you have to make concessions to the commercial and historical reality – you cannot pursue an ideal to its limit. But gradually ideas changes, programmers get more interested in correctness and demonstrable correctness and production techniques and checkers, analytical tools, testcase generators and so on that are going to help them get their programs correct.

The analogy that I draw is with agricultural pollution and vehicle safety. When Ralph Nader first started publishing his books on Unsafe At Any Speed it had no connection with the market place – the customers where just not asking for reliability or safety as one of the properties of their vehicles. But gradually, over 30 years, customer feeling about unreliable vehicles has changed with the aid of law-making, there are legal constraints now which require basic standards of safety into every vehicle sold. And so there is a possibility that the marketplace and the commercial necessity will move in the direction of greater reliability of programs and the languages in which their expressed.

You know what’s driving this move towards more ideal programming languages? It’s not idealism – although I think for many professional engineers they do have ideals and they do pursue them in preference to not pursuing them whenever the opportunity arises. No the real commercial imperative which requires greater attention paid to formal correctness of the programs is the virus. See the virus/malware/worm enters your computer and does dreadful things to it. It reaches parts of your program that normal execution never reaches. So it’s no longer adequate to test your program against all the cases that are likely to arise, because even if you do the virus will find a case that is not likely to arise. So it forces you to get the whole program correct, not just the parts that are going to be used by your customers. You’ve got to get the parts that will be used by viruses correct too. And that can’t be done by testing – it has to be done by analysis of the source code using type-checking techniques (simplest) but increasingly more sophisticated reasoning techniques are being applied to high-volume code to check that it doesn’t contain any naughty things like null reference de-referencing.

Well if I’m responsible for a billion-dollar mistake I think my reason for bringing it up it is only to put even more blame elsewhere. The designers of C [shakes his finger]. Well, one can definitely quantifier because the buffer overflow is a direct result of the “get” routine in C which does not check the subscript bounds of the string that it is asked to input. That allowed the early very simple viruses to get in by overwriting the return pointers in the code and these viruses taught the world how to write malware. Without this very simple entry it’s quite possible that nobody would ever have thought to look for the more subtle things which are now being exploited every day by people who are now motivated and skilled and whose profession and income it is to write botware/malware. So if it hadn’t been for the
“get” routine of C we might have had no malware. Now one virus “Code Red” was estimated to have cost the world economy $4bn because it really brought down all the networks in the world, interruptions to business/banking