Bites of Bytes – Ed. 8 – Duff’s Device

A long time ago in a Ranch in San Francisco far, far away….

It is a period of great change. George Lucas has set up Lucasfilm near the North Bay, a rebel base of sorts where in his own words, creators could “shake up the status quo…of how movies were made and what they were about.” They won their first big victory with the widespread success of A New Hope getting critical acclaim for it’s groundbreaking special effects.

Research in computer graphics is just starting to take off and recognising it’s potential the CG Divison begins to assemble a crack team of top programmers. The Genesis effect was the first fully computer generated film sequence and soon the skies and stars were the limit to what could be done.

But that’s not this story (More on that here if you’re interested).

Faced with the problem of real time animation at a time when computing was still in it’s infancy, there were quite a few ingenious innovations at the time and I want to take a look at one of these…

(Coming down to earth after satisfying the geek in me :P)

Tom Duff, of the Computer Research and Development Division at Lucasfilm, came up with a trick called Duff’s Device that leveraged how the C compiler was designed to speeden up a loop.

Loop unrolling is where you try to reduce the number of iterations required to perform a task by writing extra code that would otherwise be performed in a different iteration. To illustrate this, take a look at the following example I pulled off Wikipedia.

LoopUnrolling

Anyone with any working knowledge of computer programming can see that both the above code snippets accomplish the same thing. With loop unrolling though, you trade off running fewer iterations with writing extra, perhaps messier, code. So how useful is this? Well in the normal loop, x is set to zero then the condition is evaluated i.e. x<100 which is true and then the body is executed following which x is incremented. So in effect, the condition is checked at every iteration (i.e. 100 times) along with the increment and whatever time it takes to execute the body. Whereas if you look at the second snippet, now the checking of the condition and increment operations only occur 20 times while not increasing the time it takes to perform the delete function 100 times. The saving in overhead is potentially huge and only limited by your patience to write it out. This trick was already in use at the time.

You might notice one problem though. What if there were only 97 items to delete? Or any number where the increment by 5 caused an error. That’s going to cause some serious issues right? You can’t sit and hard code everything in every time.

That was the problem faced by Duff where they had to copy some data from one location to another quickly so that their animation code would run in real time but it was too slow by about 50%. He looked to speed it up by unrolling but the issue of the leftover tail was still there. The solution he came up with could alternately be described as a moment of heavenly inspiration or a hellish botching of the C language.

The source code as written by him:

send(to, from, count)
register short *to, *from;
register count;
{
	register n=(count+7)/8;
	switch(count%8){
	case 0:	do{	*to = *from++;
	case 7:		*to = *from++;
	case 6:		*to = *from++;
	case 5:		*to = *from++;
	case 4:		*to = *from++;
	case 3:		*to = *from++;
	case 2:		*to = *from++;
	case 1:		*to = *from++;
		}while(--n>0);
	}
}

When I first saw this, my first reaction was that there was absolutely no way that this would compile. But turns out it’s completely valid C code because the implementation of the switch statement uses goto jump statements between them so the above code compiles with no errors. What is happening is that in place of the switch, the compiler checks which condition is satisfied and then uses the goto to jump to that particular label. In fact each of the cases that you enumerate followed by a colon actually ends up as a label used by the goto. This is also the reason why the cases don’t break automatically and you need to add in the break statement which is again just another goto to the finish.

Now that that’s out of the way, exactly what is happening here. So the count is taken and your counter variable ‘n’ is initialised in terms of the number of case statements you are willing to write (The reason for this will soon become clear). And then the condition is evaluated of count mod 8.

Suppose you want to copy 29 bytes of data. If you copy this 8 bytes at a time by the loop unrolling then you will perform this copy 3 times (24 bytes copied) and then be left over with a tail of 5 bytes. To solve this, Duff uses the switch to see what is 29 % 8 = 5 (length of the tail) and in the first iteration of the do while itself, the amount of data equal to the tail can be copied and you are now left with a multiple of 8!! Magic!!

In the first iteration, the switch evaluates to 5 and by the nature of the conditional, cases 5 to 1 are all executed and 5 bytes are copied. So in subsequent iterations of the do while the data can be copied 8 bytes at a time until it is successfully and correctly complete. We can also see why ‘n’ is used that way. Basically ‘n’ + 7 / 8 will give you the number of times this copy has to be performed. If count was a multiple of 8 then we would see no increase in n as (count + 7 / 8) = (count / 8) and the case would execute from case zero each time. In our example, count is 29 so n is (29 + 7 / 8) = 4.

Tracing it to the end. First 5 bytes is copied, 24 is left, n is decreased from 4 to 3, then 8 bytes are copied thrice and n becomes zero and the loop exits correctly.

If you ask me, I’m definitely of the opinion that this is genius. Now when I look back at it, it looks so simple but before you have the eureka moment when you figure it out it looks utterly incomprehensible. Duff himself doesn’t seem particularly impressed as you can see from this mail to Dennis Ritchie. His point that case statements not breaking on their own being the worst feature of C is an opinion I used to share myself until I saw this particular snippet.

How on earth did he think of this? Well, in an interview, he says that at the time the compiler for the language was around 10000 lines of code and because he had actually read the full source code he knew how to use it to his benefit. If there’s a lesson to be learnt from this, it’s that for anyone who claims to be a programmer, reading and understanding source code can be incredibly valuable and is not just something that you groan and complain about.

Just think of the ramifications of what we just saw for a second. This is such a ubiquitous problem and you can adapt this solution to suit your needs and you will see the speed-up. Granted with the advancement of processors we might not need it today but it’s fascinating nonetheless.

At the time, innovations like this were the difference between a movie sequence being disjointed and unrealistic versus smooth and believable. The success achieved by those early hits could honestly be considered the bedrock of all animated films we see today so the next time you kick back and watch Kung Fu Panda or The Incredibles you might want to raise a toast to these pioneers who put the wheels in motion that took the industry to the heights it has achieved today.

P.S. I’ve spent a long time trying to dig up exactly which movie this was first developed for. He says in an interview that this was in November 1983 which puts it too late for Return of the Jedi but it could have found it’s way into Indiana Jones and the Temple of Doom which was May 1984. If you’re in the know do share it in the comments or to me by mail and I’ll add that in.

Leave a comment