Wednesday, January 30, 2013

Bitcoin as a Disruptive Technology

One of the most important ways of thinking about the way that technological and commercial forces create change is Clayton Christensen's notion of a Disruptive Technology. According to Christensen, new technologies generally start by serving a niche market that is currently under-served by the incumbent technology (and the companies that sell it). The new technology is ignored by the incumbent companies because they can't see how to make money from it; it doesn't fit their big profitable customers and the niche market is too small to be interesting. So the incumbents ignore it. Meanwhile the new technology matures and improves to the point where it can be used by the big customers, and then suddenly the incumbent technology (and the companies that sell it) are obsolete.

Like any model of a complex situation this doesn't cover every angle, but its interesting to look at Bitcoin from this perspective. Christensen's approach in his book is to look at the problem from the point of view of a manager in an incumbent company (who sees a disruptive technology as a threat) or a start-up company (who sees it as an opportunity). I'm simply going to look at the major market forces, and in particular the niche markets that Bitcoin might serve better than the incumbents, and the possible paths out from those markets.

The Underworld

An obvious group of people who are served poorly by the incumbent financial system are criminals. Unlike most markets this is a matter of deliberate design. Over the past century governments have systematically regulated all forms of financial transactions in ways that make it difficult to hide ill-gotten gains, or for that matter legitimately-gotten gains that might incur taxes. Banks are legally required to know who their customers are and report any transactions that seem suspicious, or merely large. For people who move millions of dollars around there are methods to avoid such scrutiny, but these are themselves expensive; the necessary accountants and lawyers don't come cheap. Hence there is a significant group of people who have a pressing need to avoid the scrutiny that comes when you move tens of thousands of dollars around the world, but who can't afford the infrastructure used by those who move millions.

Traditionally these people have used cash, but that is hard to do across national borders because you have to physically move the stuff, making it vulnerable to interception by the authorities or thieves. So Bitcoin, with its ability to slide across national borders without trace or cost, is very attractive.

The Grey Market

A related group are those who deal in stuff that is legal in some jurisdictions but not in others. Porn and gambling are two major businesses here. Certain drugs also fit into this category, but you can't move the product over wires, so it is vulnerable to conventional methods of interdiction (although that doesn't stop everyone).

Governments trying to control porn and gambling have generally followed the money. This tends not to work well with porn because there is too much available for free. But gambling needs to move money to work, and so authorities in several countries have attacked it at this point. Hence Bitcoin is very attractive in this market as well; punters can easily convert their local currency into Bitcoins, and if they manage to win something then they can convert their winnings back almost as easily.

The Unbanked

This is a catch-all term for people who have no access to conventional bank accounts, and hence have to deal in cash or barter.

Financial solutions for these people have traditionally been expensive and piecemeal. Moving money long distance is done by wire transfer, with hefty charges and the need to physically pick it up. Cash is difficult to hide from brigands and corrupt officials. Credit can be available, but only at punitive interest.

Things have improved; across Africa variations on themes of microfinance and mobile phone banking are changing the lives of millions, but they are still vulnerable. Local laws can limit access, and accounts in local currency are vulnerable to inflation. A stable currency that can be easily hidden and transferred quickly over long distance could meet a real demand, although it still can't provide credit.  Mobile phone credit is already serving this role in some places, so something that is designed for the job should be readily adopted.

Actually holding Bitcoins requires rather more computer power than the many third-world mobile phones can provide. But that is unlikely to be a problem for long. If MPESA can have an app, then so can Bitcoin.

Conclusion

Bitcoin looks like a classic disruptive technology: it has multiple niches in markets that are under-served by conventional money, and the grey market and the unbanked provide a ready path up to higher-value markets in places that are now reasonably well served by cash and credit cards. The black market will also provide market pull for Bitcoin uptake, but if that were the only early market niche then the mere use of Bitcoins would raise strong suspicion of illegal activity. The presence of legitimate, or at least lawful, uses of Bitcoin provides a rationale for those offering to extend conventional services to Bitcoin users and plausible deniability for those whose Bitcoins have in fact come from illegal operations.

Accordingly we should expect to see Bitcoin become a strong force in finance over the next decade.

Software has CivEng Envy

There is a school of thought which says that developing software should be like constructing a building. To make a building you have an architect draw blueprints, and these blueprints are then handed over to a builder who constructs what the architect has specified. According to this school of thought the problem with the software industry is that it doesn't create blueprints before it starts building the software. They look with envy at the world of civil engineering, where suspension bridges and tunnels and tall buildings routinely get finished on budget and schedule.

This is a superficially attractive idea; software is indeed difficult, and it would indeed be a foolish project manager on a building site who directed the builders to start laying the foundations before the plans had been finalised. But on a closer examination it starts to fall apart.

Suppose that a big software project does indeed need something analogous to a blueprint before starting on the coding. What, exactly, is a blueprint? What purpose does it serve? And where would that fit into the software lifecycle?

A blueprint for a building is a precise and complete specification of everything that will go into the building. The builder has the problem of assembling what the blueprint shows, but there is no ambiguity and no variation can be permitted. This is because buildings are safety critical infrastructure. The Hyatt Regency walkway collapse was a horrible example of what can happen when someone makes a seemingly innocuous change to the plans for a building. So before a building is constructed the plans have to be approved by a structural engineer who certifies that the building is indeed going to stay up, and by an electrical engineer who certifies that it isn't going to electrocute anyone or catch fire due to an electrical fault, and by a bunch of other engineers with less critical specialties, like air conditioning. The details matter, so the blueprints have to specify, literally, every nut and bolt, their dimensions, the metal they are made from and the torque to which they should be tightened (most of these things are standardised rather than being written down individually for every nut and bolt, but they are still part of the specification). Without this the structural engineer cannot tell whether the building is structurally sound. Similarly the electrical engineer must know about every piece of wire and electrical device. So by the time the blueprints are handed to the builder inventiveness and creativity are neither required nor allowed.

The only artefact in software development that specifies how the software will operate to this degree of precision is the source code. Once the source code has been written, it is to be executed exactly as written: inventiveness and creativity in its execution are neither required nor allowed. But those who promote the idea of "software blueprints" seem to think that something else, something more abstract, can be a blueprint, and that once these blueprints have been drawn the "construction" of the software (that is, turning these blueprints into source code) can proceed in an orderly and planned fashion, putting one line of code after the next in the manner of a bricklayer putting one brick on top of another.

But when you look at the artefacts that these people proffer,  it is clear that they are nothing like precise enough to act as a blueprint; they are more like the artists impressions, sketched floor plans and cardboard models that architects produce during the early phases of design. These artifacts can help people understand how the building will be used and fit into its environment, but they are not blueprints.

(By the way, the old chestnut about unstable software requirements being like "deciding that a half-constructed house ought to have a basement" fails for the same reason. The problem with unstable requirements is real, but the analogy is wrong).

But if the blueprint for software is the source code, then the builder for software is the compiler. This should not be a surprise: when computer scientists encounter a task that does not require inventiveness and creativity then their first instinct is to automate it. If so then it is really civil engineering that needs to be envious of software engineering.

Software is not like buildings in other ways too:
  • Buildings have very few moving parts and little dynamic behaviour, whereas software is all about dynamic behaviour (and buildings with dynamic behaviour often have problems.
  • Novelty in buildings is rare. I work in a three storey steel frame office block, which is also on an estate of very similar three storey steel frame office blocks. Software, on the other hand, is almost always novel.  If software to do a job is already available then it will be reused; I run Fedora Linux; I don't write my own operating system from scratch.
So please can we drop this half-baked analogy between writing software and civil engineering.

Friday, January 18, 2013

When Haskell is faster than C

Conventional wisdom says that no programming language is faster than C, and all higher level languages (such as Haskell) are doomed to be much slower because of their distance from the real machine.

TL;DR: Conventional wisdom is wrong. Nothing can beat highly micro-optimised C, but real everyday C code is not like that, and is often several times slower than the micro-optimised version would be. Meanwhile the high level of Haskell means that the compiler has lots of scope for doing micro-optimisation of its own. As a result it is quite common for everyday Haskell code to run faster than everyday C. Not always, of course, but enough to make the speed difference moot unless you actually plan on doing lots of micro-optimisation.

This view seems to be borne out by the Computer Language Benchmarks Game (aka the Shootout), which collects implementations of various programs written in different languages and measures their execution speed and memory usage. The winningest programs are always written in highly optimised C. The Haskell versions run anything from slightly to several times slower than the C versions. Furthermore, if you read the Haskell it is also highly optimised, to the point where it no longer looks like Haskell. Instead it reads like C translated into Haskell.  Based on this, you would be justified in concluding that everyday Haskell (that is, Haskell that plays to the strengths of the language in brevity, correctness and composability) must be irredeemably slow.

But then I read this presentation by Tony Albrecht, in which he talks about how some seemingly innocent everyday C++ code is actually very inefficient. Two things in particular caught my eye:
  • A fetch from the main memory when there is a cache miss costs 400 cycles. Pointer indirection in particular tends to fill up cache lines, and hence increase the number of cache misses overall.
  • A wrong branch prediction costs over 20 cycles. A branch that skips a simple calculation can actually run slower than the calculation.
To put it another way, C is no longer close to the real machine. The real machine has 3 levels of CPU cache (some of them shared between multiple cores), long instruction pipelines, branch prediction, multiple ALUs and FPUs, and hardware data flow analysis done while the program is being executed in order to schedule all this in a way that makes it look like a simple processor executing one instruction at a time. C doesn't expose all that to the programmer, so it seems to me that the only way to write highly optimized C is to have a sophisticated mental model of the processor and its memory architecture, decide what you want this machine to do, and then reverse-engineer the C which is going to make that happen. The result is  difficult to understand and hence hard to maintain. Look at the C implementations of the Shootout problems for examples of what I mean.

But most code isn't written like that. Tony Albrecht is a games programmer, an expert at squeezing cycles out of the rendering loop. Most developers do not live in that world. For them the objective is to produce code that meets the requirements, which includes being fast enough. This is not laziness or waste, but practicality. First design the optimal algorithm, then implement it in idiomatic code. Only if that does not run fast enough should you start detailed profiling and micro-optimisation. Not only is the micro-optimisation process itself expensive, but it makes the code hard to understand for future maintenance.

So I wondered: the high level of Haskell gives the compiler many more opportunities for micro-optimisation than C. Rather than comparing micro-optimised programs therefore, it seemed sensible to compare everyday programs of the sort that might be produced when readability is more important than raw speed. I wanted to compare programs that solved a problem large enough to have a useful mix of work, but small enough that I could write Haskell and C versions fairly quickly. After poking around the Shootout website I settled on the reverse-complement problem.

A potential issue was that one of my programs might inadvertently use something highly non-optimal, so I decided I would profile the code and remove anything that turned out to be pointlessly slow, but not change the structure of the program or add complexity merely for the sake of speed. With that in mind I wrote Haskell and C versions. I also downloaded the Shootout winner to get some feel for how my programs compared. You can see my code at the bottom of this post.

The first version of the Haskell took 30 seconds (compared with the Shootout time of about 0.6 seconds). As I had feared, profiling did indeed reveal something pointlessly slow in it. In order to filter out carriage returns from the input I used "isLetter", but unlike the C char type the Haskell Char covers the whole of Unicode, and determining if one of those is a letter is not trivial. I put the filter after the complement operation and compared the result with zero, which in addition to being faster is also the Right Thing if the input contains invalid characters. Once I had this fixed it dropped down to a much more respectable 4.3 seconds.  Interestingly, profiling suggests that about half the time is being spent writing out the 60 character lines; merely printing out the result with no line breaks cut execution down to around 2 seconds.

The C version, meanwhile, took 8.2 seconds. Profiling did not directly reveal the cause, but it seemed to imply that the processor was spending most of its time somewhere other than my code. I strongly suspect that this time is being spent in putc(3) and getc(3). The obvious optimisation would use fread(3) and fwrite(3) to read and write characters in blocks instead of one at a time, but that would require significant changes to the code; extra bookkeeping to deal with the start of the next sequence (signalled by a ">" character) when it is found half way through a block, and to insert newlines similarly. Unlike the replacement of isLetter in Haskell this would require new variables and control structures driven by the cost of the solution rather than a simple switch to a less expensive expression

It might be argued that I have tilted the playing field against C by not making these changes, and that any halfway competent C programmer would do so when faced with code that runs too slowly. If isLetter is pointlessly slow, isn't the same thing true of putc(3) and getc(3)? But I think there is a clear difference.  Both programs are written in a character-oriented way because the problem is described in terms of characters. I wrote the inner loop of the C to operate on a linked list of blocks because it looked like a faster and simpler choice than copying the whole string into a new buffer twice the size every time it overflowed (on average this algorithm copies each character once or twice; see Knuth for details). I might have considered reading or writing the characters in reverse order rather than doing the in-memory reverse in a separate function, but profiling didn't show that as a significant time sink.  Overall, getting decent performance out of the C is going to take about the same amount of work as writing the code in the first place.

On the other hand the Haskell has decent performance out of the box because the compiler automates a lot of the micro-optimisation that C forces you to do manually. It may not do as good a job as a human with plenty of time might do, but it does it automatically and reasonably well.

This isn't principally about code size, but for the record the Haskell has 42 SLOC, of which 21 are executable. The C has 115 SLOC, of which 63 are executable. The Shootout C winner has 70 SLOC, of which 46 are executable (counting comma separated statements as one line per statement).

So here is the bottom line. If you really need your code to run as fast as possible, and you are planning to spend half your development time doing profiling and optimisation, then go ahead and write it in C because no other language (except possibly Assembler) will give you the level of control you need. But if you aren't planning to do a significant amount of micro-optimisation then you don't need C. Haskell will give you the same performance, or better, and cut your development and maintenance costs by 75 to 90 percent as well.


Update: here are the compilation and run commands:


ghc -main-is HaskellText -O2  -fllvm -funbox-strict-fields HaskellText.hs


gcc -O2 naive-c.c -o naive


time ./HaskellText < big-fasta.txt > /dev/null


time ./naive < big-fasta.txt > /dev/null