Approachably Reclusive: Episode 4 is up, so it's time for some cryptic remarks about Silence

There is a sentence in "Silence Like Diamonds" -- oh,that? How nice of you to ask. The fourth episode is up now, over at Light Reading, and if this is your first time here you can find my explanation here and Light Reading's explanation there. If you really just wanted to read a good story, go read! This will wait!

(Smiling and waving as a few folks leave the room)

RUN FOR YOUR LIFE, IT'S ABOUT TO BE MATH IN HERE!

(crouching behind podium until pandemonium subsides, addressing the few remaining cool people in the front row)

All right then.

As I was saying, there's a sentence in "Silence Like Diamonds" that draws a lot of questions from readers, so I thought today I'd talk about that. It's in Episode 3, which ran last Friday:

Ever since the Yan-Dimri fast factorization algorithm had flipped the advantage from the encryptors to the cryptanalysts, only isolated systems could be really secure (at the cost of being really useless).

I'm happy to note that in the 72 hours after that, "Yan-Dimri fast factorization" drew a few search hits on Google, so to begin with: I made that up. We call this stuff science fiction, you know?

But I'm guessing that the reason why it drew Google attention was because there are plenty of people who recognize the implications in that sentence; in fact I'm a bit surprised that more people don't.

So to begin at the beginning: since about the 16th century in the European cultural sphere, and no later than 1900 everywhere, there's been a relentless arms race between encryptors (codemakers) and cryptanalysts (codebreakers). Mathematical tools have gotten better and better on both sides, and theoretical understanding of encryption/decryption has advanced rapidly, driven first by the competing great powers in the last half of the 1800s, and then by war and cold war, and now by a huge international information market.

The balance began to swing -- apparently permanently -- toward encryption with the introduction of keyword, master key, etc. systems. I'm skipping over a vast amount of detail here, so real crypto people please forgive all the variations and interesting sidelines that I might introduce. Here's a too-brief and too-undetailed version of how we got where we are now, and how the world of "Silence Like Diamonds" is different.

Think of the code that small children used to use, where A=1, B=2, etc. Suppose we encode a block of text:

HOW ALL OCCASIONS DO INFORM AGAINST ME AND SPUR MY DULL REVENGE

For our purposes in this example, I'll use 0 for space. Encoded in that simple fashion, the message is now:

8-15-23-0-1-12-12-0-15-3-3-1-19-9-15-14-19-0-4-15-0-9-14-6-15-18-13-0-1-7-1-9-14-19-20-0-13-5-0-1-14-4-0-19-16-21-18-0-13-25-0-4-21-12-12-0-18-5-22-5-14-7-5

To read it, then, the recipient would simply replace the numbers with letters (and zeros with spaces).

Of course that cipher would be pathetically easy to break, especially with modern gear like any old spreadsheet. In fact, if you knew it was that simple a code, you'd probably just set up a list of permutations in your spreadsheet (a row where you tried A=1 B=2, a row where you tried B=0, C=1, etc) and read down a few lines till you saw an English sentence.

Yet systems like that were used down through the American Civil War; you can find Edgar Allen Poe explaining how to break that sort of code in "The Gold Bug".

Many other ideas were tried, but the road that proved most profitable was the one that leads through the Enigma machine, originally created for high-security business cables in the 1920s and then adapted by the Germans during World War II. Leaving aside all the fascinating mechanics -- and they really are worth your while if you've never read the story -- the math of it was simply that if instead of a simple substitution, you had some number of encoding rules, and a key that told you which rule to use for each successive character, then every character of the clear text would be represented by several different characters, eventually all the possible ones, in the code. For example, suppose we add the successive digits of π to the substitution above; then the new code is:

11-16-27-1-6-21-14-6-20-6-8-9-28-16-24-17-21-3-12-19-6-11-20-10-18-21-21-3-3-14-10-14-14-21-28-8-17-6-9-8-15-10-9-22-25-30-21-7-18-26-0-9-29-14-12-9-25-9-31-9-18-12-14

Of course it doesn't have to be π and it doesn't have to be addition; I could have multiplied each by the successive digits of √5 and then subtracted successive digits of √22, so that the message recipient would then decode by first adding successive digits of √22, then dividing by the successive digits of √5, and finally substituting that simple numeric code. Important thing to note right here -- you'll see why later -- in this primitive version, it takes just about as long to encode as it does to decode.

This is a considerable improvement over the simple substitution ciphers; Herbert Yardley's codebreakers, in the United States's Black Chamber around World War I, would have had to work at it for a few days, especially with message this short. Notice that the E's in "revenge" are now represented by 9, 9, and 14. The spaces are represented by 1,6,3,3,8, 9,9,7,0, and 9 successively, the double L in all is 21-14 but the double L in dull is 14-12, and so forth. With a longer sample (say 1500 characters instead of the 62 here) and a long afternoon (and ideally a spreadsheet), though, a good amateur cryptographer could crack this while barely breaking a sweat. (I'm skipping how, here; let me recommend Simon Singh's THE CODE BOOK as a good place to start a fascination with cryptography -- I wish it had been around when I first became interested, confidential information ago).

But rudimentary as my little demonstration here is, it introduces the basic tricks of modern encryption: there's a key, there's an operation performed by combining the message with the key, and there's a resulting coded message, which is hard to read without the key, but easy to read by applying the key through a defined operation. As long as the sender and the intended receiver both have the key, and the would-be interceptor doesn't, the would-be interceptor has a fairly-tough-up-to-impossible math problem to crack.

Some things make the problem harder than others. There are patterns in every language -- ETAOINSHRDLU is a famous one, that is, in a long enough passage of random English, E will be the most frequent letter, followed by T, A, etc. If there is also a pattern in the key -- say, for example, it is the digits of π, beginning over again with 3 for every message -- then in a big enough sample, the combined pattern will emerge and can be teased out, especially if something is a frequent part of a message (you might have seen The Imitation Game and remember what a difference it made when British cryptographers that realized Heil Hitler would occur frequently in the encrypted messages).

But what a complicated world is hidden in that phrase "the sender and the intended receiver both have the key, and the would-be interceptor doesn't."

A perfect, unbreakable key would simply be a list of random digits held by both sender and receiver, as long as the list was used once and then thrown away; with no pattern in the key, there'd be no breaking it. But that would require that every spy, warship, infiltrator, or whatever go out with an extremely long random number list of his/her own, in some cases smuggling it through enemy lines, and never accidentally losing it or having it confiscated. Whenever they can, modern encrypters do use those unbreakable "one time pads" ("pad" because the original WW2 version was a printed, gridded pad of sheets, on each of which you wrote your message on a top line, added the random numbers from the middle line, and sent the resulting code from the bottom line, destroying it as soon as you finished. For a really fascinating story about how that all worked in practice, see Leo Marks's BETWEEN SILK AND CYANIDE).

Suppose, though, that you can't send a one-time pad with someone. Perhaps they will be going elsewhere indefinitely, or forever. Perhaps a one-time pad is too dangerous for them to carry (since it's pretty obviously incriminating evidence). What can you do instead?

For World War II and most of the Cold War, the answer was highly unsatisfactory: you made hard-to-figure-out keys that didn't look like keys. Another example from The Imitation Game, (a great movie if you're not trying to substitute it for reading Andrew Hodge's brilliant biography Alan Turing: The Enigma): the Soviet mole, who is working inside a far-above-top-secret facility into which he dare not take any suspicious object, was using passages from the King James Bible as his key. Though it still produces some very difficult code to break manually, presumably there will be recurring patterns produced by overlaps of frequent words. For example, if GERMAN and BEGAT line up every so often (one is likely to be frequent in the message, the other in the key), and if what we're doing is adding the numbers for the two letters, subtracting 26 if necessary, and then converting to a letter, the letter combination IJYNU would show up extra frequently.

The genius of the Enigma machine was that the key wasn't written out as a key, but an initial setting for a machine that then automatically generated the key as needed; each time you pushed a key to encode a letter, three internal rotors turned by specified amounts, reorganizing the substitution of the next letter. To read the message you had to know which three (of a possible 5) rotors were in which order at which starting positions, plus how a plugboard was arranged, and since the pattern was changed every night at midnight, the Germans thought it couldn't be broken at all, given the huge number of possible combinations. But again, it was breakable because there was a pattern to the Enigma's generation of the key, and a pattern to the German messages underneath, and an enormous sample (when there's that much of a war going on, and everyone is talking to everyone else, the message volume is huge).

So the Cold War began in a rough tie between encrypters and decrypters (since the British had shared some of their codebreaking methods with us voluntarily and others with the Soviets via espionage). One-time pads were unbreakable (and still are and always will be as long as the numbers are truly random) but you had to hand off the pad between sender and receiver somehow, and the pad itself could be stolen, intercepted, lost, etc. Generated keys had patterns in them and were therefore breakable, even with enormously complex and sophisticated key-generation machines (and later, software). If a government, agency, business, or other organization changed keys often enough, and the key wasn't reused too much, and the process of generating it was not too obvious or wasn't stolen by the opposition, then most messages could be kept secret for long enough, most of the time, with the codebreakers occasionally getting ahead of the game because they got bigger, faster computers to run the breaking software on, or because the encrypters made a mistake somewhere.

And then in the 1970s some nice mathematicians came along and put the encrypters so far ahead that they've been there ever since. They figured out a variety of processes using (what they hope are) one-way functions.

A one-way function is simply a process with numbers that is easy to do in one direction but hard to do in reverse. "Easy" and "hard" might have a genuine universal meaning related to the P=NP question, but for applied engineering purposes, it's always just "relative to the capabilities of the people with the secret and the people trying to get the secret."

In particular, the encryption mathematicians discovered a process in which it was quite easy to encode using the digits of a very large number (say two thousand bits, which is about a 600-digit number in decimal) that was a semi-prime, i.e. the product of two primes, but the process only worked backwards with enormous effort, so the message was hard to decode -- unless you had the "trap door," the value of one of the two primes, in which case it was easy.

So the Chief of the Global GoodORBad Guys Org (GG|BGO) could simply broadcast the semiprime and the encryption rule, and anyone who copied it down could send him a message, with no one else being able to read it. Using one of the two prime factors that he had multiplied to create the semi-prime, the Chief could decode that same message quickly and easily. If he kept those factors secret, to break the code and read his agents' messages, someone would have to factor the public key.

Furthermore, his agents could simply mail him a public key (presumably generated on their laptops or phones) from any anonymous mailbox or email, keeping their trapdoor primes somewhere concealed (a few hundred digits is not hard to conceal, after all), and when he wanted to reply he could send it by any public channel. (Ever wonder what those numbers stations on the shortwave might be?)

And there matters have rested, because factoring is a hard operation. You can see this a little by just looking at the time it takes to factor a semi-prime by hand: 35 is a semi-prime, and its primes are 5 and 7, as anyone who remembers the multiplication table figured out in less time than it took you to read this. How about 2491? How long did it take you to find 47*53? (highlight to read) Now, if you're really a demon at factoring, try 307,961. (If you really must know, that's 547 times 563).

Here's the interesting thing, from a computing standpoint: not only is factoring a hard operation but it gets much harder (in terms of machine time taken) the bigger the number, and in modern crypto, it's just not practical at all.

So that's why RSA and its many cousins, which rely on semiprimes, are nearly as unbreakable as a 1-way pad. And a good thing too, because those methods of encryption are keeping thieves out of your bank account, your boss out of that naughty website you frequent, trolls out of taking over your i.d. and sending threatening notes to the president, and all the rest. True, if the white hats don't change the key often enough, and the black hats can keep guessing for years or centuries, eventually they might manage to factor that semiprime ... In fact a few very limited tricks are known for doing just that on a few very special classes of large semiprimes (look up Fermat's method for one). The public record so far is for factoring a 232-digit number, which took hundreds of hours of computer time across about two years, and was essentially just a very-well-thought out brute force search.

So ... since it's easy to create those really big semiprimes, and since factoring takes so long that the key will change long before anyone factors the public key and gets in, it's an encryption paradise. As long as the government doesn't make everyone give them one of the factors and leave all of them in a pile on someone's desk over the weekend we should be fine, right?

Except.

No one has actually satisfactorily shown that there can't be a fast way to factor a very large semiprime.

Except, if it's really impossible, someone ought to have proved it's impossible by now, given how big the economic incentive is. Or if it is possible, you'd think someone would have done it.

And math problems can have a really long hang time for solution.

Remember, Fermat's Last Theorem took 350 years. The four-color problem stood open for about 130. This Wolfram article on unsolved problems contains many that have been open for a century or more.

Right now, somewhere out there, mathematicians who would like to have a name in the history books are working on factoring large semi-primes, and other equally ambitious mathematicians are working on proving it's impossible. (Technically, too, someone might show it to be possible without showing how, but that idea didn't lead me to a story in whicht things blow up and people daringly do deeds of derring-do).

If the factoring side wins the race -- which they could do at any time, and it's always possible that a spy service somewhere might already have done so -- instantly,

"... there is nothing covered, that shall not be revealed; neither hid, that shall not be known. Therefore whatsoever ye have spoken in darkness shall be heard in the light; and that which ye have spoken in the ear in closets shall be proclaimed upon the housetops."

Everything. All at once. Private bank accounts and passwords, credit card accounts, corporate records, everything; the only limit will be how fast human beings can absorb and use the information.

There are other encryption methods possible, but most of them are also driven by large semiprimes.

So ... in my imagined future world, two mathematicians named Yan and Dimri came up with a fast-factorization algorithm somewhere between 2020 and 2025. It's a world where all encryption, everywhere (except for one-time pads) is broken quickly; encryption only slows corporate and political espionage down a little bit. And thus it's a world where people like Yip, Yazzy, Dusan, and Markus are never short of work or things to do; in a world where there are no good locks, there's always a job for a guard.

Do I think that world will happen?

I think no one yet knows (or at least no one yet admits they know) whether it could. So, for the moment, it's perfect for science fiction.

And if I were running a large operation's security, I might deputize a math-literate person to watch the literature on semi-prime factorization.

Anyway, enjoy the story, see you Friday with some other topic, and this time I got to talk about two favorite subjects, math and fiction, so consider yourself lucky that I only wrote this much.

Tuesday, August 4, 2015

Episode 4 is up, so it's time for some cryptic remarks about Silence