Imperial units for measuring information

July 30th, 2018

I used to be a metric man, myself. Celsius, meter, kilometer, liter, kilogram. In fact, living in Transylvania, I was intimately familiar with even finer gradations. If you went to the grocery store, and the clerk was speaking Hungarian, you asked for 20 decagrams of raisins. If she was speaking Romanian, you asked for 200 grams. Fun, fun, fun.

But now, after decades of living in English speaking countries, this continental precision and spirit of the French revolution [1] seems overly technical. Inches, yards and miles are now dear to me and so much a part of my life, that I dread the conversion math whenever I travel abroad. So, how many liters per hundred kilometers consumes a car that goes 34 miles per gallon?

Now, however, I am getting uncomfortable when dealing with the metric system for information. Bits, Bytes, Kilobytes, Kilobits, Mega, Giga, Tera!!! What sort of continental approach is this? Ok, some people argue that these  should actually be multiples of 1024 not 1000, but then again, nobody really cares about this. One only cares about it when you go back to Best Buy and try to get a 2.4% discount because your 1TB harddisk is not really 1TB. Good luck with that.

So, somebody needs to develop a proper set of imperial units for the measure of information (IUMI). Some time ago me and my friend Srikar Rajamani came up with some names while waiting in traffic in Silicon Valley.  Somebody had to do it, no?

Let us start from the beginning. The basic unit of information in IUMI is a bite. Coincidentally, it denotes the same amount of information as the bit in the metric system. Do not confuse it with a byte. One bite is 0.125 bytes.

Now, 7 bites make a brain. No, it is not coming from this [2] but from this [3].

One brain is suitable for the representation of an ASCII character, if you don’t worry about continental letters like é or è.  

One thousand brains make a binch, or 6.83KB. You can measure your Python file’s length in binches, like in the following sentence:

My Tensorflow script is just a binch long, but it has been running for weeks!

Sixteen binches are a bounce, or 116.21KB. It is a suitable measure for measuring very badly scanned family photos, or very long personal notes to the management.

Now, obviously, one boot is twelve binches and one bard is three beet. They can be used to measure the speed of your wireless connection (beet/second) and USB drive (bards/second).

A buart is 32 bounces or 3.71MB (as opposed to the bard which is 4.08MB). There is not that much you can use a buart for, but it is important because four buarts make a ballon. A ballon is 14.872 MB, and can be used to measure the size of your old, unreadable SD cards from 2008 or the memory of your Motorola RAZR. Use kiloballon for the post-iPhone era.

The use of the kilo prefix is necessary, because we have a large step here. One bacre is 4840 bards or 19.75GB. Comes to mind, now that your SD cards are a lot larger, maybe you can use a bacre instead of a kiloballon. The conversion is simple: a bacre is 1.361 kiloballons, or 1 kiloballon is 0.734 bacres.

One bile is 1760 bacres. You can easily compute in your head that this is 34.754 TB, so it is a suitable measure of your hard disk size several years from now. Or when you speak about the size of big data as reflected by the Google and Facebook data centers: I can see for biles and biles. Finally, 1 nautical bile is 2025.37 bacres or exactly 40TB. It can be used when talking about the amount of data you lose when you keep your lifelog on 5 large hard disks in a RAID-0 configuration [4].






A modest proposal to change the notation of Boolean algebra

October 8th, 2016

It is always fun to explain people what “and” and “or” mean in Boolean algebra. Of how cool it is that they don’t mean the same thing as in English. Trying to pretend that while their meaning in English is unclear (it is not), in Boolean algebra they are well defined. Trying to imply that the world would be a better place, if only people would use “and” and “or” in their daily life in the Boolean algebra semantics.

Well, ok. Maybe we can make a proposal of changing English to suit Boolean algebra. Or, maybe here is a more modest proposal: let us change the Boolean algebra notation to match the English better:

A or B ---> A and/or B
A and B ---> both A and B
A xor B ---> A or B
A -> B ---> if A then surely B, (but it can also be B if not A)

Coincidence? I think not. Admissible heuristics in A* search and human cognitive biases

October 7th, 2016

I was always wondered whether anybody made this parallel before. I am sure that some people had made it, but I couldn’t find anything on the web, so I might as well as write it up.

Part 1: A* search (for non-technical people)

A* is a search algorithm used in artificial intelligence and robotics. It is a way to search for solutions to a problem. One can of course, find solutions by randomly trying out stuff (1) or by methodically trying out everything (2). What A* does is that it tries to use some knowledge about how close we are to a solution – this is called a heuristic. Basically, try to imagine that the heuristic is playing a hot-cold game: as you search, it tells you “freezing”, “cold”, “getting warmer”, “hot!”.

Now, of course, if you would genuinely know the exact distance to the solution, we don’t even need to search, just walk there directly. So the heuristic is normally just an approximation. We would assume that the closer the heuristic to reality, the better for the search, but it turns out that things are more bizarre than that. It is provable that the good heuristics are the ones that underestimate the distance to the solution, that is, they are optimistic (3). These kind of heuristics will tell you “warm”, when it is merely “cold”, and “hot!” when it is merely “warm”. Even a heuristic which always yells “hot!” (4) is still better (5) than one that approximates better, but from the pessimistic. Note that this is a formally provable result.

How do we create such heuristics? Most of the time what we do is take an original problem and (a) ignore some of the difficulties of the problem such as assume that there are no traffic jams (6) or (b) attribute superpowers to ourselves.

Part 2: Some cognitive biases

Ok, here I will need to rely mostly on our good friend Wikipedia. Basically, a cognitive bias is a human reasoning pattern which psychologists believe to be “irrational” or “illogical”. Here are some examples:

  • The planning fallacy, first proposed by Daniel Kahneman and Amos Tversky in 1979, is a phenomenon in which predictions about how much time will be needed to complete a future task display an optimism bias (underestimate the time needed).
  • The optimism bias (also known as unrealistic or comparative optimism) is a cognitive bias that causes a person to believe that they are less at risk of experiencing a negative event compared to others.
  • The illusion of control is the tendency for people to overestimate their ability to control events; for example, it occurs when someone feels a sense of control over outcomes that they demonstrably do not influence.
  • Illusory superiority is a cognitive bias whereby individuals overestimate their own qualities and abilities, relative to others. This is evident in a variety of areas including, performance on tasks or tests, and the possession of desirable characteristics or personality traits.


So, is this a coincidence or not? Well, it hinges on whether the human problem solving style is anything similar to A* search. We are certainly very bad in systematically searching for something, we are bad in backtracking, and everybody loves the hot-cold game.

(1) stochastic search
(2) uniform cost search, for instance
(3) admissible heuristics
(4) h(x) = 0
(5) what does “better” mean in this context is a bit more complicated. Let us say that if the heuristic is pessimistic, you will probably not find the best solution.
(6) Problem relaxation

The fallacy of accidental knowledge in AI

August 17th, 2014

Ok, so I want to propose a new fallacy in the way people judge artificial intelligent agents: the fallacy of accidental knowledge. This fallacy is basically about misjudging the nature of knowledge: assuming some kind of knowledge to be fundamental to cognition, when, in reality, it is just learned knowledge, acquired through the accidents of the autobiography of a human.

This fallacy is an error in evaluating the strengths and weaknesses of an AI. It happens when an AI system models a domain which is too familiar to the human which is evaluating it. The AI makes a mistake easily detectable by the human. The human judge then draws general conclusions about the ways in which AI systems in general work, which usually include statements about how AIs will never learn to perform commonsense reasoning.

The fallacy here is based on the fact that many of the commonsense knowledge used by humans have been acquired through an anecdotal form, or real world situations amounting to anecdotes.

The mistake made by the AI means that it had not yet been presented with the appropriate anecdotes, and it does not say anything about its reasoning powers. The problem with the fallacy of the anecdotal knowledge is that it forces AI developers to look for deep, systemic solutions, instead of simply providing the AI with the missing anecdotal knowledge.

My recent personal experience with Xapagy: the paper presented at AGI-14 has several examples of reasoning about the outcome of the fight between Achilles and Hector, based on its experience with previous fights it witnessed. And indeed, the agent predicts that Achilles will kill Hector.

Ok, so at this point I was wondering what Achilles will do next. So I decided to run the continuations beyond the death of Hector. Well, the next event predicted by the agent was that Hector will strike Achilles with his sword.

Stupid system! Didn’t it say, just in the previous sentence, that Hector is killed? Well, yes, but with the given autobiography, the agent had no way to know that dead people don’t continue to fight. This is not a trivial thing: children take a long time to learn what death properly means, and it is not quite clear what personal experiences are sufficient for correct inference in this case.


Present-shock in science fiction

August 7th, 2013

I am reading Kim Stanley Robinson’s newest novel 2312. Now, this book counts as serious social speculation about a potential future of humanity. It is supposed to be hard science fiction, about a future, in which, as far as I can tell, the technological progress had continued unabated for three hundred years. There is a rolling city on Mercury called Terminator which rolls on rails to keep out of the direct sunlight, there is the finished terraforming of Mars, the ongoing terraforming of Venus, etc.

But then:

  • The main characters are at this moment walking for 30 days about about 1000 km in the service tunnels of the circular track. The service stations seem to be about 90km away from each other. The builders provided supplies of food, lighting, air etc. They did not think about some kind of transportation: I think using bicycles would shorten that trip time.
  • The tunnel stations are unmarked: the walkers do not seem to know where they are, unless they go up to the surface and check.
  • The Mercurian rail on which the city moves is constructed at the parallel 45 making it 3200km long, instead of at the poles, where it would need a much shorter track. The truck is a single uninterrupted circle, there are not alternative routes.
  • A single point of failure can bring down the communication infrastructure. They seem to have wall mounted telephones in the tunnels, which go completely silent, because a short section of the 3000km rail had been bombarded.
  • The sun-walkers communicate via walkie-talkies, in what is apparently a common analog band. Mercury does not have cell phone coverage.
  • People sign up for dishwasher duties at restaurants. Apparently in 300 years we still won’t have manipulators which can put in and take out dishes from the dishwasher.

Now, this is not unusual in science fiction. We have, notoriously, Connie Willis, in whose novels, people in 2058 communicate by sending personal messengers to find each other. But even in highly tech-competent Charlie Stross’s latest novel, personal communication has such a high importance, that robots 7000 years in the future visit the travel agent in person (something I haven’t done in 20 years) and surgically change themselves to become a mermaid in order to have a brief sister-to-sister chat.

So my conclusion is that science fiction had not even been able to fully digest our current stage of the technology, let alone the one of the future. Never mind future-shock – we have a present-shock. 

Of course, some of the reason may be that technology somehow breaks the narrative tropes we so love. What kind of story would it make when Little Red Riding Hood calls 911 after meeting the wolf, Grandma has a elderly person’s panic button, the roads in the forest have security cameras, and Hansel and Gretel get home regularly thanks to Google Maps?


New Scientist article about Xapagy

December 11th, 2012

New Scientist had published a short article about Xapagy, focusing mostly on the story generation aspect:

It is a good article for general reading, and I am quite comfortable with it. It was based on last year’s crop of tech reports of I had uploaded to arXiv. Since then, the Xapagy work had been more focused on representation of tricky sentences and story segments, like “If Clinton was the Titanic, the iceberg would have sunk” and the like.

No, I still don’t have synthetic autobiographies of sufficient size to start doing really interesting stuff – like creating whole stories from scratch. But slowly, slowly, it is getting to the point that one can translate almost anything to Xapi.

Google Scholar and the lowercase PageRank paper (also shakespeare)

November 28th, 2010

Here is the paper which started the Google revolution, in its original form, the Stanford technical report.

Scientific Commons believes it to be:

  The PageRank Citation Ranking: Bringing Order to the Web
  Larry Page, Sergey Brin, R. Motwani, and T. Winograd.

Note the way the Google founders have first names, while the other authors are restricted to initials. Then, here is the way the same paper appears on Google Scholar (3508 citations):

  The pagerank citation ranking: Bringing order to the web
  L Page, S Brin, R Motwani...

Let me not dwell onto the wisdom of citing three authors in a four author paper, and replacing the fourth one with ….  Of course, it is a bit better than another accepted typographic convention where the citation would be: L. Page et al. As in American universities the general convention is that the supervising professor’s name will be the last one, this approach will guarantee that his name will be always missing.

Now let us see the BibTeX entry generated by Google Scholar:

  title={{The pagerank citation ranking: Bringing order to the web}},
  author={Page, L. and Brin, S. and Motwani, R. and Winograd, T.},
  publisher={Technical report, Stanford Digital Library Technologies Project, 1998}

So, Google Scholar is quite sure that the capitalization is correct the way it is, that is, with “pagerank” all lowercase, and the Bringing in uppercase, so it decides to protect the complete title capitalization with double brackets. It also believes it to be a journal article. The word “technical report” in the publisher field does not, apparently, raises the suspicion of the parser, neither the lack of the journal, volume, number or page fields.

Let us now move to the Stanford publication server at This one generates a rather generous bibtex entry:

          number = {1999-66},
           month = {November},
          author = {Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd},
            note = {Previous number = SIDL-WP-1999-0120},
           title = {The PageRank Citation Ranking: Bringing Order to the Web.},
            type = {Technical Report},
       publisher = {Stanford InfoLab},
            year = {1999},
     institution = {Stanford InfoLab},
             url = {},
        abstract = {The importance of a Web page is an inherently
            subjective matter, which depends on the readers interests,
            knowledge and attitudes. But there is still much that can
            be said objectively about the relative importance of Web
            pages. This paper describes PageRank, a mathod for rating
            Web pages objectively and mechanically, effectively measuring
            the human interest and attention devoted to them. We compare
            PageRank to an idealized random Web surfer. We show how to
            efficiently compute PageRank for large numbers of pages. And, we
            show how to apply PageRank to search and to user navigation.}

Now, we can finally figure out the first names of all the authors, and we learn the fact that the year is 1999, not 1998, which might or might not be correct. Also Larry is now Lawrence, which shows how much more serious this entry is compared to the others. Terry, however, remains Terry.

The only thing this entry gets wrong is the title: although it correctly capitalizes PageRank, it forgets to protect the capital letters, which will make LaTeX  cite it in all lowercase: “pagerank”. Ok, there is also an extra period at the end of the title.

To my mind, the correct title line would be like this:

title = {The {P}age{R}ank Citation Ranking: Bringing Order to the Web},

but that is only me.

Three thousand citations and the largest amount of money ever generated by an algorithm, and we still don’t know how to cite the paper exactly.

Next week, Antony and cleopatra, by Google Scholar (search for “antony and cleopatra”, and look at the first link).

Invest in Dittmer

April 22nd, 2010

I was in Sillicon Valley for a day this summer, and boy oh boy, how the names changed. The company where I worked in 2001-2002 (CPlane) of course, is nowhere, and of course the company who almost bought us is also nowhere. Our major target customers are all bankrupt, and our company investor Sun, well,… was bankr… acqui…. merged. The AT&T Research labs, where I worked in 1998 and 1999, is long time history.

So what remained? Here are the most solid institutions of Sillicon Valley.

Dittmer’s Gourmet Meats & Wurst-Haus, Inc


The Milk Pail Market

Sic transit gloria mundi  — good thing we can always fall back on cheeses and sausages.

The importance of semantics in natural language understanding

April 13th, 2010

’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

Q: What was the time of the day?

A: It was brillig.

Q: What were the toves doing?

A: They were gyring and gimbling.

Q: Were the borogoves slithy?

A: I don’t know. They were certainly mimsy.

Take this, Cyc.

Pick your science fiction idea here: Simulation

August 9th, 2009

Some notes I had written previously about William Gibson’s book Idoru: how comes that in so many books and, especially, movies people assume that the computers of the future will have three dimensional interfaces which we will try to manipulate the way we are currently manipulating our physical environment?

As it happens, every time we try to implement a three dimensional interface, we fail in a most miserable way. At the same time, our user interfaces have standardized on the overlapping windows, menus, buttons approach – and this will not change in the foreseeable future.

Idea for science fiction authors: we are a simulation on somebody’s computer. Our attempts to build computers are just an incremental attempt to simulate the computer on which we ourselves are simulated. The fact that we are converging towards a windowing system only shows that our underlying OS is also windowing based. We are simulated in a future version of Windows! Bugs introduced today might be still present in the future version. Apocalyptic scenarios involving time travel and applying patches to operating systems in the future which simulate their own past ensues.

For the film version, this idea can be developed with appropriate amount of romantic complications, car chases, expensive computer graphics etc.