Archive for the ‘Software’ Category

Imperial units for measuring information

Monday, July 30th, 2018

I used to be a metric man, myself. Celsius, meter, kilometer, liter, kilogram. In fact, living in Transylvania, I was intimately familiar with even finer gradations. If you went to the grocery store, and the clerk was speaking Hungarian, you asked for 20 decagrams of raisins. If she was speaking Romanian, you asked for 200 grams. Fun, fun, fun.

But now, after decades of living in English speaking countries, this continental precision and spirit of the French revolution [1] seems overly technical. Inches, yards and miles are now dear to me and so much a part of my life, that I dread the conversion math whenever I travel abroad. So, how many liters per hundred kilometers consumes a car that goes 34 miles per gallon?

Now, however, I am getting uncomfortable when dealing with the metric system for information. Bits, Bytes, Kilobytes, Kilobits, Mega, Giga, Tera!!! What sort of continental approach is this? Ok, some people argue that these  should actually be multiples of 1024 not 1000, but then again, nobody really cares about this. One only cares about it when you go back to Best Buy and try to get a 2.4% discount because your 1TB harddisk is not really 1TB. Good luck with that.

So, somebody needs to develop a proper set of imperial units for the measure of information (IUMI). Some time ago me and my friend Srikar Rajamani came up with some names while waiting in traffic in Silicon Valley.  Somebody had to do it, no?

Let us start from the beginning. The basic unit of information in IUMI is a bite. Coincidentally, it denotes the same amount of information as the bit in the metric system. Do not confuse it with a byte. One bite is 0.125 bytes.

Now, 7 bites make a brain. No, it is not coming from this [2] but from this [3].

One brain is suitable for the representation of an ASCII character, if you don’t worry about continental letters like é or è.  

One thousand brains make a binch, or 6.83KB. You can measure your Python file’s length in binches, like in the following sentence:

My Tensorflow script is just a binch long, but it has been running for weeks!

Sixteen binches are a bounce, or 116.21KB. It is a suitable measure for measuring very badly scanned family photos, or very long personal notes to the management.

Now, obviously, one boot is twelve binches and one bard is three beet. They can be used to measure the speed of your wireless connection (beet/second) and USB drive (bards/second).

A buart is 32 bounces or 3.71MB (as opposed to the bard which is 4.08MB). There is not that much you can use a buart for, but it is important because four buarts make a ballon. A ballon is 14.872 MB, and can be used to measure the size of your old, unreadable SD cards from 2008 or the memory of your Motorola RAZR. Use kiloballon for the post-iPhone era.

The use of the kilo prefix is necessary, because we have a large step here. One bacre is 4840 bards or 19.75GB. Comes to mind, now that your SD cards are a lot larger, maybe you can use a bacre instead of a kiloballon. The conversion is simple: a bacre is 1.361 kiloballons, or 1 kiloballon is 0.734 bacres.

One bile is 1760 bacres. You can easily compute in your head that this is 34.754 TB, so it is a suitable measure of your hard disk size several years from now. Or when you speak about the size of big data as reflected by the Google and Facebook data centers: I can see for biles and biles. Finally, 1 nautical bile is 2025.37 bacres or exactly 40TB. It can be used when talking about the amount of data you lose when you keep your lifelog on 5 large hard disks in a RAID-0 configuration [4].






Google Scholar and the lowercase PageRank paper (also shakespeare)

Sunday, November 28th, 2010

Here is the paper which started the Google revolution, in its original form, the Stanford technical report.

Scientific Commons believes it to be:

  The PageRank Citation Ranking: Bringing Order to the Web
  Larry Page, Sergey Brin, R. Motwani, and T. Winograd.

Note the way the Google founders have first names, while the other authors are restricted to initials. Then, here is the way the same paper appears on Google Scholar (3508 citations):

  The pagerank citation ranking: Bringing order to the web
  L Page, S Brin, R Motwani...

Let me not dwell onto the wisdom of citing three authors in a four author paper, and replacing the fourth one with ….  Of course, it is a bit better than another accepted typographic convention where the citation would be: L. Page et al. As in American universities the general convention is that the supervising professor’s name will be the last one, this approach will guarantee that his name will be always missing.

Now let us see the BibTeX entry generated by Google Scholar:

  title={{The pagerank citation ranking: Bringing order to the web}},
  author={Page, L. and Brin, S. and Motwani, R. and Winograd, T.},
  publisher={Technical report, Stanford Digital Library Technologies Project, 1998}

So, Google Scholar is quite sure that the capitalization is correct the way it is, that is, with “pagerank” all lowercase, and the Bringing in uppercase, so it decides to protect the complete title capitalization with double brackets. It also believes it to be a journal article. The word “technical report” in the publisher field does not, apparently, raises the suspicion of the parser, neither the lack of the journal, volume, number or page fields.

Let us now move to the Stanford publication server at This one generates a rather generous bibtex entry:

          number = {1999-66},
           month = {November},
          author = {Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd},
            note = {Previous number = SIDL-WP-1999-0120},
           title = {The PageRank Citation Ranking: Bringing Order to the Web.},
            type = {Technical Report},
       publisher = {Stanford InfoLab},
            year = {1999},
     institution = {Stanford InfoLab},
             url = {},
        abstract = {The importance of a Web page is an inherently
            subjective matter, which depends on the readers interests,
            knowledge and attitudes. But there is still much that can
            be said objectively about the relative importance of Web
            pages. This paper describes PageRank, a mathod for rating
            Web pages objectively and mechanically, effectively measuring
            the human interest and attention devoted to them. We compare
            PageRank to an idealized random Web surfer. We show how to
            efficiently compute PageRank for large numbers of pages. And, we
            show how to apply PageRank to search and to user navigation.}

Now, we can finally figure out the first names of all the authors, and we learn the fact that the year is 1999, not 1998, which might or might not be correct. Also Larry is now Lawrence, which shows how much more serious this entry is compared to the others. Terry, however, remains Terry.

The only thing this entry gets wrong is the title: although it correctly capitalizes PageRank, it forgets to protect the capital letters, which will make LaTeX  cite it in all lowercase: “pagerank”. Ok, there is also an extra period at the end of the title.

To my mind, the correct title line would be like this:

title = {The {P}age{R}ank Citation Ranking: Bringing Order to the Web},

but that is only me.

Three thousand citations and the largest amount of money ever generated by an algorithm, and we still don’t know how to cite the paper exactly.

Next week, Antony and cleopatra, by Google Scholar (search for “antony and cleopatra”, and look at the first link).

User modeling (was: stupid word processors)

Monday, March 23rd, 2009

Was editing an exam in OpenOffice, and I had to make a table with headings showing resources: r1, r2, r3… As I was typing them in, OpenOffice writer was happily capitalizing them behind me: R1,R2, R3… As this was incorrect, I had to go back and change it back to r1, r2, r3… And OO was capitalizing them again: R1, R2, R3… I had to go through some significant acrobatics to let it leave where as it was (exiting the cell downwards, rather than leftwards, and weird stuff like that).

Now, two issues:

  • Apparently the OpenOffice background processor can not figure out that a word like r1 is probably not a regular lexical word subject to capitalization. ‘Cause English words do come with numbers in them. But this is the least problem.
  • It seems that the OpenOffice system does not have a minimal model of the user. It only knows about the document (BTW, Microsoft Word is just like that). Well, if you are automatically doing things to the documents, like these programs do, then you probably see document editing, where you are trying to help the user achieve what it wants. If that is what you really want, then probably the first rule is: “If you have done a change, and the user had gone back and reverted that change right away, then probably the user wants it like that, so do not change it back again“. What this means, though,  is that you need a user model as well as a document model, and in this case, the document model is overridden by the user model. Now, by the way, implementing this particular thing would be an afternoon’s work, if somebody wants to do it right – eg. after I have fixed the first R1 –> r1, the system might guess that I don’t want it to mess with r2 in the next cell.

Now, I know that the OpenOffice guys have limited resources, but Microsoft???

The new wave of software?

Tuesday, December 5th, 2006

I was looking for an easy way to create blogs for a while, and I stumbled upon TiddlyWiki I think it is safe to say that there is a completely new set of applications hanging around which are breaking new ground compared to the classical way of working with C/C++/Java whatever. TiddlyWiki is a single HTML (!) file, and it is programmed in Html, CSS and Javascript. A similar revelation was Scrapbook, which is a little Firefox extension which allows you to capture the webpages you are visiting, and potentially edit and comment them. (Here editing mostly refers to cleanup – in sense of removing all the adds, links and other @#$% which infects todays pages – it is providing a little tool called the DOM Eraser for this). I was looking of the source for a while – and then I realized that the whole thing is implemented in JavaScript – and it was a 65KB (!) download. And inevitably, one needs to think about all the AJAX (supposedly, asynchronous javascript and xml) type of interfaces which are popping up everywhere. The most obvious ones being the Google mail and mapping applications – but of course there are many others – including most of the new online mailers from Microsoft and Yahoo. I am ambivalent about all this furry of new applications:

  • they put back the fun in hacking – they allow very small applications to be useful – this was not true for a long time.
    • BUT: they are annoyingly hackish and spend a lot of effort to do things for which clean implementations exist. Designing active user interfaces in HTML is an unqualified nightmare (and that includes tag libraries, server faces etc). And that when user interface libraries are the poster child of clean object oriented design! Seems like a step back to me.
  • the apps are undoubtedly cool and useful. I use them all the time.
    • BUT: they won’t scale. I am not talking of GMail AJAX – obviously what you see in a single page can be handled, and everything else goes on the server side, whatever that be. But unfortunately TiddlyWiki and Scrapbook can not become the big generalized knowledge repositories we are all dreaming of – their architecture simply does not permit this.

Tabbed console in Windows XP

Thursday, July 6th, 2006

Guess what, I can finally do the same thing in Windows XP what I could do in KDE for about five years: namely have a tabbed command line. Which means that I can log in to multiple remote hosts and I don’t need to clutter my desktop with 100 open CMD terminals, which by the way, have the wonderful property that they show up exactly identical on the task list. And they also change their order in the Ctrl-Tab list, such that you can never remember which is which. Ok, so the miracle software is Console, it was written by a fellow called Marko Bozikovic. Thanks Marko!
Ok, so this is not the whole thing, of course, because then you need a command line ssh client. The whole thing was that I kept waiting for putty to become multitab, no? So there is a command line interface to putty, called plink. I have thrown both of them in a directory in the path, and then I can type plink in the Console. Rather cool.