Archive for November, 2010

Google Scholar and the lowercase PageRank paper (also shakespeare)

Sunday, November 28th, 2010

Here is the paper which started the Google revolution, in its original form, the Stanford technical report.

Scientific Commons believes it to be:

  The PageRank Citation Ranking: Bringing Order to the Web
  Larry Page, Sergey Brin, R. Motwani, and T. Winograd.

Note the way the Google founders have first names, while the other authors are restricted to initials. Then, here is the way the same paper appears on Google Scholar (3508 citations):

  The pagerank citation ranking: Bringing order to the web
  L Page, S Brin, R Motwani...

Let me not dwell onto the wisdom of citing three authors in a four author paper, and replacing the fourth one with ….  Of course, it is a bit better than another accepted typographic convention where the citation would be: L. Page et al. As in American universities the general convention is that the supervising professor’s name will be the last one, this approach will guarantee that his name will be always missing.

Now let us see the BibTeX entry generated by Google Scholar:

  title={{The pagerank citation ranking: Bringing order to the web}},
  author={Page, L. and Brin, S. and Motwani, R. and Winograd, T.},
  publisher={Technical report, Stanford Digital Library Technologies Project, 1998}

So, Google Scholar is quite sure that the capitalization is correct the way it is, that is, with “pagerank” all lowercase, and the Bringing in uppercase, so it decides to protect the complete title capitalization with double brackets. It also believes it to be a journal article. The word “technical report” in the publisher field does not, apparently, raises the suspicion of the parser, neither the lack of the journal, volume, number or page fields.

Let us now move to the Stanford publication server at This one generates a rather generous bibtex entry:

          number = {1999-66},
           month = {November},
          author = {Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd},
            note = {Previous number = SIDL-WP-1999-0120},
           title = {The PageRank Citation Ranking: Bringing Order to the Web.},
            type = {Technical Report},
       publisher = {Stanford InfoLab},
            year = {1999},
     institution = {Stanford InfoLab},
             url = {},
        abstract = {The importance of a Web page is an inherently
            subjective matter, which depends on the readers interests,
            knowledge and attitudes. But there is still much that can
            be said objectively about the relative importance of Web
            pages. This paper describes PageRank, a mathod for rating
            Web pages objectively and mechanically, effectively measuring
            the human interest and attention devoted to them. We compare
            PageRank to an idealized random Web surfer. We show how to
            efficiently compute PageRank for large numbers of pages. And, we
            show how to apply PageRank to search and to user navigation.}

Now, we can finally figure out the first names of all the authors, and we learn the fact that the year is 1999, not 1998, which might or might not be correct. Also Larry is now Lawrence, which shows how much more serious this entry is compared to the others. Terry, however, remains Terry.

The only thing this entry gets wrong is the title: although it correctly capitalizes PageRank, it forgets to protect the capital letters, which will make LaTeX  cite it in all lowercase: “pagerank”. Ok, there is also an extra period at the end of the title.

To my mind, the correct title line would be like this:

title = {The {P}age{R}ank Citation Ranking: Bringing Order to the Web},

but that is only me.

Three thousand citations and the largest amount of money ever generated by an algorithm, and we still don’t know how to cite the paper exactly.

Next week, Antony and cleopatra, by Google Scholar (search for “antony and cleopatra”, and look at the first link).