From Jones and Bartlett, a book on Stem Cells from Dr. Ann A. Kiessling and Scott C. Anderson:

Selected Articles:

May 5, 2004

The Mathematics of the Internet

The Internet can be viewed as a graph, and that means you can do math with it.

The Internet is connected by links that point to other web pages that have links, etc. If you look at the links as "edges" and the pages as "nodes," you can view the Internet as a giant graph. A graph can be manipulated by the rules of mathematics and that means you can do some very clever things. This article describes one of those clever things.

First, some background. We can characterize each web page by two simple numbers: the links going out from it and the links coming in to it. A page with a lot of out-links is a source page and can be a useful start for a search. A page with many in-links is a destination page, an authoritative page where the buck stops. It would be interesting to see if we could figure out a way to separate these two groups and then sort them by popularity.

A search that could retrieve this kind of information might be more targeted than the pages retrieved by a typical search engine, which can't distinguish between sources and destinations.

A first stab at an algorithm would be to do a normal search, say at AltaVista or Google, and get back a list of around a hundred pages. From this list, count the ins and outs of each page, and sort them accordingly. This is a good start, but it has the unfortunate property of listing source pages highly simply based on the number of links they have. A better algorithm would take into account the popularity of the sites being linked to. A site with a few links to popular sites should have a higher ranking than a site full of links to unpopular sites.

But what's a popular destination site? That's easy -- it's a site that's pointed to by popular source sites. If that sounds circular, you've been paying attention. And if you know anything about programming, you know that this sort of thing can either converge on a steady solution or, more likely, it can blow up and give nonsensical answers.

That's why it's so handy that the Internet can be represented as a graph. A graph can also be represented as a matrix, and with a bit of mathematical massaging (see the monograph by Jon Kleinberg of Cornell), you can show that this particular circular definition actually converges. That means we can create a well-behaved algorithm such as this:

  • Start out with equally weighted (equally popular) in- and out-links for each page.
  • Loop through the pages, setting the source weights to the sum of the out-link weights.
  • Loop through the pages, setting the destination weights to the sum of the in-link weights.
  • Normalize the new weights so they all sum to one.

That's pretty much it. The first time you do run it, not much happens. But as you run it repeatedly, the popular sites start to bubble to the top. As a destination site becomes more popular, the source sites that point to it become more popular themselves, ratcheting all of them up the list. After about 20 iterations, it will pretty much have all the sites sorted properly.

When this algorighm is implemented, it does pretty much what you would expect. A search on "automobile manufacturers," for instance, will return web pages for car makers. These are definitely destination sites, because many sites point to them as the ultimate authority. They are also notable for two things they lack. The first is out-links. Destination pages just don't link out that much, except to internal pages at their own site (which the algorithm ignores). The second thing often lacking is more surprising: the search string. This algorithm finds highly relevant sites, but unlike other search engines, the search string doesn't have to be anywhere in the page.

How does that happen? Let's go back to the scheme: first we get a starting batch of web pages from a search engine (they, of course, all should have the search string). Then we need to get the in-links and the out-links for each of these retrieved pages. It's easy to see how you would get out-links; just bring in each web page and search through it for links. But how do you get in-links? How do you know which of the billions of pages across the vast Internet point to a given site? Fortunately, there's a trick for this. Type in "link:" before the URL in most search engines and you will get back a list of sites that point to that URL. Note something quite interesting about these two techniques: neither the out-links nor the in-links need to contain the original search string.

Another astonishing thing to note: the popularity of the web pages is arrived at without regard to content. The algorithm doesn't know whether you're searching for cars or cartoons, yet it can pick out the most popular sources and destinations. Again, this is due to the abstraction of the links to a mathematical graph: the starting batch is pulled from a search engine, but after that, it's all links in and links out.

Don't take my word for it, try it yourself. I wrote up a program to demonstrate the algorithm, and it works better than I expected. I call it Buckstop. In general, it brings back sites that are more germane to my search than does a typical search engine. For instance the web sites of "automobile manufacturers" are not listed in the top 50 searches from a typical search engine, but they all show up in Buckstop.

The search is not perfect. For one thing, it takes too long. After 5 seconds, most people get antsy. This search can take up to two minutes. That's because the algorithm has to go out and read hundreds of web pages to gather its links. If you want to change how long it takes, click the advanced link above the search field. Select a smaller number (no less than 10) of initial pages for a faster, less accurate result. Or select a bigger number (up to 400) to get a much better list of destinations. It can take up to 5 seconds for each initial page, so keep that in mind as you choose.

If enough people find this search technique useful, perhaps some company like Google will add it to their tool kit. With their massive network of servers, they could execute a Buckstop search as quickly as a normal search. Click the button below to give it a shot!

For more information about Jon Kleinberg's paper, click here.

Copyright © 2000-2014 by Scott Anderson
For reprint rights, email the author:

Here are some other suggested readings about graph theory: