May 5, 2004
The Mathematics of the Internet
The Internet can
be viewed as a graph, and that means you can do math with it.
By Scott C. Anderson
The Internet is connected by links that point to other
web pages that have links, etc. If you look at the links as "edges"
and the pages as "nodes," you can view the Internet as
a giant graph. A graph can be manipulated by the rules of mathematics
and that means you can do some very clever things. This article
describes one of those clever things.
First, some background. We can characterize each web
page by two simple numbers: the links going out from it and the
links coming in to it. A page with a lot of out-links is a source
page and can be a useful start for a search. A page with many in-links
is a destination page, an authoritative page where the buck stops.
It would be interesting to see if we could figure out a way to separate
these two groups and then sort them by popularity.
A search that could retrieve this kind of information
might be more targeted than the pages retrieved by a typical search
engine, which can't distinguish between sources and destinations.
A first stab at an algorithm would be to do a normal
search, say at AltaVista or Google, and get back a list of around
a hundred pages. From this list, count the ins and outs of each
page, and sort them accordingly. This is a good start, but it has
the unfortunate property of listing source pages highly simply based
on the number of links they have. A better algorithm would take
into account the popularity of the sites being linked to. A site
with a few links to popular sites should have a higher ranking than
a site full of links to unpopular sites.
But what's a popular destination site? That's easy
-- it's a site that's pointed to by popular source sites. If that
sounds circular, you've been paying attention. And if you know anything
about programming, you know that this sort of thing can either converge
on a steady solution or, more likely, it can blow up and give nonsensical
answers.
That's why it's so handy that the Internet can be
represented as a graph. A graph can also be represented as a matrix,
and with a bit of mathematical massaging (see the monograph
by Jon Kleinberg of Cornell), you can show that this particular
circular definition actually converges. That means we can create
a well-behaved algorithm such as this:
- Start out with equally weighted (equally popular) in- and out-links
for each page.
- Loop through the pages, setting the source weights to the sum
of the out-link weights.
- Loop through the pages, setting the destination weights to the
sum of the in-link weights.
- Normalize the new weights so they all sum to one.
That's pretty much it. The first time you do run it,
not much happens. But as you run it repeatedly, the popular sites
start to bubble to the top. As a destination site becomes more popular,
the source sites that point to it become more popular themselves,
ratcheting all of them up the list. After about 20 iterations, it
will pretty much have all the sites sorted properly.
When this algorighm is implemented, it does pretty
much what you would expect. A search on "automobile manufacturers,"
for instance, will return web pages for car makers. These are definitely
destination sites, because many sites point to them as the ultimate
authority. They are also notable for two things they lack. The first
is out-links. Destination pages just don't link out that much, except
to internal pages at their own site (which the algorithm ignores).
The second thing often lacking is more surprising: the search string.
This algorithm finds highly relevant sites, but unlike other search
engines, the search string doesn't have to be anywhere in the page.
How does that happen? Let's go back to the scheme:
first we get a starting batch of web pages from a search engine
(they, of course, all should have the search string). Then we need
to get the in-links and the out-links for each of these retrieved
pages. It's easy to see how you would get out-links; just bring
in each web page and search through it for links. But how do you
get in-links? How do you know which of the billions of pages across
the vast Internet point to a given site? Fortunately, there's a
trick for this. Type in "link:" before the URL in most
search engines and you will get back a list of sites that point
to that URL. Note something quite interesting about these two techniques:
neither the out-links nor the in-links need to contain the original
search string.
Another astonishing thing to note: the popularity
of the web pages is arrived at without regard to content. The algorithm
doesn't know whether you're searching for cars or cartoons, yet
it can pick out the most popular sources and destinations. Again,
this is due to the abstraction of the links to a mathematical graph:
the starting batch is pulled from a search engine, but after that,
it's all links in and links out.
Don't take my word for it, try it yourself. I wrote
up a program to demonstrate the algorithm, and it works better than
I expected. I call it Buckstop. In general, it brings back sites
that are more germane to my search than does a typical search engine.
For instance the web sites of "automobile manufacturers"
are not listed in the top 50 searches from a typical search engine,
but they all show up in Buckstop.
The search is not perfect. For one thing, it takes
too long. After 5 seconds, most people get antsy. This search can
take up to two minutes. That's because the algorithm has to go out
and read hundreds of web pages to gather its links. If you want
to change how long it takes, click the advanced link above the search
field. Select a smaller number (no less than 10) of initial pages
for a faster, less accurate result. Or select a bigger number (up
to 400) to get a much better list of destinations. It can take up
to 5 seconds for each initial page, so keep that in mind as you
choose.
If enough people find this search technique useful,
perhaps some company like Google will add it to their tool kit.
With their massive network of servers, they could execute a Buckstop
search as quickly as a normal search. Click the button below to
give it a shot!
For more information about Jon Kleinberg's paper,
click here.
Copyright © 2000-2004 by Scott Anderson
For reprint rights, email the author:
Scott_Anderson@ScienceForPeople.com
Here are some other suggested readings about graph
theory:
|