Transitioning from a regular lattice to a small world topology can strongly affect the properties of the graph. For example, a small fraction of random links added to a regular lattice allows disease to spread much more rapidly across the graph. An iterated mutliplayer prisoner's dilemma game is less likely to lead to cooperation if the connections between the players form a small world network rather than a regular lattice . Costs for search problems such as graph coloring have heavier tails for small world graphs as opposed to random graphs, calling for different solving strategies .
So far, several man made and naturally occurring networks have been identified as small world graphs. The power grid of the western US, the collaboration graph of film actors, and the neural network of the worm Caenorhabditis elegans, the only completely mapped neural network, have all been shown to have small world topologies. In the case of the graph of film actors, the distance between any two actors is found as follows: if the two have acted together, their minimum distance is one. If they have not costarred together, but have both costarred with the same actor, their distance is two, etc.
The fact that relationships between individuals tend to form small world networks has been captured in several popular games. For example, in the game 'Six Degrees of Kevin Bacon', one attempts to find the shortest path from any actor to Kevin Bacon. Because the graph of film actors is a small world, it is difficult to find any actor with a degree of separation greater than 4 with actor Kevin Bacon. There is also the Erdos number for scientists. If a scientist has published an article with the famous Hungarian mathematician Erdos, their number is 1, if they've published with someone who's published with Erdos, their number is 2.
In this paper I show that another man-made network, the World Wide Web, has a small world topology as well. Web sites tend to be clustered, but at the same time only a few links separate any one site from any other. This topology has implications for the way users surf the Web and the ease with which they gather information. The link structure additionally provides information about the underlying relationship between people, their interests, and communities.
The first graph considered was the Web at the site level. Site A has a directed edge to site B, if any of the pages within A point to any page within site B. The data set used was extracted by Jim Pitkow at Xerox PARC from an Alexa crawl made approximately 1 year ago. It contains 50 million pages and 259,794 sites. Initially all links were considered to be undirected. From the 259,794 sites in the data set, the leaf nodes were removed, leaving 153,127 sites. An estimate of L was formed by averaging the paths in breadth first search trees over approximately 60,000 root nodes. 84.5% of the paths were realizable, the rest were labeled with -1. The resulting histogram is shown below.
L was small, a mere 3.1 hops on average between any two connected sites. C was 0.1078, compared to 2.3e-4 for a random graph with the same number of nodes and edges.
Next, directed links were considered. This was a more natural interpretation of navigation between sites, since a user cannot move in the reverse direction on links using a standard browser. The largest strongly connected component (SCC), i.e. the largest subset of nodes such that any node within it can be reached from any other node following directed links, contained 64,826 sites. In order to sample the distribution of distances between nodes, breadth first search trees were formed from a fraction of the nodes. The corresponding histogram is given in the figure below.
L was slightly higher at 4.228 because the number of choices in paths is reduced when edges are no longer bi-directional. C was 0.081 compared to 1.05e-3 for a random graph with the same number of nodes and edges. In short, even though sites are highly clustered locally, one can still hop among 65,000 sites following on average only 4.2 between-site links (note that there might be additional hops within sites that are not counted in this framework). There is indeed a small world network of sites.
In order to have a more accurate comparison between the small world networks for sites, and the corresponding random graphs, the subset of .edu sites was considered. Because the .edu subset is significantly smaller, distances between every node could be computed. The complete histogram is shown below. 3,456 of the 11,000 .edu sites formed the largest SCC. C and L were computed for a generated random graph with the same number of nodes and directed edges.
L for the .edu graph was 4.062, similar to that of sites
overall. This was remarkably close to L of the random graph : 4.048. At
the same time C was much higher : 0.156 vs. 0.0012 for the random graph.
The semi log plot below shows the difference in the tails of the two shortest
paths histograms. While L is almost the same for both graphs, long paths
(of up to 13) occur for the .edu site graph. For the corresponding random graph
the maximum path is 8 and long paths are unlikely. While L was almost
identical, the small world network distinguishes itself by having a few
unusually longs paths as well as a much larger C.
In summary, the largest SCCs of both sites in general and the subset of .edu
sites are small world networks with small L.
Starting from the assumption that a good Web document references other good documents of similar content, one would expect that there exist groups of pages of similar content which refer to one another. The quality of these pages is guaranteed by the recommendations implicit in the links among them. Such groups can be extracted from results matching a particular query. Within each group there are documents which are good "centers", that is, the distance from them to any other document within the group is on average a minimum. Centers tend to be index pages, and hence constitute good starting points for exploration of related query results.
An application of these ideas was built around webbase, a repository of Web pages crawled by Google in the first half on 1998. For any given search string, webbase returns queries according to their PageRank and text match. Webbase also provides link information for each page.
Rather than presenting a list of documents that contains many sequential entries from the same site, the search engine can present just the center from each SCC. By sorting the centers by the size of their SCCs, one can present the user with the maximum span of the search space with the minimum number of entries. Given a good starting point, the users can explore the SCCs on their own.
Informal observation suggests that pages which do not belong to any large SCC
tend to focus on either a narrow topic, don't have many outlinks, or don't have
many pages referencing them (which implies that they are probably not worth
reading). When centers are sorted by the size of their SCCs, these documents
will be listed last. One further observes that the SCCs that span several sites
tend to contain the main relevant pages, and are rich in "hubs", or pages that
contain links to many good pages. The algorithm will present these at the top of
If we look at the top 10 centers (see table) among the 144 pages in the largest SCC, we see that they are compilations of links to java resources on the Web. Such pages are obvious good starting points. Since they are contained in the SCC, they must be linked to by at least one other page, which means that this list has been evaluated and deemed useful by at least one other source. Pages which are compilations of links tend to be subdivided into topics, and have brief human evaluations or summaries of the pages linked to. If a search engine is able to present the user with such man made sources of links, it could potentially save itself the trouble of ranking, clustering, or otherwise evaluating documents matching the search string.
It is interesting to note that this procedure might not allow the search engine to return the best resource immediately, but with high probability the best resource is only one step away. For example, java.sun.com is not in the top 5 centers, but all 5 centers link to it. One of the top five documents returned by webbase, www.gamelan.com does not appear in the SCC. It has many back links, but no forward links, and hence is disconnected from the other pages. Still, 4 of the top 5 centers listed reference gamelan, so that, again, this important site is just a click away from the center.
|DOCID||Av. Min. Distance||URL||Title|
|7458633||2.47222222222222||www.infospheres.caltech.edu/resources/java.html||Infospheres - Java Resources|
|2011210||2.48611111111111||www.apl.jhu.edu/~hall/java/||Java Programming Resources: Java, Java and More Java.|
|9488549||2.73611111111111||www.cat.syr.edu/3Si/Java/Links.html||3Si - Java Resources|
|3963615||2.77777777777778||www.december.com/works/java/info.html||Presenting Java: Information Sources|
|2897228||2.79861111111111||www.javaworld.com/javaworld/common/jw-jumps.html||JavaWorld - Java Jumps|
|5887098||3.01388888888889||javaboutique.internet.com/javafaqs.html||The Java(TM) Boutique: Java FAQs|
|2681499||3.04166666666667||sunsite.unc.edu/javafaq/||Cafe au Lait Java FAQs, News, and Resources|
What if, on the other hand, one were interested in just the best single page on java? Then one could look for the page that has a minimum average distance to it from any other page in the SCC. However, it is possible for a page to be very good, and yet have no outlinks, as in the example of gamelan.com. Then this page would not be found using our method of largest strongly connected components.
|docid||av. min. distance||url||title|
|1360147||2.29861||java.sun.com/products/||Products & APIs|
|1241416||2.33333||java.sun.com/nav/used/||Java(TM) Technology in the Real World|
|98165||2.53472||java.sun.com:81/||Java Home Page|
java.sun.com/ comes out on top, as one would hope.
In summary, in the "java" search string example, the largest SCC is a group of high quality resources spanning several sites. Pages with lowest average distance to any other page within the largest SCC are good hubs, i.e. pages with many links to good sources. Pages with lowest average distance to them from any other page in the largest SCC tend to be good sources of information, or "authorities" on java.
Future work should include a comparison with the Kleinberg algorithm  which also finds "hubs" and "authorities" using the link structure of the Web.
To see what insight one could gain from identifying strongly connected components and average shortest paths, three search strings were issued to the search engine application outlined above.
On the other hand, the pro life query results had a pro life strongly connected component of 41 pages, which spanned 16 sites. One could conclude that pro lifers not only have a stronger Web presence (804 vs. 645 documents returned for the two queries), but that the pro life community is more tightly knit, and possibly better organized.
The results of the UFO query contained a largest connected component of 95 pages, spanning 21 sites. Apparently there is a lot of interest in UFOs and UFO enthusiasts are interested in other's sightings and speculations.
The largest strongly connected components for all three queries had a high clustering coefficient and a small average shortest path, showing that groups of people with common interests are linked to one another via a small world network on the Web.
How does this tie into marketing? Suppose one were interested in informing others of upcoming legislation regarding abortion. For example, a while back one could have either opposed or supported the partial birth abortion ban bill and wanted to start a red ribbon campaign. One could drop a red ribbon in support of the bill in the middle of the pro life strongly connected sites and expect your ribbon to find its way to other pro life sites. On the other hand, if one wanted to start a black ribbon campaign in opposition to the bill, one would have to drop a black ribbon at several pro choice sites, because one would not expect the ribbon to propagate on its own. In general, one could reach a large community of people by placing an ad on a central page of an SCC. If the community is represented on the Web by many small SCCs, the advertiser would need to place ads in many SCCs, in order to ensure reaching as much of the target audience as possible.