Information Dynamics Lab

Information Dynamics Lab, HP Labs

Palo Alto, CA 94304

Abstract |
---|

Many man made and naturally occurring phenomena, including city sizes, incomes, word frequencies, and earthquake magnitudes, are distributed according to a power-law distribution. A power-law implies that small occurrences are extremely common, whereas large instances are extremely rare. This regularity or 'law' is sometimes also referred to as Zipf and sometimes Pareto. To add to the confusion, the laws alternately refer to ranked and unranked distributions. Here we show that all three terms, Zipf, power-law, and Pareto, can refer to the same thing, and how to easily move from the ranked to the unranked distributions and relate their exponents. |

A line appears on a log-log plot. One hears shouts of "Zipf!","power-law!","Pareto"! Well, which one is it? The answer is that it's quite possibly all three. Let's try to disentangle some of the confusion surrounding these matters and then tie it all back neatly together.

All three terms are used to describe phenomena where large events are rare, but small ones quite common. For example, there are few large earthquakes but many small ones. There are a few mega-cities, but many small towns. There are few words, such as 'and' and 'the' that occur very frequently, but many which occur rarely.

Zipf's law usually refers to the 'size' ** y** of an occurrence of an event relative to it's rank

Pareto was interested in the distribution of income. Instead of asking what the ** r **th largest income is, he asked how many people have an income greater than

It states that there are a few multi-billionaires, but most people make only a modest income.

What is usually called a power law distribution tells us not how many people had an income greater than ** x**, but the number of people whose income is exactly

See Appendix 1 for discussion of Pareto and power-law.

Although the literature surrounding both the Zipf and Pareto distributions is vast, there are very few direct connections made between Zipf and Pareto, and when they exist, it is by way of a vague reference [1] or an overly complicated mathematical analysis[2,3]. Here I show a simple and direct relationship between the two by walking through an example using real data.

Recently, attention has turned to the internet which seems to display quite a number of power-law distributions: the number of visits to a site [4], the number of pages within a site [5], and the number of links to a page [6], to name a few. My example will be the distribution of visits to web sites.

Figure 1a below shows the distribution of AOL users' visits to various sites on a December day in 1997. One can observe that a few sites get upward of 2000 visitors, whereas most sites got only a few visits (70,000 sites received only a single visit). The distribution is so extreme that if the full range was shown on the axes, the curve would be a perfect L shape. Figure 1b below shows the same plot, but on a log-log scale the same distribution shows itself to be linear. This is the characteristic signature of a power-law.

Fig. 1a Linear scale plot of the distribution of users among web sites |
Fig. 1b Log-log scale plot of the distribution of users among web sites |

Let ** y** = number of sites that were visited by

In a power-law we have

So a power-law with exponent

Now one just might be tempted to fit the curve in Fig. 1b to a line to extract the exponent ** a**. A word of caution is in order here. The tail end of the distribution in Fig. 1b is 'messy' - there are only a few sites with a large number of visitors. For example, the most popular site, Yahoo.com, had 129,641 visitors, but the next most popular site had only 25,528. Because there are so few data points in that range, simply fitting a straight line to the data in Fig. 1b gives a slope that is too shallow (a = 1.17). To get a proper fit, we need to bin the data into exponentially wider bins (they will appear evenly spaced on a log scale) as shown in Fig. 2a. A clean linear relationship now extends over 4 decades (1-10

Fig. 2a Binned distribution of users to sites |
Fig. 2b Cumulative distribution of users to sites |

So far we have only looked at the power-law PDF of sites visits. In order to see Zipf's law, we need to plot the number of visitors to each site against its rank. Fig. 3 shows such a plot for the same data set of AOL users' site visits. The relationship is nearly linear on a log-log plot, and the slope is -1, which makes it Zipf. In order for there to be perfectly linear relationship, the most popular sites would have to be slightly popular, and the less popular sites slightly more numerous. It might be worthwhile to fit this distribution with alternate distributions, such as the stretched exponential [7], or parabolic fractal [8]. In any case, most would happy to call this rank distribution Zipf, and we will call it Zipf here as well.

Fig. 3 Sites rank ordered by their popularity |

At first, it appears that we have discovered two separate power laws, one produced by ranking the variables, the other by looking at the frequency distribution.
Some papers even make the mistake of saying so [9]. But the key is to formulate the rank distribution in the proper way to see its direct relationship to the Pareto. The phrase "The ** r** th largest city has

then the Pareto exponent is

(See Appendix 2 for details).

Of course, since the power-law distribution is a direct derivative of Pareto's Law, its exponent is given by ** (1+1/b)**. This also implies that any process generating an exact Zipf rank distribution must have a strictly power-law probability density function. As demonstrated with the AOL data, in the case

Finally, instead of touting two separate power-laws, we have confirmed that they are different ways of looking at the same thing.

1. Per Bak, "How Nature Works: The science of self-organized criticality", Springer-Verlag, New York, 1996.

2. G. Troll and P. beim Graben (1998), "Zipf's law is not a consequence of the central limit theorem", Phys. Rev. E, **57(2)**:1347-1355.

3. R. Gunther, L. Levitin, B. Shapiro, P. Wagner (1996), "Zipf's law and the effect of ranking on probability distributions", International Journal of Theoretical Physics, **35(2)**:395-417

4. L.A. Adamic and B.A. Huberman (2000), "The Nature of Markets in the World Wide Web", QJEC **1(1)**:5-12.

5. B.A. Huberman and L.A. Adamic (1999), "Growth Dynamics of the World Wide Web", Nature **401**:131.

6. R. Albert, H. Jeoung, A-L Barabasi, "The Diameter of the World Wide Web", Nature **401**:130.

7. Jean Laherrere, D Sornette (1998), "Stretched exponential distributions in Nature and Economy: 'Fat tails' with characteristic scales", European Physical Journals, B2:525-539. http://xxx.lanl.gov/abs/cond-mat/9801293

8. Jean Laherrere (1996), "'Parabolic fractal' distributions in Nature".

9. M. Faloutsos, P. Faloutsos, and C. Faloutsos, "On Power-Law Relationships of the Internet Topology", SIGCOMM '99 pp. 251-262.

10. N. Johnson, S. Kotz, N. Balakrishnan, "Continuous Univariate Distributions Vol. 1", Wiley, New York, 1994.

where

As a consequence, the CDF

and the PDF is

Note that the shape parameter of the Pareto distribution,

Another property, which holds for all ** k**, not just those

Let the slope of the ranked plot be ** b**.

Then the expected value

which means that there are

Changing variables we get:

To get the PDF from the CDF, we take the derivative with respect to

Which gives the desired correspondence between the two exponents.

**This tutorial exists only in an online version but some of the discussion is included inL.A. Adamic and B.A.
Huberman, 'Zipf’s
law and the Internet', Glottometrics 3, 2002,143-150**