Technology

How and Why We Crawl the Web

Alexa is continually crawling all publicly-available websites to create a series of snapshots of the web. We use the data we collect to create features and services:

  • Site Info: Traffic Ranks, search analytics, demographics, and more
  • Related Links: Sites that are similar or relevant to the one you are currently viewing

Alexa has been crawling the web since early 1996, and we have constantly increased the amount of information that we gather. We are currently gathering approximately 1.6 terabytes (1600 gigabytes) of web content per day. After each snapshot of the web (which take approximately two months to complete), Alexa has gathered 4.5 billion pages from over 16 million sites.

To programmatically access Alexa's vast information about the Web, please visit Alexa Web Information Service. To keep Alexa from crawling your site, please visit this page.

Gathering Web Usage Information

In addition to the Alexa crawl, which tells what's on the web, Alexa employs web usage information, which tells us what's being seen on the web by real people. This information comes from our community of Alexa Toolbar users. Each member of the community, in addition to getting a useful tool, is giving back. Simply by using the Toolbar, each member contributes valuable information about the web, how it is used, and what is important and what isn't. This information is returned to the community with improved Related Links, Traffic Ranks, and more.

Finding Patterns in Data

The Alexa services are derived from our uniquely powerful combination of web content and usage information.

  • Site Info

    Alexa gathers information from a variety of sources to provide key statistics about each site on the web. For example, Traffic Rank, the number of Pageviews, and site Average Load Times, which are derived from our community of Toolbar users. For an example of Site Info, see the Alexa Site Info page for Schwab.com.

  • Contact Info

    In addition to site owner-provided data, Alexa provides contact information for web sites by mining for web content gathered in the crawl. This information includes site owner, address, and contact e-mail. See Contact Info for Schwab.com.

  • Related Links

    Whenever an Alexa Toolbar user visits a web page, the Alexa Toolbar retrieves information from Alexa's servers to suggest other pages that might be of interest to the user. To generate Related Links, we use several techniques, including:

    • The usage paths of the collective Alexa community- this is the most important source of our information, since these paths show us which websites our users believe are important and interesting.
    • Clustering - the hundreds of millions of links on the Web can be used to find clusters of sites that are similar and relevant to one another. We mine this data by using custom databases to find and identify these clusters.

The Alexa Toolbar

The Alexa Toolbar is a program written by Alexa Internet that users install into their browsers. Every time the user changes pages, the Alexa toolbar communicates with Alexa servers to retrieve information which is then displayed in the Toolbar.

Donation of the Information to the Internet Archive

As a service to future historians, scholars, and other interested parties, Alexa Internet donates a copy of each crawl of the web to the Internet Archive, a (501(c)3) non-profit organization committed to the long-term preservation and maintenance of a growing collection of data about the web. At Alexa, we believe that saving and preserving our early digital heritage is important today and essential for future generations. We also believe that a public charity is the best kind of organization for preserving this global asset. More information about accessing archived materials is available at the Internet Archive, www.archive.org.