Internet Archive

The Internet Archive is a nonprofit organization designed to build an Internet library and make it available to the public. Unlike the Library of Congress’s web collection efforts (e.g., Web Capture), which focus on particular topics, the Internet Archive seeks to create a comprehensive record of web content for use by scholars and researchers. It has been archiving web pages for almost twelve years, and archives approximately two billion web pages per month. It makes this material available over its website after a delay ranging from one week to six months after collection.

The Internet Archive relies on a protocol known as the “Oakland Archive Policy” in collecting and providing access to web content. Website owners can opt out of having their content copied, or “harvested.” This can be done mechanically by putting a robots.txt file on the site. The Internet Archives web crawling utility will respond to the [robots.txt file|file]] and bypass the site. Upon notification, Internet Archive will also block access to previously collected website material. The ability to opt out arguably protects website owners that derive financial and other benefits from making available older material, and minimizes the risks of a copyright infringement lawsuit against Internet Archive.