To most of the web surfing public, the Internet Archive’s Wayback Machine is the face of the Archive’s web archiving activities. Via a simple interface, anyone can type in a URL and see how it has changed over the last 20 years. Yet, behind that simple search box lies an exquisitely complexassemblage of datasets and partners that make possible the Archive’s vast repository of the web. How does the Archive really work, what does its crawl workflow look like, how does it handle issues like robots.txt, and what can all of this teach us about the future of web archiving? …
Siehe den ganzen Artikel unter http://www.forbes.com/sites/kalevleetaru/2016/01/18/the-internet-archive-turns-20-a-behind-the-scenes-look-at-archiving-the-web/#2715e4857a0b22403ae67800