Article

A Guide to Archiving on the Internet

And why having records is essential for fact-checkers.

Published Nov 12, 2022

 (Archiv Schwaz /Wikimedia Commons)
Image Via Archiv Schwaz /Wikimedia Commons

This page is part of an ongoing effort by the Snopes newsroom to teach the public the ins and outs of online fact-checking and, as a result, strengthen people's media literacy skills. Misinformation is everyone's problem. The more we can all get involved, the better job we can do combating it. Have a question about how we do what we do? Let us know.

Here at Snopes, archiving web links is key to our fact-checking practice. And thanks to numerous archival resources on the internet, that practice has become easier than ever. Keeping records on the internet is essential to understanding not just the history of the web, but also to help us track whether a tweet was ever deleted, or if someone amended a statement on a web page.   

But this is not just unique to our roles as fact-checkers. Governments also keep archives of the websites of each administration, in the interests of transparency and public access. Former U.S. President Donald Trump's White House website is trumpwhitehouse.archives.gov, while Barack Obama's White House website can be found at obamawhitehouse.archives.gov. And the Clinton administration established the first White House website in 1994. These sites are labeled as "historical material, "frozen in time."" Some federal sites are "harvested" and saved by the Federal Depository Library Program Web Archive , which aims to "provide permanent public access to Federal Agency Web content." 

Estimates about the average lifespan of a webpage vary over time. In 1997 Scientific American estimated it was 44 days, and the New Yorker in 2015 suggested it could be 100 days. But some web pages can be deleted in a matter of hours especially if they are of a politically sensitive nature. 

In 2014, when Malaysia Airlines Flight 17 was shot down over Ukrainian airspace, a Ukrainian separatist leader Igor Girkin also known as Strelkov reportedly wrote, "We just downed a plane, an AN-26." While an AN-26 is a Soviet-built, military cargo plane, the photographs on the post appeared to be of a Boeing 777. The Wayback Machine saved the post, which was deleted from Strelkov's page only a couple hours later. By the time a journalist tweeted a picture of the saved webpage writing, "Grab of Donetsk militant Strelkov's claim of downing what appears to have been MH17," Strelkov's page had been edited and the claim deleted. The only proof of that post was the saved screenshot on archive.org. While the post could possibly have been misleading, the incident revealed the Internet Archive's role in collecting receipts that became useful to journalistic investigations.

The Internet Archive (archive.org) is considered to be one of the largest such archives of the internet, with around 625 billion web pages saved since its founding in 1996. Its Wayback Machine allows users to go through 25 years of web history, and the organization partners with the Federal Depository Library Program and other organizations through Archive-It

The Internet Archive is not the only online database. Others include archive.today, perma.cc, the U.K. Web Archive (specific to sites from the United Kingdom and a collaboration with U.K. Legal Deposit Libraries), and Time Travel. Wikipedia also has a long list of international archiving efforts. 

How to Archive a Web Page

The most straightforward site to get started on, however, is archive.org. Here, you simply input a link into the Wayback Machine to see if it already exists, by clicking on "Browse History." Below that, another option allows you to "Save Page Now," and create a new link.

If you want to browse through the history of a web page, you will get directed to all the past instances it has been archived, organized like a calendar, down to the month, day, and time it was saved. You can click on a date (indicated by a blue bubble) to get access to a webpage. The larger the bubble, the more times a page was archived on that day. We should note that a green link indicates a webpage was redirected, and may not work, so users should click on blue links. 

The top of the search results page also tells users how many times a webpage was archived, and the date range. The top bar shows the years the pages were saved while the calendar below it allows us to click on the month, day, and time. 

Archive.org also has a large collection of books that we have frequently relied on in our research. 

On archive.today you can also search for whether a link has been archived before, and also archive one yourself. 

How Do We Know Archived Pages Are Not Manipulated?

While people have screenshotted webpages and tweets in the past, it is easier to manipulate simple images than it is to edit an already archived webpage. According to an article by professor of computer science Michele C. Weigle, published by the Social Science Research Council (SSRC):

In addition, screenshots are static. There can be no interaction with the page—no scrolling, no hovering, no clicking of links or even revealing what web pages the links on the page referred to.

Web archives, on the other hand, record the entire contents of a web page, including its source HTML and embedded images, stylesheets, or JavaScript source. Upon playback, the user can interact with the archived page, including clicking links to explore what the web page was connected to. In addition, public web archives are created and stored by independent archival organizations, such as the Internet Archive. We trust that the contents of these public web archives have not been tampered with or maliciously manipulated.

However, archived links are not perfect, and come with a range of possible glitches, according to SSRC:

Although web archives provide a valuable service, they are not perfect, and archiving a web page is very different from archiving a physical object or even a static file such as a PDF. Web pages have become increasingly more complex over the years, with many loading hundreds or even thousands of images, stylesheets, and JavaScript resources, which can include advertisements and trackers. These JavaScript resources are executed by web browsers, and many of their interactions cannot be captured by all web archives. The embedded and linked nature of HTML makes the direct replay of archived web pages difficult, so web archives must make some limited transformations to the original web page. This includes rewriting links and locations of embedded resources so that they are loaded from the archive instead of the live web. This prevents someone from viewing a web page captured in 2012, for instance, and seeing an advertisement from 2018 embedded in that 2012 web page.

With all the imperfections in archival resources online, here at Snopes we have still relied on them for numerous fact checks, including ones about the Twitter history of public figures like Raphael Warnock, old quotes from magazines, and much more.   

Sources

"Archived Presidential White House Websites." National Archives, 9 Jan. 2017, https://www.archives.gov/presidential-libraries/archived-websites. Accessed 10 Nov. 2022.

"Archive.Ph." https://archive.ph/. Accessed 10 Nov. 2022.

Emery, David. "Is This 'Mayonnaise Safety' Military Handbook Real?" Snopes, 8 Aug. 2022, https://www.snopes.com/fact-check/mayonnaise-safety-military-handbook/. Accessed 10 Nov. 2022.

Evon, Dan. "Did Trump Write 'Never Admit Defeat' in 'Art of the Deal'?" Snopes, 10 Nov. 2020, https://www.snopes.com/fact-check/trump-art-of-the-deal/. Accessed 10 Nov. 2022.

"Federal Depository Library Program Web Archive." Archive-it. https://archive-it.org/home/FDLPwebarchive?fc=meta_Creator%3AU.S.+Department+of+Health+and+Human+Services. Accessed 10 Nov. 2022.

"How Web Archivists and Other Digital Sleuths Are Unraveling the Mystery of MH17." Washington Post. www.washingtonpost.com, https://www.washingtonpost.com/news/the-intersect/wp/2014/07/21/how-web-archivists-and-other-digital-sleuths-are-unraveling-the-mystery-of-mh17/. Accessed 10 Nov. 2022.

"Internet Archive: About IA." https://archive.org/about/. Accessed 10 Nov. 2022.

"Internet Archive: Wayback Machine." https://archive.org/web/. Accessed 10 Nov. 2022.

Lepore, Jill. "What the Web Said Yesterday." The New Yorker, 19 Jan. 2015. www.newyorker.com, https://www.newyorker.com/magazine/2015/01/26/cobweb. Accessed 10 Nov. 2022.

Liles, Jordan. "Did Raphael Warnock Tweet About 'the Meaning of Easter'?" Snopes, 18 Apr. 2022, https://www.snopes.com/fact-check/warnock-easter-tweet/. Accessed 10 Nov. 2022.

Liles, Jordan. "'Handmaid's Tale' Tweet Deleted from CNN Host Brian Stelter's Twitter Account." Snopes, 2 Sept. 2021, https://www.snopes.com/fact-check/brian-stelter-handmaids-tale-cnn/. Accessed 10 Nov. 2022.

"List of Web Archiving Initiatives." Wikipedia, 7 Nov. 2022. https://en.wikipedia.org/w/index.php?title=List_of_Web_archiving_initiatives&oldid=1120507741. Accessed 10 Nov. 2022.

MacGuill, Dan. "Did Wired Mag Publish 'Scary Accurate' Predictions About 21st Century in 1997?" Snopes, 27 Nov. 2021, https://www.snopes.com/fact-check/wired-1997-predictions/. Accessed 10 Nov. 2022.

"Preserving the Internet." Scientific American: Article—Special Report, 1997, https://web.archive.org/web/19970504212157/https://www.sciam.com/0397issue/0397kahle.html. Accessed 10 Nov. 2022.

"The White House." Whitehouse.Gov, 12 Mar. 2015, https://obamawhitehouse.archives.gov/homepage. Accessed 10 Nov. 2022.

"The White House." Whitehouse.Gov, https://trumpwhitehouse.archives.gov/. Accessed 10 Nov. 2022.

"Time Travel." https://timetravel.mementoweb.org/. Accessed 10 Nov. 2022.

"UKWA Home." https://www.webarchive.org.uk/ukwa/. Accessed 10 Nov. 2022.

"Web Evidence Points to Pro-Russia Rebels in Downing of MH17." Christian Science Monitor, 17 July 2014. Christian Science Monitor, https://www.csmonitor.com/World/Europe/2014/0717/Web-evidence-points-to-pro-Russia-rebels-in-downing-of-MH17. Accessed 10 Nov. 2022.

"Websites Change. Perma Links Don't." Perma, https://perma.cc. Accessed 10 Nov. 2022.

Weigle, Michele C. "On the Importance of Web Archiving." Items, https://items.ssrc.org/parameters/on-the-importance-of-web-archiving/. Accessed 10 Nov. 2022.

Nur Nasreen Ibrahim is a reporter with experience working in television, international news coverage, fact checking, and creative writing.