The vastness of the internet hosts millions of pages—from forums, social networks, and personal blogs to shopping sites, digital libraries, and academic archives. However, this immense repository of information is not as permanent as many might believe. The dreaded “Error 404” is the most recognized symbol of this issue, signaling the disappearance of once-accessible content.
A recent study conducted by the Pew Research Center revealed an alarming fact: nearly 40% of all web pages created between 2013 and 2023 have vanished. This means that millions of websites, articles, images, videos, and even academic documents have disappeared from official sources, leaving behind only broken links and the classic “Error 404 – Page Not Found” message.

The study indicates that approximately 38% of all online content published during this decade is no longer accessible. Error 404 occurs when a server cannot locate the requested page—typically because it was removed, its address was changed without proper redirection, or the link became obsolete. This digital disappearance is not limited to personal blogs and websites; it also affects government portals, renowned news outlets, Wikipedia, and other widely used domains.
The idea that the internet is eternal is debunked by these findings. Many people assume that once something is published online, it will remain available indefinitely. However, the Pew Research Center study shows that even recent pages are not immune to disappearance. About 8% of websites active in 2023 no longer exist, highlighting the ephemeral nature of the digital environment.
Researchers analyzed nearly a million pages recorded by the non-profit organization Common Crawl, which regularly scans the web. The results showed that the problem spans all corners of the internet. For example, 23% of the news sites analyzed had at least one broken link. On Wikipedia, an alarming 54% of pages contained at least one reference leading to an Error 404.
This disappearance has troubling implications for fields such as journalism, education, and academic research. Students and researchers often encounter inactive links in bibliographic references, hindering access to original sources and compromising the integrity of academic work.
Although a significant amount of content has vanished from its original domains, not all is lost. The Wayback Machine, a project by the Internet Archive dedicated to preserving the web’s history, has managed to store about two-thirds of these lost pages. The project’s director, Mark Graham, emphasized the importance of this initiative to Business Insider: “If a physical library burns down, many of the books might exist elsewhere. But the digital world is inherently fragile and potentially ephemeral.”
The Wayback Machine archives over a billion URLs daily, including web pages, images, and even YouTube videos. However, not everything can be preserved. Many websites impose barriers to indexing by bots, such as paywalls—systems that restrict access to exclusive content for subscribers—and tracker blockers. These limitations make digital preservation efforts more challenging.
In addition to the Internet Archive, Common Crawl also plays a significant role in this scenario. Unlike the Wayback Machine, which aims to keep content publicly accessible, Common Crawl collects data for research and analysis purposes, helping to understand the evolution of the web over time.
Another concern raised by experts is the centralization of digital data in the hands of large corporations. Marlene Manoff, senior collection strategist at MIT Libraries, warned about the risks associated with this concentration: “In the long run, it’s impossible to preserve a digital object in its original form. But when it comes to corporate ownership, the likelihood of responsible and lasting management of digital content decreases,” she said in an interview with Business Insider.
Companies like Google and Meta hold immense volumes of data but do not always prioritize the preservation of historical content. Algorithm changes, the removal of older content, and storage policies can contribute to the gradual disappearance of relevant information. Furthermore, commercial or political decisions can lead to the deletion of important data, affecting public access to information.
The massive loss of online content also raises concerns about preserving collective memory. The internet documents historical events, social movements, cultural debates, and scientific advancements. When this information disappears, part of humanity’s recent history is erased.
Moreover, the phenomenon of “link rot”—when links become inactive over time—directly impacts the reliability of academic, journalistic, and legal content. Without efficient preservation mechanisms, we risk creating significant gaps in the digital historical record.
The disappearance of nearly 40% of online content in just a decade highlights the urgent need to develop effective digital preservation strategies. Without initiatives like the Wayback Machine and Common Crawl, much of humanity’s recent history would be at risk of being lost.
Even though these projects face technical and legal limitations, they ensure that important fragments of the digital past remain accessible. Looking ahead, raising awareness about the fragility of the online environment and the need for preservation measures is essential to ensure that digital memory does not become merely a corrupted file or a broken link.
The challenge is complex and requires collaboration among governments, corporations, non-profit organizations, and civil society. Only through joint efforts will it be possible to preserve the digital legacy for future generations.
The disappearance of nearly 40% of internet pages in a decade serves as a warning about the ephemerality of the digital world. In an era where knowledge and history are increasingly stored online, ensuring the preservation of this data is essential to protect humanity’s collective memory.
Without robust archiving initiatives and policies promoting the durability of digital content, we risk losing valuable records of our culture, science, and society. The internet, which should be an eternal repository of information, proves to be fragile and vulnerable to oblivion.
It is imperative that companies, governments, and civil society organizations collaborate to develop solutions that ensure the continuity of digital knowledge. Preserving the online past is not just a technical issue but an ethical commitment to future generations.