The Internet Is Failing The Website Preservation Test

Gork@beehaw.org · 1 year ago

The Internet Is Failing The Website Preservation Test

thejml@lemm.ee · 1 year ago

It’s important here to think about a few large issues with this data.

First Data Storage. Other people in here are talking about decentralizing and creating fully redundant arrays so multiple copies are always online and can be easily migrated from one storage tech to the next. There’s a lot of work here not just in getting all the data, but making sure it continues to move forward as we develop new technologies and new storage techniques. This won’t be a cheap endeavor, but it’s one we should try to keep up with. Hard drives die, bit rot happens. Even off, a spinning drive will fail, as will an SSD with time. CD’s I’ve written 15+ years ago aren’t 100% readable.

Second, there’s data organization. How can you find what you want later when all you have are images of systems, backups of databases, static flat files of websites? A lot of sites now require JavaScript and other browser operations to be able to view/use the site. You’ll just have a flat file with a bunch of rendered HTML, can you really still find the one you want? Search boxes wont work, API calls will fail without the real site up and running. Databases have to be restored to be queried and if they’re relational, who will know how to connect those dots?

Third, formats. Sort of like the previous, but what happens when JPG is deprecated in favor of something better? Can you currently open up that file you wrote in 1985? Will there still be a program available to decode it? We’ll have to back those up as well… along with the OSes that they run on. And if there’s no processors left that can run on, we’ll need emulators. Obviously standards are great here, we may not forget how to read a PCX or GIF or JPG file for a while, but more niche things will definitely fall by the wayside.

Fourth, Timescale. Can we keep this stuff for 50 yrs? 100 yrs? 1000 yrs? What happens when our great*30-grand-children want to find this info. We regularly find things from a few thousand years ago here on earth with archeological digsites and such. There’s a difference between backing something up for use in a few months, and for use in a few years, what about a few hundred or thousand? Data storage will be vastly different, as will processors and displays and such. … Or what happens in a Horizon Zero Dawn scenario where all the secrets are locked up in a vault of technology left to rot that no one knows how to use because we’ve nuked ourselves into regression.

digitallyfree@kbin.social · edit-2 1 year ago

lloram239@feddit.de · edit-2 1 year ago

Ultimately this is a problem that’s never going away until we replace URLs. The HTTP approach to find documents by URL, i.e. server/path, is fundamentally brittle. Doesn’t matter how careful you are, doesn’t matter how much best practice you follow, that URL is going to be dead in a few years. The problem is made worse by DNS, which in turn makes URLs expensive and expire.

There are approaches like IPFS, which uses content-based addressing (i.e. fancy file hashes), but that’s note enough either, as it provide no good way to update a resource.

The best™ solution would be some kind of global blockchain thing that keeps record of what people publish, giving each document a unique id, hash, and some way to update that resource in a non-destructive way (i.e. the version history is preserved). Hosting itself would still need to be done by other parties, but a global log file that lists out all the stuff humans have published would make it much easier and reliable to mirror it.

The end result should be “Internet as globally distributed immutable data structure”.

Bit frustrating that this whole problem isn’t getting the attention it deserves. And that even relatively new projects like the Fediverse aren’t putting in the extra effort to at least address it locally.

Lucien@beehaw.org · edit-2 1 year ago

I don’t think this will ever happen. The web is more than a network of changing documents. It’s a network of portals into systems which change state based on who is looking at them and what they do.

In order for something like this to work, you’d need to determine what the “official” view of any given document is, but the reality is that most documents are generated on the spot from many sources of data. And they aren’t just generated on the spot, they’re Turing complete documents which change themselves over time.

It’s a bit of a quantum problem - you can’t perfectly store a document while also allowing it to change, and the change in many cases is what gives it value.

Snapshots, distributed storage, and change feeds only work for static documents. Archive.org does this, and while you could probably improve the fidelity or efficiency, you won’t be able to change the underlying nature of what it is storing.

If all of reddit were deleted, it would definitely be useful to have a publically archived snapshot of Reddit. Doing so is definitely possible, particularly if they decide to cooperate with archival efforts. On the other hand, you can’t preserve all of the value by simply making a snapshot of the static content available.

All that said, if we limit ourselves to static documents, you still need to convince everyone to take part. That takes time and money away from productive pursuits such as actually creating content, to solve something which honestly doesn’t matter to the creator. It’s a solution to a problem which solely affects people accessing information after those who created it are no longer in a position to care about said information, with deep tradeoffs in efficiency, accessibility, and cost at the time of creation. You’d never get enough people to agree to it that it would make a difference.

LewsTherinTelescope@beehaw.org · edit-2 1 year ago

Inability to edit or delete anything also fundamentally has a lot of problems on its own. Accidentally post a picture with a piece of mail in the background and catch it a second after sending? Too late, anyone who looks now has your home address. Child shares too much online and parent wants to undo that? No can do, it’s there forever now. Post a link and later learn it was misinformation and want to take it down? Sucks to be you, or anyone else that sees it. Your ex post revenge porn? Just gotta live with it for the rest of time.

There’s always a risk of that when posting anything online, but that doesn’t mean systems should be designed to lean into that by default.

kool_newt@beehaw.org · 1 year ago

Capitalism has no interest in preservation except where it is profitable. Thinking about the long-term future, archaeologist’s success and acting on it is not profitiable.

FuckFashMods@lib.lgbt · 1 year ago

Its not just capitalism lol

Preserving things costs money/resources/time. This happens in a lot of societies.

strainedl0ve@beehaw.org · 1 year ago

This is a very good point and one that is not discussed enough. Archive.org is doing amazing work but there is absolutely not enough of that and they have very limited resources.

The whole internet is extremely ephemeral, more than people realize, and it’s concerning in my opinion. Funny enough, I actually think that federation/decentralization might be the solution. A distributed system to back-up the internet that anyone can contribute storage and bandwidth to might be the only sustainable solution. I wonder.if anyone has thought about it already.

entropicdrift@lemmy.sdf.org · 1 year ago

I’d argue that it can help or hurt to decentralize, depending on how it’s handled. If most sites are caching/backing up data that’s found elsewhere, that’s both good for resilience and for preservation, but if the data in question is centralized by its home server, then instead of backing up one site we’re stuck backing up a thousand, not to mention the potential issues with discovery

The Internet Is Failing The Website Preservation Test

The Internet Is Failing The Website Preservation Test

archive.ph