The cloud has promised us many things that will apparently change the way we do our computing forever. But will cloud based archive ever be a real thing?
The idea of cloud computing, defined as doing work using computers that could be anywhere, is a good one. If the situation evolves as it might, we should be able to solve a lot of the performance problems in modern computing by filling warehouses with an increasing number of identical servers. People who make workstations might find this a bit uncomfortable, and it's certainly a way off, but, at the moment, a workstation is a weighty, fixed box of electronics that sits there, going out of date. It's a lot more efficient to supply computing power as a commodity with near-zero shipping costs.
One of the jobs the cloud is being particularly proposed for at the moment is storage and even archive which instinctively seems a pretty reasonable idea. Modern cloud storage is just a rebranding of the sort of thing offered by generic file binning websites dating all the way back to the nineties. Perhaps the best-known (and perhaps most notorious) example is Megaupload. Defunct in 2012 after a series of legal interventions, Megaupload was suspected of being widely used to facilitate piracy, but regardless of the use to which it was put, the service represented cloud storage in action.
The concept is completely valid. If there's a need for caution, it isn't quite clear how well it works as a way to create long-term, stable, reliable archives. Using the cloud as a way to maintain a second copy of something is very sensible since it represents a form of offsite storage that can survive, for instance, a fire taking out an entire building. That's a great idea. The problems start when we begin comparing cloud storage to things like tape as if they're technological alternatives, as has been happening of late.
One thing to be aware of is that almost all cloud storage, and particularly the type of storage that's posited as a backup solution, is implemented as object storage and this is often cited as a reason it's good for archiving. In the conventional storage setup we're all used to, data is divided into files and held in a hierarchical tree structure of directories. Directories can be renamed, files can be deleted and appended-to, and so on. The ability to do all these things makes administration of the disk difficult – gaps left by deleted files must be filled in, data appended to files must be put in any available spare space and then linked to the existing file data somehow, and so on.
With object storage, there are fewer features. A file is often turned into a series of blocks of data and stored on disk alongside other, similar blocks. Deleting a block frees up space which can be filled with another block, but there's often no way to append to files and no real concept of directories. Because of this simplicity, it's easier to manage files, add more storage space and administer the system. The upside is that object storage is often much, much cheaper because those limitations make it much less labour-intensive to operate.
One thing object storage isn't is that it’s particularly intended to create archival reliability, and claims that object storage, without any elaboration, is intrinsically suitable for archival work are tenuous at best. It's still just data on a disk, and the disk can fail. Object storage certainly can and invariably does involve RAID-like protection, even across machines, but protection measures like this will happily protect user errors just as well as important data. RAID is not backup and object storage does not make RAID into backup. It just makes it into a more flexible, easier-to-manage RAID.
Given all this, the use of object storage has very little impact either way on the suitability of cloud storage for critical archiving. Users of cloud storage are dependent on whatever reliability mechanism the cloud storage provider has in place. Often, that'll mean multiple redundant copies being held at various physically different sites, which is a pretty good reliability procedure, though still, really, just a macro-scaled RAID. In the immediate term, a user would still be at risk of an internet connection falling over, so while the existence of the data is reasonably well assured, its availability is less certain.
So, while the cloud might be a great place to hold a second copy of important data, it's not the same as dumping that data to an LTO tape, or better yet two LTO tapes, and placing one of them in a disused salt mine with all the others. The two technologies don't try to solve the same problems, and comparing them is not a very useful way to evaluate the capabilities of either.
Image: Shutterstok - one photo