Humans have been creating and adapting communication systems for thousands of years – at least, that’s what we know based on fragments of the historical archive that have stood the test of time, preserved often by accident and discovered much later by newer generations that need to piece it together to understand the meanings.
Fast forward to 2018, and humans are much better at communicating. We use social media to document what we do, who we spend time with, even what we eat. We use websites to share information, news and opinions with the four billion other Internet users. It’s the stuff of dreams for future historians, or it would be, if the thousands of tweets per second, hundreds of thousands of Facebook posts per minute, and overall billions of GB of data created online each day weren’t at risk of being lost forever.
Organisations across all sectors are now recognising that digital content is a precious asset that can deliver value in the short term and decades to come, but without the right planning and foresight they risk relying on obsolete technologies or formats, issues with the third-party platform they are publishing on, or dependence on content management and backups that only provide security in the short or medium term.
It’s already happened with sites such as MySpace, with former users of the once world-leading social media platform losing pages, messages, and photos. Similarly, it is very hard to find older versions of websites or web pages – search engines are designed to show you only the very latest content. Most people assume that content uploaded to the web is safe there, but action needs to taken to protect these digital communications and make them accessible for future generations.
The legacy and history of 2018 will come from these digital communications, and to preserve it requires a digital archive. As well as storing the data, it needs to be usable for it to be valuable, and a digital archive can give you a snapshot of what a website or social media looked like at a certain point in time. It creates a permanent and unalterable record of what any organisation, from governments to drinks brands, have been putting online at any given period. This has huge, positive implications – companies are digitally preserving the content of legacy and historical significance, and it also allows organisations to demonstrate compliance, especially important in regulated sectors.
Cloud technologies are the only way this can be achieved in the most efficient and cost-effective way.
Archive
Traditional archiving in the form of physical hardware still holds a lure for many people, but it is limited in its capabilities. When the data you’re archiving is growing as exponentially as an organisation’s digital communications, hardware could mean a continuous need to invest in new infrastructure in order to accommodate the data. Cloud storage allows the unrestrained option to add new storage whenever it is needed, thanks to an ability to scale to almost unlimited capacity.
Reliability
Physical hardware such as servers and hard drives can easily be overloaded or fail. By comparison, cloud infrastructure has a higher level of redundancy (supplying duplicate copies of data), which can be utilised in case of problems. If a hard drive, data centre or server fails, there is peace of mind in knowing that normal service can be resumed with as little disruption as possible.
Security
Using an archiving solution that is cloud-native and certified to ISO standards of security means that any organisation can be assured they have full control of who can access their archives, which is a key consideration especially for any personal data held on the archiving platform. It is also incredibly important for public sector organisations that may be holding sensitive information of national importance. The data can be stored in highly secure cloud data centres, protected by strong safeguards, and the scalability of cloud technologies means that the safety is always there no matter the size of the dataset.
Compliance
Regulated companies and firms that need to record and retain all electronic communications under various rules and legislation – such as MiFiD II or the recently introduced GDPR guidelines – need to be able to prove compliance. Cloud technologies provide the scalability and future-proofing that is necessary to demonstrate that the data has been permanently stored in an unalterable format.
Cost savings
Moving to the cloud cuts down on a lot of the overhead that comes with maintaining your own infrastructure – the physical space and the costs and bills that come with it. This allows the flexibility to focus on improving the archive, whether it’s simple user interface updates or much more advanced capabilities, like transferring large amounts of data for large-scale research projects.
Usability
An archive is worthless if no-one can use it – whether that’s internal staff within an organisation, or students, researchers, historians and anyone else who may wish to access a public-facing resource. Search technologies can be difficult to implement because search engines don’t scan an entire set of documents one by one – they use indexes, to return useful results, faster.
Web archives consist of website data stored in a WARC file format. Playback of these files requires indexing, which is essentially a list of all the assets within the web archive, including HTML data or PDFs. Achieving this for large archives can be challenging, sometimes with billions of very small items to index, but content stored in a flexible cloud environment is much simpler to process. Cloud-based search technology gives the ability to scale up or down depending on demand, too. It can also be useful for the quality of the data; for example in the deduplication of any pages. Using a mixture of this, and its own technology, MirrorWeb managed to process 1.4 billion documents for The UK National Archives in just 10 hours.
If companies haven’t been archiving websites since the dawn of the Internet then there’s already likely a 20 plus year black hole – a huge void that will never be filled. Without this, firms will be unable to look back at the contents of their first website or their digital presence when social media first took off.
In 2018, the world is collectively estimated to spend one billion year’s worth of time online. To avoid losing valuable pieces of history, organisations must wake up to the urgent need for digital archiving, and the multiple benefits of doing so on the cloud.
Philip Clegg is Chief Technical Officer at cloud-native digital communications archiving company MirrorWeb, which he formed along with co-founders David Clee and Karl Stringer. He studied Electronic and Microelectronic Systems Engineering, Electronics and Computing before moving into a variety of IT and computing consultant roles. Since the creation of MirrorWeb in 2012, Philip has been instrumental in the archiving of many organisations from the public sector to financial services, including The National Archives’ gigantic 120TB web archive, which was transferred to and preserved on the cloud in less than two weeks.