Integrity is word that means different things to different people. Generally speaking, integrity means soundness, completeness and incorruptibility. We’re used to hearing this term associated with politicians, or more accurately lack of it, but in the world of data and cloud services this single word has a wide range of implications depending on whether you come from an IT, business, legal or digital preservation perspective.
I gave a talk on digital preservation at a DIA event recently which was all about Electronic Document Management for pharmaceutical and medical products. Use of SaaS is just starting to take off in this sector. The session I was in looked at how to keep and use data for the long-term, which is a challenge that many businesses face. There was a really interesting mix of presentations on regulatory compliance, legal admissibility and ‘evidential’ weight, and risk-based approaches to digital preservation (my bit).
Imagine that you’ve spent 6 years doing a clinical trial for a pharmaceutical drug… [generating] maybe 100,000 PDF documents and associated material for the trial – and now you need to be able to keep it accessible and readable for decades.
One of the questions to the panel was ‘what does integrity mean for electronic documents?’ shortly followed by ‘how can it be maintained over say 20 or even 50 years?’. Imagine that you’ve spent 6 years doing a clinical trial for a pharmaceutical drug and have just collected and submitted the results to the regulator, e.g. the FDA or EMA, in the form of an electronic Trial Master File (eTMF). That eTMF contains maybe 100,000 PDF documents and associated material for the trial – and now you need to be able to keep it accessible and readable for decades. At the same time, it needs to be held securely, held in a way that allows you to demonstrate integrity (that word again), held in systems that you can show have been validated, held in a way that gives you a full audit trail.
A pretty daunting task, but actually one that applies to a wide range of content in different sectors. At Arkivum we come across this problem on a daily basis, with examples including gene sequences, social science surveys, records of hazardous material disposal, and even digital works of art. It’s a major challenge, especially where cloud services are being used. Rarely do you have the transparency to ‘see inside’ a cloud service so you can assess what they are doing with your data and even more rarely is there any form of contractual guarantee, certification, or auditable evidence that they are ‘doing the right thing’ – especially in a form that would stand up to scrutiny by a regulator.
But back to integrity for a minute. We discussed at least three levels of integrity at the DIA event.
At the IT level, integrity is often short hand for ‘data integrity in storage’, which in turn means knowing that the ‘bits’ in a file or object are correct and haven’t changed – i.e. avoiding data corruption or loss. In the case of an eTMF this might mean using a checksum or other form of ‘digital fingerprint’ for each one of the files and then ‘checking the checksums’ to detect any loss of integrity. This is an activity known as ‘scrubbing’. Sometimes this is handled by the storage system, e.g. use of parity in RAID arrays, sometimes this is handled by filesystems designed to protect integrity, e.g. ZFS, and sometimes it is handled by tagging files with checksums as part of the ‘metadata’ so manual checks can be made when a data transfer or storage operation needs to be confirmed. Sometimes all this happens behind the scenes and sometimes it’s an activity that the user has to do, e.g. a document download might have an MD5 checksum provided along with the file so you can make sure that you’ve downloaded it correctly.
But what happens if the file itself has to be changed, e.g. to keep it readable because it was created 20 years ago and there’s no longer the software available that understands what the original file means? This could mean changing the format of a word document, spreadsheet, database, or any other form of proprietary data so that it’s still readable. Checksums don’t help you here because the ‘bits’ are changing. So what do you do if the file format needs to be changed, i.e. converted so it can be read in tomorrow’s software? This is firmly in digital preservation territory and raises questions of what is ‘significant’ about the file that needs to be ‘preserved’ during the conversion. Then the question is how can the conversion be tested and validated – something that’s key to asserting and proving that integrity has been maintained, especially in a regulated environment. The key here is to think in terms of who will need to use the content in the future and create a Trusted Digital Repository that collects together, stores, manages and provides access to this future community in order to meet their needs.
It’s also not enough just to have integrity at the ‘file level’, it’s needed for whole collections of documents and data. If just one of those 100,000 PDF documents in an eTMF goes AWOL or is corrupted then the overall integrity is lost. This is about completeness and correctness – nothing missing, nothing added, nothing tampered with, no links broken, no changes to order or structure. If you have some form of description of what should be there, e.g. manifests and document lists, then you at least stand a chance of spotting when something goes missing – or proving to a regulator that it hasn’t – but this requires strict control over this ‘metadata’ as well as the data it describes, if the integrity of the metadata can be compromised then so can the data.
We’ve been tackling these problems for a while at Arkivum and we take integrity in all senses of the word very seriously – data integrity is a contractually guaranteed part of our service. This involves extensive use of checksums, data replication, active integrity monitoring, trained staff, and tightly controlled processes and other measures to guard against a very wide range of risks that could lead to data corruption and loss. Perhaps the most important thing is a chain of custody with our customers – unless we know that we have received data that’s complete and correct then we could end up storing duff data for decades – and that’s no good to anyone. We provide ways for customers to access checksums and confirm that we have the right data – we can digitally sign an audit trail to say this has happened. There is no hiding place when this has occurred – no escape clause – no excuses based on a string of 9s where we could say ‘sorry, statistically you were just the unlucky one’. We use BagIt from the Library of Congress to allow large collections of files to be ‘bagged and tagged’ so any changes in any part of the content or structure can immediately be detected. Customer data is escrowed too so that each customer has a built-in exit plan, which includes an audit trail that shows each ‘bag’ of customer data has made it to escrow with nothing gone missing on the way. It would be easy for me to go on at length about all the infrastructure, people and processes that we use to ensure data integrity. But fundamentally, integrity starts and ends with the customer – it’s their data. We need to be sure that we have been given the right thing before we can guarantee to deliver it back again unchanged. Integrity is tightly bound to chain of custody and that chain links us and our customers together.
Cloud on the one hand has the virtue that it ‘hides the details’ and ‘takes away the complexity’ … But on the other hand, we also need transparency and auditability… We need to be able to look under the hood and validate what cloud services are really doing.
What’s interesting for me is how to help our customers achieve ‘data integrity in the cloud’ – especially in a way that meets compliance requirements so they can sleep at night knowing that they can access and use their data for a long time to come. Provable data integrity is something I hope we’ll see a lot more of in cloud services – and with it new models for chain of custody, transparency and auditability. But maybe there’s a paradox here to solve too. Cloud on the one hand has the virtue that it ‘hides the details’ and ‘takes away the complexity’ – you don’t have to worry about how and where the data is stored – you just have access when you need it. After all, that’s the whole point about ‘services’. But on the other hand, we also need transparency and auditability so we can be sure of integrity – opening up the black box to scrutiny. We need to be able to look under the hood and validate what cloud services are really doing – or as Ronald Reagan put it ‘trust but verify’. Only then do you know where your integrity really is.
Chief Technology Officer for Arkivum
Matthew is CTO of Arkivum and previously worked at the University of Southampton IT Innovation Centre. Over the last decade, Matthew has worked with a wide range of organisations on solving the challenges of long-term data retention and access, including in the lifesciences, aerospace, broadcasting and scientific sectors. By working with national archives and industry leaders across the UK and Europe, Matthew has investigated the issues of long-term data archiving, including preservation strategies, system architectures, total cost of ownership and how to mitigate the risk of loss of critical data assets. This resulted in IT Innovation spinning-out Arkivum ltd, which provides data archiving as a service. Matthew currently works on risk-based approaches to digital preservation and data retention including how to meet regulatory compliance as well as keep data accessible.