Recently I’ve been wrapped up in finding and reading documentation for our systems at the John F. Kennedy Presidential Library. Before I begin writing my own policies, I need to know what documentation is out there. One challenge I’ve faced in collecting digital preservation documentation is the secrecy surrounding proprietary software and hardware. Sometimes the barrier is poorly shared or outdated documentation rather than intentional secrecy. This post will go over some of the challenges I’ve faced in addressing file fixity with proprietary tools. To start off, here is a diagram of the tools used for management and storage in our digital archive
The JFK Library uses a digital asset management system (DAMS) called Documentum, created by EMC. Documentum is mainly used by businesses for managing records and internal documents. Although it was not created for digital archives, it has some clear advantages for managing a digital repository as large as ours. We currently store over 70 terabytes and that number is growing as we continue to digitize with the goal of making all the presidential papers available online. Documentum has excelled at making the digital archive accessible to novice JFK enthusiasts and experienced researchers alike.
However, a system built for improved access may not be an ideal solution for digital preservation. The storage system, Centera uses a tool called MD5scrubber to ensure file integrity. The tool creates an MD5 checksum for each blob. Right now you might be thinking “Blob? Does she mean file or folder?” No…I mean blob. Centera stores the bit sequence for each digital file as a blob. Blob stands for ‘Binary Large Object’ (although depending on who you believe that acronym may have been invented after the fact because people felt unprofessional saying ‘Blob’ all the time). The MD5scrubber regularly validates the MD5 checksum for each blob and if the checksum has changed, the blob is moved for review by EMC. If the file has been corrupted, it is then recreated from the mirror copy at Iron Mountain. Creating and automatically validating checksums is important for verifying that a digital object has not been altered or corrupted.
What worries me about this tool is that there is no input from the staff here at the Library. Archivists are not even alerted if a file is corrupted. EMC staff maintains a seamless experience, despite file corruption and re-creation. This is great for users who want immediate access to the digital archives, but as an archivist… I have trust issues. I want our archivists to know how and when checksums are created, stored, and validated. I also want our archivists to be able to manually validate a checksum when they believe it necessary. Since Centera controls the checksums behind the scenes our staff can’t view or manage file fixity information through Documentum. We also can’t confirm that the fixity information is stored separately from digital object, a practice recommended by TRAC to protect against malicious attacks.
Until recently the staff here at the JFK Library was unsure if Centera or Documentum were managing file fixity at all. One archivist began creating checksums and storing the MD5 value in the ‘technical description’ field, which also stores various information about digitization. This will allow future archivists to manually validate file fixity, but it cannot be automatically checked since the MD5 is mixed with other information instead of parsed into a unique field. These checksums are stored with the digital object, so they are also vulnerable to attack.
Here is where I explain my victory over the proprietary beast and describe the perfect system for file fixity…right?
Sorry, but I’m not there yet. Just understanding the current situation feels like a victory at this point. However I do have some thoughts on a possible solution. Introducing a third storage location, true digital preservation storage could ease my worries about fixity and allow EMC to continue the great work of providing seamless access to our patrons.
If the preservation copies and associated metadata were sent to preservation storage our archivists could maintain control over digital holdings without affecting our user’s experience. I’m still in the research phase of the project, so I haven’t determined a specific solution yet. Don’t read too much into the cloud shape used in the diagram, cloud storage is just one option I’m looking into. A third storage location would improve on our existing digital preservation program and allow archivists a higher level of authority over the digital archives.