The weather in Boston is beginning to warm. The Red Sox have opened their season and the Boston Marathon was run earlier this week. No doubt the crew teams are rowing in the Charles and the Swan Boats will soon be paddling across the pond in the Public Garden. Although spring is the season of renewal, this year it signals the end of the 2015-16 NDSR Boston projects.
For this blog post, I thought I would update you on the progress I have made on my project so far. To refresh your memory, my project is developing and implementing a digital preservation plan for the University Archives and Special Collections at UMass Boston using Archivematica and DuraCloud. The collection I am working with is the Mass. Memories Road Show (MMRS). I began the project researching digital preservation standards and best practices while familiarizing myself with the digitization and digital asset management practices in use at UMass Boston. I have been busy lately adjusting existing practices for processing digital collections and developing preservation workflows related to the use of Archivematica.
Performing a gap analysis early in the project was an important step. By comparing the existing practices at UMass Boston with digital preservation standards, guidelines and best practices, I identified the greatest areas of need as preparing the collection for ingest and implementing archival storage. To address these broad areas of need, it would be necessary to incorporate the following tasks into the digital preservation workflow.
- Generate checksums
- Screen for duplicate and unwanted files
- Create/assign unique IDs to files
- Store files in multiple locations
- Include descriptive metadata in archival storage
- Create/manage administrative, technical and preservation metadata
Archivematica addresses several of these issues. Other needs are being met by making adjustments to the existing practices.
I discovered that the first two tasks identified by the gap analysis were related. Generating checksums is an important task because it protects the authenticity and data integrity of the collection. Checksums, created by applying a cryptographic algorithm to the file, produces a unique alphanumeric code for each file that acts like a digital finger print. Periodically verifying that checksums have not changed provides evidence that the file has not been modified or damaged over time. Since the objects in the Mass. Memories Road Show collection are copied between hard drives several times and uploaded to the cloud, it is necessary to have a way to verify that each file has retained its original bit stream.
Checksums also played an important role in helping to identify and remove duplicate files. The existing file processing workflow had resulted in the accumulation of numerous duplicate video files. Duplicate files have identical checksums. A checksum tool called The HashMyFiles generates and compares checksums, and identifies when two or more files are identical. Using this tool, 3,500 video files occupying about 200GB of space were removed from the collection, saving critical processing time and storage capacity.
Other modifications being made to the file processing workflow involve adopting a file copying tool and standard terminology, adjusting the file naming conventions and digitizing registration forms. Usually a file’s creation date is overwritten when the file is copied. A tool called TeraCopy has been adopted to copy files because it retains the original creation date. Standard terminology has been adopted as well. Digital files that were previously categorized as “originals” and “edited masters” are now identified as “preservation masters” and “production masters.” Since preservation master files and production master files often share identical file names, suffixes have been added to the file naming convention to differentiate the two types. Preservation masters are now identified by an “.f0” suffix while production masters are labeled with an “.f1” suffix. Lastly, registration forms, which give UMass Boston consent to use the digital files in the collection, will now be digitized and uploaded to archival storage with the files they represent, providing additional intellectual control.
Archivematica specific adjustments are also being made to the workflow that will protect the collection’s data integrity, manage metadata and assign unique identifiers to the collection and to the files. An additional checksum file and a text file with descriptive metadata will be created and uploaded to Archivematica with each submission. Archivematica uses the checksum file to verify files are not damaged during the upload to the Archivematica servers. Archivematica parses the descriptive metadata file into a METS file allowing the descriptive metadata to be stored with the collection in DuraCloud. The normal Archivematica processing extracts technical metadata from the files, generates additional administrative and preservation metadata, and creates and assigns universally unique identifiers (UUIDs) to the objects in the submission. The metadata is all saved into the previously mentioned METS file, satisfying all digital preservation best practices for metadata. The UUIDs are saved to a text file which will be downloaded and imported into the digital asset management system ensuring that the identifiers created during archival storage are associated with the access copies.
So, a lot of progress has been made thus far. There are still a few decisions to make and a little more testing left to do before the entire collection can be uploaded to the cloud, processed through Archivematica and deposited in DuraCloud. The final tasks will be to finish documenting the new procedures and training the archives staff to use the new digital preservation tools and Archivematica.
The Boston Marathon is an appropriate metaphor for the NDSR project. There is a lot of anticipation and a “feeling out” process in the beginning. This is followed by a period where you settle in to a steady and comfortable pace. Along the way, you encounter and overcome challenges. At this point, you have made it over Heartbreak Hill. Next, Boylston Street and the Finish Line come into view and there is a hectic push to the end. After crossing the Finish Line, there will be the satisfaction and sense of accomplishment that comes with the successful completion of the end of the project. Maybe the traditional meal of a big bowl of pasta will be my reward.
Thanks for reading, Jeff