Week 3 is already coming to a close and like Tricia I am swimming in a sea of information – some tried-and-true best practices in digital preservation and some new approaches towards making it actually work. At Harvard Library, my host for the NDSR, we are grappling with formats migration within Harvard’s Digital Repository Service (DRS to us acronym-inclined archivists), understanding that though the topic of migration has been oft-discussed in the field that few sustainable solutions have emerged for ensuring long-term access to digital objects across forms.
To back up for a moment, migration is the process of updating analog and digital objects to keep up with the ever-changing technological landscape, knowing that though the objects themselves might not deteriorate, the means and technologies for viewing and experiencing them often do. Migration has been a popular preservation action within libraries and archives for quite some time, and digital migration (from one digital format to another) has been met with some hand-wringing for several decades. Many studies have noted the loss of significant properties of an object across iterative migrations, losing some of the defining features of the original format such as color space, fonts, timing, and interactivity, to name a few. Without going into too much detail on other processes of access such as emulation (recreating the original environment and external dependencies for the object), there are ongoing debates over what properties of the object should be the focus for preservation. However, everyone can agree that each format comes with its own special challenges and there is no monolithic way to preserve everything with a single click of the mouse. While Harvard is looking to institute a broad workflow and framework for file formats migration, my project focuses on how to implement this while principally being concerned with the needs of three specific, now-obsolete formats — Kodak PhotoCD, RealAudio, and SMIL Playlists.
Before diving into these three formats, I started my work by flinging myself madly into the immense body of literature around digital migration and how many institutions are doing it. Perhaps the first challenge that I came up against was knowing when to distinguish theory from practice. Many new solutions have been brought forth for any number of identified gaps in the workflow – monitoring obsolescence, creating agnostic containers for unsupported/undocumented formats (e.g. XML-based), implementing tools irrespective of its configurability across a repository workflow. While it was at many times tempting to discover such a resource and think “Eureka!”, I had to apply a fair degree of skepticism, particularly if certain sources were a few years old and the respective tools were, as of yet, largely unused in any real workflows. Nonetheless, I was able to compile a bibliography and begin to map the core arguments of these resources to a hypothetical workflow that considers many possible migration strategies based on the specific challenges of the format (I will be finalizing this workflow map throughout October). One end of the workflow that, at this point in the research, seems most in need of refinement are tools for identifying and validating formats – a significant step in the process as you decide whether a format is indeed what it says it is. Identifying and validating a file can empower the digital steward to begin to check off what significant properties will be the focus of preservation for that specific format and determine the best tools and services (that means people too!) for doing the job. For example, Jplyzer is a tool for validating JPEG2000 images, ensuring that the compression algorithms, header info, color space, etc., all comply with the standard. Once a file has been successfully validated it can be passed off to the next step in the workflow (determined based on the format requirements) to ensure that significant properties are taken into account, for example performing a comparison post-migration against the original format through QA tools (e.g. ImageMagick).
While tools for identifying (DROID) and Validating (FITS does both) exist for these purposes, this only covers a finite number of formats which are generally already well-defined and documented. Herein lies the problem…what happens to all those rare obsolete “orphaned” formats? Above that, determining the extent to which tools and the decisions around which ones to use for what format is a manual or automatic process is also a major point for consideration in workflow design. Given the mass of material within a digital repository, a trustworthy tool automating this process is desired, for example Plato or Tarverna. However, more research will need to be conducted to consider the existing architecture of the DRS and the stakeholders involved throughout the process (Tricia’s example below from MIT diagrams this nicely). This just goes to show that every institution is at a different place in solidifying these workflows, and there is not necessarily one model institution that has everything figured out (this is, of course, an ongoing process and no two institutions function –or collect – alike).
All these points are merely considerations as this stage as we look at what solutions could be applied for a more transparent and streamlined migration process. Next steps are delving deeply into the innards of the DRS and crystallizing how the various administrative pockets inform preservation at Harvard. As was noted in my initial research, tools that solve one part of the problem are great, but this doesn’t always guarantee configurability with the existing systems and processes. Sometimes the discovery of new solutions brings up new problems, but that’s what research is all about!