Migration – “Get the ball rolling” but don’t “recreate the wheel!” – and other bad spherical adages

Digital Preservation folks talk a lot about how to turn projects into programs – creating broader, scalable, and sustainable processes for approaching a common issue. This assumes that before a monolithic workflow can be developed for a multitude of scenarios/collections/formats that first you need some honest-to-goodness cases to work against.  This desire to avoid “recreating the wheel” is surely not a sign of laziness but rather shows an understanding that good digital preservation practices aim to leverage as much material as possible while still showing a little love to each object and its unique qualities.  In some cases many iterations of a workflow will emerge in order to suit the varying degrees of involvement within the process, falling somewhere between a broad framework and (in Harvard’s case for format migration projects) format policies with technical details that border on ad nauseum. Perhaps it’s the masochist in me that truly loves this part of the process!

As mentioned in an earlier post, at Harvard we are developing a formats migration framework which will be tested against projects (and resulting workflows) for converting Kodak Photo CD, RealAudio, and SMIL playlists. In preparing for these projects I did three things: 1) researched migration projects at other institutions and literature on how migration exists in tandem with developments in digital preservation praxis (the bibliography for this research can be found here); 2) interviewed Harvard Library staff in Preservation Planning, Digital Imaging, Media Preservation, Metadata, and Library Technology Services to gauge the existing terrain and library architecture at Harvard; and 3) performed initial research on the three formats in question so as to ensure that format oddities are still considered within a broad framework. After gaining an acceptable (albeit somewhat scattershot) education on identifying needs in order to conduct a migration project, I sat down with Andrea Goethals (Manager of Digital Preservation and Repository Services at Harvard) to take a crack at identifying the necessary steps to be considered in this broad migration framework. It looked something like this:

HUL_FM_MigrationFramework

18 steps – not bad right? Of course, many of these steps speak somewhat specifically to the Harvard ecosystem though could feasibly apply to other institutions, particularly those which are as large as Harvard. While the steps do go into some detail about what needs to be defined and at what point in the process, it still leaves room for defining how a format’s specific attributes may affect that overall process. For example in defining stakeholders, Kodak PhotoCD will involve consultation with staff in Imaging Services whereas RealAudio and SMIL will involve folks in Media Preservation. Though, let’s not get too far ahead of ourselves (at least not yet).

For a number of reasons (though, mostly based on comparative ease) we decided to first move ahead on creating a plan for Kodak PhotoCD (PCD). We thus went to the top of list to begin the process: Write a summary of why migration is being done – for the most part we already did this, or rather we defined these objects as the focus of a migration project. Nonetheless this will need to be stated in subsequent documents for added context and just for good measure.

The second step was to analyze and describe the content. This meant having to plunge headfirst into the Digital Repository Service (DRS) to mine the data and paint a picture of what PCD looks like at Harvard Library. The DRS is a custom-built repository that sits on top of an underlying Oracle database. It is the main repository for preserving and providing access to digital objects. As can be seen through Harvard’s Library Technology Systems Architecture map, the DRS falls under the Digital Asset Management tent which further diagrams that it must point to other services in order to provide unique identifiers (NRS), bibliographic description of objects (e.g. Olivia/Shared Shelf for Images), and access (e.g. Via for Images).

Harvard Library Technology Portfolio (2012): http://hul.harvard.edu/ois/images/2012_December_portfolio_1.jpg

For the purposes of defining the objects based on their technical characteristics and relationships to other objects, the DRS would more than suffice. In order to do this, I needed to learn a bit about SQL queries in order to generate statistics based on criteria that existed across multiple tables within the DRS. For example, if I wanted to find how many PCD objects were derivatives of other PCD objects (instances where an original PCD would be used to generate a cropped Production master in PCD which would further generate deliverables in JPEG) that used RGB color space (rather than the more typical YCC found in PCD) I would have to create a query that pulled data from the “IMAGE_METADATA,” “DRS_OBJECTS,” and “RELATIONSHIP_MAP” tables (again, this structuring is quite Harvard-specific).

Analysis is still being performed on these formats so the report is not yet complete and will thus not sharable at this point. Nonetheless, the initial analysis has already demonstrated that there will in fact need to be at least two parallel migration projects. To explain, I referenced a hypothetical query for finding the number of times where PCD objects are derivatives of other PCDs. PCD was one of the first affordable ways to have slides and photographs scanned in an archival manner. It most notably used YCC color space in order to efficiently and losslessly encode color information. Ultimately its unique form of compression and color mapping are what led to it being far from archival (in terms of long-term sustainability). In the late-1990s/early-2000s Harvard engaged with an external vendor to have their objects converted into PCD format. The PCDs were taken into Digital Imaging Services, and production copies were rendered in RGB using Luna Imaging (which at that point in time supported PhotoCD). These production masters were created in consultation with the curators who gave the necessary approval to crop the images to the desired size for creating deliverables and providing them through the web access system (the raw archival files included reference color swatches for quality assurance, thus not desirable as a final deliverable). At a later point in time Digital Imaging moved away from PCD as a production format (perhaps sensing the unsustainability of the format) and generated production masters as TIFFs. Thus, in creating a migration workflow we will need to consider the distinction between migrating “archival” and “production” masters. We will want to convert the archival masters as losslessly as possible, given their use of the Image Pac compression and YCC color space, though we’ll want to use the PCD production masters to generate new production masters (perhaps in TIFF) so as to not lose the cropping information. This is just one of the extra considerations that will be involved in the format-specific plan which otherwise would not be appropriate in the broader framework.

Obviously there are still 14 more steps before the Kodak PhotoCD migration project is complete and 2 steps on top of that for adding to the program-level plan…stay tuned for updates! So, to all my fellow Digital Preservation Masochists, press on!

-Joey Heinen

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s