In honor of Halloween, I give you not a trick, but a treat: Data about Data.
Beastly bit rot and ominous obsolescence,
The 27th of October, while being the very special Fourth Day Before Halloween, was also proclaimed World Day for Audiovisual Heritage by UNESCO as of 2005. This year’s theme for the day is “Archives at Risk: Much More To Do” – something you don’t have to tell us digital stewards twice!
Films, radio and television programs, oral histories, music performances – these and countless other audiovisual treasures hold many 20th and 21st century primary records. The Official Website for Audiovisual Heritage 2014 notes that “it is estimated that we have no more than 10 to 15 years to transfer audiovisual records to digital to prevent their loss” – though much has been lost already. And as digital preservation professionals understand, this not only means creating digital surrogates of the records but also designing the means with which to preserve these digital surrogates.
In celebration and awareness for this internationally recognized day, I thought I’d share a little more about MIT’s digital audio preservation project – particularly the cool-cat collection we are starting with. MIT’s Lewis Music Library is a subject-specific library popular with faculty, alumni, and students alike – in fact music is the second most popular minor here at MIT! My office is on the second floor, where on a given day I might see a student composing on computer software or hear the tinkle of piano keys from a performance downstairs, where bio-engineering majors who are piano virtuosos on the side stop by for a lunchtime performance. The Library offers some personal digitizing outlets as well, which you can read more about here.
In their special collections, the Lewis Music Library has 31 shelves full of recordings on reel-to-reel, audiocassette and videocassette tapes, phonographs, DAT tapes, and film. The impetus for my project was actually some funding for a specific digitization initiative that catalyzed the need for preservation and access to the content once it was transferred.
The first set of digital audio content we are testing in our workflow is a batch from the Herb Pomeroy collection. Herb Pomeroy was a jazz trumpeter and music educator from Massachusetts. In his early career, he played with such jazz luminaries at Lionel Hampton and Charlie Parker. In the 1950s, he put together his own big band, gaining national attention and playing at venues such as Carnegie Hall. Though he had an illustrious and influential career as a musician, he was also well known for his teaching career, helping to found the Jazz Workshop, teaching for 41 years at the Berklee College of Music, and becoming the director of the MIT jazz ensemble the Techtonians – later known as the Festival Jazz Ensemble – for 22 years. You can check out interviews with Pomeroy from the Lewis Music Library’s Oral History Project here.
The collection itself is comprised of recordings of performing groups he was coaching, as well as performances he did around town in big bands and smaller groups. He even played at the Chestnut Hill Mall fairly regularly! We will be digitizing a selected portion of the audio content and walking it through the workflow to test it and identify gaps. Simultaneously, we are evaluating Avalon Media System as a dissemination platform, so people everywhere can enjoy it once it is digitized.
I am delighted to be contributing to the preservation and access of such a cool collection because I think it’s these kinds of records that really underline the impact of the work we do as digital stewards and the significance of World Day for Audiovisual Heritage. Digitization and digital preservation not only ensure the endurance of our historical records and cultural heritage; they also mean expanded intellectual access – for everyone. And that is true step towards the democratization of knowledge. I mean, c’mon… this work is so cool, you guys.
Happy heritage holiday,
Jen here, finally chiming in on this blog. (Sorry for the long silence, friends!)
I’ve spent the past couple months swimming (read: drowning) in information about digital preservation, digital repositories, and born-digital material. More specifically, I’ve been gathering any and all information out there about ingesting digital materials into digital repositories, ideally with some kind of digital preservation component involved. I’ve also learned and played around with Northeastern’s soon-to-be-launched (well, soft launched) brand new digital repository system. I’m still trying to take all of it in, but I’m finally feeling a little more prepared to take on these projects.
A little bit of back story: As I was completing the application for the NDSR program in Boston, I needed regular pep talks from my (very patient) partner to keep at it. These pep talks were approximately every hour. See, aside from a couple of really excellent digital libraries / digitization courses I took, my MLIS program didn’t emphasize the technical side of librarianship, let alone the specific technical aspects of digital archives. The instructor for a class on metadata architectures had to revise the syllabus when he realized that none of us had been taught XML in our core, required library technologies class. In an archival arrangement and description class, our professor was horrified to learn that none of us had the faintest idea how to define the word “checksum,” let alone generate a fixity check. I was confident in my ability to learn the technical skills required of digital archivists / digital preservationists and had been fairly successful as a self-taught student of a number of things, but I felt underprepared for the residency, and so I questioned my ability to even apply. I eventually submitted the (oh so awkward) application video and supporting documents and crossed my fingers, and somehow, on a Monday in early June, I got an email offering me the NDSR position at Northeastern University. Cue equal amounts of excitement about returning to the east coast for such a great opportunity and panic about feeling so very underprepared for this kind of position. (Cue also devastation about leaving beautiful Colorado, warm fuzzies about being closer to family, uncertainty about starting my career on the technical side instead of the people-facing / teaching side, eagerness to build up the technical side of my brain. Basically, I had a lot of feelings.)
Fast forward two months and one cross-country road trip with two cats, my partner, and a precariously over-packed car to the first day of the NDSR immersion week. I’ll be honest and admit that I still felt pretty underprepared, but as we went through the Library of Congress’ DPOE modules, all of the disparate pieces of digital archives / digital preservation knowledge I’d picked up along the way started to make sense. Additionally, I got to know both my mentor at Northeastern and Northeastern’s archival collections a little better, and I immediately felt a strong ideological connection to this particular host institution. A lot of archives talk about incorporating “diversity” into their collections, but through my own research, I’ve found that this doesn’t always mean much. Northeastern’s commitment to documenting Boston’s underrepresented communities (e.g. African-American, Chinese, GLBTQ, and Latino communities) in their archives, though, is front and center in both mission and practice. I’d been worried that this residency would pull me away from the things I find most intriguing about archival practice (how we responsibly build truly inclusive, diverse collections), but as it turns out, the NDSR Boston powers-that-be did an excellent job in matching us to our host institutions. Now, instead of thinking of this as a 9-month break from the social justice side of archives, I’m thinking of this residency as a 9-month opportunity to build my technical skills in ways that are relevant to the kinds of collections I’m most interested in highlighting.
Fast forward another month and a half, and I’ve been burning through podcasts on the T for my daily one-hour-each-way commute to and from Northeastern. (Nerdette, TED Radio Hour, Radiolab, and Serial are my top favs, in case anyone was wondering.) The projects as initially described in Northeastern’s proposal may change slightly to more explicitly incorporate more concepts of digital preservation (like, for example, creating an actual digital preservation plan), but the first projects I’m working on are staying true to the initial proposal.
First, though, I’ve been in information-gathering mode. Think of a squirrel hoarding nuts for a long, cold winter. (I’m secretly crossing my fingers for a long, cold winter! I’ve missed them during my years in always-sunny Colorado!) That’s me, minus the bushy tail. This nut-hoarding currently looks like a mess of a Google doc with links to various case studies, success stories, failure stories, webinars, and digital archives white papers and manuals, but eventually, this will become a beautiful (informal) report that I’ll be able to refer back to all year. And likely in whatever job comes after this. And after that. And so on until the information is outdated.
I’ve also been gathering information on Northeastern’s systems in general: the new digital repository, the old digital repository, the management of our digital archives, etc. That, too, is turning into a short cheat-sheet/report, mostly for my own reference as I continue this residency.
(Fun fact: as I was looking for a squirrel cute enough for this blog post, I came across this science article that explains squirrels’ nut-hoarding tendencies as strategies for long-term savings. If squirrels are interested in long-term access (to food) and digital preservationists are interested in long-term access (to data), I think this means that squirrels are preservationists. Please don’t question my logic.)
Second on the docket is my first big project, which has already started (primarily in information-gathering/nut-hoarding mode).
Northeastern University has a strong digital humanities presence, and luckily, the digital humanities folks here seem to have a strong working relationship with Northeastern’s libraries. Our Digital Scholarship Group, for example, is located in the library (in the beautiful and recently constructed Digital Scholarship Commons) and the manager of Northeastern’s digital repository is technically under the auspices of the DSG. The DSG had a big hand in Northeastern becoming the digital home for the Our Marathon archive, a crowd-sourced digital archive that was created following the Boston Marathon bombing of 2013.
Our Marathon was a joint effort with a number of community partners, and documented the community’s response to the bombing. As you might guess, a lot of this is emotionally difficult stuff, and there have been a lot of comparisons to, for example, the 9/11 digital archive. While the Our Marathon digital archive was built, thankfully, in consultation with archives staff (which means that we have things like meaningful metadata and some documentation of donated materials; as digital humanities projects go, we’re in pretty ok shape), it was built on an Omeka platform without an explicit plan in place to eventually pull it into the archives’ digital collections. Basically: we have it, but we don’t have it in a stable location that will allow for long-term access. Yet.
That’s where I come in. Once the new digital repository is up and running, we’ll need to ingest all of this born-digital material and its accompanying metadata into the DRS. There are a ton of different file formats (audio, various formats of text, images), for one, and while there’s thankfully a ton of useful metadata, it’s all in Dublin Core. The DRS uses MODS. The metadata also doesn’t always reflect technical or administrative information – like, for example, all of the rights information we need. All of these agreements exist somewhere, but tracking them down will be a task. We’ve also determined that there is useful metadata that Omeka supports (e.g. geotagging) that may or may not translate to the DRS. And, of course, there’s the matter that Our Marathon, as it stands on Omeka, is currently being supported on a volunteer basis by a doctoral student whose time to dedicate to this project is pretty limited.
My task, as I initially understood it, was going to be so easy! Just create a workflow for ingesting this already-full-of-metadata digital collection from one platform to another. Sure! Simple! Workflows are fun! (Nerd confession: I do genuinely enjoy creating workflows.) Oh, the sweet naivete of two-months-ago-Jen.
Stay tuned for how this all shakes out.
Signing off for now,
Shira Peltzman over at NDSR NY has already written up an excellent post about AMIA’s inaugural open-source track this year, but since it so heavily informs the work that I’m doing at WGBH, I wanted to take a little bit of a closer look at some of the themes that were running through this section of the conference.
(…ok, maybe I just wanted an reason to post a lot of puppy gifs. But we’ll get there.)
To recap, open-source software is built upon essentially free code — it’s available publicly and collaboratively developed, so anybody can, in theory, download it, implement it and improve upon it. Hack Day, which Joey and I posted about last week, is supposed to result in the development of open-source tools that can be freely used by the community. (The main site used for collaborative coding is GitHub, and knowledge of how to use GitHub is almost essential for working in the open-source community; the track this year actually started with a demo from LoC’s Lauren Sorensen about how to dip your feet into the GitHub waters. GitHub’s tools for submitting comments and changes to a project and tracking their implementation can actually be pretty useful for other things besides code — the PBCore Committee, for example, is using GitHub to receive comments and review updates to the PBCore metadata standard — but that’s a whole other story.)
Anyway, there are a lot of reasons why open-source is a pretty great thing for archives. For one thing, the fact that the code isn’t locked away behind a proprietary license makes it much more likely that people ten, twenty or fifty years in the future will be able to figure out how it worked and how to recreate or emulate it — reassuring, if you don’t want to run the risk of losing content due to obsolescence, or of uploading a bunch of material and metadata into a system that you can’t then get it back out of. Additionally, open-source technology provides a lot more opportunities for archives to customize and control their own preservation solutions; as an archivist, it’s always fairly unnerving to feel like the survival of your content is completely in somebody else’s hands.
Open-source also tends to sound like a great solution financially for archives. Who wants to pay software licensing fees when you can just download code from GitHub that will do the same thing for free? However, this is where it gets a little tricky. WGBH’s Karen Cariani, in one of the most quotable moments of the open-source stream, explained it like this:
What people expect from open-source software is something like free beer.
However, what you actually tend to get is more like a free puppy.
At this point you may be thinking, ‘but puppies are great! Way better than beer!’ It’s true, puppies are pretty great. They’re cute and they’re cuddly and when you contribute to their support you get a warm feeling inside of generally doing the right thing for the universe. Still, when someone gives you a puppy, it’s not exactly ‘free’ — now that you’ve got the puppy, you have the responsibility of shelling out a significant amount of cash on food, equipment, vet’s bills … not to mention the responsibility of housetraining it, taking it for daily walks, and cleaning up after it when it forgets all the training you gave it and pees on the floor. And this is all now going to be your job for pretty much the rest of the puppy’s lifespan.
That’s basically what open-source software is like. You’re getting the initial code for free, and that’s pretty great — but once you’ve got the code, getting it to work is going to involve a significant amount of time, and probably a significant amount of money as well. When you work with a proprietary software company, training the dog and walking it and taking it to the vet (in other words, customizing it, updating it and checking it for bugs) are all part of the company’s job; you’re paying them to take care of all that hassle for you. If you’re jumping on the open-source train, either you then have to hire someone else to make your open-source software behave — and a lot of open-source companies fund themselves by hiring out developers to do that — or figuring out how to do it becomes your job. And if you’re an archivist in a financially-strapped institution, odds are you’re already doing at least two jobs.
The idea here isn’t to discourage people from using open-source tools; far from it! Karen Cariani made this analogy as part of her presentation about WGBH’s decision to work with Hydra, an open-source repository solution that’s being adopted by a number of large archives. (I talked about this a little in my post on change management, too.) All the great reasons that I mentioned above for archives to invest in open-source software remain really solid reasons to invest in open-source software. It’s just important to be aware that it is an investment, and not go in expecting to get a lot of exciting something for nothing.
The thing about open-source software, though, is that the more people become aware of the options, and start talking about them and using them and documenting them and contributing them, the better and easier they all become for everybody. The power of open-source comes from an informed community. The importance of AMIA’s open-source track this year wasn’t even so much about the actual tools presented, although of course there were a lot of fantastic open-source tools presented (in addition to Hydra, the WebVTT standard for time-aligning metadata with web-streaming content got a lot of buzz, and I’ll never pass up an opportunity to give the QCTools project a shout-out, since it’s going to be a godsend for anyone whose job involves error-checking digital video files.) But specific projects aside, in order to be part of the open-source community, it’s important to really understand what open-source is, and what it means — and the frank and open discussions about open-source at AMIA this year played a huge role in broadening that understanding.
– Rebecca (who does really like free puppies)
…though, when is the work ever light? Now in its second year, the Association of Moving Image Archivists (AMIA) and the Digital Library Federation’s (DLF) joint project – Hack Day – once again brought together archivists, programmers, and developers for an intense day of hacking and open-source tool development. Led and organized by AMIA’s Open Source Committee, this year’s menu featured an impressive and diverse line-up of project ideas that all were tackling some rather sinewy problems – a new-and-improved PBCore record generator and validator, a GUI for streamlining selection of target formats/containers during video capture in Black Magic, building out documentation for various tools and addressing some eye-raising oversights in the “Digital Preservation” Wikipedia page, to name a few (see the complete digest of projects here).
Fellow NDSRer Rebecca Fraimow will report on her project involving documentation for FFmpeg. As for me (Joey Heinen), the project for which I jumped on board was generating a report that pulls the results from various video characterization tools in order to point out discrepancies between them, proposed by Kara van Malssen as a continuation of a project started at the Open Repositories 2014 Developer Challenge. The tool functions much in the same way that FITS generates output from several tools at once so as to choose the “best fit”, though addresses FITS lack of robustness for characterization of A/V formats. Tools such as MediaInfo, FFprobe, and ExifTool are designed to parse and analyze A/V files to report on their various embedded descriptive and technical characteristics (e.g. codec, number and type of A/V streams, bit depth, color profiles, etc.) such that archivists can better manage their collections. However, at this year’s American Institute for Conservation of Historic and Artistic Works (AIC) conference, Joanna Phillips, conservator at the Guggenheim Museum, and Agathe Jarczyk, conservator and educator at the University of Bern, delivered a vital call-to-arms paper in which alarming errors in these characterization tools were exposed. Some noteworthy instances include reporting MPEG-4 for Quicktime 10-bit uncompressed video or examples where the timecode was incorrect by minutes (not just seconds). While some of these examples can be justified (particularly the famous MPEG-4 example, which is further explained in this great post by Erik Piil), there is clearly much work to be done to ensure that tools are correctly reporting on these factors or that users are well-informed as to how to interpret this information and to further understand the inner-workings of the tools (also illustrated in Erik’s post).
The group discussed what we considered to be the significant attributes for comparison across the tools and how to design the layout of the resulting csv so as to best facilitate this comparison. We then broke into two groups to complete the task: Group 1 (Ben Fino-Radin, Morgan Morel, Eugene Gekhter) to design the python script for running the analyses through the three tools and formatting the combined output; Group 2 (Kara van Malssen, Nicole Martin, Karen Cariani, and me) to determine the Xpath for the various attributes such that we could be certain that the most accurate field from each tool was summoned.
Group 2 decided to first select diverse and sometimes zany video formats such that we could get a good scope of how the three tools handle different formats: AVCHD, XDCAM, Super 35mm mxf, mkv, uncompressed mov, iPhone mp4. From this list we each generated raw XML reports in MediaInfo, FFprobe, and ExifTool and began poring over the results, using the following command-line prompts for each tool:
MediaInfo: mediainfo -f –Language=raw –Output=XML [filename]
FFprobe: ffprobe -show_format -show_streams -show_data -show_error -show_versions -print_format xml [filename]
ExifTool: exiftool -X [filename]
In examining each of the XML outputs, it didn’t take us long to realize that there were some major differences in how information was being formatted as well as how these attributes were listed, something that would prove to be a headache later on when reconciling the python scripts to locate the respective Xpath. For example, ExifTool would often output track information in a general “track” namespace, leaving the user to interpret whether this track was audio or video based on the given attribute contained therein (e.g. the presence of “Frame Rate” leading the user to conclude that Track 2, for example, was a video track). To complicate matters further, some video formats would output the audio within “Track 1” while others would be in “Track 2.” Because of this issue, Exiftool is still not completely implemented in the group’s github repository – but should be soon! Since ExifTool is most commonly used for still image collections given its initial development within that community, it is not altogether surprising that this tool often proved to be the most problematic for video formats. However, in the example of the iPhone mp4 file, neither FFprobe nor MediaInfo could report on the bit depth of the audio whereas ExifTool could.
While FFprobe and MediaInfo were, on the whole, the most faithful tools for outputting information, we often struggled with FFprobe’s (over)abundance of data, vacillating between attributes such as “codec” and “codec_long_name,” the “long_name” attributes often simply containing data that was formatted in a somewhat different way which, depending on the attribute, was sometimes more congruent with the other tools than the shortened attribute. FFprobe would also often display information that was still correct but would use somewhat indirect formulas to arrive to that conclusion (e.g. a Frame Rate of 30000/1001 to represent 29.97 – both meaning the same thing but presenting the info in a somewhat misleading way). I could continue down this rabbit hole of other minute discoveries, but the point is that behavior of these tools depends on the format and all the complexities found therein. While in some instances there may in fact be bugs in the code which results in false information, in other instances it’s a matter of knowing how each tool parses through the packets of information which can be unique to each format. An interesting follow-up project might be to chart the behaviors of a number of these formats from tool-to-tool to begin to understand how this data is parsed. But for seven people plowing through this work over the course of 7 hours, I’d say that we did our fair share of excavation. Thanks to all the organizers for creating such a great event and to all the amazing minds that came together for the cause!
This is Rebecca, jumping in. I was hanging out over in the Wiki Edit-a-thon portion of Hack Day — a new aspect of the event spearheaded by Kathryn Gronsbell of AVPreserve, focused on building out some of the documentation available around what we do as digital media preservation specialists.
Staring at text documents while your neighbors at the next table are discovering how to digitize video using only their brains might seem like a slightly dry way to spend the day, but adding an editing and documentation aspect to Hack Day is really a pretty great idea for a number of reasons. The digital archiving world — and, specifically for AMIA, the a/v archiving world — has a solid component of amazing and skilled coders and tech wizards who are completely comfortable breaking tools apart and seeing how they run. Still, for every archivist who has the piece of technical knowledge that they need to find a solution to their problem, there are ten or a hundred people who don’t even know where to start looking. As AVPreserve’s Chris Lacinak pointed out in this great post by Shira Peltzman, documentation is key to making sure that our toolsets are findable, accessible and usable. It’s also another way for people who might not have the training in Python or Ruby to still feel like they’re making a solid contribution at Hack Day, so it becomes an even more engaging event for the broader community.
This time, the Edit-A-Thon was composed of three teams. Kathryn Gronsbell and crew set to Wiki-ing on the highest level, tackling Wikipedia’s coverage of digital preservation as a broad topic. Among other tasks, they buckled down on providing really solid information about preservation fundamentals and links to resources for anyone who might come to Wikipedia looking for help (which of course is the first place most people go.) Meanwhile, Kristin MacDonough, of BAVC and Video Data Bank, recruited a group to help her boost up the usability of the A/V Artifact Atlas — a fantastic informational resource which documents the artifacts and errors that are likely to show up in converting analog video to digital formats. Most of the people who worked on that project don’t work with video in their day jobs, which made them the perfect team to act as test users and focus in on ways to make the site more understandable to the target audience of people who may not know very much about video.
As for me, I sat down with Erica Titkemeyer to tackle some documentation of FFmpeg, an open-source video characterization and transcoding program that forms the backbone of most of the open-source software options available for archivists working with video. FFmpeg can be a really useful tool in and of itself when working with digital video for all kinds of functions, including playing arcane formats, taking screenshots or clips, capturing metadata, and transcoding. However, since it’s only accessible through the command line — and the official documentation is designed for developers planning to integrate the tool into other programs, rather than for laypeople who just want to know how to make a YouTube file — many archivists become intimidated when they sit down to try and use it in their regular workflow.
At last year’s Hack Day, a team of intrepid FFmpegers created a wiki on GitHub to explain the installation and use of FFmpeg for laypeople. Erica and I built on that project, creating use cases for some of the most common functions and breaking down the code into its component parts so people could really see what was going on with the program and make it work to serve their needs. Common Use Cases and Transcoding Use Cases are the pages to check out there, if you’re interested. I learned a ton that I didn’t know about FFmpeg while trying to find ways to explain the stuff that I did know, and I plan to keep working on the project throughout the year. Something else to keep an eye on is Ashley Blewer’s work-in-progress, an app called ffmproviser which should allow users to make requests and generate lines of ffmpeg code for common commands; it’s not quite ready for full use yet, but stay tuned, because it’s got a ton of potential as a tool for a/v archivists.
This was my first time participating in Hack Day, and it was a pretty great experience — it’s easy to get caught up in the collective thrill of working with a team who are all dedicated to pushing the field a little further, and exciting to know that the work you do is going to be of direct use. It was also a great way to kick off AMIA, which focused very heavily this year on open-source digital tools, but that’s a topic for another post.
A week or two after finding out that I was going to be heading up to Boston as the NDSR resident for WGBH, I got an email from NDSR DC alum Erica Titkemeyer, asking if I would use the experience from my project to fill in an empty slot for her panel “Sailing the Ship: Supporting and Managing Change at Large Institutions” at the Association of Moving Image Archivists 2014 conference in Savannah, October 7-11.
WGBH is certainly a large institution, and my projects definitely involve change management – the Media, Library and Archives department is squarely in the middle of the process of switching over from a proprietary Artesia digital asset management system to an open-source Hydra-based repository system. One of my responsibilities over the next nine months consists of streamlining and documenting the process of ingesting materials into the DAM during this transitional phase.
…the only problem was that the panel was scheduled for early in October, at which point I still wouldn’t have actually done much of any of that yet. As a result, I’ve spent the lead-up weeks to AMIA pestering most of the WGBH team for their accumulated wisdom about the process of envisioning and implementing a major change in their digital archives management. Here’s the very, very boiled-down version of what I’ve learned so far.
As a major archive operated out of a production institution that does not, in and of itself, has a mission to preserve, the Media, Library and Archives Department often finds itself trying to walk the line between competing directives and obligations. The decision to switch over to an open-source HydraDAM system was a deliberate choice to adopt a system that would serve the functions of an archive without trying to be all things to all people within the broader framework of WGBH. The expiration of the license on the old Artesia system served as catalyst for the switch. Once the combined cost of renewing the license and adding back in all of the special features required for system functionality was taken into account – not to mention the risk of continuing to rely on the old LTO 4 tape robots used by the Artesia system, which would become obsolescent relatively soon – the archival team was able to make the case for adopting an entirely new system without licensing requirements, and having their own LTO 6 decks that could be departmentally controlled without having to rely on the broader WGBH infrastructure.
There are no real solutions to a lot of these problems – they’re all part and parcel of the price of change in a digital archive – but the WGBH team did have some tips to share about smoothing the way.
So that’s pretty much what I spoke about at AMIA (minus all the gratuitous Pirates of the Caribbean gifs; I mean, we did have ships right there in the title.)
The session also included fantastic presentations from Erica Titkemeyer, now the AV Conservator at UNC Chapel Hill’s Southern Folklife Collection, and Crystal Sanchez, the Digital Video Specialist for the Smithsonian’s Digital Asset Management system. Erica is in the process of getting the Southern Folklife Collection’s A/V materials digitized and included in Chapel Hill’s DAM system, and Crystal has spent the past year overseeing the implementation of an updated Artesia system throughout the Smithsonian. If they put their slides up, I’ll come back and link them here; they’re definitely worth checking out.
We ended the session at AMIA by asking if anybody else had advice or thoughts they wanted to share about their own change management processes, so I’ll do the same thing here — input is definitely welcome! In the meantime, Rebecca Fraimow, signing off.
Week 3 is already coming to a close and like Tricia I am swimming in a sea of information – some tried-and-true best practices in digital preservation and some new approaches towards making it actually work. At Harvard Library, my host for the NDSR, we are grappling with formats migration within Harvard’s Digital Repository Service (DRS to us acronym-inclined archivists), understanding that though the topic of migration has been oft-discussed in the field that few sustainable solutions have emerged for ensuring long-term access to digital objects across forms.
To back up for a moment, migration is the process of updating analog and digital objects to keep up with the ever-changing technological landscape, knowing that though the objects themselves might not deteriorate, the means and technologies for viewing and experiencing them often do. Migration has been a popular preservation action within libraries and archives for quite some time, and digital migration (from one digital format to another) has been met with some hand-wringing for several decades. Many studies have noted the loss of significant properties of an object across iterative migrations, losing some of the defining features of the original format such as color space, fonts, timing, and interactivity, to name a few. Without going into too much detail on other processes of access such as emulation (recreating the original environment and external dependencies for the object), there are ongoing debates over what properties of the object should be the focus for preservation. However, everyone can agree that each format comes with its own special challenges and there is no monolithic way to preserve everything with a single click of the mouse. While Harvard is looking to institute a broad workflow and framework for file formats migration, my project focuses on how to implement this while principally being concerned with the needs of three specific, now-obsolete formats — Kodak PhotoCD, RealAudio, and SMIL Playlists.
Before diving into these three formats, I started my work by flinging myself madly into the immense body of literature around digital migration and how many institutions are doing it. Perhaps the first challenge that I came up against was knowing when to distinguish theory from practice. Many new solutions have been brought forth for any number of identified gaps in the workflow – monitoring obsolescence, creating agnostic containers for unsupported/undocumented formats (e.g. XML-based), implementing tools irrespective of its configurability across a repository workflow. While it was at many times tempting to discover such a resource and think “Eureka!”, I had to apply a fair degree of skepticism, particularly if certain sources were a few years old and the respective tools were, as of yet, largely unused in any real workflows. Nonetheless, I was able to compile a bibliography and begin to map the core arguments of these resources to a hypothetical workflow that considers many possible migration strategies based on the specific challenges of the format (I will be finalizing this workflow map throughout October). One end of the workflow that, at this point in the research, seems most in need of refinement are tools for identifying and validating formats – a significant step in the process as you decide whether a format is indeed what it says it is. Identifying and validating a file can empower the digital steward to begin to check off what significant properties will be the focus of preservation for that specific format and determine the best tools and services (that means people too!) for doing the job. For example, Jplyzer is a tool for validating JPEG2000 images, ensuring that the compression algorithms, header info, color space, etc., all comply with the standard. Once a file has been successfully validated it can be passed off to the next step in the workflow (determined based on the format requirements) to ensure that significant properties are taken into account, for example performing a comparison post-migration against the original format through QA tools (e.g. ImageMagick).
While tools for identifying (DROID) and Validating (FITS does both) exist for these purposes, this only covers a finite number of formats which are generally already well-defined and documented. Herein lies the problem…what happens to all those rare obsolete “orphaned” formats? Above that, determining the extent to which tools and the decisions around which ones to use for what format is a manual or automatic process is also a major point for consideration in workflow design. Given the mass of material within a digital repository, a trustworthy tool automating this process is desired, for example Plato or Tarverna. However, more research will need to be conducted to consider the existing architecture of the DRS and the stakeholders involved throughout the process (Tricia’s example below from MIT diagrams this nicely). This just goes to show that every institution is at a different place in solidifying these workflows, and there is not necessarily one model institution that has everything figured out (this is, of course, an ongoing process and no two institutions function –or collect – alike).
All these points are merely considerations as this stage as we look at what solutions could be applied for a more transparent and streamlined migration process. Next steps are delving deeply into the innards of the DRS and crystallizing how the various administrative pockets inform preservation at Harvard. As was noted in my initial research, tools that solve one part of the problem are great, but this doesn’t always guarantee configurability with the existing systems and processes. Sometimes the discovery of new solutions brings up new problems, but that’s what research is all about!