NDSR Project Update from UMass Boston

Hello Readers.

The weather in Boston is beginning to warm. The Red Sox have opened their season and the Boston Marathon was run earlier this week. No doubt the crew teams are rowing in the Charles and the Swan Boats will soon be paddling across the pond in the Public Garden. Although spring is the season of renewal, this year it signals the end of the 2015-16 NDSR Boston projects.

swan boats Boston-Common-1889

Swan Boats in Boston’s Public Garden – image by George Barker 1889

For this blog post, I thought I would update you on the progress I have made on my project so far. To refresh your memory, my project is developing and implementing a digital preservation plan for the University Archives and Special Collections at UMass Boston using Archivematica and DuraCloud. The collection I am working with is the Mass. Memories Road Show (MMRS). I began the project researching digital preservation standards and best practices while familiarizing myself with the digitization and digital asset management practices in use at UMass Boston. I have been busy lately adjusting existing practices for processing digital collections and developing preservation workflows related to the use of Archivematica.

Performing a gap analysis early in the project was an important step. By comparing the existing practices at UMass Boston with digital preservation standards, guidelines and best practices, I identified the greatest areas of need as preparing the collection for ingest and implementing archival storage. To address these broad areas of need, it would be necessary to incorporate the following tasks into the digital preservation workflow.

  1. Generate checksums
  2. Screen for duplicate and unwanted files
  3. Create/assign unique IDs to files
  4. Store files in multiple locations
  5. Include descriptive metadata in archival storage
  6. Create/manage administrative, technical and preservation metadata

Archivematica addresses several of these issues. Other needs are being met by making adjustments to the existing practices.

red sox scorecard 1934

Boston Red Sox Scorecard 1934 – image from FenwayParkDiaries.com

I discovered that the first two tasks identified by the gap analysis were related. Generating checksums is an important task because it protects the authenticity and data integrity of the collection. Checksums, created by applying a cryptographic algorithm to the file, produces a unique alphanumeric code for each file that acts like a digital finger print.  Periodically verifying that checksums have not changed provides evidence that the file has not been modified or damaged over time. Since the objects in the Mass. Memories Road Show collection are copied between hard drives several times and uploaded to the cloud, it is necessary to have a way to verify that each file has retained its original bit stream.

Checksums also played an important role in helping to identify and remove duplicate files. The existing file processing workflow had resulted in the accumulation of numerous duplicate video files. Duplicate files have identical checksums. A checksum tool called The HashMyFiles generates and compares checksums, and identifies when two or more files are identical. Using this tool, 3,500 video files occupying about 200GB of space were removed from the collection, saving critical processing time and storage capacity.

crew practice on the charles river

Crew practice on the Charles River – image by Leslie Jones ca. 1930

Other modifications being made to the file processing workflow involve adopting a file copying tool and standard terminology, adjusting the file naming conventions and digitizing registration forms. Usually a file’s creation date is overwritten when the file is copied. A tool called TeraCopy has been adopted to copy files because it retains the original creation date. Standard terminology has been adopted as well. Digital files that were previously categorized as “originals” and “edited masters” are now identified as “preservation masters” and “production masters.” Since preservation master files and production master files often share identical file names, suffixes have been added to the file naming convention to differentiate the two types. Preservation masters are now identified by an “.f0” suffix while production masters are labeled with an “.f1” suffix. Lastly, registration forms, which give UMass Boston consent to use the digital files in the collection, will now be digitized and uploaded to archival storage with the files they represent, providing additional intellectual control.

Archivematica specific adjustments are also being made to the workflow that will protect the collection’s data integrity, manage metadata and assign unique identifiers to the collection and to the files. An additional checksum file and a text file with descriptive metadata will be created and uploaded to Archivematica with each submission. Archivematica uses the checksum file to verify files are not damaged during the upload to the Archivematica servers. Archivematica parses the descriptive metadata file into a METS file allowing the descriptive metadata to be stored with the collection in DuraCloud. The normal Archivematica processing extracts technical metadata from the files, generates additional administrative and preservation metadata, and creates and assigns universally unique identifiers (UUIDs) to the objects in the submission. The metadata is all saved into the previously mentioned METS file, satisfying all digital preservation best practices for metadata. The UUIDs are saved to a text file which will be downloaded and imported into the digital asset management system ensuring that the identifiers created during archival storage are associated with the access copies.

So, a lot of progress has been made thus far. There are still a few decisions to make and a little more testing left to do before the entire collection can be uploaded to the cloud, processed through Archivematica and deposited in DuraCloud. The final tasks will be to finish documenting the new procedures and training the archives staff to use the new digital preservation tools and Archivematica.

Johnny Miles Crossing Tape in Race

Boston Marathon winner Johnny Miles 1923 – image by Underwood & Underwood/Corbis

The Boston Marathon is an appropriate metaphor for the NDSR project. There is a lot of anticipation and a “feeling out” process in the beginning. This is followed by a period where you settle in to a steady and comfortable pace. Along the way, you encounter and overcome challenges. At this point, you have made it over Heartbreak Hill. Next, Boylston Street and the Finish Line come into view and there is a hectic push to the end. After crossing the Finish Line, there will be the satisfaction and sense of accomplishment that comes with the successful completion of the end of the project. Maybe the traditional meal of a big bowl of pasta will be my reward.

Thanks for reading, Jeff

Data Visualization: Choropleths and Cartograms and Treemaps, oh my!

Hello Readers.

Last week the NDSR Boston cohort visited with Helen Bailey, a digital curation analyst at MIT. In her spare time, Helen has become a data visualization expert. Helen provides data visualization support to the MIT Libraries and is sharing her knowledge of data visualization through presentations and workshops. If you think you are unfamiliar with data visualization, think again. I guarantee you have used data visualizations and maybe even created a few.

To set the record straight, let’s define the term before giving some examples and talking about why data visualizations are useful and what it takes to produce them. Helen offered the following two definitions:

“Information visualization is a mapping between discrete data and a visual representation.” from Lev Manovich, “What Is Visualization”

“Information visualization is a set of technologies that use visual computing to amplify human cognition with abstract information.” from Stuart Card, “Information Visualization”

While both definitions make sense, I prefer the second definition because a well-chosen visualization really provides meaning and understanding where there might otherwise be only information overload. It seems to me that the age old saying, “a picture is worth a thousand words” is appropriate when discussing the purpose and usefulness of data visualizations.

spreadsheet-v-map

Helen notes that visualizing data can be used to summarize a data set, highlight specific aspects of the data, and identify patterns and outliers. Data, typically organized in tables or spreadsheets, can be almost impossible to digest, especially in large quantities. Even smaller data sets contain so many rows and columns they literally run right off your screen making it difficult to draw conclusions or spot trends. Organizing the raw data into visual representations is really the only practical way to make the data useful.

alluvial

Alluvial Diagram

The first steps in creating data visualizations are to determine:

  • What questions will the data answer?
  • How the visualization will be used?
  • What is the best type of visualization to use?
  • Who is the target audience using the visualization?

It’s important to answer each of these questions because there are so many types of visualizations available. You can’t be a one-trick pony, reusing the same representation for all occasions. Certain representations work better for temporal (when), geospatial (where), topical/statistical (what, how much), relational (with whom) and hierarchical (ordered relationships) data.

cartogram

Cartogram

The types of representations range from simple to complex and from traditional to innovative. The names of the visualizations in the list below will either allow you to draw a picture in your mind’s eye or send you for a dictionary. A few types of the many available to consider are:

  • Gantt Charts, Stream Graphs and Alluvial Diagrams for temporal representations
  • Choropleths and Cartograms for geospatial representations
  • Histograms, Pie Charts and Heat Maps for topical/statistical representations
  • Node-link, Chord and Arc Diagrams for relational representations
  • Dendograms, Treemaps and Radial Trees for hierarchical representations

It’s easy to be overwhelmed by the choices. Helen presented a decision tree designed to help identify which representation to use depending on the parameters of your project. Do you need to show comparisons in your data over time with just a few periods of time but with many categories? Try a Column or Line Chart. But remember what Ben Fry mentions in Visualizing Data, data visualization is just another form of communication and it will only be successful if the representations make sense to your audience.

arc diagram

Arc Diagram

Are you interested in creating a data visualization for your project? Four members of our group were, and one of us has already created one of her own. Simple data visualizations, line, bar and pie charts, can be created with the spreadsheet application installed on your computer. If you have more complex data and feel like challenging yourself, there are several online tools available. Helen recommended and gave brief introductions to Voyager (http://vega.github.io/voyager), Tableau (http://www.tableau.com) and RAW (http://raw.densitydesign.org/) to mention only a few. Do be forewarned though, some of these data visualization tools have a steep learning curve and may be easier to use if you have some experience with coding and scripting.

If all else fails, use your Photoshop skills and convert your favorite data visualization into a piece of modern art or a poster to hang on your wall.

Thank you Helen Bailey for introducing NDSR Boston to  data vis! And thank you for reading.

Jeff Erickson

Image Credits:

  1. Spreadsheet image produced from a data set from the US Dept. of Labor, Bureau of Labor Statistics retrieved from http://www.bls.gov
  2. U.S. map image, Unemployment data visualization, created by Mike Bostock, retrieved from http://bl.ocks.org/mbostock/4060606
  3. Alluvial Diagram image retrieved from http://www.mapequation.org/apps/AlluvialGenerator.html
  4. Cartogram image retrieved from http://www.stephabegg.com/home/projects/cartograms
  5. Arc Diagram image retrieved from http://www.chrisharrison.net/index.php/Visualizations/BibleViz

Having FITS Over Digital Preservation?

fits_logoThis week my fellow residents and I were fortunate to receive an introduction to the File Information Tool Set (FITS) from Andrea Goethals. Andrea is the Manager of Digital Preservation and Repository Services at Harvard Library, Director of the NDSR Boston program and a developer of the FITS tool. Released in 2009, FITS is a digital preservation tool designed and developed at Harvard Library to identify and validate a wide assortment of file formats, determine technical characteristics, and extract embedded metadata. The technical metadata generated and collected by FITS can be exported in a variety of XML schemas and may be included in other files for digital preservation purposes, such as Harvard Libraries’ inclusion of FITS output in METS files in its preservation repository.

Digital preservation repositories accept into their care electronic files that are created and saved in a growing number of file formats. Proper identification of a file’s format and the extraction of embedded technical metadata are key aspects of preserving digital objects. Proper identification helps determine how digital objects will be managed and extracting embedded technical metadata provides information that future repository staff or users need to render, transform, access and use the digital objects.

fits-uses

There are several tools available that can identify and validate file formats and extract technical metadata. The great thing about FITS is that it bundles many of them together. The current version of FITS, 0.10.0, includes the following applications:

fits-tools

An explanation of each tool can be found on the FITS web site.

While these tools can be used individually, using them under the FITS umbrella is more efficient. FITS runs all the tools simultaneously, saving you time. FITS knows the strengths and weaknesses of the applications and which tools support which file formats. You benefit by installing and running a single application and receiving output from multiple applications that is appropriate to each file format.

Receiving output from multiple tools can help you verify accurate information when the tools agree, or flag a concern when they don’t. It is also helpful that FITS consolidates and normalizes the output, providing a homogenized data set that is easier to interpret. Each tool’s output is converted to a common FITS XML schema ensuring labels and terminology are used consistently. The extracted metadata can then be exported to different technical metadata schemas such as MIX for images, TextMD for text and DocumentMD for documents. Any of these schemas can then be inserted into other files like METS to provide repository documentation suitable for digital preservation.

fits-how_it_works

FITS is an open-source, Java-based application that is freely available from GitHub or the FITS web site. Because it is Java-based it runs on Windows, MAC or Linux platforms from a command-line interface. It also provides an API and can be embedded in other applications; it is one of the included micro services in Archivematica. Using a command-line interface can sometimes be intimidating and confusing, but FITS employs a limited number of intuitive commands.

FITS configuration is managed with several XML files that are easily edited with a text editor. The main configuration file, fits.xml, allows you to prioritize tools, include or exclude certain file formats from processing, enable or disable additional features like generating checksums, and determining the various output options. Another positive for the digital preservation community is that FITS is actively maintained so there is a procedure for addressing bugs and a schedule for releasing updates.

The FITS web site (fitstool.org) is well organized and fully documents the installation, configuration, use, and output options.

I know my post pales in comparison to a live demo of the application. But if it piques your interest, take it for a test drive. You’ve got nothing to lose and you might add a new tool to your digital preservation tool box.

Thanks for reading, Jeff

NDSR Boston Goes to ALA-MW

On Saturday, January 9, 2016, the residents from this year’s cohort of the National Digital Stewardship Residency program in Boston made a trip to the Convention and Exhibition Center on Boston’s waterfront. Boston was the site of this year’s ALA Mid-Winter Meetings. Our mission was to spread the word about the NDSR program and to let the American Library Association know about our digital preservation projects and how they are progressing.

ala-mw logo

I want to extend a big thank you to Frances Harrell of the North East Document Conservation Center for introducing me to Laura McCann and Kate Contakos, the co-Chairs of ALA’s Preservation Administrators Interest Group (PAIG). Laura and Kate gave us a warm welcome and top billing, allowing us to give our presentations while the morning’s first cup of coffee was jolting everybody to attention.  

       ndsr-intro je-intro

I gave a brief presentation introducing the NDSR program. Then we each talked about our individual projects; describing the projects and summarizing our outcomes to-date. After a presentation about a project involving the preservation of electronic news, we all sat together on stage for a question and answer session.

ac-introap-intro2

I am pleased to report that our presentations garnered enough interest to elicit several questions. There were project specific questions as well as a general question about our thoughts on the NDSR program. There was a request to Julie to share the Excel sheet and Wiki she created to manage the dozens of metrics she is investigating in her ISO 16363 focused project. The inquirer thought the fruits of her labor would be a useful tool to others preparing for TRAC certification. In the interest of saving time for the remainder of the agenda, the Q&A had to be ended before all the questions could be answered.

sr-introjs-intro2

On Sunday, I spread my wings and flew from the NDSR Boston nest to give an individual presentation on the importance of digital preservation. The presentation was categorized as an Ignite Presentation and employs a unique format. There is a five minutes time limit and a mandatory twenty slides. The slides progress automatically every 15 seconds. It was challenging. The fastest five minutes of my life. I got some positive responses afterwards and I am glad I did it.

Next up? The NDSR mid-year event. I’m all warmed up!

Thanks, Jeff

P.S.  If you want to join us for the NDSR mid-year event, it is scheduled for January 26th from 3 until 5 pm in room 021 of 90 Mt. Auburn Street in Harvard Square (with social event following).

NDSR Boston’s NEDCC Field Trip

nedcc-logoNDSR Boston took a field trip to scenic Andover Massachusetts this week to visit the facilities of the Northeast Document Conservation Center. Founded in 1973 by the State Libraries of the six New England states, NEDCC has become a leader in the preservation and conservation community providing the highest quality conservation services to cultural heritage institutions in the region. It is fitting that NEDCC’s facilities are located in a renovated historic mill building.

The facility is organized into three functional areas; the conservation lab, the imaging lab and the audio lab.

nedcc-conservation-labThe conservation lab addresses issues related to books, paper and photographs.  We met Todd who described three different treatments for manuscripts with binding issues. He explained that each project is considered in its own context which results in individualized treatments for each object. We also met with Amanda, a photography conservator. She took a moment to examine a few daguerreotypes of mine. She helped me date the photographs and gave me care and handling instructions.

xy-tableThe imaging lab performs preservation level imaging, creating digital surrogates of many types of objects, including objects being treated in the conservation lab.  Terrence and David explained the imaging equipment and NEDCCs approach to imaging. They explained that large production imaging and digitization is performed with sophisticated, high quality digital camera equipment. They also demonstrated their custom designed X/Y positioning table with a vacuum feature for holding materials in place. The table moves on two axes (X and Y), both front-to-back and side-to-side, beneath a stationary camera allowing the greatest flexibility in capturing all types of materials.

ireneThe audio lab captures sound with sophisticated new imaging technology called IRENE. IRENE uses 2D and 3D cameras to photograph the grooves in audio cylinders and discs as they rotate. The captured images are then processed by software that converts the images to sound. The revolutionary procedure allows the audio recordings to be captured without additional wear and tear to the objects. Audio can even be captured from cylinders and discs that have been broken into pieces. It is very cool. The audio lab is expanding and will soon be capable of digitizing audio from magnetic tape.

We also met with Frances and Eva who work in Preservation Services. They have an important role with perhaps a greater impact than their colleagues preserving objects in the labs. Preservation services are the outreach and educational arm of the operation. They are active in the community; fielding questions, performing preservation assessments, attending conferences and educating their stakeholders with workshops and webinars. They survey the community to keep abreast of what projects people are working on and what questions are being asked.

While NEDCC is focused on the conservation of paper-based collections, they have a growing digital presence. In addition to the imaging and audio labs, they are increasingly providing consulting services and assessments for digital materials and collections. They are aware of the importance of digital stewardship and noted that there are challenges to be overcome by traditional preservation administrators in embracing their role in digital preservation. Aspects of preservation and conservation that are common to both digital and traditional preservation are the importance of organizational support and the need for long-term planning and risk management.

If you live or work in the Boston area and you have never been to NEDCC, I strongly encourage you to organize a tour for your cultural heritage institution or local working group. I guarantee that you will not be disappointed.

Thanks for checking in.

Jeff Erickson

An NDSR Field Trip to Martha’s Vineyard

Hello Readers.

I am the National Digital Stewardship Resident at UMass Boston’s University welcometo mvArchives and Special Collections. Last weekend I visited Martha’s Vineyard, the summer playground of presidents and the site of the most recent Mass Memories Road Show event, the 43rd overall. Despite cooler temperatures and overcast skies, it was a beautiful Fall weekend on the island.

mmrs-mv-poster

For the uninitiated, the Mass Memories Road Show is an ongoing community based digital humanities project conducted by UMass Boston since 2004. The goal of the project is to collaborate with Massachusetts cities and towns to organize community building events where images and stories that document the history of Massachusetts through the eyes of its citizens are collected one town at a time. To learn more about the Mass Memories Road Show project, visit UMass Boston’s Open Archives web site.

I didn’t travel to Martha’s Vineyard on this cool and overcast late October lobster-rollweekend just to support my host institution or to dine on an obscenely large lobster roll at the famous Black Dog Tavern. No, I went to work on my NDSR project. My project is to develop a digital preservation plan for UMass Boston. The ingredients in my test kitchen are the images and videos of the Mass Memories Road Show collection so I went to learn more about how these digital objects are created and collected.

My original intention was simply to float between several stations responsible for
generating the digital objects and the metadata for these objects. I was primarily interested in observing the information stations where metadata about contributed photographs is captured; the scanning and digital stations where contributed photographs are digitized; the keepsake station where large format photographs, artifacts and contributors are photographed; and the video station where contributors are recorded telling their stories. By observing these stations, I hoped to better understand the workflows used to create the collection and to gain insight on how best to preserve the resulting materials.

mv gathering

The surprise for me was the interest in digital preservation that was expressed to me by many of the volunteers affectionately known as “roadies.” Many event volunteers are information professionals who have a personal and professional stake in digital preservation. As I moved around the room, introducing myself to the volunteers, I was continually asked about digital preservation issues and my NDSR project.

mv keepsake

An artist and volunteer from the Martha’s Vineyard Museum was interested in how he could create awareness at the museum for the need for digital preservation. He was concerned that no one is collecting born digital cultural heritage on Martha’s Vineyard and that there will be a large gap in the cultural record as a result. Several librarians engaged me in discussions about digital preservation topics such as preservation friendly file formats, concerns over obsolescence and the use of cloud storage.

At the end of the day I had accomplished far more than I had set out to do — on the road in Martha’s Vineyard. My presence and participation at the road show had raised awareness in and influenced people to talk about digital preservation. All that and a lobster roll, what a trip.

Please visit this space each week and help NDSR Boston continue to generate awareness and interest in digital stewardship issues.

Thanks, Jeff Erickson