NDSR Project Update from UMass Boston

Hello Readers.

The weather in Boston is beginning to warm. The Red Sox have opened their season and the Boston Marathon was run earlier this week. No doubt the crew teams are rowing in the Charles and the Swan Boats will soon be paddling across the pond in the Public Garden. Although spring is the season of renewal, this year it signals the end of the 2015-16 NDSR Boston projects.

swan boats Boston-Common-1889

Swan Boats in Boston’s Public Garden – image by George Barker 1889

For this blog post, I thought I would update you on the progress I have made on my project so far. To refresh your memory, my project is developing and implementing a digital preservation plan for the University Archives and Special Collections at UMass Boston using Archivematica and DuraCloud. The collection I am working with is the Mass. Memories Road Show (MMRS). I began the project researching digital preservation standards and best practices while familiarizing myself with the digitization and digital asset management practices in use at UMass Boston. I have been busy lately adjusting existing practices for processing digital collections and developing preservation workflows related to the use of Archivematica.

Performing a gap analysis early in the project was an important step. By comparing the existing practices at UMass Boston with digital preservation standards, guidelines and best practices, I identified the greatest areas of need as preparing the collection for ingest and implementing archival storage. To address these broad areas of need, it would be necessary to incorporate the following tasks into the digital preservation workflow.

  1. Generate checksums
  2. Screen for duplicate and unwanted files
  3. Create/assign unique IDs to files
  4. Store files in multiple locations
  5. Include descriptive metadata in archival storage
  6. Create/manage administrative, technical and preservation metadata

Archivematica addresses several of these issues. Other needs are being met by making adjustments to the existing practices.

red sox scorecard 1934

Boston Red Sox Scorecard 1934 – image from FenwayParkDiaries.com

I discovered that the first two tasks identified by the gap analysis were related. Generating checksums is an important task because it protects the authenticity and data integrity of the collection. Checksums, created by applying a cryptographic algorithm to the file, produces a unique alphanumeric code for each file that acts like a digital finger print.  Periodically verifying that checksums have not changed provides evidence that the file has not been modified or damaged over time. Since the objects in the Mass. Memories Road Show collection are copied between hard drives several times and uploaded to the cloud, it is necessary to have a way to verify that each file has retained its original bit stream.

Checksums also played an important role in helping to identify and remove duplicate files. The existing file processing workflow had resulted in the accumulation of numerous duplicate video files. Duplicate files have identical checksums. A checksum tool called The HashMyFiles generates and compares checksums, and identifies when two or more files are identical. Using this tool, 3,500 video files occupying about 200GB of space were removed from the collection, saving critical processing time and storage capacity.

crew practice on the charles river

Crew practice on the Charles River – image by Leslie Jones ca. 1930

Other modifications being made to the file processing workflow involve adopting a file copying tool and standard terminology, adjusting the file naming conventions and digitizing registration forms. Usually a file’s creation date is overwritten when the file is copied. A tool called TeraCopy has been adopted to copy files because it retains the original creation date. Standard terminology has been adopted as well. Digital files that were previously categorized as “originals” and “edited masters” are now identified as “preservation masters” and “production masters.” Since preservation master files and production master files often share identical file names, suffixes have been added to the file naming convention to differentiate the two types. Preservation masters are now identified by an “.f0” suffix while production masters are labeled with an “.f1” suffix. Lastly, registration forms, which give UMass Boston consent to use the digital files in the collection, will now be digitized and uploaded to archival storage with the files they represent, providing additional intellectual control.

Archivematica specific adjustments are also being made to the workflow that will protect the collection’s data integrity, manage metadata and assign unique identifiers to the collection and to the files. An additional checksum file and a text file with descriptive metadata will be created and uploaded to Archivematica with each submission. Archivematica uses the checksum file to verify files are not damaged during the upload to the Archivematica servers. Archivematica parses the descriptive metadata file into a METS file allowing the descriptive metadata to be stored with the collection in DuraCloud. The normal Archivematica processing extracts technical metadata from the files, generates additional administrative and preservation metadata, and creates and assigns universally unique identifiers (UUIDs) to the objects in the submission. The metadata is all saved into the previously mentioned METS file, satisfying all digital preservation best practices for metadata. The UUIDs are saved to a text file which will be downloaded and imported into the digital asset management system ensuring that the identifiers created during archival storage are associated with the access copies.

So, a lot of progress has been made thus far. There are still a few decisions to make and a little more testing left to do before the entire collection can be uploaded to the cloud, processed through Archivematica and deposited in DuraCloud. The final tasks will be to finish documenting the new procedures and training the archives staff to use the new digital preservation tools and Archivematica.

Johnny Miles Crossing Tape in Race

Boston Marathon winner Johnny Miles 1923 – image by Underwood & Underwood/Corbis

The Boston Marathon is an appropriate metaphor for the NDSR project. There is a lot of anticipation and a “feeling out” process in the beginning. This is followed by a period where you settle in to a steady and comfortable pace. Along the way, you encounter and overcome challenges. At this point, you have made it over Heartbreak Hill. Next, Boylston Street and the Finish Line come into view and there is a hectic push to the end. After crossing the Finish Line, there will be the satisfaction and sense of accomplishment that comes with the successful completion of the end of the project. Maybe the traditional meal of a big bowl of pasta will be my reward.

Thanks for reading, Jeff

Web Statistics: CHECK

We’ve accomplished a big milestone here at the State Library—we have completed our review of the web statistics! One of the main objectives of my project was to perform a comprehensive assessment of Massachusetts state government publications and we chose to use web statistics as a way of accomplishing this goal. The web statistics, gathered by Mass.gov., showed us where on agency websites materials are posted and also, after a categorization process, tells us what kinds of content agencies are producing. Implementing a priority ranking system, we also see what kinds of documents are high priority or low priority (according to the collection policy statement we created at the beginning of this process).

We began working with the web stats as a means for identifying and selecting the content we want to preserve and provide access to through the DSpace repository. As the residents learned in our first few months of NDSR, the identification and selection of content are the first steps an institution should take in planning for current and future preservation needs. Reviewing the documents from the web statistics answered the questions of what content our producers create, what content are we required to keep, and what content do we feel is most valuable to the library. The answers to these questions will inform an inventory of the kinds of content that agencies produce and will help us update the collection policy statement that we began working on in the fall. The policy statement is meant to be a living document that is continually updated as priorities or types of content change.

Having a policy statement then guides the selection of content for long-term preservation and access. Referring to documentation of our practices allows the staff to make well-informed decisions about what kinds of content is most valuable for the library and its patrons, and helps us maximize resources. Rather than spending time and energy capturing things like ephemeral material, we can allocate time and resources towards capturing things like reports or meeting materials. Our policy is something we can use to select materials as well as justification for these decisions if a patron asks why we capture certain items and not others. Documenting these actions and procedures is an important step for the State Library in building their digital preservation practice.

So how many documents did we go through? All told, we reviewed and appraised over 75,000 documents, which is pretty incredible! Many of these documents are already in DSpace and many are low priority, so we do not need to catalog and ingest every single one. I’m currently compiling and analyzing the data we pulled from the statistics (which includes the total number per agency as well as the breakdown of monographic and serial documents). I’ll know more soon about how many high priority documents we need to handle, and then will be working on a plan for the low priority documents as well. All of this will be documented and included in my final report for the State Library. In addition to using the data collected from web statistics in the identification and selection process, the web statistics allow us to use quantitative data as justification for requesting additional resources. Knowing that we have only so many resources in place currently and seeing how much work needs to be done (with data to back that up), we can use this as proof of what resources we should add to handle the workload ahead.

This process was not always shiny or fancy, and at times it was an uphill climb (going through 10,000 documents from one agency was a particular low point for me!), but we continually fine-tuned the workflow until the whole staff got into a steady rhythm. This was a great lesson for me in designing and testing workflows over time, being flexible and open to new ideas, and keeping the big picture in mind. Some of the challenges included managing many, many spreadsheets at once, tracking progress over time (as each staff member was responsible for their own agencies, but I was in charge of the big picture so I needed to be kept up-to-date on everyone’s status without being overbearing), and ensuring we were capturing only the necessary data (which was part of the workflow evolution. We began tracking lots of data, then boiled it down to the most essential to save time). Every tweak or change in the workflow was done in service of getting a better understanding of the scope of state publications, and ultimately I feel we’ve achieved that.

I’m taking the team out for a lunch next week to thank them for all of their help reviewing these and to celebrate this accomplishment. Again, this step meets a major goal for us and will help inform the next steps for my project. With a month left, I’ll be documenting this whole process and including much of my data collection in a final report for the State Library. Thanks for checking in!


MIT Libraries Host Event: Resumes & Interviews

On Wednesday March 30th, MIT Libraries held their NDSR Host Event, a Resume and  Interview Workshop. Each NDSR resident brought the description of a job for which they wanted to apply, their resume, and a cover letter. The residents paired up with an interviewer (NDSR hosts) to review their resume and cover letter to see if it addressed a job description. The interviewers posed some questions to each Resident as practice then provided some suggestions for real job searches. I enjoyed organizing the event and learned a lot from the discussions. I hope the feedback will be as helpful to you as it has been to me.

Before the interview, it is important that you do your research about the job and the institution:

  • Be prepared for questions about what interests you about the job.
  • Learn the institution’s terminology.
  • Look over the projects they are doing and think about how you could contribute to them
  • Arrive early to the interview but wait till actual interview time to approach them. This is a weird balancing act.

Here is some advice from the hosts about preparing for the interview:

  • In your interview, try to define the more technical lingo if it’s not already defined in the job description. This is a delicate balance between showing that you know the lingo and lecturing the interviewers. You want to be able to do the former, not the latter.
  • Show how your skills can transfer to the job, point out similarities between this job and your past jobs, and have different examples ready to showcase collaboration skills.
  • When they ask the question “How can you bring your knowledge into the institution?” you can wrap your answer around the institution’s future projects and/or mission.
  • When asked a question referencing the “required experience” skill set, make a case with examples that are short and to the point.

At the end of the interview, there is a question that is almost always asked: “Do you have any questions for me/us?”

  • One thing that rarely considered is that the job interview works both ways.  During a job interview you also have the opportunity to decide if you would really like the position, your co-workers, or the institution.
  • If the job description has everything but the kitchen sink in the list of duties/skills, you can ask: “This is a really wide range of skills you are looking for.  What are your priorities?”

Presentations can be a part of an interview process, since it is all about communication. Here are some tips for presentations:

  • Address the question.
  • Be aware of time.
  • While creating your presentation, ask yourself if you are addressing the topic within the time frame while saying it well.
  • Practice.

Having references is key. Before this event, I hadn’t given much thought on their importance to a job application. I assumed I would get good reviews if someone indicated that I could use them as a reference.The following feedback proves how wrong that assumption was:

  • Before you apply for a job, send your chosen references the job description and ask them if they will be a reference for a particular job before you apply.  Nothing is worse than having them ask your interviewers “So what job is this in reference to…?” or “What’s their name, again?”
  • If you will need recommendation letters, let your references know at least two weeks ahead of time—longer if possible.
  • Ask your reference if they would give you a positive reference. If they won’t, don’t use them as a reference. Imagine your interviewers getting this response: “She really gave me as a reference?”
  • In the document listing your references, write why that person is a reference for you.
  • Take into account how responsive your reference is, and whether they are likely to return emails or phone calls.
  • A very hard, but good question to ask your reference “Would you hire me again?” You may not want the answer, but it’s a less subtle indicator of what your reference will say about you!

Good luck!