Update on My Project

Hello friends!
Spring is here, allegedly, although you wouldn’t know from the snow we got recently. But it’s beginning to warm up, the river is thawing, and Harvard Yard is filling up with robins and tourists…


Spring comes to Cambridge (taken on March 22nd)

The arrival of “Spring” also heralds the end of our residency…that May 31st deadline is on the horizon now.  In the next few months, I’ll be trying to tie up all the loose ends and finish up the project, and I thought I’d update you on what I’ve been up to so far.

The big news is, I officially finished the “self-assessment” phase of my project (also the longest phase). Hooray! If you recall from my earlier posts, I was using an Excel sheet to track how Harvard met the different metrics of the ISO16363 – green for things that are being done and documented, yellow for things that are being done but not documented, and red for things that are not being done at all. So now the spreadsheet is all filled in!



Zoomed out so you can see all the colors


Now that that’s all done (whew!), I’m working on summarizing my findings in a report. Then, with the report done, I’m going to attempt to make some data visualizations that show my results in a more visually appealing manner. Andrea has given me some questions, which I’ll share below, and which I hope to address with the visualizations and my report. They are:

  • Where do we stand related to the standard?
  • Where are the gap areas?
  • How can we characterize the gap areas?
  • How might we address the gaps? What would be a good strategy to approach tackling the gap areas?

In particular,  I am looking to see what the commonalities are, if any, among the gap areas. My hope is that I can suggest a few documents that could be made and could fill several gaps at once – this would allow the DRS to fill the gaps most efficiently. For example, one thing I’ve found so far is that many of the yellow areas are related to the ingest process, so it seems like a document about the ingest process could fill several gaps at once. In the coming weeks, I’m going to continue to look for those kinds of commonalities and try to display them visually. I got some good ideas from Helen at our workshop last week (which Jeff blogged about), and I hope I can find a good way to display all this information.

Also, outside of the main project, I’ve also been working on a few “twenty-percent” things, those professional development, non-project projects.

For example, the rest of the residents and I will be hosting a webinar in a few weeks for Simmons Continuing Education. We’ll be talking about different standards, such as ISO16363, and how these standards can be used for a gap analysis.

Additionally, during the last week of May, I’ll be participating in ALA Preservation Week here at Harvard. I’m going to have a table about Personal Digital Archiving, and I’ll be rotating around to different libraries and schools on campus and teaching students about how they can save their digital lives!

Finally, I’m recording a webinar, along with D.C. resident Jessica Tieman, which we’ll make available to other NDSR residents afterwards.

So…lot’s of stuff going on! It’s going to be a busy couple of months, so make sure to keep checking back here at our blog to see how things wrap up!



Data Visualization: Choropleths and Cartograms and Treemaps, oh my!

Hello Readers.

Last week the NDSR Boston cohort visited with Helen Bailey, a digital curation analyst at MIT. In her spare time, Helen has become a data visualization expert. Helen provides data visualization support to the MIT Libraries and is sharing her knowledge of data visualization through presentations and workshops. If you think you are unfamiliar with data visualization, think again. I guarantee you have used data visualizations and maybe even created a few.

To set the record straight, let’s define the term before giving some examples and talking about why data visualizations are useful and what it takes to produce them. Helen offered the following two definitions:

“Information visualization is a mapping between discrete data and a visual representation.” from Lev Manovich, “What Is Visualization”

“Information visualization is a set of technologies that use visual computing to amplify human cognition with abstract information.” from Stuart Card, “Information Visualization”

While both definitions make sense, I prefer the second definition because a well-chosen visualization really provides meaning and understanding where there might otherwise be only information overload. It seems to me that the age old saying, “a picture is worth a thousand words” is appropriate when discussing the purpose and usefulness of data visualizations.


Helen notes that visualizing data can be used to summarize a data set, highlight specific aspects of the data, and identify patterns and outliers. Data, typically organized in tables or spreadsheets, can be almost impossible to digest, especially in large quantities. Even smaller data sets contain so many rows and columns they literally run right off your screen making it difficult to draw conclusions or spot trends. Organizing the raw data into visual representations is really the only practical way to make the data useful.


Alluvial Diagram

The first steps in creating data visualizations are to determine:

  • What questions will the data answer?
  • How the visualization will be used?
  • What is the best type of visualization to use?
  • Who is the target audience using the visualization?

It’s important to answer each of these questions because there are so many types of visualizations available. You can’t be a one-trick pony, reusing the same representation for all occasions. Certain representations work better for temporal (when), geospatial (where), topical/statistical (what, how much), relational (with whom) and hierarchical (ordered relationships) data.



The types of representations range from simple to complex and from traditional to innovative. The names of the visualizations in the list below will either allow you to draw a picture in your mind’s eye or send you for a dictionary. A few types of the many available to consider are:

  • Gantt Charts, Stream Graphs and Alluvial Diagrams for temporal representations
  • Choropleths and Cartograms for geospatial representations
  • Histograms, Pie Charts and Heat Maps for topical/statistical representations
  • Node-link, Chord and Arc Diagrams for relational representations
  • Dendograms, Treemaps and Radial Trees for hierarchical representations

It’s easy to be overwhelmed by the choices. Helen presented a decision tree designed to help identify which representation to use depending on the parameters of your project. Do you need to show comparisons in your data over time with just a few periods of time but with many categories? Try a Column or Line Chart. But remember what Ben Fry mentions in Visualizing Data, data visualization is just another form of communication and it will only be successful if the representations make sense to your audience.

arc diagram

Arc Diagram

Are you interested in creating a data visualization for your project? Four members of our group were, and one of us has already created one of her own. Simple data visualizations, line, bar and pie charts, can be created with the spreadsheet application installed on your computer. If you have more complex data and feel like challenging yourself, there are several online tools available. Helen recommended and gave brief introductions to Voyager (http://vega.github.io/voyager), Tableau (http://www.tableau.com) and RAW (http://raw.densitydesign.org/) to mention only a few. Do be forewarned though, some of these data visualization tools have a steep learning curve and may be easier to use if you have some experience with coding and scripting.

If all else fails, use your Photoshop skills and convert your favorite data visualization into a piece of modern art or a poster to hang on your wall.

Thank you Helen Bailey for introducing NDSR Boston to  data vis! And thank you for reading.

Jeff Erickson

Image Credits:

  1. Spreadsheet image produced from a data set from the US Dept. of Labor, Bureau of Labor Statistics retrieved from http://www.bls.gov
  2. U.S. map image, Unemployment data visualization, created by Mike Bostock, retrieved from http://bl.ocks.org/mbostock/4060606
  3. Alluvial Diagram image retrieved from http://www.mapequation.org/apps/AlluvialGenerator.html
  4. Cartogram image retrieved from http://www.stephabegg.com/home/projects/cartograms
  5. Arc Diagram image retrieved from http://www.chrisharrison.net/index.php/Visualizations/BibleViz


Last week I attended my first annual Code4Lib meeting in Philadelphia. Code4Lib started in 2003 as a mailing list and has since grown to a thriving community of hackers, cataloguers, designers, developers, librarians and even archivists.  This year was the 11th annual conference and there was a significant online presence, including an IRC, slack channel, and the hashtag #c4l16. All presentations and lightning talks from the conference were streamed live and the videos are still available on the Code4Lib YouTube channel.


Code4Lib 2016 Annual Conference logo

The week started off with a day of pre-conference workshops. I attended the Code4Arc workshop which focused on how coding and tech is used slightly differently in the archives world. Since archives have different goals and use different descriptive standards it makes sense to carve out a space exclusive to archival concerns. One common interest was in how multiple tools are connected when they’re implemented in the same archive. Many attendees were implementing ArchivesSpace to handle archival descriptions and were concerned about interoperability with other tools. Another concern was regarding the processing and management of hybrid collections, which contain both analog and digital material. Digital is often addressed as completely separate from analog, but many collections come into the archive containing both and that relationship must be maintained. Archivists in the workshop called for tools to be inclusive of both digital and analog holdings, especially in regards to processing and description.

I joined NDSR-NYC alum Shira Peltzman to kick-off the presentation part of the conference with a discussion of Implementing ‘Good Enough’ Digital Preservation (video here). The goal of our presentation was to make digital preservation attainable, even for those with limited support. We began with a brief overview of the three tenets of digital preservation – bit preservation, content accessibility, and ongoing management- before diving into specific resources and strategies for implementing good enough digital


Shira and myself presenting on “Good Enough” Digital Preservation

preservation. We defined ‘good enough’ as the most you can do with what you have – based on available staff and budget, collection needs, and institutional priorities. The main offerings from our presentation were expansions on the NDSA Levels of Digital Preservation. We mapped each recommendation to useful tools, resources and policy recommendations based on our experience in NDSR and beyond. We also proposed an additional level to address the access issues related to digital preservation, such as redaction of personal information and making finding aids publicly accessible. Since the NDSA levels are such a common tool for getting started with digital preservation, we hope that these additions will make it easier to move to the next level – no matter what level you are currently at.

Our talk ended with a call for engagement with our NDSA level additions and more generally, to share policies and workflows with the community. The call for shared documentation was a common thread through many presentations at the conference. Dinah Handel and Ashley Blewer also discussed this in their talk “Free Your Workflows (and the rest will follow)” (video here). They made a great point about why people don’t share their documentation – because it’s scary! There’s the constant battle against imposter syndrome, fear of public failure, not to mention the fear that as soon as a policy is polished enough to share widely it is also outdated. All of these are very real reasons to hesitate, but the advantages that come from shared documentation severely outweigh these reasons. And nowhere is this more true than in the realm of open source solutions. Open source projects often rely on the community to notify them of bugs and help create complete and accessible documentation. Shared policies and workflows help to build that community, and help the field understand how tools and strategies are actually implemented.

If you are reading this and thinking – I have documentation that I could share but where would I put it? Worry not! There are a few great places for sharing open documents.

Scalable Preservation Environments (SCAPE) collects published digital preservation policies

Library Workflow Exchange collects all library-related workflows, including digital preservation.

Community Owned digital Preservation Tool Registry (COPTR) is a wiki for all digital preservation tools, and provides a space for users to present their experiences with any given tool. This information is automatically pushed to the interactive tool grid created by Preserving digital Objects With Restricted Resources (POWRR).

Github is known as a repository for code, but it can also be a great storage option for workflows and documentation. It is especially useful for managing version control for live documents.

Do you know of another place to share digital preservation documentation? Let us know in the comments!


An Update from the State Library

In college, I took several courses that involved working closely with one of the many helpful librarians on campus. She would often refer to our projects as “iterative”– so much so that she would even laugh as she said it. Six months into my residency at the State Library of Massachusetts, the joke is on me as our process has been very iterative. This post will cover what we’ve been up to recently and what is ahead for us in the next few months.

A quick recap: we’re exploring more efficient ways of finding, downloading, and providing access to digital state publications. We’ve been working with web statistics downloaded from Mass.gov to assess the extent of digital publications and to determine what is most valuable to preserve for the Library and its users.

The web statistics workflow has, of course, evolved, requiring flexibility and an open mind. When we began using the statistics, each member of the project team was checking each URL listed, noting the type of document it was, then each of the team members would rank the document on a scale of 1-5 (1 being lowest priority, 5 being highest) using shared spreadsheets. Once we all had a solid understanding of what was highest and lowest priority, we determined that we didn’t need to each rank each type of document, so each staff member would tackle a different agency and enter their own priority rankings. We also created a new spreadsheet to consolidate that data into how many documents there were total and how many of each priority ranking. This gives a bigger picture assessment of how many state publications exist, and how many high priority documents we need to handle quickly. A few weeks later, we then decided to add a category in the spreadsheets to note whether these documents were series, serials, or monographs, which affects the way the items are cataloged. Though these are relatively minor changes in the workflow, they do reflect how important it is to continually check in with the project team about what’s working well and what could be improved. It is very iterative!

While that process is ongoing, we are also examining how to download the thousands of publications we’ve reviewed through the web stats. I researched tools that would help us batch download PDF or Word docs from sites, taking into account the Library’s resources. Though CINCH, a tool developed by the State Library of North Carolina, fits our needs well, the installation requirements were not feasible for us. I began playing around with a Firefox add-on called DownThemAll! (yes, the exclamation mark is part of the name– though it is very exciting). DownThemAll (dTa) allows a user to upload a list of URLs, specify the folder in which you’d like the files saved, then, like magic, the files are fully downloaded (dTa has other features and functions, such as a download accelerator). Any errors are noted and not downloaded, so you can go back and check if this was a 404 error or human error, for example.

The tool is free, easy, and works very well! My concern, however, is that it is not backed by an institution and it’s unclear how much funding or technical support they have. What if I come into work tomorrow and it’s gone? Who do I contact? Though they have some support help, it’s limited (for example, I emailed about an issue three weeks ago, and haven’t heard back). dTa works only with Firefox– what if there’s an issue with the browser and we can no longer access the tool? While the function of the tool works well and will be useful in the short term, I don’t see it being a sustainable solution for batch downloading. This is another part of the process that we’ll need to keep revisiting over time. And if anyone has ideas or suggestions, please let me know!

One big success we’ve had is collaborating with MassIT to gain access to their Archive-It account. Though MassIT manages the account, they’re capturing the material that we need– webpages with links to documents published by state agencies– so it makes perfect sense to work together to use Archive-It to its full capacity. I worked with MassIT to customize the metadata on the site, then I wrote some information to publish on our website about how to access and use Archive-It for the general public. We’re considering how best to incorporate Archive-It into our workflow. While DSpace will remain our central repository, where we can provide enhanced access to publications through metadata, Archive-It is capturing more material than we will be able to, which is a huge help to us. (Archive-It also allows us to print PDF reports to see all PDFs captured in their crawls, and we can use dTa to download them. We’re not currently using this now, but this is an option for the State Library to use going forward.)

With each iteration of the workflow, I feel we are getting closer to solving some of the big questions of my project. We hold weekly staff meetings to check in about the current process. Hearing each staff member’s thoughts on challenges or potential areas of improvement has taught me much about how to continually bring fresh eyes to an ongoing process, and how to keep the big picture in mind while working through smaller details. Flexibility is key not only with this project, but with digital preservation as a whole, as processes, tools, software, and other factors continue to evolve.

I hope to leave the State Library with some options of how to take this project forward, even if not all of the questions have a definitive answer. We’re also now focusing our attention on addressing other issues in the project, such as outreach to state agencies and the cataloging workflow between their OPAC, Evergreen, and DSpace. There’s much to accomplish in the remaining weeks, and I look forward to updating you as we make progress on these goals.

Thank you!