Data Visualization: Choropleths and Cartograms and Treemaps, oh my!

Hello Readers.

Last week the NDSR Boston cohort visited with Helen Bailey, a digital curation analyst at MIT. In her spare time, Helen has become a data visualization expert. Helen provides data visualization support to the MIT Libraries and is sharing her knowledge of data visualization through presentations and workshops. If you think you are unfamiliar with data visualization, think again. I guarantee you have used data visualizations and maybe even created a few.

To set the record straight, let’s define the term before giving some examples and talking about why data visualizations are useful and what it takes to produce them. Helen offered the following two definitions:

“Information visualization is a mapping between discrete data and a visual representation.” from Lev Manovich, “What Is Visualization”

“Information visualization is a set of technologies that use visual computing to amplify human cognition with abstract information.” from Stuart Card, “Information Visualization”

While both definitions make sense, I prefer the second definition because a well-chosen visualization really provides meaning and understanding where there might otherwise be only information overload. It seems to me that the age old saying, “a picture is worth a thousand words” is appropriate when discussing the purpose and usefulness of data visualizations.

spreadsheet-v-map

Helen notes that visualizing data can be used to summarize a data set, highlight specific aspects of the data, and identify patterns and outliers. Data, typically organized in tables or spreadsheets, can be almost impossible to digest, especially in large quantities. Even smaller data sets contain so many rows and columns they literally run right off your screen making it difficult to draw conclusions or spot trends. Organizing the raw data into visual representations is really the only practical way to make the data useful.

alluvial

Alluvial Diagram

The first steps in creating data visualizations are to determine:

  • What questions will the data answer?
  • How the visualization will be used?
  • What is the best type of visualization to use?
  • Who is the target audience using the visualization?

It’s important to answer each of these questions because there are so many types of visualizations available. You can’t be a one-trick pony, reusing the same representation for all occasions. Certain representations work better for temporal (when), geospatial (where), topical/statistical (what, how much), relational (with whom) and hierarchical (ordered relationships) data.

cartogram

Cartogram

The types of representations range from simple to complex and from traditional to innovative. The names of the visualizations in the list below will either allow you to draw a picture in your mind’s eye or send you for a dictionary. A few types of the many available to consider are:

  • Gantt Charts, Stream Graphs and Alluvial Diagrams for temporal representations
  • Choropleths and Cartograms for geospatial representations
  • Histograms, Pie Charts and Heat Maps for topical/statistical representations
  • Node-link, Chord and Arc Diagrams for relational representations
  • Dendograms, Treemaps and Radial Trees for hierarchical representations

It’s easy to be overwhelmed by the choices. Helen presented a decision tree designed to help identify which representation to use depending on the parameters of your project. Do you need to show comparisons in your data over time with just a few periods of time but with many categories? Try a Column or Line Chart. But remember what Ben Fry mentions in Visualizing Data, data visualization is just another form of communication and it will only be successful if the representations make sense to your audience.

arc diagram

Arc Diagram

Are you interested in creating a data visualization for your project? Four members of our group were, and one of us has already created one of her own. Simple data visualizations, line, bar and pie charts, can be created with the spreadsheet application installed on your computer. If you have more complex data and feel like challenging yourself, there are several online tools available. Helen recommended and gave brief introductions to Voyager (http://vega.github.io/voyager), Tableau (http://www.tableau.com) and RAW (http://raw.densitydesign.org/) to mention only a few. Do be forewarned though, some of these data visualization tools have a steep learning curve and may be easier to use if you have some experience with coding and scripting.

If all else fails, use your Photoshop skills and convert your favorite data visualization into a piece of modern art or a poster to hang on your wall.

Thank you Helen Bailey for introducing NDSR Boston to  data vis! And thank you for reading.

Jeff Erickson

Image Credits:

  1. Spreadsheet image produced from a data set from the US Dept. of Labor, Bureau of Labor Statistics retrieved from http://www.bls.gov
  2. U.S. map image, Unemployment data visualization, created by Mike Bostock, retrieved from http://bl.ocks.org/mbostock/4060606
  3. Alluvial Diagram image retrieved from http://www.mapequation.org/apps/AlluvialGenerator.html
  4. Cartogram image retrieved from http://www.stephabegg.com/home/projects/cartograms
  5. Arc Diagram image retrieved from http://www.chrisharrison.net/index.php/Visualizations/BibleViz

Advertisements

Code4Lib

Last week I attended my first annual Code4Lib meeting in Philadelphia. Code4Lib started in 2003 as a mailing list and has since grown to a thriving community of hackers, cataloguers, designers, developers, librarians and even archivists.  This year was the 11th annual conference and there was a significant online presence, including an IRC, slack channel, and the hashtag #c4l16. All presentations and lightning talks from the conference were streamed live and the videos are still available on the Code4Lib YouTube channel.

c4llogo

Code4Lib 2016 Annual Conference logo

The week started off with a day of pre-conference workshops. I attended the Code4Arc workshop which focused on how coding and tech is used slightly differently in the archives world. Since archives have different goals and use different descriptive standards it makes sense to carve out a space exclusive to archival concerns. One common interest was in how multiple tools are connected when they’re implemented in the same archive. Many attendees were implementing ArchivesSpace to handle archival descriptions and were concerned about interoperability with other tools. Another concern was regarding the processing and management of hybrid collections, which contain both analog and digital material. Digital is often addressed as completely separate from analog, but many collections come into the archive containing both and that relationship must be maintained. Archivists in the workshop called for tools to be inclusive of both digital and analog holdings, especially in regards to processing and description.

I joined NDSR-NYC alum Shira Peltzman to kick-off the presentation part of the conference with a discussion of Implementing ‘Good Enough’ Digital Preservation (video here). The goal of our presentation was to make digital preservation attainable, even for those with limited support. We began with a brief overview of the three tenets of digital preservation – bit preservation, content accessibility, and ongoing management- before diving into specific resources and strategies for implementing good enough digital

c4l

Shira and myself presenting on “Good Enough” Digital Preservation

preservation. We defined ‘good enough’ as the most you can do with what you have – based on available staff and budget, collection needs, and institutional priorities. The main offerings from our presentation were expansions on the NDSA Levels of Digital Preservation. We mapped each recommendation to useful tools, resources and policy recommendations based on our experience in NDSR and beyond. We also proposed an additional level to address the access issues related to digital preservation, such as redaction of personal information and making finding aids publicly accessible. Since the NDSA levels are such a common tool for getting started with digital preservation, we hope that these additions will make it easier to move to the next level – no matter what level you are currently at.

Our talk ended with a call for engagement with our NDSA level additions and more generally, to share policies and workflows with the community. The call for shared documentation was a common thread through many presentations at the conference. Dinah Handel and Ashley Blewer also discussed this in their talk “Free Your Workflows (and the rest will follow)” (video here). They made a great point about why people don’t share their documentation – because it’s scary! There’s the constant battle against imposter syndrome, fear of public failure, not to mention the fear that as soon as a policy is polished enough to share widely it is also outdated. All of these are very real reasons to hesitate, but the advantages that come from shared documentation severely outweigh these reasons. And nowhere is this more true than in the realm of open source solutions. Open source projects often rely on the community to notify them of bugs and help create complete and accessible documentation. Shared policies and workflows help to build that community, and help the field understand how tools and strategies are actually implemented.

If you are reading this and thinking – I have documentation that I could share but where would I put it? Worry not! There are a few great places for sharing open documents.

Scalable Preservation Environments (SCAPE) collects published digital preservation policies

Library Workflow Exchange collects all library-related workflows, including digital preservation.

Community Owned digital Preservation Tool Registry (COPTR) is a wiki for all digital preservation tools, and provides a space for users to present their experiences with any given tool. This information is automatically pushed to the interactive tool grid created by Preserving digital Objects With Restricted Resources (POWRR).

Github is known as a repository for code, but it can also be a great storage option for workflows and documentation. It is especially useful for managing version control for live documents.

Do you know of another place to share digital preservation documentation? Let us know in the comments!

 

An Update from the State Library

In college, I took several courses that involved working closely with one of the many helpful librarians on campus. She would often refer to our projects as “iterative”– so much so that she would even laugh as she said it. Six months into my residency at the State Library of Massachusetts, the joke is on me as our process has been very iterative. This post will cover what we’ve been up to recently and what is ahead for us in the next few months.

A quick recap: we’re exploring more efficient ways of finding, downloading, and providing access to digital state publications. We’ve been working with web statistics downloaded from Mass.gov to assess the extent of digital publications and to determine what is most valuable to preserve for the Library and its users.

The web statistics workflow has, of course, evolved, requiring flexibility and an open mind. When we began using the statistics, each member of the project team was checking each URL listed, noting the type of document it was, then each of the team members would rank the document on a scale of 1-5 (1 being lowest priority, 5 being highest) using shared spreadsheets. Once we all had a solid understanding of what was highest and lowest priority, we determined that we didn’t need to each rank each type of document, so each staff member would tackle a different agency and enter their own priority rankings. We also created a new spreadsheet to consolidate that data into how many documents there were total and how many of each priority ranking. This gives a bigger picture assessment of how many state publications exist, and how many high priority documents we need to handle quickly. A few weeks later, we then decided to add a category in the spreadsheets to note whether these documents were series, serials, or monographs, which affects the way the items are cataloged. Though these are relatively minor changes in the workflow, they do reflect how important it is to continually check in with the project team about what’s working well and what could be improved. It is very iterative!

While that process is ongoing, we are also examining how to download the thousands of publications we’ve reviewed through the web stats. I researched tools that would help us batch download PDF or Word docs from sites, taking into account the Library’s resources. Though CINCH, a tool developed by the State Library of North Carolina, fits our needs well, the installation requirements were not feasible for us. I began playing around with a Firefox add-on called DownThemAll! (yes, the exclamation mark is part of the name– though it is very exciting). DownThemAll (dTa) allows a user to upload a list of URLs, specify the folder in which you’d like the files saved, then, like magic, the files are fully downloaded (dTa has other features and functions, such as a download accelerator). Any errors are noted and not downloaded, so you can go back and check if this was a 404 error or human error, for example.

The tool is free, easy, and works very well! My concern, however, is that it is not backed by an institution and it’s unclear how much funding or technical support they have. What if I come into work tomorrow and it’s gone? Who do I contact? Though they have some support help, it’s limited (for example, I emailed about an issue three weeks ago, and haven’t heard back). dTa works only with Firefox– what if there’s an issue with the browser and we can no longer access the tool? While the function of the tool works well and will be useful in the short term, I don’t see it being a sustainable solution for batch downloading. This is another part of the process that we’ll need to keep revisiting over time. And if anyone has ideas or suggestions, please let me know!

One big success we’ve had is collaborating with MassIT to gain access to their Archive-It account. Though MassIT manages the account, they’re capturing the material that we need– webpages with links to documents published by state agencies– so it makes perfect sense to work together to use Archive-It to its full capacity. I worked with MassIT to customize the metadata on the site, then I wrote some information to publish on our website about how to access and use Archive-It for the general public. We’re considering how best to incorporate Archive-It into our workflow. While DSpace will remain our central repository, where we can provide enhanced access to publications through metadata, Archive-It is capturing more material than we will be able to, which is a huge help to us. (Archive-It also allows us to print PDF reports to see all PDFs captured in their crawls, and we can use dTa to download them. We’re not currently using this now, but this is an option for the State Library to use going forward.)

With each iteration of the workflow, I feel we are getting closer to solving some of the big questions of my project. We hold weekly staff meetings to check in about the current process. Hearing each staff member’s thoughts on challenges or potential areas of improvement has taught me much about how to continually bring fresh eyes to an ongoing process, and how to keep the big picture in mind while working through smaller details. Flexibility is key not only with this project, but with digital preservation as a whole, as processes, tools, software, and other factors continue to evolve.

I hope to leave the State Library with some options of how to take this project forward, even if not all of the questions have a definitive answer. We’re also now focusing our attention on addressing other issues in the project, such as outreach to state agencies and the cataloging workflow between their OPAC, Evergreen, and DSpace. There’s much to accomplish in the remaining weeks, and I look forward to updating you as we make progress on these goals.

Thank you!
Stefanie

Digital Preservation UnConference

This Tuesday we hosted our first Digital Preservation UnConference at the John F. Kennedy Presidential Library.  We had a great turnout from a number of institutions around Boston and the larger New England community.  A range of  topics around digital preservation were discussed from social media and web archiving to wrangling data for a system migration.

As you may know, part of the residency program includes hosting an event at your institution.  At the JFK Library we immediately knew we wanted to host a public unconference.  This actually came up in discussions between myself and my host mentor, Erica Boudreau, before I even arrived in Boston.  I had been to a few digital humanities and library themed unconferences and I was excited to see how this format could be used to address the issues specific to digital preservation.

In planning for the event we created a wordpress site, including an UnConference 101, registration information, and directions to the event.  We used the website commenting function to allow attendees to propose sessions ahead of time.  We also created a twitter handle, @jfkdigipres for sharing updates and event information.

audience

Attendees get ready for the day of UnConference-ing!

The day started with brief opening remarks by myself, followed by session proposals from the attendees.  Then we broke for coffee and attendees voted on the proposals.  Once voting was over, myself and a few dedicated volunteers entered session proposals into the schedule.  And with that, the sessions were underway!

Volunteers and attendees collaborated on community notes which recorded the main points and resources discussed in each session.  If you couldn’t make it to the event, or are curious what happened in the sessions you missed, I highly recommend checking out these collaborative notes.  There are some great tools and ideas discussed there.

Since I’m currently looking into plans for a potential system migration, I led a discussion on content migration for digital preservation.  It was great to hear how others are and have dealt with large-scale migrations like this.  There was a great point made about how a system migration is iterative and you always have to keep an eye on the horizon because your current system might lose support or fail to meet your collection requirements in the future.

One twitter user said…

Fellow resident, Jeff Erickson, led a discussion on preparing to use Archives Direct, a tool he’s currently researching for preserving content collected through the Mass. Memories Road Show.  They also discussed the evolution of the tool Archivematica and the necessity of exit strategies when working with cloud storage providers.

The last session I attended was on personal digital collections and how public history is changing in the digital world.  Now that few people are writing physical letters, how will day to day communication be preserved in the future?  Will twitter accounts and email inboxes be included in future donations of personal collections? There were differing opinions on who is responsible for preserving these kinds of collections.  Historical societies and community archives have traditionally taken on these roles, but with limited staff and technical expertise can they continue doing so in the born-digital world?

The event had a strong presence on twitter, where tweets were shared with the hashtag #jfkdigipres.  We collected these tweets through a storify page so we can preserve this discussion around digital preservation.

Overall the event was a great success!  I hope the conversations started here will continue both online and through future digital preservation events.

 

20160223_161518

National Digital Stewardship Residents, past and present, in front of the John F. Kennedy Presidential Library

 

Digital Commonwealth Visit

Last week, a group of brave NDSR-ers trekked out during a snow storm to visit with staff from the Digital Commonwealth, which is based at the Boston Public Library.

As the blizzard raged outside, we met Tom Blake in the lobby of the library. Tom and his staff were kind enough to meet with us that morning – and they fought high winds and T  delays to get there!

IMG_9423

Blizzarding outside the BPL. It was like a scene from The Revenant.

For those who are not familiar, here’s a short description of the Digital Commonwealth, taken from their website:

“Digital Commonwealth is a non-profit collaborative organization that provides resources and services to support the creation, management, and dissemination of cultural heritage materials held by Massachusetts libraries, museums, historical societies, and archives. Digital Commonwealth currently has over 130 member institutions from across the state.

This site provides access to thousands of images, documents, and sound recordings that have been digitized by member institutions so that they may be available to researchers, students, and the general public.”

The Digital Commonwealth both hosts and harvests materials. That is, they may store digitized or digital material on their own servers (hosting), or they include material hosted elsewhere (such as on DSpace, ContentDM, etc) as part of their collections and then link out to its original location.

They will also digitize materials for organizations, which I think is a pretty amazing service to provide! This means that organizations can get their materials digitized without having to buy expensive equipment or allocate staff to digitizing – which, I’m sure many of you know, can be a time-consuming task.  They will also help organizations to create and clean up metadata. They use a MODS metadata schema. You can read more about their metadata requirements here.

During our visit, Tom and his staff emphasized that the Digital Commonwealth is very access-driven, and this is reflected in their collecting. He said that if an organization comes to them with materials and makes a case for why users would want to access those materials, they will almost always take those materials in. In fact, I believe one of the Digital Commonwealth staff members at our meeting said that the phrase “But someone will want to use this!” is kind of like their kryptonite. (Hope I’m not giving away a big secret by saying that.) I thought this commitment to access and their focus on users was really admirable!

 

I was initially interested in visiting the Digital Commonwealth because, over the course of the residency, I’ve begun to wonder about how smaller organizations with limited resources can participate in digital preservation. To me, digital preservation seems like a resource-demanding endeavor. You’ve got to pay for storage, pay for staff to process and preserve digital materials, pay for digitizing or technologies to manage born-digital materials – plus you need to have the expertise in your staff and the support from your administration. I was concerned that small organizations, such as local historical societies, wouldn’t be able to participate in digital preservation because their limited resources. But it’s not as though they could just ignore digital preservation – they probably want to digitize materials, or they might have a donor with born-digital materials. So what are small organizations to do?

I think the Digital Commonwealth is a great example of a solution to this problem. It allows small organizations to benefit from the resources and expertise available at larger organizations. It also gives smaller organizations a wider audience – because their materials are available on the digital commonwealth website, alongside materials from a variety of other organizations.

At the meeting, we discussed examples of this kind of resource sharing in other places, such as the Connecticut Digital Archive. I would be curious to hear if you, reader, know of any others, or know of examples where many small organizations have come together to pool their resources. Also, are you also concerned about small organizations and digital preservation? Why or why not?

Thanks for reading!

IMG_9421.JPG

Harvard Yard in the snow

Having FITS Over Digital Preservation?

fits_logoThis week my fellow residents and I were fortunate to receive an introduction to the File Information Tool Set (FITS) from Andrea Goethals. Andrea is the Manager of Digital Preservation and Repository Services at Harvard Library, Director of the NDSR Boston program and a developer of the FITS tool. Released in 2009, FITS is a digital preservation tool designed and developed at Harvard Library to identify and validate a wide assortment of file formats, determine technical characteristics, and extract embedded metadata. The technical metadata generated and collected by FITS can be exported in a variety of XML schemas and may be included in other files for digital preservation purposes, such as Harvard Libraries’ inclusion of FITS output in METS files in its preservation repository.

Digital preservation repositories accept into their care electronic files that are created and saved in a growing number of file formats. Proper identification of a file’s format and the extraction of embedded technical metadata are key aspects of preserving digital objects. Proper identification helps determine how digital objects will be managed and extracting embedded technical metadata provides information that future repository staff or users need to render, transform, access and use the digital objects.

fits-uses

There are several tools available that can identify and validate file formats and extract technical metadata. The great thing about FITS is that it bundles many of them together. The current version of FITS, 0.10.0, includes the following applications:

fits-tools

An explanation of each tool can be found on the FITS web site.

While these tools can be used individually, using them under the FITS umbrella is more efficient. FITS runs all the tools simultaneously, saving you time. FITS knows the strengths and weaknesses of the applications and which tools support which file formats. You benefit by installing and running a single application and receiving output from multiple applications that is appropriate to each file format.

Receiving output from multiple tools can help you verify accurate information when the tools agree, or flag a concern when they don’t. It is also helpful that FITS consolidates and normalizes the output, providing a homogenized data set that is easier to interpret. Each tool’s output is converted to a common FITS XML schema ensuring labels and terminology are used consistently. The extracted metadata can then be exported to different technical metadata schemas such as MIX for images, TextMD for text and DocumentMD for documents. Any of these schemas can then be inserted into other files like METS to provide repository documentation suitable for digital preservation.

fits-how_it_works

FITS is an open-source, Java-based application that is freely available from GitHub or the FITS web site. Because it is Java-based it runs on Windows, MAC or Linux platforms from a command-line interface. It also provides an API and can be embedded in other applications; it is one of the included micro services in Archivematica. Using a command-line interface can sometimes be intimidating and confusing, but FITS employs a limited number of intuitive commands.

FITS configuration is managed with several XML files that are easily edited with a text editor. The main configuration file, fits.xml, allows you to prioritize tools, include or exclude certain file formats from processing, enable or disable additional features like generating checksums, and determining the various output options. Another positive for the digital preservation community is that FITS is actively maintained so there is a procedure for addressing bugs and a schedule for releasing updates.

The FITS web site (fitstool.org) is well organized and fully documents the installation, configuration, use, and output options.

I know my post pales in comparison to a live demo of the application. But if it piques your interest, take it for a test drive. You’ve got nothing to lose and you might add a new tool to your digital preservation tool box.

Thanks for reading, Jeff

NDSR Tour of the Massachusetts State Archives

This week, my fellow residents, our hosts, and members of the NDSR community visited the Massachusetts State Archives. Located on Columbia Point, the Archives house, preserve, and make accessible public records of the Massachusetts government. massachusetts_state_archives_50_timeline_ca_jul14_800x564We talked with the Electronic Records Archivist Veronica Martzahl about digital preservation efforts and learned about the Archives’ amazing collections from Executive Director Michael Comeau. Thanks to you both and the Archives staff for having us!

Veronica shared what led to the creation of her role at the Archives and informed us about some digital preservation initiatives that are underway. When previous Massachusetts governor Mitt Romney left office, his hard drives were swept clean and no electronic records were transferred to the Archives. This alone would be an issue in terms of government transparency and the importance of leaving a historical record (and definitely not in line with best archival practice!), but it became even more critical when Romney ran for president. This provided the impetus for the Archives to develop a digital preservation program that would ensure better procedures moving forward.

For about two years now, Veronica has been working tirelessly to implement a new digital repository, which included testing, cost analysis, and training, and has had her hands in several other projects as well. In the end, the Archives chose the Preservica Standard Edition for their digital collections. The big take-away from this is that the process was long and challenging. Dealing with factors such as IT constraints, budgeting, and the usual politics involved in government work presented some hurdles, but that there was strong institutional commitment for the project, which is such an important factor in digital preservation. This taught us much about the reality of selecting systems for your institution– something I’m sure all of the residents will deal with sooner or later! We were all very impressed with the amount of work Veronica has achieved, and can see the long-term positive impact that this repository will have for the Archives. 

As the resident at the State Library, I was particularly interested in what we can learn from another government agency working to preserve digital government information. Veronica was kind enough to spend some time with me last October discussing the current state of digital preservation at the State Archives, and I was excited to expand on that today, plus to hear updates since we last talked. One question I often get is, why don’t the library and archives collaborate on digital preservation? In a case of maddening bureaucracy, the Library reports to the Department of Administration and Finance, while the Archives report to the Secretary of the Commonwealth. This fracture often results in some confusion, but the staff at both institutions are very supportive (we often refer users to the Archives for research, and vice versa). I hope the Archives and Library staff can continue to find opportunities for collaboration, especially in regards to digital preservation.

After Veronica caught us up on the digital projects, Michael then provided us with some interesting background information about the Archives, it’s vast collection, and some detail about their emergency preparedness plan. Columbia Point, where the Archives are located, is very close to the water, and susceptible to some serious damage from natural disasters. e96cbf7d40f573aae8d8499bc743ff3eMichael explained that that is partially why the building is designed to be so strong– it has to withhold some intense weather! 

We were able to see the original versions of some founding documents in Massachusetts history and the Bill of Rights, on display in the Commonwealth Museum. As a history nerd, I was pretty jazzed to be in the same room as these materials. Hearing Michael discuss the process of designing a proper space to house these documents was equally interesting. They worked with a scientist at MIT to create a home for these materials, thus protecting their longevity. The encasements they designed have allowed these crucial pieces of history to be well preserved. Though our focus may be on digital preservation, it was a great chance for us to hear a case study around preservation of print materials and to consider how necessary preservation is, regardless of format.

This month we also get to hear about the Digital Commonwealth, get a demonstration in the FITS tool from Harvard, and attend an UnConference at the JFK Library. Looking forward to it!