In college, I took several courses that involved working closely with one of the many helpful librarians on campus. She would often refer to our projects as “iterative”– so much so that she would even laugh as she said it. Six months into my residency at the State Library of Massachusetts, the joke is on me as our process has been very iterative. This post will cover what we’ve been up to recently and what is ahead for us in the next few months.
A quick recap: we’re exploring more efficient ways of finding, downloading, and providing access to digital state publications. We’ve been working with web statistics downloaded from Mass.gov to assess the extent of digital publications and to determine what is most valuable to preserve for the Library and its users.
The web statistics workflow has, of course, evolved, requiring flexibility and an open mind. When we began using the statistics, each member of the project team was checking each URL listed, noting the type of document it was, then each of the team members would rank the document on a scale of 1-5 (1 being lowest priority, 5 being highest) using shared spreadsheets. Once we all had a solid understanding of what was highest and lowest priority, we determined that we didn’t need to each rank each type of document, so each staff member would tackle a different agency and enter their own priority rankings. We also created a new spreadsheet to consolidate that data into how many documents there were total and how many of each priority ranking. This gives a bigger picture assessment of how many state publications exist, and how many high priority documents we need to handle quickly. A few weeks later, we then decided to add a category in the spreadsheets to note whether these documents were series, serials, or monographs, which affects the way the items are cataloged. Though these are relatively minor changes in the workflow, they do reflect how important it is to continually check in with the project team about what’s working well and what could be improved. It is very iterative!
While that process is ongoing, we are also examining how to download the thousands of publications we’ve reviewed through the web stats. I researched tools that would help us batch download PDF or Word docs from sites, taking into account the Library’s resources. Though CINCH, a tool developed by the State Library of North Carolina, fits our needs well, the installation requirements were not feasible for us. I began playing around with a Firefox add-on called DownThemAll! (yes, the exclamation mark is part of the name– though it is very exciting). DownThemAll (dTa) allows a user to upload a list of URLs, specify the folder in which you’d like the files saved, then, like magic, the files are fully downloaded (dTa has other features and functions, such as a download accelerator). Any errors are noted and not downloaded, so you can go back and check if this was a 404 error or human error, for example.
The tool is free, easy, and works very well! My concern, however, is that it is not backed by an institution and it’s unclear how much funding or technical support they have. What if I come into work tomorrow and it’s gone? Who do I contact? Though they have some support help, it’s limited (for example, I emailed about an issue three weeks ago, and haven’t heard back). dTa works only with Firefox– what if there’s an issue with the browser and we can no longer access the tool? While the function of the tool works well and will be useful in the short term, I don’t see it being a sustainable solution for batch downloading. This is another part of the process that we’ll need to keep revisiting over time. And if anyone has ideas or suggestions, please let me know!
One big success we’ve had is collaborating with MassIT to gain access to their Archive-It account. Though MassIT manages the account, they’re capturing the material that we need– webpages with links to documents published by state agencies– so it makes perfect sense to work together to use Archive-It to its full capacity. I worked with MassIT to customize the metadata on the site, then I wrote some information to publish on our website about how to access and use Archive-It for the general public. We’re considering how best to incorporate Archive-It into our workflow. While DSpace will remain our central repository, where we can provide enhanced access to publications through metadata, Archive-It is capturing more material than we will be able to, which is a huge help to us. (Archive-It also allows us to print PDF reports to see all PDFs captured in their crawls, and we can use dTa to download them. We’re not currently using this now, but this is an option for the State Library to use going forward.)
With each iteration of the workflow, I feel we are getting closer to solving some of the big questions of my project. We hold weekly staff meetings to check in about the current process. Hearing each staff member’s thoughts on challenges or potential areas of improvement has taught me much about how to continually bring fresh eyes to an ongoing process, and how to keep the big picture in mind while working through smaller details. Flexibility is key not only with this project, but with digital preservation as a whole, as processes, tools, software, and other factors continue to evolve.
I hope to leave the State Library with some options of how to take this project forward, even if not all of the questions have a definitive answer. We’re also now focusing our attention on addressing other issues in the project, such as outreach to state agencies and the cataloging workflow between their OPAC, Evergreen, and DSpace. There’s much to accomplish in the remaining weeks, and I look forward to updating you as we make progress on these goals.