Adoption of “BagIt” standard for Ingest and Export of Digital Objects
As alluded to in a previous post, the Library’s Digital Collections platform recently underwent a relatively major change – adopting the “BagIt” specification, developed by the Internet Engineering Task Force (IETF), for ingest and export of our digital objects. Impetus for this change came form meeting with a potential future content producer for the platform, thinking about how to model, ingest, provide access to, and preserve their content over time. With only an ad-hoc ingest process in place, we knew more formalized ingest procedures would be needed down the road, and this was a prefect opportunity to lay the groundwork for those. Addressing ingest workflows required stepping back and identifying all points in the system this would effect. Precisely because the effects were far reaching – spanning from changing how assets are organized for ingest, through storage and management in the repository, all the way to their display on the front-end – adopting a widely used standard such as BagIt has had a very positive and normalizing impact on the Digital Collections platform as a whole.
The BagIt standard is, “a hierarchical file packaging format designed to support disk-based storage and network transfer of arbitrary digital content“. BagIt is, at its simplest, a set of rules, checksums, and naming conventions for grouping and packaging files. It is not a file format like Zip (.zip) archives or Tarballs (.tar), both which can provide varying levels of compression and result in a single file; when you “bagify” a directory, unlike creating .zip or .tar files where you end up with a single file, you are still left with the original directory! What it does provide are manifests and checksums for all files that are part of the “bag”, resulting in a package that can be created, moved, disassembled, reassembled, and checked for file integrity and consistency throughout multiple transformations and storage locations. For content producers that may not have access to all nooks and crannies of the back-end for a repository, it is a good way for them to ensure they can retrieve exactly what they deposit.
Adopting the BagIt packaging standard for our ingest process has allowed us to move away from our original, XSLT-based ingest workflow, a workflow that varied widely from collection to collection, resulting in objects that were fairly similar, but not identical, in structure. This was irritating and hindered access at best, and detrimental to preservation efforts at worst. Putting more time and effort into the creation of well-formed bags, which become our ingest SIPs, and then ingesting all BagIt archives through the same, Python-based ingest process, we are left with objects in the repository that have been made with the same mold, and therefore much easier to manage. It has wrestled complex object structure from XSLT stylesheets, into human readable JSON files that can be created by hand or programmatically for larger collections.
Overview of BagIt object ingest, storage, and export for Digital Collections platform
The BagIt standard is fairly widely used, including the Library of Congress, Chronopolis, and The Stanford Digital Repository, among others, and becoming popularized by other repository platforms such as Archivematica, which uses BagIt as the packaging format for AIPs in the system. Because we use Fedora Commons as our repository storage system, we are using BagIt objects only for ingest and export. Originally, BagIt packages were useful because we could show, outside of the repository itself, that objects exported were “identical” (at least as far as individual file checksums are concerned) to the objects ingested. But throughout the process of incorporating them into our ingest / export workflow, they have proved to be very handy packages for neatly wrapping up objects that may contain, 1, 100, 1,000, or more component pieces.