From aggregation to prototype

Natalie Fulkerson, The HathiTrust and Stuart Lewis, National Library of Scotland

The initial task for the GDD Network was to match and aggregate bibliographic records from different sources – specifically the core project partners of the HathiTrust, National Library of Scotland, British Library, National Library of Wales. 

The next job was to publish this as a dataset, and then to use that dataset to build a prototype search service of the collection. 

Part 1: The aggregation

Our goal was to create an aggregation of the records from all four partners containing enough bibliographic information to be useful, while also managing the size of the resulting file. To do this, we evaluated each set of records to identify common fields and assessed the prevalence of those fields across the records. In this way, we were able to create a datafile consisting of records that were reasonably complete and comparable between institutions, while also being mindful of any future applications or hosting that might be negatively impacted if the file were too large. Even with only these limited fields, by the time 17½ million records are aggregated, that’s more than 5GB of data.

Data Element Description – Brief MARC field/s 
Volume Identifier Permanent item identifier   –
Title   245|a 
Imprint Publisher + Date of publication 260|bc 
Publication Place   008 (bytes 15-17) 
Author   100|abcd; 110|abcd 
(URI) Link to digital object 856 
(Publication Place) 260|a 

The dataset also includes the source of the records (either the HathiTrust or one of the three national libraries) as a separate field.  This wasn’t present in the source data; we added it as part of the aggregation.   

We did not include subject, because the HathiFile records we worked from did not contain them, and we did not include language because uptake across the partners was too variable.  If we were to do it again, we might look more closely at subject data, and we could break out the imprint fields separately for faceting by publication date. 

The aggregation exercise highlighted an inconsistent practice in the use of item URLs: one of the partners provided records that have URL links to the catalogue record rather than the item itself, others pointed to the digitised resources at the Library, and some pointed to the Google Books where the digitisation was undertaken by them. 

Part 2: The prototype

The aggregated data was provided as a 5GB CSV file by the team at HathiTrust. As a dataset, this is an easy-to-use format which can be manipulated in many tools, including spreadsheet software such as Excel. However, for many people, they would like to be able to interrogate the data without having to load it into a tool. 

The GDD Network had identified core use cases for a Global Register of Digitised Texts, and to fulfil some of these we need to create an easy interface for users to first search for items (either single items, or groups of items), and then to go to the item or, in the case of a large collection of items, download the relevant URLs for their research. 

The scope of the project meant that we didn’t have any dedicated resource to build a large custom search system, but thankfully these are a number of tools that can easily be integrated together to build a working prototype. 

Step one was to select a search index – something that can store that data and provide the capability to search.  There are a number of open source tools that do this, such as Lucene, ElasticSearch, and Solr. We chose Solr, simply because we have prior experience with it. Thankfully, Solr can natively import CSV files, so other than a few changes to replace double-quotes in a few titles, the data imported very simply.   

One of the many benefits of cloud computing is the ability to grow and shrink servers. This allowed us to use a reasonably powerful server for the two hours and 18 minutes it took to import the data and then once completed, re-size the server to a more reasonable smaller size. 

The next job was to provide a faceted search user interface on top of the Solr index. Again, open source software came to the rescue, this time through the Solrdora system by Hector Correa. This allowed us to easily produce a search system: 

Screenshot showing a list of all the records in the GDD Network prototype, a total of 17,567,832. The search results show the title, authors, publishers and either dates of the authors' lives or the date of publication. They also show the source of the information. On the left-hand side are the facets of the search include the numbers of recored for each facet. These include sources: HaithiTrust - 17,036,009; British Library - 517,599; National Library of Scotland - 11,350; National Library of Wales - 2,874. The other facet shown is Publishers, the examples include: United States; United States Bureau of the Census; Great Britain, Parliament, House of Commons; United States, General Accounting Office; Sweden, Riksdagen; France; United States, Department of State; India, Parliament, Lok Sabha; Great Britain, Parliament; United States, Department of Agriculture.

Using the prototype allows the data to be more easily explored than in the CSV, and to evaluate whether the current data set is sufficient to meet the core use cases. For example, some immediate issues come to light when trying to use the system: many of the items are only available to members of the particular libraries, however access statements or rights information are not displayed, so it is not possible to know whether you will be able to access any particular record. 

Other than source (the Library that provided the records), and the publisher, there didn’t appear to be other fields with enough consistency to apply useful facets on. This could be considered in future iterations of the dataset.  

The ‘export results’ button also doesn’t work; this would need to be implemented in order to meet the digital scholarship use case where a user would want to download their complete search result. 

The prototype has offered an opportunity to assess the dataset against the use cases but it is only a short-term tool, and will soon be switched off. However the GDD Network partners will incorporate the knowledge gathered from using it into the final project report, due out in early 2020.