The GDD Network is exploring ways to understand the distribution of digitised texts worldwide by aggregating library records and making the resulting dataset available for further research. The project described here lays the foundation for that larger initiative by analysing library catalogue data from the three UK national libraries – the British Library (BL), the National Library of Scotland (NLS), and the National Library of Wales (NLW) – and HathiTrust, in order to understand the extent of overlap between the four digitized collections.
Earlier this year, HathiTrust team members Natalie Fulkerson, Josh Steverman and Martin Warin conducted a trial matching of library catalogue records contributed by the three library partners and HathiTrust, in order to determine the feasibility of producing an aggregated dataset containing records from all four partners. To do this, we needed a process for accepting, matching, and aggregating bibliographic records from different sources.
Method
Each library partner agreed to contribute full MARC catalogue records for their digitised works:
Library partner | # records contributed |
British Library | 516, 212 |
National Library of Scotland | 10, 919 |
National Library of Wales | 2,290 |
HathiTrust | 8,611,9971 |
Cataloguing practices vary between institutions and over time, so we needed to understand what the record sets from each library contained before we could develop a workable mechanism for matching them. HathiTrust has an established process for processing and analysing bibliographic records from multiple sources, so our first step was to see how well this workflow fit the needs of the proposed registry.
The HathiTrust process relies on the presence of an OCLC Control Number (OCN), which uniquely identifies each work, to determine whether more than one library holds a copy of the same title.2 OCN assignment is more-or-less standard practice in US cataloguing, but UK libraries use them less often, so we suspected that most of the UK records would lack OCNs. Our analysis confirmed this was the case:
# records | # OCNs | % of records with OCN | |
British Library | 516,212 | 611 | .11 |
National Library of Scotland | 10,919 | 561 | 5 |
National Library of Wales | 2,290 | 744 | 32 |
With so few OCNs in the UK records, we needed a better way to identify all the overlaps between their digitised collections and items in the HathiTrust Digital Library.
Next, we looked for the presence of other identifiers in the UK records, for example, ISBN:
# records | # ISBNs | % of records with ISBN | |
British Library | 516,212 | 34 | .006 |
National Library of Scotland | 10,919 | 55 | .5 |
National Library of Wales | 2,290 | 17 | .74 |
No luck there either.
So, on to string matching!
This is an emerging research area in libraries, so we developed a protocol for testing a few different methods:
- Literal string matching
- Normalized string matching
- Word-by-word (bag of words) matching
- Machine learning approaches
Each method has different benefits and drawbacks. Literal string matching is very precise and not computationally expensive, but likely excludes many true matches:
# digitized records | # matching HathiTrust | % overlap | |
British Library | 516,212 | 4,559 | 0.88 |
National Library of Scotland | 10,919 | 343 | 3.14 |
National Library of Wales | 2,290 | 51 | 2.23 |
Normalised string matching (using pre-processing to downcase and remove non-alpha characters prior to running the match script) is still an inexpensive process, and it produced more matches:
# digitized records | # matching HathiTrust | % overlap | |
British Library | 516,212 | 39,815 | 7.71 |
National Library of Scotland | 10,919 | 1,746 | 16 |
National Library of Wales | 2,290 | 253 | 11.04 |
Exploratory approaches
Word-by-word matching decouples character strings (words) from the order in which they appear and compares the resulting “bags” of words. We used the following approach:
- For each title in the British Library record set, downcase, eliminate stopwords, and produce a “bag of words”
- Grab each HathiTrust record whose title matches to any of the words in the bag
- Determine precision and recall, calculate an average confidence score, rank by score
The resulting output is a list of candidate matches from the HathiTrust collection, for each British Library record, with a corresponding confidence score. For example:
****************
## BL title: “St. Paul at Philippi. A Seatonian poem.”
## Bag of words: st,paul,philippi,seatonian,poem
score | p | r | match_title |
0.514 | 0.600 | 0.429 | The martyrdom of St. Peter and St. Paul; a poem. By George Burgess. |
0.514 | 0.600 | 0.429 | St. Catharine of Siena. The Seatonian prize poem for 1948. |
0.857 | 1.000 | 0.714 | St. Paul at Philippi : a Seatonian poem / by Thomas E. Hankinson. |
****************
Next, we explored possible machine learning methods. We posed the following question: “Can you train an SVM classifier to distinguish between title matches and non-matches?” In other words, if you give a classifier a large number of record pairs, can you teach it to determine which ones match and which ones don’t?
Our method calculated the Damerau-Levenshtein (D-L) distance3 between pairs of records, created plots, and used two algorithms (polynomial and gaussian) to attempt to create a boundary between matching and non-matching records. We did this for each of 16 different components of the MARC record for each title.
We applied this method to a small subset of HathiTrust records pre-clustered by OCLC control number. Essentially, we asked the classifier to determine whether two records should belong to the same cluster, without using the available OCLC number as a reference point. Then we checked against the pre-clustered data to validate the classifier output.
Preliminary results for this method look promising:
# of clusters in test set | # of predicted clusters | Precision | Recall | |
Polynomial | 5989 | 5558 | 0.982 | 0.911 |
Gaussian RBF | 5989 | 5948 | 0.975 | 0.968 |
Insights
As we expected, the UK records contained too few OCNs or ISBNs to serve as a reliable match point. Other methods showed varying levels of promise:
- Literal string matching is simple but very rigid; word-by-word matching is more complicated but presents a more nuanced view
- Machine learning methods are more accurate, but require training time and are resource-intensive to run
Our exploration thus far points to a possible hybrid approach: use less computationally-expensive methods as a first pass to eliminate clear non-matches, then apply more expensive methods to improve precision and recall.
Conclusion and future directions
The project team learned firsthand that duplicate detection is hard and involves tradeoffs. Our work to date suggests the following activities for a future stage:
1. Continue refining the machine learning process:
a) Explore other ML methods (other algorithms, alternatives to D-L distance).
b) Identify other training datasets (reduce bias).
c) Explore ways to make the process less resource intensive.
2. Explore other methods, such as stemming and fuzzy logic.
3. Test the hybrid approach described above.
We look forward to presenting this work in more detail at a GDD Network meeting in Glasgow, Scotland, in December 2019.
For more information, please contact Natalie Fulkerson at nfulkers@hathitrust.org.
***
Natalie is Collection Services Librarian at HathiTrust and project manager for the data analysis portion of the GDD Network grant project.
Josh and Martin are application developers within Library Information Technology, University Library, at the University of Michigan.
Footnotes
1 HathiTrust clusters records for multiple copies of the same item, so these 8.6 million records correspond to roughly 17.5 million items.
2 As with any large, complex identifier scheme, OCN assignment is not perfect. Duplicate records do exist, and misapplication of OCLC control numbers does occur. For this project, we assume that two records with the same OCN refer to two copies of the same work.
3 Damerau, Fred J. (March 1964), “A technique for computer detection and correction of spelling errors”, Communications of the ACM, 7 (3): 171–176, doi:10.1145/363958.363994