Since 2015, the Natural History Museum London has made its research and collections data available through its Data Portal. Some important new features have just been added which make it easier for users to reuse this data.
The Portal provides free and open access to research datasets as well as digitised items from the Museum’s specimen collection. This includes 3D scans, images and audio recordings as well as other structured data in tables.
The Data Portal currently has over 4.2 million records from the specimens collection and a further 5.5 million records from other research datasets. Since 2015, more than 250 scientific publications have cited data from the Data Portal, either directly or through aggregators such as the Global Biodiversity Information Facility (GBIF), though there are many more citations that it is currently not possible to track. The Museum’s data are vital resources to answer some of the big challenges facing science and society including maintaining food security, solutions for healthcare, reducing the loss of biodiversity and tackling climate change.
If this kind of data are not available, the possibility to lead spatial epidemiological studies of vector-borne infectious diseases is significantly reduced…It would have been necessary to generate the information through field campaigns, or by travelling to see the collections in person, which could have represented a very high amount of time/resources.
Alberto Jose Alaniz, Lead Author of Spatial quantification of the world population potentially exposed to Zika virus Alaniz et al, 2017.
Users can download data from the Portal and are encouraged to cite the source, however, there is currently no way for users to cite subsets of the data returned through a query, nor a way to persistently identify the data subset users are citing. This is a common issue with scientific data put online, particularly when the cited data changes frequently, such as is the case with the Museum’s specimen collection. Over the past 5 years the Museum’s digital collections have doubled, and are updated on the Data Portal five days a week as more of the collection is digitised.
The Data Portal team have implemented new changes to meet the Research Data Alliance’s (RDA) Data Citation recommendations on supporting “a dynamic, query centric view of data sets” (Rauber et al. 2015). When users search and download data from the Portal, Digital Object Identifiers (DOI) are created for unique searches providing a snapshot of the exact version of the data used. This stores links to the versioned data, rather than each variation of the actual dataset, eliminating the need to physically store every version of every dataset ever downloaded. Combining this versioning information into the search index also allows queries against historical data, opening up the possibility of gaining insights into the way our digitised collection has changed over time.
Reproducibility is essential for scientific research, yet ironically few data repositories support the citation of dynamic datasets that change over time. This new feature of the Museum’s Data Portal not only makes it possible to reproduce the exact version of a subset of our collection data that has been incorporated into other research, but also allow users to update these data, helping users stay updated with additions and improvement to our datasets.
Vince Smith, Head of Informatics Division at the Natural History Museum, London
By persistently identifying query results, researchers can cite data precisely and have confidence that although the data may change after they use it, users of their work will be able to access the data as it looked when they studied it originally. Users can also choose to update their datasets, allowing them to incorporate changes made over time. This should also encourage the systematic use of citations making it easier to track both the usage and impact of research and collections datasets.
These new features apply across the Portal and therefore staff uploaded research datasets also get DOIs created on downloads, allowing us to track the usage of these datasets too.
Tracking our Impact
Tracking the impact of the Data Portal is critical to ensuring its continued support and success. Prior to this latest upgrade, DOIs were only available at the dataset level or for the full online collection. While this is useful, it doesn’t provide us with much insight into what our users are actually doing with the data – for example, are most people downloading entire resources or single records? Producing DOIs for subsets of resources gives us the ability to track the use of the data at a more granular level, improving the accuracy and depth of our reporting. Plans are currently being developed for how best to track these new DOIs and how we can also present that information on the Data Portal to users – for example, by showing the number of citations a DOI has and which papers it has been used it.
Additionally, we can now track the queries being used for cited data on the Portal. This provides us with new insights into how users are querying the Data Portal, which can then be fed back into new improvements and features ensuring they reflect our user’s habits. We can also start to track changes to data over time, such as the most updated specimen record, or the most cited specimens, giving us new insights into the collection.
The changes that the Data Portal team have made pave the way for future evolution of the Data Portal and unlock further improvements. New features expected to be implemented along with other smaller improvements over the coming months include:
- More advanced search abilities including the ability to use “and”, “or” and “not” as well as more enhanced data querying such as numerical queries.
- Cross resource search, allowing searches across all tabulated data on the Portal
- A redesign of the front page and search results pages
Longer term, we would like to add the ability for users to annotate our data, helping to address some of the challenges we face of maintaining data quality and providing better links to associated data resources. Find out more about the digital collections programme at www.nhm.ac.uk/digitalcollections. Explore our data online at data.nhm.ac.uk and if you would like to get in touch, email us at firstname.lastname@example.org.