Benchmark Data Repositories for Better Benchmarking
- Benchmark Data Repositories for Better
- Benchmarking
- comparatively less attention has been paid to the repositories where
- these datasets are stored, documented, and shared
Introduction
- Benchmarking can help quantify
- progress on these tasks over time, and the availability of a well-studied, standard task evaluation
- environment can be a critical first step before moving to real-world applications, especially in high-
- stakes or expensive domains.
- Early data repositories, such as the UCI ML Repository, arose to address the data needs that come
- with ML benchmarking
- the process of selecting a dataset is often less about a scientific or engineering application and more
- about the compositional characteristics of the data and its associated tasks, for which a particular
- class of methods is applicable
- benchmark repository
- to describe repositories that support the discovery and use of datasets for evaluating ML models
Valuing Datasets as Research Contributions
Dataset Citations and Metrics
- persistent identifier (PID), such as a DOI, that can reliably
- be used to access a dataset has been widely recommended by experts
- ML
- datasets and their documentation frequently lack PIDs and are often only available via GitHub or
- personal/research group websites
- In ML, however, datasets are often referred
- to using combinations of names, descriptions, and associated papers, which can be challenging to
- disambiguate
- include the minted DOI
- connects a dataset to associated research entities (such as the dataset’s creators or
- maintainers, publications, code, or other datasets)
- Introductory papers can give the data a story beyond standardized documentation, providing useful context about the problem,
- background on data collection procedures, and guidance about tasks for which the data have already
- been used.
- Introductory papers can be included in benchmark repositories as a
- standardized metadata field
- dataset’s point of contact:
- someone responsible for answering questions about the dataset and addressing any issues
- lthough long-term data maintenance, and determining different
- stakeholders’ responsibilities in that maintenance, remain challenging tasks [28, 60, 61], establishing
- a point of contact can help prevent the development of a disconnect between a dataset and its creators,
- which is not uncommon in ML
Dataset Licenses
- Data repositories can include licenses as part of a dataset’s metadata.
- it
- is ambiguous if models trained on a dataset count as “derivative work”
- current licensing practices for ML datasets are often irregular,
- conflicting, poorly documented, or over-permissive given the dataset content
Addressing Issues with Dataset Content
- technical flaws such as labeling errors and annotation artifacts
- privacy
- and copyright violations
- inclusions of hate speech or other harmful content
- representational biases
- miscellaneous ethical issues
Contextual Metadata
- including information about a dataset’s
- source, funding, collection, annotation, and preprocessing, benchmark repositories can illuminate the
- assumptions and motivations of dataset creators and flag potential dataset issues
- Data Cards
- datasheets
- Dataset Nutrition Label
- FAIR principles
- Data selection, filtering, and annotation processes are important design decisions that can significantly
- impact downstream performance [7, 38, 75, 100, 101]; however, they tend to be under-documented
Quality Review
- datasets containing personally identifiable information should not be released
- Benchmark repositories can help
- identify these problems throughout the data lifecycle by (1) performing a pre-release quality review
- [92] to catch issues before a dataset is shared, and (2) by serving as a centralized location to collect
- users’ reports and concerns [72] to flag issues throughout a dataset’s use and reuse
- It is an open
- question to what extent repositories should be involved in these decisions; several popular repositories
- (e.g., Zenodo, Mendeley, etc.) view their role as only providing infrastructure and not conducting any
- kind of data review.
Dataset Revision and Deprecation
- dataset revision by documenting data versions and connecting
- each dataset to a responsible point of contact
- can enforce versioning by assigning a new version number
- whenever a data file is changed.
- deprecation of datasets.
- creators often withdraw their dataset without an explanation of why it
- was withdrawn or explicit instructions not to use the dataset
- deprecation reports are posted in a scattered, decentralized manner via news articles,
- conference papers, or researcher or lab websites
- it can be unclear to researchers if a
- dataset is acceptable to use; it is not uncommon for datasets to remain in use after their deprecation,
- including in published, peer-reviewed papers
- If a deprecation report is clearly displayed in the same place where a dataset was available, it
- clarifies to researchers (and reviewers) that the dataset should not be used
- Compositional metadata
- describe the makeup of a dataset,
- Croissant
- Task metadata
- include the intended or appropriate ML tasks for a dataset (e.g., image classification
- or time-series prediction) and specialized metadata relevant to those tasks
- As an example, HuggingFace Datasets
- [17], which specializes in NLP data, collects metadata on language and multilinguality, text creation,
- and fine-grained NLP tasks
- repositories can require that data
- donors provide high-quality compositional and task metadata, including specialized task metadata,
- where appropriate
- if a repository displays a
- particular benchmarked result for a dataset, they should ensure that specific details on all settings
- and hyperparameter values used to obtain that result are available
Encouraging Holistic Evaluation
- Kaggle [18]
- and Papers with Code
- Leaderboarding
- Further, as measures
- of uncertainty are seldom incorporated, seemingly record-breaking performance improvements are
- not always statistically significant
Analysis Beyond Single Metrics
- variety of metrics, capturing model size and
- complexity, energy consumption, inference latency, and the amount of data used
- error analysis or disaggregated evaluations
Metric Uncertainty
- Metrics shown without any measure of uncertainty can prompt fallacious conclusions, e.g., that
- one model performs definitively better than another
- Instead, including uncertainty makes these
- benchmarked results more informative [1, 157], and a growing body of methodologies have been
- developed for estimating uncertainty, computing confidence intervals, and performing statistical
- significance testing in the context of model comparison
- variance, uncertainty, and statistical significance
- in model analysis
Living Datasets
- leaderboards can evaluate submitted models
- on a private, hitherto unused test set [167—169] (e.g., as done by Kaggle) or on out-of-distribution
- data
- eaderboards can also support “living” or evolving datasets,
- to which dataset creators continuously add new examples or tasks and remove outdated or erroneous
- examples.
- benchmarked performances of two models evaluated at two
- different points in time may not be directly comparable
- robust models will generally outperform those using a specific trick or artifact as
- evaluations are repeated over time
- de-incentivizing
- overfitting and helping bridge the gap between benchmarked and real-world performance.
Dataset Discoverability
- there also exists a plethora
- of other high-quality datasets that could have been used but did not win the “benchmark lottery”
- existing standards emphasize
- the importance of standardized, rich metadata [39, 129, 130], which enable searching for datasets
- via keywords, filtering, and controlled vocabularies
- search based on compositional and task
- metadata
- UCI ML Repository’s search functionality includes a filter
- for classification, regression, clustering, or other datasets.