Benchmark Data Repositories for Better Benchmarking

Benchmark Data Repositories for Better
Benchmarking
comparatively less attention has been paid to the repositories where
these datasets are stored, documented, and shared

Introduction

Benchmarking can help quantify
progress on these tasks over time, and the availability of a well-studied, standard task evaluation
environment can be a critical first step before moving to real-world applications, especially in high-
stakes or expensive domains.
Early data repositories, such as the UCI ML Repository, arose to address the data needs that come
with ML benchmarking
the process of selecting a dataset is often less about a scientific or engineering application and more
about the compositional characteristics of the data and its associated tasks, for which a particular
class of methods is applicable
benchmark repository
to describe repositories that support the discovery and use of datasets for evaluating ML models

Valuing Datasets as Research Contributions

Dataset Citations and Metrics

persistent identifier (PID), such as a DOI, that can reliably
be used to access a dataset has been widely recommended by experts
ML
datasets and their documentation frequently lack PIDs and are often only available via GitHub or
personal/research group websites
In ML, however, datasets are often referred
to using combinations of names, descriptions, and associated papers, which can be challenging to
disambiguate
include the minted DOI

Connection Metadata

connects a dataset to associated research entities (such as the dataset’s creators or
maintainers, publications, code, or other datasets)
Introductory papers can give the data a story beyond standardized documentation, providing useful context about the problem,
background on data collection procedures, and guidance about tasks for which the data have already
been used.
Introductory papers can be included in benchmark repositories as a
standardized metadata field
dataset’s point of contact:
someone responsible for answering questions about the dataset and addressing any issues
lthough long-term data maintenance, and determining different
stakeholders’ responsibilities in that maintenance, remain challenging tasks [28, 60, 61], establishing
a point of contact can help prevent the development of a disconnect between a dataset and its creators,
which is not uncommon in ML

Dataset Licenses

Data repositories can include licenses as part of a dataset’s metadata.
it
is ambiguous if models trained on a dataset count as “derivative work”
current licensing practices for ML datasets are often irregular,
conflicting, poorly documented, or over-permissive given the dataset content

Addressing Issues with Dataset Content

technical flaws such as labeling errors and annotation artifacts
privacy
and copyright violations
inclusions of hate speech or other harmful content
representational biases
miscellaneous ethical issues

Contextual Metadata

including information about a dataset’s
source, funding, collection, annotation, and preprocessing, benchmark repositories can illuminate the
assumptions and motivations of dataset creators and flag potential dataset issues
Data Cards
datasheets
Dataset Nutrition Label
FAIR principles
Data selection, filtering, and annotation processes are important design decisions that can significantly
impact downstream performance [7, 38, 75, 100, 101]; however, they tend to be under-documented

Quality Review

datasets containing personally identifiable information should not be released
Benchmark repositories can help
identify these problems throughout the data lifecycle by (1) performing a pre-release quality review
[92] to catch issues before a dataset is shared, and (2) by serving as a centralized location to collect
users’ reports and concerns [72] to flag issues throughout a dataset’s use and reuse
It is an open
question to what extent repositories should be involved in these decisions; several popular repositories
(e.g., Zenodo, Mendeley, etc.) view their role as only providing infrastructure and not conducting any
kind of data review.

Dataset Revision and Deprecation

dataset revision by documenting data versions and connecting
each dataset to a responsible point of contact
can enforce versioning by assigning a new version number
whenever a data file is changed.
deprecation of datasets.
creators often withdraw their dataset without an explanation of why it
was withdrawn or explicit instructions not to use the dataset
deprecation reports are posted in a scattered, decentralized manner via news articles,
conference papers, or researcher or lab websites
it can be unclear to researchers if a
dataset is acceptable to use; it is not uncommon for datasets to remain in use after their deprecation,
including in published, peer-reviewed papers
If a deprecation report is clearly displayed in the same place where a dataset was available, it
clarifies to researchers (and reviewers) that the dataset should not be used

Compositional and Task Metadata

Compositional metadata
describe the makeup of a dataset,
Croissant
Task metadata
include the intended or appropriate ML tasks for a dataset (e.g., image classification
or time-series prediction) and specialized metadata relevant to those tasks
As an example, HuggingFace Datasets
[17], which specializes in NLP data, collects metadata on language and multilinguality, text creation,
and fine-grained NLP tasks
repositories can require that data
donors provide high-quality compositional and task metadata, including specialized task metadata,
where appropriate

Benchmark Metadata

if a repository displays a
particular benchmarked result for a dataset, they should ensure that specific details on all settings
and hyperparameter values used to obtain that result are available

Encouraging Holistic Evaluation

Kaggle [18]
and Papers with Code
Leaderboarding
Further, as measures
of uncertainty are seldom incorporated, seemingly record-breaking performance improvements are
not always statistically significant

Analysis Beyond Single Metrics

variety of metrics, capturing model size and
complexity, energy consumption, inference latency, and the amount of data used
error analysis or disaggregated evaluations

Metric Uncertainty

Metrics shown without any measure of uncertainty can prompt fallacious conclusions, e.g., that
one model performs definitively better than another
Instead, including uncertainty makes these
benchmarked results more informative [1, 157], and a growing body of methodologies have been
developed for estimating uncertainty, computing confidence intervals, and performing statistical
significance testing in the context of model comparison
variance, uncertainty, and statistical significance
in model analysis

Living Datasets

leaderboards can evaluate submitted models
on a private, hitherto unused test set [167—169] (e.g., as done by Kaggle) or on out-of-distribution
data
eaderboards can also support “living” or evolving datasets,
to which dataset creators continuously add new examples or tasks and remove outdated or erroneous
examples.
benchmarked performances of two models evaluated at two
different points in time may not be directly comparable
robust models will generally outperform those using a specific trick or artifact as
evaluations are repeated over time
de-incentivizing
overfitting and helping bridge the gap between benchmarked and real-world performance.

Dataset Discoverability

there also exists a plethora
of other high-quality datasets that could have been used but did not win the “benchmark lottery”
existing standards emphasize
the importance of standardized, rich metadata [39, 129, 130], which enable searching for datasets
via keywords, filtering, and controlled vocabularies
search based on compositional and task
metadata
UCI ML Repository’s search functionality includes a filter
for classification, regression, clustering, or other datasets.

Subhaditya's KB

Benchmark Data Repositories for Better Benchmarking

Benchmark Data Repositories for Better Benchmarking

Introduction

Valuing Datasets as Research Contributions

Dataset Citations and Metrics

Connection Metadata

Dataset Licenses

Addressing Issues with Dataset Content

Contextual Metadata

Quality Review

Dataset Revision and Deprecation

Compositional and Task Metadata

Benchmark Metadata

Encouraging Holistic Evaluation

Analysis Beyond Single Metrics

Metric Uncertainty

Living Datasets

Dataset Discoverability

Graph View

Table of Contents

Backlinks