OpenMl Feedback

To contact

Name	Email	Paper	Done	Where
Damian			y
Irene			y	SURF
Aparna			y	Ireland
Guillaume			y	Paris, probabl
Ihsan			y	Edinburg
Sebastian			y	Paris, sklearn
Binh P.Nguyen	binh.p.nguyen@vuw.ac.nz	Challenges and opportunities of generative models on tabular data		New zealand
Hariharan Manikandan	hmanikan@cs.cmu.edu	Language models are weak learners		CMU
Jacobus G. M. van der Linden	J.G.M.vanderLinden@tudelft.nl	Optimal or Greedy Decision Trees? Revisiting their Objectives, Tuning, and Performance		Tu delft	https://arxiv.org/pdf/2409.12788
Ning Zhang, Ruidong Wang	drningzhang@global.tencent.com , ruidwang@global.tencent.com	Class-aware and Augmentation-free Contrastive Learning from Label Proportion		Tencent Amsterdam	https://arxiv.org/pdf/2408.06743
Dan Stowell	elodie.briefer@bio.ku.dk	Sound evidence for biodiversity monitoring		Tilburg
Ben van Werkhoven	b.van.werkhoven@liacs.leidenuniv.nl	FAIR Sharing of Data in Autotuning Research		Leiden

Email Templates

Idea 1 - People on Openml Slack

Hello, this is Subhaditya, a research engineer from OpenML. We are doing a small user study for version 2.0 of OpenML python and would love to hear about your experience in using OpenML.

- Would you be open to a chat?

- if its in NL, then we can meet over a coffee 

- Hello <>,This is Subhaditya, a developer at OpenML. It's nice to e-meet you :)  
- I saw that you recently used OpenML for a paper, and would love to hear about your experience in using OpenML. We are currently doing a user study of sorts for version 2.0 of OpenML python, and any views are welcome.

- Would you be open for a chat sometime next week?

Idea 2 - People Who Dont Know Openml

Hello <name>,

I’m Subhaditya, a researcher at TU Eindhoven working on an NWO project about open science in machine learning. It’s centered around OpenML, a free and open platform for sharing datasets and benchmarks (maybe you’ve heard about it?). I’m exploring how to make it easier to use and help accelerate scientific research.

I’d love to talk to you about your experiences in storing research data and how we can help researchers across the Netherlands.

Would you or someone in your lab be open to a brief call sometime? I made a calendar where you can pick any time that suits you. If you’re in the Netherlands, I’d also be happy to meet in person over coffee.

Idea 3

local /Eindhoven
NWO project

this is Subhaditya, a research engineer from OpenML. We are a platform for collaborative open science that has features for easily and reproducibly storing datasets and benchmarks. We are a team based in TU Eindhoven, maybe you have heard of us before?

We would very much like to hear about your experiences with storing research data and if there is some way we can help.

Our goal is to make open science more accessible for the research community, especially in the context of broader ML research. All the data is freely available (both to store and access) and can be accessed from the web or via programmatic interfaces. You can use OpenML to share your data and experiments. We find that a lot of our users tend to use this in their research papers as well.

Perhaps this might be useful to your lab?

We are currently working on a new version of our API and trying to make it more user friendly for researchers. To that end, would you or someone in your lab be open to a brief chat sometime next week? If you’re in the Netherlands, I’d be happy to meet in person over some coffee. If not, we can set up a short call as well.

Damian

dataset size
- gui interface does not really work for it
- bitbucket
- dataset class
evaluation datasets
automl
- company - pdf export, excel export
  - monetization (probably not but yeah)
- academia - model download
- information - metrics, performance metrics, how long it take to run
- what to do next from here?
  - data distribution is narrow for example
  - how to optimize diff metrics
  - run autoattack model : adversarial robustness guarantees
  - what can be derived from the data
- similar datasets
  - acting like a proxy
how can you help people with all the knowledge in OpenML

Irene

Dataset upload: This process seems very straightforward, so overall it is enough to use.
def_tar_att and ‘Ignore attribute’: Most users will likely know what to fill in there, but some people getting started with OpenML and machine learning in general may have a harder time. Perhaps it’d be good to add a link to this page or a short description?
Collection date: I am not sure if you are looking for some uniformity there, but it would be nice to add instructions on the date/month/year format you prefer.
Auto-ML reports: Since I won’t be an end user, it’s a bit difficult for me to provide feedback here. I am looping in my colleague @Yue Zhao from the SURF HPML Team, who joined our meeting in Eindhoven and who may provide feedback herself or forward this request to someone else within the HPML team.

Aparna

part time lecturer business Tue-dublin + phd
hasnt super used it before
keel
- randomized dataset
- different variants of datasets
hard to understand which is original
keep track of results manually
ran small scale only and openml feels large scale
found openml because of datasets
- unique datasets
feels like itll do everything automatically
phd
- metalearning - feature selection recommendation
  - ~100 datasets
- auto extraction of target variables
  - first and last col : column
  - images features - issues (like mnist)

Guillaume

probabl
we met during the hackathon as well
scikit learn
- parquet not done
- fetch from openml - merged
- scrub - data preprocessing
  - use files from openML
  - logs about files
  - away from simple csvs
    - more than one single file
  - upload several sources of data to openml and connect
difficult part - harder data
- real life datasets
- whats there now is well curated
- time series data - not so easy to find
  - so try to generate synthetic since its hard to find

Ihsan Ullah

Research Software Engineer at ChaLearn. The projects I work on span around Machine Learning, Competitions, Software development, Academic writing etc
Him and his team were very happy with the OpenML team and support
Uses codabench
- mainly focused on competitions
- on Codabench datasets/experiments/tasks are not visible like in OpenML but I think OpenML is the best when we want to find a specific item (runs/tasks/datasets etc)
Problems with OpenML
- when you search Meta_Album you get nothing but with Meta Album you get the correct datasets (Personal note. This really is true :/ I just checked)
- Yes, I had a bad experience when uploading through the UI, but the python API works really well. And the support from OpenML is also great.
- Difficult to understand what OpenML does
  - add some videos, diagrams etc (Personal note - Bring back the cute robot!!)
- UI Upload was suuper bad it seems
- Modernize the UI (Personal note - Probably done with the new version)

Sebastian Fisher

hardest to use
- weird error messages
  - mark a wrong type col : evaluation engine runs and ds stays unprocessed - no error messages
- as long as you follow the main path
  - so many features but not everything up to same standard
- survival task - doesnt really work
benchmark datasets and stuff are indistinguishable , task collections are just confusing (for neurips)
edit the description
researcher doesnt need the automl report
- but still useful
- upload datasets where the point is not to run on it

General

User page

Subhaditya's KB

openml feedback