Croissant

  • Croissant is a metadata description format
  • Ml datasets are a combination of structured and unstructured data, which make them complicated to manage
  • Croissant was built on top of schema.org, and has more details relative to it
  • The format has 4 layers
    • dataset level metadata
    • resource description
    • content structure
    • ml semantics
  • Croissant does not require any changes to underlying data
  • Analysis and visualization tools work out of the box for all datasets
  • Using croissant, datasets can be exposed consistently throughout platforms
  • Collaborations with google, huggingface, google dataset search also exist
  • openml has deeper dataset description by default, slightly lesser in HF and kaggle
  • once loaded, datasets can be imported elsewhere (torch, tf etc) easily
  • Croissant editor - web app where you can use a GUI to enter the dataset descriptions
  • NeurIPS also now recommends using the Croissant format