Data Storage for Different AI Dataset Services

 

Template

  • How do they store data?
    • blank
  • Why is it useful?
    • blank
  • Data upload interface
  • Data view interface

Kaggle

  • They started with datasets, and then moved on to other things
  • How do they store data?
    • Pretty much anything, but usually a folder with files and metadata
  • Why is it useful?
    • competitions
    • they were the pretty much the first ones in this space
    • community features (likes etc)
  • Data upload interface
  • Data view interface

Hugging Face

  • Started with models, integrated with deep learning libraries. they took advantage of the rise in transformer libraries and then supported all of the ones since
  • How do they store data?
    • Folder of parquet files , with the occasional metadata csv
  • Why is it useful?
    • easy to view the dataset directly without downloading
    • spaces to run the data directly
    • directly copy API command
    • many papers published
    • github LFS upload and discussions
    • community features (likes etc)
  • Data upload interface
  • Data view interface

NCBI (Bio Tech info)

  • Specific biotech data
  • How do they store data?
    • A tab delimited file : DataSet SOFT (but saved as a .gz archive)
  • Why is it useful?
    • Specific data views : eg genes
  • Data view interface

U.S. Department of Energy Office of Scientific and Technical Information (Physics)

  • How do they store data?
    • Images are in .zip, tabular in .csv and metadata as .xml
  • Why is it useful?
  • Data upload interface
  • Data view interface

Interesting Features for Us

  • Community features - likes, comments
  • Notebooks
  • Directly copy API command
  • More PR ( implementations of recent datasets/models etc)