The Multimodal Universe Dataset

Available Versions

Version Size Release Date Link
v1 98.34TB 2024-10-30 23:52 UTC Download

Dataset Overview

The MultimodalUniverse dataset is a large-scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.

Abstract

We present the Multimodal Universe, a large-scale multimodal dataset of scientific astronomical data, compiled specifically to facilitate machine learning research. Overall, our dataset contains hundreds of millions of astronomical observations, constituting 100TB of multi-channel and hyper-spectral images, spectra, multivariate time series, as well as a wide variety of associated scientific measurements and metadata. In addition, we include a range of benchmark tasks representative of standard practices for machine learning methods in astrophysics. This massive dataset will enable the development of large multi-modal models specifically targeted towards scientific applications. All codes used to compile the dataset, and a description of how to access the data is available at https://github.com/MultimodalUniverse/MultimodalUniverse. The paper is available at https://openreview.net/forum?id=EWm9zR5Qy1

Available Data Types

The full dataset contains contributions from over 20 major astronomical surveys, totaling approximately 100TB of scientific data. All data is available through HTTPS or GLOBUS from the Flatiron Institute, with preview datasets accessible via the Hugging Face Hub.

Changelog

v1 (Initial Release)

Cite: BibTeX Citation

@inproceedings{
    TheMultimodalUniverse2024,
    title={The Multimodal Universe: Enabling Large-Scale Machine Learning with 100{TB} of Astronomical Scientific Data},
    author={
        {The Multimodal Universe Collaboration} and Eirini Angeloudi and Jeroen Audenaert and
        Micah Bowles and Benjamin M. Boyd and David Chemaly and
        Brian Cherinka and Ioana Ciuca and Miles Cranmer and
        Aaron Do and Matthew Grayling and Erin Elizabeth Hayes and
        Tom Hehir and Shirley Ho and Marc Huertas-Company and
        Kartheik G. Iyer and Maja Jablonska and Francois Lanusse and
        Henry W. Leung and Kaisey Mandel and Juan Rafael Mart{\'i}nez-Galarza and
        Peter Melchior and Lucas Thibaut Meyer and Liam Holden Parker and
        Helen Qu and Jeff Shen and Michael J. Smith and
        Connor Stone and Mike Walmsley and John F Wu
    },
    booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2024},
    url={https://openreview.net/forum?id=EWm9zR5Qy1}
}