| Version | Size | Release Date | Link |
|---|---|---|---|
| v1 | 98.34TB | 2024-10-30 23:52 UTC | Download |
The MultimodalUniverse dataset is a large-scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.
We present the Multimodal Universe, a large-scale multimodal dataset of scientific astronomical data, compiled specifically to facilitate machine learning research. Overall, our dataset contains hundreds of millions of astronomical observations, constituting 100TB of multi-channel and hyper-spectral images, spectra, multivariate time series, as well as a wide variety of associated scientific measurements and metadata. In addition, we include a range of benchmark tasks representative of standard practices for machine learning methods in astrophysics. This massive dataset will enable the development of large multi-modal models specifically targeted towards scientific applications. All codes used to compile the dataset, and a description of how to access the data is available at https://github.com/MultimodalUniverse/MultimodalUniverse. The paper is available at https://openreview.net/forum?id=EWm9zR5Qy1
The full dataset contains contributions from over 20 major astronomical surveys, totaling approximately 100TB of scientific data. All data is available through HTTPS or GLOBUS from the Flatiron Institute, with preview datasets accessible via the Hugging Face Hub.
@inproceedings{
TheMultimodalUniverse2024,
title={The Multimodal Universe: Enabling Large-Scale Machine Learning with 100{TB} of Astronomical Scientific Data},
author={
{The Multimodal Universe Collaboration} and Eirini Angeloudi and Jeroen Audenaert and
Micah Bowles and Benjamin M. Boyd and David Chemaly and
Brian Cherinka and Ioana Ciuca and Miles Cranmer and
Aaron Do and Matthew Grayling and Erin Elizabeth Hayes and
Tom Hehir and Shirley Ho and Marc Huertas-Company and
Kartheik G. Iyer and Maja Jablonska and Francois Lanusse and
Henry W. Leung and Kaisey Mandel and Juan Rafael Mart{\'i}nez-Galarza and
Peter Melchior and Lucas Thibaut Meyer and Liam Holden Parker and
Helen Qu and Jeff Shen and Michael J. Smith and
Connor Stone and Mike Walmsley and John F Wu
},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=EWm9zR5Qy1}
}