Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited one or two modalities. We present i-Code, self-supervised framework where users may flexibly combine the modalities of vision, speech, language into unified general-purpose vector representations. In this framework, data...