Lecture 8: Introduction to Multimodal Machine Learning — Part 1 Representation
Challenges in Multimodal Learning
- Representation
- Alignment
- Translation
- Fusion
- Co-Learning
Representation
Definition: Learning representations that reflect cross-modal interactions between individual elements, across different modalities.
Representation Fusion
Definition: Learn a joint representation that models cross-modal interactions between individual elements of different modalities.
- Unimodal encoders can be jointly learned with fusion network, or pre-trained
Early and Late Fusion
Basic Concepts
- Additive terms
- Multiplicative ‘interaction’ term
Additive Fusion
Multiplicative Fusion
Tensor Fusion
The weight matrix may end up quite large!
Low-rank Fusion
传统的张量融合
可能非常巨大。
权重分解
其中, 是视觉模态的子权重矩阵, 是语言模态的子权重矩阵。
输入特征分解
输入特征 也可以分解为多个子特征
其中, 是视觉特征经过投影矩阵 后的子特征。 类似。
Contrastive Language-Image Pretraining, CLIP
其中, 是相似度的度量,通常为余弦相似度。
在训练过程中,CLIP通过构建批量的数据构建正负样本矩阵。特征间的相似性矩阵用于计算损失,其中对角线表示正样本,非对角线为负样本。