多模态论文阅读之BLIP

来自网友在路上 171871提问提问时间：2023-11-07 10:14:00阅读次数： 71

Title

BLIP: Bootstrapping Language-Image Pre-training for Uniﬁed Vision-Language Understanding and Generation

模型角度：clip albef等要么采用encoder-base model 要么采用encoder-decoder model. However, encoder-based models are less straightforward to directly transfer to text generation tasks(e.g. image captioning), whereas encoder-decoder models have not been sucessfully adopted for image-text retrieval tasks. 那有没有一个统一的框架呢？
数据角度：SOTA的方法（如CLIP、ALBEF等）都在从web上收集到的图文对上进行预训练。尽管通过扩展数据集获得了性能提升，但本文的研究表明，对于视觉语言学习来说，有噪声的网络文本是次优（suboptimal）的。

在这里插入图片描述

查看全文

本文"多模态论文阅读之BLIP"：http://eshow365.cn/6-34448-0.html 内容来自互联网，请自行判断内容的正确性。如有侵权请联系我们，立即删除！