Python中CLIP多模态模型的库的实现

2025-04-29 09:24 开发作者：彬彬侠

1. 安装 OpenAI 官方 CLIP

pip install git+https://github.com/openai/CLIP.git

依赖：torch、numpy, PIL

2. 快速使用示例

import clip
import torch
from PIL import Image

# 加载模型和预处理方编程客栈法
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 加载图像并预处理
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)

# 编写文本描述
text = clip.tokenize(["a photo of a cat", "a photo of a dog"]).to(device)

# 提取特征并计算相似度
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probabilities:", probs)

3. 模型选项

支持的模型有：

"ViT-B/32"：最快，最常用
"ViT-B/16"：更大更准
"RN50"、"RN101"：基于 ResNet

4. 文本编码

text = ["a photo of a banana", "a dog", "a car"]
tokens = clip.tokenize(text).to(device)

with torch.no_grad():
    text_features = model.encode_text(tokens)

5. 图像编码

from PIL import Image

image = Image.open("example.jpg")
image_input = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)

6. 相似度比较

import torch.nn.functional as F

# 余弦相似度
similarity = F.cosine_similarity(image_features, text_features)
print(similarity)

7. 零样本图像分类

labels = ["a dog", "a cat", "a car"]
text_inputs = clip.tokenize([f"a photo of {label}" for label in labels]).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_inputs)
 http://www.devze.com   image_features = model.encode_image(image)

# 归一化
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# 相似度得分
logits = (image_features @ text_features.T)
pred = logits.argmax().item()

print(f"Predicted label: {labels[pred]}")

8. 与其他库对比

特性	CLIP	BLIP / Flamingo	BERT / GPT
图文对齐	是	是	否
多模态能力	强（图像 + 文本）	更强（支持生成）	弱
零样本能力	强	强	无
适合任务	图文检索、匹配、分类	生成描述、问答、VQA	语言任务

9. 更强大：open_clip

open_clip 是社区支持的更强版本，支持更多预训练模型（如 LAION 提供的）：

pip install open_clip_torch

import open_clip

modphpel, preprocess, tokenizer = open编程_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')

10. 总结

功能	方法
加载模型	`clip.load()`
文本编码	`model.encode_text()`
图像编码	`model.encode_image()`
图文相似度	`model(image, text)` 或余弦相似度
图像分类（零样本）	文本描述嵌入后选最大相似度
支持模型	`"ViT-B/32"`, `"ViT-B/16"` 等

CLIP 是现代多模态 AI 模型的典范，可广泛应用于图像检索、图文分类、图像问答、跨模态搜索等场景。它在“零样本”条件下也能表现良好，是构建通用图文理解系统的强大工具。

到此这篇关于Python中CLIP多模态模型的库的实现的文章就介绍到这了,更多相关Python CLIP多模态模型内容请搜索编程客栈(www.devze.com)以前的文章或继续浏览下面的相关文章希望大家以后多多支持编程客栈(www.devze.com)！

继续阅读：Python CLIP多模态模型 Python CLIP库

Python中CLIP多模态模型的库的实现

目录

1. 安装 OpenAI 官方 CLIP

2. 快速使用示例

3. 模型选项

4. 文本编码

5. 图像编码

6. 相似度比较

7. 零样本图像分类

8. 与其他库对比

9. 更强大：open_clip

10. 总结

更多精彩内容

精彩评论

最新开发

Java线上CPU飙高问题排查及解决全指南

Zuul实现服务网关路由全过程

C++11中的lambda表达式与包装器

C#实现.NET Core大文件上传的全面指南

C#数组越界异常IndexOutOfRangeException的原因及解决方案

开发排行榜

springboot后端存储富文本内容的思路与步骤(含图片内容)

PyCharm运行python测试,报错“没有发现测试”/“空套件”的解决

return base64.b64encode(b).decode(

基于C语言实现钻石棋游戏的示例代码

Sublime Text 3解决中文乱码问题（实测可用）