Day 11: Multimodal Models
Multimodal models process not only text but also images, audio, video, and other forms of input simultaneously. Since 2024, most cutting-edge LLMs support multimodal capabilities.
Multimodal Model Comparison (By Model Family)
| Model Family | Input | Output | Features |
|---|---|---|---|
| OpenAI Multimodal | Text + Image (+Audio) | Text (+Audio) | General-purpose API, rich ecosystem |
| Claude Multimodal | Text + Image | Text | Strong in document/chart interpretation |
| Gemini Multimodal | Text + Image (+Video) | Text | Long context/video capabilities |
| LLaVA/Qwen-VL | Text + Image | Text | Open-source, easy local execution |
Version/model IDs change frequently, so check each provider’s model listing documentation before use.
OpenAI Vision API (Chat Completions Compatible Example)
from openai import OpenAI
import base64
client = OpenAI()
# Method 1: Pass image via URL
response = client.chat.completions.create(
model="gpt-4o", # Can be replaced with the latest multimodal model available in your project
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image? Please describe it."},
{
"type": "image_url",
"image_url": {"url": "https://example.com/sample.jpg"},
},
],
}
],
max_tokens=500,
)
print(response.choices[0].message.content)
# Method 2: Pass local image as base64
def encode_image(image_path):
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
image_base64 = encode_image("screenshot.png")
response = client.chat.completions.create(
model="gpt-4o", # Can be replaced with the latest multimodal model available in your project
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze the error in this screenshot."},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_base64}"},
},
],
}
],
)
print(response.choices[0].message.content)
Claude Vision API
import anthropic
import base64
client = anthropic.Anthropic()
def encode_image(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
image_data = encode_image("chart.png")
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": "Analyze the data in this chart and explain the key trends.",
},
],
}
],
)
print(message.content[0].text)
LLaVA Local Execution (Open-Source)
# Run LLaVA with Ollama (local, free)
# ollama pull llava:13b
import ollama
response = ollama.chat(
model="llava:13b",
messages=[
{
"role": "user",
"content": "Describe this image.",
"images": ["./photo.jpg"], # Local image path
}
],
)
print(response["message"]["content"])
# Comparing multiple images is also possible
response = ollama.chat(
model="llava:13b",
messages=[
{
"role": "user",
"content": "Find the differences between these two images.",
"images": ["./before.jpg", "./after.jpg"],
}
],
)
print(response["message"]["content"])
Multimodal Use Cases
| Application | Description | Suitable Models |
|---|---|---|
| Document OCR + Analysis | Read scanned documents and summarize content | OpenAI/Claude/Gemini Multimodal |
| Code Screenshot Debugging | Analyze error screens to identify root causes | OpenAI/Claude Multimodal |
| Chart/Graph Interpretation | Describe visualized data as text | Claude/Gemini Multimodal |
| Product Image Classification | Classify product photos by category | LLaVA (local, cost-saving) |
| UI/UX Review | Analyze app screenshots and suggest improvements | OpenAI/Claude/Gemini Multimodal |
| Medical Imaging Assistance | Initial analysis of X-ray, CT images | Specialized models required |
The key insight about multimodal models is that they don’t truly “understand” images — they convert images into token sequences to process them alongside text. An image encoder (usually ViT-based) converts the image into vectors, which are then combined with the language model’s input.
Today’s Exercises
- Use a commercial multimodal API (choose from OpenAI/Claude/Gemini) to analyze your own screenshot. Among text recognition, layout understanding, and semantic comprehension, which does it perform best at?
- Run LLaVA locally via Ollama and compare the response quality with a commercial multimodal API for the same image.
- Research the relationship between image resolution and token count in multimodal models. How much does the cost increase when sending high-resolution images?