Lesson 07 — Snap-and-Identify Image Assistant
Take a photo with your phone, send it to your Telegram Bot, and AI tells you what's in it. Foreign menus, confusing formulas, code screenshots, street signs — just snap and send.
What This Does
You take a photo with your phone
↓
Send it to your Telegram Bot
↓
OpenClaw receives the image
MiniMax VL-01 (vision model) analyzes the image
↓
Tells you in plain language what's in it
No need to open a browser, no manual uploading — just send a photo.
Real-World Scenarios
| Scenario | You Send | AI Returns |
|---|---|---|
| Traveling abroad | Photo of a Japanese menu | Each dish's name and approximate price |
| Reading a paper | Screenshot of a math formula | The formula's meaning and derivation |
| Writing code | Screenshot of an error | Error cause and fix suggestions |
| Shopping | Ingredient list photo | Key ingredient analysis, flags allergens |
| Reviewing documents | A page from a contract | Plain-language explanation of the clause |
| Identifying plants | Photo of a roadside flower | Plant name, characteristics, toxicity |
Prerequisites
- Complete Lesson 01 (gateway running)
- Complete Lesson 02 (Telegram connected)
- MiniMax configured in
openclaw.json(VL-01 supports image input)
Step 1: Confirm Vision Model Is Configured
Add VL-01 to minimax.models in ~/.openclaw/openclaw.json:
{
"id": "MiniMax-VL-01",
"name": "MiniMax VL-01",
"reasoning": false,
"input": ["text", "image"],
"cost": { "input": 15, "output": 60, "cacheRead": 2, "cacheWrite": 10 },
"contextWindow": 200000,
"maxTokens": 8192
}Verify:
pnpm openclaw models list --all | grep VL
# You should see minimax/MiniMax-VL-01 text+image yesStep 2: Create the Image Identification Skill
Create ~/.openclaw/workspace/skills/identify/SKILL.md:
mkdir -p ~/.openclaw/workspace/skills/identify# Image Identification Assistant
The user has sent an image. You need to:
1. **Carefully observe** all details in the image
2. **Determine the scene type**: is it text/formula/code/a physical object/a screenshot/something else?
3. Based on the scene type, provide the most helpful response:
### If it's text/menu/sign/document
- Transcribe all visible text in full
- Translate to English if it's in a foreign language
- Explain the content in natural language
### If it's code/screenshot/error message
- Identify the programming language and framework
- Explain what the code does / what the error is
- Provide improvement suggestions or a fix
### If it's a formula/chart
- Explain the formula's meaning in plain words
- Describe what each variable represents
- If it's a chart, analyze the data trends
### If it's a physical object/plant/food
- Identify what it is
- Provide relevant background information (origin, uses, safety notes, etc.)
## Tone
Direct and concise — lead with the most important conclusion, then expand on details.
Respond entirely in English.Step 3: That's It — Just Use It
Open Telegram, send your Bot an image, along with or followed by a message describing what you want to know:
Example 1: Japanese Menu
[Send a photo of a Japanese menu]
What should I order here? Anything good for someone who doesn't eat spicy food?
Example 2: Code Error
[Send a screenshot of a terminal error]
What does this error mean and how do I fix it?
Example 3: Send Just the Image
Send a photo without any text, and AI will automatically judge what you most likely want to know based on the image content.
Advanced: Travel Image Pack
Create a dedicated travel skill at ~/.openclaw/workspace/skills/travel-assistant/SKILL.md:
# Travel Image Assistant
You are an experienced travel assistant. The user will send you various photos taken while traveling.
## Menus
- Identify each dish with ingredients and flavor profile
- Note the price (approximate in USD after conversion)
- Recommend 2–3 dishes that would suit a general Western palate
## Transportation (subway maps, signs, timetables)
- Explain the current location or direction
- Give the simplest possible action to take
## Attractions
- Identify where this is
- Provide a brief historical background (2–3 sentences)
- Share a practical visiting tip
## Shopping (price tags, ingredient labels)
- Convert prices to USD
- Flag common allergens in the ingredients
- Whether it's worth buying (compare to domestic prices)
Stay direct and practical in English.How It Works
When OpenClaw receives a Telegram image message:
- The image is converted to base64 format
- It's sent to MiniMax VL-01 along with your text question
- VL-01 understands both the image and the text simultaneously
- The reply is sent back to you via Telegram
The entire process typically completes in 3–8 seconds.
Tips
Send multiple images at once: Telegram supports photo albums, and AI will analyze all images together
Ask follow-up questions: AI has context memory, so after sending a photo you can keep asking:
[Send menu photo]
How is this dish prepared?
→ Does the second item contain peanuts? I'm allergic.
→ What would you recommend instead?
Specify the output language: If you want a response in a specific language, just say so:
[Send image] Please respond in French.
FAQ
What image formats are supported?
JPEG, PNG, WebP, GIF (first frame only), and other mainstream formats. Photos sent via Telegram are automatically compressed to JPEG; screenshots are usually PNG — both are fully supported.
How accurate is the image recognition?
MiniMax VL-01 performs well on text recognition (OCR), scene understanding, and code identification, with high accuracy for clear photos. Blurry, low-light, or very small text will reduce accuracy. For best results, hold the camera steady in good lighting.
Can it recognize code errors in a screenshot?
Yes — this is one of the most practical use cases. Send a terminal screenshot and AI can identify the error message, pinpoint the problematic code, and suggest a fix. It supports error formats from most mainstream programming languages.
Is there a size limit for images?
Telegram's photo mode supports up to about 10MB. OpenClaw automatically handles resizing after receiving, so you usually don't need to compress manually. Very high-resolution images (over 4000px) may benefit from compression first to save tokens.
What happens if I send an image without activating the identification skill?
It still works. Without a dedicated skill activated, AI will analyze the image using general capabilities, but the output may be less structured than with a dedicated skill. We recommend creating a /identify skill as shown in this lesson for more consistent output.
Why This Is Interesting
This is a classic example of OpenClaw connecting three components: a messaging channel (Telegram), a vision AI (MiniMax VL-01), and a skill system (SKILL.md).
In the past, to analyze an image you'd open the ChatGPT website → upload the image → wait. Now you just send a photo in Telegram and AI handles it automatically — as natural as messaging a real assistant.
This idea of "embedding AI into the tools you already use every day" is the core design philosophy of OpenClaw.