Tutorials 07

Lesson 07 — Snap-and-Identify Image Assistant

Take a photo with your phone, send it to your Telegram Bot, and AI tells you what's in it. Foreign menus, confusing formulas, code screenshots, street signs — just snap and send.


What This Does

You take a photo with your phone
   ↓
Send it to your Telegram Bot
   ↓
OpenClaw receives the image
MiniMax VL-01 (vision model) analyzes the image
   ↓
Tells you in plain language what's in it

No need to open a browser, no manual uploading — just send a photo.


Real-World Scenarios

Scenario You Send AI Returns
Traveling abroad Photo of a Japanese menu Each dish's name and approximate price
Reading a paper Screenshot of a math formula The formula's meaning and derivation
Writing code Screenshot of an error Error cause and fix suggestions
Shopping Ingredient list photo Key ingredient analysis, flags allergens
Reviewing documents A page from a contract Plain-language explanation of the clause
Identifying plants Photo of a roadside flower Plant name, characteristics, toxicity

Prerequisites

  • Complete Lesson 01 (gateway running)
  • Complete Lesson 02 (Telegram connected)
  • MiniMax configured in openclaw.json (VL-01 supports image input)

Step 1: Confirm Vision Model Is Configured

Add VL-01 to minimax.models in ~/.openclaw/openclaw.json:

{
  "id": "MiniMax-VL-01",
  "name": "MiniMax VL-01",
  "reasoning": false,
  "input": ["text", "image"],
  "cost": { "input": 15, "output": 60, "cacheRead": 2, "cacheWrite": 10 },
  "contextWindow": 200000,
  "maxTokens": 8192
}

Verify:

pnpm openclaw models list --all | grep VL
# You should see minimax/MiniMax-VL-01  text+image  yes

Step 2: Create the Image Identification Skill

Create ~/.openclaw/workspace/skills/identify/SKILL.md:

mkdir -p ~/.openclaw/workspace/skills/identify
# Image Identification Assistant
 
The user has sent an image. You need to:
 
1. **Carefully observe** all details in the image
2. **Determine the scene type**: is it text/formula/code/a physical object/a screenshot/something else?
3. Based on the scene type, provide the most helpful response:
 
### If it's text/menu/sign/document
- Transcribe all visible text in full
- Translate to English if it's in a foreign language
- Explain the content in natural language
 
### If it's code/screenshot/error message
- Identify the programming language and framework
- Explain what the code does / what the error is
- Provide improvement suggestions or a fix
 
### If it's a formula/chart
- Explain the formula's meaning in plain words
- Describe what each variable represents
- If it's a chart, analyze the data trends
 
### If it's a physical object/plant/food
- Identify what it is
- Provide relevant background information (origin, uses, safety notes, etc.)
 
## Tone
Direct and concise — lead with the most important conclusion, then expand on details.
Respond entirely in English.

Step 3: That's It — Just Use It

Open Telegram, send your Bot an image, along with or followed by a message describing what you want to know:

Example 1: Japanese Menu

[Send a photo of a Japanese menu]
What should I order here? Anything good for someone who doesn't eat spicy food?

Example 2: Code Error

[Send a screenshot of a terminal error]
What does this error mean and how do I fix it?

Example 3: Send Just the Image

Send a photo without any text, and AI will automatically judge what you most likely want to know based on the image content.


Advanced: Travel Image Pack

Create a dedicated travel skill at ~/.openclaw/workspace/skills/travel-assistant/SKILL.md:

# Travel Image Assistant
 
You are an experienced travel assistant. The user will send you various photos taken while traveling.
 
## Menus
- Identify each dish with ingredients and flavor profile
- Note the price (approximate in USD after conversion)
- Recommend 2–3 dishes that would suit a general Western palate
 
## Transportation (subway maps, signs, timetables)
- Explain the current location or direction
- Give the simplest possible action to take
 
## Attractions
- Identify where this is
- Provide a brief historical background (2–3 sentences)
- Share a practical visiting tip
 
## Shopping (price tags, ingredient labels)
- Convert prices to USD
- Flag common allergens in the ingredients
- Whether it's worth buying (compare to domestic prices)
 
Stay direct and practical in English.

How It Works

When OpenClaw receives a Telegram image message:

  1. The image is converted to base64 format
  2. It's sent to MiniMax VL-01 along with your text question
  3. VL-01 understands both the image and the text simultaneously
  4. The reply is sent back to you via Telegram

The entire process typically completes in 3–8 seconds.


Tips

Send multiple images at once: Telegram supports photo albums, and AI will analyze all images together

Ask follow-up questions: AI has context memory, so after sending a photo you can keep asking:

[Send menu photo]
How is this dish prepared?
→ Does the second item contain peanuts? I'm allergic.
→ What would you recommend instead?

Specify the output language: If you want a response in a specific language, just say so:

[Send image] Please respond in French.

FAQ

What image formats are supported?

JPEG, PNG, WebP, GIF (first frame only), and other mainstream formats. Photos sent via Telegram are automatically compressed to JPEG; screenshots are usually PNG — both are fully supported.

How accurate is the image recognition?

MiniMax VL-01 performs well on text recognition (OCR), scene understanding, and code identification, with high accuracy for clear photos. Blurry, low-light, or very small text will reduce accuracy. For best results, hold the camera steady in good lighting.

Can it recognize code errors in a screenshot?

Yes — this is one of the most practical use cases. Send a terminal screenshot and AI can identify the error message, pinpoint the problematic code, and suggest a fix. It supports error formats from most mainstream programming languages.

Is there a size limit for images?

Telegram's photo mode supports up to about 10MB. OpenClaw automatically handles resizing after receiving, so you usually don't need to compress manually. Very high-resolution images (over 4000px) may benefit from compression first to save tokens.

What happens if I send an image without activating the identification skill?

It still works. Without a dedicated skill activated, AI will analyze the image using general capabilities, but the output may be less structured than with a dedicated skill. We recommend creating a /identify skill as shown in this lesson for more consistent output.


Why This Is Interesting

This is a classic example of OpenClaw connecting three components: a messaging channel (Telegram), a vision AI (MiniMax VL-01), and a skill system (SKILL.md).

In the past, to analyze an image you'd open the ChatGPT website → upload the image → wait. Now you just send a photo in Telegram and AI handles it automatically — as natural as messaging a real assistant.

This idea of "embedding AI into the tools you already use every day" is the core design philosophy of OpenClaw.

Stay up to date with OpenClaw

Follow @lanmiaoai on X for tips, updates and new tutorials.

Follow