ESP32-S3 CAM · XiaoZhi AI

ESP32-S3 CAM AI Voice Assistant
With OV3660 Camera & TFT Display

Build a voice-controlled AI assistant with camera vision using the ESP32-S3 CAM module, OV3660 3MP camera, and 1.8-inch ST7735 TFT display. Features dual firmware variants (Normal UI and WeChat-style UI), MCP device tools, and browser-based flashing — no coding required.

~1 hour OV3660 3MP ST7735 TFT Free Firmware Chrome / Edge

What is the ESP32-S3 CAM XiaoZhi AI?

Open-source AI voice assistant with camera vision

XiaoZhi AI is an open-source firmware project that transforms an ordinary ESP32 microcontroller into a cloud-connected AI voice assistant. Unlike commercial assistants like Alexa or Google Assistant, XiaoZhi gives you full control over your AI's personality, language, and behavior. You can configure it to speak Hindi, English, or Hinglish, give it a custom name and persona, and even connect it to external services — all for free using the open-source tier.

This particular build takes it to the next level using the ESP32-S3 CAM module with an OV3660 3MP camera for AI-powered visual recognition and a 1.8-inch ST7735 TFT LCD for vibrant status display. The ESP32-S3 brings dual-core 240MHz processing power with onboard PSRAM, making it capable of handling camera capture, audio processing, and display updates simultaneously without lag.

The AI can see its surroundings and describe them, respond to voice commands, control a lamp via GPIO14, adjust speaker volume and display brightness, tell the current time and date, provide weather updates, play music, tell jokes, and maintain natural multi-turn conversations — all through the xiaozhi.me cloud platform. The two UI variants give you the choice between a clean information dashboard and a chat-style messaging interface.

How It Works

Understanding the technology behind your AI voice assistant.

System Architecture Overview

From voice input to AI response — the complete flow

When you press the BOOT button and speak, here is what happens behind the scenes:

  1. Voice Capture: The INMP441 I2S microphone captures your voice at 16kHz sample rate. The I2S interface ensures clean digital audio with minimal noise.
  2. Cloud Processing: Audio is streamed to the xiaozhi.me cloud servers via WiFi, where speech-to-text converts it to text.
  3. AI Inference: The text is processed by a large language model (LLM) — Xiaozhi Lite on the free tier. The AI can invoke MCP tools like camera capture or lamp control during this phase.
  4. Text-to-Speech: The AI response is converted to natural-sounding speech and streamed back to the ESP32-S3.
  5. Audio Output: The MAX98357A amplifier drives the speaker, while the ST7735 display updates with status, emotions, or chat bubbles.

The entire round trip typically takes 1-3 seconds depending on your internet speed and the complexity of the request. Camera capture adds a moment for image processing.

Camera Vision Capability

How the OV3660 3MP sensor enables visual AI

The OV3660 is a 3-megapixel camera sensor that connects to the ESP32-S3 CAM module through a dedicated FPC (Flexible Printed Circuit) ribbon connector. When you ask "What do you see?" or "Describe this room," the firmware captures a VGA-resolution image (640x480) using the camera's built-in JPEG compression engine.

The image is uploaded to the cloud AI, which analyzes it using computer vision capabilities and returns a natural language description. This works for:

  • Identifying objects and people in the room
  • Reading text from signs, labels, or documents
  • Describing colors, lighting conditions, and spatial layout
  • Recognizing whether lights are on or off
  • Assisting with component identification during electronics work

The camera is configured to use PSRAM for frame buffer storage, ensuring smooth operation without exhausting the main memory. The MCP tool self.camera.take_photo must be enabled in the xiaozhi.me console under MCP Services.

Why ESP32-S3 CAM?

What makes the ESP32-S3 the right choice for this project.

ESP32-S3 vs Classic ESP32

Key differences that matter for camera + display + audio

The ESP32-S3 is a significant upgrade over the classic ESP32 for projects that need to handle multiple data streams simultaneously. Here is why it matters:

Feature ESP32 ESP32-S3
ProcessorXtensa dual-core 240MHzXtensa dual-core 240MHz + vector extensions
PSRAMUp to 8MB (external)Up to 16MB (external, octal)
Camera InterfaceRequires external wiringBuilt-in FPC connector on CAM module
USB InterfaceUART via CP2102/CH340Native USB Serial/JTAG (no extra chip)
FlashUp to 16MBUp to 16MB, octal support
AI AcceleratorNoneVector instructions for ML inference

The built-in camera FPC connector on the ESP32-S3 CAM module eliminates the need for messy wiring to the camera. The native USB Serial/JTAG means no driver installation is needed on modern operating systems — just plug and flash.

Key Features

Voice-controlled capabilities powered by the MCP tool system and cloud AI.

Camera Vision

Ask "What do you see?" and the AI captures and describes its surroundings using the OV3660 3MP camera.

Two Display Modes

Choose between a clean Normal UI with clock and status, or a WeChat-style chat bubble interface.

Voice Control

Adjust speaker volume, display brightness, toggle lamp, get weather, play music — all by voice.

Custom Personality

Configure name, language, voice, and behavior. Supports English, Hindi, and Hinglish.

MCP Tool System

Built-in tools: camera, lamp, volume, brightness, time, weather, WiFi info, reboot, and more.

OTA Updates

Change AI personality anytime through the cloud console. Settings persist across reboots.

Required Components

All parts are commonly available online. The ESP32-S3 CAM module includes the camera connector onboard.

ComponentDescriptionQty
ESP32-S3 CAM ModuleDual-core 240MHz with PSRAM and FPC camera connector1
OV3660 Camera Sensor3MP camera module with FPC ribbon cable1
1.8" TFT LCD ST7735128x160 SPI color display with LED backlight1
INMP441 I2S MicDigital I2S microphone module1
MAX98357A AmplifierI2S DAC + 3W class-D amp module1
2W 4 Ohm SpeakerSmall speaker for voice output1
USB-C Data CableMust support data transfer1
Jumper WiresM-M and M-F assorted~30
Breadboard 400 TieFor prototyping2

Circuit Diagram

The OV3660 connects via FPC ribbon to the ESP32-S3 CAM module. All other peripherals connect through pin headers.

ESP32-S3 CAM Circuit Diagram - SKR Electronics Lab
Circuit Notes The OV3660 camera connects via the FPC ribbon to the ESP32-S3 CAM module. Double-check all connections before powering on.

Flash the Firmware

Choose your preferred display UI and flash directly from the browser. Use Chrome or Edge on desktop.

Flashing Instructions Click the flash button, note the COM ports, then hold BOOT on the ESP32-S3 and plug in USB-C. Select the new port that appears. Check "Erase Device" and click Install.

Normal UI

Clean dashboard with clock, status icons, and emotion display

Web Serial not supported. Use Chrome or Edge on desktop.

WeChat-Style UI

Chat bubble interface with message history and waveform animation

Web Serial not supported. Use Chrome or Edge on desktop.
2

Connect to Your Wi-Fi Network

Via captive portal at 192.168.4.1 · 2.4 GHz networks only

After successful flashing, the ESP32-S3 boots and creates a temporary Wi-Fi hotspot named Xiaozhi-SKR-XXXX. Connect to this hotspot from your phone or computer, then follow these steps to configure your home Wi-Fi credentials through the device's built-in web portal.

1
Connect to "Xiaozhi-SKR-XXXX" hotspot

Open your phone or laptop Wi-Fi settings and connect to the network named XiaoZhi-XXXX. No password is required — this is the ESP32-S3's temporary access point.

2
Open 192.168.4.1 in your browser

Once connected, open a browser and go to 192.168.4.1. The ESP32-S3 configuration portal loads. You will see the main dashboard with device status and available tabs.

Configuration Portal — Main Dashboard
ESP32-S3 Configuration Portal
3
Go to the Advanced tab and select your timezone

Tap the Advanced tab in the top navigation. Find the Timezone dropdown and select your local timezone (e.g. Asia/Kolkata for India). This ensures the device reports accurate time in responses.

Advanced Tab — Timezone Selection
ESP32-S3 Timezone Configuration
4
Click Save Configuration, then switch to WiFi Config

After selecting your timezone, click the Save Configuration button. Once saved, switch to the WiFi Config tab to set up your home network connection.

WiFi Config Tab — Select Network & Enter Password
ESP32-S3 WiFi Configuration
5
Select your network, enter password, and click Connect

In the WiFi Config tab, click your home Wi-Fi network from the available list (2.4 GHz only). Type your Wi-Fi password in the password field and click the Connect button. The device will attempt to connect.

6
Wait for the success confirmation message

Once connected successfully, the portal displays a green success message. The ESP32-S3 restarts automatically and joins your home network. The ST7735 TFT display will show a pairing code — do not disconnect power during this process.

Connection Successful
ESP32-S3 WiFi Connected Success
2.4 GHz networks only ESP32-S3 does not support 5 GHz Wi-Fi. If your router broadcasts both bands with the same name, temporarily connect a device to confirm your 2.4 GHz SSID.
3

Create Your XiaoZhi Account

Free account at xiaozhi.me · Google login recommended

Before you can pair your device, you need a free account on the XiaoZhi AI platform. This is where you manage your agent's personality, language, voice, and advanced settings.

1
Reconnect to your home Wi-Fi network

Switch your device back from the XiaoZhi hotspot to your regular network.

2
Open xiaozhi.me in your browser

Click "Console" in the navigation.

3
Sign up using Google

The fastest method. Your account is active immediately.

XiaoZhi.me — Homepage
XiaoZhi.me Homepage
4

Pair Your ESP32-S3 Device

Enter the 6-digit code scrolling on the TFT display

After creating your account and signing into the console, you'll see the Agents page. Now link your physical ESP32-S3 to your account using the pairing code displayed on the TFT display.

1
Click "+ Add Device" in the Agents console

An input dialog appears for the pairing code.

2
Read the 6-digit code scrolling on your TFT display

The code refreshes every 30 seconds. Type it quickly.

3
Enter the code and click Confirm

The device links to your account.

4
Accept the agreement and click "Start Using"

Select the Open Source (Free) tier to continue.

Add Device Dialog — Console
Add Device Dialog
Device paired successfully Your ESP32-S3 now appears as an agent card in the console. You can see it listed as Online with a green indicator.
5

Configure Your AI Agent's Personality

Set name, voice, language, system prompt, MCP tools

Click "Configure Role" on your device card. This opens the full configuration panel where you design your AI's identity — its name, voice, language, and behavioral instructions.

Configure Role — XiaoZhi Console
Configure Role Screenshot
What is "Role Introduction"? This is the system prompt — the core instruction set that defines who your AI is, how it behaves, what language it speaks, and what it knows. It's the AI's personality blueprint.

Settings reference — what each option does:

SettingWhat It ControlsRecommended
Assistant NameWhat the AI calls itself in greetingsAny name — e.g., "Maxon" or "Jarvis"
Dialogue LanguagePrimary language for voice outputSwitch to preferred language
Voice RoleTTS voice and accent selectionTry several and pick the best fit
Role IntroductionFull personality and behavior system promptUse the generator tool below
Memory TypeHow the AI retains conversation contextShort-term Memory
Language ModelAI engine powering responsesXiaozhi Lite (free, fast)
Voice Recognition SpeedSpeech-to-text processing speedNormal
Character Speech SpeedHow fast the AI talksNormal or slightly slower
Official Services (MCP)Built-in tools: Weather, Music, JokesEnable Weather, Music

Use the interactive system prompt generator below to craft a detailed, personalized instruction set for your AI.

System Prompt Generator Build a detailed XiaoZhi role prompt — fully customizable
Quick Templates
1
Identity
2
Personality
3
Voice & Skills
About Your AI Assistant
About the User (optional)
generated_prompt.txt
0 / 2000 characters
6

Set Up Custom Wake Word

Say your own phrase to wake the AI — no button press needed

This is the most powerful upgrade in Version 2. Instead of pressing the BOOT button every time, you can wake your AI by saying a custom phrase — like "Hey Maxon". The ESP32-S3 runs the MultiNet model locally to detect your phrase.

The setup is done entirely through the XiaoZhi web console — no code, no flashing, just a few clicks. Here is the complete flow:

1
Manage Devices

Click this on your agent card

2
Customize

Visible only when Online

3
Custom Wake Word

Select in Theme Design

4
Generate assets.bin

Flash OTA to device

Detailed step-by-step:

01
Click "Manage Devices" on your agent card

In the Agents section of the console, find your device card and click "Manage Devices".

02
Click "Customize" — only visible when device is Online

This button appears next to Theme Settings when your ESP32-S3 is powered on and connected.

Manage Devices → Customize Button
Manage Devices and Customize button
Click "Manage Devices" then "Customize" to open the Customization tool
03
Step 1 — Chip & Screen loads automatically

Auto-detects your chip (ESP32-S3) and screen (ST7735 128×160). Click Next.

Step 1 — Chip Configuration Auto-detected
Chip model auto-detected: ESP32-S3, ST7735 128x160
Device configuration auto-loads: ESP32-S3, Screen ST7735 128×160
Auto-detection not working? Expand "Manual Configuration" and set Chip to ESP32-S3, screen ST7735 128×160 manually.
04
Step 2 — Theme Design: Click "Custom Wake Word"

Four tabs appear: Wake Word Config, Font Config, Emoji Collection, Chat Background. Click "Custom Wake Word".

Step 2 — Select Custom Wake Word
Theme Design: Custom Wake Word button
Theme Design → Wake Word Config → Click "Custom Wake Word"
05
Enter Wake Word Name and Wake Command

Wake Word Name — a label (e.g., Maxon)
Wake Command — the exact phrase to speak (e.g., Hey Maxon)
Keep it 2–3 syllables for best accuracy.

Custom Wake Word Settings
Wake word: Maxon, command: Hey Maxon
Wake Word Name: Maxon · Wake Command: Hey Maxon · Model: MultiNet6 (English)
06
Select Recognition Model — MultiNet6 (English)

Choose MultiNet6 (English) for English commands or MultiNet6 (Chinese) for Chinese. Sensitivity threshold default 20 is fine.

Select Recognition Model
MultiNet6 English selected
Select MultiNet6 (English) — available only on ESP32-S3
07
Click Next → Preview → Click "Generate assets.bin"

Step 3 shows a display preview simulation. Click the green "Generate assets.bin" button.

Step 3 — Preview & Generate
Preview showing Generate assets.bin
Preview confirms wake word — click Generate assets.bin to proceed
08
Confirm → Click "Start Generate"

Summary shows Chip: ESP32-S3, Resolution 128×160 (ST7735), Wake Word. Files: index.json ~1KB, srmodels.bin ~1.2MB. Click "Start Generate".

Generate assets.bin — Confirmation Dialog
Generate assets.bin dialog
All settings confirmed — click "Start Generate"
09
Generation done → Click "Flash to Device Online"

assets.bin generates in ~2s (3.61 MB). Ensure device is online, then click the blue "Flash to Device Online" button.

assets.bin Ready — Flash to Device
assets.bin ready 3.61 MB
3.61 MB generated in 2.2s — click "Flash to Device Online"
10
OTA flashing — device says "Updating the System"

Progress bar shows upload in real time. ESP32-S3 speaks "Updating the System". Do not power off.

OTA Flashing in Progress
OTA flashing progress bar
Flashing at 50% — do not close or power off
11
Wait 1–2 min — device restarts, wake word active

After flashing, the device reboots automatically. Once reconnected, say your wake phrase — the AI responds immediately.

Wake word is active Say "Hey Maxon" — the AI wakes. No button needed. Face animation activates on the TFT display.
Tips for best recognition Use 2–3 syllable phrases. Speak at normal volume 1–2m away. Increase sensitivity if false triggers occur.

Final Activation — Start Using Your AI

Save, reset, and bring your voice assistant to life

Your device is flashed, paired, configured, and wake word is set. Complete this final activation sequence:

Save all settings in the XiaoZhi console

Click Save after configuring role, personality, and MCP tools.

Hard reset the ESP32-S3

Press the physical RST/EN button to apply all settings.

Wait for Wi-Fi and NTP sync

TFT display shows connecting → then face animation appears.

Say your wake word

Speak "Hey Maxon" (or your phrase) — the AI activates and is ready.

Manual Activation (Backup)

Short press BOOT also wakes the AI if you prefer not to use wake word.

Auto-Sleep

After silence, the device sleeps to save power. Wake word or BOOT activates it again.

Update Personality Anytime

Change role, voice, or language in the console. Save + hard reset to apply.

Change Wake Word Anytime

Repeat Customize → Generate assets.bin → Flash to use a different phrase.

Settings not applying? Always do a hard RST/EN press after saving changes. Software reboot alone may not apply new settings.

Your AI Voice Assistant is Ready

You've built, flashed, configured, and set up a custom wake word on a fully functional XiaoZhi AI V2. Say your phrase and start talking.

Save in Console Hard Reset Board Wait for Connection Speak Wake Word

Want It Pre-Built & Ready to Go?

Get a fully assembled, tested XiaoZhi AI S3 kit with wake word pre-configured. Power it on and start talking immediately.

Order on WhatsApp — +91 8535889926
Pre-Tested Fast Shipping Free Support Wake Word Configured
Full view

MCP Device Tools

Built-in tools accessible to the AI during conversation

The firmware exposes multiple MCP (Model Context Protocol) tools that the AI can invoke during conversation. Enable them in the xiaozhi.me console under MCP Services.

Tool Voice Command Example
self.camera.take_photo"What do you see?"
self.audio_speaker.set_volume"Set volume to 50%"
self.screen.set_brightness"Dim the screen"
self.lamp.turn_on / off"Turn on the lamp"
self.get_current_time"What time is it?"
self.system.reboot"Restart the device"
self.system.get_wifi_info"What's my WiFi status?"

Troubleshooting Guide

Common issues and how to resolve them.

Device Won't Flash / No COM Port Detected
  • Use a good USB-C cable. Some cables are charge-only and do not carry data. Try a different cable.
  • Reset the device. Press the RESET button while the flash page is trying to connect.
  • Enter download mode. Hold BOOT, press RESET (still holding BOOT), then release BOOT. This forces download mode.
  • Try a different browser. Google Chrome or Microsoft Edge are recommended.
  • USB port power. Use a direct computer port, not a hub.
WiFi Issues
  • Hotspot not appearing: Wait 15s after power-on. Press RESET if needed. If TFT lights up but no hotspot, reflash the firmware.
  • 5GHz not visible: Expected. ESP32-S3 only supports 2.4GHz. Create a separate 2.4GHz IoT network.
  • Keeps disconnecting: Weak signal or poor power. Use a quality power supply and move closer to the router.
  • Not appearing in xiaozhi.me: Speak the pairing code clearly and ensure the device is connected before generating the code.
Audio Problems
  • No sound: Check MAX98357A wiring — BCLK=GPIO40, LRC=GPIO41, DOUT=GPIO39. Speaker connects to output terminals, not GND.
  • Quiet or distorted: Use 4-8 ohm speaker. Try "Volume 100" to increase gain.
  • Mic not picking up: Verify INMP441 — SD=GPIO42, WS=GPIO1, SCK=GPIO2. Sound hole must face you.
  • Echo/feedback: Keep the mic away from the speaker.
Camera Not Working
  • MCP tool off: In xiaozhi.me console, enable self.camera.take_photo under MCP Services.
  • FPC connector: Ensure the ribbon cable is fully inserted and the latch clicks.
  • Low light: The OV3660 needs adequate illumination. Add more light.
  • "Take photo failed": Memory issue. Restart the device. PSRAM is already enabled in firmware.
Device Gets Hot / Unstable
  • Normal warmth: ESP32-S3 at 240MHz with camera + display + audio runs warm. This is normal.
  • Too hot: Check for short circuits on the breadboard. Ensure nothing touches the bottom of the CAM module.
  • Random reboots: Power issue. ESP32-S3 draws up to 500mA. Use a 1A+ power source.
  • Display flickering: Loose SPI connections or insufficient power. Recheck wiring and power.

Frequently Asked Questions

Quick answers to common questions about the project.

Do I need to know programming to build this?

No. The firmware comes precompiled. You just click the flash button, configure WiFi through the web portal, and pair via the xiaozhi.me website. No code writing required.

Can I use a different display or camera?

This firmware is specifically built for ST7735 1.8-inch TFT displays and OV3660 cameras. Different hardware would require firmware modifications. Check the official XiaoZhi AI GitHub for other board configurations.

Can I use this without the xiaozhi.me cloud service?

No. The firmware relies on the xiaozhi.me cloud for speech-to-text, AI inference, and text-to-speech. The device is not designed for offline use.

Is the free tier really free? What are the limits?

Yes, the free open-source tier is completely free. It uses the Xiaozhi Lite AI model which is capable enough for most conversations. Paid tiers offer better AI models, voice cloning, and higher usage limits.

Can I control home appliances beyond the lamp GPIO?

The lamp control is a simple GPIO14 toggle. You can connect a relay module to control higher-power devices. The firmware supports additional MCP tools that can be extended with custom development.

Can I change the UI after flashing?

Yes, by flashing the other firmware variant. The Normal UI shows a dashboard with clock, status, and AI responses. The WeChat UI mimics a chat messaging interface. Both are available as separate flashable firmware files.

How do I update the firmware later?

Simply repeat the flashing process. Your paired device will reconnect to xiaozhi.me after reflashing. No need to re-pair unless you want a fresh start.

Conclusion & Next Steps

What you built and where to go from here.

You Have Built a Voice-Powered AI Assistant

Complete with camera vision, display, and cloud AI

Congratulations. By following this guide, you have assembled and configured a fully functional AI assistant with:

  • Voice interaction (push-to-talk via BOOT button)
  • Visual AI recognition using the OV3660 3MP camera
  • Color TFT status display (ST7735 1.8-inch)
  • Cloud AI processing through xiaozhi.me
  • Hardware control (lamp, brightness, volume)
  • Customizable personality and language
  • Two UI choices: Normal Dashboard or WeChat-style Messaging

This project demonstrates just how powerful modern microcontrollers have become. With built-in WiFi, camera interface, display support, and external sensor connectivity, the ESP32-S3 CAM is a versatile platform for AI edge devices.

Next Steps & Ideas

Ways to extend your XiaoZhi AI project
Add a Relay Module

Connect a 5V relay to GPIO14 to control lamps, fans, or other home appliances with voice commands.

Enclosure Design

3D print a custom enclosure to make your XiaoZhi look like a proper desktop gadget. Design one on Onshape or Tinkercad.

Explore the Code

The XiaoZhi AI project is fully open-source. Clone the GitHub repo and explore adding your own MCP tools or custom features.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top