A toolkit for remote human-robot interaction through Zoom video conferencing, powered by Large Language Models. This system serves as a core component of the Robi Butler project (ICRA 2025 paper).
This is a minimal implementation enabling remote human-robot interaction through Zoom meetings with:
- Zoom Interface: Bidirectional communication via Zoom chat and voice (to output robot voice, please use ElevenLabs to generate the voice, and create a virtual microphone to output the voice to the system)
- Robi Butler: Intelligent agent for instruction processing and robot control
- LLM Planner: GPT-4 powered natural language understanding
- Fetch Agent: Robot control interface with manipulation primitives
- Pointing Server (optional): Web-based pointing/gesture input
-
User → Robot:
- User speaks/types "Robi, pick up the cup from the table" in Zoom
zoom_interface.pydetects wake word → publishes to/user_instructionrobi_butler.pyreceives message → callsChatAgent.get_response()- LLM generates:
[move("table"), pick("cup")] - Butler executes via
fetch_agent→ actions sent to robot - Butler publishes result to
/robot_feedback zoom_interface.pywrites response to Zoom chat
-
Pointing (optional):
- User clicks on web interface (via ngrok URL)
server.pypublishes Point32(x, y) to/user_pointrobi_butler.pystores point in buffer- User says "Robi, pick this*"
- LLM generates:
[pick("*")] - Butler uses stored point coordinates for action
zoom_communication/
├── robi_butler.py # Main agent: instruction processing & execution
├── zoom_interface.py # Zoom bridge: chat/voice ↔ ROS topics
├── llm_planner.py # ChatAgent: LLM-based planning
├── fetch_agent.py # Robot interface: action primitives
├── robots/
│ └── fake_robot.py # Simulated robot with scene graph
├── webgesture/server/
│ └── server.py # Pointing server (Flask)
├── requirements.txt # Python dependencies
├── .env.example # Environment template
├── .gitignore
├── LICENSE
└── geckodriver # Firefox WebDriver
| Topic | Type | Publisher | Subscriber | Description |
|---|---|---|---|---|
/user_instruction |
String | zoom_interface.py | robi_butler.py | User commands |
/robot_feedback |
String | robi_butler.py | zoom_interface.py | Robot responses |
/user_point |
Point32 | server.py | robi_butler.py | Pointing coordinates |
- Python 3.8+
- ROS Noetic (or later)
- Firefox browser
- OpenAI API key (set the system variable
OPENAI_API_KEY, or put it in the.envfile) - Geckodriver (already included in the repository or download from releases)
- Install system dependencies:
sudo apt-get install portaudio19-dev python3-pyaudio
-
Install Python dependencies:
pip install -r requirements.txt
-
Configure environment variables:
cp .env.example .env nano .env
Edit
.env:OPENAI_API_KEY=sk-your-openai-api-key-here #(set the system variable `OPENAI_API_KEY`, or put it in the `.env` file) ZOOM_MEETING_URL=https://zoom.us/j/YOUR_MEETING_ID?pwd=YOUR_PASSWORD # please put the full URL of the Zoom meeting, get it from the Zoom meeting invite email
-
Make geckodriver executable:
chmod +x geckodriver
Terminal 1 - ROS Core:
roscoreTerminal 2 - Zoom Interface:
python3 zoom_interface.py- Opens Firefox and joins Zoom meeting (join from browser)
- Listens for chat messages and voice commands (mute the robot microphone, open the chat box)
- Publishes to
/user_instruction - Writes responses from
/robot_feedbackto Zoom
Terminal 3 - Robi Butler:
python3 robi_butler.py- Subscribes to
/user_instructionand/user_point - Processes commands through LLM
- Executes robot actions
- Publishes to
/robot_feedback
Terminal 4 (Optional) - Pointing Server:
cd webgesture/server
python3 server.py
# In another terminal, expose with ngrok
ngrok http 8888- Provides web interface for pointing
- Share ngrok URL with users
- Publishes clicks to
/user_point
Voice/Chat (say "Robi" to activate):
Robi, move to the kitchen
Robi, pick up the cup from the table
Robi, open the fridge and check if there is pizza
Robi, what do you see?
With Pointing (use "this"):
Robi, pick this [user clicks on object]
Robi, what is this? [user clicks on object]
Robi, move to this [user clicks on location]
Zoom not connecting:
- Verify
ZOOM_MEETING_URLin.env - Check Firefox is installed
- Wait 35 seconds for page load
- Check geckodriver:
chmod +x geckodriver
No microphone input:
# List devices
python3 -c "import speech_recognition as sr; print(sr.Microphone.list_microphone_names())"
# Set device in zoom_interface.py
self.mic_device = 1 # your device indexOpenAI API errors:
# Verify key
echo $OPENAI_API_KEY
# Or check .env file
cat .env | grep OPENAI_API_KEYROS topics not working:
# Check topics
rostopic list
# Check connections
rostopic info /user_instruction
# Test manually
rostopic pub /user_instruction std_msgs/String "test"ModuleNotFoundError:
pip install -r requirements.txt
# ROS packages (if missing)
sudo apt-get install ros-noetic-cv-bridge ros-noetic-geometry-msgsMIT License - see LICENSE file for details
If you use this system in your research, please cite:
@inproceedings{xiao2025robi,
title={Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant},
author={Xiao, Anxing and Janaka, Nuwan and Hu, Tianrun and Gupta, Anshul and Li, Kaixin and Yu, Cunjun and Hsu, David},
booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},
pages={4337--4344},
year={2025},
organization={IEEE}
}
