Skip to content

threefruits/zoom_robot_interface

Repository files navigation

Zoom Communication Interface for Remote Human-Robot Interaction

A toolkit for remote human-robot interaction through Zoom video conferencing, powered by Large Language Models. This system serves as a core component of the Robi Butler project (ICRA 2025 paper).

Overview

This is a minimal implementation enabling remote human-robot interaction through Zoom meetings with:

  • Zoom Interface: Bidirectional communication via Zoom chat and voice (to output robot voice, please use ElevenLabs to generate the voice, and create a virtual microphone to output the voice to the system)
  • Robi Butler: Intelligent agent for instruction processing and robot control
  • LLM Planner: GPT-4 powered natural language understanding
  • Fetch Agent: Robot control interface with manipulation primitives
  • Pointing Server (optional): Web-based pointing/gesture input

Data Flow

  1. User → Robot:

    • User speaks/types "Robi, pick up the cup from the table" in Zoom
    • zoom_interface.py detects wake word → publishes to /user_instruction
    • robi_butler.py receives message → calls ChatAgent.get_response()
    • LLM generates: [move("table"), pick("cup")]
    • Butler executes via fetch_agent → actions sent to robot
    • Butler publishes result to /robot_feedback
    • zoom_interface.py writes response to Zoom chat
  2. Pointing (optional):

    • User clicks on web interface (via ngrok URL)
    • server.py publishes Point32(x, y) to /user_point
    • robi_butler.py stores point in buffer
    • User says "Robi, pick this*"
    • LLM generates: [pick("*")]
    • Butler uses stored point coordinates for action

Project Structure

zoom_communication/
├── robi_butler.py          # Main agent: instruction processing & execution
├── zoom_interface.py       # Zoom bridge: chat/voice ↔ ROS topics
├── llm_planner.py         # ChatAgent: LLM-based planning
├── fetch_agent.py         # Robot interface: action primitives
├── robots/
│   └── fake_robot.py      # Simulated robot with scene graph
├── webgesture/server/
│   └── server.py          # Pointing server (Flask)
├── requirements.txt       # Python dependencies
├── .env.example          # Environment template
├── .gitignore
├── LICENSE
└── geckodriver           # Firefox WebDriver

ROS Topics

Topic Type Publisher Subscriber Description
/user_instruction String zoom_interface.py robi_butler.py User commands
/robot_feedback String robi_butler.py zoom_interface.py Robot responses
/user_point Point32 server.py robi_butler.py Pointing coordinates

Installation

Prerequisites

  • Python 3.8+
  • ROS Noetic (or later)
  • Firefox browser
  • OpenAI API key (set the system variable OPENAI_API_KEY, or put it in the .env file)
  • Geckodriver (already included in the repository or download from releases)
  • Install system dependencies:
    sudo apt-get install portaudio19-dev python3-pyaudio

Setup

  1. Install Python dependencies:

    pip install -r requirements.txt
  2. Configure environment variables:

    cp .env.example .env
    nano .env

    Edit .env:

    OPENAI_API_KEY=sk-your-openai-api-key-here #(set the system variable `OPENAI_API_KEY`, or put it in the `.env` file)
    ZOOM_MEETING_URL=https://zoom.us/j/YOUR_MEETING_ID?pwd=YOUR_PASSWORD # please put the full URL of the Zoom meeting, get it from the Zoom meeting invite email
  3. Make geckodriver executable:

    chmod +x geckodriver

Usage

Start the System (3-4 terminals)

Terminal 1 - ROS Core:

roscore

Terminal 2 - Zoom Interface:

python3 zoom_interface.py
  • Opens Firefox and joins Zoom meeting (join from browser)
  • Listens for chat messages and voice commands (mute the robot microphone, open the chat box)
  • Publishes to /user_instruction
  • Writes responses from /robot_feedback to Zoom

Terminal 3 - Robi Butler:

python3 robi_butler.py
  • Subscribes to /user_instruction and /user_point
  • Processes commands through LLM
  • Executes robot actions
  • Publishes to /robot_feedback

Terminal 4 (Optional) - Pointing Server:

cd webgesture/server
python3 server.py

# In another terminal, expose with ngrok
ngrok http 8888
  • Provides web interface for pointing
  • Share ngrok URL with users
  • Publishes clicks to /user_point

Example Commands

Voice/Chat (say "Robi" to activate):

Robi, move to the kitchen
Robi, pick up the cup from the table
Robi, open the fridge and check if there is pizza
Robi, what do you see?

With Pointing (use "this"):

Robi, pick this          [user clicks on object]
Robi, what is this?      [user clicks on object]
Robi, move to this       [user clicks on location]

Troubleshooting

Zoom not connecting:

  • Verify ZOOM_MEETING_URL in .env
  • Check Firefox is installed
  • Wait 35 seconds for page load
  • Check geckodriver: chmod +x geckodriver

No microphone input:

# List devices
python3 -c "import speech_recognition as sr; print(sr.Microphone.list_microphone_names())"

# Set device in zoom_interface.py
self.mic_device = 1  # your device index

OpenAI API errors:

# Verify key
echo $OPENAI_API_KEY

# Or check .env file
cat .env | grep OPENAI_API_KEY

ROS topics not working:

# Check topics
rostopic list

# Check connections
rostopic info /user_instruction

# Test manually
rostopic pub /user_instruction std_msgs/String "test"

ModuleNotFoundError:

pip install -r requirements.txt

# ROS packages (if missing)
sudo apt-get install ros-noetic-cv-bridge ros-noetic-geometry-msgs

License

MIT License - see LICENSE file for details

Citation

If you use this system in your research, please cite:

@inproceedings{xiao2025robi,
  title={Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant},
  author={Xiao, Anxing and Janaka, Nuwan and Hu, Tianrun and Gupta, Anshul and Li, Kaixin and Yu, Cunjun and Hsu, David},
  booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={4337--4344},
  year={2025},
  organization={IEEE}
}

About

A toolkit for remote human-robot interaction through Zoom video conferencing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages