Skip to content

HimathX/Sonance

Repository files navigation

Sonance

A real-time, voice-powered AI DJ agent that listens, thinks, and plays the perfect track β€” instantly.


Table of Contents


Overview

Sonance is a real-time AI DJ that you talk to naturally. Say "play something chill for a rainy evening" and Sonance translates that mood into precise acoustic parameters, runs a hybrid semantic + acoustic search across a 10k-track vector database, and starts playing β€” all within sub-second latency.

It combines live WebRTC voice capture, ultra-fast LLM reasoning (Groq), hybrid vector search (Superlinked + Qdrant), and DJ-quality text-to-speech (ElevenLabs) into a seamless, conversational music experience.


Key Features

  • πŸŽ™οΈ Real-time voice interaction β€” zero-latency WebRTC audio pipeline with VAD and Deepgram Nova-2 STT
  • 🧠 Mood-aware LLM reasoning β€” a LangChain ReAct agent powered by Groq maps abstract descriptions to acoustic dimensions
  • πŸ” Hybrid vector search β€” Superlinked fuses semantic lyric embeddings with numeric acoustic spaces (valence, energy, tempo, instrumentalness)
  • 🎡 Seamless music playback β€” YouTube IFrame API plays tracks in-browser with no Spotify auth required
  • πŸ“‹ 5-track smart queue β€” auto-advances through a curated queue with skip-forward, skip-back, and history
  • 🎭 Multiple DJ personas β€” choose between DJ, Tara, Leo, Zoe, and Mia, each with a distinct style and ElevenLabs voice
  • πŸ“Š Full observability β€” every interaction traced end-to-end with Opik

Architecture

System Overview

The full pipeline spans four distinct layers. The browser captures voice, pipes it over WebRTC, the backend transcribes β†’ reasons β†’ searches β†’ speaks, and sends a playback command back to the frontend and plays music all in under a second.

Sonance Sub-Second AI Voice Architecture


Sensory Layer β€” Voice Input

The browser captures the microphone stream via the WebRTC API. FastRTC handles audio capture and VAD (Voice Activity Detection). When the user finishes speaking, the audio buffer is sent to Deepgram Nova-2 for transcription in under 300 ms.

Sensory Layer β€” VAD and STT flow


Cognitive Layer β€” Agent Reasoning

The transcript is passed to a LangChain ReAct agent running on Groq (Llama 3 / Mixtral). The agent decides whether to call music_discovery_tool or pause_music_tool, then generates a natural-language DJ response in the voice of the selected persona.

Cognitive Layer β€” ReAct agent tool selection


Retrieval Layer β€” Hybrid Music Search

music_discovery_tool calls Superlinked with a natural language query and optional acoustic targets. Superlinked encodes both a semantic lyric vector and numeric acoustic vectors, fuses them in Qdrant, filters out already-played tracks, and returns the top-K results.

Retrieval Layer β€” Superlinked hybrid search flow


Executive Layer β€” Playback & Response

The LLM final response triggers two parallel streams: ElevenLabs TTS synthesises the DJ voice and streams audio chunks back over the WebRTC audio track; simultaneously, a JSON control_command is sent via the FastRTC data channel, triggering the hidden YouTube IFrame Player in the browser. When a track ends, the queue auto-advances.

Executive Layer β€” TTS, control command, and auto-queue


Tech Stack

Component Technology Role
Frontend React 18 + Vite + TypeScript 3-column DJ dashboard
WebRTC Transport FastRTC Sub-second audio streaming (browser ↔ backend)
Speech-to-Text Deepgram Nova-2 (via Groq) Transcription < 300 ms
LLM Reasoning Groq Β· Llama 3 / Mixtral ReAct agent β€” mood β†’ acoustic parameters
Vector Search Superlinked + Qdrant Cloud Hybrid semantic + acoustic music retrieval
Text-to-Speech ElevenLabs / Orpheus DJ persona voices, streamed as WebRTC audio
Music Playback YouTube IFrame API In-browser audio; no Spotify auth required
Lyrics Data LyricsGenius Semantic lyric embeddings for search
Observability Opik Full trace of every agent interaction
API Framework FastAPI REST + WebRTC signalling server

Getting Started

Prerequisites

  • Python 3.11+ and uv package manager
  • Node.js 18+ and npm
  • A Qdrant Cloud cluster (free tier is enough to start)
  • A Groq API key
  • A Genius API access token (for lyrics ingestion)
  • An ElevenLabs API key (for DJ voice)

Installation

# 1. Clone the repository
git clone https://github.com/your-org/sonance
cd sonance

# 2. Install backend dependencies
uv sync

# 3. Copy and configure environment variables
cp .env.example .env
# β†’ Open .env and fill in your API keys (see Configuration below)

# 4. Install frontend dependencies
cd frontend && npm install

Configuration

Create a .env file at the project root based on .env.example:

Variable Required Description
GROQ__API_KEY βœ… Groq API key for LLM reasoning and STT
QDRANT__CLUSTER_URL βœ… Your Qdrant Cloud cluster URL
QDRANT__API_KEY βœ… Your Qdrant Cloud API key
GENIUS__ACCESS_TOKEN βœ… Genius API token for lyrics ingestion
ELEVENLABS__API_KEY βœ… ElevenLabs API key for DJ voice TTS
SPOTIFY__CLIENT_ID βšͺ Spotify app client ID (metadata only)
SPOTIFY__CLIENT_SECRET βšͺ Spotify app client secret (metadata only)
HF_HOME βšͺ HuggingFace cache dir β€” point to a drive with β‰₯ 10 GB

Note

Spotify credentials are only needed during data ingestion to fetch track metadata and audio features. They are not needed to run the agent or play music.


Data Ingestion

Before running the agent for the first time, populate your Qdrant cluster with the music index. The ingestion pipeline fetches tracks from curated Spotify playlists, downloads lyrics via Genius, computes Superlinked embeddings, and upserts everything into Qdrant.

Important

Run ingestion once before starting the agent. Start with --limit 5 to verify your API keys and Qdrant connectivity before a full run.

# Quick smoke-test: 5 tracks per playlist
uv run python src/sonance_agents/infrastructure/ingest.py --limit 5

# Full ingestion: ~100 tracks per playlist across 4 playlists (~400 tracks total)
uv run python src/sonance_agents/infrastructure/ingest.py --limit 100

What the pipeline does:

  1. Authenticates with Spotify via the Client Credentials flow (no browser popup)
  2. Fetches track metadata and audio features in bulk
  3. Downloads lyrics from Genius for each track
  4. Saves the raw dataset to data/seed_tracks.csv
  5. Upserts all tracks into Qdrant via Superlinked (semantic + acoustic vectors)

Note

The first run downloads sentence-transformers/all-MiniLM-L6-v2 (~90 MB). Set HF_HOME to a drive with sufficient free space.


Running Sonance

Open two terminals from the project root:

Terminal 1 β€” Backend

uv run fastapi dev src/sonance_agents/api/main.py

The API is available at http://localhost:8000. The WebRTC signalling endpoint lives at http://localhost:8000/webrtc/offer.

Terminal 2 β€” Frontend

cd frontend
npm run dev

Open http://localhost:5173 in your browser. Click the voice button, start talking, and let your DJ take over.

DJ Personas

Select your DJ from the animated avatar picker in the left panel. Each persona has a distinct personality and voice:

Avatar Personality Voice Style
DJ Charismatic all-rounder Energetic, confident
Tara Warm & soulful Smooth, expressive
Leo Hype & energetic Punchy, upbeat
Zoe Chill & lo-fi Laid-back, mellow
Mia Romantic & dreamy Soft, atmospheric

Orb State Colours

The 3D orb in the center panel reflects the current agent state:

State Colour
Idle Indigo / Blue-violet
Listening Cyan / Sky blue
Thinking Amber / Yellow
Talking Purple / Violet

Project Structure

sonance/
β”œβ”€β”€ frontend/                          ← React + Vite + TypeScript dashboard
β”‚   └── src/
β”‚       β”œβ”€β”€ App.tsx                    ← 3-column DJ dashboard (chat Β· orb Β· now playing)
β”‚       β”œβ”€β”€ App.css                    ← Futuristic dark theme with ambient blobs
β”‚       β”œβ”€β”€ hooks/
β”‚       β”‚   β”œβ”€β”€ useWebRTC.ts           ← WebRTC connection, data channel, mic control
β”‚       β”‚   └── useYouTubePlayer.ts    ← YouTube IFrame player, queue, skip logic
β”‚       └── components/ui/             ← Orb, VoiceButton, LiveWaveform, AnimatedTooltip…
β”‚
β”œβ”€β”€ src/sonance_agents/
β”‚   β”œβ”€β”€ agent/
β”‚   β”‚   β”œβ”€β”€ fastrtc_agent.py           ← STT β†’ ReAct Agent β†’ TTS + control command dispatch
β”‚   β”‚   β”œβ”€β”€ stream.py                  ← VoiceAgentStream & make_control_command helpers
β”‚   β”‚   └── tools/
β”‚   β”‚       └── music_discovery.py     ← MusicDiscoveryTool: hybrid semantic + acoustic search
β”‚   β”œβ”€β”€ avatars/
β”‚   β”‚   β”œβ”€β”€ base.py                    ← DJ system prompt template
β”‚   β”‚   └── definitions/               ← YAML persona configs (dj, tara, leo, zoe, mia)
β”‚   β”œβ”€β”€ infrastructure/
β”‚   β”‚   β”œβ”€β”€ ingest.py                  ← Data ingestion pipeline (Spotify + Genius β†’ Qdrant)
β”‚   β”‚   └── superlinked_integration/
β”‚   β”‚       β”œβ”€β”€ schema.py              ← Track schema
β”‚   β”‚       β”œβ”€β”€ index.py               ← Superlinked index definition
β”‚   β”‚       β”œβ”€β”€ query.py               ← Hybrid search query builder
β”‚   β”‚       └── service.py             ← MusicSearchService (query execution)
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── main.py                    ← FastAPI app: WebRTC signalling + REST endpoints
β”‚   └── config.py                      ← Pydantic-settings (all env vars)
β”‚
β”œβ”€β”€ static/
β”‚   β”œβ”€β”€ Sonance.png                    ← Hero banner
β”‚   └── diagrams/                      ← Architecture diagrams
β”œβ”€β”€ data/                              ← seed_tracks.csv (generated after ingestion)
└── pyproject.toml

License

This project is licensed under the MIT License β€” see the LICENSE file for details.

About

Cognitive DJ agent mapping subjective intent to acoustic vectors. High-fidelity voice-to-playback via WebRTC and Superlinked.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors