Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
dc1318d
Fix Agent UI docs: correct CLI commands, API method names, and missin…
kovtcharov Mar 18, 2026
d826a93
Agent UI polish: refined typography, glassmorphism styling, and eval …
kovtcharov Mar 18, 2026
0788c8b
Agent UI: terminal-style animations with pixelated red cursor
kovtcharov Mar 18, 2026
eeb0283
Fix Black formatting in file_tools.py
kovtcharov Mar 18, 2026
e3f7396
Agent UI polish: streaming transitions, design consistency, and final…
kovtcharov Mar 19, 2026
1306665
Fix broken hardware query: add Windows/Linux system info commands to …
kovtcharov Mar 19, 2026
8fd61d0
Advanced UI animations: modal exits, delete transitions, and session …
kovtcharov Mar 19, 2026
4062772
Update default model to Qwen3.5-35B-A3B and improve network query hints
kovtcharov Mar 19, 2026
42db71e
Fix session list disappearing from sidebar during backend glitches
kovtcharov Mar 19, 2026
0243069
Fix false positive LLM health check banner under heavy load
kovtcharov Mar 19, 2026
b203fa4
Agent UI: thinking display, Lemonade stats, model override, security …
kovtcharov Mar 19, 2026
e17bf72
Fix thinking display: single cursor, no flash, smoother animations
kovtcharov Mar 19, 2026
c994caf
Remove dead .msg-entering CSS, fix thinking indicator light theme
kovtcharov Mar 19, 2026
66c6628
Fix unit test: update default model assertion to Qwen3.5-35B-A3B-GGUF
kovtcharov Mar 19, 2026
94d6fda
Fix SSE handler tests: start_progress emits status, not thinking
kovtcharov Mar 19, 2026
37f9672
Stable thinking toolbar: no visual changes on state transitions
kovtcharov Mar 19, 2026
d38f025
feat: Agent UI eval benchmark framework with `gaia eval agent` command
kovtcharov Mar 20, 2026
bb5f679
fix: Agent UI capabilities, streaming cleanup, MCP management, and ev…
kovtcharov Mar 21, 2026
c6d3d0c
fix: gate SD tool registration on enable_sd_tools config flag
kovtcharov Mar 21, 2026
46a93cb
docs: update eval monitor log with SD regression fix and final valida…
kovtcharov Mar 21, 2026
b60d06a
fix: prevent index_document→list_indexed_documents→memory-answer hall…
kovtcharov Mar 21, 2026
41e86c7
docs: update monitor log with large_document fix (Fix Round 4)
kovtcharov Mar 21, 2026
eae1919
fix: add negative-assertion guard and multi-doc topic-switch rule
kovtcharov Mar 22, 2026
85aeec9
fix: update sd_graceful_degradation scenario for opt-in SD tools
kovtcharov Mar 22, 2026
632d5fe
fix: add when-uncertain fallback and conversation context recall rules
kovtcharov Mar 22, 2026
db9f578
fix: prevent planning-text responses before tool calls
kovtcharov Mar 22, 2026
9a83180
docs: final validation — 34/34 pass rate confirmed
kovtcharov Mar 22, 2026
ed5c72e
fix: resolve CI failures — lint, unit tests, and SDK memory test
kovtcharov Mar 22, 2026
3d4d5c7
fix: black-format runner.py subprocess.run call
kovtcharov Mar 22, 2026
86c1575
fix: improve agent reliability and eval benchmark quality — 34/34 PASS
kovtcharov Mar 23, 2026
f36f329
Restore changes reverted by accidental PR #566 merge
itomek Mar 23, 2026
9190980
Fix security regressions and add shell command guardrail tests
itomek Mar 23, 2026
682b4a6
Fix missing Always Allow checkbox, }}} streaming artifact, and JSON f…
itomek Mar 23, 2026
9413e85
Fix } streaming artifact: extend regex to match thought+tool+tool_arg…
itomek Mar 23, 2026
38efada
feat: upgrade default model to Qwen3.5-35B-A3B-GGUF and restore rever…
kovtcharov Mar 23, 2026
ceec4b5
merge: integrate tomas/restore-reverted-prs-564-565-568
kovtcharov Mar 23, 2026
9e2cfe7
merge: integrate kalin/fix-agent-ui-docs animations and UI polish
kovtcharov Mar 23, 2026
f0b5c78
fix: eval benchmark — category field, scorecard accuracy, audit corre…
kovtcharov Mar 23, 2026
be6ff06
feat: add Open Folder button to Document Library file rows
kovtcharov Mar 23, 2026
a254421
feat: add LLM context size and model download validation to Agent UI
kovtcharov Mar 23, 2026
db04329
merge: pull in latest main (b7a97e6)
kovtcharov Mar 23, 2026
d604553
fix: detect loaded model when Lemonade omits root-level model_loaded …
kovtcharov Mar 23, 2026
8adacbc
feat: add hover performance stats to Agent UI messages
kovtcharov Mar 23, 2026
61d018d
feat: add Agent UI eval benchmark with RAG quality scenarios
kovtcharov Mar 23, 2026
2085226
perf: optimize Agent UI chat response latency
kovtcharov Mar 23, 2026
14673bb
feat: auto-index in query_specific_file, UI polish, eval doc and form…
kovtcharov Mar 23, 2026
2e955b8
feat: warn when unexpected LLM model is loaded in Agent UI
kovtcharov Mar 23, 2026
469a29c
feat: detect and warn when wrong Lemonade model is loaded
kovtcharov Mar 23, 2026
1b01f22
fix: skip real_world scenarios in manifest cross-reference tests when…
kovtcharov Mar 23, 2026
3639b79
fix: apply Black formatting to test_eval.py and test_server.py
kovtcharov Mar 23, 2026
2f211ac
perf: pre-load heavy modules at server startup
kovtcharov Mar 23, 2026
d953ef6
fix: fix lint errors in _preload_modules (isort order, noqa placement…
kovtcharov Mar 23, 2026
69157cf
fix: fix lint errors in _preload_modules (isort order, noqa placement…
kovtcharov Mar 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 19 additions & 10 deletions .github/workflows/test_eval.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ on:
branches: ["main"]
paths:
- 'src/gaia/eval/**'
- 'eval/scenarios/**'
- 'eval/corpus/**'
- 'eval/prompts/**'
- 'tests/test_eval.py'
- 'setup.py'
- '.github/workflows/test_eval.yml'
Expand All @@ -21,6 +24,9 @@ on:
types: [opened, synchronize, reopened, ready_for_review]
paths:
- 'src/gaia/eval/**'
- 'eval/scenarios/**'
- 'eval/corpus/**'
- 'eval/prompts/**'
- 'tests/test_eval.py'
- 'setup.py'
- '.github/workflows/test_eval.yml'
Expand Down Expand Up @@ -79,6 +85,7 @@ jobs:
node-version: '18'

- name: Test webapp functionality
shell: pwsh
run: |
cd src/gaia/eval/webapp
# Install dependencies
Expand All @@ -88,19 +95,21 @@ jobs:
# Test that server can start (Windows-compatible version)
$env:PORT = 3456 # Use non-default port to avoid conflicts
$process = Start-Process node -ArgumentList "server.js" -PassThru -ErrorAction Stop
Start-Sleep -Seconds 3
if ($process.HasExited) {
Write-Error "Server failed to start or crashed immediately"
exit 1
}
# Try to connect to the server
try {
Start-Sleep -Seconds 3
if ($process.HasExited) {
Write-Error "Server failed to start or crashed immediately"
exit 1
}
# Try to connect to the server
$response = Invoke-WebRequest -Uri "http://localhost:3456" -TimeoutSec 5 -UseBasicParsing
Write-Output "Server responded with status: $($response.StatusCode)"
Write-Output "Webapp server test passed"
} catch {
Write-Error "Server did not respond to HTTP request"
Stop-Process -Id $process.Id -Force -ErrorAction SilentlyContinue
Write-Error "Server did not respond to HTTP request: $_"
exit 1
} finally {
if (-not $process.HasExited) {
Stop-Process -Id $process.Id -Force -ErrorAction SilentlyContinue
}
}
Stop-Process -Id $process.Id -Force
Write-Output "Webapp server test passed"
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -227,4 +227,7 @@ docs/playbooks/sd-agent/index-backup.mdx
.claude/settings.local.json

# Custom util scripts
util/custom/*
util/custom/*

# Real-world eval corpus documents (sourced from public web, not committed)
eval/corpus/real_world/
20 changes: 13 additions & 7 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,18 +254,18 @@ gaia/

| Agent | Location | Description | Default Model |
|-------|----------|-------------|---------------|
| **ChatAgent** | `agents/chat/agent.py` | Document Q&A with RAG | Qwen3-Coder-30B |
| **CodeAgent** | `agents/code/agent.py` | Code generation with orchestration | Qwen3-Coder-30B |
| **JiraAgent** | `agents/jira/agent.py` | Jira issue management | Qwen3-Coder-30B |
| **BlenderAgent** | `agents/blender/agent.py` | 3D scene automation | Qwen3-Coder-30B |
| **DockerAgent** | `agents/docker/agent.py` | Container management | Qwen3-Coder-30B |
| **ChatAgent** | `agents/chat/agent.py` | Document Q&A with RAG | Qwen3.5-35B |
| **CodeAgent** | `agents/code/agent.py` | Code generation with orchestration | Qwen3.5-35B |
| **JiraAgent** | `agents/jira/agent.py` | Jira issue management | Qwen3.5-35B |
| **BlenderAgent** | `agents/blender/agent.py` | 3D scene automation | Qwen3.5-35B |
| **DockerAgent** | `agents/docker/agent.py` | Container management | Qwen3.5-35B |
| **MedicalIntakeAgent** | `agents/emr/agent.py` | Medical form processing | Qwen3-VL-4B (VLM) |
| **RoutingAgent** | `agents/routing/agent.py` | Intelligent agent selection | Qwen3-Coder-30B |
| **RoutingAgent** | `agents/routing/agent.py` | Intelligent agent selection | Qwen3.5-35B |
| **SDAgent** | `agents/sd/agent.py` | Stable Diffusion image generation | SDXL-Turbo |

### Default Models
- General tasks: `Qwen3-0.6B-GGUF`
- Code/Agents: `Qwen3-Coder-30B-A3B-Instruct-GGUF`
- Code/Agents: `Qwen3.5-35B-A3B-GGUF`
- Vision tasks: `Qwen3-VL-4B-Instruct-GGUF`

## CLI Commands
Expand Down Expand Up @@ -530,3 +530,9 @@ Specialized agents are available in `.claude/agents/` for specific tasks (23 age
- **ui-ux-designer** (opus) - User-centered design, accessibility

When invoking a proactive agent from `.claude/agents/`, indicate which agent you are using in your response.

## Learned Skills

**Read these before starting related tasks:**

- `.claude/skills/gaia-eval-benchmark.md` - How to run, audit, and trust/distrust the GAIA Agent UI eval benchmark; covers RAG cache integrity, response rendering bugs, eval judge leniency, and MCP session inspection (tags: eval, rag, mcp, gaia-agent-ui, debugging, hallucination, ci-cd, testing)
3 changes: 2 additions & 1 deletion docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,8 @@
"group": "Evaluation Framework",
"pages": [
"reference/eval",
"reference/eval/fix-code-testbench"
"reference/eval/fix-code-testbench",
"eval"
]
},
"reference/dependency-management",
Expand Down
Loading
Loading