Skill Issue is an autonomous learning system that tests real competence by periodically sending short challenges via push notifications. Instead of tracking study time or content consumption, it measures whether you can actually use a skill when tested.
The system continuously improves itself. Three autonomous agents handle scheduling, challenge generation, and skill progression. When users rate questions poorly, Opik automatically optimizes the prompts used to generate those questions—analyzing what went wrong and improving them through reasoning-based refinement, these prompts are then immediately used by the system. Every LLM call and agent decision is traced in Opik for full observability.
Scan to download the Android app
- Competence over content - Measure what you can do, not what you've read
- Measurement over teaching - Test real ability in the moment
- Short interruptions, not study sessions - Quick challenges that fit into your day
- Autonomous adaptation - System learns from your performance automatically
- Self-improving prompts - Opik automatically optimizes question generation when quality drops
- Strong observability and evaluation - Every LLM call and agent decision is traced and scored via Opik
| Layer | Technology |
|---|---|
| Monorepo | PNPM workspaces |
| Language | TypeScript (backend & frontends), Python (prompt optimization) |
| Backend | Express, node-cron |
| Mobile | React Native (Expo 54), Zustand, React Query |
| Web | React 19, Vite 5, React Router |
| Auth | Clerk |
| Database | Supabase (PostgreSQL) |
| AI - Challenge Generation | Anthropic Claude SDK (Claude Haiku 4.5) |
| AI - Dataset Generation | OpenAI GPT-4o |
| AI - Prompt Optimization | Opik Optimizer (HRPO, EvolutionaryOptimizer, MetaPrompt) |
| Observability | Opik - tracing, feedback scores, prompt versioning, datasets, experiments |
| Push Notifications | Expo Server SDK |
| Containerization | Docker (multi-stage Node 18 Alpine) |
/
/packages
/backend - Express API server with autonomous agents, LLM-as-Judge, Opik tracing
/mobile - React Native mobile app (Expo 54)
/shared - Common TypeScript types, schemas, API clients
/optimization - Python prompt optimization system (Opik Optimizer)
The backend runs three autonomous agents orchestrated by a cron-based scheduler:
Scheduler Tick (cron: every 30 min)
|
v
Agent 1 - Scheduling Decision
Evaluates all users/skills, decides who needs a challenge
Factors: time since last challenge, quiet hours, accuracy threshold, max daily challenges
|
v
Agent 2 - Challenge Design
Checks question pool for reusable high-quality questions
If none found -> Generates MCQ via Claude -> Validates structure -> LLM-as-Judge quality gate
If validation or judge fails -> regenerate (up to retry limit)
|
v
Agent 3 - Skill State Update
Evaluates answer correctness, adjusts difficulty target (+-1), updates streaks
|
v
Push Notification -> User answers -> Agent 3 processes response
Check Question Pool
-> Found high-quality match?
-> Reuse existing question (saves API calls)
-> Not found?
-> Generate new challenge (Claude Haiku 4.5)
-> Structural Validation (4 answer options, question length, JSON format)
-> LLM-as-Judge Evaluation (5 quality dimensions, weighted composite score)
-> Pass (>= 0.7 composite, no dimension below 0.4 veto threshold)
-> Save to DB & question pool -> Send push notification
-> Fail
-> Regenerate (flow repeats, max retries configurable)
The question pool significantly reduces LLM API calls by reusing high-quality questions across users. Questions are eligible for reuse if they have an average rating of 2.0+ stars or haven't been rated yet.
Every generated challenge passes through an LLM-as-Judge quality gate before being stored. The judge evaluates 5 weighted dimensions:
| Dimension | Weight | What it measures |
|---|---|---|
| Clarity | 20% | Is the question unambiguous? |
| Difficulty Alignment | 25% | Does complexity match the target difficulty (1-10)? |
| Distractor Quality | 20% | Are wrong options plausible but clearly incorrect? |
| Educational Value | 15% | Does the explanation effectively teach WHY? |
| Skill Relevance | 20% | Does the question genuinely test the stated skill? |
Quality gate: Composite score >= 0.7 to pass. Any single dimension below 0.4 triggers a veto regardless of composite score.
Each evaluation is recorded as feedback scores on the Opik trace, so you can see exactly how every challenge scored and why it passed or failed.
Opik is deeply integrated throughout the system for observability, evaluation, prompt versioning, and optimization.
Every operation is traced with hierarchical spans:
challenge_generation- Contains all challenge generation LLM calls and LLM-as-Judge evaluation calls as nested spans. When a challenge fails the judge and gets regenerated, you can see each attempt as a separate span within the trace. Eachllm_judge_evaluationspan has a feedback tab showing per-dimension scores, and the parentchallenge_generationtrace shows the scores of the winning challenge.scheduling_tick- Root trace for each scheduler run, linking to individual challenge tracesscheduling_decision- Agent 1's decision about which users to challengeskill_state_update- Agent 3's difficulty adjustments after a user answersanswer_submission- User answer processingdataset_generation_examples- Dataset creation for prompt optimizationprompt_optimization- Full optimization run with refinement iterations
Feedback scores are attached at both the span and trace level:
- Span-level: Each
llm_judge_evaluationspan shows per-dimension scores (clarity, difficulty alignment, distractor quality, educational value, skill relevance) - Trace-level: The parent
challenge_generationtrace shows the final winning challenge's scores - Answer feedback: Correct/incorrect outcomes are scored on answer submission traces
- User ratings: 1-5 star ratings from users feed back into question quality tracking
Prompts are saved to Opik with full version history. When optimization improves a prompt, the new version is stored with metrics showing baseline vs. improved scores. The admin dashboard shows all versions, their performance, and which is currently active.
- Test datasets are generated per skill per difficulty level (e.g.,
skill_{id}_level_{7}_examples) - Datasets contain 5 high-quality example challenges generated by GPT-4o
- These serve as benchmarks for measuring prompt quality during optimization
- Experiments run prompts against these datasets and measure quality via LLM-as-Judge
- After optimization, experiments compare metrics between baseline and optimized prompts
The system monitors question quality continuously and optimizes prompts automatically:
How it works:
- Every hour, checks average user ratings for each skill+level combination
- When ratings drop below 2.5 stars (with 10+ rated questions), triggers optimization
- Automatically generates benchmark dataset if it doesn't exist (5 examples via GPT-4o)
- Runs Opik's HRPO (HierarchicalReflectiveOptimizer) with 5 refinement cycles
- HRPO analyzes why prompts fail and hierarchically refines them through reasoning
- Measures quality using the same LLM-as-Judge metric (5 dimensions, composite score)
- If improved, automatically deploys the optimized prompt
- Won't re-optimize unless rating changes by 0.1+ points (prevents wasteful reruns)
Optimizer types supported:
- HierarchicalReflectiveOptimizer (HRPO) - Default. Uses reflective analysis and structured reasoning
- EvolutionaryOptimizer - Mutates prompts across generations
- MetaPromptOptimizer - Uses meta-prompting to generate candidate prompts
The admin dashboard shows optimization status, pending jobs, low-rated skills, and full version history with scores.
Prompt variants can be registered in Opik and selected via weighted random assignment. Each variant is tagged on the trace so you can compare performance between prompt versions in the Opik dashboard. Currently disabled by default (AB_TEST_CHALLENGE_PROMPT_ENABLED=false).
Express-based API server that orchestrates the core system:
- Three autonomous agents (scheduling, challenge design, skill state)
- Challenge generation via Anthropic Claude SDK (Haiku 4.5)
- LLM-as-Judge quality gate with 5-dimension evaluation
- Full Opik tracing with hierarchical spans, feedback scores, and prompt versioning
- Question pool for reusing high-quality challenges across users
- Automated prompt optimization scheduler (hourly checks)
- Admin dashboard for viewing prompts, optimization status, and triggering manual optimization
- A/B testing support for prompt variants
- Push notification delivery via Expo Server SDK
- Cron-based scheduling with configurable intervals
Key Dependencies: Express, @anthropic-ai/sdk, @supabase/supabase-js, node-cron, expo-server-sdk, openai, opik, zod
React Native mobile app built with Expo 54:
- Cross-platform iOS/Android
- User authentication via Clerk
- Challenge quiz interface with answer submission
- Feedback collection (confidence, difficulty ratings, 1-5 star quality ratings)
- Push notification handling for challenge delivery
- Skill browsing and selection
- Calibration assessments
Key Dependencies: Expo 54, React Native 0.81, @clerk/clerk-expo, expo-notifications, @tanstack/react-query, zustand, expo-router
Shared TypeScript package:
- Common type definitions
- Zod schemas for validation
- API client utilities
Key Dependencies: @anthropic-ai/sdk, zod
Prompt optimization system using Opik's optimizer library:
- Per-skill, per-difficulty-level prompt optimization
- Prompts are "baked" with concrete values before optimization (no template variables)
- Uses LLM-as-Judge as the quality metric for the optimizer
- Supports HRPO (default), EvolutionaryOptimizer, and MetaPromptOptimizer
- Exports optimized prompts with baseline vs. improved scores to JSON and database
- Called by TypeScript backend via subprocess when optimization is triggered
Key Dependencies: opik, opik-optimizer, anthropic, openai, litellm, supabase, python-dotenv
# Install dependencies
pnpm install
# Run backend
cd packages/backend
pnpm dev
# Run mobile app
cd packages/mobile
pnpm start
# Run prompt optimization (Python) - normally called automatically by backend
cd optimization
pip install -r requirements.txt
python optimize_challenge_prompt.py --skill <skill_id> --level 3 --optimizer hrpoKey environment variables for the backend:
# Prompt Optimization (automated)
PROMPT_OPTIMIZATION_ENABLED=true
PROMPT_OPTIMIZATION_RATING_THRESHOLD=2.5
PROMPT_OPTIMIZATION_MIN_QUESTIONS=10
PROMPT_OPTIMIZATION_CHECK_INTERVAL_HOURS=1
PROMPT_OPTIMIZATION_AUTO_DEPLOY=true
PROMPT_OPTIMIZATION_MAX_CONCURRENT_JOBS=2
PROMPT_OPTIMIZATION_REFINEMENTS=5
# Question Pool (reduces API costs)
QUESTION_POOL_ENABLED=true
QUESTION_POOL_MIN_RATING=2.0
ADD_GENERATED_TO_POOL=true
# LLM Settings
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=400
# A/B Testing
AB_TEST_CHALLENGE_PROMPT_ENABLED=false