Skip to content

LegionIO/lex-eval

lex-eval

LLM output evaluation framework for LegionIO. Provides LLM-as-judge and code-based evaluators for scoring LLM outputs against expected results, with per-row results and summary statistics.

Overview

lex-eval runs structured evaluation suites against LLM outputs. Each evaluation takes a list of input/output/expected triples, scores them with the chosen evaluator, and returns a result set with pass/fail per row and an aggregate score.

Installation

gem 'lex-eval'

Usage

require 'legion/extensions/eval'

client = Legion::Extensions::Eval::Client.new

# Run an LLM-judge evaluation
result = client.run_evaluation(
  evaluator_name: 'accuracy',
  evaluator_config: { type: :llm_judge, criteria: 'factual correctness' },
  inputs: [
    { input: 'What is BGP?', output: 'Border Gateway Protocol', expected: 'Border Gateway Protocol' },
    { input: 'What is OSPF?', output: 'Open Shortest Path First', expected: 'Open Shortest Path First' }
  ]
)
# => { evaluator: 'accuracy',
#      results: [{ passed: true, score: 1.0, row_index: 0 }, ...],
#      summary: { total: 2, passed: 2, failed: 0, avg_score: 1.0 } }

# Run a code-based evaluation
client.run_evaluation(
  evaluator_name: 'json-validity',
  evaluator_config: { type: :code },
  inputs: [{ input: 'parse this', output: '{"valid": true}', expected: nil }]
)

# List built-in evaluator templates
client.list_evaluators

Evaluator Types

Type Description
:llm_judge Uses legion-llm to score output against expected using natural language criteria
:code Runs a Ruby proc or checks structural validity

Built-In Templates

12 YAML evaluator templates ship with the gem and are returned by list_evaluators:

hallucination, relevance, toxicity, faithfulness, qa_correctness, sql_generation, code_generation, code_readability, tool_calling, human_vs_ai, rag_relevancy, summarization

Annotation Queues

Human-in-the-loop annotation for labeling LLM outputs:

client = Legion::Extensions::Eval::Client.new(db: Sequel.sqlite)
Legion::Extensions::Eval::Helpers::AnnotationSchema.create_tables(client.instance_variable_get(:@db))

client.create_queue(name: 'review', description: 'Manual review queue')
client.enqueue_items(queue_name: 'review', items: [{ input: 'q', output: 'a' }])
client.assign_next(queue_name: 'review', annotator: 'alice', count: 5)
client.complete_annotation(item_id: 1, label_score: 0.9, label_category: 'correct')
client.queue_stats(queue_name: 'review')
client.export_to_dataset(queue_name: 'review')

Agentic Review

AI-reviews-AI with confidence-based escalation:

client = Legion::Extensions::Eval::Client.new
result = client.review_output(input: 'question', output: 'answer')
# => { confidence: 0.92, recommendation: 'approve', issues: [], explanation: '...' }

result = client.review_with_escalation(input: 'q', output: 'a')
# => { action: :auto_approve, escalated: false, ... }  (confidence > 0.9)
# => { action: :light_review, escalated: true, priority: :low, ... }  (0.6-0.9)
# => { action: :full_review, escalated: true, priority: :high, ... }  (< 0.6)

Development

bundle install
bundle exec rspec
bundle exec rubocop

License

MIT

About

LLM output evaluation framework for LegionIO

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages