Gemlift

RAG-based library upgrade assistant that crawls documentation, stores it in a vector database, and uses an LLM to generate accurate upgrade guidance. It learns from past failures via an error memory store and can autonomously upgrade a Rails application with verification via RSpec.

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Knowledge Pipeline                       │
│                                                              │
│  WebCrawler ──► DocumentChunker ──► ChromaDB                 │
│  (crawl4ai)     (langchain)         upgrade_knowledge        │
│                                                              │
│                              ChromaDB                        │
│                              upgrade_errors  ◄── CI/CD hook  │
│                              (error memory)                  │
└─────────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                    Upgrade Agent                             │
│                                                              │
│  User Request ──► HybridRetriever ──► LLM ──► Response      │
│                   (KB + errors,                              │
│                    verified-fix boost)                       │
└─────────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                  Upgrader (autonomous)                       │
│                                                              │
│  RepoReader ──► plan_agent (LLM+RAG) ──► UpgradePlan        │
│                                              │               │
│                                         apply_to(repo)       │
│                                              │               │
│                                         RailsExecutor        │
│                                         (bundle, rspec)      │
└─────────────────────────────────────────────────────────────┘

Module Overview

src/gemlift/
├── config.py               # All settings via pydantic-settings + .env
├── interfaces.py           # Protocol definitions for all components
│
├── ingestion/              # Crawl → chunk → store
│   ├── crawler.py          # crawl4ai async crawler (SOURCES registry)
│   ├── chunker.py          # Markdown-header + recursive text splitting
│   └── pipeline.py         # Orchestrates crawl→chunk→embed→upsert
│
├── indexing/               # Embedding + vector storage
│   ├── embeddings.py       # SentenceTransformer (local, 384-dim)
│   └── vector_store.py     # ChromaDB wrapper (cosine similarity, upsert dedup)
│
├── memory/
│   └── error_store.py      # Stores upgrade failures; marks verified fixes
│
├── retrieval/
│   └── retriever.py        # HybridRetriever: KB + error memory, score boosting
│
├── agent/                  # LangGraph upgrade agent
│   ├── state.py            # UpgradeState TypedDict
│   ├── nodes.py            # retrieve_node + generate_node factories
│   └── graph.py            # StateGraph: retrieve → generate → END
│
├── upgrader/               # Autonomous repo upgrader
│   ├── repo_reader.py      # Reads Gemfile, .ruby-version, config files
│   ├── plan.py             # UpgradePlan + FileChange dataclasses
│   ├── plan_agent.py       # LLM+RAG → structured JSON upgrade plan
│   └── executor.py         # RailsExecutor: bundle/rspec via subprocess + rbenv
│
├── scheduler/
│   └── refresh.py          # Daily re-crawl daemon
│
└── cli.py                  # Click CLI entry point

Stack

Layer	Tool
Crawling	`crawl4ai` (async, JS-capable, playwright)
Chunking	`langchain` MarkdownHeaderTextSplitter + RecursiveCharacterTextSplitter
Embeddings	`sentence-transformers` all-MiniLM-L6-v2 (local, free)
Vector DB	`ChromaDB` (embedded, persistent)
Agent	`LangGraph` StateGraph
LLM	LiteLLM proxy or direct Anthropic (`langchain-openai` / `langchain-anthropic`)
CLI	`click`
Package manager	`uv`

Setup

Prerequisites

Python 3.11+
uv — install
Ruby + rbenv (for e2e Rails upgrade tests)

Install

git clone https://github.com/nhatbui-eh/gemlift
cd gemlift
uv sync

Playwright (required for web crawling)

uv run crawl4ai-setup

Environment

Copy .env.example to .env and fill in credentials:

cp .env.example .env

# Option A — LiteLLM proxy (preferred)
LITELLM_API_KEY=sk-...
LITELLM_BASE_URL=https://your-litellm-proxy.com
LITELLM_MODEL=claude-sonnet-4-5-20250929

# Option B — Direct Anthropic
ANTHROPIC_API_KEY=sk-ant-...

The real_llm fixture auto-detects which one is set, preferring LiteLLM.

CLI Usage

Ingest documentation

# Ingest all known sources (rails, ruby, bundler)
uv run gemlift ingest

# Ingest a specific library
uv run gemlift ingest --library rails

# Ingest a custom URL
uv run gemlift ingest --url https://guides.rubyonrails.org/upgrading_ruby_on_rails.html

Ask the upgrade agent

uv run gemlift ask "How do I upgrade from Rails 7.1 to 8.0?" \
  --library rails --from-version 7.1 --to-version 8.0

Report a CI/CD failure to error memory

uv run gemlift report-error \
  --library rails --from-version 7.1 --to-version 8.0 \
  --error "ActiveRecord::StrictLoadingViolationError in production" \
  --context "has_many :posts, strict_loading: true"

Confirm a fix (promotes it with a score boost)

uv run gemlift confirm-fix <error-id> "Add strict_loading: false to the association"

Start the daily refresh daemon

uv run gemlift refresh

Testing

Test layout

tests/
├── conftest.py                     # Shared fixtures (real_embedding_fn, real_llm, …)
├── unit/                           # 42 tests — no I/O, mocks only
│   ├── test_crawler.py
│   ├── test_chunker.py
│   ├── test_embeddings.py
│   ├── test_vector_store.py
│   ├── test_error_store.py
│   └── test_retriever.py
├── integration/                    # 12 tests — real ChromaDB + sentence-transformers
│   ├── test_ingestion_pipeline.py  # Crawls real URLs (marked network)
│   └── test_query_pipeline.py      # Retrieval + error memory
└── e2e/                            # 7 tests — real LLM + real Rails subprocess
    ├── test_upgrade_workflow.py     # gemlift agent end-to-end
    └── test_rails_upgrade.py        # Upgrade hw-rails-intro app Rails 7.1→8.0

Run by scope

# Unit tests only (fast, no network, no LLM)
uv run pytest tests/unit/ -v

# Integration — real embeddings + ChromaDB, no crawling
uv run pytest -m "integration and not network" -v

# Integration — includes real web crawling
uv run pytest -m integration -v

# E2E — no LLM (error memory + retrieval only)
uv run pytest -m "e2e and not network" -v

# E2E — full (needs LITELLM_API_KEY or ANTHROPIC_API_KEY, network)
uv run pytest -m e2e -v

# Everything
uv run pytest -v

Markers

Marker	Meaning
`unit`	Pure unit tests, no external dependencies
`integration`	Real ChromaDB + sentence-transformers
`e2e`	Full pipeline including LLM
`network`	Makes real HTTP requests (crawler or LLM)
`slow`	Takes more than a few seconds

The Rails upgrade e2e test

tests/e2e/test_rails_upgrade.py runs a complete end-to-end upgrade of the bundled hw-rails-intro Rails app:

Copies the app to a temp directory
Runs bundle install + db:setup with Ruby 3.3.9 / Rails 7.1
Asserts all 15 RSpec specs pass (baseline)
Calls the gemlift plan_agent with RAG context → LLM returns a structured JSON upgrade plan
Applies the file changes (Gemfile, .ruby-version, config/application.rb)
Runs bundle update with Ruby 3.4.4
Asserts all 15 RSpec specs still pass on Rails 8.0

# Requires rbenv with Ruby 3.3.9 + 3.4.4 and LLM credentials
uv run pytest tests/e2e/test_rails_upgrade.py -v -s

Error Memory & Learning

Every failed upgrade attempt can be stored in the upgrade_errors ChromaDB collection. Verified fixes receive a +0.2 score boost during hybrid retrieval, surfacing them above unverified errors.

CI pipeline fails
      │
      ▼
gemlift report-error --library rails --error "..." --context "..."
      │
      ▼
Error stored in ChromaDB (UNVERIFIED)
      │
Developer fixes it
      │
      ▼
gemlift confirm-fix <id> "the fix that worked"
      │
      ▼
Document updated → FIX STATUS: VERIFIED WORKING
Future queries retrieve this with +0.2 boost

Data

Path	Contents
`data/chroma_db/`	Persistent ChromaDB (knowledge + error memory)
`data/models/`	Local sentence-transformer model (optional, falls back to HuggingFace download)
`data/raw/`	Raw crawl outputs (optional)
`data/eval/`	Evaluation datasets (optional)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
src/gemlift		src/gemlift
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
rag-upgrade-plan.md		rag-upgrade-plan.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemlift

System Architecture

Module Overview

Stack

Setup

Prerequisites

Install

Playwright (required for web crawling)

Environment

CLI Usage

Ingest documentation

Ask the upgrade agent

Report a CI/CD failure to error memory

Confirm a fix (promotes it with a score boost)

Start the daily refresh daemon

Testing

Test layout

Run by scope

Markers

The Rails upgrade e2e test

Error Memory & Learning

Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemlift

System Architecture

Module Overview

Stack

Setup

Prerequisites

Install

Playwright (required for web crawling)

Environment

CLI Usage

Ingest documentation

Ask the upgrade agent

Report a CI/CD failure to error memory

Confirm a fix (promotes it with a score boost)

Start the daily refresh daemon

Testing

Test layout

Run by scope

Markers

The Rails upgrade e2e test

Error Memory & Learning

Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages