Skip to content

suriyasureshok/GeneFlow

Repository files navigation

🧬 GeneFlow: ADK-Powered Bioinformatics Copilot

Python 3.10+ Streamlit Google ADK License: MIT

Overview

GeneFlow is an advanced bioinformatics analysis platform powered by Google ADK (Agentic Development Kit) that combines multi-agent architecture with generative AI capabilities. It provides researchers and bioinformaticians with intelligent, conversational tools for DNA sequence analysis, protein prediction, literature search, and hypothesis generation.

Key Capabilities

  • 🧬 Intelligent Sequence Analysis: GC content, ORF detection, motif scanning
  • πŸ”¬ Protein Prediction: Physicochemical properties from DNA sequences
  • πŸ“š Literature Integration: AI-powered research paper discovery and synthesis
  • πŸ’‘ Hypothesis Generation: AI-driven research direction suggestions
  • πŸ“Š Advanced Visualizations: Interactive plots and 3D structure modeling
  • πŸ€– Multi-Agent Architecture: Specialized agents for different bioinformatics tasks
  • πŸ’Ύ Session Management: Persistent conversation history and context
  • πŸ“ˆ Performance Monitoring: Real-time metrics and cost tracking

Quick Start

Prerequisites

  • Python 3.10 or higher
  • Google API Key (for generative AI capabilities)
  • 4GB RAM minimum

Installation

  1. Clone the repository

    git clone https://github.com/suriyasureshok/geneflow.git
    cd GeneFlow
  2. Create and activate virtual environment

    python -m venv gene
    gene\Scripts\activate  # On Windows
    source gene/bin/activate  # On macOS/Linux
  3. Install dependencies

    pip install -r requirements.txt
  4. Set up environment variables

    # Create .env file in root directory
    echo GOOGLE_API_KEY=your_api_key_here > .env
  5. Launch the application

    python main.py

    The application will automatically:

    • Check all dependencies
    • Create necessary directories (sessions/, metrics/, geneflow_plots/)
    • Launch the Streamlit UI at http://localhost:8501

Application Architecture

System Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Streamlit Web UI                   β”‚
β”‚  (Home, Dashboard, Chat, Analysis Pages)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      UnifiedCoordinator (Router)             β”‚
β”‚  - Routes to Chat or Analysis based on input β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                     β”‚
    β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ ChatAgent β”‚        β”‚ ADKCoordinator β”‚
    β”‚ (Fast)    β”‚        β”‚ (Comprehensive)β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                     β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ Sequence     β”‚      β”‚ Protein       β”‚
            β”‚ Analyzer     β”‚      β”‚ Prediction    β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Module Structure

GeneFlow/
β”œβ”€β”€ main.py                      # Application entry point
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ Architecture.md              # System design documentation
β”œβ”€β”€ Modules.md                   # Module reference guide
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agents/                  # AI agent implementations
β”‚   β”‚   β”œβ”€β”€ adk_coordinator.py   # Main ADK-based orchestrator
β”‚   β”‚   β”œβ”€β”€ unified_coordinator.py # Request router
β”‚   β”‚   β”œβ”€β”€ chat_agent.py        # Lightweight chat
β”‚   β”‚   β”œβ”€β”€ sequence_analyzer.py # Sequence analysis agent
β”‚   β”‚   β”œβ”€β”€ protein_prediction.py # Protein analysis
β”‚   β”‚   β”œβ”€β”€ comparison.py        # Sequence comparison
β”‚   β”‚   β”œβ”€β”€ hypothesis.py        # Hypothesis generation
β”‚   β”‚   β”œβ”€β”€ literature.py        # Literature search
β”‚   β”‚   └── coordinator.py       # Legacy coordinator
β”‚   β”‚
β”‚   β”œβ”€β”€ core/                    # Core infrastructure
β”‚   β”‚   β”œβ”€β”€ session_manager.py   # User session management
β”‚   β”‚   β”œβ”€β”€ monitoring.py        # Performance metrics
β”‚   β”‚   β”œβ”€β”€ adk_tools.py         # ADK tool definitions
β”‚   β”‚   β”œβ”€β”€ agent_factory.py     # Agent creation
β”‚   β”‚   β”œβ”€β”€ context_manager.py   # Execution context
β”‚   β”‚   └── memory.py            # Memory management
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/                   # Utility modules
β”‚   β”‚   β”œβ”€β”€ visualizer.py        # Plot generation
β”‚   β”‚   β”œβ”€β”€ reporter.py          # PDF report creation
β”‚   β”‚   └── structure_generator.py # 3D structure modeling
β”‚   β”‚
β”‚   β”œβ”€β”€ ui/                      # Streamlit user interface
β”‚   β”‚   β”œβ”€β”€ Home.py              # Landing page
β”‚   β”‚   └── pages/
β”‚   β”‚       β”œβ”€β”€ 1_Dashboard.py   # Analytics dashboard
β”‚   β”‚       β”œβ”€β”€ 2_Chat.py        # Chat interface
β”‚   β”‚       └── 3_Analysis.py    # Full analysis
β”‚   β”‚
β”‚   β”œβ”€β”€ data/                    # Reference data
β”‚   β”‚   └── known_sequences.fasta # Sequence database
β”‚   β”‚
β”‚   └── tests/                   # Unit tests
β”‚       β”œβ”€β”€ test_adk_pipeline.py
β”‚       β”œβ”€β”€ test_*.py            # Component tests
β”‚       └── ...
β”‚
β”œβ”€β”€ sessions/                    # User session storage
β”œβ”€β”€ metrics/                     # Performance metrics
└── geneflow_plots/              # Generated visualizations

Workflow Examples

Example 1: Quick Chat (1-3 seconds)

from src.agents.unified_coordinator import UnifiedCoordinator

coordinator = UnifiedCoordinator()

# Simple question - routes to ChatAgent
result = coordinator.process_message(
    "What is GC content and why is it important?",
    session_id="user_123"
)

print(result['response'])

Example 2: Full DNA Analysis Pipeline (30-60 seconds)

coordinator = UnifiedCoordinator()

# DNA sequence - routes to ADKCoordinator with full tools
sequence = "ATGAAATATAAAGCGTACGTGCTTGAATGCCTTATAAACGTAGCTAG"

result = coordinator.run_pipeline(
    sequence=sequence,
    session_id="user_123"
)

print(f"Analysis complete!")
print(f"GC Content: {result['results']['analysis']['gc_percent']}%")
print(f"ORFs Found: {len(result['results']['analysis']['orfs'])}")
print(f"Report saved to: {result['results']['report']['report_path']}")

Example 3: Session-based Conversation

coordinator = UnifiedCoordinator()
session_id = "researcher_001"

# First message
result1 = coordinator.process_message(
    "I'm studying bacterial resistance genes",
    session_id=session_id
)

# Follow-up with context
result2 = coordinator.process_message(
    "Can you analyze this sequence for me?",
    session_id=session_id
)

# The agent remembers previous conversation context
print(result2['response'])

Performance Characteristics

Operation Time Tokens
Chat Response 1-3s 200-500
Sequence Analysis 5-15s 500-1000
Full Pipeline 60-120s 2000-5000
PDF Report Gen 5-15s -
3D Structure Gen 10-20s -

Configuration

Environment Variables

# Required
GOOGLE_API_KEY=your_api_key_here

# Optional
LOG_LEVEL=INFO                    # Logging level
SESSION_MAX_AGE_HOURS=24         # Session expiration
MAX_SEQUENCE_LENGTH=100000       # Max sequence size
CACHE_ENABLED=true               # Enable caching
REDIS_URL=redis://localhost:6379 # Redis cache (optional)

Performance Tuning

# In your initialization code
from src.core.session_manager import SessionManager
from src.core.monitoring import PerformanceMonitor

# Customize session storage
session_manager = SessionManager(
    storage_path="custom_sessions",
    max_session_age_hours=48  # Longer session lifetime
)

# Customize performance monitoring
monitor = PerformanceMonitor(
    storage_path="custom_metrics",
    enabled=True  # Disable for production if needed
)

# Pass to coordinator
from src.agents.unified_coordinator import UnifiedCoordinator
coordinator = UnifiedCoordinator(
    session_manager=session_manager,
    performance_monitor=monitor
)

Features in Detail

1. Sequence Analysis

  • GC Content: Percentage of guanine and cytosine bases
  • ORF Detection: Open Reading Frame identification (ATG to stop codon)
  • Motif Scanning: Regulatory element detection (TATA box, Kozak sequence, etc.)

2. Protein Prediction

  • Translation: DNA to amino acid conversion
  • Molecular Weight: Protein mass calculation
  • Hydrophobicity: Protein property analysis
  • Signal Peptide: N-terminal signal detection

3. Sequence Comparison

  • Homology Search: Find similar sequences
  • Alignment: Compare multiple sequences
  • Similarity Scoring: Quantify sequence relationships

4. Literature Integration

  • PubMed Search: Scientific paper discovery
  • Citation Analysis: Find related research
  • Trend Analysis: Identify research directions

5. Hypothesis Generation

  • Pattern-based: From sequence analysis results
  • Literature-informed: Based on research context
  • Confidence Scoring: Probability estimation

6. Visualization Suite

  • GC Content Plots: Sliding window analysis
  • ORF Maps: Linear genome representation
  • 3D Structure: DNA/Protein visualization
  • Property Charts: Physicochemical analysis

Testing

# Run all tests
pytest src/tests/

# Run specific test
pytest src/tests/test_sequence_analyzer.py -v

# With coverage
pytest src/tests/ --cov=src --cov-report=html

Troubleshooting

Issue: "GOOGLE_API_KEY not found"

Solution: Set the environment variable:

set GOOGLE_API_KEY=your_key  # Windows
export GOOGLE_API_KEY=your_key  # Mac/Linux

Issue: Slow responses

Solutions:

  • Check network connectivity
  • Verify API quota limits
  • Reduce sequence length for initial analysis
  • Enable local caching

Issue: Session not found

Solution: Sessions expire after 24 hours by default. Create a new session or adjust SESSION_MAX_AGE_HOURS.

Issue: Out of memory

Solution: Reduce sequence length or enable Redis caching for session storage.

Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see LICENSE file for details.

Citation

If you use GeneFlow in your research, please cite:

@software{geneflow2024,
  author = {Suriya Sureshkumar},
  title = {GeneFlow: ADK-Powered Bioinformatics Copilot},
  year = {2024},
  url = {https://github.com/suriyasureshok/geneflow}
}

Resources

Support


Last Updated: November 2024

Version: 1.0.0

About

GeneFlow is an AI-powered bioinformatics platform built with Google ADK that uses multi-agent intelligence for DNA analysis, protein prediction, and scientific insight generation. It turns complex biological data into clear, conversational results for researchers.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages