I am a Senior Data Scientist and Technical Lead specializing in high-throughput genomic data engineering. Currently, I serve as the Bioinformatics Regional Resource for the Mountain Region, architecting scalable, reproducible workflows for public health surveillance.
My work focuses on workflow orchestration (Nextflow), containerization (Docker/Singularity), and cloud infrastructure (AWS) to turn petabytes of raw sequencing data into actionable epidemiological insights.
- Languages: Python (Pandas, Scipy, PySAM), R, Groovy, Bash
- Workflow Orchestration: Nextflow (DSL2), Snakemake, WDL
- Infrastructure: Docker, Singularity, AWS (Batch, S3, HealthOmics), GitHub Actions
- Data Engineering: ETL pipeline design, algorithmic benchmarking, metadata governance
Role: Lead Architect & Maintainer
The standard-of-care SARS-CoV-2 sequencing pipeline used by the CDC and public health laboratories across the US.
- Tech: Nextflow, Docker, Singularity, AWS Batch.
- Scale: Orchestrates alignment, variant calling, and lineage classification for thousands of concurrent samples.
- Impact: CLIA-validated and deployed for real-time genomic surveillance.
Role: Lead Maintainer
A command-line tool for Unsupervised Machine Learning in genomic epidemiology.
- Tech: Python, Scikit-learn (PCA, Silhouette Analysis), Fastcluster.
- ML Features: Uses Auto-K optimization to mathematically identify lineage thresholds and PCA for cluster validation.
-
Performance: Optimized
$O(N^2)$ clustering for large-scale distance matrices.
Role: Core Maintainer
A community-driven repository for reproducible bioinformatics containers.
- Tech: Docker, GitHub Actions CI/CD.
- Impact: Solves the "it works on my machine" problem by providing version-controlled, public-health-grade images.





