A comprehensive collection of bioinformatics tools and scripts for sequence analysis, quality control, and data processing.
Requirements: Python 3.13+, BioPython 1.86+
-
fastx.py - FASTA/FASTQ file operations (sequence-level operations)
- Degenerate base counting
- Sequence validation and filtering
-
readsets.py - Read pair operations (R1/R2 aware)
- Complexity filtering
- Paired-end sequence analysis
-
sambams.py - SAM/BAM file manipulation
- CIGAR string filtering
-
fastx_utils.py - Utility functions
- Sequence complement/reverse complement (DNA, RNA, IUPAC)
- FASTA/FASTQ parsing helpers
- File collection utilities
-
macguffin_classes.py - Core classes
Primer- Genomic primer representation (chromosome, position, coordinates)RunCollection- Pipeline run organization (samples, read sets)RunSet- Paired-end read set management
-
configs.py - Configuration settings
-
tools/blastools.py - BLAST result manipulation
- Self-hit removal from BLAST reports
-
tools/ref_filler.py - Reference sequence expansion
- Fills alignment gaps by extracting variants from aligned sequences
- Generates subsequence FASTA files for missing regions
-
tools/pairwise_align.py - Pairwise sequence alignment
- Aligns sequence pairs using Biopython
-
tools/ann_to_bed.py - Format conversion
- Converts UCSC RepMask annotation format to BED format
-
tools/clstr_splitter.py - CD-HIT cluster parser
- Splits CD-HIT cluster output into individual cluster files
-
tools/dircutadapt.py - CutAdapt wrapper
- Batch adapter trimming for FASTQ files in a directory
-
tools/subsample.py - Subsampling tool
- Generates subsampled FASTQ files using SeqTK
- Optional bulk processing from TSV file
-
tools/extract_index_files.py - Index extraction
- Extracts index sequences from paired-end FASTQ (generates I1/I2 from R1/R2)
- Supports gzipped input
-
tools/fasta_subseq_dl.py - Sequence download
- Downloads FASTA subsequences from NCBI using Accession IDs and coordinates
-
tools/bed_expansion.py - BED region expansion
- Expands BED regions below a minimum length threshold symmetrically
-
tools/graph_assembler.py - De Bruijn-like sequence assembly
- Graph-based read assembly using greedy overlap matching
- Implements Depth-First Search path finding
-
tools/detect_fusions.py - Fusion read detection
- K-mer based detection of spanning and chimeric fusion reads in paired-end FASTQ
- Deduplicates k-mers shared between fusion partners to reduce false positives
-
tools/detect_tandem_repeats.py - Tandem repeat detection
- Identifies assembled contigs containing tandem repeats of a gene
- Fetches exon sequences from NCBI by accession and aligns against contigs
The archive/ directory contains deprecated/retired tools:
fadiff.py- FASTA difference (superseded by standard tools)fauniq.py- FASTA unique (superseded by standard tools)fqfilter.py- FASTQ filteringcigar_filter.py- CIGAR filtering (see sambams.py)map_accessiontaxid.py- Accession to TaxID mappingtaxid_annotate.py- TaxID annotationsnp_primer_validate.py- SNP primer validationread_pair_merger.py- Paired-end mergingtabtodb.py- TAB to database conversionwindowshopper.py- Window-based sequence analysisrpm_prep- RPM packaging preparation