A tool for designing RNA/DNA sequence libraries for MaP-seq experiments. Handles preprocessing, padding with hairpin structures, barcode generation, and optional read-count prediction.
Requires cmake and a C++20 compiler.
git clone https://github.com/hmblair/fld
cd fld
./configureAdd ./bin to your PATH.
For read-count prediction features, also install rn-coverage.
Design a library with padding and barcodes:
fld pipeline -o output/ --pad-to 130 --barcode-length 10 designs.fastaThis produces:
output/library.csv- Full library with all sequence componentsoutput/library.fasta- Complete sequences in FASTA formatoutput/t7-library.fasta- Same sequences with T7 promoter prefix (GGGAACG)
If you have rn-coverage installed, add --predict to automatically balance padding and barcodes by predicted read counts:
fld pipeline -o output/ --pad-to 130 --barcode-length 10 --predict designs.fastaThe pipeline command is the main entry point. Here are common use cases:
fld pipeline -o output/ --pad-to 130 --barcode-length 10 designs.fastaWhat it does:
- Preprocesses input FASTA
- Pads sequences to 130nt with hairpin structures
- Generates unique 10nt-stem barcodes
- Outputs library files
Output structure:
output/
├── library.csv # Final library (all components)
├── library.fasta # Complete sequences
├── t7-library.fasta # With T7 prefix (GGGAACG)
└── tmp/ # Intermediate files
fld pipeline -o output/ --pad-to 130 --barcode-length 10 lib1.fasta lib2.fasta lib3.fastaEach file becomes a separate sublibrary. The index column restarts at 1 for each file, and sublibrary tracks the source.
fld pipeline -o output/ --pad-to 130 --barcode-length 10 --m2 designs.fastaGenerates complement sequences for M2-seq experiments before padding.
fld pipeline -o output/ --pad-to 130 --no-barcodes designs.fastaPads sequences but skips barcode generation.
fld pipeline -o output/ --pad-to 130 --barcode-length 10 --predict designs.fastaRequires: rn-coverage on PATH
What it does:
- Preprocesses input FASTA
- Generates padding, barcodes, and design sequences separately
- Predicts read counts for each component in isolation
- Merges padding using read-count balancing within each design-length group
- Predicts read counts for padded designs
- Merges barcodes using read-count balancing (low-read designs get high-read barcodes)
- Predicts final read counts for the fully assembled library
- Verifies begin/end columns match design positions
fld pipeline -o output/ --pad-to 130 --barcode-length 10 \
--max-gc 4 --max-gu 0 --closing-gc 2 designs.fastaLimits GC pairs to 4 per stem, forbids GU pairs, closes each stem with 2 GC pairs.
fld pipeline -o output/ --pad-to 130 --barcode-length 10 \
--five-const "GGGAAACCC" --three-const "AAAAAAAAAA" designs.fastaThe output CSV contains these columns:
| Column | Description |
|---|---|
index |
1-based position in original input file |
name |
Sequence identifier from FASTA header |
sublibrary |
Source file name (for tracking) |
five_const |
5' constant/primer region |
five_padding |
5' padding hairpins |
design |
Original design sequence |
three_padding |
3' padding hairpins |
barcode |
Barcode hairpin |
three_const |
3' constant/primer region |
begin |
1-based start position of design in full sequence |
end |
1-based end position of design in full sequence |
The full sequence is: five_const + five_padding + design + three_padding + barcode + three_const
The begin and end columns indicate where the original design is located within this full sequence. Sanity check: end - begin + 1 == len(design).
When using fld design or other commands that read CSV files, only the sequence columns are required:
| Required Column | Description |
|---|---|
five_const |
5' constant/primer region |
five_padding |
5' padding hairpins |
design |
Original design sequence |
three_padding |
3' padding hairpins |
barcode |
Barcode hairpin |
three_const |
3' constant/primer region |
Optional metadata columns (index, name, sublibrary, begin, end) are used if present, otherwise sensible defaults are applied. Columns can appear in any order.
| Option | Default | Description |
|---|---|---|
-o |
(required) | Output directory |
--pad-to |
130 | Target sequence length for design region |
--barcode-length |
10 | Barcode stem length (0 to disable) |
--no-barcodes |
false | Skip barcode generation entirely |
--m2 |
false | Generate M2-seq complement sequences |
--predict |
false | Run rn-coverage prediction with padding and barcode balancing |
--sort-by-reads |
false | Sort output by predicted reads (default: preserve input order) |
--overwrite |
false | Overwrite existing output directory |
--five-const |
ACTCGAGTAGAGTCGAAAA | 5' constant sequence |
--three-const |
AAAAGAAACAACAACAACAAC | 3' constant sequence |
--min-stem-length |
7 | Minimum hairpin stem length |
--max-stem-length |
13 | Maximum hairpin stem length |
--max-gc |
5 | Maximum GC pairs per stem |
--max-gu |
0 | Maximum GU pairs per stem |
--closing-gc |
1 | GC pairs to close each stem |
--spacer |
2 | PolyA spacer length between stems |
"Not enough barcodes" error:
Increase --max-gu (e.g., --max-gu 1 or --max-gu 2). You can also increase --max-gc, but this may affect experimental results.
rn-coverage not found: Install from https://github.com/hmblair/rn-coverage and ensure it's on your PATH.
Convert FASTA to CSV format for manual processing:
fld preprocess -o library.csv --sublibrary mylib designs.fastaRun just the design step (padding + barcoding) on a CSV:
fld design -o output --pad-to 130 --barcode-length 10 library.csvManually merge barcodes with read-count balancing:
fld merge --library library.csv --library-reads lib_reads.txt \
--barcodes barcodes.txt --barcode-reads bc_reads.txt \
-o mergedAdd read counts to a library CSV:
fld sort --reads reads.txt -o sorted input.csvUse --sort-by-reads to sort by read count, --descending for highest first.
Generate standalone barcodes:
fld barcodes --count 1000 --length 10 -o barcodes.txtCheck sequence lengths in FASTA files:
fld inspect designs.fasta
fld inspect --sort designs.fasta # Sort by countGenerate M2-seq complement sequences:
fld m2 -o output.fasta input.fasta
fld m2 -o output.fasta --all input.fasta # All three mutantsAdd a prefix to all sequences:
fld prepend --sequence GGGAACG -o output.fasta input.fastaConvert between DNA and RNA:
fld to-rna -o rna.fasta dna.fasta # T → U
fld to-dna -o dna.fasta rna.fasta # U → TConvert library CSV to plain text (one sequence per line):
fld txt -o output library.csvCompare two FASTA files:
fld diff file1.fasta file2.fastaDuplicate sequences N times:
fld duplicate --count 3 -o output.fasta input.fastaGenerate random sequences:
fld random --count 100 --length 50 -o random.txt
fld random --count 100 --length 50 -o random.fasta --fastaSplit FASTA by sequence length:
fld categorize -o output_dir/ --bins 130 240 500 input.fastaRun the test suite:
fld testA complete construct has this structure (5' to 3'):
[5' const] [5' padding] [design] [3' padding] [barcode] [3' const]
- 5'/3' const: Primer binding sites (constant across library)
- 5'/3' padding: Hairpin structures to reach target length
- design: Your original sequence of interest
- barcode: Unique identifier hairpin for each sequence
Sequences are padded to the target length using hairpin structures:
- Stem length randomly sampled from configured range
- Each stem closed with GC pairs (configurable)
- Tetraloop chosen from common stable loops
- Multiple stems separated by polyA spacers
Barcodes are hairpin structures with:
- Configurable stem length
- Same GC/GU constraints as padding
- Guaranteed Hamming distance ≥ 2 between all barcodes
When using --predict, both padding and barcodes are assigned to balance coverage using a two-round inverse pairing strategy:
Round 1 — Padding: Within each group of designs that share the same length (and therefore need the same amount of padding):
- Sort designs by predicted reads (ascending)
- Sort padding sequences by predicted reads (descending)
- Pair them: low-read designs get high-read padding
Round 2 — Barcodes: On the now-padded library:
- Sort padded designs by predicted reads (ascending)
- Sort barcodes by predicted reads (descending)
- Pair them: low-read designs get high-read barcodes
This two-round approach runs 5 prediction steps total, predicting each component in isolation first, then predicting the assembled combinations after each merge.
