DNA counts differ between conditions that use the same DNA FASTQ file

Hello,

I ran the experiment portion of MPRAsnakeflow and noticed that I’m getting different DNA counts for the same replicate, even though both conditions use the same DNA FASTQ file.

My understanding is that if two conditions share the same DNA file, the resulting DNA counts for each oligo should be identical, since they’re derived from the same source. However, that’s not what I’m seeing.

Here is my experiment file for two conditions:
```
Condition,Replicate,DNA_BC_F,RNA_BC_F
jurkat0hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_0hr.fastq.gz
jurkat0hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_0hr.fastq.gz
jurkat0hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_0hr.fastq.gz
jurkat0hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_0hr.fastq.gz
jurkat0hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_0hr.fastq.gz
jurkat12hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_12hr.fastq.gz
jurkat12hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_12hr.fastq.gz
jurkat12hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_12hr.fastq.gz
jurkat12hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_12hr.fastq.gz
jurkat12hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_12hr.fastq.gz
```

Here, `jurkat0hr_rep1` and `jurkat12hr_rep1` both use the same DNA file `ms_plasmid_rep1.fastq.gz`.

Despite sharing the same DNA input, these two conditions show different DNA counts for the same oligos.

**Example** Oligo `10:115676701:NA:NA` 

File | replicate | dna_counts
-- | -- | --
jurkat0hr_allreps_merged.txt | 1 | 1445
jurkat12hr_allreps_merged.txt | 1 | 1850

(pwd: `results/experiments/MSmpraExp/assigned_counts/MSmpraAssignBbmapFW/default/`)

First 5 rows of `jurkat0hr_allreps_merged.tsv.gz`
```
replicate	oligo_name	dna_counts	rna_counts	dna_normalized	rna_normalized	log2FoldChange	n_bc
1	10:115676701:NA:NA	1445	1095	1.0049	0.9281	-0.1147	81
1	10:124924855:NA:NA	1958	1230	1.161	0.8889	-0.3853	95
1	10:1307637:T:C:R:wC	4852	4688	1.1065	1.303	0.2358	247
1	10:134755559:NA:NA	2403	1674	1.1569	0.9822	-0.2361	117
```

First 5 rows of `jurkat12hr_allreps_merged.tsv.gz`
```
replicate	oligo_name	dna_counts	rna_counts	dna_normalized	rna_normalized	log2FoldChange	n_bc
1	10:115676701:NA:NA	1850	1390	1.1857	0.9526	-0.3159	83
1	10:124924855:NA:NA	2124	1947	0.9495	0.9306	-0.029	119
1	10:1307637:T:C:R:wC	6803	10559	1.0934	1.8145	0.7308	331
1	10:134755559:NA:NA	2534	2160	1.037	0.9451	-0.1338	130
```

The counts are also different depending on what tool I used in the assignment stage:
For oligo `10:115676701:NA:NA` 
Tool | File | replicate | dna_counts
-- | -- | -- | --
bbmap | jurkat0hr_allreps_merged.tsv.gz | 1 | 1445
bbmap | jurkat12hr_allreps_merged.tsv.gz | 1 | 1850
bwa | jurkat0hr_allreps_merged.tsv.gz | 1 | 1407
bwa | jurkat12hr_allreps_merged.tsv.gz | 1 | 1791
exact | jurkat0hr_allreps_merged.tsv.gz | 1 | 764
exact | jurkat12hr_allreps_merged.tsv.gz | 1 | 900

Am I misunderstanding the output file? Is this discrepancy expected behavior, or could this indicate a bug in how the DNA counts are processed across conditions using the same file? 

Best,
Grace 

This is my config file.
```
version: "0.5"
experiments:
  MSmpraExp: # change
    bc_length: 20
    data_folder: /home/go274/palmer_scratch/practice/MS_mpra_exp_data
    experiment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/experiment.csv
    demultiplex: false
    assignments:
      MSmpraAssignBbmapFW:
        type: file
        assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_bbmap/assignment_barcodes.default_config.tsv.gz
      MSmpraAssignBwaFW:
        type: file
        assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_bwa/assignment_barcodes.default_config.tsv.gz
      MSmpraAssignExactFW:
        type: file
        assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_exact/assignment_barcodes.default_config.tsv.gz
    design_file: /home/go274/palmer_scratch/practice/sequences.fasta
    label_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/labels.tsv.gz
    configs:
      default:
        filter:
            bc_threshold: 1 # Changed from 10 default to 1, to assume most lenient BC counting
            min_dna_counts: 1
            min_rna_counts: 1
            outlier_detection:
              methods: none
              mad_bins: 20
              times_mad: 5
              times_zscore: 3
            DNA:
              min_counts: 1
            RNA:
              min_counts: 1
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNA counts differ between conditions that use the same DNA FASTQ file #226

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	replicate	dna_counts
jurkat0hr_allreps_merged.txt	1	1445
jurkat12hr_allreps_merged.txt	1	1850

Tool	File	replicate	dna_counts
bbmap	jurkat0hr_allreps_merged.tsv.gz	1	1445
bbmap	jurkat12hr_allreps_merged.tsv.gz	1	1850
bwa	jurkat0hr_allreps_merged.tsv.gz	1	1407
bwa	jurkat12hr_allreps_merged.tsv.gz	1	1791
exact	jurkat0hr_allreps_merged.tsv.gz	1	764
exact	jurkat12hr_allreps_merged.tsv.gz	1	900

DNA counts differ between conditions that use the same DNA FASTQ file #226

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions