-
Notifications
You must be signed in to change notification settings - Fork 8
DNA counts differ between conditions that use the same DNA FASTQ file #226
Description
Hello,
I ran the experiment portion of MPRAsnakeflow and noticed that I’m getting different DNA counts for the same replicate, even though both conditions use the same DNA FASTQ file.
My understanding is that if two conditions share the same DNA file, the resulting DNA counts for each oligo should be identical, since they’re derived from the same source. However, that’s not what I’m seeing.
Here is my experiment file for two conditions:
Condition,Replicate,DNA_BC_F,RNA_BC_F
jurkat0hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_0hr.fastq.gz
jurkat0hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_0hr.fastq.gz
jurkat0hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_0hr.fastq.gz
jurkat0hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_0hr.fastq.gz
jurkat0hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_0hr.fastq.gz
jurkat12hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_12hr.fastq.gz
jurkat12hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_12hr.fastq.gz
jurkat12hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_12hr.fastq.gz
jurkat12hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_12hr.fastq.gz
jurkat12hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_12hr.fastq.gz
Here, jurkat0hr_rep1 and jurkat12hr_rep1 both use the same DNA file ms_plasmid_rep1.fastq.gz.
Despite sharing the same DNA input, these two conditions show different DNA counts for the same oligos.
Example Oligo 10:115676701:NA:NA
| File | replicate | dna_counts |
|---|---|---|
| jurkat0hr_allreps_merged.txt | 1 | 1445 |
| jurkat12hr_allreps_merged.txt | 1 | 1850 |
(pwd: results/experiments/MSmpraExp/assigned_counts/MSmpraAssignBbmapFW/default/)
First 5 rows of jurkat0hr_allreps_merged.tsv.gz
replicate oligo_name dna_counts rna_counts dna_normalized rna_normalized log2FoldChange n_bc
1 10:115676701:NA:NA 1445 1095 1.0049 0.9281 -0.1147 81
1 10:124924855:NA:NA 1958 1230 1.161 0.8889 -0.3853 95
1 10:1307637:T:C:R:wC 4852 4688 1.1065 1.303 0.2358 247
1 10:134755559:NA:NA 2403 1674 1.1569 0.9822 -0.2361 117
First 5 rows of jurkat12hr_allreps_merged.tsv.gz
replicate oligo_name dna_counts rna_counts dna_normalized rna_normalized log2FoldChange n_bc
1 10:115676701:NA:NA 1850 1390 1.1857 0.9526 -0.3159 83
1 10:124924855:NA:NA 2124 1947 0.9495 0.9306 -0.029 119
1 10:1307637:T:C:R:wC 6803 10559 1.0934 1.8145 0.7308 331
1 10:134755559:NA:NA 2534 2160 1.037 0.9451 -0.1338 130
The counts are also different depending on what tool I used in the assignment stage:
For oligo 10:115676701:NA:NA
| Tool | File | replicate | dna_counts |
|---|---|---|---|
| bbmap | jurkat0hr_allreps_merged.tsv.gz | 1 | 1445 |
| bbmap | jurkat12hr_allreps_merged.tsv.gz | 1 | 1850 |
| bwa | jurkat0hr_allreps_merged.tsv.gz | 1 | 1407 |
| bwa | jurkat12hr_allreps_merged.tsv.gz | 1 | 1791 |
| exact | jurkat0hr_allreps_merged.tsv.gz | 1 | 764 |
| exact | jurkat12hr_allreps_merged.tsv.gz | 1 | 900 |
Am I misunderstanding the output file? Is this discrepancy expected behavior, or could this indicate a bug in how the DNA counts are processed across conditions using the same file?
Best,
Grace
This is my config file.
version: "0.5"
experiments:
MSmpraExp: # change
bc_length: 20
data_folder: /home/go274/palmer_scratch/practice/MS_mpra_exp_data
experiment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/experiment.csv
demultiplex: false
assignments:
MSmpraAssignBbmapFW:
type: file
assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_bbmap/assignment_barcodes.default_config.tsv.gz
MSmpraAssignBwaFW:
type: file
assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_bwa/assignment_barcodes.default_config.tsv.gz
MSmpraAssignExactFW:
type: file
assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_exact/assignment_barcodes.default_config.tsv.gz
design_file: /home/go274/palmer_scratch/practice/sequences.fasta
label_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/labels.tsv.gz
configs:
default:
filter:
bc_threshold: 1 # Changed from 10 default to 1, to assume most lenient BC counting
min_dna_counts: 1
min_rna_counts: 1
outlier_detection:
methods: none
mad_bins: 20
times_mad: 5
times_zscore: 3
DNA:
min_counts: 1
RNA:
min_counts: 1