Skip to content

DNA counts differ between conditions that use the same DNA FASTQ file #226

@graceoualline

Description

@graceoualline

Hello,

I ran the experiment portion of MPRAsnakeflow and noticed that I’m getting different DNA counts for the same replicate, even though both conditions use the same DNA FASTQ file.

My understanding is that if two conditions share the same DNA file, the resulting DNA counts for each oligo should be identical, since they’re derived from the same source. However, that’s not what I’m seeing.

Here is my experiment file for two conditions:

Condition,Replicate,DNA_BC_F,RNA_BC_F
jurkat0hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_0hr.fastq.gz
jurkat0hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_0hr.fastq.gz
jurkat0hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_0hr.fastq.gz
jurkat0hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_0hr.fastq.gz
jurkat0hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_0hr.fastq.gz
jurkat12hr,1,ms_plasmid_rep1.fastq.gz,ms_Jurkat_MS_rep1_12hr.fastq.gz
jurkat12hr,2,ms_plasmid_rep2.fastq.gz,ms_Jurkat_MS_rep2_12hr.fastq.gz
jurkat12hr,3,ms_plasmid_rep3.fastq.gz,ms_Jurkat_MS_rep3_12hr.fastq.gz
jurkat12hr,4,ms_plasmid_rep4.fastq.gz,ms_Jurkat_MS_rep4_12hr.fastq.gz
jurkat12hr,5,ms_plasmid_rep5.fastq.gz,ms_Jurkat_MS_rep5_12hr.fastq.gz

Here, jurkat0hr_rep1 and jurkat12hr_rep1 both use the same DNA file ms_plasmid_rep1.fastq.gz.

Despite sharing the same DNA input, these two conditions show different DNA counts for the same oligos.

Example Oligo 10:115676701:NA:NA

File replicate dna_counts
jurkat0hr_allreps_merged.txt 1 1445
jurkat12hr_allreps_merged.txt 1 1850

(pwd: results/experiments/MSmpraExp/assigned_counts/MSmpraAssignBbmapFW/default/)

First 5 rows of jurkat0hr_allreps_merged.tsv.gz

replicate	oligo_name	dna_counts	rna_counts	dna_normalized	rna_normalized	log2FoldChange	n_bc
1	10:115676701:NA:NA	1445	1095	1.0049	0.9281	-0.1147	81
1	10:124924855:NA:NA	1958	1230	1.161	0.8889	-0.3853	95
1	10:1307637:T:C:R:wC	4852	4688	1.1065	1.303	0.2358	247
1	10:134755559:NA:NA	2403	1674	1.1569	0.9822	-0.2361	117

First 5 rows of jurkat12hr_allreps_merged.tsv.gz

replicate	oligo_name	dna_counts	rna_counts	dna_normalized	rna_normalized	log2FoldChange	n_bc
1	10:115676701:NA:NA	1850	1390	1.1857	0.9526	-0.3159	83
1	10:124924855:NA:NA	2124	1947	0.9495	0.9306	-0.029	119
1	10:1307637:T:C:R:wC	6803	10559	1.0934	1.8145	0.7308	331
1	10:134755559:NA:NA	2534	2160	1.037	0.9451	-0.1338	130

The counts are also different depending on what tool I used in the assignment stage:
For oligo 10:115676701:NA:NA

Tool File replicate dna_counts
bbmap jurkat0hr_allreps_merged.tsv.gz 1 1445
bbmap jurkat12hr_allreps_merged.tsv.gz 1 1850
bwa jurkat0hr_allreps_merged.tsv.gz 1 1407
bwa jurkat12hr_allreps_merged.tsv.gz 1 1791
exact jurkat0hr_allreps_merged.tsv.gz 1 764
exact jurkat12hr_allreps_merged.tsv.gz 1 900

Am I misunderstanding the output file? Is this discrepancy expected behavior, or could this indicate a bug in how the DNA counts are processed across conditions using the same file?

Best,
Grace

This is my config file.

version: "0.5"
experiments:
  MSmpraExp: # change
    bc_length: 20
    data_folder: /home/go274/palmer_scratch/practice/MS_mpra_exp_data
    experiment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/experiment.csv
    demultiplex: false
    assignments:
      MSmpraAssignBbmapFW:
        type: file
        assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_bbmap/assignment_barcodes.default_config.tsv.gz
      MSmpraAssignBwaFW:
        type: file
        assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_bwa/assignment_barcodes.default_config.tsv.gz
      MSmpraAssignExactFW:
        type: file
        assignment_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/results/assignment/forward_MS_mpra_assign_exact/assignment_barcodes.default_config.tsv.gz
    design_file: /home/go274/palmer_scratch/practice/sequences.fasta
    label_file: /home/go274/palmer_scratch/practice/MPRAsnakeflow_practice/MS_mpra/labels.tsv.gz
    configs:
      default:
        filter:
            bc_threshold: 1 # Changed from 10 default to 1, to assume most lenient BC counting
            min_dna_counts: 1
            min_rna_counts: 1
            outlier_detection:
              methods: none
              mad_bins: 20
              times_mad: 5
              times_zscore: 3
            DNA:
              min_counts: 1
            RNA:
              min_counts: 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions