Skip to content

extract_clusters skipping annotation for first and last proteins #123

@galacmr

Description

@galacmr

Hello!
The latest version of cblaster seems to have an issue with the extract_clusters command where it does not pull the annotations for the proteins that are listed first and last in the output from running search.

-- Example --
Here is are the proteins that were found:

Cluster 1 score 6.3:          
Query                                                       Subject              Identity  Coverage   E-value  Bitscore  Start  
intermediate                                                WP_002526074.1       -         -          -         -          657707
intermediate                                                WP_204158246.1       -          -          -         -          658253
intermediate                                                WP_171015102.1       -          -          -         -          660183 
intermediate                                                WP_002522771.1       -          -          -         -          660448 
intermediate                                                WP_002522770.1       -          -          -         -          661923 
intermediate                                                HMPREF9575_RS13755   -          -          -         -          662199  662272 
intermediate                                                HMPREF9575_RS13480  -          -          -         -          662311 
response_regulator_receiver_domain_protein_translation      WP_002522768.1       100       100        3e-141   403        662707 
histidine_kinase_translation                                WP_002522767.1       100        100        0        779        663312 
intermediate                                                WP_002526075.1       -          -          -         -          664515 
hypothetical_protein_translation                            WP_002526076.1       99.8       100       0         1154       664760 
transporter _major_facilitator_family_protein_translation  WP_032501619.1       100        100        0        885        667259 
intermediate                                                WP_002522761.1       -          -          -         -          668730 
intermediate                                                WP_002522760.1       -          -          -         -          669860 
intermediate                                                WP_002522759.1       -          -          -         -          670528 
intermediate                                                WP_002526079.1       -          -          -         -          672873 

And here is the gbk file that extract_clusters created which does not include the first or last gene listed in the search output (Note I left out the middle genes for brevity):


LOCUS       NZ_GL383323.1          15998 bp    DNA              UNK 01-JAN-1980
DEFINITION  Genes for cluster 1 on scaffold NZ_GL383323.1 of species
            Cutibacterium acnes HL110PA1.
ACCESSION   NZ_GL383323
VERSION     NZ_GL383323.1

FEATURES             Location/Qualifiers
     CDS             complement(547..2478)
                     /protein_id="WP_204158246.1"
                     /translation="MTWVQASFWRSQDQINDTRDLSDLLASPATVPVGMRWYREPNEAS
                     IFNITDPEANQTFKPGDSGSFTVTGTPAQMGLASPNAVDAIGIHVQASPENQSRRTVGR
                     ARVLTVLSDAHTSANLAPVIVLSTMPTRRIDGTFTDESLADDITHRLKPLAEAAHTRNA
                     TVLVDPSLIDEARAMASGYRVAGKGTATVEGKGQQTAREWLDLVDPLLTTGQAYRLPYG
                     NADVIGAVRQGRPNVLLTVKHALDPSNPAAKLPLAVVDPSAELDRSSFKTLTKELSPAL
                     VLTCAASARDGVRGESGGKGIGLADTARTDGHPQSNSDPQRRGMLLSQALLMTHESIPA
                     VTLVTTVNDVQATAPVGWLHLQNLSAVLTGAKPGLRLPGTRAGDITLKGPWWRVQHDVG
                     IDSDDWSDLVGAPTEATSLTSAKFVSRSLSSSLQDREAWATDVMRPAADAMAGKGLVLH
                     SAPQFVMSSSTNDFPLTVTNSLAQTVHVKVVVFSENPQRIDIPDTQVVTIQPRETQTIR
                     FAPKASSNGVIEMQAHLSTPSGRSLGSQTSFVVKATQMDDVGWIIIVVSALVLIIATVL
                     RIRQVTASSRRQAESNGEPQTSGPTAGSTSDNISDTTPSPSAVEDPDTASDDDSEHHLP
                     TGEGNLAE"
                     /cluster_role="intermediate"
     CDS             2477..2722
                     /protein_id="WP_171015102.1"
                     /translation="MDTEEVFVTVPLMVMGVLGSPVGGLTEVIVTLTNSAAVRAWAGED
                     PAKTVVRTINNPSSQAAGQRREAGDTFVMVVGRGRD"
                     /cluster_role="intermediate"
     ~~ADDITIONAL GENES IN HERE~~
     CDS             complement(12154..12825)
                     /protein_id="WP_002522760.1"
                     /translation="MTTTLLPRVAQPGPGVSHVDTDRPIECIITHHPHMPRDRGGHATS
                     VSGRYLLRQAAELMLGTDPAHCPVVDPSRRWYWPGINLHGSVSHVPGWSLTTLSTGGHI
                     GADIQDFRERPGAMAFIGDLVKLSRSASLREFAECEAVVKVSELTKETFGHVRLPEWTP
                     GWRHVFEDYWVWSLEMHGMGVIALASDLPRAIRWWRCDADARGRLQALRPISSLGPGRP
                     S"
                     /cluster_role="intermediate"
     CDS             complement(12822..15161)
                     /protein_id="WP_002522759.1"
                     /translation="MTITAENATTRSDIARSIAVTGVGLVTAQGDHTDECWTELVDGVC
                     GITMNVTFDDSGTTIPCAGVAPIPNSDSIDRCYLLGVHAMREALEMSGIDLDSVGRDRI
                     GLVVGSSLGAMPTLEAAHRRAIETGVLDAGLAADSQLHCVADHLAAEFDIRGPRVVTSN
                     ACAAGAVAIGYAAELLWSDDVDLVVCGGVDPLAQISANGFTCLGALDNLPCSPMAGSSG
                     LTLGEGAGFMVLERTDAAAARGQEVMAEIAGYGTSCDGYHQTAPDPGGNGARSSMEAAL
                     RSAHLKPSDVSYVNLHGTGTPTNDAVEPKALRSLFKSDDLPPVSSVKGAIGHTLGAAGA
                     IEAVCSIKAIHEGVLPPTVNNRGQASRTGLDIVPECARKAAPDVVISNSFAFGGNNASV
                     VITAPRGGVHCTAPAQLREVGISGMAALAGKAANSEELLSALSEDCPIWMADEKTWEGD
                     AVQTGHVDIKRLSRTINPSKVRRMDPLGIISSAVVTDLYARHGKLSRKDAESTGIIFAT
                     GYGPVTAVTQFNDGIIRHGSEGANALVFPNTVVNAAAGHLAMLNRYRGYTATLACGGTS
                     SLMALLLAARVVGRGAADRIMVVIADEFPSIAVQAVAKLPGYRHRVDGSGAVLSEGAVC
                     VLVEAVEVAEARGTAPMALLRGFGSRGESVGVGHTASDGRAWAKAMAAALGPAGLTASD
                     VSTVVAASSGHPRVDRAEQAARRIVGLSATATTFPKAIVGETHGSAAGIGLFGALCGSR
                     SAAHQNILVNAFSHGGGYASMVVESL"
                     /cluster_role="intermediate"
ORIGIN.....

It looks like extract_clusters is pulling the full DNA sequences but not the annotations which you can see in the gbk file as neither of the genes listed go to the end of the sequence. This is also evident in the clinker images as none of the clusters end with arrows but rather all end with sticks of unannotated sequences.

I looked back at some older cblaster results from previous versions and this isn't how the software used to behave. I didn't see any changes indicated what would have led to this change. Can this be fixed?
Let me know if there is anything you need from me that would help!
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions