SlideShare a Scribd company logo
6
Most read
10
Most read
13
Most read
Dot plots

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Dot plots
• How can we compare the human & Drosophila
  melanogaster Eyeless protein sequences?
  One method is a dotplot
• A dotplot is a graphical method for assessing
  similarity
  Make a matrix (table) with one row for each letter in sequence 1, & one
       column for each letter in sequence 2
  Colour in each cell with an identical letter in the 2 sequences
  Regions of local similarity between the 2 sequences appear as diagonal
       lines of coloured cells (‘dots’)
eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’:

                   Q   Q    E   S   G    P   V    R   S   T          Sequence 2
               R
               Q
               Q
               E
Sequence 1
               P
               V
               R
               S
               T
               C

     Regions of local similarity between the 2 sequences appear as
     diagonal lines
     Some off-diagonal dots may be due to chance similarities
Problem
• Make a dot-plot for DNA sequences “GCATCGGC” &
  “CCATCGCCATCG”. Are there regions of similarity?
Answer
• Make a dot-plot for DNA sequences “GCATCGGC” &
  “CCATCGCCATCG”. Are there regions of similarity?
       C    C   A   T   C   G    C   C   A   T      C   G
   G
   C
   A
   T
   C
   G
   G
   C

  CATCG in sequence 1 appears twice in sequence 2
Dot plots with thresholds
• If you colour in all cells with an identical letter, some
  dots may be due to chance similarities
• Therefore, it is common to use a threshold to decide
  whether to plot a ‘dot’ in a cell
  A window of a certain size (eg. window size = 3) is moved up all possible
        diagonals, one-by-one
  A score is calculated for each position of the window on a diagonal :
        the number of identical letters in the window
  If the score is equal to or above the threshold (eg. threshold = score of
        2), all the cells in the window are coloured in
  The choice of values for the window size and threshold for the dot plot
        are chosen by trial-and-error
eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window
      size of 3, and a threshold of ≥2:


          C   C   A   T   C   G   C   C     A    T   C    G
      G
      C
      A
      T
      C
      G
      G
      C

          Score = 2, ≥ threshold → colour in
                  3, <
                  0,
                  1,

  = the sliding window                    and so on....
Real data: fruitfly & human Eyeless
• A dot plot of fruitfly & human Eyeless proteins:
        Fruitfly Eyeless



                                           Window-size = 10,
                                           Threshold = 3




                           Human Eyeless
  Do you think we chose a good value for the
  window-size and threshold?
Real data: fruitfly & human Eyeless
• Here is a dot plot of fruitfly and human Eyeless
  proteins, made using windowsize=10, threshold=5:
     Fruitfly Eyeless




                                         Window-size = 10,
                                         Threshold = 5




                        Human Eyeless
  Are there any regions of similarity?
Pros and cons of dot plots
• Advantages
  A dot plot can be used to identify long regions of strong similarity
  between two sequences
  It produces a plot, which is easy to make and to interpret
  It can be used to compare very short or long sequences (even whole
        chromosomes – millions of bases)
• Disadvantages
  It is necessary to find the best window size and threshold by trial-and-
  error
  A dot plot can only be used to compare 2 sequences, not >2 sequences
  It doesn’t tell you what mutations occurred in the region of
  similarity (if there is one) since the two sequences shared a
  common ancestor
Software for making dotplots
• dotPlot() function in the SeqinR R library
  Allows you to specify a windowsize and threshold
  If the score in a window is ≥ than the threshold, colours in the 1st cell in
        the window (not all cells)
• EMBOSS dottup
  Allows you to specify a windowsize but not a threshold
  If all cells in a window are identities, it colours in all cells in the window
• EMBOSS dotmatcher
  Allows you to specify a windowsize and threshold
  Instead of using the number of identities in a window as the window
        score, it calculates a more complex score based on the
  similarities of the bases/amino acids
Problem
• Make a dot-plot for amino acid sequences
  “RQQEPVRSTC” and “QQESGPVRST”, using a
  window size of 3, and a threshold of ≥3
Answer
•   Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”,
    using window size: 3, threshold: ≥3

                Q   Q   E   S   G   P   V   R   S   T
            R
            Q
            Q
            E
            P
            V
            R
            S
            T
            C
Further reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Practical on dotplots in R in the Little Book of R for Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

More Related Content

PPTX
Dot blotting
PPTX
Structure of prokaryotic genes.
PPTX
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PPTX
Dynamic programming and pairwise sequence alignment
PPTX
Phylogenetic tree
PPTX
Bacterial vaccines
PPTX
Regulation of eukaryotic gene expression
Dot blotting
Structure of prokaryotic genes.
Sequence alig Sequence Alignment Pairwise alignment:-
Dynamic programming and pairwise sequence alignment
Phylogenetic tree
Bacterial vaccines
Regulation of eukaryotic gene expression

What's hot (20)

PPTX
sequence of file formats in bioinformatics
PDF
Sequence Alignment
PPTX
BLAST (Basic local alignment search Tool)
PPTX
Structural genomics
PPTX
Comparative genomics
DOCX
PPT
Gene bank by kk sahu
PPTX
Sequence alignment
PPTX
Blast and fasta
PPTX
Scoring matrices
PDF
dot plot analysis
PPT
Sequence Alignment In Bioinformatics
PPT
Biological databases
PPT
Bioinformatics
PPTX
String.pptx
PPTX
PDF
Sequence alignment
PPTX
Genome annotation
sequence of file formats in bioinformatics
Sequence Alignment
BLAST (Basic local alignment search Tool)
Structural genomics
Comparative genomics
Gene bank by kk sahu
Sequence alignment
Blast and fasta
Scoring matrices
dot plot analysis
Sequence Alignment In Bioinformatics
Biological databases
Bioinformatics
String.pptx
Sequence alignment
Genome annotation
Ad

Similar to Dotplots for Bioinformatics (20)

PPTX
Dot matrix seminar
PPT
NIPS2007: structured prediction
PPTX
Intelligent Handwriting Recognition_MIL_presentation_v3_final
PPT
20100515 bioinformatics kapushesky_lecture07
PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PPTX
Scalable membership management
PPT
SyMAP Master's Thesis Presentation
PPTX
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIX
PPT
Indexing Text with Approximate q-grams
PPT
Pairwise sequence alignment
PPTX
Efficient anomaly detection via matrix sketching
PPTX
2012 talk to CSE department at U. Arizona
PDF
Significant scales in community structure
PDF
Word2vec and Friends
PPTX
240708_JW_labseminar[struc2vec: Learning Node Representations from Structural...
PPTX
De bruijn graphs
PPTX
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
PDF
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
PDF
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
PDF
The Explanation the Pipeline design strategy.pdf
Dot matrix seminar
NIPS2007: structured prediction
Intelligent Handwriting Recognition_MIL_presentation_v3_final
20100515 bioinformatics kapushesky_lecture07
PR-284: End-to-End Object Detection with Transformers(DETR)
Scalable membership management
SyMAP Master's Thesis Presentation
DOT MATRIX DOT MATRIX DOT MATRIX DOT MATRIX
Indexing Text with Approximate q-grams
Pairwise sequence alignment
Efficient anomaly detection via matrix sketching
2012 talk to CSE department at U. Arizona
Significant scales in community structure
Word2vec and Friends
240708_JW_labseminar[struc2vec: Learning Node Representations from Structural...
De bruijn graphs
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
The Explanation the Pipeline design strategy.pdf
Ad

More from avrilcoghlan (10)

PPT
DESeq Paper Journal club
PPT
Introduction to genomes
PPT
Homology
PPT
Statistical significance of alignments
PPT
PPT
Multiple alignment
PPT
The Smith Waterman algorithm
PPT
Alignment scoring functions
PPT
The Needleman Wunsch algorithm
PPT
Introduction to HMMs in Bioinformatics
DESeq Paper Journal club
Introduction to genomes
Homology
Statistical significance of alignments
Multiple alignment
The Smith Waterman algorithm
Alignment scoring functions
The Needleman Wunsch algorithm
Introduction to HMMs in Bioinformatics

Recently uploaded (20)

PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Complications of Minimal Access Surgery at WLH
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Computing-Curriculum for Schools in Ghana
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Lesson notes of climatology university.
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Microbial disease of the cardiovascular and lymphatic systems
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Complications of Minimal Access Surgery at WLH
01-Introduction-to-Information-Management.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
VCE English Exam - Section C Student Revision Booklet
Computing-Curriculum for Schools in Ghana
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Module 4: Burden of Disease Tutorial Slides S2 2025
2.FourierTransform-ShortQuestionswithAnswers.pdf
GDM (1) (1).pptx small presentation for students
PPH.pptx obstetrics and gynecology in nursing
Lesson notes of climatology university.
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx

Dotplots for Bioinformatics

  • 1. Dot plots Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. Dot plots • How can we compare the human & Drosophila melanogaster Eyeless protein sequences? One method is a dotplot • A dotplot is a graphical method for assessing similarity Make a matrix (table) with one row for each letter in sequence 1, & one column for each letter in sequence 2 Colour in each cell with an identical letter in the 2 sequences Regions of local similarity between the 2 sequences appear as diagonal lines of coloured cells (‘dots’)
  • 3. eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’: Q Q E S G P V R S T Sequence 2 R Q Q E Sequence 1 P V R S T C Regions of local similarity between the 2 sequences appear as diagonal lines Some off-diagonal dots may be due to chance similarities
  • 4. Problem • Make a dot-plot for DNA sequences “GCATCGGC” & “CCATCGCCATCG”. Are there regions of similarity?
  • 5. Answer • Make a dot-plot for DNA sequences “GCATCGGC” & “CCATCGCCATCG”. Are there regions of similarity? C C A T C G C C A T C G G C A T C G G C CATCG in sequence 1 appears twice in sequence 2
  • 6. Dot plots with thresholds • If you colour in all cells with an identical letter, some dots may be due to chance similarities • Therefore, it is common to use a threshold to decide whether to plot a ‘dot’ in a cell A window of a certain size (eg. window size = 3) is moved up all possible diagonals, one-by-one A score is calculated for each position of the window on a diagonal : the number of identical letters in the window If the score is equal to or above the threshold (eg. threshold = score of 2), all the cells in the window are coloured in The choice of values for the window size and threshold for the dot plot are chosen by trial-and-error
  • 7. eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window size of 3, and a threshold of ≥2: C C A T C G C C A T C G G C A T C G G C Score = 2, ≥ threshold → colour in 3, < 0, 1, = the sliding window and so on....
  • 8. Real data: fruitfly & human Eyeless • A dot plot of fruitfly & human Eyeless proteins: Fruitfly Eyeless Window-size = 10, Threshold = 3 Human Eyeless Do you think we chose a good value for the window-size and threshold?
  • 9. Real data: fruitfly & human Eyeless • Here is a dot plot of fruitfly and human Eyeless proteins, made using windowsize=10, threshold=5: Fruitfly Eyeless Window-size = 10, Threshold = 5 Human Eyeless Are there any regions of similarity?
  • 10. Pros and cons of dot plots • Advantages A dot plot can be used to identify long regions of strong similarity between two sequences It produces a plot, which is easy to make and to interpret It can be used to compare very short or long sequences (even whole chromosomes – millions of bases) • Disadvantages It is necessary to find the best window size and threshold by trial-and- error A dot plot can only be used to compare 2 sequences, not >2 sequences It doesn’t tell you what mutations occurred in the region of similarity (if there is one) since the two sequences shared a common ancestor
  • 11. Software for making dotplots • dotPlot() function in the SeqinR R library Allows you to specify a windowsize and threshold If the score in a window is ≥ than the threshold, colours in the 1st cell in the window (not all cells) • EMBOSS dottup Allows you to specify a windowsize but not a threshold If all cells in a window are identities, it colours in all cells in the window • EMBOSS dotmatcher Allows you to specify a windowsize and threshold Instead of using the number of identities in a window as the window score, it calculates a more complex score based on the similarities of the bases/amino acids
  • 12. Problem • Make a dot-plot for amino acid sequences “RQQEPVRSTC” and “QQESGPVRST”, using a window size of 3, and a threshold of ≥3
  • 13. Answer • Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”, using window size: 3, threshold: ≥3 Q Q E S G P V R S T R Q Q E P V R S T C
  • 14. Further reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Practical on dotplots in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  • #4: In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “RQQEPVRSTC” seq2 &lt;- “QQESGPVRST” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot1(seq1b,seq2b,dotsize=1)
  • #6: In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “GCATCGGC” seq2 &lt;- “CCATCGCCATCG” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot1(seq1b,seq2b,dotsize=1)
  • #8: In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- “GCATCGGC” seq2 &lt;- “CCATCGCCATCG” seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=3,threshold=2)
  • #9: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- read.fasta(“human.fa”) # human Eyeless seq2 &lt;- read.fasta(“fly.fa”) # fruitfly Eyeless seq1b &lt;- seq1[[1]] seq2b &lt;- seq2[[1]] source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=10,threshold=3) Saved picture as dotplot2.png
  • #10: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- read.fasta(“human.fa”) # human Eyeless seq2 &lt;- read.fasta(“fly.fa”) # fruitfly Eyeless seq1b &lt;- seq1[[1]] seq2b &lt;- seq2[[1]] source(“dotplot.R”) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=10,threshold=5) Saved picture as dotplot1.png
  • #14: In R: setwd(&quot;C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/MScCourseLectures/MB6301Lectures/MB6301_Ls3456_Aln&quot;) library(&quot;seqinr&quot;) seq1 &lt;- &quot;RQQEPVRSTC&quot; seq2 &lt;- &quot;QQESGPVRST&quot; seq1b &lt;- s2c(seq1) seq2b &lt;- s2c(seq2) source(&quot;dotplot.R&quot;) makeDotPlot2(seq1b,seq2b,dotsize=1,windowsize=3,threshold=3)