SlideShare a Scribd company logo
1      Example of self-documenting data journalism notes
This is an example of using Sweave to combine code and output from the R statistical programming
environment and the LaTeX document processing environment to generate a self-documenting
script in which the actual code used to do stats and generate statistical graphics is displayed along
the charts it directly produces.

1.1     Getting Started...
The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical
analysis: experience the thrill of touching real data 1 .

>   # The << echo = T >>= identifies an R code region;
>   # echo=T means run the code, and print what happens when it's run
>   # In the code area, lines beginning with a # are comment lines and are not executed
>
>   #First, we need to load in the XML library that contains the scraper function
>   library(XML)
>   #Now we scrape the table
>   srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis'
>   cancerdata=data.frame(
+     readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) )
>
>   #The @ symbol on its own at the start of a line marks the end of a code block

   The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to
extract the N’th table in the page.) The header part labels the columns (the data pulled in from
the HTML table itself contains all sorts of clutter).
   We can inspect the data we’ve imported as follows:

>   #Look at the whole table (the whole table is quite long,
>   # so donlt disply it/comment out the command for now instead.
>   #cancerdata
>   #If you are using RStudio, you can inspect the data using the command: View(cancerdata))
>   #Look at the column headers
>   names(cancerdata)

[1] "Area"            "Rate"          "Population" "Number"

> #Look at the first 10 rows
> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

> #Look at the last 10 rows
> tail(cancerdata)
    1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis




                                                   1
Area      Rate Population Number
374 Wiltshire      18.69     727662    136
375 Sheffield       16.9     757396    128
376     Durham     17.29     786582    136
377      Leeds      17.3     959538    166
378   Cornwall     15.44    1062176    164
379 Birmingham     19.78    1268959    251

> #What sort of datatype is in the Number column?
> class(cancerdata$Number)

[1] "factor"

   The last line, class(cancerdata$Number), identifies the data as type factor. In order to
do stats and plot graphs, we need the Number, Rate and Population columns to contain actual
numbers. (Factors organise data according to categories; when the table is loaded in, the data is
loaded in as strings of characters; rather than seeing each number as a number, it’s identified as
a category.) The

>   #Convert the numerical columns to a numeric datatype
>   cancerdata$Rate =
+     as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)])
>   cancerdata$Population =
+     as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)])
>   cancerdata$Number =
+     as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)])
>                        a˘
    #Just check it worked^Ae
>   class(cancerdata$Number)

[1] "numeric"

> class(cancerdata$Rate)

[1] "numeric"

> class(cancerdata$Population)

[1] "numeric"

> head(cancerdata)

              Area Rate Population Number
1 Shetland Islands 19.15     31332      6
2         Limavady 21.49     32573      7
3       Ballymoney 17.05     35191      6
4   Orkney Islands 29.87     36826     11
5            Larne 27.54     39942     11
6      Magherafelt 15.26     45872      7

   We can now plot the data as a simple scatterplot using the plot command (figure 1) or we
can add a title to the graph and tweak the axis labels (figure 2).
   The plot command is great for generating quick charts. If we want a bit more control over
the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R
bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio,
find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its
dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3.


                                                  2
> #Plot the Number of deaths by the Population
> plot(Number ~ Population, data=cancerdata)
                 250




                                                                                                   q




                                                                      q
                 200




                                                                                      q   q
                 150
        Number




                                                                          q       q
                                                                              q
                                                                  q
                                                             q
                                                             q
                                                           q q q qq
                 100




                                                          q q
                                                     q qq       q
                                                     q q q
                                                          qq
                                                q   q q q
                                                      q
                                                   q
                                                 qq q
                                            qqq q qq
                                               qq q
                                                q
                                              q qqq
                                                qq
                                           q q q
                                       q qq q q q q q
                                            q
                                                      q
                                                      q
                                       q q q qqq q
                 50




                                          q qqq q
                                                q
                                         q qq q
                                        qq
                                         qq qq
                                       qqq qqq q q
                                          q q
                                          q
                                      qqqqqq qq
                                      qqqqq
                                        qqqq
                                    q q qq qq
                                    qq q
                                   qqq qq qqq
                                    q q qq q
                                     qq
                                     qq q
                                     qq
                                  qqqq q q
                                     qq
                                 qqqqqq
                                 qqqqqq q
                                 qq qqq
                                  qqqq q
                                      q
                                  q qq q
                                 q qq q
                                  q qq q
                                      q
                                      q
                                qqqqqqqq
                                 q qq qq
                                qqqqqq
                                qqqqqqq
                                  qq
                                  qq
                              qqqq q
                                 qq
                              qqqq q
                               qqq
                                qq
                              q qqq
                                qq
                                 qq
                             qq q
                           q q
                            q
                                q q
                             q q qq
                              qq q
                               q q
                           qqq q
                           qqq
                           q q
                            q
                 0




                       0         200000 400000 600000 800000                                  1200000

                                                            Population



                                           Figure 1: Vanilla scatter plot




                                                               3
> #Plot the Number of deaths by the Population.
> #Add in a title (main) and tweak the y-axis label (ylab).
> plot(Number ~ Population, data=cancerdata,
+      main='Bowel Cancer Occurrence by Population', ylab='Number of deaths')



                                           Bowel Cancer Occurrence by Population
                           250




                                                                                                             q




                                                                                q
                           200
        Number of deaths




                                                                                                q   q
                           150




                                                                                    q       q
                                                                                        q
                                                                            q
                                                                       q
                                                                       q
                                                                     q q q qq
                           100




                                                                    q q
                                                               q qq       q
                                                               q q q
                                                                    qq
                                                          q   q q q
                                                                q
                                                             q
                                                           qq q
                                                      qqq q qq
                                                         qq q
                                                          q
                                                        q qqq
                                                          qq
                                                     q q q
                                                 q qq q q q q q
                                                      q
                                                                q
                                                                q
                                                 q q q qqq q
                           50




                                                    q qqq q
                                                          q
                                                   q qq q
                                                  qq
                                                   qq qq
                                                 qqq qqq q q
                                                    q q
                                                    q
                                                qqqqqq qq
                                                qqqqq
                                                  qqqq
                                              q q qq qq
                                              qq q
                                             qqq qq qqq
                                              q q qq q
                                               qq
                                               qq q
                                               qq
                                            qqqq q q
                                               qq
                                           qqqqqq
                                           qqqqqq q
                                           qq qqq
                                            qqqq q
                                                q
                                            q qq q
                                           q qq q
                                            q qq q
                                                q
                                                q
                                          qqqqqqqq
                                           q qq qq
                                          qqqqqq
                                          qqqqqqq
                                            qq
                                            qq
                                        qqqq q
                                           qq
                                        qqqq q
                                         qqq
                                          qq
                                        q qqq
                                          qq
                                           qq
                                       qq q
                                     q q
                                      q
                                          q q
                                       q q qq
                                        qq q
                                         q q
                                     qqq q
                                     qqq
                                     q q
                                      q
                           0




                                 0         200000 400000 600000 800000                                  1200000

                                                                      Population



                                                     Figure 2: Vanilla scatter plot




                                                                         4
>   require(ggplot2)
>   #Plot the Number of deaths by the Population
>   p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number))
>   print(p)


                                                                                                                       q
                    250




                                                                                      q

                    200



                                                                                                        q    q


                    150
           Number




                                                                                          q       q
                                                                                              q

                                                                              q
                                                                      q
                                                                      q
                                                                   q      q   q
                    100                                           q qq            q
                                                           q              q
                                                              qq
                                                               qq
                                                           q
                                                                 qq
                                                      q    qq q
                                                            q
                                                        qqq
                                                          q q
                                                  q q q qq q
                                                     qq
                                                    q qq q q
                                                       q
                                                      qq
                                                 q q     q q qq
                                            q qqq      q      q
                                                   qqq q
                                                       q
                                                 qq q q q
                    50                     q qq q q q q
                                                  q qq
                                           qqqqq qq q q
                                               q
                                            qqq q q q q
                                             qq q
                                               q
                                               q
                                          q q qq qq q
                                          q q qqq
                                          q qq q
                                             qqq q
                                      q qq qq qq
                                       q q qq q
                                     qqqq qqq qq
                                                qq
                                      qqq qqq q q
                                       q qq q
                                       q q q
                                       qq  q
                                                  q
                                   qq qqq qq
                                      qqq qq
                                      qqqqq
                                       qqq q
                                         q
                                    qqqqq qq
                                        qq q
                                         q
                                         qq q q q
                                   qqq qq q
                                   qq q
                                   qqqqq qq
                                   qqqqqq q
                                   q qqq q
                                    q
                                   q qq q q
                                    q qq
                                    q q
                                qq q q
                                  qqq
                                 qqqq
                                   q
                                  qqq
                                    q
                              qqqqqqq q
                              qqqqqq
                                 qqq q q
                                 qqq q
                              qqq qqq
                                 qq qq
                             qqqqqq
                                qq q
                               qq q   q
                          qq qq q qq
                           qqqqq q
                            qq q q
                                q     q
                          qq q
                          qq
                          q   q




                                    200000           400000           600000                  800000   1000000   1200000
                                                                  Population



                                                Figure 3: A rather prettier plot




                                                                              5
1.2    Generating the Funnel Plot
Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s
article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated
to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2
    The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing
the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple
of things:

  1. work out what values to use where! I did this by looking at the ggplot code to see what
     was plotted. p was on the y-axis and should be used to present the death rate. The data
     provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the
     range 0..1. The x-axis is the population.
  2. change the range and width of samples used to create the curves
  3. change the y-axis range.

   You can see the result in figure 3.




   2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#

5210


                                                 6
>   #TH: funnel plot code from:
>   #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210
>   #TH: Use our cancerdata
>   number=cancerdata$Population
>   #TH: The rate is given as a 'per 100,000' value, so normalise it
>   p=cancerdata$Rate/100000
>   p.se <- sqrt((p*(1-p)) / (number))
>   df <- data.frame(p, number, p.se, Area=cancerdata$Area)
>   ## common effect (fixed effect model)
>   p.fem <- weighted.mean(p, 1/p.se^2)
>   ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator
>   #TH: I'm going to alter the spacing of the samples used to generate the curves
>   number.seq <- seq(1000, max(number), 1000)
>   number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq))
>   dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem)
>   ## draw plot
>   #TH: note that we need to tweak the limits of the y-axis
>   fp <- ggplot(aes(x = number, y = p), data = df) +
+   geom_point(shape = 1) +
+   geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) +
+   geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) +
+   geom_hline(aes(yintercept = p.fem), data = dfCI) +
+   xlab("Population") + ylab("Bowel cancer death rate") + theme_bw()
>   #Automatically set the maximum y-axis value to be just a bit larger than the max data value
>   fp=fp+scale_y_continuous(limits = c(0,1.1*max(p)))
>   #Label the outlier point
>   fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003))
>   print(fp)




                                                                                           Glasgow City
                                                                                                q

                                     0.00030   q


                                               q            q
                                                qq
                                                      q
                                                q
                                                   qq          q
                                     0.00025           q qq          q q
                                                        qq
                                                       qq         qq
                                                                   q            q
                                                   qq q q
                                                       q
                                                       q qq
                                                 q q q qq              q
           Bowel cancer death rate




                                                                        q q      q
                                                        q q         q q
                                               q q q qq q q                   q
                                                 q     q qq q q q q
                                                        q        q
                                                        q q
                                                        q q              q q              q
                                                        q q q
                                                  q q q q qq                 q       q qq q
                                     0.00020          qq q q
                                                     q qqqq q
                                                    qq q                    q
                                                    qq q q qq qq         q                                                    q
                                               q      qq qqq q q q q
                                                               q
                                                               q
                                                               q
                                                              q q
                                                                          q
                                                                                  q q
                                                                                      q
                                                                                      q
                                                                                              q
                                                             q    q
                                                  q q q qq q q q q q              q       qq        q
                                                         q q q qqq
                                                           q      q q      qq q           q
                                                   q q qq qqq q q
                                                         q
                                                        q qqq
                                                                   q
                                                    qqq
                                                     qq q q qq q q     q         q q q        q             q    q
                                               q        q qqq qq q q
                                                             q q
                                                             q
                                                             q q                                        q
                                                    q qq q q q q q q q
                                                          q qq
                                                      q qq qq q
                                                             qqq
                                                    q q q qq qqqqq q q qq
                                                       q qq q qq
                                                              q          q              q   q
                                                                                                q
                                                          q q q
                                                      q q qqq q          q                                           q
                                                q q q            q q
                                     0.00015        q
                                                            qq q q qq q
                                                            qqq q qqq
                                                             qq      q q        q
                                                            qq               q
                                                     qq q q qq qqq q               q
                                                                                   q
                                                       qqqq q
                                                         qq           q      q
                                                      q qq q q
                                                         qq           q q
                                                                           q       q
                                                          q qqqq
                                                               q            qq
                                                          q      qq
                                                      q q q             q
                                                 q               q
                                                                 q     q
                                                   q
                                                                      q
                                     0.00010             q
                                                        qq
                                                          q
                                                          q
                                                   q     q




                                     0.00005
                                                                                       7

                                     0.00000


                                                          200000        400000          600000      800000      1000000 1200000
                                                                               Population

More Related Content

PDF
F1 India 2011: Free Practice Summary Stats
PDF
Survey of Esoko Users
PDF
Regression Modelling Overview
PPT
Writing for web (Montenegrin)
PDF
Robust parametric classification and variable selection with minimum distance...
PDF
Living In The Cloud
PDF
PDF
Introduction to power laws
F1 India 2011: Free Practice Summary Stats
Survey of Esoko Users
Regression Modelling Overview
Writing for web (Montenegrin)
Robust parametric classification and variable selection with minimum distance...
Living In The Cloud
Introduction to power laws

Similar to Example sweavefunnelplot (13)

PDF
Navigating Molecular Haystacks: Tools & Applications
PDF
Stat7840 hao wu
PDF
Clustering Plot
PDF
Time series compare
PDF
Slides lyon-2011
PDF
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
PDF
Slides mcneil
PDF
Portuguese Market and On-board Sampling Effort Review
PDF
Slides geotop
PDF
F1 2011 Korea Race Report
PDF
Manual de Aplicação - TCC
PDF
Parallel Combinatorial Computing and Sparse Matrices
PDF
Slides GEOTOP
Navigating Molecular Haystacks: Tools & Applications
Stat7840 hao wu
Clustering Plot
Time series compare
Slides lyon-2011
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis...
Slides mcneil
Portuguese Market and On-board Sampling Effort Review
Slides geotop
F1 2011 Korea Race Report
Manual de Aplicação - TCC
Parallel Combinatorial Computing and Sparse Matrices
Slides GEOTOP
Ad

More from Tony Hirst (20)

PPTX
15 in 20 research fiesta
PPTX
Dev8d jupyter
PPTX
Ili 16 robot
PDF
Jupyternotebooks ou.pptx
PDF
Virtual computing.pptx
PPTX
ouseful-parlihacks
PDF
Gors appropriate
PPTX
Gors appropriate
PPTX
Robotlab jupyter
PDF
Fco open data in half day th-v2
PPTX
Notes on the Future - ILI2015 Workshop
PPTX
Community Journalism Conf - hyperlocal data wire
PPTX
Residential school 2015_robotics_interest
PPTX
Data Mining - Separating Fact From Fiction - NetIKX
PPTX
Week4
PPTX
A Quick Tour of OpenRefine
PPTX
Conversations with data
PPTX
Data reuse OU workshop bingo
PPTX
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
PDF
Lincoln jun14datajournalism
15 in 20 research fiesta
Dev8d jupyter
Ili 16 robot
Jupyternotebooks ou.pptx
Virtual computing.pptx
ouseful-parlihacks
Gors appropriate
Gors appropriate
Robotlab jupyter
Fco open data in half day th-v2
Notes on the Future - ILI2015 Workshop
Community Journalism Conf - hyperlocal data wire
Residential school 2015_robotics_interest
Data Mining - Separating Fact From Fiction - NetIKX
Week4
A Quick Tour of OpenRefine
Conversations with data
Data reuse OU workshop bingo
Inspiring content - You Don't Need Big Data to Tell Good Data Stories
Lincoln jun14datajournalism
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mushroom cultivation and it's methods.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Machine Learning_overview_presentation.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mushroom cultivation and it's methods.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
Group 1 Presentation -Planning and Decision Making .pptx
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Tartificialntelligence_presentation.pptx
Programs and apps: productivity, graphics, security and other tools
Univ-Connecticut-ChatGPT-Presentaion.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Heart disease approach using modified random forest and particle swarm optimi...
Machine Learning_overview_presentation.pptx
A comparative study of natural language inference in Swahili using monolingua...

Example sweavefunnelplot

  • 1. 1 Example of self-documenting data journalism notes This is an example of using Sweave to combine code and output from the R statistical programming environment and the LaTeX document processing environment to generate a self-documenting script in which the actual code used to do stats and generate statistical graphics is displayed along the charts it directly produces. 1.1 Getting Started... The aim is to try to replicate a graphic included by Ben Goldacre in his article DIY statistical analysis: experience the thrill of touching real data 1 . > # The << echo = T >>= identifies an R code region; > # echo=T means run the code, and print what happens when it's run > # In the code area, lines beginning with a # are comment lines and are not executed > > #First, we need to load in the XML library that contains the scraper function > library(XML) > #Now we scrape the table > srcURL='http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis' > cancerdata=data.frame( + readHTMLTable( srcURL, which=1, header=c('Area','Rate','Population','Number') ) ) > > #The @ symbol on its own at the start of a line marks the end of a code block The format is simple: readHTMLTable(url,which=TABLENUMBER) (TABLENUMBER is used to extract the N’th table in the page.) The header part labels the columns (the data pulled in from the HTML table itself contains all sorts of clutter). We can inspect the data we’ve imported as follows: > #Look at the whole table (the whole table is quite long, > # so donlt disply it/comment out the command for now instead. > #cancerdata > #If you are using RStudio, you can inspect the data using the command: View(cancerdata)) > #Look at the column headers > names(cancerdata) [1] "Area" "Rate" "Population" "Number" > #Look at the first 10 rows > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 > #Look at the last 10 rows > tail(cancerdata) 1 http://www.guardian.co.uk/commentisfree/2011/oct/28/bad-science-diy-data-analysis 1
  • 2. Area Rate Population Number 374 Wiltshire 18.69 727662 136 375 Sheffield 16.9 757396 128 376 Durham 17.29 786582 136 377 Leeds 17.3 959538 166 378 Cornwall 15.44 1062176 164 379 Birmingham 19.78 1268959 251 > #What sort of datatype is in the Number column? > class(cancerdata$Number) [1] "factor" The last line, class(cancerdata$Number), identifies the data as type factor. In order to do stats and plot graphs, we need the Number, Rate and Population columns to contain actual numbers. (Factors organise data according to categories; when the table is loaded in, the data is loaded in as strings of characters; rather than seeing each number as a number, it’s identified as a category.) The > #Convert the numerical columns to a numeric datatype > cancerdata$Rate = + as.numeric(levels(cancerdata$Rate)[as.numeric(cancerdata$Rate)]) > cancerdata$Population = + as.numeric(levels(cancerdata$Population)[as.integer(cancerdata$Population)]) > cancerdata$Number = + as.numeric(levels(cancerdata$Number)[as.integer(cancerdata$Number)]) > a˘ #Just check it worked^Ae > class(cancerdata$Number) [1] "numeric" > class(cancerdata$Rate) [1] "numeric" > class(cancerdata$Population) [1] "numeric" > head(cancerdata) Area Rate Population Number 1 Shetland Islands 19.15 31332 6 2 Limavady 21.49 32573 7 3 Ballymoney 17.05 35191 6 4 Orkney Islands 29.87 36826 11 5 Larne 27.54 39942 11 6 Magherafelt 15.26 45872 7 We can now plot the data as a simple scatterplot using the plot command (figure 1) or we can add a title to the graph and tweak the axis labels (figure 2). The plot command is great for generating quick charts. If we want a bit more control over the charts we produce, the ggplot2 library is the way to go. (ggplot2 isn’t part of the standard R bundle, so you’ll need to install the package yourself if you haven’t already installed it. In RStudio, find the Packages tab, click Install Packages, search for ggplot2 and then install it, along with its dependencies...). You can see the sort of chart ggplot creates out of the box in figure 3. 2
  • 3. > #Plot the Number of deaths by the Population > plot(Number ~ Population, data=cancerdata) 250 q q 200 q q 150 Number q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 1: Vanilla scatter plot 3
  • 4. > #Plot the Number of deaths by the Population. > #Add in a title (main) and tweak the y-axis label (ylab). > plot(Number ~ Population, data=cancerdata, + main='Bowel Cancer Occurrence by Population', ylab='Number of deaths') Bowel Cancer Occurrence by Population 250 q q 200 Number of deaths q q 150 q q q q q q q q q qq 100 q q q qq q q q q qq q q q q q q qq q qqq q qq qq q q q qqq qq q q q q qq q q q q q q q q q q q qqq q 50 q qqq q q q qq q qq qq qq qqq qqq q q q q q qqqqqq qq qqqqq qqqq q q qq qq qq q qqq qq qqq q q qq q qq qq q qq qqqq q q qq qqqqqq qqqqqq q qq qqq qqqq q q q qq q q qq q q qq q q q qqqqqqqq q qq qq qqqqqq qqqqqqq qq qq qqqq q qq qqqq q qqq qq q qqq qq qq qq q q q q q q q q qq qq q q q qqq q qqq q q q 0 0 200000 400000 600000 800000 1200000 Population Figure 2: Vanilla scatter plot 4
  • 5. > require(ggplot2) > #Plot the Number of deaths by the Population > p=ggplot(cancerdata)+geom_point(aes(x=Population, y=Number)) > print(p) q 250 q 200 q q 150 Number q q q q q q q q q 100 q qq q q q qq qq q qq q qq q q qqq q q q q q qq q qq q qq q q q qq q q q q qq q qqq q q qqq q q qq q q q 50 q qq q q q q q qq qqqqq qq q q q qqq q q q q qq q q q q q qq qq q q q qqq q qq q qqq q q qq qq qq q q qq q qqqq qqq qq qq qqq qqq q q q qq q q q q qq q q qq qqq qq qqq qq qqqqq qqq q q qqqqq qq qq q q qq q q q qqq qq q qq q qqqqq qq qqqqqq q q qqq q q q qq q q q qq q q qq q q qqq qqqq q qqq q qqqqqqq q qqqqqq qqq q q qqq q qqq qqq qq qq qqqqqq qq q qq q q qq qq q qq qqqqq q qq q q q q qq q qq q q 200000 400000 600000 800000 1000000 1200000 Population Figure 3: A rather prettier plot 5
  • 6. 1.2 Generating the Funnel Plot Doing a bit of searching for the “funnel plot” chart type used to display the data in Goldacre’s article, I came across a post on Cross Validated, the Stack Overflow/Stack Exchange site dedicated to statistics related Q&A: How to draw funnel plot using ggplot2 in R? 2 The meta-analysis answer seemed to produce the similar chart type, so I had a go at cribbing the code, with confidence limits set at the 95% and 99.9% levels. Note that I needed to do a couple of things: 1. work out what values to use where! I did this by looking at the ggplot code to see what was plotted. p was on the y-axis and should be used to present the death rate. The data provides this as a rate per 100,000, so we need to divide by 100, 000 to make it a rate in the range 0..1. The x-axis is the population. 2. change the range and width of samples used to create the curves 3. change the y-axis range. You can see the result in figure 3. 2 http://stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210# 5210 6
  • 7. > #TH: funnel plot code from: > #stats.stackexchange.com/questions/5195/how-to-draw-funnel-plot-using-ggplot2-in-r/5210#5210 > #TH: Use our cancerdata > number=cancerdata$Population > #TH: The rate is given as a 'per 100,000' value, so normalise it > p=cancerdata$Rate/100000 > p.se <- sqrt((p*(1-p)) / (number)) > df <- data.frame(p, number, p.se, Area=cancerdata$Area) > ## common effect (fixed effect model) > p.fem <- weighted.mean(p, 1/p.se^2) > ## lower and upper limits for 95% and 99.9% CI, based on FEM estimator > #TH: I'm going to alter the spacing of the samples used to generate the curves > number.seq <- seq(1000, max(number), 1000) > number.ll95 <- p.fem - 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul95 <- p.fem + 1.96 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ll999 <- p.fem - 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > number.ul999 <- p.fem + 3.29 * sqrt((p.fem*(1-p.fem)) / (number.seq)) > dfCI <- data.frame(number.ll95, number.ul95, number.ll999, number.ul999, number.seq, p.fem) > ## draw plot > #TH: note that we need to tweak the limits of the y-axis > fp <- ggplot(aes(x = number, y = p), data = df) + + geom_point(shape = 1) + + geom_line(aes(x = number.seq, y = number.ll95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul95), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ll999, linetype = 2), data = dfCI) + + geom_line(aes(x = number.seq, y = number.ul999, linetype = 2), data = dfCI) + + geom_hline(aes(yintercept = p.fem), data = dfCI) + + xlab("Population") + ylab("Bowel cancer death rate") + theme_bw() > #Automatically set the maximum y-axis value to be just a bit larger than the max data value > fp=fp+scale_y_continuous(limits = c(0,1.1*max(p))) > #Label the outlier point > fp=fp+geom_text(aes(x = number, y = p,label=Area),size=3,data=subset(df,p>0.0003)) > print(fp) Glasgow City q 0.00030 q q q qq q q qq q 0.00025 q qq q q qq qq qq q q qq q q q q qq q q q qq q Bowel cancer death rate q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q qq q 0.00020 qq q q q qqqq q qq q q qq q q qq qq q q q qq qqq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qqq q q q qq q q q q qq qqq q q q q qqq q qqq qq q q qq q q q q q q q q q q q qqq qq q q q q q q q q q qq q q q q q q q q qq q qq qq q qqq q q q qq qqqqq q q qq q qq q qq q q q q q q q q q q qqq q q q q q q q q 0.00015 q qq q q qq q qqq q qqq qq q q q qq q qq q q qq qqq q q q qqqq q qq q q q qq q q qq q q q q q qqqq q qq q qq q q q q q q q q q q 0.00010 q qq q q q q 0.00005 7 0.00000 200000 400000 600000 800000 1000000 1200000 Population