SlideShare a Scribd company logo
Stat405
     Graphical extensions & missing values



                          Hadley Wickham
Tuesday, 31 August 2010
1. Graphical extensions (1d & 2d)
                2. Subsetting




Tuesday, 31 August 2010
1d
                 extensions
Tuesday, 31 August 2010
Fair                    Good                 Very Good

         6000

         5000

         4000

         3000

         2000

         1000

            0
 count




                                 Premium                   Ideal

         6000

         5000

         4000

         3000

         2000

         1000

            0
                0         5000    10000 15000   0   5000   10000 15000   0   5000   10000 15000
                                price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
Tuesday, 31 August 2010
Fair                    Good                 Very Good

         6000

         5000

         4000

         3000

         2000

         1000

            0
 count




                                 Premium                   Ideal

         6000
                                                                         What makes it
         5000
                                                                         difficult to
                                                                         compare the
         4000
                                                                         distributions?
         3000

         2000
                                                                         Brainstorm for 1
                                                                         minute.
         1000

            0
                0         5000    10000 15000   0   5000   10000 15000   0   5000   10000 15000
                                price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)
Tuesday, 31 August 2010
Problems

                    Each histogram far away from the others,
                    but we know stacking is hard to read →
                    use another way of displaying densities
                    Varying relative abundance makes
                    comparisons difficult → rescale to ensure
                    constant area



Tuesday, 31 August 2010
# Large distances make comparisons hard
     qplot(price, data = diamonds, binwidth = 500) +
       facet_wrap(~ cut)

     # Stacked heights hard to compare
     qplot(price, data = diamonds, binwidth = 500, fill = cut)

     # Much better - but still have differing relative abundance
     qplot(price, data = diamonds, binwidth = 500,
       geom = "freqpoly", colour = cut)

     # Instead of displaying count on y-axis, display density
     # .. indicates that variable isn't in original data
     qplot(price, ..density.., data = diamonds, binwidth = 500,
       geom = "freqpoly", colour = cut)

     # To use with histogram, you need to be explicit
     qplot(price, ..density.., data = diamonds, binwidth = 500,
       geom = "histogram") + facet_wrap(~ cut)

Tuesday, 31 August 2010
Your turn


                    Use this technique to explore the
                    relationship between price and clarity,
                    and carat and clarity.




Tuesday, 31 August 2010
2d
                 extensions
Tuesday, 31 August 2010
Idea             ggplot
                     Small points       shape = I(".")

                   Transparency        alpha = I(1/50)

                          Jittering    geom = "jitter"

                  Smooth curve         geom = "smooth"
                                       geom = "bin2d" or
                          2d bins         geom = "hex"

             Density contours         geom = "density2d"
Tuesday, 31 August 2010
# There are two ways to add additional geoms
     # 1) A vector of geom names:
     qplot(price, carat, data = diamonds,
       geom = c("point", "smooth"))

     # 2) Add on extra geoms
     qplot(price, carat, data = diamonds) + geom_smooth()

     # This how you get help about a specific geom:
     ?geom_smooth
     # or go to http://had.co.nz/ggplot2/geom_smooth.html



Tuesday, 31 August 2010
# To set aesthetics to a particular value, you need
     # to wrap that value in I()

     qplot(price, carat, data = diamonds, colour = "blue")
     qplot(price, carat, data = diamonds, colour = I("blue"))

     # Practical application:   varying alpha
     qplot(price, carat, data   = diamonds, alpha   =   I(1/10))
     qplot(price, carat, data   = diamonds, alpha   =   I(1/50))
     qplot(price, carat, data   = diamonds, alpha   =   I(1/100))
     qplot(price, carat, data   = diamonds, alpha   =   I(1/250))




Tuesday, 31 August 2010
Your turn

                    Explore the relationship between carat,
                    price and clarity, using these techniques.
                    Which did you find most useful?




Tuesday, 31 August 2010
Subsetting

Tuesday, 31 August 2010
Motivation

                    Look at histograms and scatterplots of x,
                    y, z from the diamonds dataset
                    Which values are clearly incorrect? Which
                    values might we be able to correct?
                    (Remember measurements are in millimetres,
                    1 inch = 25 mm)




Tuesday, 31 August 2010
Plots

                qplot(x,   data = diamonds, binwidth = 0.1)
                qplot(y,   data = diamonds, binwidth = 0.1)
                qplot(z,   data = diamonds, binwidth = 0.1)
                qplot(x,   y, data = diamonds)
                qplot(x,   z, data = diamonds)
                qplot(y,   z, data = diamonds)




Tuesday, 31 August 2010
Modifying data
                    To modify, must first know how to extract,
                    or subset. Many different methods
                    available in R. We’ll start with most
                    explicit then learn some shortcuts next
                    time.
                    Basic structure:
                    df$varname
                    df[row index, column index]


Tuesday, 31 August 2010
$

                    Remember str(diamonds) ?
                    That hints at how to extract individual
                    variables:
                    diamonds$carat
                    diamonds$price



Tuesday, 31 August 2010
blank     include all


                          integer   +ve: include
                                    -ve: exclude

                          logical   include TRUEs


                          character lookup by name


Tuesday, 31 August 2010
Integer subsetting



Tuesday, 31 August 2010
# Nothing
     str(diamonds[, ])

     # Positive integers & nothing
     diamonds[1:6, ] # same as head(diamonds)
     diamonds[, 1:4] # watch out!

     # Two positive integers in rows & columns
     diamonds[1:10, 1:4]

     # Repeating input repeats output
     diamonds[c(1,1,1,2,2), 1:4]

     # Negative integers drop values
     diamonds[-(1:53900), -1]


Tuesday, 31 August 2010
# Useful technique: Order by one or more columns
     diamonds <- diamonds[order(diamonds$price), ]

     # Useful technique: Combine two tables
     carats <- data.frame(table(carat = diamonds$carat))
     mtch <- match(diamonds$carat, carats$carat)
     diamonds$carat_count <- carats$Freq[mtch]




Tuesday, 31 August 2010
Logical subsetting



Tuesday, 31 August 2010
# The most complicated to understand, but
     # the most powerful. Lets you extract a
     # subset defined by some characteristic of
     # the data
     x_big <- diamonds$x > 10

     head(x_big)
     sum(x_big)
     mean(x_big)
     table(x_big)

     diamonds$x[x_big]
     diamonds[x_big, ]


Tuesday, 31 August 2010
small <- diamonds[diamonds$carat < 1, ]
     lowqual <- diamonds[diamonds$clarity
       %in% c("I1", "SI2", "SI1"), ]

     # Comparison functions:
     # < > <= >= != == %in%
                                                  a
     # Boolean operators: & | !                   b
     small <- diamonds$carat < 1 &              a | b
       diamonds$price > 500                     a & b
     lowqual <- diamonds$colour == "D" |        a & !b
       diamonds$cut == "Fair"                  xor(a, b)




Tuesday, 31 August 2010
Useful                table(zeros)
         functions for                sum(zeros)
       logical vectors                mean(zeros)
                TRUE = 1; FALSE = 0



Tuesday, 31 August 2010
Your turn
                    Select the diamonds that have:
                    Equal x and y dimensions.
                    Depth between 55 and 70.
                    Carat smaller than the mean.
                    Cost more than $10,000 per carat.
                    Are of good quality or better.


Tuesday, 31 August 2010
Saving results
                # Prints to screen
                diamonds[diamonds$x > 10, ]

                # Saves to new data frame
                big <- diamonds[diamonds$x > 10, ]

                # Overwrites existing data frame. Dangerous!
                diamonds <- diamonds[diamonds$x < 10,]



Tuesday, 31 August 2010
diamonds <- diamonds[1, 1]
     diamonds

     # Uh oh!

     rm(diamonds)
     str(diamonds)

     # Phew!




Tuesday, 31 August 2010
Your turn
                    Create a logical vector that selects
                    diamonds with equal x & y. Create a new
                    dataset that only contains these values.
                    Create a logical vector that selects
                    diamonds with incorrect/unusual x, y, or z
                    values. Create a new dataset that omits
                    these values. (Hint: do this one variable
                    at a time)


Tuesday, 31 August 2010
equal_dim <- diamonds$x == diamonds$y
     equal <- diamonds[equal_dim, ]

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- diamonds$x == 0
     y_zero <- diamonds$y == 0
     z_zero <- diamonds$z == 0
     zeros <- x_zero | y_zero | z_zero

     bad <- y_big | z_big | zeros
     good <- diamonds[!bad, ]


Tuesday, 31 August 2010

More Related Content

PDF
03 Cleaning
PDF
03 Modelling
PDF
02 Ddply
PDF
Test
PDF
数式処理ソフトMathematicaで数学の問題を解く
PDF
Pscs3 keyboard shortcuts_pc
PDF
Shortcuts para Photoshop CS3
PDF
cOnscienS: social and organizational framework for gaming AI
03 Cleaning
03 Modelling
02 Ddply
Test
数式処理ソフトMathematicaで数学の問題を解く
Pscs3 keyboard shortcuts_pc
Shortcuts para Photoshop CS3
cOnscienS: social and organizational framework for gaming AI

What's hot (12)

PDF
Datamining 6th svm
PPTX
Vibrational Rotational Spectrum of HCl and DCl
PDF
SSA slides
PDF
Finite Element Analysis Made Easy Lr
PDF
Table 7a,7b kft 131
PDF
Cost
PDF
Image denoising
PPT
Admissions in india 2015
PDF
Huang_presentation.pdf
PDF
4831603 physics-formula-list-form-4
PDF
Datamining 6th Svm
PDF
Math IA
Datamining 6th svm
Vibrational Rotational Spectrum of HCl and DCl
SSA slides
Finite Element Analysis Made Easy Lr
Table 7a,7b kft 131
Cost
Image denoising
Admissions in india 2015
Huang_presentation.pdf
4831603 physics-formula-list-form-4
Datamining 6th Svm
Math IA
Ad

Viewers also liked (6)

PDF
PDF
Yet another object system for R
PDF
13 case-study
PDF
07 Problem Solving
PDF
05 subsetting
PDF
Yet another object system for R
13 case-study
07 Problem Solving
05 subsetting
Ad

Similar to 03 extensions (10)

PDF
24 modelling
PDF
11 adv-manip
PDF
11 adv-manip
PPTX
Startup Cofounder Wanted (And Awesome Geeky Puzzles)
PDF
Sequential Selection of Correlated Ads by POMDPs
PDF
11 Data Structures
PDF
Q plot tutorial
PPTX
RBootcamp Day 4
PDF
Introduction to Raphaël
PPT
Clustering
24 modelling
11 adv-manip
11 adv-manip
Startup Cofounder Wanted (And Awesome Geeky Puzzles)
Sequential Selection of Correlated Ads by POMDPs
11 Data Structures
Q plot tutorial
RBootcamp Day 4
Introduction to Raphaël
Clustering

More from Hadley Wickham (20)

PDF
27 development
PDF
27 development
PDF
23 data-structures
PDF
Graphical inference
PDF
R packages
PDF
PDF
PDF
20 date-times
PDF
19 tables
PDF
18 cleaning
PDF
17 polishing
PDF
16 critique
PDF
15 time-space
PDF
14 case-study
PDF
12 adv-manip
PDF
10 simulation
PDF
10 simulation
PDF
09 bootstrapping
PDF
08 functions
PDF
07 problem-solving
27 development
27 development
23 data-structures
Graphical inference
R packages
20 date-times
19 tables
18 cleaning
17 polishing
16 critique
15 time-space
14 case-study
12 adv-manip
10 simulation
10 simulation
09 bootstrapping
08 functions
07 problem-solving

03 extensions

  • 1. Stat405 Graphical extensions & missing values Hadley Wickham Tuesday, 31 August 2010
  • 2. 1. Graphical extensions (1d & 2d) 2. Subsetting Tuesday, 31 August 2010
  • 3. 1d extensions Tuesday, 31 August 2010
  • 4. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 5000 4000 3000 2000 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  • 5. Fair Good Very Good 6000 5000 4000 3000 2000 1000 0 count Premium Ideal 6000 What makes it 5000 difficult to compare the 4000 distributions? 3000 2000 Brainstorm for 1 minute. 1000 0 0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000 price qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) Tuesday, 31 August 2010
  • 6. Problems Each histogram far away from the others, but we know stacking is hard to read → use another way of displaying densities Varying relative abundance makes comparisons difficult → rescale to ensure constant area Tuesday, 31 August 2010
  • 7. # Large distances make comparisons hard qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut) # Stacked heights hard to compare qplot(price, data = diamonds, binwidth = 500, fill = cut) # Much better - but still have differing relative abundance qplot(price, data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # Instead of displaying count on y-axis, display density # .. indicates that variable isn't in original data qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "freqpoly", colour = cut) # To use with histogram, you need to be explicit qplot(price, ..density.., data = diamonds, binwidth = 500, geom = "histogram") + facet_wrap(~ cut) Tuesday, 31 August 2010
  • 8. Your turn Use this technique to explore the relationship between price and clarity, and carat and clarity. Tuesday, 31 August 2010
  • 9. 2d extensions Tuesday, 31 August 2010
  • 10. Idea ggplot Small points shape = I(".") Transparency alpha = I(1/50) Jittering geom = "jitter" Smooth curve geom = "smooth" geom = "bin2d" or 2d bins geom = "hex" Density contours geom = "density2d" Tuesday, 31 August 2010
  • 11. # There are two ways to add additional geoms # 1) A vector of geom names: qplot(price, carat, data = diamonds, geom = c("point", "smooth")) # 2) Add on extra geoms qplot(price, carat, data = diamonds) + geom_smooth() # This how you get help about a specific geom: ?geom_smooth # or go to http://had.co.nz/ggplot2/geom_smooth.html Tuesday, 31 August 2010
  • 12. # To set aesthetics to a particular value, you need # to wrap that value in I() qplot(price, carat, data = diamonds, colour = "blue") qplot(price, carat, data = diamonds, colour = I("blue")) # Practical application: varying alpha qplot(price, carat, data = diamonds, alpha = I(1/10)) qplot(price, carat, data = diamonds, alpha = I(1/50)) qplot(price, carat, data = diamonds, alpha = I(1/100)) qplot(price, carat, data = diamonds, alpha = I(1/250)) Tuesday, 31 August 2010
  • 13. Your turn Explore the relationship between carat, price and clarity, using these techniques. Which did you find most useful? Tuesday, 31 August 2010
  • 15. Motivation Look at histograms and scatterplots of x, y, z from the diamonds dataset Which values are clearly incorrect? Which values might we be able to correct? (Remember measurements are in millimetres, 1 inch = 25 mm) Tuesday, 31 August 2010
  • 16. Plots qplot(x, data = diamonds, binwidth = 0.1) qplot(y, data = diamonds, binwidth = 0.1) qplot(z, data = diamonds, binwidth = 0.1) qplot(x, y, data = diamonds) qplot(x, z, data = diamonds) qplot(y, z, data = diamonds) Tuesday, 31 August 2010
  • 17. Modifying data To modify, must first know how to extract, or subset. Many different methods available in R. We’ll start with most explicit then learn some shortcuts next time. Basic structure: df$varname df[row index, column index] Tuesday, 31 August 2010
  • 18. $ Remember str(diamonds) ? That hints at how to extract individual variables: diamonds$carat diamonds$price Tuesday, 31 August 2010
  • 19. blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 31 August 2010
  • 21. # Nothing str(diamonds[, ]) # Positive integers & nothing diamonds[1:6, ] # same as head(diamonds) diamonds[, 1:4] # watch out! # Two positive integers in rows & columns diamonds[1:10, 1:4] # Repeating input repeats output diamonds[c(1,1,1,2,2), 1:4] # Negative integers drop values diamonds[-(1:53900), -1] Tuesday, 31 August 2010
  • 22. # Useful technique: Order by one or more columns diamonds <- diamonds[order(diamonds$price), ] # Useful technique: Combine two tables carats <- data.frame(table(carat = diamonds$carat)) mtch <- match(diamonds$carat, carats$carat) diamonds$carat_count <- carats$Freq[mtch] Tuesday, 31 August 2010
  • 24. # The most complicated to understand, but # the most powerful. Lets you extract a # subset defined by some characteristic of # the data x_big <- diamonds$x > 10 head(x_big) sum(x_big) mean(x_big) table(x_big) diamonds$x[x_big] diamonds[x_big, ] Tuesday, 31 August 2010
  • 25. small <- diamonds[diamonds$carat < 1, ] lowqual <- diamonds[diamonds$clarity %in% c("I1", "SI2", "SI1"), ] # Comparison functions: # < > <= >= != == %in% a # Boolean operators: & | ! b small <- diamonds$carat < 1 & a | b diamonds$price > 500 a & b lowqual <- diamonds$colour == "D" | a & !b diamonds$cut == "Fair" xor(a, b) Tuesday, 31 August 2010
  • 26. Useful table(zeros) functions for sum(zeros) logical vectors mean(zeros) TRUE = 1; FALSE = 0 Tuesday, 31 August 2010
  • 27. Your turn Select the diamonds that have: Equal x and y dimensions. Depth between 55 and 70. Carat smaller than the mean. Cost more than $10,000 per carat. Are of good quality or better. Tuesday, 31 August 2010
  • 28. Saving results # Prints to screen diamonds[diamonds$x > 10, ] # Saves to new data frame big <- diamonds[diamonds$x > 10, ] # Overwrites existing data frame. Dangerous! diamonds <- diamonds[diamonds$x < 10,] Tuesday, 31 August 2010
  • 29. diamonds <- diamonds[1, 1] diamonds # Uh oh! rm(diamonds) str(diamonds) # Phew! Tuesday, 31 August 2010
  • 30. Your turn Create a logical vector that selects diamonds with equal x & y. Create a new dataset that only contains these values. Create a logical vector that selects diamonds with incorrect/unusual x, y, or z values. Create a new dataset that omits these values. (Hint: do this one variable at a time) Tuesday, 31 August 2010
  • 31. equal_dim <- diamonds$x == diamonds$y equal <- diamonds[equal_dim, ] y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 zeros <- x_zero | y_zero | z_zero bad <- y_big | z_big | zeros good <- diamonds[!bad, ] Tuesday, 31 August 2010