03 extensions

Stat405
Graphical extensions & missing values

Hadley Wickham
Tuesday, 31 August 2010

1. Graphical extensions (1d & 2d)
2. Subsetting


1d
extensions

Fair Good Very Good

6000

5000

4000

3000

2000

1000

0
count

Premium Ideal

6000

5000

4000

3000

2000

1000

0
0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000
price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

Fair Good Very Good

6000

5000

4000

3000

2000

1000

0
count

Premium Ideal

6000
What makes it
5000
difﬁcult to
compare the
4000
distributions?
3000

2000
Brainstorm for 1
minute.
1000

0
0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000
price
qplot(price, data = diamonds, binwidth = 500) + facet_wrap(~ cut)

Problems

Each histogram far away from the others,
but we know stacking is hard to read →
use another way of displaying densities
Varying relative abundance makes
comparisons difﬁcult → rescale to ensure
constant area


# Large distances make comparisons hard
qplot(price, data = diamonds, binwidth = 500) +
facet_wrap(~ cut)

# Stacked heights hard to compare
qplot(price, data = diamonds, binwidth = 500, fill = cut)

# Much better - but still have differing relative abundance
qplot(price, data = diamonds, binwidth = 500,
geom = "freqpoly", colour = cut)

# Instead of displaying count on y-axis, display density
# .. indicates that variable isn't in original data
qplot(price, ..density.., data = diamonds, binwidth = 500,
geom = "freqpoly", colour = cut)

# To use with histogram, you need to be explicit
qplot(price, ..density.., data = diamonds, binwidth = 500,
geom = "histogram") + facet_wrap(~ cut)


Your turn

Use this technique to explore the
relationship between price and clarity,
and carat and clarity.


2d
extensions

Idea ggplot
Small points shape = I(".")

Transparency alpha = I(1/50)

Jittering geom = "jitter"

Smooth curve geom = "smooth"
geom = "bin2d" or
2d bins geom = "hex"

Density contours geom = "density2d"

# There are two ways to add additional geoms
# 1) A vector of geom names:
qplot(price, carat, data = diamonds,
geom = c("point", "smooth"))

# 2) Add on extra geoms
qplot(price, carat, data = diamonds) + geom_smooth()

# This how you get help about a specific geom:
?geom_smooth
# or go to http://had.co.nz/ggplot2/geom_smooth.html


# To set aesthetics to a particular value, you need
# to wrap that value in I()

qplot(price, carat, data = diamonds, colour = "blue")
qplot(price, carat, data = diamonds, colour = I("blue"))

# Practical application: varying alpha
qplot(price, carat, data = diamonds, alpha = I(1/10))


Your turn

Explore the relationship between carat,
price and clarity, using these techniques.
Which did you ﬁnd most useful?


Subsetting


Motivation

Look at histograms and scatterplots of x,
y, z from the diamonds dataset
Which values are clearly incorrect? Which
values might we be able to correct?
(Remember measurements are in millimetres,
1 inch = 25 mm)


Plots

qplot(x, data = diamonds, binwidth = 0.1)
qplot(y, data = diamonds, binwidth = 0.1)
qplot(z, data = diamonds, binwidth = 0.1)
qplot(x, y, data = diamonds)
qplot(x, z, data = diamonds)
qplot(y, z, data = diamonds)


Modifying data
To modify, must ﬁrst know how to extract,
or subset. Many different methods
available in R. We’ll start with most
explicit then learn some shortcuts next
time.
Basic structure:
df$varname
df[row index, column index]


$

Remember str(diamonds) ?
That hints at how to extract individual
variables:
diamonds$carat
diamonds$price


blank include all

integer +ve: include
-ve: exclude

logical include TRUEs

character lookup by name


Integer subsetting


# Nothing
str(diamonds[, ])

# Positive integers & nothing
diamonds[1:6, ] # same as head(diamonds)
diamonds[, 1:4] # watch out!

# Two positive integers in rows & columns
diamonds[1:10, 1:4]

# Repeating input repeats output
diamonds[c(1,1,1,2,2), 1:4]

# Negative integers drop values
diamonds[-(1:53900), -1]


# Useful technique: Order by one or more columns
diamonds <- diamonds[order(diamonds$price), ]

# Useful technique: Combine two tables
carats <- data.frame(table(carat = diamonds$carat))
mtch <- match(diamonds$carat, carats$carat)
diamonds$carat_count <- carats$Freq[mtch]


Logical subsetting


# The most complicated to understand, but
# the most powerful. Lets you extract a
# subset defined by some characteristic of
# the data
x_big <- diamonds$x > 10

head(x_big)
sum(x_big)
mean(x_big)
table(x_big)

diamonds$x[x_big]
diamonds[x_big, ]


small <- diamonds[diamonds$carat < 1, ]
lowqual <- diamonds[diamonds$clarity
%in% c("I1", "SI2", "SI1"), ]

# Comparison functions:
# < > <= >= != == %in%
a
# Boolean operators: & | ! b
small <- diamonds$carat < 1 & a | b
diamonds$price > 500 a & b
lowqual <- diamonds$colour == "D" | a & !b
diamonds$cut == "Fair" xor(a, b)


Useful table(zeros)
functions for sum(zeros)
logical vectors mean(zeros)
TRUE = 1; FALSE = 0


Your turn
Select the diamonds that have:
Equal x and y dimensions.
Depth between 55 and 70.
Carat smaller than the mean.
Cost more than $10,000 per carat.
Are of good quality or better.


Saving results
# Prints to screen
diamonds[diamonds$x > 10, ]

# Saves to new data frame
big <- diamonds[diamonds$x > 10, ]

# Overwrites existing data frame. Dangerous!
diamonds <- diamonds[diamonds$x < 10,]


diamonds <- diamonds[1, 1]
diamonds

# Uh oh!

rm(diamonds)
str(diamonds)

# Phew!


Your turn
Create a logical vector that selects
diamonds with equal x & y. Create a new
dataset that only contains these values.
Create a logical vector that selects
diamonds with incorrect/unusual x, y, or z
values. Create a new dataset that omits
these values. (Hint: do this one variable
at a time)


equal_dim <- diamonds$x == diamonds$y
equal <- diamonds[equal_dim, ]

y_big <- diamonds$y > 10
z_big <- diamonds$z > 6

x_zero <- diamonds$x == 0
y_zero <- diamonds$y == 0
z_zero <- diamonds$z == 0
zeros <- x_zero | y_zero | z_zero

bad <- y_big | z_big | zeros
good <- diamonds[!bad, ]


03 extensions

More Related Content

What's hot (12)

Viewers also liked (6)

Similar to 03 extensions (10)

More from Hadley Wickham (20)

03 extensions