ggplot gotcha: beware jittering your data

2015/12/21

I like to overlay boxplots with a scatter plot so that you get the comfort of seeing the real data with the easiness of seeing the mean and IQR. Here’s an example with some fake data in a data frame that I’ll call dat: group val A 0.6 B 1.6 A 0.7 B 1.7 A 0.8 ...

It looks like:

ggplot(dat, aes(x=group, y=val)) + geom_boxplot() + geom_point() p1

Often there’s a lot of data, so I want to jitter the points left and right to prevent them overlapping. I’ve found that width=0.1 is good enough to get that kind of separation:

ggplot(dat, aes(x=group, y=val)) + geom_boxplot() + geom_point(position=position_jitter(w=0.1)) p2

Unfortunately, this seemingly reasonable command did something very insidious: it also jittered in the y-axis direction! To see that, I’ll draw horizontal lines where the original data were:

ggplot(dat, aes(x=group, y=val)) + geom_boxplot() + geom_point(position=position_jitter(w=0.1)) + geom_hline(aes(yintercept=val)) p25

To get this right, you need to actually add h=0:

ggplot(dat, aes(x=group, y=val)) + geom_boxplot() + geom_point(position=position_jitter(h=0, w=0.1)) p3

To show that it’s right: ggplot(dat, aes(x=group, y=val)) + geom_boxplot() + geom_point(position=position_jitter(h=0, w=0.1)) + geom_hline(aes(yintercept=val)) p35