Part 3 Plotting, using ggplot2
3.1 Introduction
R has some serious capabilities for plotting. Note that the package is named ggplot2
while function or command is ggplot()
.
The ggplot2
system of plotting is quite modular. First, we initiate a graph with ggplot()
, then you can add in different modules of a graph (for example, bar graph, text labels, colors, legend, title, etc.) with the plus sign (“+”) followed by a command. Here it how it would look:
df %>%
ggplot() +
module1() +
module2()...
This means that since it is so modular, you can have a line graph, scatter graph, etc. all in one graph. Different types of graphs as called “geoms” in ggplot2
, where geom_bar
is a bar graph (a histogram by default), geom_boxplot
is a boxplot, geom_point
is a scatter graph, and more than 20 other types of graphs available in ggplot2
plus extensions. A bookmark worthy official reference for ggplot2
is here: https://ggplot2.tidyverse.org/reference
After going through this tutorial and getting used to ggplot2
, check out some extensions at https://www.ggplot2-exts.org/gallery (you can even animate your graphs into gifs!)
3.2 Coding the graph
Let’s try some real code! Before diving in, let’s install and load ggdark
package, which gives the graphs a nice dark theme (the default dark theme by ggplot2
is not so good and if you’re looking for a nice light theme, theme_classic()
that comes with ggplot2
is generally recommended).
install.packages("ggdark")
library(ggdark)
Now on to graphing…
df_age <- nshap %>% # hearn, age, gender
select(age, gender) %>%
mutate(age = as.numeric(age))
df_age %>%
ggplot() +
geom_bar(aes(x = age,
colour = gender)) +
dark_theme_bw()
# spike at 68 — what do you think explains this?
The first chunk of the code reduces the nshap
dataframe into two variables (age
and gender
), then converts age
into a numeric variable. This reduced dataframe is saved as df_age
.
Then, we start the graph with the command ggplot()
, set the graph type as a bar graph (again, histogram is the default) with geom_bar
. The aes()
function here stands for “aesthetics” for the graph, where you set out what R needs to draw. In this case, it draws x-axis with the variable age
, and it automatically assigns different colors to different discrete values of gender
variable. We give a nice dark theme template look for the graph. Because this is a histogram, we do not need to assign a \(y\) variable — it just counts up the number of occurrences for each of the age
values.
3.3 Normal bar graph
Instead of the default histogram, let’s try making a normal bar graph.
nshap$maritlst <- factor(nshap$maritlst, levels=c("married",
"living with a partner",
"separated",
"divorced",
"never married",
"widowed"))
marital <- nshap %>%
group_by(maritlst, gender) %>%
summarise(n = n())
marital %>%
ggplot() +
geom_bar(aes(y = n,
x = maritlst,
colour = gender),
stat = "identity") +
ggtitle("Count of respondents vs. marital status by gender") +
dark_theme_bw()
# More widowed females, possibly explained by male deaths from war,
# lower life expectancy for males, etc.
The necessary arguments are y
for aes()
, stat
for geom_bar()
. While this is still a histogram because it uses counts (n
), this is to demonstrate how to make a normal bar graph. stat = "identity"
is what sets it as a normal bar graph and requires you to input the y
argument. Then you can put in the title of the graph with ggtitle()
.
3.4 Switching the axes
Sometimes, the labels might be too long for readability, therefore difficult to fit into the x-axis. You can either choose to rotate the labels so the text becomes vertical, or switch the axes. Below code shows how to switch the axes.
nshap$par_nerve <- factor(nshap$par_nerve, levels=c("often",
"some of the time",
"hardly ever or rarely",
"never",
"refused",
"don't know"))
nerve <- nshap %>%
group_by(par_nerve, gender) %>%
summarise(n = n())
#head(nerve)
nerve %>%
ggplot() +
geom_bar(aes(y = n,
x = par_nerve,
colour = gender),
stat = "identity") +
ylab("Count") +
xlab("How often do you find your partner to be annoying?") +
coord_flip() +
dark_theme_bw()
# Looking at raw averages of the respondents, males tend to be annoying
# more frequently, but have to watch out for missing data, indicated by
# "NA" here
coord_flip()
flips the \(y\) and the \(x\) axes, making the graph a horizontal bar graph.
3.5 geom_smooth()
A scatter graph is nice when there are only several data points, maybe fewer than 100. However, NSHAP data is rather sizable, which makes geom_smooth()
a neat solution that uses LOESS (local estimation scatterplot smoothing) method by default (AKA moving regression similar to moving averages). It also shows 95% confidence interval area, shaded in the graph by default. The CI can be problematic when there are few data points to work off of for some of the local domains, for common example, the extreme income brackets.
hag <- nshap %>% # hearn, age, gender
select(hearn, age, gender, educ) %>%
mutate(hearn = as.numeric(hearn),
age = as.numeric(age))
hag <- na.omit(hag)
hag$educ <- factor(hag$educ, levels=c("bachelors or more",
"voc cert/some college/assoc",
"hs/equiv",
"< hs"))
plotted <- hag %>%
ggplot() +
geom_smooth(aes(y = hearn,
x = age,
colour = educ)) +
facet_wrap(~gender) +
ylab("Household earnings") +
xlab("Age") +
ggtitle("Household earnings vs. age") +
dark_theme_bw(base_size = 11) +
scale_y_continuous(labels = scales::dollar)
plotted
The graph is a bit ugly, but there are several new elements here. Let’s go over them one by one, then address the graph readability later.
First, the graph is saved into the object named plotted
.
facet_wrap(~gender)
: splits the graph intogender
‘s (two) discrete values’ graphs.ylab
: y-axis label — similar story applies toxlab
andggtitle
.scale_y_continuous(labels = scales::dollar)
: R recognizes dollar scale and prepends “$” into the y-axis values.coord_cartesian(ylim = c(0, 250000))
: only shows \(y\) coordinates between the set min and max values (but does not remove the values outside of the range from the data; for more, see “clipping” below).
3.5.1 Clipping
knitr::include_graphics("clipping.png")
Here, t
is the ggplot2
graph object. knitr::include_graphics
shows the image file in this tutorial.
3.6 Saving the graph
Conveniently, ggplot2
has its own function to save graphs which makes it simple and convenient. Base R options for saving graphics is not recommended.
ggsave(filename = "hearn_vs_age.png",
plot = plotted,
width = 12.8,
height = 7.2,
dpi = 150)
knitr::include_graphics("hearn_vs_age.png")
Notice that the graph, fonts, legend, borders, etc. are all rescaled according to the ggsave()
arguments and looks much better. You can tweak around with these settings to improve readability as opposed to the previous ugly default graph.