Important aspect in data sciences -> Communicate information clearly and efficiently to the community.
Powerful tool to discovers patterns in the data.
It makes complex data more accessible -> reveal data.
Bad graphics can be a reason for paper rejection!
The data-ink ratio is the proportion of ink that is used to present actual data compared to the total amount of ink used in the entire display.
\[ \verb|data-ink ratio| = \frac{\verb|Data-ink|}{\verb|Total ink used to print the graphic|} \]
The data-to-ink ratio should be keep as high as possible.
It is easy to exaggerate effects or distort the reality with graphs.
3D graphics are very rarely useful:
First, they break the data-to-ink ratio rule.
Secondly, they can distort the reality.
The value of C is 3…
ggplot2 is an extremely powerful package based on the grammar of graphics to produce complicated graphics in an elegant manner.
ggplot2 works best when you have tidy data.
Graphics are built by combining layers.
You can refer to the printed cheat sheet for an overview of the package’s functions.
ggplot2 is not part of base R, so it needs to be installed.
For the following examples we are going to use the data from the mpg
dataset. This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov.
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | audi | a4 | 1.80 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
2 | audi | a4 | 1.80 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
3 | audi | a4 | 2.00 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
4 | audi | a4 | 2.00 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
5 | audi | a4 | 2.80 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
6 | audi | a4 | 2.80 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
7 | audi | a4 | 3.10 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
8 | audi | a4 quattro | 1.80 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
9 | audi | a4 quattro | 1.80 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
10 | audi | a4 quattro | 2.00 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
geoms is the short for geometric objects which are used to specify which type of graphic you want to produce (boxplot, barplot, scatter, …).
## [1] "geom_abline" "geom_area" "geom_bar"
## [4] "geom_bin2d" "geom_blank" "geom_boxplot"
## [7] "geom_col" "geom_contour" "geom_count"
## [10] "geom_crossbar" "geom_curve" "geom_density"
## [13] "geom_density_2d" "geom_density2d" "geom_dotplot"
## [16] "geom_errorbar" "geom_errorbarh" "geom_freqpoly"
## [19] "geom_hex" "geom_histogram" "geom_hline"
## [22] "geom_jitter" "geom_label" "geom_line"
## [25] "geom_linerange" "geom_map" "geom_path"
## [28] "geom_point" "geom_pointrange" "geom_polygon"
## [31] "geom_qq" "geom_qq_line" "geom_quantile"
## [34] "geom_raster" "geom_rect" "geom_ribbon"
## [37] "geom_rug" "geom_segment" "geom_sf"
## [40] "geom_sf_label" "geom_sf_text" "geom_smooth"
## [43] "geom_spoke" "geom_step" "geom_text"
## [46] "geom_tile" "geom_violin" "geom_vline"
There are two main types of one variable graphics:
Graphic type | Geom | Description |
---|---|---|
Histogram | geom_histogram() |
Produces histograms for continuous data. |
Barplot | geom_bar() |
Produces histograms for discrete data. |
Create an histogram of the displ
variable and change the default color of the bars to red.
Hint:
Two variables graphics are more commmon.
Graphic type | Geom | Description |
---|---|---|
Scatter plot | geom_point() |
Produces scatter plot between \(x\) and \(y\). |
Line plot | geom_line() |
Produces line plot between \(x\) and \(y\). |
Boxplot | geom_boxplot() |
Boxplot between \(x\) and \(y\). |
Create a scatter plot between hwy
(x) and cty
(y). Change the color of the points to blue and the size to 4.
In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles (Wikipedia).
To make a boxplot, we need to have a discrete/categorical variable on \(x\) and a continuous variable on \(y\).
Create a boxplot using the following code:
It can be useful to add colors in graphics. To change the colors of the points, we have to use the color parameter.
If we want to set a color based on a variable, we have to use the aesthetic: aes(colour = variable)
.
In the same manner, the size of the dots can be based on a particular variable.
Try to reproduce this graphic.
What is wrong with this graphic?
Faceting is a very powerful of the ggplot2 library which allows to display additional categorical variables in facets.
There are two types of faceting: facet_grid()
and facet_wrap()
.
2D facet graphics are made using the facet_grid()
function.
There are a ton of options to modify the look and feel of your graphics and we can not go through them all in short period of time.
Book: ggplot2: Elegant Graphics for Data Analysis (Use R!)
Here I present the principal functions I usually use to make publication-ready graphics.
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(colour = factor(cyl)) + # Scatterplot color based on cyl
stat_smooth(method = "lm") + # Add a linear smoother to the data
labs(colour = "Number of\ncylinders") + # Title of the color legend
xlab("Horsepower") + # Change x-axis title
ylab("Miles per gallon") + # Change y-axis title
ggtitle("This is my title") + # Add a title on top of the plot
theme(legend.position = "top") + # Change legend position
xlim(0, 8) # Change limits of x-axis
Saving your pretty ggplot2 graphics is pretty easy with the ggsave()
function.
p <- ggplot(mpg, aes(x = displ, y = cty)) +
geom_point()
# Vector formats
ggsave("path/to/myfile.pdf", p, width = 5.97, height = 4.79)
ggsave("path/to/myfile.eps", p, width = 5.97, height = 4.79)
ggsave("path/to/myfile.ps", p, width = 5.97, height = 4.79)
# Raster formats
ggsave("path/to/myfile.jpg", p, width = 5.97, height = 4.79)
ggsave("path/to/myfile.tiff", p, width = 5.97, height = 4.79)
ggsave("path/to/myfile.png", p, width = 5.97, height = 4.79)
Use the following data and reproduce the plot on the next slide.
Hint #1: Use bind_rows()
to bind both datasets.
Hint #2: Before starting, take time to analyze the graphic and think how you need to format the data and what are the different components of the plot.
Data source: http://climrun.cyi.ac.cy/?q=csv