Data visualisation with ggplot2

Data visualization

  • Important aspect in data sciences -> Communicate information clearly and efficiently to the community.

  • Powerful tool to discovers patterns in the data.

  • It makes complex data more accessible -> reveal data.

  • Bad graphics can be a reason for paper rejection!

  • A picture is worth a thousand words.
    • Always, always, always plot the data!
    • When possible, replace tables with figures that are more compelling.

What is a good graph?

Data-ink ratio

The data-ink ratio is the proportion of ink that is used to present actual data compared to the total amount of ink used in the entire display.


\[ \verb|data-ink ratio| = \frac{\verb|Data-ink|}{\verb|Total ink used to print the graphic|} \]

The data-to-ink ratio should be keep as high as possible.


Examples

Examples

How to lie with graphs

It is easy to exaggerate effects or distort the reality with graphs.

Examples

Examples

It seems that the second bar is 3 times higher than the first bar.

Do not use 3D, ever!

3D graphics are very rarely useful:

  1. First, they break the data-to-ink ratio rule.

  2. Secondly, they can distort the reality.

Examples

What is the value of C?

Source: http://consultantjournal.com/blog/use-3d-charts-at-your-own-risk

Examples

Source: http://consultantjournal.com/blog/use-3d-charts-at-your-own-risk

The value of C is 3…

Examples

What is the value of y at z = low and x = t1?

Top 3 of bad graphs

Source: http://bit.ly/1OnKlEi
HTML5 Icon
HTML5 Icon
HTML5 Icon

ggplot2

ggplot2

  • ggplot2 is an extremely powerful package based on the grammar of graphics to produce complicated graphics in an elegant manner.

  • ggplot2 works best when you have tidy data.

  • Graphics are built by combining layers.

  • You can refer to the printed cheat sheet for an overview of the package’s functions.

ggplot2

ggplot2 is not part of base R, so it needs to be installed.

The data

For the following examples we are going to use the data from the mpg dataset. This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov.

manufacturer model displ year cyl trans drv cty hwy fl class
1 audi a4 1.80 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.80 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.00 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.00 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.80 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.80 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.10 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.80 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.80 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2.00 2008 4 manual(m6) 4 20 28 p compact

Basic structure

alt text

geoms

geoms is the short for geometric objects which are used to specify which type of graphic you want to produce (boxplot, barplot, scatter, …).

##  [1] "geom_abline"     "geom_area"       "geom_bar"       
##  [4] "geom_bin2d"      "geom_blank"      "geom_boxplot"   
##  [7] "geom_col"        "geom_contour"    "geom_count"     
## [10] "geom_crossbar"   "geom_curve"      "geom_density"   
## [13] "geom_density_2d" "geom_density2d"  "geom_dotplot"   
## [16] "geom_errorbar"   "geom_errorbarh"  "geom_freqpoly"  
## [19] "geom_hex"        "geom_histogram"  "geom_hline"     
## [22] "geom_jitter"     "geom_label"      "geom_line"      
## [25] "geom_linerange"  "geom_map"        "geom_path"      
## [28] "geom_point"      "geom_pointrange" "geom_polygon"   
## [31] "geom_qq"         "geom_qq_line"    "geom_quantile"  
## [34] "geom_raster"     "geom_rect"       "geom_ribbon"    
## [37] "geom_rug"        "geom_segment"    "geom_sf"        
## [40] "geom_sf_label"   "geom_sf_text"    "geom_smooth"    
## [43] "geom_spoke"      "geom_step"       "geom_text"      
## [46] "geom_tile"       "geom_violin"     "geom_vline"

One variable graphics

There are two main types of one variable graphics:

Graphic type Geom Description
Histogram geom_histogram() Produces histograms for continuous data.
Barplot geom_bar() Produces histograms for discrete data.

Histogram

Histogram

Exercise

Exercise #1

Create an histogram of the displ variable and change the default color of the bars to red.

Hint:

Barplot

Two variables graphics

Two variables graphics are more commmon.

Graphic type Geom Description
Scatter plot geom_point() Produces scatter plot between \(x\) and \(y\).
Line plot geom_line() Produces line plot between \(x\) and \(y\).
Boxplot geom_boxplot() Boxplot between \(x\) and \(y\).

Scatter plot

Line plot

Exercise

Exercise #1

Create a scatter plot between hwy (x) and cty (y). Change the color of the points to blue and the size to 4.

Boxplot

In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles (Wikipedia).

To make a boxplot, we need to have a discrete/categorical variable on \(x\) and a continuous variable on \(y\).

Boxplot

Exercise

Exercise #1

Create a boxplot using the following code:

Working with colors

It can be useful to add colors in graphics. To change the colors of the points, we have to use the color parameter.

Working with colors

If we want to set a color based on a variable, we have to use the aesthetic: aes(colour = variable).

Working with size

In the same manner, the size of the dots can be based on a particular variable.

Exercise

Exercise #1

Try to reproduce this graphic.

What is wrong with this graphic?

Faceting

Faceting

Faceting is a very powerful of the ggplot2 library which allows to display additional categorical variables in facets.

There are two types of faceting: facet_grid() and facet_wrap().

1D facets

1D facets

2D facets

2D facet graphics are made using the facet_grid() function.

Graphics appearance

Graphics appearance

There are a ton of options to modify the look and feel of your graphics and we can not go through them all in short period of time.

Book: ggplot2: Elegant Graphics for Data Analysis (Use R!)

Here I present the principal functions I usually use to make publication-ready graphics.

Graphics appearance

Saving your graphic

Saving your pretty ggplot2 graphics is pretty easy with the ggsave() function.

Exercise

Exercise #1

Use the following data and reproduce the plot on the next slide.

Hint #1: Use bind_rows() to bind both datasets.

Hint #2: Before starting, take time to analyze the graphic and think how you need to format the data and what are the different components of the plot.

Data source: http://climrun.cyi.ac.cy/?q=csv