Working with real data

Importing your data

Until now we have been working with simple data generated within R.

However, most of the time we want to work with external data.

  • Various sources and types of data: text file, images (geotif/raster).
  • Most of the time either .csv or .xls formats.

Data frames

Data frames are important objects in R which are created when reading a file.

Can be seen as an Excel tabular sheet:

  • Lines are observations
  • Columns are variables

A data frame can be seen as a matrix with the difference that columns (variables) can be of different types (numerics, dates, characters, etc.).

Examples

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Structure of a df

After you opened a data file, it is always a good idea to look the structure of the returned data frame. This ensure that all variables have the right types.

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Useful functions

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
## [1] 32
## [1] 11

Accessing elements of data frame

There are two main ways to access data of a data frame.

First method

The first way to access elements of a data frame consists in using indexes as we did for accessing matrix.

For example,

## [1] 160

Accessing columns of a data frame.

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

Accessing rows of a data frame.

##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Second method

The second method to access elements of a data frame consists in using the $ operator using the df$variable scheme.

##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

Using the second method makes things more obvious and easy to read since you don’t have to wonder the position (indexes) of the variables in the data frame.

##  [1] 6 6 4 6 8 6 8 4 4 6

Quick-RStudio tip

After typing the name of a data frame, the list of all variables within this data frame will appears. Use the keyboard to select the variable of interest.

CSV and Excel files

I recommend to use readr and readxl libraries for reading CSV and Excel files. These two libraries are not installed by default.

Reading Excel files

Reading csv files

Use read_csv2() if you are using CSV files produced with EU settings (; instead of , separator).

Other file formats

There are many file format supported by R. This is a list of functions to use to open most used file format.

Data format Function Package
Comma separated values (.csv) read_csv() readr
Excel (.xls, .xlsx) read_excel() readxl
Other text files (.txt, .dat, …) read_delim() readr
Stata read_stata() haven
SPSS read_spss() haven
SAS read_sas() haven
Matlab file (.mat) readMat() R.matlab

Exercise

Exercise #1

Open your data in R and start exploring them.

  • Data types as expected?
  • How missing values (NA) have been handled by R?
  • Any problems?

Data manipulation

It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many over the course of analysis as new problems come to light or new data is collected.

  • Two main characteristics of good data manipulation:
    • speed (results are returned in reasonable time)
    • elegance (easy to use and understand what is going on)

The tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

The tidyverse

The tidyverse provides a set of packages which provides flexible tools for data manipulation.

The tidyverse

The package is not included in the base installation of R.

You can refer to the printed cheat sheet for an overview of the package’s functions.

dplyr

All dplyr functions use the same scheme: the first argument to provide is a data frame.

Function Definition
filter(df, …) Select a subset of the rows of a data frame.
arrange(df, …) Reorder rows of a data frame.
select(df, …) Select columns of a data frame.
mutate(df, …) Add new columns to a data frame.
summarise(df, …) Summarize a data frame into a single row. This function is very helpful in combination with the group_by() function.
group_by(df, …) Group a data frame based on the specified columns.

Lets practice!

To learn how to use these functions, we will use datasets from package nycflights13 which we first need to install.

nycflights13

This package contains information about all flights that departed from New York (i.e., EWR, JFK and LGA) in 2013.

For the following examples we are going to use fights, planes, airports, airlines tables.

Exploring the data

flights

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

planes

## Classes 'tbl_df', 'tbl' and 'data.frame':    3322 obs. of  9 variables:
##  $ tailnum     : chr  "N10156" "N102UW" "N103US" "N104UW" ...
##  $ year        : int  2004 1998 1999 1999 2002 1999 1999 1999 1999 1999 ...
##  $ type        : chr  "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ...
##  $ manufacturer: chr  "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" ...
##  $ model       : chr  "EMB-145XR" "A320-214" "A320-214" "A320-214" ...
##  $ engines     : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ seats       : int  55 182 182 182 55 182 182 182 182 182 ...
##  $ speed       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ engine      : chr  "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" ...

airports

## Classes 'tbl_df', 'tbl' and 'data.frame':    1458 obs. of  8 variables:
##  $ faa  : chr  "04G" "06A" "06C" "06N" ...
##  $ name : chr  "Lansdowne Airport" "Moton Field Municipal Airport" "Schaumburg Regional" "Randall Airport" ...
##  $ lat  : num  41.1 32.5 42 41.4 31.1 ...
##  $ lon  : num  -80.6 -85.7 -88.1 -74.4 -81.4 ...
##  $ alt  : int  1044 264 801 523 11 1593 730 492 1000 108 ...
##  $ tz   : num  -5 -6 -6 -5 -5 -5 -5 -5 -5 -8 ...
##  $ dst  : chr  "A" "A" "A" "A" ...
##  $ tzone: chr  "America/New_York" "America/Chicago" "America/Chicago" "America/New_York" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_integer(),
##   ..   name = col_character(),
##   ..   city = col_character(),
##   ..   country = col_character(),
##   ..   faa = col_character(),
##   ..   icao = col_character(),
##   ..   lat = col_double(),
##   ..   lon = col_double(),
##   ..   alt = col_integer(),
##   ..   tz = col_double(),
##   ..   dst = col_character(),
##   ..   tzone = col_character()
##   .. )

airlines

## Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  2 variables:
##  $ carrier: chr  "9E" "AA" "AS" "B6" ...
##  $ name   : chr  "Endeavor Air Inc." "American Airlines Inc." "Alaska Airlines Inc." "JetBlue Airways" ...

Relationships between tables

Filtering data

filter

Most of the time you will want to work on subset observations of your data.

  • filter() allows you to select a subset of the rows of a data frame.

  • The first argument is the name of the data frame, and the second and subsequent are filtering expressions evaluated in the context of that data frame.

Example

Select all flights with tail number equal to “N14228”.

## # A tibble: 111 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5
##  2  2013     1     8     1435           1440        -5     1717           1746       -29 UA        1579 N14228  EWR    MIA        150     1085    14
##  3  2013     1     9      717            700        17      812            815        -3 UA        1142 N14228  EWR    BOS         39      200     7
##  4  2013     1     9     1143           1144        -1     1425           1445       -20 UA        1707 N14228  EWR    TPA        149      997    11
##  5  2013     1    13      835            824        11     1030            951        39 UA        1572 N14228  EWR    BOS         43      200     8
##  6  2013     1    16     1829           1730        59     2117           2023        54 UA        1637 N14228  EWR    TPA        144      997    17
##  7  2013     1    22     1902           1808        54     2214           2106        68 UA        1269 N14228  EWR    PBI        163     1023    18
##  8  2013     1    23     1050           1056        -6     1143           1208       -25 UA        1047 N14228  EWR    BOS         39      200    10
##  9  2013     1    23     1533           1529         4     1641           1645        -4 UA        1116 N14228  EWR    BOS         47      200    15
## 10  2013     1    25      724            720         4     1000           1023       -23 UA        1724 N14228  EWR    PBI        137     1023     7
## # … with 101 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>

Example

Select all flights with a departure delay >= 180 minutes.

## # A tibble: 3,945 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013     1     1      848           1835       853     1001           1950       851 MQ        3944 N942MQ  JFK    BWI         41      184    18
##  2  2013     1     1     1815           1325       290     2120           1542       338 EV        4417 N17185  EWR    OMA        213     1134    13
##  3  2013     1     1     1842           1422       260     1958           1535       263 EV        4633 N18120  EWR    BTV         46      266    14
##  4  2013     1     1     2006           1630       216     2230           1848       222 EV        4644 N14972  EWR    SAV        121      708    16
##  5  2013     1     1     2115           1700       255     2330           1920       250 9E        3347 N924XJ  JFK    CVG        115      589    17
##  6  2013     1     1     2205           1720       285       46           2040       246 AA        1999 N5DNAA  EWR    MIA        146     1085    17
##  7  2013     1     1     2312           2000       192       21           2110       191 EV        4312 N13958  EWR    DCA         44      199    20
##  8  2013     1     1     2343           1724       379      314           1938       456 EV        4321 N21197  EWR    MCI        222     1092    17
##  9  2013     1     2     1244            900       224     1431           1104       207 EV        4412 N13958  EWR    MYR         94      550     9
## 10  2013     1     2     1332            904       268     1616           1128       288 EV        4364 N41104  EWR    MCI        203     1092     9
## # … with 3,935 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>

Example

Select all flights from carriers “MQ” or “EV”.

## # A tibble: 80,570 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53      229     6
##  2  2013     1     1      600            600         0      837            825        12 MQ        4650 N542MQ  LGA    ATL        134      762     6
##  3  2013     1     1      602            605        -3      821            805        16 MQ        4401 N730MQ  LGA    DTW        105      502     6
##  4  2013     1     1      608            600         8      807            735        32 MQ        3768 N9EAMQ  EWR    ORD        139      719     6
##  5  2013     1     1      624            630        -6      909            840        29 EV        4626 N11107  EWR    MSP        190     1008     6
##  6  2013     1     1      624            630        -6      840            830        10 MQ        4599 N518MQ  LGA    MSP        166     1020     6
##  7  2013     1     1      632            608        24      740            728        12 EV        4144 N13553  EWR    IAD         52      212     6
##  8  2013     1     1      656            705        -9     1007            940        27 MQ        4534 N722MQ  LGA    XNA        233     1147     7
##  9  2013     1     1      749            710        39      939            850        49 MQ        3737 N508MQ  EWR    ORD        148      719     7
## 10  2013     1     1      800            810       -10      949            955        -6 MQ        4406 N828MQ  JFK    RDU         80      427     8
## # … with 80,560 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>

Example

Select all flights from carriers MQ OR EV.

## # A tibble: 80,570 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53      229     6
##  2  2013     1     1      600            600         0      837            825        12 MQ        4650 N542MQ  LGA    ATL        134      762     6
##  3  2013     1     1      602            605        -3      821            805        16 MQ        4401 N730MQ  LGA    DTW        105      502     6
##  4  2013     1     1      608            600         8      807            735        32 MQ        3768 N9EAMQ  EWR    ORD        139      719     6
##  5  2013     1     1      624            630        -6      909            840        29 EV        4626 N11107  EWR    MSP        190     1008     6
##  6  2013     1     1      624            630        -6      840            830        10 MQ        4599 N518MQ  LGA    MSP        166     1020     6
##  7  2013     1     1      632            608        24      740            728        12 EV        4144 N13553  EWR    IAD         52      212     6
##  8  2013     1     1      656            705        -9     1007            940        27 MQ        4534 N722MQ  LGA    XNA        233     1147     7
##  9  2013     1     1      749            710        39      939            850        49 MQ        3737 N508MQ  EWR    ORD        148      719     7
## 10  2013     1     1      800            810       -10      949            955        -6 MQ        4406 N828MQ  JFK    RDU         80      427     8
## # … with 80,560 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>

arrange

  • arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them.
  • The function takes a data frame, and a set of column names (or more complicated expressions) to order by.

Example

Order flights on the departure delay.

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013    12     7     2040           2123       -43       40           2352        48 B6          97 N592JB  JFK    DEN        265     1626    21
##  2  2013     2     3     2022           2055       -33     2240           2338       -58 DL        1715 N612DL  LGA    MSY        162     1183    20
##  3  2013    11    10     1408           1440       -32     1549           1559       -10 EV        5713 N825AS  LGA    IAD         52      229    14
##  4  2013     1    11     1900           1930       -30     2233           2243       -10 DL        1435 N934DL  LGA    TPA        139     1010    19
##  5  2013     1    29     1703           1730       -27     1947           1957       -10 F9         837 N208FR  LGA    DEN        250     1620    17
##  6  2013     8     9      729            755       -26     1002            955         7 MQ        3478 N711MQ  LGA    DTW         88      502     7
##  7  2013    10    23     1907           1932       -25     2143           2143         0 EV        4361 N13994  EWR    TYS        111      631    19
##  8  2013     3    30     2030           2055       -25     2213           2250       -37 MQ        4573 N725MQ  LGA    DTW         87      502    20
##  9  2013     3     2     1431           1455       -24     1601           1631       -30 9E        3318 N929XJ  JFK    BUF         55      301    14
## 10  2013     5     5      934            958       -24     1225           1309       -44 B6         375 N531JB  LGA    FLL        150     1076     9
## # … with 336,766 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>

Example

Order flights on the departure delay and by arrival delay

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013    12     7     2040           2123       -43       40           2352        48 B6          97 N592JB  JFK    DEN        265     1626    21
##  2  2013     2     3     2022           2055       -33     2240           2338       -58 DL        1715 N612DL  LGA    MSY        162     1183    20
##  3  2013    11    10     1408           1440       -32     1549           1559       -10 EV        5713 N825AS  LGA    IAD         52      229    14
##  4  2013     1    11     1900           1930       -30     2233           2243       -10 DL        1435 N934DL  LGA    TPA        139     1010    19
##  5  2013     1    29     1703           1730       -27     1947           1957       -10 F9         837 N208FR  LGA    DEN        250     1620    17
##  6  2013     8     9      729            755       -26     1002            955         7 MQ        3478 N711MQ  LGA    DTW         88      502     7
##  7  2013     3    30     2030           2055       -25     2213           2250       -37 MQ        4573 N725MQ  LGA    DTW         87      502    20
##  8  2013    10    23     1907           1932       -25     2143           2143         0 EV        4361 N13994  EWR    TYS        111      631    19
##  9  2013     5     5      934            958       -24     1225           1309       -44 B6         375 N531JB  LGA    FLL        150     1076     9
## 10  2013     9    18     1631           1655       -24     1812           1845       -33 AA        2223 N4XXAA  LGA    STL        125      888    16
## # … with 336,766 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>

Example

Order flights on the departure delay and by decreasing arrival delay

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013    12     7     2040           2123       -43       40           2352        48 B6          97 N592JB  JFK    DEN        265     1626    21
##  2  2013     2     3     2022           2055       -33     2240           2338       -58 DL        1715 N612DL  LGA    MSY        162     1183    20
##  3  2013    11    10     1408           1440       -32     1549           1559       -10 EV        5713 N825AS  LGA    IAD         52      229    14
##  4  2013     1    11     1900           1930       -30     2233           2243       -10 DL        1435 N934DL  LGA    TPA        139     1010    19
##  5  2013     1    29     1703           1730       -27     1947           1957       -10 F9         837 N208FR  LGA    DEN        250     1620    17
##  6  2013     8     9      729            755       -26     1002            955         7 MQ        3478 N711MQ  LGA    DTW         88      502     7
##  7  2013    10    23     1907           1932       -25     2143           2143         0 EV        4361 N13994  EWR    TYS        111      631    19
##  8  2013     3    30     2030           2055       -25     2213           2250       -37 MQ        4573 N725MQ  LGA    DTW         87      502    20
##  9  2013     5    14      914            938       -24     1143           1204       -21 9E        4065 N8747B  LGA    SDF        107      659     9
## 10  2013     3     2     1431           1455       -24     1601           1631       -30 9E        3318 N929XJ  JFK    BUF         55      301    14
## # … with 336,766 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>

select

  • select() is used to extract columns from a data frame.
  • Useful when you want to perform data analysis on specific columns and not the complete dataset.

Example

Select “faa” and “name” columns from the airports data frame.

## # A tibble: 1,458 x 2
##    faa   name                          
##    <chr> <chr>                         
##  1 04G   Lansdowne Airport             
##  2 06A   Moton Field Municipal Airport 
##  3 06C   Schaumburg Regional           
##  4 06N   Randall Airport               
##  5 09J   Jekyll Island Airport         
##  6 0A9   Elizabethton Municipal Airport
##  7 0G6   Williams County Airport       
##  8 0G7   Finger Lakes Regional Airport 
##  9 0P2   Shoestring Aviation Airfield  
## 10 0S9   Jefferson County Intl         
## # … with 1,448 more rows

Example

You can also use select() to remove columns by using - in front of the columns you want to remove.

## # A tibble: 1,458 x 6
##      lat    lon   alt    tz dst   tzone              
##    <dbl>  <dbl> <int> <dbl> <chr> <chr>              
##  1  41.1  -80.6  1044    -5 A     America/New_York   
##  2  32.5  -85.7   264    -6 A     America/Chicago    
##  3  42.0  -88.1   801    -6 A     America/Chicago    
##  4  41.4  -74.4   523    -5 A     America/New_York   
##  5  31.1  -81.4    11    -5 A     America/New_York   
##  6  36.4  -82.2  1593    -5 A     America/New_York   
##  7  41.5  -84.5   730    -5 A     America/New_York   
##  8  42.9  -76.8   492    -5 A     America/New_York   
##  9  39.8  -76.6  1000    -5 U     America/New_York   
## 10  48.1 -123.    108    -8 A     America/Los_Angeles
## # … with 1,448 more rows

mutate and transmute

  • The mutate() and transmute() functions are used to add columns to a data frame.
  • The difference between both functions is that mutate adds new variables and preserves existing while transmute drops existing variables.

Example

Add a new column named “distance_km” which converts mile distances into km distances.

## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5
##  2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5
##  3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5
##  4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5
##  5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6
##  6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150      719     5
##  7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158     1065     6
##  8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53      229     6
##  9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140      944     6
## 10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138      733     6
## # … with 336,766 more rows, and 3 more variables: minute <dbl>, time_hour <dttm>, distance_km <dbl>

Example

transmute() drops all other columns.

## # A tibble: 336,776 x 1
##    distance_km
##          <dbl>
##  1       2253.
##  2       2279.
##  3       1753.
##  4       2536.
##  5       1226.
##  6       1157.
##  7       1714.
##  8        369.
##  9       1519.
## 10       1180.
## # … with 336,766 more rows

Example

It is also possible to add many new variable in a single operation.

## # A tibble: 336,776 x 2
##    distance_km distance_knot
##          <dbl>         <dbl>
##  1       2253.         1217.
##  2       2279.         1230.
##  3       1753.          946.
##  4       2536.         1370.
##  5       1226.          662.
##  6       1157.          625.
##  7       1714.          925.
##  8        369.          199.
##  9       1519.          820.
## 10       1180.          637.
## # … with 336,766 more rows

summarise

  • The summarise() function allows to summarise columns from a dataframe.
  • Compress a data frame into single values (i.e. one row).

Hence, you have to provide a function that will be used to calculate a certain value.

Example

Calculate the average departure and arrival delays in the flights table.

## # A tibble: 1 x 2
##   average_departure_delay average_arrival_delay
##                     <dbl>                 <dbl>
## 1                      NA                    NA

What is going on?

Example

Calculate the average departure and arrival delays in the flights table.

## # A tibble: 1 x 2
##   average_departure_delay average_arrival_delay
##                     <dbl>                 <dbl>
## 1                    12.6                  6.90

Working with grouped data

  • Alone, the summarise() function is not really helpful because it collapses a data frame to a single row.
  • Usually we want to work on grouped data.
  • group_by() “prepares” a data frame to send it to the summarise() function.

Example

Calculate the average departure delay for each carrier.

## # A tibble: 16 x 2
##    carrier average_departure_delay
##    <chr>                     <dbl>
##  1 9E                        16.7 
##  2 AA                         8.59
##  3 AS                         5.80
##  4 B6                        13.0 
##  5 DL                         9.26
##  6 EV                        20.0 
##  7 F9                        20.2 
##  8 FL                        18.7 
##  9 HA                         4.90
## 10 MQ                        10.6 
## 11 OO                        12.6 
## 12 UA                        12.1 
## 13 US                         3.78
## 14 VX                        12.9 
## 15 WN                        17.7 
## 16 YV                        19.0

Example

It might be interested to know how many observations were used in each group to compute the average. To do so, we can use the n() function.

## # A tibble: 16 x 3
##    carrier average_departure_delay number_of_observation
##    <chr>                     <dbl>                 <int>
##  1 9E                        16.7                  18460
##  2 AA                         8.59                 32729
##  3 AS                         5.80                   714
##  4 B6                        13.0                  54635
##  5 DL                         9.26                 48110
##  6 EV                        20.0                  54173
##  7 F9                        20.2                    685
##  8 FL                        18.7                   3260
##  9 HA                         4.90                   342
## 10 MQ                        10.6                  26397
## 11 OO                        12.6                     32
## 12 UA                        12.1                  58665
## 13 US                         3.78                 20536
## 14 VX                        12.9                   5162
## 15 WN                        17.7                  12275
## 16 YV                        19.0                    601

Example

It is possible to summarise more than one column at once.

## # A tibble: 16 x 5
##    carrier number_of_flight average_departure_delay min_delay max_delay
##    <chr>              <int>                   <dbl>     <dbl>     <dbl>
##  1 9E                 18460                   16.7        -24       747
##  2 AA                 32729                    8.59       -24      1014
##  3 AS                   714                    5.80       -21       225
##  4 B6                 54635                   13.0        -43       502
##  5 DL                 48110                    9.26       -33       960
##  6 EV                 54173                   20.0        -32       548
##  7 F9                   685                   20.2        -27       853
##  8 FL                  3260                   18.7        -22       602
##  9 HA                   342                    4.90       -16      1301
## 10 MQ                 26397                   10.6        -26      1137
## 11 OO                    32                   12.6        -14       154
## 12 UA                 58665                   12.1        -20       483
## 13 US                 20536                    3.78       -19       500
## 14 VX                  5162                   12.9        -20       653
## 15 WN                 12275                   17.7        -13       471
## 16 YV                   601                   19.0        -16       387

Exercise

For this exercise, we will look at statistics of R packages downloaded on 2015-11-21. You can load the data as follow:

Question #1

Which package has been downloaded the most that day?

Exercise

Question #2

From which countries people downloaded the most packages?

Exercise

Question #3

How many packages have been downloaded that day from people in Denmark?

Joining tables

Joining tables

  • Data are rarely stored in a single data frame.

  • Good database practices recommend to have similar data in separate tables.

  • Joins are used to “merge” data frame together.

Joins in R

There are many ways of joining tables (data frames) in R. We will however focus on the merging tools offered by the package dplyr.

Types of join

Source: http://bit.ly/1QZ8Oil

Examples

To understand how joins work we will create two small data frames from data in nycflights13.

## # A tibble: 336,776 x 5
##     year month   day  hour origin
##    <int> <int> <int> <dbl> <chr> 
##  1  2013     1     1     5 EWR   
##  2  2013     1     1     5 LGA   
##  3  2013     1     1     5 JFK   
##  4  2013     1     1     5 JFK   
##  5  2013     1     1     6 LGA   
##  6  2013     1     1     5 EWR   
##  7  2013     1     1     6 EWR   
##  8  2013     1     1     6 LGA   
##  9  2013     1     1     6 JFK   
## 10  2013     1     1     6 LGA   
## # … with 336,766 more rows

Examples

## # A tibble: 1,458 x 2
##    faa   name                          
##    <chr> <chr>                         
##  1 04G   Lansdowne Airport             
##  2 06A   Moton Field Municipal Airport 
##  3 06C   Schaumburg Regional           
##  4 06N   Randall Airport               
##  5 09J   Jekyll Island Airport         
##  6 0A9   Elizabethton Municipal Airport
##  7 0G6   Williams County Airport       
##  8 0G7   Finger Lakes Regional Airport 
##  9 0P2   Shoestring Aviation Airfield  
## 10 0S9   Jefferson County Intl         
## # … with 1,448 more rows

left join

With a left join, all observations from table A are keep and only observations from table B that match will be merged.

## Error: `by` required, because the data sources have no common variables

left join

With a left join, all observations from table A are keep and only observations from table B that match will be merged.

## # A tibble: 336,776 x 6
##     year month   day  hour origin name               
##    <int> <int> <int> <dbl> <chr>  <chr>              
##  1  2013     1     1     5 EWR    Newark Liberty Intl
##  2  2013     1     1     5 LGA    La Guardia         
##  3  2013     1     1     5 JFK    John F Kennedy Intl
##  4  2013     1     1     5 JFK    John F Kennedy Intl
##  5  2013     1     1     6 LGA    La Guardia         
##  6  2013     1     1     5 EWR    Newark Liberty Intl
##  7  2013     1     1     6 EWR    Newark Liberty Intl
##  8  2013     1     1     6 LGA    La Guardia         
##  9  2013     1     1     6 JFK    John F Kennedy Intl
## 10  2013     1     1     6 LGA    La Guardia         
## # … with 336,766 more rows

left join

Attention, the order is important!

## # A tibble: 338,231 x 6
##    faa   name                            year month   day  hour
##    <chr> <chr>                          <int> <int> <int> <dbl>
##  1 04G   Lansdowne Airport                 NA    NA    NA    NA
##  2 06A   Moton Field Municipal Airport     NA    NA    NA    NA
##  3 06C   Schaumburg Regional               NA    NA    NA    NA
##  4 06N   Randall Airport                   NA    NA    NA    NA
##  5 09J   Jekyll Island Airport             NA    NA    NA    NA
##  6 0A9   Elizabethton Municipal Airport    NA    NA    NA    NA
##  7 0G6   Williams County Airport           NA    NA    NA    NA
##  8 0G7   Finger Lakes Regional Airport     NA    NA    NA    NA
##  9 0P2   Shoestring Aviation Airfield      NA    NA    NA    NA
## 10 0S9   Jefferson County Intl             NA    NA    NA    NA
## # … with 338,221 more rows

right join

With a right join, all observations from table B are keep and only observations from table A that match will be merged.

## # A tibble: 338,231 x 6
##     year month   day  hour origin name                          
##    <int> <int> <int> <dbl> <chr>  <chr>                         
##  1    NA    NA    NA    NA 04G    Lansdowne Airport             
##  2    NA    NA    NA    NA 06A    Moton Field Municipal Airport 
##  3    NA    NA    NA    NA 06C    Schaumburg Regional           
##  4    NA    NA    NA    NA 06N    Randall Airport               
##  5    NA    NA    NA    NA 09J    Jekyll Island Airport         
##  6    NA    NA    NA    NA 0A9    Elizabethton Municipal Airport
##  7    NA    NA    NA    NA 0G6    Williams County Airport       
##  8    NA    NA    NA    NA 0G7    Finger Lakes Regional Airport 
##  9    NA    NA    NA    NA 0P2    Shoestring Aviation Airfield  
## 10    NA    NA    NA    NA 0S9    Jefferson County Intl         
## # … with 338,221 more rows

inner join

With a inner join, only observations that match both tables are kept.

## # A tibble: 336,776 x 6
##     year month   day  hour origin name               
##    <int> <int> <int> <dbl> <chr>  <chr>              
##  1  2013     1     1     5 EWR    Newark Liberty Intl
##  2  2013     1     1     5 LGA    La Guardia         
##  3  2013     1     1     5 JFK    John F Kennedy Intl
##  4  2013     1     1     5 JFK    John F Kennedy Intl
##  5  2013     1     1     6 LGA    La Guardia         
##  6  2013     1     1     5 EWR    Newark Liberty Intl
##  7  2013     1     1     6 EWR    Newark Liberty Intl
##  8  2013     1     1     6 LGA    La Guardia         
##  9  2013     1     1     6 JFK    John F Kennedy Intl
## 10  2013     1     1     6 LGA    La Guardia         
## # … with 336,766 more rows

full join

With a full join, all observations from both tables are kept.

## # A tibble: 338,231 x 6
##     year month   day  hour origin name               
##    <int> <int> <int> <dbl> <chr>  <chr>              
##  1  2013     1     1     5 EWR    Newark Liberty Intl
##  2  2013     1     1     5 LGA    La Guardia         
##  3  2013     1     1     5 JFK    John F Kennedy Intl
##  4  2013     1     1     5 JFK    John F Kennedy Intl
##  5  2013     1     1     6 LGA    La Guardia         
##  6  2013     1     1     5 EWR    Newark Liberty Intl
##  7  2013     1     1     6 EWR    Newark Liberty Intl
##  8  2013     1     1     6 LGA    La Guardia         
##  9  2013     1     1     6 JFK    John F Kennedy Intl
## 10  2013     1     1     6 LGA    La Guardia         
## # … with 338,221 more rows

anti join

With an anti join, only observations from table A that do not match table B are kept.

## # A tibble: 1,455 x 2
##    faa   name                          
##    <chr> <chr>                         
##  1 04G   Lansdowne Airport             
##  2 06A   Moton Field Municipal Airport 
##  3 06C   Schaumburg Regional           
##  4 06N   Randall Airport               
##  5 09J   Jekyll Island Airport         
##  6 0A9   Elizabethton Municipal Airport
##  7 0G6   Williams County Airport       
##  8 0G7   Finger Lakes Regional Airport 
##  9 0P2   Shoestring Aviation Airfield  
## 10 0S9   Jefferson County Intl         
## # … with 1,445 more rows

Exercise

Exercise #1

First, load the statistics of R packages downloaded on 2015-11-21 and countries ISO codes as follow:

From which countries there were no packages downloaded this particular date?

Tidy data

Tidy data

In a tidy data set, each row correspond to an observation whereas each column is a variable.

Tidying a dataset consists in cleaning and structuring datasets to facilitate analysis.

What is a messy dataset?

https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

There are five main common problems that define a messy dataset:

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observational unit is stored in multiple tables.

Is there a solution?

Of course! Most of these problems can be easily addressed with little effort using the tidyr package (included in tidyverse).

You can refer to the printed cheat sheet for an overview of the package’s functions.

Column headers are values, not variable name

The messy data

The following dataset explores the relationship between income and religion in the US.

The messy data

Take a minute to look at the following table to find what is the main problems.

religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don’t know/refused
1 Agnostic 27.00 34.00 60.00 81.00 76.00 137.00 122.00 109.00 84.00 96.00
2 Atheist 12.00 27.00 37.00 52.00 35.00 70.00 73.00 59.00 74.00 76.00
3 Buddhist 27.00 21.00 30.00 34.00 33.00 58.00 62.00 39.00 53.00 54.00
4 Catholic 418.00 617.00 732.00 670.00 638.00 1116.00 949.00 792.00 633.00 1489.00
5 Don’t know/refused 15.00 14.00 15.00 11.00 10.00 35.00 21.00 17.00 18.00 116.00
6 Evangelical Prot 575.00 869.00 1064.00 982.00 881.00 1486.00 949.00 723.00 414.00 1529.00

The problem

Column headers are values, not variable names

What we need to do here is to gather information that is spread in the column names.

This is done using the gather(df, key, value) function where df is a data frame, key is the new column name that will contain the values of the data spread in the old dataframe and value is the name of the new column that will contain the actual values.

We can also specify variables that we do not need to tidy using the -var operator. In this case, -religion.

gather

religion income frequency
1 Agnostic <$10k 27.00
2 Atheist <$10k 12.00
3 Buddhist <$10k 27.00
4 Catholic <$10k 418.00
5 Don’t know/refused <$10k 15.00
6 Evangelical Prot <$10k 575.00
7 Hindu <$10k 1.00
8 Historically Black Prot <$10k 228.00
9 Jehovah’s Witness <$10k 20.00
10 Jewish <$10k 19.00

This new dataset is tidy each column represents a variable and each row represents an observation. We will later see how tidy data can be very useful for analysis and plotting (especially with ggplot2).

Multiple variables stored in one column

The messy data

The following messy dataset comes from the World Health Organisation, and records the counts of confirmed tuberculosis cases between 1980 and 2008.

The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).

http://www.who.int/tb/country/data/download/en/

The messy data

Take a minute to look at the following table to find what are the main problems.

iso2 year new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
1 AD 1989.00 NA NA NA NA NA NA
2 AD 1990.00 NA NA NA NA NA NA
3 AD 1991.00 NA NA NA NA NA NA
4 AD 1992.00 NA NA NA NA NA NA
5 AD 1993.00 NA NA NA NA NA NA
6 AD 1994.00 NA NA NA NA NA NA
7 AD 1996.00 NA NA 0.00 0.00 0.00 4.00
8 AD 1997.00 NA NA 0.00 0.00 1.00 2.00
9 AD 1998.00 NA NA 0.00 0.00 0.00 1.00
10 AD 1999.00 NA NA 0.00 0.00 0.00 1.00
11 AD 2000.00 NA NA 0.00 0.00 1.00 0.00
12 AD 2001.00 NA NA 0.00 NA NA 2.00

The problems

There are two main problems with this dataset:

  1. We have column names representing observations.
  2. There are two type of information in column names (sex and age).

Problem #1

Column headers are values, not variable names

As for the previous dataset, we use the gather() function.

iso2 year demographic frequency
1 AD 1989.00 new_sp_m04 NA
2 AD 1990.00 new_sp_m04 NA
3 AD 1991.00 new_sp_m04 NA
4 AD 1992.00 new_sp_m04 NA
5 AD 1993.00 new_sp_m04 NA
6 AD 1994.00 new_sp_m04 NA

At this point, the dataset looks already better.

Problem #2

Multiple variables stored in one column

The column demographic contains two types of information (sex and age).

It seems ti be always the same pattern: sex start at the 7th position and the age at the 8th position.

Problem #2

To “split” this column into three different columns, we will use the separate() function.

The result

iso2 year junk sex age_group frequency
1 AD 1989.00 new_sp_ m 04 NA
2 AD 1990.00 new_sp_ m 04 NA
3 AD 1991.00 new_sp_ m 04 NA
4 AD 1992.00 new_sp_ m 04 NA
5 AD 1993.00 new_sp_ m 04 NA
6 AD 1994.00 new_sp_ m 04 NA
7 AD 1996.00 new_sp_ m 04 NA
8 AD 1997.00 new_sp_ m 04 NA
9 AD 1998.00 new_sp_ m 04 NA
10 AD 1999.00 new_sp_ m 04 NA
11 AD 2000.00 new_sp_ m 04 NA
12 AD 2001.00 new_sp_ m 04 NA
13 AD 2002.00 new_sp_ m 04 NA
14 AD 2003.00 new_sp_ m 04 NA
15 AD 2004.00 new_sp_ m 04 NA

Variables are stored in both rows and columns

The messy data

The following messy dataset is the daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010.

Take a minute to look at the following table to find what are the main problems.

id year month element d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
1 MX17004 2010.00 1.00 tmax NA NA NA NA NA NA NA NA NA NA
2 MX17004 2010.00 1.00 tmin NA NA NA NA NA NA NA NA NA NA
3 MX17004 2010.00 2.00 tmax NA 27.30 24.10 NA NA NA NA NA NA NA
4 MX17004 2010.00 2.00 tmin NA 14.40 14.40 NA NA NA NA NA NA NA
5 MX17004 2010.00 3.00 tmax NA NA NA NA 32.10 NA NA NA NA 34.50
6 MX17004 2010.00 3.00 tmin NA NA NA NA 14.20 NA NA NA NA 16.80

Problem #1

Column headers are values, not variable names

It is easy to understand that we have values spread in the columns. For instance, d1 means day 1. We should instead have a column named day with values from 1 to 31.

First problem resolved

It feels better already.

id year month element day temperature
1 MX17004 2010.00 12.00 tmax d1 29.90
2 MX17004 2010.00 12.00 tmin d1 13.80
3 MX17004 2010.00 2.00 tmax d2 27.30
4 MX17004 2010.00 2.00 tmin d2 14.40
5 MX17004 2010.00 11.00 tmax d2 31.30
6 MX17004 2010.00 11.00 tmin d2 16.30
7 MX17004 2010.00 2.00 tmax d3 24.10
8 MX17004 2010.00 2.00 tmin d3 14.40
9 MX17004 2010.00 7.00 tmax d3 28.60
10 MX17004 2010.00 7.00 tmin d3 17.50
11 MX17004 2010.00 11.00 tmax d4 27.20
12 MX17004 2010.00 11.00 tmin d4 12.00

Problem #2

The variable day is not numeric.

## Classes 'tbl_df', 'tbl' and 'data.frame':    66 obs. of  6 variables:
##  $ id         : chr  "MX17004" "MX17004" "MX17004" "MX17004" ...
##  $ year       : num  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ month      : num  12 12 2 2 11 11 2 2 7 7 ...
##  $ element    : chr  "tmax" "tmin" "tmax" "tmin" ...
##  $ day        : chr  "d1" "d1" "d2" "d2" ...
##  $ temperature: num  29.9 13.8 27.3 14.4 31.3 16.3 24.1 14.4 28.6 17.5 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr  "1" "2" "3" "4" ...

Problem #2

We can easily extract numerical values from character using the parse_number() function.

id year month element day temperature
1 MX17004 2010.00 12.00 tmax 1.00 29.90
2 MX17004 2010.00 12.00 tmin 1.00 13.80
3 MX17004 2010.00 2.00 tmax 2.00 27.30
4 MX17004 2010.00 2.00 tmin 2.00 14.40
5 MX17004 2010.00 11.00 tmax 2.00 31.30
6 MX17004 2010.00 11.00 tmin 2.00 16.30
7 MX17004 2010.00 2.00 tmax 3.00 24.10
8 MX17004 2010.00 2.00 tmin 3.00 14.40
9 MX17004 2010.00 7.00 tmax 3.00 28.60
10 MX17004 2010.00 7.00 tmin 3.00 17.50
11 MX17004 2010.00 11.00 tmax 4.00 27.20
12 MX17004 2010.00 11.00 tmin 4.00 12.00

Problem #3

Variables are stored in column

Looking at the table we see that there is something strange with element. Can you guess it?

id year month element day temperature
1 MX17004 2010.00 12.00 tmax 1.00 29.90
2 MX17004 2010.00 12.00 tmin 1.00 13.80
3 MX17004 2010.00 2.00 tmax 2.00 27.30
4 MX17004 2010.00 2.00 tmin 2.00 14.40
5 MX17004 2010.00 11.00 tmax 2.00 31.30
6 MX17004 2010.00 11.00 tmin 2.00 16.30
7 MX17004 2010.00 2.00 tmax 3.00 24.10
8 MX17004 2010.00 2.00 tmin 3.00 14.40
9 MX17004 2010.00 7.00 tmax 3.00 28.60
10 MX17004 2010.00 7.00 tmin 3.00 17.50
11 MX17004 2010.00 11.00 tmax 4.00 27.20
12 MX17004 2010.00 11.00 tmin 4.00 12.00

Problem #3

tmin and tmax are variables, not values and we should make them columns instead.

This is easily done using the spread() function.

id year month day tmax tmin
1 MX17004 2010.00 1.00 30.00 27.80 14.50
2 MX17004 2010.00 2.00 2.00 27.30 14.40
3 MX17004 2010.00 2.00 3.00 24.10 14.40
4 MX17004 2010.00 2.00 11.00 29.70 13.40
5 MX17004 2010.00 2.00 23.00 29.90 10.70
6 MX17004 2010.00 3.00 5.00 32.10 14.20
7 MX17004 2010.00 3.00 10.00 34.50 16.80
8 MX17004 2010.00 3.00 16.00 31.10 17.60
9 MX17004 2010.00 4.00 27.00 36.30 16.70
10 MX17004 2010.00 5.00 27.00 33.20 18.20

Your turn

Using your data:

  • Is you data tidy?
  • If not, what can you do?
  • Do you need to merge different data sources? What type of joins would you use?