Until now we have been working with simple data generated within R.
However, most of the time we want to work with external data.
Data frames are important objects in R which are created when reading a file.
Can be seen as an Excel tabular sheet:
A data frame can be seen as a matrix with the difference that columns (variables) can be of different types (numerics, dates, characters, etc.).
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
After you opened a data file, it is always a good idea to look the structure of the returned data frame. This ensure that all variables have the right types.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
## [1] 32
## [1] 11
There are two main ways to access data of a data frame.
The first way to access elements of a data frame consists in using indexes as we did for accessing matrix.
For example,
## [1] 160
Accessing columns of a data frame.
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
Accessing rows of a data frame.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
The second method to access elements of a data frame consists in using the $
operator using the df$variable
scheme.
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Using the second method makes things more obvious and easy to read since you don’t have to wonder the position (indexes) of the variables in the data frame.
## [1] 6 6 4 6 8 6 8 4 4 6
After typing the name of a data frame, the list of all variables within this data frame will appears. Use the keyboard to select the variable of interest.
I recommend to use readr
and readxl
libraries for reading CSV and Excel files. These two libraries are not installed by default.
Reading Excel files
library(readxl) # Load the readxl package
mydata <- read_excel("/path/to/myfile.xls")
mydata <- read_excel("/path/to/myfile.xlsx", sheet = 3)
Reading csv files
Use read_csv2()
if you are using CSV files produced with EU settings (;
instead of ,
separator).
There are many file format supported by R. This is a list of functions to use to open most used file format.
Data format | Function | Package |
---|---|---|
Comma separated values (.csv) | read_csv() | readr |
Excel (.xls, .xlsx) | read_excel() | readxl |
Other text files (.txt, .dat, …) | read_delim() | readr |
Stata | read_stata() | haven |
SPSS | read_spss() | haven |
SAS | read_sas() | haven |
Matlab file (.mat) | readMat() | R.matlab |
Open your data in R and start exploring them.
NA
) have been handled by R?It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many over the course of analysis as new problems come to light or new data is collected.
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
The tidyverse provides a set of packages which provides flexible tools for data manipulation.
The package is not included in the base installation of R.
You can refer to the printed cheat sheet for an overview of the package’s functions.
All dplyr functions use the same scheme: the first argument to provide is a data frame.
Function | Definition |
---|---|
filter(df, …) | Select a subset of the rows of a data frame. |
arrange(df, …) | Reorder rows of a data frame. |
select(df, …) | Select columns of a data frame. |
mutate(df, …) | Add new columns to a data frame. |
summarise(df, …) | Summarize a data frame into a single row. This function is very helpful in combination with the group_by() function. |
group_by(df, …) | Group a data frame based on the specified columns. |
To learn how to use these functions, we will use datasets from package nycflights13
which we first need to install.
This package contains information about all flights that departed from New York (i.e., EWR, JFK and LGA) in 2013.
For the following examples we are going to use fights, planes, airports, airlines tables.
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
## Classes 'tbl_df', 'tbl' and 'data.frame': 3322 obs. of 9 variables:
## $ tailnum : chr "N10156" "N102UW" "N103US" "N104UW" ...
## $ year : int 2004 1998 1999 1999 2002 1999 1999 1999 1999 1999 ...
## $ type : chr "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ...
## $ manufacturer: chr "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" ...
## $ model : chr "EMB-145XR" "A320-214" "A320-214" "A320-214" ...
## $ engines : int 2 2 2 2 2 2 2 2 2 2 ...
## $ seats : int 55 182 182 182 55 182 182 182 182 182 ...
## $ speed : int NA NA NA NA NA NA NA NA NA NA ...
## $ engine : chr "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" ...
## Classes 'tbl_df', 'tbl' and 'data.frame': 1458 obs. of 8 variables:
## $ faa : chr "04G" "06A" "06C" "06N" ...
## $ name : chr "Lansdowne Airport" "Moton Field Municipal Airport" "Schaumburg Regional" "Randall Airport" ...
## $ lat : num 41.1 32.5 42 41.4 31.1 ...
## $ lon : num -80.6 -85.7 -88.1 -74.4 -81.4 ...
## $ alt : int 1044 264 801 523 11 1593 730 492 1000 108 ...
## $ tz : num -5 -6 -6 -5 -5 -5 -5 -5 -5 -8 ...
## $ dst : chr "A" "A" "A" "A" ...
## $ tzone: chr "America/New_York" "America/Chicago" "America/Chicago" "America/New_York" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_integer(),
## .. name = col_character(),
## .. city = col_character(),
## .. country = col_character(),
## .. faa = col_character(),
## .. icao = col_character(),
## .. lat = col_double(),
## .. lon = col_double(),
## .. alt = col_integer(),
## .. tz = col_double(),
## .. dst = col_character(),
## .. tzone = col_character()
## .. )
## Classes 'tbl_df', 'tbl' and 'data.frame': 16 obs. of 2 variables:
## $ carrier: chr "9E" "AA" "AS" "B6" ...
## $ name : chr "Endeavor Air Inc." "American Airlines Inc." "Alaska Airlines Inc." "JetBlue Airways" ...
Most of the time you will want to work on subset observations of your data.
filter()
allows you to select a subset of the rows of a data frame.
The first argument is the name of the data frame, and the second and subsequent are filtering expressions evaluated in the context of that data frame.
## # A tibble: 111 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5
## 2 2013 1 8 1435 1440 -5 1717 1746 -29 UA 1579 N14228 EWR MIA 150 1085 14
## 3 2013 1 9 717 700 17 812 815 -3 UA 1142 N14228 EWR BOS 39 200 7
## 4 2013 1 9 1143 1144 -1 1425 1445 -20 UA 1707 N14228 EWR TPA 149 997 11
## 5 2013 1 13 835 824 11 1030 951 39 UA 1572 N14228 EWR BOS 43 200 8
## 6 2013 1 16 1829 1730 59 2117 2023 54 UA 1637 N14228 EWR TPA 144 997 17
## 7 2013 1 22 1902 1808 54 2214 2106 68 UA 1269 N14228 EWR PBI 163 1023 18
## 8 2013 1 23 1050 1056 -6 1143 1208 -25 UA 1047 N14228 EWR BOS 39 200 10
## 9 2013 1 23 1533 1529 4 1641 1645 -4 UA 1116 N14228 EWR BOS 47 200 15
## 10 2013 1 25 724 720 4 1000 1023 -23 UA 1724 N14228 EWR PBI 137 1023 7
## # … with 101 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>
## # A tibble: 3,945 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 848 1835 853 1001 1950 851 MQ 3944 N942MQ JFK BWI 41 184 18
## 2 2013 1 1 1815 1325 290 2120 1542 338 EV 4417 N17185 EWR OMA 213 1134 13
## 3 2013 1 1 1842 1422 260 1958 1535 263 EV 4633 N18120 EWR BTV 46 266 14
## 4 2013 1 1 2006 1630 216 2230 1848 222 EV 4644 N14972 EWR SAV 121 708 16
## 5 2013 1 1 2115 1700 255 2330 1920 250 9E 3347 N924XJ JFK CVG 115 589 17
## 6 2013 1 1 2205 1720 285 46 2040 246 AA 1999 N5DNAA EWR MIA 146 1085 17
## 7 2013 1 1 2312 2000 192 21 2110 191 EV 4312 N13958 EWR DCA 44 199 20
## 8 2013 1 1 2343 1724 379 314 1938 456 EV 4321 N21197 EWR MCI 222 1092 17
## 9 2013 1 2 1244 900 224 1431 1104 207 EV 4412 N13958 EWR MYR 94 550 9
## 10 2013 1 2 1332 904 268 1616 1128 288 EV 4364 N41104 EWR MCI 203 1092 9
## # … with 3,935 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>
## # A tibble: 80,570 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6
## 2 2013 1 1 600 600 0 837 825 12 MQ 4650 N542MQ LGA ATL 134 762 6
## 3 2013 1 1 602 605 -3 821 805 16 MQ 4401 N730MQ LGA DTW 105 502 6
## 4 2013 1 1 608 600 8 807 735 32 MQ 3768 N9EAMQ EWR ORD 139 719 6
## 5 2013 1 1 624 630 -6 909 840 29 EV 4626 N11107 EWR MSP 190 1008 6
## 6 2013 1 1 624 630 -6 840 830 10 MQ 4599 N518MQ LGA MSP 166 1020 6
## 7 2013 1 1 632 608 24 740 728 12 EV 4144 N13553 EWR IAD 52 212 6
## 8 2013 1 1 656 705 -9 1007 940 27 MQ 4534 N722MQ LGA XNA 233 1147 7
## 9 2013 1 1 749 710 39 939 850 49 MQ 3737 N508MQ EWR ORD 148 719 7
## 10 2013 1 1 800 810 -10 949 955 -6 MQ 4406 N828MQ JFK RDU 80 427 8
## # … with 80,560 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>
## # A tibble: 80,570 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6
## 2 2013 1 1 600 600 0 837 825 12 MQ 4650 N542MQ LGA ATL 134 762 6
## 3 2013 1 1 602 605 -3 821 805 16 MQ 4401 N730MQ LGA DTW 105 502 6
## 4 2013 1 1 608 600 8 807 735 32 MQ 3768 N9EAMQ EWR ORD 139 719 6
## 5 2013 1 1 624 630 -6 909 840 29 EV 4626 N11107 EWR MSP 190 1008 6
## 6 2013 1 1 624 630 -6 840 830 10 MQ 4599 N518MQ LGA MSP 166 1020 6
## 7 2013 1 1 632 608 24 740 728 12 EV 4144 N13553 EWR IAD 52 212 6
## 8 2013 1 1 656 705 -9 1007 940 27 MQ 4534 N722MQ LGA XNA 233 1147 7
## 9 2013 1 1 749 710 39 939 850 49 MQ 3737 N508MQ EWR ORD 148 719 7
## 10 2013 1 1 800 810 -10 949 955 -6 MQ 4406 N828MQ JFK RDU 80 427 8
## # … with 80,560 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>
arrange()
works similarly to filter()
except that instead of filtering or selecting rows, it reorders them.## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 12 7 2040 2123 -43 40 2352 48 B6 97 N592JB JFK DEN 265 1626 21
## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL 1715 N612DL LGA MSY 162 1183 20
## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV 5713 N825AS LGA IAD 52 229 14
## 4 2013 1 11 1900 1930 -30 2233 2243 -10 DL 1435 N934DL LGA TPA 139 1010 19
## 5 2013 1 29 1703 1730 -27 1947 1957 -10 F9 837 N208FR LGA DEN 250 1620 17
## 6 2013 8 9 729 755 -26 1002 955 7 MQ 3478 N711MQ LGA DTW 88 502 7
## 7 2013 10 23 1907 1932 -25 2143 2143 0 EV 4361 N13994 EWR TYS 111 631 19
## 8 2013 3 30 2030 2055 -25 2213 2250 -37 MQ 4573 N725MQ LGA DTW 87 502 20
## 9 2013 3 2 1431 1455 -24 1601 1631 -30 9E 3318 N929XJ JFK BUF 55 301 14
## 10 2013 5 5 934 958 -24 1225 1309 -44 B6 375 N531JB LGA FLL 150 1076 9
## # … with 336,766 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 12 7 2040 2123 -43 40 2352 48 B6 97 N592JB JFK DEN 265 1626 21
## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL 1715 N612DL LGA MSY 162 1183 20
## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV 5713 N825AS LGA IAD 52 229 14
## 4 2013 1 11 1900 1930 -30 2233 2243 -10 DL 1435 N934DL LGA TPA 139 1010 19
## 5 2013 1 29 1703 1730 -27 1947 1957 -10 F9 837 N208FR LGA DEN 250 1620 17
## 6 2013 8 9 729 755 -26 1002 955 7 MQ 3478 N711MQ LGA DTW 88 502 7
## 7 2013 3 30 2030 2055 -25 2213 2250 -37 MQ 4573 N725MQ LGA DTW 87 502 20
## 8 2013 10 23 1907 1932 -25 2143 2143 0 EV 4361 N13994 EWR TYS 111 631 19
## 9 2013 5 5 934 958 -24 1225 1309 -44 B6 375 N531JB LGA FLL 150 1076 9
## 10 2013 9 18 1631 1655 -24 1812 1845 -33 AA 2223 N4XXAA LGA STL 125 888 16
## # … with 336,766 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 12 7 2040 2123 -43 40 2352 48 B6 97 N592JB JFK DEN 265 1626 21
## 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL 1715 N612DL LGA MSY 162 1183 20
## 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV 5713 N825AS LGA IAD 52 229 14
## 4 2013 1 11 1900 1930 -30 2233 2243 -10 DL 1435 N934DL LGA TPA 139 1010 19
## 5 2013 1 29 1703 1730 -27 1947 1957 -10 F9 837 N208FR LGA DEN 250 1620 17
## 6 2013 8 9 729 755 -26 1002 955 7 MQ 3478 N711MQ LGA DTW 88 502 7
## 7 2013 10 23 1907 1932 -25 2143 2143 0 EV 4361 N13994 EWR TYS 111 631 19
## 8 2013 3 30 2030 2055 -25 2213 2250 -37 MQ 4573 N725MQ LGA DTW 87 502 20
## 9 2013 5 14 914 938 -24 1143 1204 -21 9E 4065 N8747B LGA SDF 107 659 9
## 10 2013 3 2 1431 1455 -24 1601 1631 -30 9E 3318 N929XJ JFK BUF 55 301 14
## # … with 336,766 more rows, and 2 more variables: minute <dbl>, time_hour <dttm>
select()
is used to extract columns from a data frame.## # A tibble: 1,458 x 2
## faa name
## <chr> <chr>
## 1 04G Lansdowne Airport
## 2 06A Moton Field Municipal Airport
## 3 06C Schaumburg Regional
## 4 06N Randall Airport
## 5 09J Jekyll Island Airport
## 6 0A9 Elizabethton Municipal Airport
## 7 0G6 Williams County Airport
## 8 0G7 Finger Lakes Regional Airport
## 9 0P2 Shoestring Aviation Airfield
## 10 0S9 Jefferson County Intl
## # … with 1,448 more rows
You can also use select()
to remove columns by using -
in front of the columns you want to remove.
## # A tibble: 1,458 x 6
## lat lon alt tz dst tzone
## <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 41.1 -80.6 1044 -5 A America/New_York
## 2 32.5 -85.7 264 -6 A America/Chicago
## 3 42.0 -88.1 801 -6 A America/Chicago
## 4 41.4 -74.4 523 -5 A America/New_York
## 5 31.1 -81.4 11 -5 A America/New_York
## 6 36.4 -82.2 1593 -5 A America/New_York
## 7 41.5 -84.5 730 -5 A America/New_York
## 8 42.9 -76.8 492 -5 A America/New_York
## 9 39.8 -76.6 1000 -5 U America/New_York
## 10 48.1 -123. 108 -8 A America/Los_Angeles
## # … with 1,448 more rows
mutate()
and transmute()
functions are used to add columns to a data frame.## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5
## 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5
## 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5
## 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6
## 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6
## 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6
## 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6
## # … with 336,766 more rows, and 3 more variables: minute <dbl>, time_hour <dttm>, distance_km <dbl>
transmute()
drops all other columns.
## # A tibble: 336,776 x 1
## distance_km
## <dbl>
## 1 2253.
## 2 2279.
## 3 1753.
## 4 2536.
## 5 1226.
## 6 1157.
## 7 1714.
## 8 369.
## 9 1519.
## 10 1180.
## # … with 336,766 more rows
It is also possible to add many new variable in a single operation.
## # A tibble: 336,776 x 2
## distance_km distance_knot
## <dbl> <dbl>
## 1 2253. 1217.
## 2 2279. 1230.
## 3 1753. 946.
## 4 2536. 1370.
## 5 1226. 662.
## 6 1157. 625.
## 7 1714. 925.
## 8 369. 199.
## 9 1519. 820.
## 10 1180. 637.
## # … with 336,766 more rows
summarise()
function allows to summarise columns from a dataframe.Hence, you have to provide a function that will be used to calculate a certain value.
summarise(flights,
average_departure_delay = mean(dep_delay),
average_arrival_delay = mean(arr_delay))
## # A tibble: 1 x 2
## average_departure_delay average_arrival_delay
## <dbl> <dbl>
## 1 NA NA
What is going on?
summarise(flights,
average_departure_delay = mean(dep_delay, na.rm = TRUE),
average_arrival_delay = mean(arr_delay, na.rm = TRUE))
## # A tibble: 1 x 2
## average_departure_delay average_arrival_delay
## <dbl> <dbl>
## 1 12.6 6.90
summarise()
function is not really helpful because it collapses a data frame to a single row.group_by()
“prepares” a data frame to send it to the summarise()
function.## # A tibble: 16 x 2
## carrier average_departure_delay
## <chr> <dbl>
## 1 9E 16.7
## 2 AA 8.59
## 3 AS 5.80
## 4 B6 13.0
## 5 DL 9.26
## 6 EV 20.0
## 7 F9 20.2
## 8 FL 18.7
## 9 HA 4.90
## 10 MQ 10.6
## 11 OO 12.6
## 12 UA 12.1
## 13 US 3.78
## 14 VX 12.9
## 15 WN 17.7
## 16 YV 19.0
It might be interested to know how many observations were used in each group to compute the average. To do so, we can use the n()
function.
## # A tibble: 16 x 3
## carrier average_departure_delay number_of_observation
## <chr> <dbl> <int>
## 1 9E 16.7 18460
## 2 AA 8.59 32729
## 3 AS 5.80 714
## 4 B6 13.0 54635
## 5 DL 9.26 48110
## 6 EV 20.0 54173
## 7 F9 20.2 685
## 8 FL 18.7 3260
## 9 HA 4.90 342
## 10 MQ 10.6 26397
## 11 OO 12.6 32
## 12 UA 12.1 58665
## 13 US 3.78 20536
## 14 VX 12.9 5162
## 15 WN 17.7 12275
## 16 YV 19.0 601
It is possible to summarise more than one column at once.
## # A tibble: 16 x 5
## carrier number_of_flight average_departure_delay min_delay max_delay
## <chr> <int> <dbl> <dbl> <dbl>
## 1 9E 18460 16.7 -24 747
## 2 AA 32729 8.59 -24 1014
## 3 AS 714 5.80 -21 225
## 4 B6 54635 13.0 -43 502
## 5 DL 48110 9.26 -33 960
## 6 EV 54173 20.0 -32 548
## 7 F9 685 20.2 -27 853
## 8 FL 3260 18.7 -22 602
## 9 HA 342 4.90 -16 1301
## 10 MQ 26397 10.6 -26 1137
## 11 OO 32 12.6 -14 154
## 12 UA 58665 12.1 -20 483
## 13 US 20536 3.78 -19 500
## 14 VX 5162 12.9 -20 653
## 15 WN 12275 17.7 -13 471
## 16 YV 601 19.0 -16 387
For this exercise, we will look at statistics of R packages downloaded on 2015-11-21. You can load the data as follow:
Which package has been downloaded the most that day?
From which countries people downloaded the most packages?
How many packages have been downloaded that day from people in Denmark?
Data are rarely stored in a single data frame.
Good database practices recommend to have similar data in separate tables.
Joins are used to “merge” data frame together.
There are many ways of joining tables (data frames) in R. We will however focus on the merging tools offered by the package dplyr
.
To understand how joins work we will create two small data frames from data in nycflights13
.
## # A tibble: 336,776 x 5
## year month day hour origin
## <int> <int> <int> <dbl> <chr>
## 1 2013 1 1 5 EWR
## 2 2013 1 1 5 LGA
## 3 2013 1 1 5 JFK
## 4 2013 1 1 5 JFK
## 5 2013 1 1 6 LGA
## 6 2013 1 1 5 EWR
## 7 2013 1 1 6 EWR
## 8 2013 1 1 6 LGA
## 9 2013 1 1 6 JFK
## 10 2013 1 1 6 LGA
## # … with 336,766 more rows
## # A tibble: 1,458 x 2
## faa name
## <chr> <chr>
## 1 04G Lansdowne Airport
## 2 06A Moton Field Municipal Airport
## 3 06C Schaumburg Regional
## 4 06N Randall Airport
## 5 09J Jekyll Island Airport
## 6 0A9 Elizabethton Municipal Airport
## 7 0G6 Williams County Airport
## 8 0G7 Finger Lakes Regional Airport
## 9 0P2 Shoestring Aviation Airfield
## 10 0S9 Jefferson County Intl
## # … with 1,448 more rows
With a left join, all observations from table A are keep and only observations from table B that match will be merged.
## Error: `by` required, because the data sources have no common variables
With a left join, all observations from table A are keep and only observations from table B that match will be merged.
# By default, dplyr will merge using all commun columns.
left_join(flights2, airports2, by = c("origin" = "faa"))
## # A tibble: 336,776 x 6
## year month day hour origin name
## <int> <int> <int> <dbl> <chr> <chr>
## 1 2013 1 1 5 EWR Newark Liberty Intl
## 2 2013 1 1 5 LGA La Guardia
## 3 2013 1 1 5 JFK John F Kennedy Intl
## 4 2013 1 1 5 JFK John F Kennedy Intl
## 5 2013 1 1 6 LGA La Guardia
## 6 2013 1 1 5 EWR Newark Liberty Intl
## 7 2013 1 1 6 EWR Newark Liberty Intl
## 8 2013 1 1 6 LGA La Guardia
## 9 2013 1 1 6 JFK John F Kennedy Intl
## 10 2013 1 1 6 LGA La Guardia
## # … with 336,766 more rows
Attention, the order is important!
## # A tibble: 338,231 x 6
## faa name year month day hour
## <chr> <chr> <int> <int> <int> <dbl>
## 1 04G Lansdowne Airport NA NA NA NA
## 2 06A Moton Field Municipal Airport NA NA NA NA
## 3 06C Schaumburg Regional NA NA NA NA
## 4 06N Randall Airport NA NA NA NA
## 5 09J Jekyll Island Airport NA NA NA NA
## 6 0A9 Elizabethton Municipal Airport NA NA NA NA
## 7 0G6 Williams County Airport NA NA NA NA
## 8 0G7 Finger Lakes Regional Airport NA NA NA NA
## 9 0P2 Shoestring Aviation Airfield NA NA NA NA
## 10 0S9 Jefferson County Intl NA NA NA NA
## # … with 338,221 more rows
With a right join, all observations from table B are keep and only observations from table A that match will be merged.
# This is the same as the previous "inverted" left join
right_join(flights2, airports2, by = c("origin" = "faa"))
## # A tibble: 338,231 x 6
## year month day hour origin name
## <int> <int> <int> <dbl> <chr> <chr>
## 1 NA NA NA NA 04G Lansdowne Airport
## 2 NA NA NA NA 06A Moton Field Municipal Airport
## 3 NA NA NA NA 06C Schaumburg Regional
## 4 NA NA NA NA 06N Randall Airport
## 5 NA NA NA NA 09J Jekyll Island Airport
## 6 NA NA NA NA 0A9 Elizabethton Municipal Airport
## 7 NA NA NA NA 0G6 Williams County Airport
## 8 NA NA NA NA 0G7 Finger Lakes Regional Airport
## 9 NA NA NA NA 0P2 Shoestring Aviation Airfield
## 10 NA NA NA NA 0S9 Jefferson County Intl
## # … with 338,221 more rows
With a inner join, only observations that match both tables are kept.
## # A tibble: 336,776 x 6
## year month day hour origin name
## <int> <int> <int> <dbl> <chr> <chr>
## 1 2013 1 1 5 EWR Newark Liberty Intl
## 2 2013 1 1 5 LGA La Guardia
## 3 2013 1 1 5 JFK John F Kennedy Intl
## 4 2013 1 1 5 JFK John F Kennedy Intl
## 5 2013 1 1 6 LGA La Guardia
## 6 2013 1 1 5 EWR Newark Liberty Intl
## 7 2013 1 1 6 EWR Newark Liberty Intl
## 8 2013 1 1 6 LGA La Guardia
## 9 2013 1 1 6 JFK John F Kennedy Intl
## 10 2013 1 1 6 LGA La Guardia
## # … with 336,766 more rows
With a full join, all observations from both tables are kept.
## # A tibble: 338,231 x 6
## year month day hour origin name
## <int> <int> <int> <dbl> <chr> <chr>
## 1 2013 1 1 5 EWR Newark Liberty Intl
## 2 2013 1 1 5 LGA La Guardia
## 3 2013 1 1 5 JFK John F Kennedy Intl
## 4 2013 1 1 5 JFK John F Kennedy Intl
## 5 2013 1 1 6 LGA La Guardia
## 6 2013 1 1 5 EWR Newark Liberty Intl
## 7 2013 1 1 6 EWR Newark Liberty Intl
## 8 2013 1 1 6 LGA La Guardia
## 9 2013 1 1 6 JFK John F Kennedy Intl
## 10 2013 1 1 6 LGA La Guardia
## # … with 338,221 more rows
With an anti join, only observations from table A that do not match table B are kept.
## # A tibble: 1,455 x 2
## faa name
## <chr> <chr>
## 1 04G Lansdowne Airport
## 2 06A Moton Field Municipal Airport
## 3 06C Schaumburg Regional
## 4 06N Randall Airport
## 5 09J Jekyll Island Airport
## 6 0A9 Elizabethton Municipal Airport
## 7 0G6 Williams County Airport
## 8 0G7 Finger Lakes Regional Airport
## 9 0P2 Shoestring Aviation Airfield
## 10 0S9 Jefferson County Intl
## # … with 1,445 more rows
First, load the statistics of R packages downloaded on 2015-11-21 and countries ISO codes as follow:
# http://cran-logs.rstudio.com/2018/2018-07-15.csv.gz
downloads <- read_csv("data/2018-07-15.csv.gz")
countries <- read_csv("data/countries_ISO.csv")
From which countries there were no packages downloaded this particular date?
In a tidy data set, each row correspond to an observation whereas each column is a variable.
Tidying a dataset consists in cleaning and structuring datasets to facilitate analysis.
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
There are five main common problems that define a messy dataset:
Of course! Most of these problems can be easily addressed with little effort using the tidyr
package (included in tidyverse).
You can refer to the printed cheat sheet for an overview of the package’s functions.
The following dataset explores the relationship between income and religion in the US.
Take a minute to look at the following table to find what is the main problems.
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k | Don’t know/refused | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Agnostic | 27.00 | 34.00 | 60.00 | 81.00 | 76.00 | 137.00 | 122.00 | 109.00 | 84.00 | 96.00 |
2 | Atheist | 12.00 | 27.00 | 37.00 | 52.00 | 35.00 | 70.00 | 73.00 | 59.00 | 74.00 | 76.00 |
3 | Buddhist | 27.00 | 21.00 | 30.00 | 34.00 | 33.00 | 58.00 | 62.00 | 39.00 | 53.00 | 54.00 |
4 | Catholic | 418.00 | 617.00 | 732.00 | 670.00 | 638.00 | 1116.00 | 949.00 | 792.00 | 633.00 | 1489.00 |
5 | Don’t know/refused | 15.00 | 14.00 | 15.00 | 11.00 | 10.00 | 35.00 | 21.00 | 17.00 | 18.00 | 116.00 |
6 | Evangelical Prot | 575.00 | 869.00 | 1064.00 | 982.00 | 881.00 | 1486.00 | 949.00 | 723.00 | 414.00 | 1529.00 |
What we need to do here is to gather information that is spread in the column names.
This is done using the gather(df, key, value)
function where df
is a data frame, key
is the new column name that will contain the values of the data spread in the old dataframe and value
is the name of the new column that will contain the actual values.
We can also specify variables that we do not need to tidy using the -var
operator. In this case, -religion
.
# Tidy the data frame (2 new columns and keep "religion" as is).
fixed <- gather(messy, income, frequency, -religion)
religion | income | frequency | |
---|---|---|---|
1 | Agnostic | <$10k | 27.00 |
2 | Atheist | <$10k | 12.00 |
3 | Buddhist | <$10k | 27.00 |
4 | Catholic | <$10k | 418.00 |
5 | Don’t know/refused | <$10k | 15.00 |
6 | Evangelical Prot | <$10k | 575.00 |
7 | Hindu | <$10k | 1.00 |
8 | Historically Black Prot | <$10k | 228.00 |
9 | Jehovah’s Witness | <$10k | 20.00 |
10 | Jewish | <$10k | 19.00 |
This new dataset is tidy each column represents a variable and each row represents an observation. We will later see how tidy data can be very useful for analysis and plotting (especially with ggplot2).
The following messy dataset comes from the World Health Organisation, and records the counts of confirmed tuberculosis cases between 1980 and 2008.
The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).
Take a minute to look at the following table to find what are the main problems.
iso2 | year | new_sp_m04 | new_sp_m514 | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | |
---|---|---|---|---|---|---|---|---|
1 | AD | 1989.00 | NA | NA | NA | NA | NA | NA |
2 | AD | 1990.00 | NA | NA | NA | NA | NA | NA |
3 | AD | 1991.00 | NA | NA | NA | NA | NA | NA |
4 | AD | 1992.00 | NA | NA | NA | NA | NA | NA |
5 | AD | 1993.00 | NA | NA | NA | NA | NA | NA |
6 | AD | 1994.00 | NA | NA | NA | NA | NA | NA |
7 | AD | 1996.00 | NA | NA | 0.00 | 0.00 | 0.00 | 4.00 |
8 | AD | 1997.00 | NA | NA | 0.00 | 0.00 | 1.00 | 2.00 |
9 | AD | 1998.00 | NA | NA | 0.00 | 0.00 | 0.00 | 1.00 |
10 | AD | 1999.00 | NA | NA | 0.00 | 0.00 | 0.00 | 1.00 |
11 | AD | 2000.00 | NA | NA | 0.00 | 0.00 | 1.00 | 0.00 |
12 | AD | 2001.00 | NA | NA | 0.00 | NA | NA | 2.00 |
There are two main problems with this dataset:
As for the previous dataset, we use the gather()
function.
iso2 | year | demographic | frequency | |
---|---|---|---|---|
1 | AD | 1989.00 | new_sp_m04 | NA |
2 | AD | 1990.00 | new_sp_m04 | NA |
3 | AD | 1991.00 | new_sp_m04 | NA |
4 | AD | 1992.00 | new_sp_m04 | NA |
5 | AD | 1993.00 | new_sp_m04 | NA |
6 | AD | 1994.00 | new_sp_m04 | NA |
At this point, the dataset looks already better.
The column demographic contains two types of information (sex and age).
It seems ti be always the same pattern: sex start at the 7th position and the age at the 8th position.
To “split” this column into three different columns, we will use the separate()
function.
iso2 | year | junk | sex | age_group | frequency | |
---|---|---|---|---|---|---|
1 | AD | 1989.00 | new_sp_ | m | 04 | NA |
2 | AD | 1990.00 | new_sp_ | m | 04 | NA |
3 | AD | 1991.00 | new_sp_ | m | 04 | NA |
4 | AD | 1992.00 | new_sp_ | m | 04 | NA |
5 | AD | 1993.00 | new_sp_ | m | 04 | NA |
6 | AD | 1994.00 | new_sp_ | m | 04 | NA |
7 | AD | 1996.00 | new_sp_ | m | 04 | NA |
8 | AD | 1997.00 | new_sp_ | m | 04 | NA |
9 | AD | 1998.00 | new_sp_ | m | 04 | NA |
10 | AD | 1999.00 | new_sp_ | m | 04 | NA |
11 | AD | 2000.00 | new_sp_ | m | 04 | NA |
12 | AD | 2001.00 | new_sp_ | m | 04 | NA |
13 | AD | 2002.00 | new_sp_ | m | 04 | NA |
14 | AD | 2003.00 | new_sp_ | m | 04 | NA |
15 | AD | 2004.00 | new_sp_ | m | 04 | NA |
The following messy dataset is the daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010.
Take a minute to look at the following table to find what are the main problems.
id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | d9 | d10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | MX17004 | 2010.00 | 1.00 | tmax | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
2 | MX17004 | 2010.00 | 1.00 | tmin | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
3 | MX17004 | 2010.00 | 2.00 | tmax | NA | 27.30 | 24.10 | NA | NA | NA | NA | NA | NA | NA |
4 | MX17004 | 2010.00 | 2.00 | tmin | NA | 14.40 | 14.40 | NA | NA | NA | NA | NA | NA | NA |
5 | MX17004 | 2010.00 | 3.00 | tmax | NA | NA | NA | NA | 32.10 | NA | NA | NA | NA | 34.50 |
6 | MX17004 | 2010.00 | 3.00 | tmin | NA | NA | NA | NA | 14.20 | NA | NA | NA | NA | 16.80 |
It is easy to understand that we have values spread in the columns. For instance, d1
means day 1. We should instead have a column named day
with values from 1 to 31.
It feels better already.
id | year | month | element | day | temperature | |
---|---|---|---|---|---|---|
1 | MX17004 | 2010.00 | 12.00 | tmax | d1 | 29.90 |
2 | MX17004 | 2010.00 | 12.00 | tmin | d1 | 13.80 |
3 | MX17004 | 2010.00 | 2.00 | tmax | d2 | 27.30 |
4 | MX17004 | 2010.00 | 2.00 | tmin | d2 | 14.40 |
5 | MX17004 | 2010.00 | 11.00 | tmax | d2 | 31.30 |
6 | MX17004 | 2010.00 | 11.00 | tmin | d2 | 16.30 |
7 | MX17004 | 2010.00 | 2.00 | tmax | d3 | 24.10 |
8 | MX17004 | 2010.00 | 2.00 | tmin | d3 | 14.40 |
9 | MX17004 | 2010.00 | 7.00 | tmax | d3 | 28.60 |
10 | MX17004 | 2010.00 | 7.00 | tmin | d3 | 17.50 |
11 | MX17004 | 2010.00 | 11.00 | tmax | d4 | 27.20 |
12 | MX17004 | 2010.00 | 11.00 | tmin | d4 | 12.00 |
day
is not numeric.## Classes 'tbl_df', 'tbl' and 'data.frame': 66 obs. of 6 variables:
## $ id : chr "MX17004" "MX17004" "MX17004" "MX17004" ...
## $ year : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ month : num 12 12 2 2 11 11 2 2 7 7 ...
## $ element : chr "tmax" "tmin" "tmax" "tmin" ...
## $ day : chr "d1" "d1" "d2" "d2" ...
## $ temperature: num 29.9 13.8 27.3 14.4 31.3 16.3 24.1 14.4 28.6 17.5 ...
## - attr(*, "na.action")= 'omit' Named int 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr "1" "2" "3" "4" ...
We can easily extract numerical values from character using the parse_number()
function.
id | year | month | element | day | temperature | |
---|---|---|---|---|---|---|
1 | MX17004 | 2010.00 | 12.00 | tmax | 1.00 | 29.90 |
2 | MX17004 | 2010.00 | 12.00 | tmin | 1.00 | 13.80 |
3 | MX17004 | 2010.00 | 2.00 | tmax | 2.00 | 27.30 |
4 | MX17004 | 2010.00 | 2.00 | tmin | 2.00 | 14.40 |
5 | MX17004 | 2010.00 | 11.00 | tmax | 2.00 | 31.30 |
6 | MX17004 | 2010.00 | 11.00 | tmin | 2.00 | 16.30 |
7 | MX17004 | 2010.00 | 2.00 | tmax | 3.00 | 24.10 |
8 | MX17004 | 2010.00 | 2.00 | tmin | 3.00 | 14.40 |
9 | MX17004 | 2010.00 | 7.00 | tmax | 3.00 | 28.60 |
10 | MX17004 | 2010.00 | 7.00 | tmin | 3.00 | 17.50 |
11 | MX17004 | 2010.00 | 11.00 | tmax | 4.00 | 27.20 |
12 | MX17004 | 2010.00 | 11.00 | tmin | 4.00 | 12.00 |
Looking at the table we see that there is something strange with element
. Can you guess it?
id | year | month | element | day | temperature | |
---|---|---|---|---|---|---|
1 | MX17004 | 2010.00 | 12.00 | tmax | 1.00 | 29.90 |
2 | MX17004 | 2010.00 | 12.00 | tmin | 1.00 | 13.80 |
3 | MX17004 | 2010.00 | 2.00 | tmax | 2.00 | 27.30 |
4 | MX17004 | 2010.00 | 2.00 | tmin | 2.00 | 14.40 |
5 | MX17004 | 2010.00 | 11.00 | tmax | 2.00 | 31.30 |
6 | MX17004 | 2010.00 | 11.00 | tmin | 2.00 | 16.30 |
7 | MX17004 | 2010.00 | 2.00 | tmax | 3.00 | 24.10 |
8 | MX17004 | 2010.00 | 2.00 | tmin | 3.00 | 14.40 |
9 | MX17004 | 2010.00 | 7.00 | tmax | 3.00 | 28.60 |
10 | MX17004 | 2010.00 | 7.00 | tmin | 3.00 | 17.50 |
11 | MX17004 | 2010.00 | 11.00 | tmax | 4.00 | 27.20 |
12 | MX17004 | 2010.00 | 11.00 | tmin | 4.00 | 12.00 |
tmin
and tmax
are variables, not values and we should make them columns instead.
This is easily done using the spread()
function.
id | year | month | day | tmax | tmin | |
---|---|---|---|---|---|---|
1 | MX17004 | 2010.00 | 1.00 | 30.00 | 27.80 | 14.50 |
2 | MX17004 | 2010.00 | 2.00 | 2.00 | 27.30 | 14.40 |
3 | MX17004 | 2010.00 | 2.00 | 3.00 | 24.10 | 14.40 |
4 | MX17004 | 2010.00 | 2.00 | 11.00 | 29.70 | 13.40 |
5 | MX17004 | 2010.00 | 2.00 | 23.00 | 29.90 | 10.70 |
6 | MX17004 | 2010.00 | 3.00 | 5.00 | 32.10 | 14.20 |
7 | MX17004 | 2010.00 | 3.00 | 10.00 | 34.50 | 16.80 |
8 | MX17004 | 2010.00 | 3.00 | 16.00 | 31.10 | 17.60 |
9 | MX17004 | 2010.00 | 4.00 | 27.00 | 36.30 | 16.70 |
10 | MX17004 | 2010.00 | 5.00 | 27.00 | 33.20 | 18.20 |
Using your data: