Until now we have been working with simple data generated within R.
However, most of the time we want to work with external data.
Data frames are important objects in R which are created when reading a file.
Can be seen as an Excel tabular sheet:
A data frame can be seen as a matrix with the difference that columns (variables) can be of different types (numerics, dates, characters, etc.).
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
After you opened a data file, it is always a good idea to look the structure of the returned data frame. This ensure that all variables have the right types.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
## [1] 32
## [1] 11
There are two main ways to access data of a data frame.
The first way to access elements of a data frame consists in using indexes as we did for accessing matrix.
For example,
## [1] 160
Accessing columns of a data frame.
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
Accessing rows of a data frame.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
The second method to access elements of a data frame consists in using the $
operator using the df$variable
scheme.
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Using the second method makes things more obvious and easy to read since you don’t have to wonder the position (indexes) of the variables in the data frame.
## [1] 6 6 4 6 8 6 8 4 4 6
After typing the name of a data frame, the list of all variables within this data frame will appears. Use the keyboard to select the variable of interest.
I recommend to use readr
and readxl
libraries for reading CSV and Excel files. These two libraries are not installed by default.