Working with real data

Importing your data

Until now we have been working with simple data generated within R.

However, most of the time we want to work with external data.

  • Various sources and types of data: text file, images (geotif/raster).
  • Most of the time either .csv or .xls formats.

Data frames

Data frames are important objects in R which are created when reading a file.

Can be seen as an Excel tabular sheet:

  • Lines are observations
  • Columns are variables

A data frame can be seen as a matrix with the difference that columns (variables) can be of different types (numerics, dates, characters, etc.).

Examples

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Structure of a df

After you opened a data file, it is always a good idea to look the structure of the returned data frame. This ensure that all variables have the right types.

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Useful functions

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
## [1] 32
## [1] 11

Accessing elements of data frame

There are two main ways to access data of a data frame.

First method

The first way to access elements of a data frame consists in using indexes as we did for accessing matrix.

For example,

## [1] 160

Accessing columns of a data frame.

##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

Accessing rows of a data frame.

##           mpg cyl disp  hp drat   wt  qsec vs am gear carb
## Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Second method

The second method to access elements of a data frame consists in using the $ operator using the df$variable scheme.

##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

Using the second method makes things more obvious and easy to read since you don’t have to wonder the position (indexes) of the variables in the data frame.

##  [1] 6 6 4 6 8 6 8 4 4 6

Quick-RStudio tip

After typing the name of a data frame, the list of all variables within this data frame will appears. Use the keyboard to select the variable of interest.

CSV and Excel files

I recommend to use readr and readxl libraries for reading CSV and Excel files. These two libraries are not installed by default.