Sometimes when you open a data file (lets say a .csv), variables will be recognized as factor whereas it should be numeric. It is therefore tempting to simply convert the variable to numeric using as.numeric(). Big mistake! If you use directly as.numeric() on a variable of the type factor, R will returns the levels of the factor rather the actual values. To overcome this hassle, you should first convert the variable into a string with as.character() and thereafter use as.numeric().

x <- as.factor(rnorm(5))
x
#> [1] 0.702789386505402   2.14588100918678    -1.64008283187289  
#> [4] -0.0431189013978952 0.809021048617338  
#> 5 Levels: -1.64008283187289 -0.0431189013978952 ... 2.14588100918678
## Wrong way
as.numeric(x)
#> [1] 3 5 1 2 4
## Right way
as.numeric(as.character(x))
#> [1]  0.7027894  2.1458810 -1.6400828 -0.0431189  0.8090210

Another way to prevent numeric to be converted into factor would be to use as.is = T when opening the data file with read.csv.

Remove all objects in the workspace**

rm(list = ls())

Replacing values of a data frame with NA

library(xtable)

## Random matrix with with some (50) zeros
df <- matrix(rnorm(100), ncol = 10)

df[sample(100, 50)] <- 0

df <- data.frame(df)

print(xtable(df), type = "html")

<table border=1>

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 1 0.00 0.00 0.29 0.43 0.00 -0.02 0.00 0.12 -0.26 0.00 2 0.00 0.00 0.00 -2.03 0.00 0.00 -0.12 0.00 0.00 1.72 3 0.00 0.00 0.00 -1.40 0.00 0.03 0.00 -0.89 0.00 0.00 4 0.00 0.00 0.00 -1.23 0.00 0.00 0.00 -0.21 0.00 0.00 5 -0.64 0.00 -1.96 1.62 -0.69 1.00 -0.37 -1.99 -1.91 0.00 6 -0.95 0.64 0.00 0.00 0.00 -0.48 0.00 0.00 0.00 0.00 7 0.00 0.00 0.00 -0.91 -0.55 1.09 -2.64 0.00 1.36 0.00 8 -0.35 -0.50 0.00 0.45 0.16 0.38 -1.16 1.29 0.45 -0.42 9 -1.93 -1.01 -0.28 0.05 0.00 0.00 1.30 0.00 0.00 0.32 10 -0.85 0.00 0.00 1.67 0.00 0.70 0.00 0.00 0.55 -0.59

</table>

## Replace the 0 with NA
df[df == 0] <- NA

print(xtable(df), type = "html", NA.string = "NA")

<table border=1>

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 1 NA NA 0.29 0.43 NA -0.02 NA 0.12 -0.26 NA 2 NA NA NA -2.03 NA NA -0.12 NA NA 1.72 3 NA NA NA -1.40 NA 0.03 NA -0.89 NA NA 4 NA NA NA -1.23 NA NA NA -0.21 NA NA 5 -0.64 NA -1.96 1.62 -0.69 1.00 -0.37 -1.99 -1.91 NA 6 -0.95 0.64 NA NA NA -0.48 NA NA NA NA 7 NA NA NA -0.91 -0.55 1.09 -2.64 NA 1.36 NA 8 -0.35 -0.50 NA 0.45 0.16 0.38 -1.16 1.29 0.45 -0.42 9 -1.93 -1.01 -0.28 0.05 NA NA 1.30 0.00 NA 0.32 10 -0.85 NA NA 1.67 NA 0.70 NA NA 0.55 -0.59

</table>

##Count % of missing values per column

The other day I was working with a large database. I was looking for a way to calculate % of missing values for each variable. Here is how I calculated it.

## Generate some data
df <- matrix(rnorm(100), ncol = 10)

df[sample(100, 50)] <- NA

df <- data.frame(df)

print(xtable(df), type = "html", NA.string = "NA")

<table border=1>

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 1 NA 0.27 NA NA NA NA -0.28 NA 0.76 NA 2 NA -0.85 -0.05 0.38 NA -1.37 NA 0.73 0.30 -1.21 3 NA -0.05 0.37 NA -1.06 0.09 -0.26 -0.26 -0.04 -0.48 4 -1.25 NA NA -0.15 NA 1.65 1.24 -0.13 NA 0.11 5 NA -0.18 NA NA NA 0.14 NA -0.95 0.56 NA 6 NA -0.18 NA -0.18 NA -0.59 NA 0.51 NA NA 7 NA -0.59 NA NA NA NA -0.84 NA NA NA 8 -0.41 NA -0.51 NA 0.03 NA -0.42 1.22 -1.08 0.02 9 0.23 NA -0.71 0.94 -0.76 NA NA -0.14 1.49 NA 10 1.46 NA NA NA NA NA 0.66 NA 1.64 NA

</table>

## Find % of missing values in each column with apply
missing <- apply(df, 2, function(x) sum(is.na(x)))/nrow(df)*100

## Print the result
df2 <- data.frame(variable = names(df), missing = missing)

print(xtable(df2), type = "html", include.rownames = FALSE)

<table border=1>

variable missing X1 60.00 X2 40.00 X3 60.00 X4 60.00 X5 70.00 X6 50.00 X7 40.00 X8 30.00 X9 30.00 X10 60.00

</table>

Grouping data frame

Tsshere is various way to merge data frame in R. For me, the most easiest way is to use the join functions from the dplyr package.

library(dplyr)

df1 <- data.frame(ID = sample(LETTERS[1:20],10, replace = FALSE), x = rnorm(10))
df2 <- data.frame(ID = sample(LETTERS[1:20],10, replace = FALSE), y = rnorm(10))

print(xtable(df1), type = "html", include.rownames = FALSE)

<table border=1>

ID x R -0.24 M -0.78 P -0.40 N 0.57 B -0.01 Q 0.38 I 1.00 F 0.49 S 0.20 H -0.78

</table>

print(xtable(df2), type = "html", include.rownames = FALSE)

<table border=1>

ID y M 1.08 O -0.24 E -0.71 I 0.16 A -0.62 H -1.24 S -0.27 G 1.32 F -0.97 L 0.29

</table>

inner_join(df1, df2, by = "ID")

ID x y 1 M -0.7774292 1.0780574 2 I 1.0025206 0.1625215 3 F 0.4894344 -0.9664112 4 S 0.2042161 -0.2672898 5 H -0.7775847 -1.2408061

df_exact_match <- merge(df1,df2, by = "ID") ## Exact match
print(xtable(df_exact_match), type = "html", include.rownames = FALSE, NA.string = "NA")

<table border=1>

ID x y F 0.49 -0.97 H -0.78 -1.24 I 1.00 0.16 M -0.78 1.08 S 0.20 -0.27

</table>

df_all_x <- merge(df1,df2, by = "ID", all.x = T) ## Keep all x (discard y that are not matched)
print(xtable(df_all_x), type = "html", include.rownames = FALSE, NA.string = "NA")

<table border=1>

ID x y B -0.01 NA F 0.49 -0.97 H -0.78 -1.24 I 1.00 0.16 M -0.78 1.08 N 0.57 NA P -0.40 NA Q 0.38 NA R -0.24 NA S 0.20 -0.27

</table>

df_all_y <- merge(df1,df2, by = "ID", all.y = T) ## Keep all y (discard x that are not matched)
print(xtable(df_all_y), type = "html", include.rownames = FALSE, NA.string = "NA")

<table border=1>

ID x y F 0.49 -0.97 H -0.78 -1.24 I 1.00 0.16 M -0.78 1.08 S 0.20 -0.27 A NA -0.62 E NA -0.71 G NA 1.32 L NA 0.29 O NA -0.24

</table>

df_all <- merge(df1,df2, by = "ID", all = T) ## Keep everything
print(xtable(df_all), type = "html", include.rownames = FALSE, NA.string = "NA")

<table border=1>

ID x y B -0.01 NA F 0.49 -0.97 H -0.78 -1.24 I 1.00 0.16 M -0.78 1.08 N 0.57 NA P -0.40 NA Q 0.38 NA R -0.24 NA S 0.20 -0.27 A NA -0.62 E NA -0.71 G NA 1.32 L NA 0.29 O NA -0.24

</table>