Recently I have been more interested to perform web scraping to extract public data to perform data analysis. There are probably many R packages out there that do a good job at such task, but I found out that the rvest was among the most popular ones. Hence I decided to give it a try.

In this post, I’m using the rvest library to visualize how many medals were obtained by each country during the summer Olympics 2016. I thought it would be a first good exercise to get my hands around it.

Extracting data from web

First, let’s extract medals count information from the web. Here we are going to use data from NBC website.

library(rvest)
library(dplyr)

url1 <- "http://www.nbcolympics.com/medals"

medals <- read_html(url1) %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table()

knitr::kable(head(medals))
Place Country Gold Silver Bronze Total
1 United States 46 37 38 121
2 China 26 18 26 70
3 Great Britain 27 23 17 67
4 Russia 19 18 19 56
5 Germany 17 10 15 42
6 France 10 18 14 42

Get country codes and tidy the data

We are now almost ready to plot the data. I thought it would be interesting to enhance the visualization by associating each country to its flag. To do so, I first need to extract the two-characters ISO code for each country using the countrycode package. This is pretty straightforward using the countrycode() function. The next few lines are only used to tidy the data so it works well with ggplot2.

library(countrycode)
library(tidyr)

medals <- medals %>%
  mutate(code = countrycode(Country, "country.name", "iso2c")) %>%
  mutate(code = tolower(code)) %>%
  gather(medal_color, count, Gold, Silver, Bronze) %>%
  mutate(medal_color = factor(medal_color, levels = c("Gold", "Silver", "Bronze"))) %>%
  rename(country = Country) %>%
  drop_na(country, code)

knitr::kable(head(medals))
Place country Total code medal_color count
1 United States 121 us Gold 46
2 China 70 cn Gold 26
3 Great Britain 67 gb Gold 27
4 Russia 56 ru Gold 19
5 Germany 42 de Gold 17
6 France 42 fr Gold 10

Plotting the data

Using ggplot2 it is easy to plot the data. Do not forget to install the ggflags package from Github using devtools::install_github("baptiste/ggflags").

# devtools::install_github("baptiste/ggflags")
library(ggplot2)
library(ggflags)

medals %>%
  filter(Total >= 10) %>%
  ggplot(aes(x = reorder(country, Total), y = count)) +
  geom_bar(stat = "identity", aes(fill = medal_color)) +
  geom_flag(y = -7, aes(country = code), size = 10) +
  theme(axis.text.x = element_text(
    angle = 90,
    hjust = 1,
    size = 7,
    vjust = 0.5
  )) +
  scale_fill_manual(values = c(
    "Gold" = "gold",
    "Bronze" = "#cd7f32",
    "Silver" = "#C0C0C0"
  )) +
  scale_y_continuous(expand = c(0.1, 1)) +
  xlab("Country") +
  ylab("Number of medals") +
  theme_bw() +
  theme(legend.justification = c(1, 0), legend.position = c(1, 0)) +
  theme(legend.title = element_blank()) +
  coord_flip()

plot of chunk unnamed-chunk-3

As simple as that!