Advanced Data Manipulation Homework

(Refer back to the Advanced Data Manipulation lesson).

Key Concepts

dplyr verbs

the pipe %>%

variable creation

multiple conditions

properties of grouped data

aggregation

summary functions

window functions

Getting Started

We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.

Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm. Run gm to display it.

Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.

# Load required libraries
library(dplyr)
library(readr)

# Read the data
gm <- read_csv("data/gapminder.csv")

# Take a look
gm

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Problem set

Use dplyr functions to address the following questions:

How many unique countries are represented per continent? (Hint: group_by then summarize with a call to n_distinct(...)).

gm %>% 
  group_by(continent) %>% 
  summarize(n=n_distinct(country))

## # A tibble: 5 x 2
##   continent     n
##   <chr>     <int>
## 1 Africa       52
## 2 Americas     25
## 3 Asia         33
## 4 Europe       30
## 5 Oceania       2

Which European nation had the lowest GDP per capita in 1997? (Hint: filter, arrange, head(n=1))

gm %>%
    filter(continent == "Europe" & year == 1997) %>%
    arrange(gdpPercap) %>%
    head(1)

## # A tibble: 1 x 6
##   country continent  year lifeExp     pop gdpPercap
##   <chr>   <chr>     <int>   <dbl>   <int>     <dbl>
## 1 Albania Europe     1997    73.0 3428038     3193.

According to the data available, what was the average life expectancy across each continent in the 1980s? (Hint: filter, group_by, summarize)

gm %>%
    filter(year == 1982 | year == 1987) %>%
    group_by(continent) %>%
    summarize(mean_lifeExp = mean(lifeExp))

## # A tibble: 5 x 2
##   continent mean_lifeExp
##   <chr>            <dbl>
## 1 Africa            52.5
## 2 Americas          67.2
## 3 Asia              63.7
## 4 Europe            73.2
## 5 Oceania           74.8

What 5 countries have the highest total GDP over all years combined? (Hint: GDP per capita is simply GDP divided by the total population size. To get GDP back, you’d mutate to calculate GDP as the product of GDP per capita times the population size. Mutate, group_by, summarize, arrange, head(n=5))

gm %>%
    mutate(gdp = gdpPercap*pop) %>%
    group_by(country) %>%
    summarise(Total.GDP = sum(gdp)) %>%
    arrange(desc(Total.GDP)) %>%
    head(5)

## # A tibble: 5 x 2
##   country        Total.GDP
##   <chr>              <dbl>
## 1 United States    7.68e13
## 2 Japan            2.54e13
## 3 China            2.04e13
## 4 Germany          1.95e13
## 5 United Kingdom   1.33e13

What countries and years had life expectancies of at least 80 years? N.b. only output the columns of interest: country, life expectancy and year (in that order).

gm %>%
    filter(lifeExp >= 80) %>%
    select(country, lifeExp, year)

## # A tibble: 22 x 3
##    country          lifeExp  year
##    <chr>              <dbl> <int>
##  1 Australia           80.4  2002
##  2 Australia           81.2  2007
##  3 Canada              80.7  2007
##  4 France              80.7  2007
##  5 Hong Kong, China    80    1997
##  6 Hong Kong, China    81.5  2002
##  7 Hong Kong, China    82.2  2007
##  8 Iceland             80.5  2002
##  9 Iceland             81.8  2007
## 10 Israel              80.7  2007
## # ... with 12 more rows

What 10 countries have the strongest correlation (in either direction) between life expectancy and per capita GDP?

gm %>%
    group_by(country) %>%
    summarise(cor = cor(lifeExp, gdpPercap)) %>%
    arrange(desc(abs(cor))) %>%
    head(10)

## # A tibble: 10 x 2
##    country          cor
##    <chr>          <dbl>
##  1 France         0.996
##  2 Austria        0.993
##  3 Belgium        0.993
##  4 Norway         0.992
##  5 Oman           0.991
##  6 United Kingdom 0.990
##  7 Italy          0.990
##  8 Israel         0.988
##  9 Denmark        0.987
## 10 Australia      0.986

Which combinations of continent (besides Asia) and year have the highest average population across all countries? N.b. your output should include all results sorted by highest average population. (Hint: filter where continent != Asia, group by two variables, summarize, then arrange.)

gm %>%
    filter(continent != "Asia") %>%
    group_by(continent, year) %>%
    summarise(mean.pop = mean(pop)) %>%
    arrange(desc(mean.pop))

## # A tibble: 48 x 3
## # Groups:   continent [4]
##    continent  year  mean.pop
##    <chr>     <int>     <dbl>
##  1 Americas   2007 35954847.
##  2 Americas   2002 33990910.
##  3 Americas   1997 31876016.
##  4 Americas   1992 29570964.
##  5 Americas   1987 27310159.
##  6 Americas   1982 25211637.
##  7 Americas   1977 23122708.
##  8 Americas   1972 21175368.
##  9 Europe     2007 19536618.
## 10 Europe     2002 19274129.
## # ... with 38 more rows

Which three countries have had the most consistent population estimates (i.e. lowest standard deviation) across the years of available data?

gm %>%
    group_by(country) %>%
    summarize(sd.pop = sd(pop)) %>%
    arrange(sd.pop) %>%
    head(3)

## # A tibble: 3 x 2
##   country               sd.pop
##   <chr>                  <dbl>
## 1 Sao Tome and Principe 45906.
## 2 Iceland               48542.
## 3 Montenegro            99738.

Bonus! Which observations indicate that the population of a country has decreased from the previous year and the life expectancy has increased from the previous year? See the vignette on window functions.

gm %>% 
  arrange(country, year) %>% 
  group_by(country) %>% 
  filter(pop < lag(pop) & lifeExp > lag(lifeExp))

## # A tibble: 36 x 6
## # Groups:   country [22]
##    country                continent  year lifeExp      pop gdpPercap
##    <chr>                  <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan            Asia       1982    39.9 12881816      978.
##  2 Bosnia and Herzegovina Europe     1992    72.2  4256013     2547.
##  3 Bosnia and Herzegovina Europe     1997    73.2  3607000     4766.
##  4 Bulgaria               Europe     2002    72.1  7661799     7697.
##  5 Bulgaria               Europe     2007    73.0  7322858    10681.
##  6 Croatia                Europe     1997    73.7  4444595     9876.
##  7 Czech Republic         Europe     1997    74.0 10300707    16049.
##  8 Czech Republic         Europe     2002    75.5 10256295    17596.
##  9 Czech Republic         Europe     2007    76.5 10228744    22833.
## 10 Equatorial Guinea      Africa     1977    42.0   192675      959.
## # ... with 26 more rows