(Refer back to the Advanced Data Manipulation lesson).
- dplyr verbs
- the pipe
%>%- variable creation
- multiple conditions
- properties of grouped data
- aggregation
- summary functions
- window functions
We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.
Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm. Run gm to display it.
Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.
# Load required libraries
library(dplyr)
library(readr)
# Read the data
gm <- read_csv("data/gapminder.csv")
# Take a look
gm## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <chr>       <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rowsUse dplyr functions to address the following questions:
group_by then summarize with a call to n_distinct(...)).gm %>% 
  group_by(continent) %>% 
  summarize(n=n_distinct(country))## # A tibble: 5 x 2
##   continent     n
##   <chr>     <int>
## 1 Africa       52
## 2 Americas     25
## 3 Asia         33
## 4 Europe       30
## 5 Oceania       2head(n=1))gm %>%
    filter(continent == "Europe" & year == 1997) %>%
    arrange(gdpPercap) %>%
    head(1)## # A tibble: 1 x 6
##   country continent  year lifeExp     pop gdpPercap
##   <chr>   <chr>     <int>   <dbl>   <int>     <dbl>
## 1 Albania Europe     1997    73.0 3428038     3193.gm %>%
    filter(year == 1982 | year == 1987) %>%
    group_by(continent) %>%
    summarize(mean_lifeExp = mean(lifeExp))## # A tibble: 5 x 2
##   continent mean_lifeExp
##   <chr>            <dbl>
## 1 Africa            52.5
## 2 Americas          67.2
## 3 Asia              63.7
## 4 Europe            73.2
## 5 Oceania           74.8head(n=5))gm %>%
    mutate(gdp = gdpPercap*pop) %>%
    group_by(country) %>%
    summarise(Total.GDP = sum(gdp)) %>%
    arrange(desc(Total.GDP)) %>%
    head(5)## # A tibble: 5 x 2
##   country        Total.GDP
##   <chr>              <dbl>
## 1 United States    7.68e13
## 2 Japan            2.54e13
## 3 China            2.04e13
## 4 Germany          1.95e13
## 5 United Kingdom   1.33e13gm %>%
    filter(lifeExp >= 80) %>%
    select(country, lifeExp, year)## # A tibble: 22 x 3
##    country          lifeExp  year
##    <chr>              <dbl> <int>
##  1 Australia           80.4  2002
##  2 Australia           81.2  2007
##  3 Canada              80.7  2007
##  4 France              80.7  2007
##  5 Hong Kong, China    80    1997
##  6 Hong Kong, China    81.5  2002
##  7 Hong Kong, China    82.2  2007
##  8 Iceland             80.5  2002
##  9 Iceland             81.8  2007
## 10 Israel              80.7  2007
## # ... with 12 more rowsgm %>%
    group_by(country) %>%
    summarise(cor = cor(lifeExp, gdpPercap)) %>%
    arrange(desc(abs(cor))) %>%
    head(10)## # A tibble: 10 x 2
##    country          cor
##    <chr>          <dbl>
##  1 France         0.996
##  2 Austria        0.993
##  3 Belgium        0.993
##  4 Norway         0.992
##  5 Oman           0.991
##  6 United Kingdom 0.990
##  7 Italy          0.990
##  8 Israel         0.988
##  9 Denmark        0.987
## 10 Australia      0.986!= Asia, group by two variables, summarize, then arrange.)gm %>%
    filter(continent != "Asia") %>%
    group_by(continent, year) %>%
    summarise(mean.pop = mean(pop)) %>%
    arrange(desc(mean.pop)) ## # A tibble: 48 x 3
## # Groups:   continent [4]
##    continent  year  mean.pop
##    <chr>     <int>     <dbl>
##  1 Americas   2007 35954847.
##  2 Americas   2002 33990910.
##  3 Americas   1997 31876016.
##  4 Americas   1992 29570964.
##  5 Americas   1987 27310159.
##  6 Americas   1982 25211637.
##  7 Americas   1977 23122708.
##  8 Americas   1972 21175368.
##  9 Europe     2007 19536618.
## 10 Europe     2002 19274129.
## # ... with 38 more rowsgm %>%
    group_by(country) %>%
    summarize(sd.pop = sd(pop)) %>%
    arrange(sd.pop) %>%
    head(3)## # A tibble: 3 x 2
##   country               sd.pop
##   <chr>                  <dbl>
## 1 Sao Tome and Principe 45906.
## 2 Iceland               48542.
## 3 Montenegro            99738.gm %>% 
  arrange(country, year) %>% 
  group_by(country) %>% 
  filter(pop < lag(pop) & lifeExp > lag(lifeExp))## # A tibble: 36 x 6
## # Groups:   country [22]
##    country                continent  year lifeExp      pop gdpPercap
##    <chr>                  <chr>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan            Asia       1982    39.9 12881816      978.
##  2 Bosnia and Herzegovina Europe     1992    72.2  4256013     2547.
##  3 Bosnia and Herzegovina Europe     1997    73.2  3607000     4766.
##  4 Bulgaria               Europe     2002    72.1  7661799     7697.
##  5 Bulgaria               Europe     2007    73.0  7322858    10681.
##  6 Croatia                Europe     1997    73.7  4444595     9876.
##  7 Czech Republic         Europe     1997    74.0 10300707    16049.
##  8 Czech Republic         Europe     2002    75.5 10256295    17596.
##  9 Czech Republic         Europe     2007    76.5 10228744    22833.
## 10 Equatorial Guinea      Africa     1977    42.0   192675      959.
## # ... with 26 more rows