(Refer back to the Advanced Data Manipulation lesson).
- dplyr verbs
- the pipe
%>%
- variable creation
- multiple conditions
- properties of grouped data
- aggregation
- summary functions
- window functions
We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/
subfolder of the project directory where you can access it easily from R.
Load the dplyr and readr packages, and read the gapminder data into R using the read_csv()
function (n.b. read_csv()
is not the same as read.csv()
). Assign the data to an object called gm
. Run gm
to display it.
Note, the code is available by hitting the “Code” button above each expected output, but try not to use it unless you’re stuck.
# Load required libraries
library(dplyr)
library(readr)
# Read the data
gm <- read_csv("data/gapminder.csv")
# Take a look
gm
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
Use dplyr functions to address the following questions:
group_by
then summarize with a call to n_distinct(...)
).gm %>%
group_by(continent) %>%
summarize(n=n_distinct(country))
## # A tibble: 5 x 2
## continent n
## <chr> <int>
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
head(n=1)
)gm %>%
filter(continent == "Europe" & year == 1997) %>%
arrange(gdpPercap) %>%
head(1)
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1997 73.0 3428038 3193.
gm %>%
filter(year == 1982 | year == 1987) %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp))
## # A tibble: 5 x 2
## continent mean_lifeExp
## <chr> <dbl>
## 1 Africa 52.5
## 2 Americas 67.2
## 3 Asia 63.7
## 4 Europe 73.2
## 5 Oceania 74.8
head(n=5)
)gm %>%
mutate(gdp = gdpPercap*pop) %>%
group_by(country) %>%
summarise(Total.GDP = sum(gdp)) %>%
arrange(desc(Total.GDP)) %>%
head(5)
## # A tibble: 5 x 2
## country Total.GDP
## <chr> <dbl>
## 1 United States 7.68e13
## 2 Japan 2.54e13
## 3 China 2.04e13
## 4 Germany 1.95e13
## 5 United Kingdom 1.33e13
gm %>%
filter(lifeExp >= 80) %>%
select(country, lifeExp, year)
## # A tibble: 22 x 3
## country lifeExp year
## <chr> <dbl> <int>
## 1 Australia 80.4 2002
## 2 Australia 81.2 2007
## 3 Canada 80.7 2007
## 4 France 80.7 2007
## 5 Hong Kong, China 80 1997
## 6 Hong Kong, China 81.5 2002
## 7 Hong Kong, China 82.2 2007
## 8 Iceland 80.5 2002
## 9 Iceland 81.8 2007
## 10 Israel 80.7 2007
## # ... with 12 more rows
gm %>%
group_by(country) %>%
summarise(cor = cor(lifeExp, gdpPercap)) %>%
arrange(desc(abs(cor))) %>%
head(10)
## # A tibble: 10 x 2
## country cor
## <chr> <dbl>
## 1 France 0.996
## 2 Austria 0.993
## 3 Belgium 0.993
## 4 Norway 0.992
## 5 Oman 0.991
## 6 United Kingdom 0.990
## 7 Italy 0.990
## 8 Israel 0.988
## 9 Denmark 0.987
## 10 Australia 0.986
!=
Asia, group by two variables, summarize, then arrange.)gm %>%
filter(continent != "Asia") %>%
group_by(continent, year) %>%
summarise(mean.pop = mean(pop)) %>%
arrange(desc(mean.pop))
## # A tibble: 48 x 3
## # Groups: continent [4]
## continent year mean.pop
## <chr> <int> <dbl>
## 1 Americas 2007 35954847.
## 2 Americas 2002 33990910.
## 3 Americas 1997 31876016.
## 4 Americas 1992 29570964.
## 5 Americas 1987 27310159.
## 6 Americas 1982 25211637.
## 7 Americas 1977 23122708.
## 8 Americas 1972 21175368.
## 9 Europe 2007 19536618.
## 10 Europe 2002 19274129.
## # ... with 38 more rows
gm %>%
group_by(country) %>%
summarize(sd.pop = sd(pop)) %>%
arrange(sd.pop) %>%
head(3)
## # A tibble: 3 x 2
## country sd.pop
## <chr> <dbl>
## 1 Sao Tome and Principe 45906.
## 2 Iceland 48542.
## 3 Montenegro 99738.
gm %>%
arrange(country, year) %>%
group_by(country) %>%
filter(pop < lag(pop) & lifeExp > lag(lifeExp))
## # A tibble: 36 x 6
## # Groups: country [22]
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1982 39.9 12881816 978.
## 2 Bosnia and Herzegovina Europe 1992 72.2 4256013 2547.
## 3 Bosnia and Herzegovina Europe 1997 73.2 3607000 4766.
## 4 Bulgaria Europe 2002 72.1 7661799 7697.
## 5 Bulgaria Europe 2007 73.0 7322858 10681.
## 6 Croatia Europe 1997 73.7 4444595 9876.
## 7 Czech Republic Europe 1997 74.0 10300707 16049.
## 8 Czech Republic Europe 2002 75.5 10256295 17596.
## 9 Czech Republic Europe 2007 76.5 10228744 22833.
## 10 Equatorial Guinea Africa 1977 42.0 192675 959.
## # ... with 26 more rows