- CAREER TRACK: Data Scientist with R
- First course: Intermediate-r
- Second course: Introduction to the Tidyverse
- Third course: Importing Data in R
- Class I Importing data from flat files with utils, date: 2018-10-22
- Class II readr & data.table, date: 2018-10-22
- Class III Importing Excel data, date: 2018-10-22
- Class IV Reproducible Excel work with XLConnect, date: 2018-10-22
- Class V Importing data from databases, date: 2018-10-22
- Class VI Importing data from the web, date: 2018-10-22
- Class VII Importing data from statistical software packages, date: 2018-10-22
- Fourth course: Importing Data in R
- Class I Introduction and exploring raw data, date: 2018-10-22
- Class II Exploring raw data, date: 2018-10-22
- Class II Tidying data, date: 2018-10-22
- Class III Preparing data for analysis, date: 2018-10-22
- Class IV Missing and special values, date: 2018-10-22
- Class V Outliers and obvious errors, date: 2018-10-22
- Fiveth course: Importing & Cleaning Data in R: Case Studies
- Sixth course: Writing Functions in R
- Seventh course: Data Manipulation in R with dplyr
- Eighth course: Joining Data in R with dplyr
- Nineth course: Intro to SQL for Data Science
- Tenth course: Data Visualization with ggplot2
- Eleventh course: Working with Dates and Times in R
This is a study note of DataCamp in R Markdown document. For more details on learning at DataCamp see https://www.datacamp.com/tracks/data-scientist-with-r
CAREER TRACK: Data Scientist with R
A Data Scientist combines statistical and machine learning techniques with R programming to analyze and interpret complex data.
First course: Intermediate-r
Class I Conditionals and Control Flow, date: 2018-10-20
1.Relational Operators
The basic form of comparison is equality and inequality
Equality ==
Inequality !=
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] TRUE
It is useful to compare a vector with a number, such as:
A vector 1 2 3 4 5NULL
[1] FALSE FALSE TRUE TRUE TRUE
Compare matrices R’s ability to deal with different data structures for comparisons does not stop at vectors. Matrices and relational operators also work together seamlessly!
linkin <- c(15, 13, 15, 16, 17)
facebook <- c(25, 17, 12, 18, 22)
views <- matrix(c(linkin, facebook), nrow = 2, byrow = TRUE)
print(views)
[,1] [,2] [,3] [,4] [,5]
[1,] 15 13 15 16 17
[2,] 25 17 12 18 22
[,1] [,2] [,3] [,4] [,5]
[1,] FALSE TRUE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE
2.Logical Operators
And operator &
Or operator |
No operator !
TRUE & TRUE is TRUE; TRUE & FALSE is FALSE;
TRUE | FALSE is TRUE; FALSE | FALSE is FALSE;
!TRUE is FALSE
3.Conditional Statements
if statement, syntax:
if (condition) {
expr
}
else statement, syntax:
if (condition) {
expr1
} else {
expr2
}
else if statement, syntax: if (condition1) {
expr1
} else if (condition2) {
expr2
} else {
expr3
}
x <- 10
if (x < 10) {
print("x is < 10")
} else if (x > 10) {
print("x is > 10")
} else {
print("x is = 10")
}
[1] "x is = 10"
Class II Loops, date: 2018-10-20
Loops can come in handy on numerous occasions. While loops are like repeated if statements; the for loop is designed to iterate over all elements in a sequence. Learn all about them in this chapter. #### While loop while loop, syntax:
while (condition) {
expr
}
==break statement==
for example:
[1] "x is set to : 1"
[1] "x is set to : 2"
[1] "x is set to : 3"
[1] "x is set to : 4"
For loop
for loop, syntax:
for (var in seq) {
expr
}
for loop in vector, for loop in list, break statement, next statement
for example:
x <- c(rep(LETTERS, 1))
for (i in x) {
if (i == "G") {
break
} else if (i == "E") {
next
}
print(i)
}
[1] "A"
[1] "B"
[1] "C"
[1] "D"
[1] "F"
Class III Functions, date: 2018-10-20
Functions are an extremely important concept in almost every programming language; R is not different. After learning what a function is and how you can use one, you’ll take full control by writing your own functions.
Writing functions
syntax:
fun_name <- function(arg1, arg2) {
body
}
For example:
[1] 6
Class IV The apply family, date: 2018-10-20
The apply family is much efficient than for and while loop.
lapply() for list or vector
syntax:
lapply(list, fun, arg)
price <- list(2.37, 2.49, 2.18, 2.22, 2.47)
multiprice <- function(x, times) {
x * times
}
price_3 <- lapply(price, multiprice, times = 3)
print(as.vector(unlist(price_3)))
[1] 7.11 7.47 6.54 6.66 7.41
The output of apply() function is a list, and then we can use as.vector(unlist(mylist[1])) to transform the list to vector or dataframe
sapply() for list
sapply() is same as lapply, the only difference is the return of sapply() will simplify list to array.
Second course: Introduction to the Tidyverse
Class I Data wrangling, date: 2018-10-21
1.The gapminder dataset
Before we can follow the introduction in this course, we need to install the gapminder and dplyr package.
The gapminder package provide a dataset named gapminder.
2.The filter verb
filter() subset observations.
Use filter() find rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.
```r
gapminder %>% filter(year == 2007, country == "United States")
#Two conditions are joined by a comma, it equals to &
gapminder %>% filter(year == 2007 | lifeExp == 34)
#If the relationship between conditions is OR, you can do like this
```
3.The arrange verb
arrange() sorts a table based on a variable.
We use the pipe to connect the filter() and arrange().
```r
gapminder %>% arrange(lifeExp)
# Default is ascending, if you want to sort data descending, des(lifeExp) is used
```
4.The mutate verb
mutate() changes or add variables.
mutate() adds new variables and preserves existing; transmute() drops existing variables and keeps only the variables you create.
We can use the mutate() to replace the existed variables or generate a new variable.
```r
gapminder %>% mutate(pop = pop / 10000, gdp = gdpPercap * 10000)
# pop will be replaced, and the gdp is a new variables, the result will be returned to gapminder,or you can gen a new object
```
mutate_all() and transmutate_all()
apply the function(s) to all the columns
```r
mtcars %>% mutate_all(., function(x) { x / 2 })
```
mutate_at(), summarise_at and transmutate_at()
apply the function(s) to the columns selected
```r
mtcars %>% mutate_at(., c("mpg", "cyl"), function(x) { x / 2 })
mtcars %>% mutate_at(., vars(starts_with("X")), function(x) { x / 2 })
```
mutate_if(), summarise_if and transmutate_if()
apply the function(s) to the columns selected by some charateristics
```r
mtcars %>% mutate_if(., is.numeric, round)
```
Class II Data visualization, date: 2018-10-21
You have already been able to answer some questions about the data through dplyr, but you have engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and manipulation are often intertwined, so you will see how the dplyr and ggplot2 packages work closely together to create informative graphs.
1.Visualizing with ggplot2
2.Log scales
When x and y are need to log transform, we can use the scale_x_log10() and scale_y_log10()
3.Additional aesthetics
Additional aes including fill, group, colour, size can be used to customize our graphic.
4.Faceting
When we have to plot a lot of graphics using same method, we can use facet_wrap to do it automaticly.
Class III Grouping and summarizing, date: 2018-10-22
So far you have been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.
1.The summarize verb
Summarise a data frame.
Sometime, we usually make a summary in descriptive statistic, and summarize() function in dplyr package is very useful.
Combine with ddply to do that for each separate id
leMean popMean
1 59.47444 29601212
2.The group_by verb
Another common action in descriptive statistic is Crosstable. There are two methods that are group_by and ddply.
For example:
leMean popMean
1 59.47444 29601212
# This action also can be done by ddplyr() in plyr package
gapminder %>% ddply(., .(year, continent), summarize, leMean = mean(lifeExp), popMean = mean(pop))
year continent leMean popMean
1 1952 Africa 39.13550 4570010
2 1952 Americas 53.27984 13806098
3 1952 Asia 46.31439 42283556
4 1952 Europe 64.40850 13937362
5 1952 Oceania 69.25500 5343003
6 1957 Africa 41.26635 5093033
7 1957 Americas 55.96028 15478157
8 1957 Asia 49.31854 47356988
9 1957 Europe 66.70307 14596345
10 1957 Oceania 70.29500 5970988
11 1962 Africa 43.31944 5702247
12 1962 Americas 58.39876 17330810
13 1962 Asia 51.56322 51404763
14 1962 Europe 68.53923 15345172
15 1962 Oceania 71.08500 6641759
16 1967 Africa 45.33454 6447875
17 1967 Americas 60.41092 19229865
18 1967 Asia 54.66364 57747361
19 1967 Europe 69.73760 16039299
20 1967 Oceania 71.31000 7300207
21 1972 Africa 47.45094 7305376
22 1972 Americas 62.39492 21175368
23 1972 Asia 57.31927 65180977
24 1972 Europe 70.77503 16687835
25 1972 Oceania 71.91000 8053050
26 1977 Africa 49.58042 8328097
27 1977 Americas 64.39156 23122708
28 1977 Asia 59.61056 72257987
29 1977 Europe 71.93777 17238818
30 1977 Oceania 72.85500 8619500
31 1982 Africa 51.59287 9602857
32 1982 Americas 66.22884 25211637
33 1982 Asia 62.61794 79095018
34 1982 Europe 72.80640 17708897
35 1982 Oceania 74.29000 9197425
36 1987 Africa 53.34479 11054502
37 1987 Americas 68.09072 27310159
38 1987 Asia 64.85118 87006690
39 1987 Europe 73.64217 18103139
40 1987 Oceania 75.32000 9787208
41 1992 Africa 53.62958 12674645
42 1992 Americas 69.56836 29570964
43 1992 Asia 66.53721 94948248
44 1992 Europe 74.44010 18604760
45 1992 Oceania 76.94500 10459826
46 1997 Africa 53.59827 14304480
47 1997 Americas 71.15048 31876016
48 1997 Asia 68.02052 102523803
49 1997 Europe 75.50517 18964805
50 1997 Oceania 78.19000 11120715
51 2002 Africa 53.32523 16033152
52 2002 Americas 72.42204 33990910
53 2002 Asia 69.23388 109145521
54 2002 Europe 76.70060 19274129
55 2002 Oceania 79.74000 11727415
56 2007 Africa 54.80604 17875763
57 2007 Americas 73.60812 35954847
58 2007 Asia 70.72848 115513752
59 2007 Europe 77.64860 19536618
60 2007 Oceania 80.71950 12274974
Class IV Types of visualizations, date: 2018-10-22
You have learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You will see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.
Types of visualizations include geom_line(), geom_bar(), geom_hist(), geom_box()
Third course: Importing Data in R
Class I Importing data from flat files with utils, date: 2018-10-22
This course will introduce five method to import different types of data into R using base package utils, including Flat file, Excel, Database(sql), Web, and Statistic software(sas, spss, stat)
1.Introduction & read.csv
read.csv(“path/file.csv”, stringsAsFactors = FALSE)
Usually, we can define the path of file first, such as file.path(“~”, “dir”, “file.csv”)
2.txt file
read.table(“path/file.txt”, header = TRUE, sep = “/”)
Class II readr & data.table, date: 2018-10-22
Next to base R, there are also dedicated packages to easily and efficiently import flat file data. We’ll talk about two such packages: readr and data.table. Do not need to define the stringsAsFactors = FALSE.
1.Package readr
readr::read_csv readr::read_tsv for tet file
1.Package data.table
fread(“file.csv”)
- Infer column types and separators
- It simply works
- Extremely fast
- Possible to specify numerous parameters
- Improved read.table()
- Fast, convenient, customizable
Class III Importing Excel data, date: 2018-10-22
2.Package readxl
For excel, we can use the function excel_list to get all sheet name from the excel file. Then, use the function read_excel to import file.
read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, skip = 2)
the param of col_types include txt, numeric, date, blank.
3.Package gdata
The gdata package is a extension of utils.
Class IV Reproducible Excel work with XLConnect, date: 2018-10-22
Package XLConnect
- loadWorkbook()
- getWorksheet()
- readWorksheet()
- creatSheet()
- writeWorkbook()
- saveWorkbook()
- renameSheet()
- removeSheet()
Bridgebetween Excel and R
Class V Importing data from databases, date: 2018-10-22
1.Connecting to database
Main package: DBI
Different type of databse:
- MySQL, R RMySQL
con <- dbConnect(RMySQL::MySQL(), dbname = “tweater”, host = “courses.csrrinzqubik.us-east-1.rds.amazonaws.com”, port = 3306, user = “student”, password = “datacamp”) - Oracle, R ROracle
2.Import table data
Using dbListTables() and dbReadTable to import the table.
users <- dbReadTable(con, “users”) abDisconnect()
3.SQL Queries from inside R
dbGetQuery() latest <- dbGetQuery(con, “SELECT post FROM tweats WHERE date > '2015-09-21'”)
4.DBI internals
You have used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:
- Sending the specified query with dbSendQuery();
- Fetching the result of executing the query on the database with dbFetch();
- Clearing the result with dbClearResult().
Let us not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().
Class VI Importing data from the web, date: 2018-10-22
1.HTTP
Download the dataset from the http/https.
# library(readr)
# url_csv <-http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv
# pools <- read_csv(url_csv)
# head(pools)
But this method is useless for excel file
2.Downloading files
Using download.file function to download excel or RData from web. There are two method:
# url_xls <-http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls
# Method 1
# url_data <- gdata::read.xls(url_xls) uninstall the Perl
# Method 2
# download.file(url_xls, destfile = "local_latitude.xls") # Download the file to local
# excel_readxl <- readxl::read_excel("local_latitude.xls") file destroyed
3.Httr
Using the httr package to crawling data from web.
Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files. httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content. You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as argument, it will try its best to figure out which type is most appropriate based on the content-type.
APIs & JSON
Package: jsonlite
Fucntion: fromJSON(), toJSON(), minify(), pretty()
Class VII Importing data from statistical software packages, date: 2018-10-22
Package: haven by Hardy and foreign by R Core team
haven:
- read_sas() for sas7bat
- read_dta() or read_stata()
- read_spss() or read.sav()
foreign:
- Cannot import .sas7bdat, Only SAS library: .xport, if you want to you need sas7bdat package
- For stata 5 to 12 read.dta()
Fourth course: Importing Data in R
Class I Introduction and exploring raw data, date: 2018-10-22
This chapter will give you an overview of the process of data cleaning with R, then walk you through the basics of exploring raw data.
3 Steps: Exploring raw data, tidy data, analyze.
Class II Exploring raw data, date: 2018-10-22
1.Understanding the structure of data
class(), dim(), str(), or glimpse() in dplyr package.
summary()
2.Looking at your data
head(), tail()
3.Visualizing your data
hist(), plot()
Class II Tidying data, date: 2018-10-22
1.Introduction to tidy data
Principles of tidy data
2.Introduction to tidyr
Easily tidy data with spread() and gather() in tidyr package.
gather() is to gather the columns of wide data, syntax:
gather(data, key, value, …)
- key: bare name of new key column.
- value: bare name of new value column.
- …: bare names of columns to gather (or not).
spread() is to spread key-value pairs into columns.
separate() is to saparate a variable to more.
peparate(data, var_old, c(“var1”, “var2), sep =”")
unite() is to unite more variables into one.
unite(data, var_new, var1, var2, sep = "")
3.Addressing common symptoms of messy data
1).Column headers are values, not variable names.
2).Variables are stored in both rows and columns.
3).Multiple values are sored in column.
4).A single observational unit is stored in multiple tables.
5).Multiple types of obeservational units are stored in the same table.
Class III Preparing data for analysis, date: 2018-10-22
1.Date with lubridate package.
2.String manipulation
Using stringr package
[1] "hello world"
[1] "000123"
[1] TRUE FALSE FALSE
[1] "a" "b" "d"
Class IV Missing and special values, date: 2018-10-22
Check for NAs
- is.na()
- any(is.na()) means are there any NAs
- sum(is.na())
- complete.cases() means find rows with no missing values
- na.omit() means remove rows with NAs
Class V Outliers and obvious errors, date: 2018-10-22
summary(), barplot(), hsit(), boxplot()
Fiveth course: Importing & Cleaning Data in R: Case Studies
Running exciting analyses on interesting datasets is the dream of every data scientist. But first, some importing and cleaning must be done. In this series of four case studies, you’ll revisit key concepts from our courses on importing and cleaning data in R.
Ticket Sales Data, date: 2018-10-23
1.Dealing with the missing value
Load stringr
library(stringr)
Find columns of sales5 containing “dt”: date_cols
date_cols <- str_detect(names(sales5), “dt”)
library(lubridate)
Coerce date columns into Date objects
sales5[, date_cols] <- lapply(sales5[date_cols], ymd)
Create logical vectors indicating missing values (don’t change)
missing <- lapply(sales5[, date_cols], is.na)
create a numerical vector containing the number of NA values in each date column
num_missing <- sapply(missing, sum)
Sixth course: Writing Functions in R
Functions are a fundamental building block of the R language. You’ve probably used dozens (or even hundreds) of functions written by others, but in order to take your R game to the next level, you’ll need to learn to write your own functions. This course will teach you the fundamentals of writing functions in R so that, among other things, you can make your code more readable, avoid coding errors, and automate repetitive tasks.
Class I Writing a function in R, date: 208-10-23
1.Basic information of function in R:
- Three parts of a funtion
- Arguments
- Body
- Environment
- Return value is the last executed expression, or the firest executed return() statement
- Funtions canb etreated like usual R objests
2.Data STRUCTURE
Atomic vector of six types: logical, integer, double, character, complex, and raw
Lists, a.k.a recursive vectors
Atomic vectors are homogeneous, lists can be heterogeneous
Lists
Created with list()
Subset with [, [[ OR $
[ extracts a sublist
[[ and $ extract elements, remove a level of hierarchy
For example:
$a
[1] 1 2 3 4 5
$c
$c[[1]]
[1] "ABC"
$c[[2]]
[1] 2
$c[[3]]
[1] 4
$c[[4]]
[1] "EFG"
$c
$c[[1]]
[1] "ABC"
$c[[2]]
[1] 2
$c[[3]]
[1] 4
$c[[4]]
[1] "EFG"
[[1]]
[1] "ABC"
[[2]]
[1] 2
[[3]]
[1] 4
[[4]]
[1] "EFG"
[[1]]
[1] "ABC"
[[1]]
[1] "ABC"
[[2]]
[1] 2
[[3]]
[1] 4
[[4]]
[1] "EFG"
3.For loops
Looping over columns in a data frame.
df is a dataframe, we can do like this:
```r
for (i in 1 : ncol(df)) {
print(median(df[[i]])) # df[[i]] also can be replaced by df[, i]
}
# another safety method, using seq_along() function
for (i in seq_along(df)) {
print(median(df[[i]]))
}
#Output the return
output <- vector(mode = "double", length = ncol(df))
for (i in seq_along(df)) {
output[i] <- median(df[[i]])
}
```
4.How can you write a good function?
1.Good name:
- Should generally be verbs
- Should be descriptive
2.Argument names:
- Should generally be nouns
- Use the very common short names when appropriate: x, y, z, df
3.Argument order:
- Data arguments come first
- Detail arguments should have sensible defaults
Class II Functional programming, date: 2018-10-23
You already know how to use a for loop. The goal of this chapter is to teach you how to use the map functions in the purrr package which remove the code that’s duplicated across multiple for loops. After completing this chapter you’ll be able to solve new iteration problems with greater ease (faster and with fewer bugs).
The map() function is an adnvantage method of apply family functions.
1.Introducing purrr
Package purrr, fucntion map()
Advantages of the map fucntion in purrr:
- Handy shortcts for specifying .f
- More consistent than sapply(), lapply(), which makes them better for programming
- Takes much less time to solve iteration problems
All the map functions in purrr take a vector, .x, as the first argument, then return .f applied to each element of .x. The type of object that is returned is determined by function suffix (the part after _):
- map() returns a list or data frame
- map_lgl() returns a logical vector
- map_int() returns a integer vector
- map_dbl() returns a double vector
- map_chr() returns a character vector
2.Shortcuts
In R, a one-sided formula starts with a ~, followed by an R expression. In purrr’s map functions, the R expression can refer to an element of the .x argument using the . character.
map(cyl, function(df) lm(mpg ~ wt, data = df))
OR:
map(cyl, ~ lm(mpg ~ wt, data = .))
3.Dealing with failure
safely() is an adverb; it takes a verb and modifies it. That is, it takes a function as an argument and it returns a function as its output. The function that is returned is modified so it never throws an error (and never stops the rest of your computation!).
For example:
Create safe_readLines() by passing readLines() to safely()
safe_readLines <- safely(readLines())
Call safe_readLines()
readLines(“url”)
Call safe_readLines()
safe_readLines(“url”)
4.Maps over multiple arguments
Mapping over many argumens
- map2() iterate over two arguments
- pamp() iter over many arguments
- invoke_map() iterate pver functions and arguments
pamp()
Compare the following two calls to pmap() (run them in the console and compare their output too!):
pmap(list(n, mu, sd), rnorm)
pmap(list(mu, n, sd), rnorm)
What’s the difference? By default pmap() matches the elements of the list to the arguments in the function by position. In the first case, n to the n argument of rnorm(), mu to the mean argument of rnorm(), and sd to the sd argument of rnorm(). In the second case mu gets matched to the n argument of rnorm(), which is clearly not what we intended!
pmap(list(mean = mu, n = n, sd = sd), rnorm)
Mapping over functions and their arguments
Sometimes it’s not the arguments to a function you want to iterate over, but a set of functions themselves. Imagine that instead of varying the parameters to rnorm() we want to simulate from different distributions, say, using rnorm(), runif(), and rexp(). How do we iterate over calling these functions?
In purrr, this is handled by the invoke_map() function. The first argument is a list of functions. In our example, something like:
f <- list(“rnorm”, “runif”, “rexp”)
The second argument specifies the arguments to the functions. In the simplest case, all the functions take the same argument, and we can specify it directly, relying on … to pass it to each function. In this case, call each function with the argument n = 5:
invoke_map(f, n = 5), for example:
```r
# Define list of functions
f <- list("rnorm", "runif", "rexp")
# Parameter list for rnorm()
rnorm_params <- list(mean = 10)
# Add a min element with value 0 and max element with value 5
runif_params <- list(min = 0, max = 5)
# Add a rate element with value 5
rexp_params <- list(rate = 5)
# Define params for each function
params <- list(
rnorm_params,
runif_params,
rexp_params
)
# Call invoke_map() on f supplying params as the second argument
invoke_map(f, params, n = 5)
```
5.Maps with side effects
walk() operates just like map() except it’s designed for functions that don’t return anything. You use walk() for functions with side effects like printing, plotting or saving.
```r
# Define list of functions
f <- list(Normal = "rnorm", Uniform = "runif", Exp = "rexp")
# Define params
params <- list(
Normal = list(mean = 10),
Uniform = list(min = 0, max = 5),
Exp = list(rate = 5)
)
# Assign the simulated samples to sims
sims <- invoke_map(f, params, n = 50)
# Use walk() to make a histogram of each element in sims
sims %>% walk(hist)
```
Walking over two or more arguments
Those histograms were pretty good, but they really needed better breaks for the bins on the x-axis. That means we need to vary two arguments to hist(): x and breaks. Remember map2()? That allowed us to iterate over two arguments. Guess what? There is a walk2(), too!
```r
# Replace "Sturges" with reasonable breaks for each sample
breaks_list <- list(
Normal = seq(6, 16, 0.5),
Uniform = seq(0, 5, 0.25),
Exp = seq(0, 1.5, 0.1)
)
# Use walk2() to make histograms with the right breaks
sims %>% walk2(breaks_list, hist)
```
Walking with many arguments: pwalk
Ugh! Nice breaks but those plots had UUUUGLY labels and titles. The x-axis labels are easy to fix if we don’t mind every plot having its x-axis labeled the same way. We can use the … argument to any of the map() or walk() functions to pass in further arguments to the function .f. In this case, we might decide we don’t want any labels on the x-axis, in which case we need to pass an empty string to the xlab argument of hist()
library(purrr)
f <- list(Normal = "rnorm", Uniform = "runif", Exp = "rexp")
params <- list(
Normal = list(mean = 10),
Uniform = list(min = 0, max = 5),
Exp = list(rate = 5)
)
# Increase sample size to 1000
sims <- invoke_map(f, params, n = 100)
# Compute nice_breaks (don't change this)
find_breaks <- function(x) {
rng <- range(x, na.rm = TRUE)
seq(rng[1], rng[2], length.out = 30)
}
nice_breaks <- map(sims, find_breaks)
# Create a vector nice_titles
nice_titles <- list("Normal(10, 1)", "Uniform(0, 5)", "Exp(5)")
# Use pwalk() instead of walk2()
pwalk(list(x = sims, breaks = nice_breaks, main = nice_titles), hist, xlab = "")
Class III Robust functions, date: 2018-10-23
1.An error is better than a surprise
Recall our both_na() function from Chapter 2, that finds the number of entries where vectors x and y both have missing values:
both_na <- function(x, y) {
sum(is.na(x) & is.na(y))
}
We had an example where the behavior was a little surprising:
- x <- c(NA, NA, NA)
- y <- c( 1, NA, NA, NA)
- both_na(x, y)
The function works and returns 3, but we certainly didn’t design this function with the idea that people could pass in different length arguments.
Using stopifnot() is a quick way to have your function stop, if a condition isn’t met. stopifnot() takes logical expressions as arguments and if any are FALSE an error will occur.
2.An informative error is even better
Using stop() instead of stopifnot() allows you to specify a more informative error message. Recall the general pattern for using stop() is:
if (condition) {
stop(“Error”, call. = FALSE)
}
Writing good error messages is an important part of writing a good function! We recommend your error tells the user what should be true, not what is false. For example, here a good error would be “x and y must have the same length”, rather than the bad error “x and y don’t have the same length”.
Let’s use this pattern to write a better check for the length of x and y.
# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c(1, NA, NA, NA)
both_na <- function(x, y) {
# Replace condition with logical
if (length(x) != length(y)) {
# Replace "Error" with better message
stop("x and y must have the same length", call. = FALSE)
}
sum(is.na(x) & is.na(y))
}
# Call both_na()
# both_na(x, y)
Seventh course: Data Manipulation in R with dplyr
In this interactive tutorial, you will learn how to perform sophisticated dplyr techniques to carry out your data manipulation with R. First you will master the five verbs of R data manipulation with dplyr: select, mutate, filter, arrange and summarise. Next, you will learn how you can chain your dplyr operations using the pipe operator of the magrittr package. In the final section, the focus is on practicing how to subset your data using the group_by function, and how you can access data stored outside of R in a database. All said and done, you will be familiar with data manipulation tools and techniques that will allow you to efficiently manipulate data.
Class I Introduction to dplyr and tbls, date: 208-10-27
As Garrett explained, a tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble() or tbl_df().
The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame.
R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:
two <- c(“AA”, “AS”)
lut <- c(“AA” = “American”, “AS” = “Alaska”, “B6” = “JetBlue”)
two <- lut[two]
Class II The five verbs and select in more detail, date: 208-10-27
The dplyr package contains five key data manipulation functions, also called verbs:
- select(), which returns a subset of the columns,
- filter(), that is able to return a subset of the rows,
- arrange(), that reorders the rows according to single or multiple variables,
- mutate(), used to add columns from existing data,
- summarize(), which reduces each group to a single row by calculating aggregate measures.
Syntax:
1.select()
select(df, var1, var2) OR select(df, 1:4, -2)
dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:
- starts_with(“X”): every name that starts with “X”,
- ends_with(“X”): every name that ends with “X”,
- contains(“X”): every name that contains “X”,
- matches(“X”): every name that matches “X”, where “X” can be a regular expression,
- num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05,
- one_of(x): every name that appears in x, which should be a character vector.
2.mutate()
mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.
Take this example that adds a new column, z, which is the element-wise sum of the columns x and y, to the data frame df:
mutate(df, z = x + y)
3.filter()
R comes with a set of logical operators that you can use inside filter():
- x < y, TRUE if x is less than y
- x <= y, TRUE if x is less than or equal to y
- x == y, TRUE if x equals y
- x != y, TRUE if x does not equal y
- x >= y, TRUE if x is greater than or equal to y
- x > y, TRUE if x is greater than y
- x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
filter(df, a > 0)
Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:
filter(df, a > 0 & b > 0) equals to ilter(df, a > 0, b > 0)
4.arrange()
arrange() can be used to rearrange rows according to any type of data.
5.Summarize()
summarize(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().
In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.
You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:
- min(x) - minimum value of vector x.
- max(x) - maximum value of vector x.
- mean(x) - mean value of vector x.
- median(x) - median value of vector x.
- quantile(x, p) - pth quantile of vector x.
- sd(x) - standard deviation of vector x.
- var(x) - variance of vector x.
- IQR(x) - Inter Quartile Range (IQR) of vector x.
- diff(range(x)) - total range of vector x.
dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:
- first(x) - The first element of vector x.
- last(x) - The last element of vector x.
- nth(x, n) - The nth element of vector x.
- n() - The number of rows in the data.frame or group of observations that summarize() describes.
- n_distinct(x) - The number of unique values in vector x
6.Chaining your functions: the pipe operator
As another example of the %>%, have a look at the following two commands that are completely equivalent:
mean(c(1, 2, 3, NA), na.rm = TRUE)
c(1, 2, 3, NA) %>% mean(na.rm = TRUE)
The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.
Class III Group_by and working with databases, date: 2018-10-27
1.Get group-wise insights: group_by
As Garrett explained, group_by() lets you define groups within your data set. Its influence becomes clear when calling summarize() on a grouped dataset: summarizing statistics are calculated for the different groups separately.
2.dplyr and databases
hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:
library(data.table) hflights2 <- as.data.table(hflights) hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.
Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.
Eighth course: Joining Data in R with dplyr
Class I Mutating joins, date: 2018-10-27
There are four join functions in dplyr:
- left_join() is the basic join function in dplyr
- right_join()
- inner_join()
- full_join()
syntax:
left_join(df1, df2, by = “key”) OR
left_join(df1, df2, by = c(“key1”, “key2”))
To join by different variables on x and y use a named vector. For example, by = c(“a” = “b”) will match x.a to y.b.
Class II Filtering joins and set operations, date: 2018-10-27
Filtering joins and set operations combine information from datasets without adding new variables. Filtering joins filter the observations of one dataset based on whether or not they occur in a second dataset. Set operations use combinations of observations from both datasets to create a new dataset.
1.Semi-joins
semi_join() is an advantage filter method.
As you saw in the video, semi-joins provide a concise way to filter data from the first dataset based on information in a second dataset.
2.Anti-joins
anti_joins provide a concise way to filter data from the first dataset based on information NOT in a second dataset.
For example, you can use an anti-join to see which rows will not be matched to a second dataset by a join.
3.Set operations
Set operations means the function of rbind
- union() rbind all observations df1 and df2
- intersect() keeping the same observations both in df1 and df2
- sediff() excluding the observations in df2 from df1, sediff(df1, df2)
4.Comparing datasets
setequal(df1, df2) comparing the rows, if same return TRUE identical(df1, df2) EXACT comparing
Class III Advanced joining, date: 2018-10-27
Joining multiple tables
pipe() function or reduce() fuction in purrr package.
For example:
tables <- list(df1, df2, df3)
reduce(tables, left_join, by = “key”)
Nineth course: Intro to SQL for Data Science
The role of a data scientist is to turn raw data into actionable insights. Much of the world’s raw data—from electronic medical records to customer transaction histories—lives in organized collections of tables called relational databases. Therefore, to be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL (pronounced ess-que-ell, or sequel). This course teaches you everything you need to know to begin working with databases today!
Class I Selecting columns, date: 2018-11-05
In SQL, you can select data from a table using a SELECT statement. For example, the following query selects the name column from the people table:
SELECT name FROM people;
It’s also good practice (but not necessary for the exercises in this course) to include a semicolon at the end of your query.
This tells SQL where the end of your query is!
SELECT COUNT(birthdate)
FROM people;
SELECT COUNT(DISTINCT birthdate)
FROM people;
Count number of none-missing varables SELECT COUNT(birthdate)
FROM people WHERE birthdate IS NOT NULL;
Class II Filtering rows, date: 2018-11-05
Filtering results
Congrats on finishing the first chapter! You now know how to select columns and perform basic counts. This chapter will focus on filtering your results.
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
- = equal
- <> not equal
- < less than
- > greater than
- <= less than or equal to
- >= greater than or equal to
For example, you can filter text records such as title. The following code returns all films with the title ‘Metropolis’:
SELECT title
FROM films
WHERE title = ‘Metropolis’;
Notice that the WHERE clause always comes after the FROM statement!
Note that in this course we will use <> and not != for the not equal operator, as per the SQL standard.
1.NULL and IS NULL
Now that you know what NULL is and what it’s used for, it’s time for some practice!
SELECT title FROM films WHERE budget IS NULL;
2.LIKE and NOT LIKE
As you’ve seen, the WHERE clause can be used to filter text data. However, so far you’ve only been able to filter by specifying the exact text you’re interested in. In the real world, often you’ll want to search for a pattern rather than a specific text string.
In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:
The % wildcard will match zero, one, or many characters in text. For example, the following query matches companies like ‘Data’, ‘DataC’ ‘DataCamp’, ‘DataMind’, and so on:
SELECT name
FROM companies
WHERE name LIKE ‘Data%’;
3.Sorting multiple columns
ORDER BY can also be used to sort on multiple columns. It will sort by the first column specified, then sort by the next, then the next, and so on. For example
Tenth course: Data Visualization with ggplot2
This ggplot2 tutorial builds on your knowledge from the first course to produce meaningful explanatory plots. We’ll explore the last four optional layers. Statistics will be calculated on the fly and we’ll see how Coordinates and Facets aid in communication. Publication quality plots will be produced directly in R using the Themes layer. We’ll also discuss details on data visualization best practices with ggplot2 to help make sure you have a sound understanding of what works and why. By the end of the course, you’ll have all the tools needed to make a custom plotting function to explore a large data set, combining statistics and excellent visuals.
- Statistics
- Coordinates
- Facets
- Themes
Class I Statistics, date: 2018-11-06
Two categories of functions in statistics layer:
Called from within a geom, such as: geom_smooth
Called independently, such as: stat_smooth
geom_smooth <-> stat_smooth
You can use either stat_smooth() or geom_smooth() to apply a linear model. stat_quantile()
stat_sum()
This function calculates the total number of overlapping observations and is another good alternative to overplotting.
```r
myColors <- c(brewer.pal(3, "Dark2"), "black")
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, span = 0.7) +
stat_smooth(method = "loess",
aes(group = 1, col="All"),
se = FALSE, span = 0.7) +
# Add correct arguments to scale_color_manual
scale_color_manual("Cylinders", values = myColors)
```
Class II Coordinates and Facets, date: 2018-11-06
1.Zooming in
You saw different ways of using the coordinates layer to zoom in.
scale_x_continuous(limits = c(3, 6), expand = c(0, 0)) <-> xlim()
scale_y_continuous(limits = c(3, 6), expand = c(0, 0)) <-> ylim()
equals to:
coord_cartesian(xlim = c(3, 6)) OR coord_cartesian(ylim = c(3, 6))
2.Aspect Ratio
We can set the aspect ratio of a plot with coord_fixed() or coord_equal(). Both use ratio = 1 as a default. A 1:1 aspect ratio is most appropriate when two continuous variables are on the same scale
3.Pie Charts
The coord_polar() function converts a planar x-y Cartesian plot to polar coordinates. This can be useful if you are producing pie charts.
4.Facets: the basics
The most straightforward way of using facets is facet_grid(). Here we just need to specify the categorical variable to use on rows and columns using standard R formula notation (rows ~ columns).
5.Themes
plot.background() is used to define the backgroud of full plot except the backgroud of figure.
panel.grid() is used to define the grid of figure.
axis.line() is used to define the vartical and horizon.
axis.title() is used to define the labs of vartical and horizon.
strip.background() is used to define the lables when apply the facet().
We can use the theme_set() function to globally set the theme, and we can reset it using theme_set(original) that default to theme_grey().
library(ggplot2)
library(gridExtra)
p <- ggplot(mtcars, aes(x = disp, y = hp)) +
geom_point() +
geom_line() +
facet_grid(. ~ gear)
p1 <- p + theme(plot.background = element_rect(fill = "green"))
p2 <- p + theme(plot.background = element_rect(fill = "green", color = "black", size = 3),
rect = element_blank()
)
p3 <- p2 + theme(panel.grid = element_blank(),
axis.line = element_line(color = "red"),
axis.ticks = element_line(color = "red"),
strip.background = element_blank())
p4 <- p3 + theme(axis.title = element_text(color = "red", hjust = 0, face = "italic", size = 16))
grid.arrange(p, p1, p2, p3, p4, nrow = 5)
6.Heat map
In the video you saw reasons for not using heat maps. Nonetheless, you may encounter a case in which you really do want to use one. Luckily, they’re fairly straightforward to produce in ggplot2.
We begin by specifying two categorical variables for the x and y aesthetics. At the intersection of each category we’ll draw a box, except here we call it a tile, using the geom_tile() layer. Then we will fill each tile with a continuous variable.
Eleventh course: Working with Dates and Times in R
Dates and times are abundant in data and essential for answering questions that start with when, how long, or how often. However, they can be tricky, as they come in a variety of formats and can behave in unintuitive ways. This course teaches you the essentials of parsing, manipulating, and computing with dates and times in R. By the end, you’ll have mastered the lubridate package, a member of the tidyverse, specifically designed to handle dates and times.
Class I Dates and Times in R, date: 2018-11-26
Getting datetimes into R Just like dates without times, if you want R to recognize a string as a datetime you need to convert it, although now you use as.POSIXct(). as.POSIXct() expects strings to be in the format YYYY-MM-DD HH:MM:SS.
The only tricky thing is that times will be interpreted in local time based on your machine’s set up. You can check your timezone with Sys.timezone(). If you want the time to be interpreted in a different timezone, you just set the tz argument of as.POSIXct(). You’ll learn more about time zones in Chapter 4.
In this exercise you’ll input a couple of datetimes by hand and then see that read_csv() also handles datetimes automatically in a lot of cases.
Functions in lubridate package.
ymd()
dmy()
mdy()
dmy()
dym()
parse_date_time(x = , order = )
e.g. parse_date_time(x = “2010-10-10”, order = “ymd”)