Study note of Data Scientist with R in DataCamp

Chi Shen

2019-05-21

This is a study note of DataCamp in R Markdown document. For more details on learning at DataCamp see https://www.datacamp.com/tracks/data-scientist-with-r

CAREER TRACK: Data Scientist with R

A Data Scientist combines statistical and machine learning techniques with R programming to analyze and interpret complex data.


First course: Intermediate-r

Class I Conditionals and Control Flow, date: 2018-10-20

1.Relational Operators

The basic form of comparison is equality and inequality
Equality ==
Inequality !=

[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] TRUE

It is useful to compare a vector with a number, such as:

A vector 1 2 3 4 5NULL
[1] FALSE FALSE  TRUE  TRUE  TRUE

Compare matrices R’s ability to deal with different data structures for comparisons does not stop at vectors. Matrices and relational operators also work together seamlessly!

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   13   15   16   17
[2,]   25   17   12   18   22
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] FALSE  TRUE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE

2.Logical Operators

And operator &
Or operator |
No operator !

TRUE & TRUE is TRUE; TRUE & FALSE is FALSE;
TRUE | FALSE is TRUE; FALSE | FALSE is FALSE;
!TRUE is FALSE

3.Conditional Statements

if statement, syntax:
if (condition) {
expr
}

else statement, syntax:
if (condition) {
expr1
} else {
expr2
}

else if statement, syntax: if (condition1) {
expr1
} else if (condition2) {
expr2
} else {
expr3
}

[1] "x is = 10"

Class II Loops, date: 2018-10-20

Loops can come in handy on numerous occasions. While loops are like repeated if statements; the for loop is designed to iterate over all elements in a sequence. Learn all about them in this chapter. #### While loop while loop, syntax:
while (condition) {
expr
}

==break statement==
for example:

[1] "x is set to : 1"
[1] "x is set to : 2"
[1] "x is set to : 3"
[1] "x is set to : 4"

For loop

for loop, syntax:
for (var in seq) {
expr
}

for loop in vector, for loop in list, break statement, next statement
for example:

[1] "A"
[1] "B"
[1] "C"
[1] "D"
[1] "F"

Class III Functions, date: 2018-10-20

Functions are an extremely important concept in almost every programming language; R is not different. After learning what a function is and how you can use one, you’ll take full control by writing your own functions.

Writing functions

syntax:
fun_name <- function(arg1, arg2) {
body
}

For example:

[1] 6

Class IV The apply family, date: 2018-10-20

The apply family is much efficient than for and while loop.

lapply() for list or vector

syntax:
lapply(list, fun, arg)

[1] 7.11 7.47 6.54 6.66 7.41

The output of apply() function is a list, and then we can use as.vector(unlist(mylist[1])) to transform the list to vector or dataframe

sapply() for list

sapply() is same as lapply, the only difference is the return of sapply() will simplify list to array.

Second course: Introduction to the Tidyverse

Class I Data wrangling, date: 2018-10-21

1.The gapminder dataset

Before we can follow the introduction in this course, we need to install the gapminder and dplyr package.
The gapminder package provide a dataset named gapminder.

2.The filter verb

filter() subset observations.
Use filter() find rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.

```r
gapminder %>% filter(year == 2007, country == "United States")
#Two conditions are joined by a comma, it equals to &
gapminder %>% filter(year == 2007 | lifeExp == 34)
#If the relationship between conditions is OR, you can do like this
```

3.The arrange verb

arrange() sorts a table based on a variable.

We use the pipe to connect the filter() and arrange().

```r
gapminder %>% arrange(lifeExp)
# Default is ascending, if you want to sort data descending, des(lifeExp) is used
```

4.The mutate verb

mutate() changes or add variables.
mutate() adds new variables and preserves existing; transmute() drops existing variables and keeps only the variables you create.

We can use the mutate() to replace the existed variables or generate a new variable.

```r
gapminder %>% mutate(pop = pop / 10000, gdp = gdpPercap * 10000)
# pop will be replaced, and the gdp is a new variables, the result will be returned to gapminder,or you can gen a new object
```

mutate_all() and transmutate_all()

apply the function(s) to all the columns

```r
mtcars %>% mutate_all(., function(x) { x / 2 })
```

mutate_at(), summarise_at and transmutate_at()

apply the function(s) to the columns selected

```r
mtcars %>% mutate_at(., c("mpg", "cyl"), function(x) { x / 2 })
mtcars %>% mutate_at(., vars(starts_with("X")), function(x) { x / 2 })
```

mutate_if(), summarise_if and transmutate_if()

apply the function(s) to the columns selected by some charateristics

```r
mtcars %>% mutate_if(., is.numeric, round)
```

Class II Data visualization, date: 2018-10-21

You have already been able to answer some questions about the data through dplyr, but you have engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and manipulation are often intertwined, so you will see how the dplyr and ggplot2 packages work closely together to create informative graphs.

2.Log scales

When x and y are need to log transform, we can use the scale_x_log10() and scale_y_log10()

4.Faceting

When we have to plot a lot of graphics using same method, we can use facet_wrap to do it automaticly.

Class III Grouping and summarizing, date: 2018-10-22

So far you have been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

1.The summarize verb

Summarise a data frame.
Sometime, we usually make a summary in descriptive statistic, and summarize() function in dplyr package is very useful.

Combine with ddply to do that for each separate id

    leMean  popMean
1 59.47444 29601212

2.The group_by verb

Another common action in descriptive statistic is Crosstable. There are two methods that are group_by and ddply.
For example:

    leMean  popMean
1 59.47444 29601212
   year continent   leMean   popMean
1  1952    Africa 39.13550   4570010
2  1952  Americas 53.27984  13806098
3  1952      Asia 46.31439  42283556
4  1952    Europe 64.40850  13937362
5  1952   Oceania 69.25500   5343003
6  1957    Africa 41.26635   5093033
7  1957  Americas 55.96028  15478157
8  1957      Asia 49.31854  47356988
9  1957    Europe 66.70307  14596345
10 1957   Oceania 70.29500   5970988
11 1962    Africa 43.31944   5702247
12 1962  Americas 58.39876  17330810
13 1962      Asia 51.56322  51404763
14 1962    Europe 68.53923  15345172
15 1962   Oceania 71.08500   6641759
16 1967    Africa 45.33454   6447875
17 1967  Americas 60.41092  19229865
18 1967      Asia 54.66364  57747361
19 1967    Europe 69.73760  16039299
20 1967   Oceania 71.31000   7300207
21 1972    Africa 47.45094   7305376
22 1972  Americas 62.39492  21175368
23 1972      Asia 57.31927  65180977
24 1972    Europe 70.77503  16687835
25 1972   Oceania 71.91000   8053050
26 1977    Africa 49.58042   8328097
27 1977  Americas 64.39156  23122708
28 1977      Asia 59.61056  72257987
29 1977    Europe 71.93777  17238818
30 1977   Oceania 72.85500   8619500
31 1982    Africa 51.59287   9602857
32 1982  Americas 66.22884  25211637
33 1982      Asia 62.61794  79095018
34 1982    Europe 72.80640  17708897
35 1982   Oceania 74.29000   9197425
36 1987    Africa 53.34479  11054502
37 1987  Americas 68.09072  27310159
38 1987      Asia 64.85118  87006690
39 1987    Europe 73.64217  18103139
40 1987   Oceania 75.32000   9787208
41 1992    Africa 53.62958  12674645
42 1992  Americas 69.56836  29570964
43 1992      Asia 66.53721  94948248
44 1992    Europe 74.44010  18604760
45 1992   Oceania 76.94500  10459826
46 1997    Africa 53.59827  14304480
47 1997  Americas 71.15048  31876016
48 1997      Asia 68.02052 102523803
49 1997    Europe 75.50517  18964805
50 1997   Oceania 78.19000  11120715
51 2002    Africa 53.32523  16033152
52 2002  Americas 72.42204  33990910
53 2002      Asia 69.23388 109145521
54 2002    Europe 76.70060  19274129
55 2002   Oceania 79.74000  11727415
56 2007    Africa 54.80604  17875763
57 2007  Americas 73.60812  35954847
58 2007      Asia 70.72848 115513752
59 2007    Europe 77.64860  19536618
60 2007   Oceania 80.71950  12274974

Class IV Types of visualizations, date: 2018-10-22

You have learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You will see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.

Types of visualizations include geom_line(), geom_bar(), geom_hist(), geom_box()

Third course: Importing Data in R

Class I Importing data from flat files with utils, date: 2018-10-22

This course will introduce five method to import different types of data into R using base package utils, including Flat file, Excel, Database(sql), Web, and Statistic software(sas, spss, stat)

1.Introduction & read.csv

read.csv(“path/file.csv”, stringsAsFactors = FALSE)

Usually, we can define the path of file first, such as file.path(“~”, “dir”, “file.csv”)

2.txt file

read.table(“path/file.txt”, header = TRUE, sep = “/”)

Class II readr & data.table, date: 2018-10-22

Next to base R, there are also dedicated packages to easily and efficiently import flat file data. We’ll talk about two such packages: readr and data.table. Do not need to define the stringsAsFactors = FALSE.

1.Package readr

readr::read_csv readr::read_tsv for tet file

1.Package data.table

fread(“file.csv”)

  • Infer column types and separators
  • It simply works
  • Extremely fast
  • Possible to specify numerous parameters
  • Improved read.table()
  • Fast, convenient, customizable

Class III Importing Excel data, date: 2018-10-22

2.Package readxl

For excel, we can use the function excel_list to get all sheet name from the excel file. Then, use the function read_excel to import file.

read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, skip = 2)
the param of col_types include txt, numeric, date, blank.

3.Package gdata

The gdata package is a extension of utils.

Class IV Reproducible Excel work with XLConnect, date: 2018-10-22

Package XLConnect

  • loadWorkbook()
  • getWorksheet()
  • readWorksheet()
  • creatSheet()
  • writeWorkbook()
  • saveWorkbook()
  • renameSheet()
  • removeSheet()

Bridgebetween Excel and R

Class V Importing data from databases, date: 2018-10-22

1.Connecting to database

Main package: DBI
Different type of databse:

  • MySQL, R RMySQL
    con <- dbConnect(RMySQL::MySQL(), dbname = “tweater”, host = “courses.csrrinzqubik.us-east-1.rds.amazonaws.com”, port = 3306, user = “student”, password = “datacamp”)
  • Oracle, R ROracle

2.Import table data

Using dbListTables() and dbReadTable to import the table.
users <- dbReadTable(con, “users”) abDisconnect()

3.SQL Queries from inside R

dbGetQuery() latest <- dbGetQuery(con, “SELECT post FROM tweats WHERE date > '2015-09-21'”)

4.DBI internals

You have used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:

  • Sending the specified query with dbSendQuery();
  • Fetching the result of executing the query on the database with dbFetch();
  • Clearing the result with dbClearResult().

Let us not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().

Class VI Importing data from the web, date: 2018-10-22

3.Httr

Using the httr package to crawling data from web.

Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files. httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content. You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as argument, it will try its best to figure out which type is most appropriate based on the content-type.

Class VII Importing data from statistical software packages, date: 2018-10-22

Package: haven by Hardy and foreign by R Core team

haven:

  • read_sas() for sas7bat
  • read_dta() or read_stata()
  • read_spss() or read.sav()

foreign:

  • Cannot import .sas7bdat, Only SAS library: .xport, if you want to you need sas7bdat package
  • For stata 5 to 12 read.dta()

Fourth course: Importing Data in R

Class I Introduction and exploring raw data, date: 2018-10-22

This chapter will give you an overview of the process of data cleaning with R, then walk you through the basics of exploring raw data.
3 Steps: Exploring raw data, tidy data, analyze.

Class II Exploring raw data, date: 2018-10-22

1.Understanding the structure of data

class(), dim(), str(), or glimpse() in dplyr package.
summary()

2.Looking at your data

head(), tail()

3.Visualizing your data

hist(), plot()

Class II Tidying data, date: 2018-10-22

1.Introduction to tidy data

Principles of tidy data

2.Introduction to tidyr

Easily tidy data with spread() and gather() in tidyr package.

gather() is to gather the columns of wide data, syntax:
gather(data, key, value, …)

  • key: bare name of new key column.
  • value: bare name of new value column.
  • …: bare names of columns to gather (or not).

spread() is to spread key-value pairs into columns.
separate() is to saparate a variable to more.
peparate(data, var_old, c(“var1”, “var2), sep =”")

unite() is to unite more variables into one.
unite(data, var_new, var1, var2, sep = "")

3.Addressing common symptoms of messy data

1).Column headers are values, not variable names.
2).Variables are stored in both rows and columns.
3).Multiple values are sored in column.
4).A single observational unit is stored in multiple tables.
5).Multiple types of obeservational units are stored in the same table.

Class III Preparing data for analysis, date: 2018-10-22

1.Date with lubridate package.

2.String manipulation

Using stringr package

[1] "hello world"
[1] "000123"
[1]  TRUE FALSE FALSE
[1] "a" "b" "d"

Class IV Missing and special values, date: 2018-10-22

Check for NAs

  • is.na()
  • any(is.na()) means are there any NAs
  • sum(is.na())
  • complete.cases() means find rows with no missing values
  • na.omit() means remove rows with NAs

Class V Outliers and obvious errors, date: 2018-10-22

summary(), barplot(), hsit(), boxplot()

Fiveth course: Importing & Cleaning Data in R: Case Studies

Running exciting analyses on interesting datasets is the dream of every data scientist. But first, some importing and cleaning must be done. In this series of four case studies, you’ll revisit key concepts from our courses on importing and cleaning data in R.

Ticket Sales Data, date: 2018-10-23

1.Dealing with the missing value

Load stringr
library(stringr)

Find columns of sales5 containing “dt”: date_cols
date_cols <- str_detect(names(sales5), “dt”)
library(lubridate)

Coerce date columns into Date objects
sales5[, date_cols] <- lapply(sales5[date_cols], ymd)

Create logical vectors indicating missing values (don’t change)
missing <- lapply(sales5[, date_cols], is.na)

create a numerical vector containing the number of NA values in each date column
num_missing <- sapply(missing, sum)

Sixth course: Writing Functions in R

Functions are a fundamental building block of the R language. You’ve probably used dozens (or even hundreds) of functions written by others, but in order to take your R game to the next level, you’ll need to learn to write your own functions. This course will teach you the fundamentals of writing functions in R so that, among other things, you can make your code more readable, avoid coding errors, and automate repetitive tasks.

Class I Writing a function in R, date: 208-10-23

1.Basic information of function in R:

  • Three parts of a funtion
    • Arguments
    • Body
    • Environment
  • Return value is the last executed expression, or the firest executed return() statement
  • Funtions canb etreated like usual R objests

2.Data STRUCTURE

Atomic vector of six types: logical, integer, double, character, complex, and raw
Lists, a.k.a recursive vectors
Atomic vectors are homogeneous, lists can be heterogeneous

Lists
Created with list()
Subset with [, [[ OR $
[ extracts a sublist
[[ and $ extract elements, remove a level of hierarchy

For example:

$a
[1] 1 2 3 4 5
$c
$c[[1]]
[1] "ABC"

$c[[2]]
[1] 2

$c[[3]]
[1] 4

$c[[4]]
[1] "EFG"
$c
$c[[1]]
[1] "ABC"

$c[[2]]
[1] 2

$c[[3]]
[1] 4

$c[[4]]
[1] "EFG"
[[1]]
[1] "ABC"

[[2]]
[1] 2

[[3]]
[1] 4

[[4]]
[1] "EFG"
[[1]]
[1] "ABC"
[[1]]
[1] "ABC"

[[2]]
[1] 2

[[3]]
[1] 4

[[4]]
[1] "EFG"

3.For loops

Looping over columns in a data frame.
df is a dataframe, we can do like this:

```r
for (i in 1 : ncol(df)) {
    print(median(df[[i]])) # df[[i]] also can be replaced by df[, i]
}
# another safety method, using seq_along() function  
for (i in seq_along(df)) {
    print(median(df[[i]]))
}

#Output the return
output <- vector(mode = "double", length = ncol(df))
for (i in seq_along(df)) {
    output[i] <- median(df[[i]])
}
```

4.How can you write a good function?

1.Good name:

  • Should generally be verbs
  • Should be descriptive

2.Argument names:

  • Should generally be nouns
  • Use the very common short names when appropriate: x, y, z, df

3.Argument order:

  • Data arguments come first
  • Detail arguments should have sensible defaults

Class II Functional programming, date: 2018-10-23

You already know how to use a for loop. The goal of this chapter is to teach you how to use the map functions in the purrr package which remove the code that’s duplicated across multiple for loops. After completing this chapter you’ll be able to solve new iteration problems with greater ease (faster and with fewer bugs).

The map() function is an adnvantage method of apply family functions.

1.Introducing purrr

Package purrr, fucntion map()
Advantages of the map fucntion in purrr:

  • Handy shortcts for specifying .f
  • More consistent than sapply(), lapply(), which makes them better for programming
  • Takes much less time to solve iteration problems

All the map functions in purrr take a vector, .x, as the first argument, then return .f applied to each element of .x. The type of object that is returned is determined by function suffix (the part after _):

  • map() returns a list or data frame
  • map_lgl() returns a logical vector
  • map_int() returns a integer vector
  • map_dbl() returns a double vector
  • map_chr() returns a character vector

2.Shortcuts

In R, a one-sided formula starts with a ~, followed by an R expression. In purrr’s map functions, the R expression can refer to an element of the .x argument using the . character.

map(cyl, function(df) lm(mpg ~ wt, data = df))

OR:

map(cyl, ~ lm(mpg ~ wt, data = .))

3.Dealing with failure

safely() is an adverb; it takes a verb and modifies it. That is, it takes a function as an argument and it returns a function as its output. The function that is returned is modified so it never throws an error (and never stops the rest of your computation!).

For example:
Create safe_readLines() by passing readLines() to safely()
safe_readLines <- safely(readLines())

Call safe_readLines()
readLines(“url”)

Call safe_readLines()
safe_readLines(“url”)

4.Maps over multiple arguments

Mapping over many argumens

  • map2() iterate over two arguments
  • pamp() iter over many arguments
  • invoke_map() iterate pver functions and arguments

pamp()
Compare the following two calls to pmap() (run them in the console and compare their output too!):

pmap(list(n, mu, sd), rnorm)
pmap(list(mu, n, sd), rnorm)
What’s the difference? By default pmap() matches the elements of the list to the arguments in the function by position. In the first case, n to the n argument of rnorm(), mu to the mean argument of rnorm(), and sd to the sd argument of rnorm(). In the second case mu gets matched to the n argument of rnorm(), which is clearly not what we intended!

pmap(list(mean = mu, n = n, sd = sd), rnorm)

Mapping over functions and their arguments

Sometimes it’s not the arguments to a function you want to iterate over, but a set of functions themselves. Imagine that instead of varying the parameters to rnorm() we want to simulate from different distributions, say, using rnorm(), runif(), and rexp(). How do we iterate over calling these functions?

In purrr, this is handled by the invoke_map() function. The first argument is a list of functions. In our example, something like:

f <- list(“rnorm”, “runif”, “rexp”)
The second argument specifies the arguments to the functions. In the simplest case, all the functions take the same argument, and we can specify it directly, relying on … to pass it to each function. In this case, call each function with the argument n = 5:

invoke_map(f, n = 5), for example:

```r
# Define list of functions
f <- list("rnorm", "runif", "rexp")

# Parameter list for rnorm()
rnorm_params <- list(mean = 10)

# Add a min element with value 0 and max element with value 5
runif_params <- list(min = 0, max = 5)

# Add a rate element with value 5
rexp_params <- list(rate = 5)

# Define params for each function
params <- list(
  rnorm_params,
  runif_params,
  rexp_params
)

# Call invoke_map() on f supplying params as the second argument
invoke_map(f, params, n = 5)
```

5.Maps with side effects

walk() operates just like map() except it’s designed for functions that don’t return anything. You use walk() for functions with side effects like printing, plotting or saving.

```r
# Define list of functions
f <- list(Normal = "rnorm", Uniform = "runif", Exp = "rexp")

# Define params
params <- list(
  Normal = list(mean = 10),
  Uniform = list(min = 0, max = 5),
  Exp = list(rate = 5)
)

# Assign the simulated samples to sims
sims <- invoke_map(f, params, n = 50)

# Use walk() to make a histogram of each element in sims
sims %>% walk(hist)
```
Walking over two or more arguments

Those histograms were pretty good, but they really needed better breaks for the bins on the x-axis. That means we need to vary two arguments to hist(): x and breaks. Remember map2()? That allowed us to iterate over two arguments. Guess what? There is a walk2(), too!

```r
# Replace "Sturges" with reasonable breaks for each sample
breaks_list <- list(
  Normal = seq(6, 16, 0.5),
  Uniform = seq(0, 5, 0.25),
  Exp = seq(0, 1.5, 0.1)
)

# Use walk2() to make histograms with the right breaks
sims %>% walk2(breaks_list, hist)
```
Walking with many arguments: pwalk

Ugh! Nice breaks but those plots had UUUUGLY labels and titles. The x-axis labels are easy to fix if we don’t mind every plot having its x-axis labeled the same way. We can use the … argument to any of the map() or walk() functions to pass in further arguments to the function .f. In this case, we might decide we don’t want any labels on the x-axis, in which case we need to pass an empty string to the xlab argument of hist()

Class III Robust functions, date: 2018-10-23

1.An error is better than a surprise

Recall our both_na() function from Chapter 2, that finds the number of entries where vectors x and y both have missing values:

both_na <- function(x, y) {
sum(is.na(x) & is.na(y))
}
We had an example where the behavior was a little surprising:

  • x <- c(NA, NA, NA)
  • y <- c( 1, NA, NA, NA)
  • both_na(x, y)

The function works and returns 3, but we certainly didn’t design this function with the idea that people could pass in different length arguments.

Using stopifnot() is a quick way to have your function stop, if a condition isn’t met. stopifnot() takes logical expressions as arguments and if any are FALSE an error will occur.

2.An informative error is even better

Using stop() instead of stopifnot() allows you to specify a more informative error message. Recall the general pattern for using stop() is:

if (condition) {
stop(“Error”, call. = FALSE)
}

Writing good error messages is an important part of writing a good function! We recommend your error tells the user what should be true, not what is false. For example, here a good error would be “x and y must have the same length”, rather than the bad error “x and y don’t have the same length”.

Let’s use this pattern to write a better check for the length of x and y.

Seventh course: Data Manipulation in R with dplyr

In this interactive tutorial, you will learn how to perform sophisticated dplyr techniques to carry out your data manipulation with R. First you will master the five verbs of R data manipulation with dplyr: select, mutate, filter, arrange and summarise. Next, you will learn how you can chain your dplyr operations using the pipe operator of the magrittr package. In the final section, the focus is on practicing how to subset your data using the group_by function, and how you can access data stored outside of R in a database. All said and done, you will be familiar with data manipulation tools and techniques that will allow you to efficiently manipulate data.

Class I Introduction to dplyr and tbls, date: 208-10-27

As Garrett explained, a tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble() or tbl_df().

The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame.

R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:

two <- c(“AA”, “AS”)
lut <- c(“AA” = “American”, “AS” = “Alaska”, “B6” = “JetBlue”)
two <- lut[two]

Class II The five verbs and select in more detail, date: 208-10-27

The dplyr package contains five key data manipulation functions, also called verbs:

  • select(), which returns a subset of the columns,
  • filter(), that is able to return a subset of the rows,
  • arrange(), that reorders the rows according to single or multiple variables,
  • mutate(), used to add columns from existing data,
  • summarize(), which reduces each group to a single row by calculating aggregate measures.

Syntax:

1.select()

select(df, var1, var2) OR select(df, 1:4, -2)

dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

  • starts_with(“X”): every name that starts with “X”,
  • ends_with(“X”): every name that ends with “X”,
  • contains(“X”): every name that contains “X”,
  • matches(“X”): every name that matches “X”, where “X” can be a regular expression,
  • num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05,
  • one_of(x): every name that appears in x, which should be a character vector.

2.mutate()

mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.

Take this example that adds a new column, z, which is the element-wise sum of the columns x and y, to the data frame df:
mutate(df, z = x + y)

3.filter()

R comes with a set of logical operators that you can use inside filter():

  • x < y, TRUE if x is less than y
  • x <= y, TRUE if x is less than or equal to y
  • x == y, TRUE if x equals y
  • x != y, TRUE if x does not equal y
  • x >= y, TRUE if x is greater than or equal to y
  • x > y, TRUE if x is greater than y
  • x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

filter(df, a > 0)

Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:

filter(df, a > 0 & b > 0) equals to ilter(df, a > 0, b > 0)

4.arrange()

arrange() can be used to rearrange rows according to any type of data.

5.Summarize()

summarize(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:

  • min(x) - minimum value of vector x.
  • max(x) - maximum value of vector x.
  • mean(x) - mean value of vector x.
  • median(x) - median value of vector x.
  • quantile(x, p) - pth quantile of vector x.
  • sd(x) - standard deviation of vector x.
  • var(x) - variance of vector x.
  • IQR(x) - Inter Quartile Range (IQR) of vector x.
  • diff(range(x)) - total range of vector x.

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

  • first(x) - The first element of vector x.
  • last(x) - The last element of vector x.
  • nth(x, n) - The nth element of vector x.
  • n() - The number of rows in the data.frame or group of observations that summarize() describes.
  • n_distinct(x) - The number of unique values in vector x
6.Chaining your functions: the pipe operator

As another example of the %>%, have a look at the following two commands that are completely equivalent:

mean(c(1, 2, 3, NA), na.rm = TRUE)
c(1, 2, 3, NA) %>% mean(na.rm = TRUE)

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

Class III Group_by and working with databases, date: 2018-10-27

1.Get group-wise insights: group_by

As Garrett explained, group_by() lets you define groups within your data set. Its influence becomes clear when calling summarize() on a grouped dataset: summarizing statistics are calculated for the different groups separately.

2.dplyr and databases

hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:

library(data.table) hflights2 <- as.data.table(hflights) hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.

Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.

Eighth course: Joining Data in R with dplyr

Class I Mutating joins, date: 2018-10-27

There are four join functions in dplyr:

  • left_join() is the basic join function in dplyr
  • right_join()
  • inner_join()
  • full_join()

syntax:

left_join(df1, df2, by = “key”) OR
left_join(df1, df2, by = c(“key1”, “key2”))

To join by different variables on x and y use a named vector. For example, by = c(“a” = “b”) will match x.a to y.b.

Class II Filtering joins and set operations, date: 2018-10-27

Filtering joins and set operations combine information from datasets without adding new variables. Filtering joins filter the observations of one dataset based on whether or not they occur in a second dataset. Set operations use combinations of observations from both datasets to create a new dataset.

1.Semi-joins

semi_join() is an advantage filter method.

As you saw in the video, semi-joins provide a concise way to filter data from the first dataset based on information in a second dataset.

2.Anti-joins

anti_joins provide a concise way to filter data from the first dataset based on information NOT in a second dataset.

For example, you can use an anti-join to see which rows will not be matched to a second dataset by a join.

3.Set operations

Set operations means the function of rbind

  • union() rbind all observations df1 and df2
  • intersect() keeping the same observations both in df1 and df2
  • sediff() excluding the observations in df2 from df1, sediff(df1, df2)

4.Comparing datasets

setequal(df1, df2) comparing the rows, if same return TRUE identical(df1, df2) EXACT comparing

Class III Advanced joining, date: 2018-10-27

Joining multiple tables

pipe() function or reduce() fuction in purrr package.

For example:

tables <- list(df1, df2, df3)
reduce(tables, left_join, by = “key”)

Nineth course: Intro to SQL for Data Science

The role of a data scientist is to turn raw data into actionable insights. Much of the world’s raw data—from electronic medical records to customer transaction histories—lives in organized collections of tables called relational databases. Therefore, to be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL (pronounced ess-que-ell, or sequel). This course teaches you everything you need to know to begin working with databases today!

Class I Selecting columns, date: 2018-11-05

In SQL, you can select data from a table using a SELECT statement. For example, the following query selects the name column from the people table:

SELECT name FROM people;

It’s also good practice (but not necessary for the exercises in this course) to include a semicolon at the end of your query.

This tells SQL where the end of your query is!

SELECT COUNT(birthdate)
FROM people;

SELECT COUNT(DISTINCT birthdate)
FROM people;

Count number of none-missing varables SELECT COUNT(birthdate)
FROM people WHERE birthdate IS NOT NULL;

Class II Filtering rows, date: 2018-11-05

Filtering results

Congrats on finishing the first chapter! You now know how to select columns and perform basic counts. This chapter will focus on filtering your results.

In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:

  • = equal
  • <> not equal
  • < less than
  • > greater than
  • <= less than or equal to
  • >= greater than or equal to

For example, you can filter text records such as title. The following code returns all films with the title ‘Metropolis’:

SELECT title
FROM films
WHERE title = ‘Metropolis’;

Notice that the WHERE clause always comes after the FROM statement!

Note that in this course we will use <> and not != for the not equal operator, as per the SQL standard.

1.NULL and IS NULL

Now that you know what NULL is and what it’s used for, it’s time for some practice!

SELECT title FROM films WHERE budget IS NULL;

2.LIKE and NOT LIKE

As you’ve seen, the WHERE clause can be used to filter text data. However, so far you’ve only been able to filter by specifying the exact text you’re interested in. In the real world, often you’ll want to search for a pattern rather than a specific text string.

In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:

The % wildcard will match zero, one, or many characters in text. For example, the following query matches companies like ‘Data’, ‘DataC’ ‘DataCamp’, ‘DataMind’, and so on:

SELECT name
FROM companies
WHERE name LIKE ‘Data%’;

3.Sorting multiple columns

ORDER BY can also be used to sort on multiple columns. It will sort by the first column specified, then sort by the next, then the next, and so on. For example

Tenth course: Data Visualization with ggplot2

This ggplot2 tutorial builds on your knowledge from the first course to produce meaningful explanatory plots. We’ll explore the last four optional layers. Statistics will be calculated on the fly and we’ll see how Coordinates and Facets aid in communication. Publication quality plots will be produced directly in R using the Themes layer. We’ll also discuss details on data visualization best practices with ggplot2 to help make sure you have a sound understanding of what works and why. By the end of the course, you’ll have all the tools needed to make a custom plotting function to explore a large data set, combining statistics and excellent visuals.

  • Statistics
  • Coordinates
  • Facets
  • Themes

Class I Statistics, date: 2018-11-06

Two categories of functions in statistics layer:

Called from within a geom, such as: geom_smooth
Called independently, such as: stat_smooth

geom_smooth <-> stat_smooth

You can use either stat_smooth() or geom_smooth() to apply a linear model. stat_quantile()
stat_sum()
This function calculates the total number of overlapping observations and is another good alternative to overplotting.

```r
myColors <- c(brewer.pal(3, "Dark2"), "black")
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, span = 0.7) +
  stat_smooth(method = "loess", 
              aes(group = 1, col="All"), 
              se = FALSE, span = 0.7) +
  # Add correct arguments to scale_color_manual
  scale_color_manual("Cylinders", values = myColors)
```

Class II Coordinates and Facets, date: 2018-11-06

1.Zooming in

You saw different ways of using the coordinates layer to zoom in.

scale_x_continuous(limits = c(3, 6), expand = c(0, 0)) <-> xlim()
scale_y_continuous(limits = c(3, 6), expand = c(0, 0)) <-> ylim()

equals to:

coord_cartesian(xlim = c(3, 6)) OR coord_cartesian(ylim = c(3, 6))

2.Aspect Ratio

We can set the aspect ratio of a plot with coord_fixed() or coord_equal(). Both use ratio = 1 as a default. A 1:1 aspect ratio is most appropriate when two continuous variables are on the same scale

4.Facets: the basics

The most straightforward way of using facets is facet_grid(). Here we just need to specify the categorical variable to use on rows and columns using standard R formula notation (rows ~ columns).

5.Themes

plot.background() is used to define the backgroud of full plot except the backgroud of figure.
panel.grid() is used to define the grid of figure.
axis.line() is used to define the vartical and horizon.
axis.title() is used to define the labs of vartical and horizon.

strip.background() is used to define the lables when apply the facet().

We can use the theme_set() function to globally set the theme, and we can reset it using theme_set(original) that default to theme_grey().

6.Heat map

In the video you saw reasons for not using heat maps. Nonetheless, you may encounter a case in which you really do want to use one. Luckily, they’re fairly straightforward to produce in ggplot2.

We begin by specifying two categorical variables for the x and y aesthetics. At the intersection of each category we’ll draw a box, except here we call it a tile, using the geom_tile() layer. Then we will fill each tile with a continuous variable.

Eleventh course: Working with Dates and Times in R

Dates and times are abundant in data and essential for answering questions that start with when, how long, or how often. However, they can be tricky, as they come in a variety of formats and can behave in unintuitive ways. This course teaches you the essentials of parsing, manipulating, and computing with dates and times in R. By the end, you’ll have mastered the lubridate package, a member of the tidyverse, specifically designed to handle dates and times.

Class I Dates and Times in R, date: 2018-11-26

Getting datetimes into R Just like dates without times, if you want R to recognize a string as a datetime you need to convert it, although now you use as.POSIXct(). as.POSIXct() expects strings to be in the format YYYY-MM-DD HH:MM:SS.

The only tricky thing is that times will be interpreted in local time based on your machine’s set up. You can check your timezone with Sys.timezone(). If you want the time to be interpreted in a different timezone, you just set the tz argument of as.POSIXct(). You’ll learn more about time zones in Chapter 4.

In this exercise you’ll input a couple of datetimes by hand and then see that read_csv() also handles datetimes automatically in a lot of cases.

Functions in lubridate package.

ymd()
dmy()
mdy()
dmy()
dym()

parse_date_time(x = , order = )

e.g. parse_date_time(x = “2010-10-10”, order = “ymd”)