R#
Sections#
- Data Computing 2
- R for Data Science 1
- Silge & Robinson’s Text Mining with R
- Forecasting: Principles and Practice 3
- A01 - Tidy Data
- A02 - Graphical Exploration
- A04 - Data Wrangling
- A05 - Project: Popular Names
- A06 - Project: Bird Species
- A07 - Project: Bicycle Sharing
- A08 - Project: Scraping Nuclear Reactors
- A09 - Project: Street or Road?
- A10 - Base R
Resources#
R Lang#
R
R Markdown
R and Jupyter
IRkernel for Jupyter
R and Anaconda
Packages
Cheatsheets
Other Resources#
Ansell, Brendan. “Introduction to R - tidyverse”. Home.
[G] Bryan, Jennifer. “Happy Git and GitHub for the useR”. Home.
Coltekin, Cagri. (2015). A hands-on tutorial on using R for (mostly) linguistics research.
Guilherme D. Garcia
freeCodeCamp. (2021). “R Shiny for Data Science Tutorial - Build Interactive Data-Driven Web Apps”. YouTube.
Stewart, Matthew. (2019). “Guide to R and Python in a Single Jupyter Notebook”. Towards Data Science. Page.
“Using R for Time Series Analysis”. https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
Unofficial solution manual to the first editiion of R for Data Science: https://jrnold.github.io/r4ds-exercise-solutions/
Texts#
Baayen, R. H. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press.
Baumer, Benjamin; Daniel Kaplan; & Nicholas J. Horton. Modern Data Science with R. 2nd Ed. CRC Press. Home.
Buisson, Florent. (2021). Behavioral Data Analysis with R and Python: Customer-Driven Data for Real Business Results. O’Reilly.
Bruce, Peter; Andrew Bruce; & Peter Gedeck. (2020). Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. O’Reilly.
Chang, Winston. (2018). R Graphics Cookbook: Practical Recipes for Visualizing Data. 2nd Ed. O’Reilly. Home.
Davies, Tilman M. (2016). The Book of R: A First Course in Programming and Statistics. No Starch Press.
Grolemund, Garrett. Hands-On Programming with R: Write Your Own Functions and Simulations. O’Reilly. Home. GitHub.
Hvitfeldt, Emil & Julia Silge. (2022). Supervised Machine Learning for Text Analysis in R. Home.
Hyndman, Rob J. & George Athanasopoulos. (2021). Forecasting: Principles and Practice. 3rd Ed. Home.
Kaplan, Daniel & Matthew Beckman. (2021). Data Computing. 2nd Ed. Home.
Kuhn, Max & Julia Silge. (2022). Tidy Modeling with R: A Framework for Modeling in the Tidyverse. O’Reilly. Home.
Long, J. D. & Paul Teetor. (2019). R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics. 2nd Ed. O’Reilly.
Matloff, Norman. (2023). The Art of Machine Learning: Algorithms + Data + R. No Starch Press.
Matloff, Norman. (2011). The Art of R Programming: A Tour of Statistical Software Design. No Starch Press.
Nolan, Deborah & Duncan Temple Lang. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
Scavetta, Rick & Boyan Angelov. (2021). Python and R for the Modern Data Scientist: The Best of Both Worlds. O’Reilly.
Silge, Julia & David Robinson. Text Mining with R: A Tidy Approach. Home.
Unwin, Antony. (2015). Graphical Data Analysis with R. CRC Press. Home.
Wickham, Hadley. (????). ggplot2: Elegant Graphics for Data Analysis. 3rd Ed. Springer. Home.
Wickham, Hadley. (2021). Mastering Shiny: Build Interactive Apps, Reports, and Dashboards Powered by R. O’Reilly.
Wickham, Hadley. (2019). Advanced R. 2nd Ed. CRC Press.
Wickham, Hadley. (2010). “A Layered Grammar of Graphics”. Paper.
Wickham, Hadley & Jennifer Bryan. (????). R Packages: Organize, Test, Document, and Share Your Code. O’Reilly. Home.
Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. (2023). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd Ed. O’Reilly. Home.
Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st Ed. O’Reilly. Home.
Wilkinson, Leland. (2005). The Grammar of Graphics. 2nd Ed. Springer.
Xie, Emily; Christophe Dervieux; & Emily Riederer. (2023). R Markdown Cookbook. CRC Press. Home.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. (2018). R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. Home.
Figures#
[W] Wickham, Hadley
Terms#
Notes#
Most builtin R functions work with vectors of values.
.libPaths()
get local path to R library installation location StackOverflowinstalled.packages()
get installed packages
Data Viz#
Scatterplot
shows the relationship between two (or more) variables, across some number of cases
usually: Cartesian coordinate system where axes represent a variable
Histogram
shows a distribution
shows how many cases fall into given ranges of the variable
Frequency Polygon
like a histogram, but lets you break up cases by other variables
Bar Chart
compares values of a single variable across groups
effective when you want to compare a few different quantities
Map
shows how a variable relates to geography
Choropleth Map
the fill color of each region reflects the value of a variable
Network Diagram
shows how entities are connected to one another
a vertex corresponds to a case, and the network describes which cases are connected to which other cases
How to create a data graphic:
choose the glyph-ready data frame
select the kind of graphic
MAPPING a variable to a graphical attribute: decide which variables from the data frame will be assigned to which roles in the graphic
axes
bar lengths
glyphs
fill color
border color
shapes
sizes
transparency
etc.
interactive
mosaic
mplot()
mWorldMap(key=,fill=)
mUSMap(key=,fill=)
esquisse
coordinate systems
coord_fixed
coord_flip
coord_map
coord_polar
coord_quickmap
facets
facet_grid(. ~ v)
facet_wrap(~ v)
geoms
geom_abline
geom_bar
geom_boxplot
geom_col
geom_count
geom_histogram
geom_jitter
geom_line
geom_point
geom_pointrange
position adjustments
position_dodge
position_fill
position_identity
position_jitter
position_stack
scales
scale_color_brewer
stats
stat
stat_bin
stat_count
stat_smooth
stat_summary
Data Verb
single-table in, single-table out
arrange()
filter()
mutate()
summarize()
transmute()
double-table in, single-table out
left_join() - the output has all the cases from the left, even if there is no match in the right
inner_join() - the output has only the cases from the left with a match in the right
full_join() - the output will have all the cases from both the left and the right
MUTATION
to mutate means to add new variables to a data frame
data verb
mutate()
allows you to add new variables constructed by transformation operations on existing variables (join data verbs allow something similar, constructing new variables by pulling them out of a different data frame)translation, a special kind of mutation
translate levels in one variable into a different set of levels
JOIN - a data verb that looks for info to match rows in one table to zero or more rows in a second table
when there is a match between a case in the left table and a case in the right table, the remaining variables from the right table are concatenated onto the case from the left table
Remaining questions
What happens when a case in the right table has no matches in the left table?
What happens when a case in the left table has no matches in the right table?
What happens when there are multiple cases in the left table that match a case in the right table?
all joins behave the same when there are multiple matches in the right table for a case in the left table: each of these multiple matches will produce a case in the output
LeftTable %>% joinOperation(RightTable, vars_for_matching)
When used with the chaining syntax %>%
, most data verbs have arguments that consist only of reduction and transformation functions, constants, and variables.
MUTATING JOINS
inner equijoin
outer joins
left outer join
right outer join
full outer join
INNER EQUIJOIN
an inner join matches pairs of observations whenever their keys are equal
inner equijoin, because the keys are matched using the equality operator
unmatched rows are not included in the result
an inner join keeps observations that appear in both tables
OUTER JOIN
an outer join keeps observations that appear in at least one of the tables
outer joins work by adding an additional virtual observation to each table; this observation has a key that always matches (if no other key matches), and a value filled with NA
LEFT OUTER JOIN
keeps all observations in
x
RIGHT OUTER JOIN
keeps all observations in
y
FULL OUTER JOIN
keeps all observations in
x
andy
NATURAL JOIN
the default
by = NULL
uses all variables that appear in both tables
Filtering Joins
match observations in the same way as mutating joins but affect observations, not variables
SEMI JOIN
keeps all observations in
x
that have a match iny
useful for matching filtered summary tables back to the original rows
ANTI JOIN
drops all observations in
x
that have a match iny
useful for diagnosing join mismatches
Keys
the variables used to connect each pair of tables are called keys
a key is a variable or set of variables that uniquely identifies an observation
a primary key uniquely identifies an observation in its own table
a foreign key uniquely identifies an observation in another table
a variable can be both a primary key and a foreign key
verify that the primary key uniquely identifies each observation: one way to do this is to
count()
the primary keys and look for entries wheren
is greater than onesometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it
surrogate key: if a table lacks a primary key, it’s sometimes useful to add one with
mutate()
androw_number()
a primary key and the corresponding foreign key in another table form a relation
relations are typically one-to-many
Relations
one-to-many
one-to-one, a special case of one-to-many
many-to-many = many-to-one + one-to-many
filter - select observations by their values
arrange - reorder rows
select - select variables by their names
starts_with('abc') - matches names that begin with 'abc'
ends_with('abc') - mathces names that end with 'abc'
contains('abc') - matches names that contain 'abc'
matches('(.)\\1') - selects variables that match a regular expression
num_range('x',1:3) - matches x1, x2, and x3
rename()
everything()
any_of()
contains()
mutate - create new variables with functions of existing variables
transmute()
summarize - collapse many values down to a single summary
group_by - changes the scope of each function from operating on the entire dataset to operating on it group-by-group
+ addition
- subtraction
* multiplication
/ division
^ exponentiation
x / sum(x) the proportion of a total
y - mean(y) the difference from the mean
%/% integer division
%% remainder
log()
log2()
log10()
lead()
lag()
x - lag(x) running difference
x != lag(x) find when values change
lag(x)
lead(x)
cumsum(x)
cumprod(x)
cummin(x)
cummax(x)
cummean(x)
min_rank(y)
min_rank(desc(y))
row_number(y)
dense_rank(y)
percent_rank(y)
cume_dist(y)
ntile()
verb
* the first argument is the dataframe
* the subsequent arguments describe what to do with the dataframe using the variable names
* the result is a new dataframe
Exploratory Data Analysis (EDA)
use visualization and transformation to explore data in a systematic way
EDA is an iterative process
generate questions about the data
search for the answers by visualizing, transforming, modeling the data
use what is learned to refine the questions, or generate new questions
EDA is not a formal process with a strict set of rules
some important questions
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Variable
a variable is a quantity, quality, or property that can be measured
Value
a value is the state of a variable when you measure it; the value of a variable may change from measurement to measurement
Observation
an observation is a set of measurements made under similar conditions (i.e., at the same time and on the same object)
an observation contains several values each of which is associated with a different variable
Tabular Data
tabular data is a set of values each of which is associated with a variable and an observation
tabular data is tidy if each value is placed in its own cell; each variable in its own column; and each observation in its own row
Variation
variation is the tendency of the values of a variable to change from measurement to measurement
if you measure any continuous variable twice, you will get two different results (this is true even if you measure quantities that are constant like the speed of light)
each of your measurements will include a small amount of error that varies from measurement to measurement
categorical variables can also vary if you measure across different subjects or different times
variation describes the behavior within a variable
Categorical Variable
a variable is categorical if it can only take one of a small set of values
in R, categorical variables are usually saved as factors or character vectors
use a bar chart to examine the distribution of a categorical variable
the height of the bars displays how many observations occurred with each x value (computed manually with dplyr::count())
Continuous Variable
a variable is continuous if it can take any of an infinite set of ordered values (e.g., numbers and datetimes)
use a histogram to examine the distribution of a continuous variable
a histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall into each bin
many of the questions will prompt you to explore a relationship between variables to see if the values of one variable can explain the behavior of another variable
Outlier
outliers are observations that are unusual which don’t seem to fit the pattern; sometimes they are data entry errors; other times they suggest important new science
Covariation
covariation describes the behavior between variables
covariation is the tendency for the values of two or more variables to vary togehter in a related way
the best way to spot covariation is to visualize the relationship between two or more variables
Boxplot
display the distribution of a continuous variable broken down by a categorical variable
a boxplot is a type of visual shorthand for a distribution of values
a boxplot consists of
a box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR)
in the middle of the box is a line that displays the median, i.e., the 50th percentile of the distribution
these three lines provide a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side
visual points that display observations that fall more than 1.5 times the IQR from either edge of the box
a line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution
Atomic Vectors
6 types of atomic vectors
Logical
Numeric
Integer
Double
Character
Complex
Raw
atomic vectors are homogeneous
each type of atomic vector has its own missing value; you can always use
NA
and it will be converted to the correct type using the implicit coercion rules
Recursive Vectors (Lists)
lists can contain other lists
lists can be heterogeneous
NULL
NULL is often used to represent the absence of a vector as opposed to NA which is used to represent the absence of a value in a vector
NULL typically behaves like a vector of length 0
Properties of Vectors
type, determined with
typeof()
length, determined with
length()
Augmented Vectors
vectors can contain arbitrary additional metadata in the form of attributes which are used to create augmented vectors which build on additional behavior
3 types of augmented vectors
Factors
factors represent categorical data that can take a fixed set of possible values
built on top of integer vectors
factors have a level attribute
Dates and Datetimes
built on top of numeric vectors
dates: represent the number of days since 1 Jan 1970
datetimes: class POSIXct (Portable Operating System Interface, calendar time), represent the number of seconds since 1 Jan 1970
Data frames and tibbles, built on top of lists
Logical Vectors
three possible values: FALSE, TRUE, NA
logical vectors are usually constructed with comparison operators; they can also be constructed manually with
c()
Numeric Vectors
in R, numbers are doubles by default; to make an integer, place an
L
after the numberdoubles are approximations
integers have one special value: NA
doubles have four special values: NA, NaN, Inf, -Inf, the latter three of which can arise during division
Character Vectors
each element of a character vector is a string, and a string can contain an arbitrary amount of data
R uses a global string pool: this means that each unique string is only stored in memory once and every use of the string points to that representation which reduces the amount of memory needed by duplicated strings
a pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 x 1000 + 152 = 8.14 kB
Explicit coercion
as.logical()
as.integer()
as.double()
as.character()
Implicit coercion
of the type of a vector
when creating a vector containing multiple types with
c()
the most complex type always wins
of the length of a vector (vector recycling)
the shorter vector is repeated to the same length as the longer vector
this is most useful when mxixing vectors and “scalars” (vectors of length 1)
Subsetting
numeric vector of integers
logical vector, keeps all values corresponding to a TRUE value
named vectors can be subsetted with a character vector
x[]
returns the completex
this is not useful for subsetting vectors, but it is useful for subsetting matrices and other high-dimensional structures because it lets you select all the rows or all the columns by leaving that index blank
[[]]
extracts a single element and always drops names
Attributes
names, used to name the elements of a vector
dimensions (dims), make a vector behave like a matrix or array
class, used to implement the S3 object oriented system
controls how generic functions work
generic functions
generic functions are key to object oriented programming in R because they make functions behave differently for different classes of input