R

R#

Sections#

Resources#

R Lang#

R

Home
Docs
- The R Manuals
  - An Introduction to R
Wiki

R Markdown

R and Jupyter

IRkernel for Jupyter
- Home

R and Anaconda

Using R with Anaconda

Packages

Esquisse
- CRAN Get started with esquisse
Tidyverse
- dplyr
- ggplot2
- readr
vowels
- Docs

Cheatsheets

https://swirlstats.com/

Other Resources#

Marin Stats Lectures
Ansell, Brendan. “Introduction to R - tidyverse”. Home.
[G] Bryan, Jennifer. “Happy Git and GitHub for the useR”. Home.
Coltekin, Cagri. (2015). A hands-on tutorial on using R for (mostly) linguistics research.
Guilherme D. Garcia
freeCodeCamp. (2021). “R Shiny for Data Science Tutorial - Build Interactive Data-Driven Web Apps”. YouTube.
Stewart, Matthew. (2019). “Guide to R and Python in a Single Jupyter Notebook”. Towards Data Science. Page.
“Using R for Time Series Analysis”. https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
Unofficial solution manual to the first editiion of R for Data Science: https://jrnold.github.io/r4ds-exercise-solutions/

Texts#

Bookdown

Baayen, R. H. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press.
Baumer, Benjamin; Daniel Kaplan; & Nicholas J. Horton. Modern Data Science with R. 2nd Ed. CRC Press. Home.
Buisson, Florent. (2021). Behavioral Data Analysis with R and Python: Customer-Driven Data for Real Business Results. O’Reilly.
Bruce, Peter; Andrew Bruce; & Peter Gedeck. (2020). Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. O’Reilly.
Chang, Winston. (2018). R Graphics Cookbook: Practical Recipes for Visualizing Data. 2nd Ed. O’Reilly. Home.
Davies, Tilman M. (2016). The Book of R: A First Course in Programming and Statistics. No Starch Press.
Grolemund, Garrett. Hands-On Programming with R: Write Your Own Functions and Simulations. O’Reilly. Home. GitHub.
Hvitfeldt, Emil & Julia Silge. (2022). Supervised Machine Learning for Text Analysis in R. Home.
Hyndman, Rob J. & George Athanasopoulos. (2021). Forecasting: Principles and Practice. 3rd Ed. Home.
Kaplan, Daniel & Matthew Beckman. (2021). Data Computing. 2nd Ed. Home.
Kuhn, Max & Julia Silge. (2022). Tidy Modeling with R: A Framework for Modeling in the Tidyverse. O’Reilly. Home.
Long, J. D. & Paul Teetor. (2019). R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics. 2nd Ed. O’Reilly.
Matloff, Norman. (2023). The Art of Machine Learning: Algorithms + Data + R. No Starch Press.
Matloff, Norman. (2011). The Art of R Programming: A Tour of Statistical Software Design. No Starch Press.
Nolan, Deborah & Duncan Temple Lang. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
Scavetta, Rick & Boyan Angelov. (2021). Python and R for the Modern Data Scientist: The Best of Both Worlds. O’Reilly.
Silge, Julia & David Robinson. Text Mining with R: A Tidy Approach. Home.
Unwin, Antony. (2015). Graphical Data Analysis with R. CRC Press. Home.
Wickham, Hadley. (????). ggplot2: Elegant Graphics for Data Analysis. 3rd Ed. Springer. Home.
Wickham, Hadley. (2021). Mastering Shiny: Build Interactive Apps, Reports, and Dashboards Powered by R. O’Reilly.
Wickham, Hadley. (2019). Advanced R. 2nd Ed. CRC Press.
Wickham, Hadley. (2014). “Tidy Data”. Home. CRAN.
Wickham, Hadley. (2010). “A Layered Grammar of Graphics”. Paper.
Wickham, Hadley & Jennifer Bryan. (????). R Packages: Organize, Test, Document, and Share Your Code. O’Reilly. Home.
Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. (2023). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd Ed. O’Reilly. Home.
Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st Ed. O’Reilly. Home.
- unofficial solution guide
Wilkinson, Leland. (2005). The Grammar of Graphics. 2nd Ed. Springer.
Xie, Emily; Christophe Dervieux; & Emily Riederer. (2023). R Markdown Cookbook. CRC Press. Home.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. (2018). R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. Home.

Figures#

[W] Wickham, Hadley

Terms#

[W] Comprehensive R Archive Network (CRAN)
[W] dplyr
[W] ggplot2
[W] R
[W] R Package
[W] RStudio
[W] Tidyverse

Notes#

Most builtin R functions work with vectors of values.

.libPaths() get local path to R library installation location StackOverflow
installed.packages() get installed packages

Data Viz#

Scatterplot

shows the relationship between two (or more) variables, across some number of cases
usually: Cartesian coordinate system where axes represent a variable

Histogram

shows a distribution
shows how many cases fall into given ranges of the variable

Frequency Polygon

like a histogram, but lets you break up cases by other variables

Bar Chart

compares values of a single variable across groups
effective when you want to compare a few different quantities

Map

shows how a variable relates to geography

Choropleth Map

the fill color of each region reflects the value of a variable

Network Diagram

shows how entities are connected to one another
a vertex corresponds to a case, and the network describes which cases are connected to which other cases

How to create a data graphic:

choose the glyph-ready data frame
select the kind of graphic
MAPPING a variable to a graphical attribute: decide which variables from the data frame will be assigned to which roles in the graphic
- axes
- bar lengths
- glyphs
  - fill color
  - border color
  - shapes
  - sizes
  - transparency
- etc.

interactive

mosaic
- mplot()
- mWorldMap(key=,fill=)
- mUSMap(key=,fill=)
esquisse

coordinate systems

coord_fixed
coord_flip
coord_map
coord_polar
coord_quickmap

facets

facet_grid(. ~ v)
facet_wrap(~ v)

geoms

geom_abline
geom_bar
geom_boxplot
geom_col
geom_count
geom_histogram
geom_jitter
geom_line
geom_point
geom_pointrange

position adjustments

position_dodge
position_fill
position_identity
position_jitter
position_stack

scales

scale_color_brewer

stats

stat
stat_bin
stat_count
stat_smooth
stat_summary

Data Verb

single-table in, single-table out
- arrange()
- filter()
- mutate()
- summarize()
- transmute()
double-table in, single-table out
- left_join() - the output has all the cases from the left, even if there is no match in the right
- inner_join() - the output has only the cases from the left with a match in the right
- full_join() - the output will have all the cases from both the left and the right

MUTATION

to mutate means to add new variables to a data frame
data verb mutate() allows you to add new variables constructed by transformation operations on existing variables (join data verbs allow something similar, constructing new variables by pulling them out of a different data frame)
translation, a special kind of mutation
- translate levels in one variable into a different set of levels

JOIN - a data verb that looks for info to match rows in one table to zero or more rows in a second table

when there is a match between a case in the left table and a case in the right table, the remaining variables from the right table are concatenated onto the case from the left table
Remaining questions
- What happens when a case in the right table has no matches in the left table?
- What happens when a case in the left table has no matches in the right table?
- What happens when there are multiple cases in the left table that match a case in the right table?
all joins behave the same when there are multiple matches in the right table for a case in the left table: each of these multiple matches will produce a case in the output
```
LeftTable %>%
  joinOperation(RightTable, vars_for_matching)
```

When used with the chaining syntax %>%, most data verbs have arguments that consist only of reduction and transformation functions, constants, and variables.

MUTATING JOINS

inner equijoin
outer joins
- left outer join
- right outer join
- full outer join

INNER EQUIJOIN

an inner join matches pairs of observations whenever their keys are equal
inner equijoin, because the keys are matched using the equality operator
unmatched rows are not included in the result
an inner join keeps observations that appear in both tables

OUTER JOIN

an outer join keeps observations that appear in at least one of the tables
outer joins work by adding an additional virtual observation to each table; this observation has a key that always matches (if no other key matches), and a value filled with NA

LEFT OUTER JOIN

keeps all observations in x

RIGHT OUTER JOIN

keeps all observations in y

FULL OUTER JOIN

keeps all observations in x and y

NATURAL JOIN

the default by = NULL uses all variables that appear in both tables

Filtering Joins

match observations in the same way as mutating joins but affect observations, not variables

SEMI JOIN

keeps all observations in x that have a match in y
useful for matching filtered summary tables back to the original rows

ANTI JOIN

drops all observations in x that have a match in y
useful for diagnosing join mismatches

Keys

the variables used to connect each pair of tables are called keys
a key is a variable or set of variables that uniquely identifies an observation
a primary key uniquely identifies an observation in its own table
a foreign key uniquely identifies an observation in another table
a variable can be both a primary key and a foreign key
verify that the primary key uniquely identifies each observation: one way to do this is to count() the primary keys and look for entries where n is greater than one
sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it
surrogate key: if a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number()
a primary key and the corresponding foreign key in another table form a relation
relations are typically one-to-many

Relations

one-to-many
one-to-one, a special case of one-to-many
many-to-many = many-to-one + one-to-many

filter    - select observations by their values
arrange   - reorder rows
select    - select variables by their names
              starts_with('abc') - matches names that begin with 'abc'
              ends_with('abc')   - mathces names that end with 'abc'
              contains('abc')    - matches names that contain 'abc'
              matches('(.)\\1')  - selects variables that match a regular expression
              num_range('x',1:3) - matches x1, x2, and x3
              rename()
              everything()
              any_of()
              contains()
mutate    - create new variables with functions of existing variables
              transmute()
summarize - collapse many values down to a single summary
group_by  - changes the scope of each function from operating on the entire dataset to operating on it group-by-group

+           addition
-           subtraction
*           multiplication
/           division
^           exponentiation
x / sum(x)  the proportion of a total
y - mean(y) the difference from the mean
%/%         integer division
%%          remainder
log()
log2()
log10()
lead()
lag()
x - lag(x) running difference
x != lag(x) find when values change

lag(x)
lead(x)
cumsum(x)
cumprod(x)
cummin(x)
cummax(x)
cummean(x)

min_rank(y)
min_rank(desc(y))
row_number(y)
dense_rank(y)
percent_rank(y)
cume_dist(y)
ntile()

verb
* the first argument is the dataframe
* the subsequent arguments describe what to do with the dataframe using the variable names
* the result is a new dataframe

Exploratory Data Analysis (EDA)

use visualization and transformation to explore data in a systematic way
EDA is an iterative process
- generate questions about the data
- search for the answers by visualizing, transforming, modeling the data
- use what is learned to refine the questions, or generate new questions
EDA is not a formal process with a strict set of rules
some important questions
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?

Variable

a variable is a quantity, quality, or property that can be measured

Value

a value is the state of a variable when you measure it; the value of a variable may change from measurement to measurement

Observation

an observation is a set of measurements made under similar conditions (i.e., at the same time and on the same object)
an observation contains several values each of which is associated with a different variable

Tabular Data

tabular data is a set of values each of which is associated with a variable and an observation
tabular data is tidy if each value is placed in its own cell; each variable in its own column; and each observation in its own row

Variation

variation is the tendency of the values of a variable to change from measurement to measurement
if you measure any continuous variable twice, you will get two different results (this is true even if you measure quantities that are constant like the speed of light)
each of your measurements will include a small amount of error that varies from measurement to measurement
categorical variables can also vary if you measure across different subjects or different times
variation describes the behavior within a variable

Categorical Variable

a variable is categorical if it can only take one of a small set of values
in R, categorical variables are usually saved as factors or character vectors
use a bar chart to examine the distribution of a categorical variable
the height of the bars displays how many observations occurred with each x value (computed manually with dplyr::count())

Continuous Variable

a variable is continuous if it can take any of an infinite set of ordered values (e.g., numbers and datetimes)
use a histogram to examine the distribution of a continuous variable
a histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall into each bin

many of the questions will prompt you to explore a relationship between variables to see if the values of one variable can explain the behavior of another variable

Outlier

outliers are observations that are unusual which don’t seem to fit the pattern; sometimes they are data entry errors; other times they suggest important new science

Covariation

covariation describes the behavior between variables
covariation is the tendency for the values of two or more variables to vary togehter in a related way
the best way to spot covariation is to visualize the relationship between two or more variables

Boxplot

display the distribution of a continuous variable broken down by a categorical variable
a boxplot is a type of visual shorthand for a distribution of values
a boxplot consists of
- a box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR)
  - in the middle of the box is a line that displays the median, i.e., the 50th percentile of the distribution
  - these three lines provide a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side
- visual points that display observations that fall more than 1.5 times the IQR from either edge of the box
- a line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution

Atomic Vectors

6 types of atomic vectors
- Logical
- Numeric
  - Integer
  - Double
- Character
- Complex
- Raw
atomic vectors are homogeneous
each type of atomic vector has its own missing value; you can always use NA and it will be converted to the correct type using the implicit coercion rules

Recursive Vectors (Lists)

lists can contain other lists
lists can be heterogeneous

NULL

NULL is often used to represent the absence of a vector as opposed to NA which is used to represent the absence of a value in a vector
NULL typically behaves like a vector of length 0

Properties of Vectors

type, determined with typeof()
length, determined with length()

Augmented Vectors

vectors can contain arbitrary additional metadata in the form of attributes which are used to create augmented vectors which build on additional behavior
3 types of augmented vectors
- Factors
  - factors represent categorical data that can take a fixed set of possible values
  - built on top of integer vectors
  - factors have a level attribute
- Dates and Datetimes
  - built on top of numeric vectors
  - dates: represent the number of days since 1 Jan 1970
  - datetimes: class POSIXct (Portable Operating System Interface, calendar time), represent the number of seconds since 1 Jan 1970
- Data frames and tibbles, built on top of lists

Logical Vectors

three possible values: FALSE, TRUE, NA
logical vectors are usually constructed with comparison operators; they can also be constructed manually with c()

Numeric Vectors

in R, numbers are doubles by default; to make an integer, place an L after the number
doubles are approximations
integers have one special value: NA
doubles have four special values: NA, NaN, Inf, -Inf, the latter three of which can arise during division

Character Vectors

each element of a character vector is a string, and a string can contain an arbitrary amount of data
R uses a global string pool: this means that each unique string is only stored in memory once and every use of the string points to that representation which reduces the amount of memory needed by duplicated strings
a pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 x 1000 + 152 = 8.14 kB

Explicit coercion

as.logical()
as.integer()
as.double()
as.character()

Implicit coercion

of the type of a vector
- when creating a vector containing multiple types with c() the most complex type always wins
of the length of a vector (vector recycling)
- the shorter vector is repeated to the same length as the longer vector
- this is most useful when mxixing vectors and “scalars” (vectors of length 1)

Subsetting

numeric vector of integers
logical vector, keeps all values corresponding to a TRUE value
named vectors can be subsetted with a character vector
x[] returns the complete x
- this is not useful for subsetting vectors, but it is useful for subsetting matrices and other high-dimensional structures because it lets you select all the rows or all the columns by leaving that index blank
[[]] extracts a single element and always drops names

Attributes

names, used to name the elements of a vector
dimensions (dims), make a vector behave like a matrix or array
class, used to implement the S3 object oriented system
- controls how generic functions work

generic functions

generic functions are key to object oriented programming in R because they make functions behave differently for different classes of input

R

Contents

R#