R#


Sections#


Resources#

R Lang#

R

R Markdown

R and Jupyter

  • IRkernel for Jupyter

R and Anaconda

Packages

Cheatsheets

https://swirlstats.com/

Other Resources#


Texts#

Bookdown

  • Baayen, R. H. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press.

  • Baumer, Benjamin; Daniel Kaplan; & Nicholas J. Horton. Modern Data Science with R. 2nd Ed. CRC Press. Home.

  • Buisson, Florent. (2021). Behavioral Data Analysis with R and Python: Customer-Driven Data for Real Business Results. O’Reilly.

  • Bruce, Peter; Andrew Bruce; & Peter Gedeck. (2020). Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. O’Reilly.

  • Chang, Winston. (2018). R Graphics Cookbook: Practical Recipes for Visualizing Data. 2nd Ed. O’Reilly. Home.

  • Davies, Tilman M. (2016). The Book of R: A First Course in Programming and Statistics. No Starch Press.

  • Grolemund, Garrett. Hands-On Programming with R: Write Your Own Functions and Simulations. O’Reilly. Home. GitHub.

  • Hvitfeldt, Emil & Julia Silge. (2022). Supervised Machine Learning for Text Analysis in R. Home.

  • Hyndman, Rob J. & George Athanasopoulos. (2021). Forecasting: Principles and Practice. 3rd Ed. Home.

  • Kaplan, Daniel & Matthew Beckman. (2021). Data Computing. 2nd Ed. Home.

  • Kuhn, Max & Julia Silge. (2022). Tidy Modeling with R: A Framework for Modeling in the Tidyverse. O’Reilly. Home.

  • Long, J. D. & Paul Teetor. (2019). R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics. 2nd Ed. O’Reilly.

  • Matloff, Norman. (2023). The Art of Machine Learning: Algorithms + Data + R. No Starch Press.

  • Matloff, Norman. (2011). The Art of R Programming: A Tour of Statistical Software Design. No Starch Press.

  • Nolan, Deborah & Duncan Temple Lang. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.

  • Scavetta, Rick & Boyan Angelov. (2021). Python and R for the Modern Data Scientist: The Best of Both Worlds. O’Reilly.

  • Silge, Julia & David Robinson. Text Mining with R: A Tidy Approach. Home.

  • Unwin, Antony. (2015). Graphical Data Analysis with R. CRC Press. Home.

  • Wickham, Hadley. (????). ggplot2: Elegant Graphics for Data Analysis. 3rd Ed. Springer. Home.

  • Wickham, Hadley. (2021). Mastering Shiny: Build Interactive Apps, Reports, and Dashboards Powered by R. O’Reilly.

  • Wickham, Hadley. (2019). Advanced R. 2nd Ed. CRC Press.

  • Wickham, Hadley. (2014). “Tidy Data”. Home. CRAN.

  • Wickham, Hadley. (2010). “A Layered Grammar of Graphics”. Paper.

  • Wickham, Hadley & Jennifer Bryan. (????). R Packages: Organize, Test, Document, and Share Your Code. O’Reilly. Home.

  • Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. (2023). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd Ed. O’Reilly. Home.

  • Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st Ed. O’Reilly. Home.

  • Wilkinson, Leland. (2005). The Grammar of Graphics. 2nd Ed. Springer.

  • Xie, Emily; Christophe Dervieux; & Emily Riederer. (2023). R Markdown Cookbook. CRC Press. Home.

  • Xie, Yihui, J. J. Allaire, and Garrett Grolemund. (2018). R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. Home.


Figures#

  • [W] Wickham, Hadley


Terms#

  • [W] Comprehensive R Archive Network (CRAN)

  • [W] dplyr

  • [W] ggplot2

  • [W] R

  • [W] R Package

  • [W] RStudio

  • [W] Tidyverse


Notes#

Most builtin R functions work with vectors of values.

  • .libPaths() get local path to R library installation location StackOverflow

  • installed.packages() get installed packages

Data Viz#

Scatterplot

  • shows the relationship between two (or more) variables, across some number of cases

  • usually: Cartesian coordinate system where axes represent a variable

Histogram

  • shows a distribution

  • shows how many cases fall into given ranges of the variable

Frequency Polygon

  • like a histogram, but lets you break up cases by other variables

Bar Chart

  • compares values of a single variable across groups

  • effective when you want to compare a few different quantities

Map

  • shows how a variable relates to geography

Choropleth Map

  • the fill color of each region reflects the value of a variable

Network Diagram

  • shows how entities are connected to one another

  • a vertex corresponds to a case, and the network describes which cases are connected to which other cases

How to create a data graphic:

  • choose the glyph-ready data frame

  • select the kind of graphic

  • MAPPING a variable to a graphical attribute: decide which variables from the data frame will be assigned to which roles in the graphic

    • axes

    • bar lengths

    • glyphs

      • fill color

      • border color

      • shapes

      • sizes

      • transparency

    • etc.

interactive

  • mosaic

    • mplot()

    • mWorldMap(key=,fill=)

    • mUSMap(key=,fill=)

  • esquisse

coordinate systems

  • coord_fixed

  • coord_flip

  • coord_map

  • coord_polar

  • coord_quickmap

facets

  • facet_grid(. ~ v)

  • facet_wrap(~ v)

geoms

  • geom_abline

  • geom_bar

  • geom_boxplot

  • geom_col

  • geom_count

  • geom_histogram

  • geom_jitter

  • geom_line

  • geom_point

  • geom_pointrange

position adjustments

  • position_dodge

  • position_fill

  • position_identity

  • position_jitter

  • position_stack

scales

  • scale_color_brewer

stats

  • stat

  • stat_bin

  • stat_count

  • stat_smooth

  • stat_summary

Data Verb

  • single-table in, single-table out

    • arrange()

    • filter()

    • mutate()

    • summarize()

    • transmute()

  • double-table in, single-table out

    • left_join() - the output has all the cases from the left, even if there is no match in the right

    • inner_join() - the output has only the cases from the left with a match in the right

    • full_join() - the output will have all the cases from both the left and the right

MUTATION

  • to mutate means to add new variables to a data frame

  • data verb mutate() allows you to add new variables constructed by transformation operations on existing variables (join data verbs allow something similar, constructing new variables by pulling them out of a different data frame)

  • translation, a special kind of mutation

    • translate levels in one variable into a different set of levels

JOIN - a data verb that looks for info to match rows in one table to zero or more rows in a second table

  • when there is a match between a case in the left table and a case in the right table, the remaining variables from the right table are concatenated onto the case from the left table

  • Remaining questions

    • What happens when a case in the right table has no matches in the left table?

    • What happens when a case in the left table has no matches in the right table?

    • What happens when there are multiple cases in the left table that match a case in the right table?

  • all joins behave the same when there are multiple matches in the right table for a case in the left table: each of these multiple matches will produce a case in the output

    LeftTable %>%
      joinOperation(RightTable, vars_for_matching)
    

When used with the chaining syntax %>%, most data verbs have arguments that consist only of reduction and transformation functions, constants, and variables.

MUTATING JOINS

  • inner equijoin

  • outer joins

    • left outer join

    • right outer join

    • full outer join

INNER EQUIJOIN

  • an inner join matches pairs of observations whenever their keys are equal

  • inner equijoin, because the keys are matched using the equality operator

  • unmatched rows are not included in the result

  • an inner join keeps observations that appear in both tables

OUTER JOIN

  • an outer join keeps observations that appear in at least one of the tables

  • outer joins work by adding an additional virtual observation to each table; this observation has a key that always matches (if no other key matches), and a value filled with NA

LEFT OUTER JOIN

  • keeps all observations in x

RIGHT OUTER JOIN

  • keeps all observations in y

FULL OUTER JOIN

  • keeps all observations in x and y

NATURAL JOIN

  • the default by = NULL uses all variables that appear in both tables

Filtering Joins

  • match observations in the same way as mutating joins but affect observations, not variables

SEMI JOIN

  • keeps all observations in x that have a match in y

  • useful for matching filtered summary tables back to the original rows

ANTI JOIN

  • drops all observations in x that have a match in y

  • useful for diagnosing join mismatches

Keys

  • the variables used to connect each pair of tables are called keys

  • a key is a variable or set of variables that uniquely identifies an observation

  • a primary key uniquely identifies an observation in its own table

  • a foreign key uniquely identifies an observation in another table

  • a variable can be both a primary key and a foreign key

  • verify that the primary key uniquely identifies each observation: one way to do this is to count() the primary keys and look for entries where n is greater than one

  • sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it

  • surrogate key: if a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number()

  • a primary key and the corresponding foreign key in another table form a relation

  • relations are typically one-to-many

Relations

  • one-to-many

  • one-to-one, a special case of one-to-many

  • many-to-many = many-to-one + one-to-many

filter    - select observations by their values
arrange   - reorder rows
select    - select variables by their names
              starts_with('abc') - matches names that begin with 'abc'
              ends_with('abc')   - mathces names that end with 'abc'
              contains('abc')    - matches names that contain 'abc'
              matches('(.)\\1')  - selects variables that match a regular expression
              num_range('x',1:3) - matches x1, x2, and x3
              rename()
              everything()
              any_of()
              contains()
mutate    - create new variables with functions of existing variables
              transmute()
summarize - collapse many values down to a single summary
group_by  - changes the scope of each function from operating on the entire dataset to operating on it group-by-group

+           addition
-           subtraction
*           multiplication
/           division
^           exponentiation
x / sum(x)  the proportion of a total
y - mean(y) the difference from the mean
%/%         integer division
%%          remainder
log()
log2()
log10()
lead()
lag()
x - lag(x) running difference
x != lag(x) find when values change

lag(x)
lead(x)
cumsum(x)
cumprod(x)
cummin(x)
cummax(x)
cummean(x)

min_rank(y)
min_rank(desc(y))
row_number(y)
dense_rank(y)
percent_rank(y)
cume_dist(y)
ntile()

verb
* the first argument is the dataframe
* the subsequent arguments describe what to do with the dataframe using the variable names
* the result is a new dataframe

Exploratory Data Analysis (EDA)

  • use visualization and transformation to explore data in a systematic way

  • EDA is an iterative process

    • generate questions about the data

    • search for the answers by visualizing, transforming, modeling the data

    • use what is learned to refine the questions, or generate new questions

  • EDA is not a formal process with a strict set of rules

  • some important questions

    • What type of variation occurs within my variables?

    • What type of covariation occurs between my variables?

Variable

  • a variable is a quantity, quality, or property that can be measured

Value

  • a value is the state of a variable when you measure it; the value of a variable may change from measurement to measurement

Observation

  • an observation is a set of measurements made under similar conditions (i.e., at the same time and on the same object)

  • an observation contains several values each of which is associated with a different variable

Tabular Data

  • tabular data is a set of values each of which is associated with a variable and an observation

  • tabular data is tidy if each value is placed in its own cell; each variable in its own column; and each observation in its own row

Variation

  • variation is the tendency of the values of a variable to change from measurement to measurement

  • if you measure any continuous variable twice, you will get two different results (this is true even if you measure quantities that are constant like the speed of light)

  • each of your measurements will include a small amount of error that varies from measurement to measurement

  • categorical variables can also vary if you measure across different subjects or different times

  • variation describes the behavior within a variable

Categorical Variable

  • a variable is categorical if it can only take one of a small set of values

  • in R, categorical variables are usually saved as factors or character vectors

  • use a bar chart to examine the distribution of a categorical variable

  • the height of the bars displays how many observations occurred with each x value (computed manually with dplyr::count())

Continuous Variable

  • a variable is continuous if it can take any of an infinite set of ordered values (e.g., numbers and datetimes)

  • use a histogram to examine the distribution of a continuous variable

  • a histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall into each bin

many of the questions will prompt you to explore a relationship between variables to see if the values of one variable can explain the behavior of another variable

Outlier

  • outliers are observations that are unusual which don’t seem to fit the pattern; sometimes they are data entry errors; other times they suggest important new science

Covariation

  • covariation describes the behavior between variables

  • covariation is the tendency for the values of two or more variables to vary togehter in a related way

  • the best way to spot covariation is to visualize the relationship between two or more variables

Boxplot

  • display the distribution of a continuous variable broken down by a categorical variable

  • a boxplot is a type of visual shorthand for a distribution of values

  • a boxplot consists of

    • a box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR)

      • in the middle of the box is a line that displays the median, i.e., the 50th percentile of the distribution

      • these three lines provide a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side

    • visual points that display observations that fall more than 1.5 times the IQR from either edge of the box

    • a line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution

Atomic Vectors

  • 6 types of atomic vectors

    • Logical

    • Numeric

      • Integer

      • Double

    • Character

    • Complex

    • Raw

  • atomic vectors are homogeneous

  • each type of atomic vector has its own missing value; you can always use NA and it will be converted to the correct type using the implicit coercion rules

Recursive Vectors (Lists)

  • lists can contain other lists

  • lists can be heterogeneous

NULL

  • NULL is often used to represent the absence of a vector as opposed to NA which is used to represent the absence of a value in a vector

  • NULL typically behaves like a vector of length 0

Properties of Vectors

  • type, determined with typeof()

  • length, determined with length()

Augmented Vectors

  • vectors can contain arbitrary additional metadata in the form of attributes which are used to create augmented vectors which build on additional behavior

  • 3 types of augmented vectors

    • Factors

      • factors represent categorical data that can take a fixed set of possible values

      • built on top of integer vectors

      • factors have a level attribute

    • Dates and Datetimes

      • built on top of numeric vectors

      • dates: represent the number of days since 1 Jan 1970

      • datetimes: class POSIXct (Portable Operating System Interface, calendar time), represent the number of seconds since 1 Jan 1970

    • Data frames and tibbles, built on top of lists

Logical Vectors

  • three possible values: FALSE, TRUE, NA

  • logical vectors are usually constructed with comparison operators; they can also be constructed manually with c()

Numeric Vectors

  • in R, numbers are doubles by default; to make an integer, place an L after the number

  • doubles are approximations

  • integers have one special value: NA

  • doubles have four special values: NA, NaN, Inf, -Inf, the latter three of which can arise during division

Character Vectors

  • each element of a character vector is a string, and a string can contain an arbitrary amount of data

  • R uses a global string pool: this means that each unique string is only stored in memory once and every use of the string points to that representation which reduces the amount of memory needed by duplicated strings

  • a pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 x 1000 + 152 = 8.14 kB

Explicit coercion

  • as.logical()

  • as.integer()

  • as.double()

  • as.character()

Implicit coercion

  • of the type of a vector

    • when creating a vector containing multiple types with c() the most complex type always wins

  • of the length of a vector (vector recycling)

    • the shorter vector is repeated to the same length as the longer vector

    • this is most useful when mxixing vectors and “scalars” (vectors of length 1)

Subsetting

  • numeric vector of integers

  • logical vector, keeps all values corresponding to a TRUE value

  • named vectors can be subsetted with a character vector

  • x[] returns the complete x

    • this is not useful for subsetting vectors, but it is useful for subsetting matrices and other high-dimensional structures because it lets you select all the rows or all the columns by leaving that index blank

  • [[]] extracts a single element and always drops names

Attributes

  • names, used to name the elements of a vector

  • dimensions (dims), make a vector behave like a matrix or array

  • class, used to implement the S3 object oriented system

    • controls how generic functions work

generic functions

  • generic functions are key to object oriented programming in R because they make functions behave differently for different classes of input