R for Data Science 1

R for Data Science 1 #

Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st Ed. O’Reilly. Home.

Revised

08 Jun 2023

Programming Environment #

packages <- c(
  'hexbin',       # library(hexbin)
  'lubridate',    # library(lubridate)
  'maps',         # library(maps)
  'modelr',       # library(modelr)
  'mosaic',       # library(mosaic)
  'mosaicData',   # library(mosaicData)
  'nycflights13', # library(nycflights13)
  'pryr',         # library(pryr)
  'purrr',        # library(purrr)
  'tidyverse',    # library(tidyverse)
  'RcppRoll'      # library(RcppRoll)
)

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

str_c('EXECUTED : ', now())
sessionInfo()
# R.version.string # R.Version()
# .libPaths()
# installed.packages()

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: ‘mosaic’

The following objects are masked from ‘package:dplyr’:

    count, do, tally

The following object is masked from ‘package:Matrix’:

    mean

The following object is masked from ‘package:ggplot2’:

    stat

The following object is masked from ‘package:modelr’:

    resample

The following objects are masked from ‘package:stats’:

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from ‘package:base’:

    max, mean, min, prod, range, sample, sum

Attaching package: ‘pryr’

The following object is masked from ‘package:mosaic’:

    inspect

The following object is masked from ‘package:dplyr’:

    where

Attaching package: ‘purrr’

The following objects are masked from ‘package:pryr’:

    compose, partial

The following object is masked from ‘package:mosaic’:

    cross

The following object is masked from ‘package:maps’:

    map

── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0     ✔ tibble  3.2.1
✔ readr   2.1.4     ✔ tidyr   1.3.0
✔ stringr 1.5.0     

── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()     masks pryr::compose()
✖ mosaic::count()      masks dplyr::count()
✖ purrr::cross()       masks mosaic::cross()
✖ mosaic::do()         masks dplyr::do()
✖ tidyr::expand()      masks Matrix::expand()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::lag()         masks stats::lag()
✖ purrr::map()         masks maps::map()
✖ ggformula::na.warn() masks modelr::na.warn()
✖ tidyr::pack()        masks Matrix::pack()
✖ purrr::partial()     masks pryr::partial()
✖ mosaic::resample()   masks modelr::resample()
✖ mosaic::stat()       masks ggplot2::stat()
✖ mosaic::tally()      masks dplyr::tally()
✖ tidyr::unpack()      masks Matrix::unpack()
✖ pryr::where()        masks dplyr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

'EXECUTED : 2025-01-23 15:23:18.850391'

R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RcppRoll_0.3.0     forcats_1.0.0      stringr_1.5.0      readr_2.1.4       
 [5] tidyr_1.3.0        tibble_3.2.1       tidyverse_2.0.0    purrr_1.0.2       
 [9] pryr_0.1.6         nycflights13_1.0.2 mosaic_1.8.4.2     mosaicData_0.20.3 
[13] ggformula_0.10.4   dplyr_1.1.2        Matrix_1.5-4       ggplot2_3.4.3     
[17] lattice_0.21-8     modelr_0.1.11      maps_3.4.1         lubridate_1.9.2   
[21] hexbin_1.28.3     

loaded via a namespace (and not attached):
 [1] utf8_1.2.3         generics_0.1.3     stringi_1.7.12     hms_1.1.3         
 [5] digest_0.6.31      magrittr_2.0.3     evaluate_0.21      grid_4.3.0        
 [9] timechange_0.2.0   pbdZMQ_0.3-9       fastmap_1.1.1      jsonlite_1.8.5    
[13] backports_1.4.1    fansi_1.0.4        scales_1.2.1       tweenr_2.0.2      
[17] codetools_0.2-19   cli_3.6.1          labelled_2.11.0    rlang_1.1.1       
[21] crayon_1.5.2       polyclip_1.10-4    munsell_0.5.0      base64enc_0.1-3   
[25] withr_2.5.0        repr_1.1.6         tools_4.3.0        tzdb_0.4.0        
[29] uuid_1.1-0         colorspace_2.1-0   mosaicCore_0.9.2.1 broom_1.0.5       
[33] IRdisplay_1.1      vctrs_0.6.3        R6_2.5.1           ggridges_0.5.4    
[37] lifecycle_1.0.3    ggstance_0.3.6     MASS_7.3-58.4      pkgconfig_2.0.3   
[41] pillar_1.9.0       gtable_0.3.3       glue_1.6.2         Rcpp_1.0.10       
[45] ggforce_0.4.1      haven_2.5.2        tidyselect_1.2.0   IRkernel_1.3.2    
[49] farver_2.1.1       htmltools_0.5.5    compiler_4.3.0    

03 - Data Visualization #

head(x = mpg, n = 5)

A tibble: 5 × 11
manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
<chr>	<chr>	<dbl>	<int>	<int>	<chr>	<chr>	<int>	<int>	<chr>	<chr>
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ,y=hwy))
ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ,y=hwy,color=class))
ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ,y=hwy,size=class))
ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ,y=hwy,alpha=class))
ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ,y=hwy,shape=class))

../../../_images/62185af54332fefaa8009b117b95ccc240e087919871ebb5f551aafa6a43b784.png

Warning message:
“Using size for a discrete variable is not advised.”

../../../_images/6c119391fccc56a6293da32159f0a1aad7f647207d0a9f8ef66344d49af564ca.png

Warning message:
“Using alpha for a discrete variable is not advised.”

../../../_images/8c272dd9a365bcf50c00cd3ebde1df842bfcb45c62df148b966b5466defb5e9c.png

Warning message:
“The shape palette can deal with a maximum of 6 discrete values because
more than 6 becomes difficult to discriminate; you have 7. Consider
specifying shapes manually if you must have them.”

Warning message:
“Removed 62 rows containing missing values (`geom_point()`).”

../../../_images/ac5a59bc1b808cf93da314c9be809d9756b5a3fa62cc873d9824388c90159eb4.png

../../../_images/5fe7fe563a7d9998f5817f88d41d62dc62d182c8b2d23d74b916324c80ddec27.png

ggplot2::mpg %>%
  ggplot() +
    geom_point(mapping=aes(x=displ, y=hwy, color=displ<5))

../../../_images/ae59301642c3db9ce59bf11a5d1c849c0fbaa4394d304c024f3575c1dbfdd9aa.png

ggplot2::mpg %>%
  ggplot() +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_wrap(~ class, nrow=2)
ggplot2::mpg %>%
  ggplot() +
    geom_point(mapping = aes(x = displ, y = hwy)) +
    facet_grid(drv ~ cyl)

../../../_images/da949d4139ff9b7aa9671eb8b311283b9b96f861420aafc50c143fdd70f8b3d7.png

../../../_images/0f3090a844c7c7c771f3cb4dc170608dec64cff3229bb1454bd24e9161f88ec3.png

ggplot2::mpg %>%
  ggplot(mapping = aes(x = displ, y = hwy, color = drv)) +
    geom_point (                               show.legend=FALSE) +
    geom_smooth(mapping = aes(linetype = drv), show.legend=FALSE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

../../../_images/861240d5fa2a42234c4de7d2712ba8825187c929eeaed938db67f9a40bd1f11b.png

ggplot2::mpg %>%
  ggplot(mapping = aes(x = displ, y = hwy)) +
    geom_point(mapping = aes(color = class)) +
    geom_smooth(data = filter(mpg, class == 'subcompact'), se=FALSE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

../../../_images/9320f8eb71984e237d09a79f446da650b6f4eff3802310cc26a575706ab9fcd6.png

# LINE PLOT
x  <- 1:10
y  <- cumsum(rnorm(10))
df <- data.frame(x, y)
ggplot(df, mapping = aes(x = x, y = y)) +
  geom_line(size=0.8) +
  ggtitle('Evolution')

Warning message:
“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.”

../../../_images/d7f0610dc6b697b99c61b5e7fdb11e0d91e6820c24f69c2d75af75e24ac269c9.png

# BOX PLOT
mtcars %>%
  ggplot(mapping = aes(x = as.factor(cyl), y = mpg)) +
    geom_boxplot(fill = 'slateblue', alpha = 0.2) +
    xlab('cyl')

../../../_images/51dd87f8e5503987fddfebc4a5c9fed05bd195e9d688f5df993a94dd4b4caa0e.png

ggplot(data    = mpg,
       mapping = aes(x     = displ,
                     y     = hwy,
                     color = drv)) +
  geom_point() +
  geom_smooth(se=FALSE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

../../../_images/352f4ad902d6a5e102c7804b9a21f83eb6be0a7718b1d586c9343d4ae815ec72.png

ggplot(data    = mpg,
       mapping = aes(x = displ,
                     y = hwy)) +
  geom_point () +
  geom_smooth()

ggplot() +
  geom_point (data    = mpg,
              mapping = aes(x = displ,
                            y = hwy)) +
  geom_smooth(data    = mpg,
              mapping = aes(x = displ,
                            y = hwy))

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

../../../_images/c9ab4b35d51ea6c6dbf91c37ba68d50d118bfca480d66d6dfd89886a9a3b7e18.png

ggplot(data = diamonds) +
  geom_bar  (mapping = aes(x = cut))
ggplot(data = diamonds) +
  stat_count(mapping = aes(x = cut))

../../../_images/d11e0e33f0dc911cb6a0a617a55010a29043c71d0b50aa99ef0290c0ed947a4f.png

demo <- tribble(
  ~cut,  ~freq,
  'Fair',      1610,
  'Good',      4906,
  'Very Good',12082,
  'Premium',  13791,
  'Ideal',    21551
)
ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut,
                         y = freq),
           stat    = 'identity')

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x     = cut,
                         y     = stat(prop),
                         group = 1))

ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun     = median
  )
ggplot(data = diamonds) +
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat    = 'summary',
    fun.min = min,
    fun.max = max,
    fun     = median
  )

Warning message:
“`stat(prop)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.”

../../../_images/5066383e316b451e8b8363a2c977c86458565e31a76fe46356dc3b36b87c1428.png

../../../_images/d5c36c31ae02ec2f10b1054f617e13852432abcf50d71b5c865c5f0b4d9ed0db.png

../../../_images/e0bb4fc0d12c2bf5abfc5dc2b60388c9394ede2c9d391ba79044aa7674b905a2.png

g  <- ggplot(data = mpg, mapping = aes(x = class))

df <- data.frame(x = rep(c(2.9, 3.1, 4.5), c(5, 10, 4)))
ggplot(data = df, mapping = aes(x)) + geom_bar()
ggplot(data = df, mapping = aes(x)) + geom_histogram(binwidth = 2.5)
df <- data.frame(trt = c('a', 'b', 'c'), outcome = c(2.3, 1.9, 3.2))
ggplot(data = df,  mapping = aes(x = trt, y = outcome)) + geom_point()
ggplot(data = df,  mapping = aes(x = trt, y = outcome)) + geom_col()
ggplot(data = mpg, mapping = aes(y = class)) + geom_bar(mapping = aes(fill = drv), position = position_stack(reverse = TRUE)) + theme(legend.position='top')
g + geom_bar(mapping = aes(fill = drv))
ggplot(data = mpg, mapping = aes(y = class)) + geom_bar()
g + geom_bar(mapping = aes(weight = displ))
g + geom_bar()

../../../_images/8408bf2fa6496b0fee934431b6cf8470a5d3212a85674376fc9f05e883df9407.png

../../../_images/0720f9c46c587ad2125bcc5eb3a8c1ec2ad1e7885197f10028a96ab100fc3ece.png

../../../_images/9483188b889dd00096587834695d391c5b01499d014a88dc0d6db02f49cca401.png

../../../_images/865aa51c0943fe091d54d014c1c215b854a697a083abc9224eb9c6f26d7772c9.png

../../../_images/5451f283dca86460be7fef26acdb272eb977ced2a71d6b219012566c77b97873.png

../../../_images/36270d7a6a234da4a83df2b6aaec31b3849055d11188bb1a4da0b464ccce8b5a.png

../../../_images/773a0a41dbc24cc71eb7f4f6b3870d904775f0f6be12d4ac4e95cbbc3fa393c7.png

../../../_images/25288fb488c755ce8ba7f4d69a114aa474a3011728fba97751252d8c9cf73d85.png

../../../_images/562610a5795c5ccac4ea4c7537d5590e07f70585e0a8f49de55ad1e8fefb5453.png

# 3.7.1 [5]

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = after_stat(prop), fill = color))

../../../_images/3ad7439ccae15c370195f31c2a82e4f82af5bbfb00f4edc17e2c542e1030b7d4.png

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = 'dodge')
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = 'fill')
ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
  geom_bar(fill = NA, position = 'identity')
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 1/5, position = 'identity')
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, color = cut))

../../../_images/5fe3e55b3d1be67faa8f26fefb905b2653c30d2c2ba934739b908ad1d113aa01.png

../../../_images/5886e79758fb3f25ad2de1c1c3bad62576407f9033214c23afcb9f9c0be0208d.png

../../../_images/924f853b8dff603d607f255a60fca71ddfdbe4cf7446922fcf07196b2faf2fab.png

../../../_images/f58683caf1a596a12868d2ef25611895a4cd86d541a122074e0bb3f2bd0d2b7d.png

../../../_images/bc6097db8f36e66f1a7edf5256c862f24c442be569b35918ff454879d2d20f5f.png

../../../_images/21b9b6ad7076690f4e8c8aa5226cf87393ae8b3fbf5a9d2e728e91a7736f525a.png

../../../_images/bd805cb664e03b540b199099bd3e97142612254636764d9d576efa0d37ece658.png

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), position = 'jitter')

../../../_images/ca79f1aa51586a2213c3a0bcd7d9ade2c5b5b5fc63da322fbd19444e3d1a72a8.png

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_jitter()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_count()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point()

../../../_images/c9cb91ad620e07d74718cef8e3c9c7b715e0ec759c8c9f60f244ebe63ffca205.png

../../../_images/2276ac1fdef37b1fa4d58c3945ed731c7596a2590491b24ca9efa1ee2f41b21e.png

../../../_images/53b3d5bf668de9ee640447562fd7b393b1fe2d868151f09d3634a533e7a12250.png

head(mpg)

A tibble: 6 × 11
manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
<chr>	<chr>	<dbl>	<int>	<int>	<chr>	<chr>	<int>	<int>	<chr>	<chr>
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = displ, fill = manufacturer))

../../../_images/e1fe6d0e624efc14c2e7931911afc8260ff7c58a67a2165b22c39253f74e1db0.png

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + coord_flip()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot()

../../../_images/29b6b32edd0375763b99c7e0813fe358473cc84b94debc7df0ffe8b9bdd4d7d7.png

../../../_images/c95785ff59d270955569a119905a610da90549a4d9765d82f4852e1f9b77303f.png

nz <- map_data('nz')
ggplot(data = nz, mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = 'white', color = 'black') +
  coord_quickmap()
ggplot(data = nz, mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = 'white', color = 'black')

../../../_images/b608c81ae62396d2e034310bb441110663faa4c112ad9d1c1cbbb27b2952c9dc.png

../../../_images/b973bb32c75f8718584744eaa379dda98b27e02fa6ed5febaedf22ffdd7db9fa.png

bar <- ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(x = cut, fill = cut),
    show.legend = FALSE,
    width = 1
  ) +
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()
bar + coord_polar()

../../../_images/01f0687ad40da84da74fbdc652f85ec1b991b47119bf63b2625b648371b76144.png

../../../_images/322431e65dc5a705d1a86b5f0a8def52bd6b358b0658941fa28938ee56227dee.png

# 3.9.1 [4]

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline() +
  coord_fixed()

../../../_images/a6cf85314459bcf9555f86c106a9eb96dcb9b37ef4b05609abdf98e8d3a27b9b.png

04 - Workflow: basics #

sin(pi/2)
seq(1,10)
typeof(seq(1,10))
seq(1,10,length.out=5)
1:10

1

1
2
3
4
5
6
7
8
9
10

'integer'

1
3.25
5.5
7.75
10

1
2
3
4
5
6
7
8
9
10

07 - Exploratory Data Analysis #

diamonds %>%
  count(cut)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut)) +
  theme(text = element_text(size = 20))

A tibble: 5 × 2
cut	n
<ord>	<int>
Fair	1610
Good	4906
Very Good	12082
Premium	13791
Ideal	21551

../../../_images/8586e83af2fef79db67ea7a2b567732905fff3c81af47f8395d5f897522ea0f6.png

diamonds %>%
  count(cut_width(x = carat, width = 0.5))

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5) +
  theme(text = element_text(size = 20))

A tibble: 11 × 2
cut_width(x = carat, width = 0.5)	n
<fct>	<int>
[-0.25,0.25]	785
(0.25,0.75]	29498
(0.75,1.25]	15977
(1.25,1.75]	5313
(1.75,2.25]	2002
(2.25,2.75]	322
(2.75,3.25]	32
(3.25,3.75]	5
(3.75,4.25]	4
(4.25,4.75]	1
(4.75,5.25]	1

../../../_images/2955c29419ba38eef4b22b77a656bd2248566173a002ac81105e7198e24ef5ca.png

smaller <-
  diamonds %>%
    filter(carat < 3)

smaller %>%
  count(cut_width(x = carat, width = 0.1))

ggplot(data = smaller) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.1) +
  theme(text = element_text(size = 20))

A tibble: 27 × 2
cut_width(x = carat, width = 0.1)	n
<fct>	<int>
[0.15,0.25]	785
(0.25,0.35]	10273
(0.35,0.45]	6231
(0.45,0.55]	5417
(0.55,0.65]	2328
(0.65,0.75]	5249
(0.75,0.85]	1725
(0.85,0.95]	2656
(0.95,1.05]	6258
(1.05,1.15]	2687
(1.15,1.25]	2651
(1.25,1.35]	1063
(1.35,1.45]	325
(1.45,1.55]	2556
(1.55,1.65]	738
(1.65,1.75]	631
(1.75,1.85]	140
(1.85,1.95]	57
(1.95,2.05]	1173
(2.05,2.15]	407
(2.15,2.25]	225
(2.25,2.35]	135
(2.35,2.45]	69
(2.45,2.55]	81
(2.55,2.65]	21
(2.65,2.75]	16
(2.75,2.85]	3

../../../_images/6963f7d6fb7e6a761b0c3a76dda2db4c01b653ea60adb4e67c9d513a4c5eca77.png

ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
  geom_freqpoly(binwidth = 0.1) +
  theme(text = element_text(size = 20))

../../../_images/66018a7b8ca2ed2259ea772f67ea72b4bad05f3e247e42fce5931cb78bdeeafe.png

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01) +
  theme(text = element_text(size = 20))

../../../_images/09eb2c97e5eceb75c65ab39d66169b156160a41e1f66c9af55a4e20662e72bef.png

ggplot(data = faithful, mapping = aes(x = eruptions)) +
  geom_histogram(binwidth = 0.25) +
  theme(text = element_text(size = 20))

../../../_images/8e3e3f5ac3bbfd088f90cb799dc33bc10e52ca4b33c463c8008e8a8d1fdec4a3.png

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  theme(text = element_text(size = 20))

../../../_images/7bd5c4cab0ccefe66efcc4bda2fa138cf1a4f7942f615b3858a801082c4d84f1.png

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50)) +
  theme(text = element_text(size = 20))

../../../_images/565a82546323709987c5bd285bc622b9a83e0d0df565d35cd685ae1e7e725762.png

unusual <-
  diamonds %>%
    filter(y < 3 | y > 20) %>%
    select(price, x, y, z) %>%
    arrange(y)
unusual

A tibble: 9 × 4
price	x	y	z
<int>	<dbl>	<dbl>	<dbl>
5139	0.00	0.0	0.00
6381	0.00	0.0	0.00
12800	0.00	0.0	0.00
15686	0.00	0.0	0.00
18034	0.00	0.0	0.00
2130	0.00	0.0	0.00
2130	0.00	0.0	0.00
2075	5.15	31.8	5.12
12210	8.09	58.9	8.06

7.4 - Missing values #

Drop the unusual values #

diamonds2 <-
  diamonds %>%
    filter(between(x = y, left = 3, right = 20))
head(x = diamonds2)

A tibble: 6 × 10
carat	cut	color	clarity	depth	table	price	x	y	z
<dbl>	<ord>	<ord>	<ord>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48

Replace the unusual values with missing values #

diamonds2 <-
  diamonds %>%
    mutate(y = ifelse(test = y < 3 | y > 20, yes = NA, no = y))
head(x = diamonds2)

A tibble: 6 × 10
carat	cut	color	clarity	depth	table	price	x	y	z
<dbl>	<ord>	<ord>	<ord>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48

diamonds2 <-
  diamonds %>%
    mutate(y = case_when(y >= 3 & y <= 20 ~ y, .default = NA))
head(x = diamonds2)

A tibble: 6 × 10
carat	cut	color	clarity	depth	table	price	x	y	z
<dbl>	<ord>	<ord>	<ord>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75
0.24	Very Good	J	VVS2	62.8	57	336	3.94	3.96	2.48

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
  geom_point(na.rm = TRUE) +
  theme(text = element_text(size = 20))

../../../_images/7d9e821b9c469372fa6b167a2840c8c5d1444b11fb29878e2433b38541d80113.png

nycflights13::flights %>%
  mutate(
    cancelled      = is.na(dep_time),
    sched_hour     = sched_dep_time %/% 100,
    sched_min      = sched_dep_time %%  100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
    ggplot(mapping = aes(sched_dep_time)) +
      geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4) +
      theme(text = element_text(size = 20))

../../../_images/d51c1dae934921c0e41bed5730c9c9c5e2bee4f795105b3acf3ade1045843de6.png

7.5 - Covariation #

7.5.1 - One categorical variable and one continuous variable #

ggplot(data = diamonds, mapping = aes(x = price)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500) +
  theme(text = element_text(size = 20))

../../../_images/1229d2ea2973bff191a7aa863617d835c8a1d5ff0ac5b62871d2bd46ec319266.png

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut)) +
  theme(text = element_text(size = 20))

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500) +
  theme(text = element_text(size = 20))

Warning message:
“The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.”

../../../_images/bdec9f62f5143512ed145bf27c830e41cdba17ee7aff6596b1ff08f69a87116a.png

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot() +
  theme(text = element_text(size = 20))

../../../_images/6dc6d189d574f906e27ed0646e50781638146ad77cf79c8a151bf719b4e2df84.png

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  theme(text = element_text(size = 20))

../../../_images/0316634c75b2e49652a6eac6a7a612004aca1255487f234eac01f91f2270d5a3.png

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  theme(text = element_text(size = 20))

../../../_images/cc3381012eb320f475b2edf233eba04e93d9bb9420dc50a493cff2e47030ac25.png

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip() +
  theme(text = element_text(size = 20))

../../../_images/bb4d5af2e3d80ea37778e39886b20766ee4301f1b8228428180a6638121e6827.png

7.5.2 - Two categorical variables #

# to visualize the covariation between categorical variables, count the number of observations for each combination
# one way to do this is to rely on the builtin `geom_count()`
# another approach is to compute the count with dplyr, and then visualize with `geom_tile()` and the fill aesthetic
# the size of each circle in the plot displays how many observations occurred at each combination of values
# covariation will appear as a strong correlation between specific x values and specific y values
# if the categorical variables are unordered, use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns
# for larger plots, try the d3heatmap or heatmaply packages, which create interactive plots

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color)) +
  theme(text = element_text(size = 20))

diamonds %>%
  count(color, cut)

diamonds %>%
  count(color, cut) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n)) +
    theme(text = element_text(size = 20))

A tibble: 35 × 3
color	cut	n
<ord>	<ord>	<int>
D	Fair	163
D	Good	662
D	Very Good	1513
D	Premium	1603
D	Ideal	2834
E	Fair	224
E	Good	933
E	Very Good	2400
E	Premium	2337
E	Ideal	3903
F	Fair	312
F	Good	909
F	Very Good	2164
F	Premium	2331
F	Ideal	3826
G	Fair	314
G	Good	871
G	Very Good	2299
G	Premium	2924
G	Ideal	4884
H	Fair	303
H	Good	702
H	Very Good	1824
H	Premium	2360
H	Ideal	3115
I	Fair	175
I	Good	522
I	Very Good	1204
I	Premium	1428
I	Ideal	2093
J	Fair	119
J	Good	307
J	Very Good	678
J	Premium	808
J	Ideal	896

../../../_images/5c9d951aa22dbd782cf5770c33964875d504e3d7cacea917bc046e5b9ec3205b.png

../../../_images/651f63353854b8281d2a7d4e514de3d7e94f5812c705ac36f79d155d5d6bed44.png

7.5.3 - Two continuous variables #

# one way to visualize the covariation between two continuous variables is to draw a scatterplot with `geom_point()`
# covariation can be seen as a pattern in the points
# for example, an expontential relationship between carat size and price of a diamond can be seen

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price)) +
  theme(text = element_text(size = 20))

../../../_images/6ea853c16bdc2584d5bad5ae79d21513704537d946975624b5e5a277ebecbe97.png

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price), alpha = 1/100) +
  theme(text = element_text(size = 20))

../../../_images/fe2c1b7d88ee9a42ebf1377c38677220d1df7f9836c68abb6f18bb167d2c12c1.png

ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price)) +
  theme(text = element_text(size = 20))

../../../_images/65b85acb03e02ec868e3a9727d390d57121dbb505546748724545750afdd9a98.png

ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price)) +
  theme(text = element_text(size = 20))

../../../_images/f3503763f930329bf1e5baae0b33dcd3c48ca134c6a34002dfadf6291175eb58.png

ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(x = carat, width = 0.1))) +
  theme(text = element_text(size = 20))

../../../_images/81a14924916b96ba5feca662902489316666a97016a52d1795261ca1268ff8aa.png

ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_number(x = carat, n = 20))) +
  theme(text = element_text(size = 20))

../../../_images/4e18973d3e08d9cd5965cc3cb989222c956cc671eb41f775be862ef0ce7a38c1.png

7.6 - Patterns and models #

ggplot(data = faithful) +
  geom_point(mapping = aes(x = eruptions, y = waiting)) +
  theme(text = element_text(size = 20))

../../../_images/50b396023dd7669e457fc10b426e8fe29b0c71c34cc50e17d2bdaf782ebbf11d.png

mod <- lm(formula = log(x = price) ~ log(x = carat), data = diamonds)

diamonds2 <-
  diamonds %>%
    add_residuals(model = mod) %>%
    mutate(resid = exp(x = resid))

ggplot(data = diamonds2) +
  geom_point(mapping = aes(x = carat, y = resid)) +
  theme(text = element_text(size = 20))

../../../_images/a1efa2577afeefe88af4db9e339fa700dbea7021d97dc66de34e1213faa14d97.png

ggplot(data = diamonds2) +
  geom_boxplot(mapping = aes(x = cut, y = resid)) +
  theme(text = element_text(size = 20))

../../../_images/b9009d739006fdb9c06ad63d6c737349b35ef6e383e6e470c1105194330af6aa.png

12 - Tidy Data #

table1
table2
table3
table4a
table4b

A tibble: 6 × 4
country	year	cases	population
<chr>	<dbl>	<dbl>	<dbl>
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

A tibble: 12 × 4
country	year	type	count
<chr>	<dbl>	<chr>	<dbl>
Afghanistan	1999	cases	745
Afghanistan	1999	population	19987071
Afghanistan	2000	cases	2666
Afghanistan	2000	population	20595360
Brazil	1999	cases	37737
Brazil	1999	population	172006362
Brazil	2000	cases	80488
Brazil	2000	population	174504898
China	1999	cases	212258
China	1999	population	1272915272
China	2000	cases	213766
China	2000	population	1280428583

A tibble: 6 × 3
country	year	rate
<chr>	<dbl>	<chr>
Afghanistan	1999	745/19987071
Afghanistan	2000	2666/20595360
Brazil	1999	37737/172006362
Brazil	2000	80488/174504898
China	1999	212258/1272915272
China	2000	213766/1280428583

A tibble: 3 × 3
country	1999	2000
<chr>	<dbl>	<dbl>
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

A tibble: 3 × 3
country	1999	2000
<chr>	<dbl>	<dbl>
Afghanistan	19987071	20595360
Brazil	172006362	174504898
China	1272915272	1280428583

# Compute rate per 10,000
table1 %>%
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>%
  count(year, wt = cases)

# Visualize changes over time
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country))

A tibble: 6 × 5
country	year	cases	population	rate
<chr>	<dbl>	<dbl>	<dbl>	<dbl>
Afghanistan	1999	745	19987071	0.372741
Afghanistan	2000	2666	20595360	1.294466
Brazil	1999	37737	172006362	2.193930
Brazil	2000	80488	174504898	4.612363
China	1999	212258	1272915272	1.667495
China	2000	213766	1280428583	1.669488

A tibble: 2 × 2
year	n
<dbl>	<dbl>
1999	250740
2000	296920

../../../_images/064c99a4d0d24feadaf03fdd79611b164732df7af602b3d56de52b72d8a6c98a.png

table4a
table4a %>%
  pivot_longer(c(`1999`,`2000`),names_to='year',values_to='cases')
table4b
table4b %>%
  pivot_longer(c(`1999`,`2000`),names_to='year',values_to='population')
tidy4a <- table4a %>%
  pivot_longer(c(`1999`,`2000`),names_to='year',values_to='cases')
tidy4b <- table4b %>%
  pivot_longer(c(`1999`,`2000`),names_to='year',values_to='population')
left_join(tidy4a,tidy4b)

A tibble: 3 × 3
country	1999	2000
<chr>	<dbl>	<dbl>
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

A tibble: 6 × 3
country	year	cases
<chr>	<chr>	<dbl>
Afghanistan	1999	745
Afghanistan	2000	2666
Brazil	1999	37737
Brazil	2000	80488
China	1999	212258
China	2000	213766

A tibble: 3 × 3
country	1999	2000
<chr>	<dbl>	<dbl>
Afghanistan	19987071	20595360
Brazil	172006362	174504898
China	1272915272	1280428583

A tibble: 6 × 3
country	year	population
<chr>	<chr>	<dbl>
Afghanistan	1999	19987071
Afghanistan	2000	20595360
Brazil	1999	172006362
Brazil	2000	174504898
China	1999	1272915272
China	2000	1280428583

Joining with `by = join_by(country, year)`

A tibble: 6 × 4
country	year	cases	population
<chr>	<chr>	<dbl>	<dbl>
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

13 - Relational data #

# tibble `flights` connects to tibble `planes`   via variable  `tailnum`
# tibble `flights` connects to tibble `airlines` via variable  `carrier`
# tibble `flights` connects to tibble `airports` via variables `origin` and `dest`
# tibble `flights` connects to tibble `weather`  via variables `origin`, `year`, `month`, `day`, and `hour`

data(package = 'nycflights13')

# tibble `airlines` looks up the full carrier name from its abbreviated code
head(x = nycflights13::airlines, n = 6)

A tibble: 6 × 2
carrier	name
<chr>	<chr>
9E	Endeavor Air Inc.
AA	American Airlines Inc.
AS	Alaska Airlines Inc.
B6	JetBlue Airways
DL	Delta Air Lines Inc.
EV	ExpressJet Airlines Inc.

# tibble `airports` gives info about each airport each of which is identified by the faa airport code
head(x = nycflights13::airports, n = 6)

A tibble: 6 × 8
faa	name	lat	lon	alt	tz	dst	tzone
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>
04G	Lansdowne Airport	41.13047	-80.61958	1044	-5	A	America/New_York
06A	Moton Field Municipal Airport	32.46057	-85.68003	264	-6	A	America/Chicago
06C	Schaumburg Regional	41.98934	-88.10124	801	-6	A	America/Chicago
06N	Randall Airport	41.43191	-74.39156	523	-5	A	America/New_York
09J	Jekyll Island Airport	31.07447	-81.42778	11	-5	A	America/New_York
0A9	Elizabethton Municipal Airport	36.37122	-82.17342	1593	-5	A	America/New_York

# tibble `flights`
#   foreign key - tailnum

head(x = nycflights13::flights, n = 6)

# not quite the primary key
flights %>%
  count(year, month, day, flight, tailnum) %>%
    filter(n > 1)

A tibble: 6 × 19
year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
<int>	<int>	<int>	<int>	<int>	<dbl>	<int>	<int>	<dbl>	<chr>	<int>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dttm>
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

A tibble: 11 × 6
year	month	day	flight	tailnum	n
<int>	<int>	<int>	<int>	<chr>	<int>
2013	2	9	303	NA	2
2013	2	9	655	NA	2
2013	2	9	1623	NA	2
2013	6	8	2269	N487WN	2
2013	6	15	2269	N230WN	2
2013	6	22	2269	N440LV	2
2013	6	29	2269	N707SA	2
2013	7	6	2269	N259WN	2
2013	8	3	2269	N446WN	2
2013	8	10	2269	N478WN	2
2013	12	15	398	NA	2

# tibble `planes` gives info about planes each of which is identified by its tailnum
#   primary key - variable `tailnum`

head(x = nycflights13::planes, n = 6)

planes %>%
  count(tailnum) %>%
    filter(n > 1)

A tibble: 6 × 9
tailnum	year	type	manufacturer	model	engines	seats	speed	engine
<chr>	<int>	<chr>	<chr>	<chr>	<int>	<int>	<int>	<chr>
N10156	2004	Fixed wing multi engine	EMBRAER	EMB-145XR	2	55	NA	Turbo-fan
N102UW	1998	Fixed wing multi engine	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan
N103US	1999	Fixed wing multi engine	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan
N104UW	1999	Fixed wing multi engine	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan
N10575	2002	Fixed wing multi engine	EMBRAER	EMB-145LR	2	55	NA	Turbo-fan
N105UW	1999	Fixed wing multi engine	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan

A tibble: 0 × 2
tailnum	n
<chr>	<int>

# tibble `weather` gives the weather at each NYC airport for each hour
#   primary key? - variables `year`, `month`, `day`, `hour`, `origin`

head(x = nycflights13::weather, n = 6)

# not quite the primary key
weather %>%
  count(year, month, day, hour, origin) %>%
    filter(n > 1)

A tibble: 6 × 15
origin	year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	time_hour
<chr>	<int>	<int>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dttm>
EWR	2013	1	1	1	39.02	26.06	59.37	270	10.35702	NA	0	1012.0	10	2013-01-01 01:00:00
EWR	2013	1	1	2	39.02	26.96	61.63	250	8.05546	NA	0	1012.3	10	2013-01-01 02:00:00
EWR	2013	1	1	3	39.02	28.04	64.43	240	11.50780	NA	0	1012.5	10	2013-01-01 03:00:00
EWR	2013	1	1	4	39.92	28.04	62.21	250	12.65858	NA	0	1012.2	10	2013-01-01 04:00:00
EWR	2013	1	1	5	39.02	28.04	64.43	260	12.65858	NA	0	1011.9	10	2013-01-01 05:00:00
EWR	2013	1	1	6	37.94	28.04	67.21	240	11.50780	NA	0	1012.4	10	2013-01-01 06:00:00

A tibble: 3 × 6
year	month	day	hour	origin	n
<int>	<int>	<int>	<int>	<chr>	<int>
2013	11	3	1	EWR	2
2013	11	3	1	JFK	2
2013	11	3	1	LGA	2

flights2 <-
  flights %>%
    select(year:day, hour, origin, dest, tailnum, carrier)
head(x = flights2, n = 6)

A tibble: 6 × 8
year	month	day	hour	origin	dest	tailnum	carrier
<int>	<int>	<int>	<dbl>	<chr>	<chr>	<chr>	<chr>
2013	1	1	5	EWR	IAH	N14228	UA
2013	1	1	5	LGA	IAH	N24211	UA
2013	1	1	5	JFK	MIA	N619AA	AA
2013	1	1	5	JFK	BQN	N804JB	B6
2013	1	1	6	LGA	ATL	N668DN	DL
2013	1	1	5	EWR	ORD	N39463	UA

flights2 %>%
  select(-origin, -dest) %>%
    left_join(y = airlines, by = 'carrier') %>%
      head(n = 6)

A tibble: 6 × 7
year	month	day	hour	tailnum	carrier	name
<int>	<int>	<int>	<dbl>	<chr>	<chr>	<chr>
2013	1	1	5	N14228	UA	United Air Lines Inc.
2013	1	1	5	N24211	UA	United Air Lines Inc.
2013	1	1	5	N619AA	AA	American Airlines Inc.
2013	1	1	5	N804JB	B6	JetBlue Airways
2013	1	1	6	N668DN	DL	Delta Air Lines Inc.
2013	1	1	5	N39463	UA	United Air Lines Inc.

flights2 %>%
  select(-origin, -dest) %>%
    mutate(
      name = airlines$name[match(carrier, airlines$carrier)]
    ) %>%
      head(n = 6)

A tibble: 6 × 7
year	month	day	hour	tailnum	carrier	name
<int>	<int>	<int>	<dbl>	<chr>	<chr>	<chr>
2013	1	1	5	N14228	UA	United Air Lines Inc.
2013	1	1	5	N24211	UA	United Air Lines Inc.
2013	1	1	5	N619AA	AA	American Airlines Inc.
2013	1	1	5	N804JB	B6	JetBlue Airways
2013	1	1	6	N668DN	DL	Delta Air Lines Inc.
2013	1	1	5	N39463	UA	United Air Lines Inc.

x <- tribble(
  ~key, ~val_x,
     1,   'x1',
     2,   'x2',
     3,   'x3'
)

y <- tribble(
  ~key, ~val_y,
     1,   'y1',
     2,   'y2',
     4,   'y3'
)

inner_join(x = x, y = y, by = 'key')

base::merge(x = x, y = y)

A tibble: 2 × 3
key	val_x	val_y
<dbl>	<chr>	<chr>
1	x1	y1
2	x2	y2

A data.frame: 2 × 3
key	val_x	val_y
<dbl>	<chr>	<chr>
1	x1	y1
2	x2	y2

# ONE TABLE HAS DUPLICATE KEYS
#   this is useful when you want to add in additional info (there is typically a one-to-many relationship)

x <- tribble(
  ~key, ~val_x,
     1,   'x1',
     2,   'x2',
     2,   'x3',
     1,   'x4'
)

y <- tribble(
  ~key, ~val_y,
     1,   'y1',
     2,   'y2'
)

left_join(x = x, y = y, by = 'key')

base::merge(x = x, y = y)

A tibble: 4 × 3
key	val_x	val_y
<dbl>	<chr>	<chr>
1	x1	y1
2	x2	y2
2	x3	y2
1	x4	y1

A data.frame: 4 × 3
key	val_x	val_y
<dbl>	<chr>	<chr>
1	x1	y1
1	x4	y1
2	x2	y2
2	x3	y2

# BOTH TABLES HAVE DUPLICATE KEYS
#   this is usually an error because in neither table do the keys uniquely identify an observation
#   when you join duplicated keys you get all possible combinations, the Cartesian product

x <- tribble(
  ~key, ~val_x,
     1,   'x1',
     2,   'x2',
     2,   'x3',
     3,   'x4'
)

y <- tribble(
  ~key, ~val_y,
     1,   'y1',
     2,   'y2',
     2,   'y3',
     3,   'y4'
)

left_join(x = x, y = y, by = 'key')

base::merge(x = x, y = y)

Warning message in left_join(x = x, y = y, by = "key"):
“Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship = "many-to-many"` to silence this warning.”

A tibble: 6 × 3
key	val_x	val_y
<dbl>	<chr>	<chr>
1	x1	y1
2	x2	y2
2	x2	y3
2	x3	y2
2	x3	y3
3	x4	y4

A data.frame: 6 × 3
key	val_x	val_y
<dbl>	<chr>	<chr>
1	x1	y1
2	x2	y2
2	x2	y3
2	x3	y2
2	x3	y3
3	x4	y4

# NATURAL JOIN
#   by = c('year', 'month', 'day', 'hour', 'origin')

flights2 %>%
  left_join(y = weather) %>%
    head()

Joining with `by = join_by(year, month, day, hour, origin)`

A tibble: 6 × 18
year	month	day	hour	origin	dest	tailnum	carrier	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	time_hour
<int>	<int>	<int>	<dbl>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dttm>
2013	1	1	5	EWR	IAH	N14228	UA	39.02	28.04	64.43	260	12.65858	NA	0	1011.9	10	2013-01-01 05:00:00
2013	1	1	5	LGA	IAH	N24211	UA	39.92	24.98	54.81	250	14.96014	21.86482	0	1011.4	10	2013-01-01 05:00:00
2013	1	1	5	JFK	MIA	N619AA	AA	39.02	26.96	61.63	260	14.96014	NA	0	1012.1	10	2013-01-01 05:00:00
2013	1	1	5	JFK	BQN	N804JB	B6	39.02	26.96	61.63	260	14.96014	NA	0	1012.1	10	2013-01-01 05:00:00
2013	1	1	6	LGA	ATL	N668DN	DL	39.92	24.98	54.81	260	16.11092	23.01560	0	1011.7	10	2013-01-01 06:00:00
2013	1	1	5	EWR	ORD	N39463	UA	39.02	28.04	64.43	260	12.65858	NA	0	1011.9	10	2013-01-01 05:00:00

flights2 %>%
  left_join(y = planes, by = 'tailnum') %>%
    head()

A tibble: 6 × 16
year.x	month	day	hour	origin	dest	tailnum	carrier	year.y	type	manufacturer	model	engines	seats	speed	engine
<int>	<int>	<int>	<dbl>	<chr>	<chr>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	<int>	<int>	<int>	<chr>
2013	1	1	5	EWR	IAH	N14228	UA	1999	Fixed wing multi engine	BOEING	737-824	2	149	NA	Turbo-fan
2013	1	1	5	LGA	IAH	N24211	UA	1998	Fixed wing multi engine	BOEING	737-824	2	149	NA	Turbo-fan
2013	1	1	5	JFK	MIA	N619AA	AA	1990	Fixed wing multi engine	BOEING	757-223	2	178	NA	Turbo-fan
2013	1	1	5	JFK	BQN	N804JB	B6	2012	Fixed wing multi engine	AIRBUS	A320-232	2	200	NA	Turbo-fan
2013	1	1	6	LGA	ATL	N668DN	DL	1991	Fixed wing multi engine	BOEING	757-232	2	178	NA	Turbo-fan
2013	1	1	5	EWR	ORD	N39463	UA	2012	Fixed wing multi engine	BOEING	737-924ER	2	191	NA	Turbo-fan

flights2 %>%
  left_join(y = airports, by = c('dest' = 'faa')) %>%
    head()

A tibble: 6 × 15
year	month	day	hour	origin	dest	tailnum	carrier	name	lat	lon	alt	tz	dst	tzone
<int>	<int>	<int>	<dbl>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>
2013	1	1	5	EWR	IAH	N14228	UA	George Bush Intercontinental	29.98443	-95.34144	97	-6	A	America/Chicago
2013	1	1	5	LGA	IAH	N24211	UA	George Bush Intercontinental	29.98443	-95.34144	97	-6	A	America/Chicago
2013	1	1	5	JFK	MIA	N619AA	AA	Miami Intl	25.79325	-80.29056	8	-5	A	America/New_York
2013	1	1	5	JFK	BQN	N804JB	B6	NA	NA	NA	NA	NA	NA	NA
2013	1	1	6	LGA	ATL	N668DN	DL	Hartsfield Jackson Atlanta Intl	33.63672	-84.42807	1026	-5	A	America/New_York
2013	1	1	5	EWR	ORD	N39463	UA	Chicago Ohare Intl	41.97860	-87.90484	668	-6	A	America/Chicago

flights2 %>%
  left_join(y = airports, by = c('origin' = 'faa')) %>%
    head()

A tibble: 6 × 15
year	month	day	hour	origin	dest	tailnum	carrier	name	lat	lon	alt	tz	dst	tzone
<int>	<int>	<int>	<dbl>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>
2013	1	1	5	EWR	IAH	N14228	UA	Newark Liberty Intl	40.69250	-74.16867	18	-5	A	America/New_York
2013	1	1	5	LGA	IAH	N24211	UA	La Guardia	40.77725	-73.87261	22	-5	A	America/New_York
2013	1	1	5	JFK	MIA	N619AA	AA	John F Kennedy Intl	40.63975	-73.77893	13	-5	A	America/New_York
2013	1	1	5	JFK	BQN	N804JB	B6	John F Kennedy Intl	40.63975	-73.77893	13	-5	A	America/New_York
2013	1	1	6	LGA	ATL	N668DN	DL	La Guardia	40.77725	-73.87261	22	-5	A	America/New_York
2013	1	1	5	EWR	ORD	N39463	UA	Newark Liberty Intl	40.69250	-74.16867	18	-5	A	America/New_York

options(repr.plot.width=20, repr.plot.height=10)

airports %>%
  semi_join(y = flights, by = c('faa' = 'dest')) %>%
    ggplot(mapping = aes(x = lon, y = lat, size = )) +
      borders('state') +
      geom_point() +
      coord_quickmap() +
      theme(
        text=element_text(size=20)
      )

../../../_images/6a8425bdd191dc907148db99e2093ee665e8d06b1bc20a439441d8481dfe04c5.png

Filtering Joins #

Semi Join #

x <- tribble(
  ~key, ~val_x,
     1,   'x1',
     2,   'x2',
     3,   'x3'
)

y <- tribble(
  ~key, ~val_y,
     1,   'y1',
     2,   'y2',
     4,   'y3'
)

semi_join(x = x, y = y)

Joining with `by = join_by(key)`

A tibble: 2 × 2
key	val_x
<dbl>	<chr>
1	x1
2	x2

x <- tribble(
  ~key, ~val_x,
     1,   'x1',
     2,   'x2',
     2,   'x3',
     3,   'x4'
)

y <- tribble(
  ~key, ~val_y,
     1,   'y1',
     2,   'y2',
     2,   'y3',
     3,   'y4'
)

semi_join(x = x, y = y)

Joining with `by = join_by(key)`

A tibble: 4 × 2
key	val_x
<dbl>	<chr>
1	x1
2	x2
2	x3
3	x4

top_dest <-
  flights %>%
    count(dest, sort = TRUE) %>%
      head(n = 10)
top_dest

flights %>%
  filter(dest %in% top_dest$dest) %>%
    head()

flights %>%
  semi_join(y = top_dest) %>%
    head()

A tibble: 10 × 2
dest	n
<chr>	<int>
ORD	17283
ATL	17215
LAX	16174
BOS	15508
MCO	14082
CLT	14064
SFO	13331
FLL	12055
MIA	11728
DCA	9705

A tibble: 6 × 19
year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
<int>	<int>	<int>	<int>	<int>	<dbl>	<int>	<int>	<dbl>	<chr>	<int>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dttm>
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00
2013	1	1	555	600	-5	913	854	19	B6	507	N516JB	EWR	FLL	158	1065	6	0	2013-01-01 06:00:00
2013	1	1	557	600	-3	838	846	-8	B6	79	N593JB	JFK	MCO	140	944	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	753	745	8	AA	301	N3ALAA	LGA	ORD	138	733	6	0	2013-01-01 06:00:00

Joining with `by = join_by(dest)`

A tibble: 6 × 19
year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
<int>	<int>	<int>	<int>	<int>	<dbl>	<int>	<int>	<dbl>	<chr>	<int>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dttm>
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00
2013	1	1	555	600	-5	913	854	19	B6	507	N516JB	EWR	FLL	158	1065	6	0	2013-01-01 06:00:00
2013	1	1	557	600	-3	838	846	-8	B6	79	N593JB	JFK	MCO	140	944	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	753	745	8	AA	301	N3ALAA	LGA	ORD	138	733	6	0	2013-01-01 06:00:00

Anti Join #

x <- tribble(
  ~key, ~val_x,
     1,   'x1',
     2,   'x2',
     3,   'x3'
)

y <- tribble(
  ~key, ~val_y,
     1,   'y1',
     2,   'y2',
     4,   'y3'
)

anti_join(x = x, y = y)

Joining with `by = join_by(key)`

A tibble: 1 × 2
key	val_x
<dbl>	<chr>
3	x3

x <- tribble(
  ~key, ~val_x,
     1,   'x1',
     2,   'x2',
     2,   'x3',
     3,   'x4'
)

y <- tribble(
  ~key, ~val_y,
     1,   'y1',
     2,   'y2',
     2,   'y3',
     3,   'y4'
)

anti_join(x = x, y = y)

Joining with `by = join_by(key)`

A tibble: 0 × 2
key	val_x
<dbl>	<chr>

# anti joins are useful for diagnosing join mismatches
# for example, you might be interested to know that there are many `flights` that don't have a match in `planes`

flights %>%
  anti_join(y = planes, by = 'tailnum') %>%
    count(tailnum, sort = TRUE) %>%
      head()

A tibble: 6 × 2
tailnum	n
<chr>	<int>
NA	2512
N725MQ	575
N722MQ	513
N723MQ	507
N713MQ	483
N735MQ	396

13.7 - Set operations #

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)

df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(x = df1, y = df2)

union(x = df1, y = df2)

setdiff(x = df1, y = df2)

setdiff(x = df2, y = df1)

A tibble: 1 × 2
x	y
<dbl>	<dbl>
1	1

A tibble: 3 × 2
x	y
<dbl>	<dbl>
1	1
2	1
1	2

A tibble: 1 × 2
x	y
<dbl>	<dbl>
2	1

A tibble: 1 × 2
x	y
<dbl>	<dbl>
1	2

14 - Strings #

string1      <- "This is a string"
string2      <- 'If I want to include a "quote" inside a string, I use single quotes'
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

# the printed representation of a string is not the same as the string itself
x <- c("\"", "\\")
x
writeLines(x)

'"'
'\\'

"
\

?'"'

x <- "\u00b5"
x

c('one', 'two', 'three')

'µ'

'one'
'two'
'three'

str_length(c('a', 'R for data science', NA))

1
18
<NA>

str_c('x', 'y', 'z')

'xyz'

x <- c('abc', NA)

str_c('|-', x, '-|')

str_c('|-', str_replace_na(x), '-|')

'|-abc-|'
NA

'|-abc-|'
'|-NA-|'

str_c('prefix-', c('a', 'b', 'c'), '-suffix')

'prefix-a-suffix'
'prefix-b-suffix'
'prefix-c-suffix'

name        <- 'Hadley'
time_of_day <- 'morning'
birthday    <- FALSE

str_c('Good ', time_of_day, ' ', name, if (birthday) ' and HAPPY BIRTHDAY', '.')

'Good morning Hadley.'

str_c(c('x', 'y', 'z'), collapse = ', ')

'x, y, z'

x <- c('Apple', 'Banana', 'Pear')

str_sub(x,  1,  3)
str_sub(x, -3, -1)
str_sub(x,  1, 10)

'App'
'Ban'
'Pea'

'ple'
'ana'
'ear'

'Apple'
'Banana'
'Pear'

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))

x

'apple'
'banana'
'pear'

str_to_upper(c('i', 'ı'))
str_to_upper(c('i', 'ı'), locale = 'tr')

'I'
'I'

'İ'
'I'

x <- c('apple', 'eggplant', 'banana')

str_sort(x, locale = 'en')  # English
str_sort(x, locale = 'haw') # Hawaiian

'apple'
'banana'
'eggplant'

'apple'
'eggplant'
'banana'

x <- c('apple', 'banana', 'pear')

str_view(x, 'an')

[2] │ b<an><an>a

str_view(x, '.a.')

[2] │ <ban>ana
[3] │ p<ear>

dot <- '\\.'

dot

writeLines(dot)

'\\.'

\.

str_view(c('abc', 'a.c', 'bef'), 'a\\.c')

[2] │ <a.c>

x <- 'a\\b'

x

writeLines(x)

str_view(x, '\\\\')

'a\\b'

a\b

[1] │ a<\>b

x <- 

Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: x <- 
   ^
Traceback:

16 - Dates and times #

today()

2023-06-19

now()

[1] "2023-06-19 21:40:12 EDT"

lubridate::ymd('2017-01-31')
lubridate::mdy('January 31st, 2017')
lubridate::dmy('31-Jan-2017')
lubridate::ymd(20170131)
lubridate::ymd_hms('2017-01-31 20:11:59')
lubridate::mdy_hm('01/31/2017 08:01')
lubridate::ymd(20170131, tz = 'UTC')

2017-01-31

[1] "2017-01-31 20:11:59 UTC"

[1] "2017-01-31 08:01:00 UTC"

[1] "2017-01-31 UTC"

19 - Functions #

df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10),
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
df

A tibble: 10 × 4
a	b	c	d
<dbl>	<dbl>	<dbl>	<dbl>
0.2603453	0.0000000	0.52291366	0.17190835
0.4428665	0.3911715	0.00000000	0.34864602
0.7679751	0.2958880	0.61301233	0.00000000
0.0000000	0.3256557	0.03846392	1.00000000
0.9597766	0.3409009	0.30398213	0.38970973
0.7516023	0.2671197	1.00000000	0.14085417
1.0000000	0.4352529	0.35195509	0.08903446
0.3199440	0.5370480	0.57422199	0.77900010
0.7130083	0.0714432	0.37854517	0.07351166
0.1629208	0.1235444	0.53272103	0.50560502

rescale01 <- function (x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

rescale01(c(0, 5, 10))
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))

0
0.5
1

0
0.5
1

0
0.25
0.5
<NA>
1

df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10),
)
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
df

A tibble: 10 × 4
a	b	c	d
<dbl>	<dbl>	<dbl>	<dbl>
0.5620421	0.4772677	0.00000000	0.2779846
0.0000000	0.1481486	0.20080052	0.2611846
0.3120153	0.8396613	0.04548743	0.4313317
0.5319523	0.6228401	0.47050875	0.9074133
0.4530802	0.3482687	0.33315823	0.1256405
0.4333381	1.0000000	0.33140720	0.5639065
0.4721171	0.5706066	0.14447456	0.7455952
0.1415588	0.0000000	0.15490717	1.0000000
1.0000000	0.3948759	1.00000000	0.0000000
0.3231058	0.6069666	0.73818349	0.9683428

x <- c(1:10, Inf)

rescale01(x)

0
0
0
0
0
0
0
0
0
0
NaN

rescale01 <- function (x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

x <- c(1:10, Inf)

rescale01(x)

0
0.111111111111111
0.222222222222222
0.333333333333333
0.444444444444444
0.555555555555556
0.666666666666667
0.777777777777778
0.888888888888889
1
Inf

Exercise 19.2.1 #

1 #

[1] Why is TRUE not a parameter to rescale01? What would happen if x contained a single missing value, and na.rm was FALSE?

rescale01 <- function (x, remove_missing_data = TRUE, finite_data = TRUE) {
  rng <- range(x, na.rm = remove_missing_data, finite = finite_data)
  (x - rng[1]) / (rng[2] - rng[1])
}

rescale01(c(0, 5, 10),       FALSE, FALSE)
rescale01(c(-10, 0, 10),     FALSE, FALSE)
rescale01(c(1, 2, 3, NA, 5), FALSE, FALSE)
rescale01(c(1:10, Inf),      FALSE, FALSE)

rescale01(c(0, 5, 10),       FALSE, TRUE)
rescale01(c(-10, 0, 10),     FALSE, TRUE)
rescale01(c(1, 2, 3, NA, 5), FALSE, TRUE)
rescale01(c(1:10, Inf),      FALSE, TRUE)

rescale01(c(0, 5, 10),       TRUE, FALSE)
rescale01(c(-10, 0, 10),     TRUE, FALSE)
rescale01(c(1, 2, 3, NA, 5), TRUE, FALSE)
rescale01(c(1:10, Inf),      TRUE, FALSE)

rescale01(c(0, 5, 10),       TRUE, TRUE)
rescale01(c(-10, 0, 10),     TRUE, TRUE)
rescale01(c(1, 2, 3, NA, 5), TRUE, TRUE)
rescale01(c(1:10, Inf),      TRUE, TRUE)

0
0.5
1

0
0.5
1

<NA>
<NA>
<NA>
<NA>
<NA>

0
0
0
0
0
0
0
0
0
0
NaN

0
0.5
1

0
0.5
1

0
0.25
0.5
<NA>
1

0
0.111111111111111
0.222222222222222
0.333333333333333
0.444444444444444
0.555555555555556
0.666666666666667
0.777777777777778
0.888888888888889
1
Inf

0
0.5
1

0
0.5
1

0
0.25
0.5
<NA>
1

0
0
0
0
0
0
0
0
0
0
NaN

0
0.5
1

0
0.5
1

0
0.25
0.5
<NA>
1

0
0.111111111111111
0.222222222222222
0.333333333333333
0.444444444444444
0.555555555555556
0.666666666666667
0.777777777777778
0.888888888888889
1
Inf

2 #

[2] In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1.

rescale01 <- function (x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  y   <- (x - rng[1]) / (rng[2] - rng[1])
  y[y == -Inf] <- 0
  y[y ==  Inf] <- 1
  y
}

x <- c(1:10, Inf)

rescale01(x)

0
0.111111111111111
0.222222222222222
0.333333333333333
0.444444444444444
0.555555555555556
0.666666666666667
0.777777777777778
0.888888888888889
1
1

[3] Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?

x <- c(1:10, NA, NA)
x

# proportion of null values
prop_na <- function (x) {
  mean(is.na(x))
}

prop_na(x)

1
2
3
4
5
6
7
8
9
10
<NA>
<NA>

0.166666666666667

x <- c(1:10, NA, NA)
x

# standardization, sum to unity
sum_to_one <- function (x, na.rm = FALSE) {
  x / sum(x, na.rm = na.rm)
}

sum_to_one(x)
sum_to_one(x, TRUE)

1
2
3
4
5
6
7
8
9
10
<NA>
<NA>

<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>

0.0181818181818182
0.0363636363636364
0.0545454545454545
0.0727272727272727
0.0909090909090909
0.109090909090909
0.127272727272727
0.145454545454545
0.163636363636364
0.181818181818182
<NA>
<NA>

# coefficient of variation
coef_variation <- function (x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}

x <- c(1:10, NA, NA)

coef_variation(x)
coef_variation(x, TRUE)

1
2
3
4
5
6
7
8
9
10
<NA>
<NA>

<NA>

0.55048188256318

4 #

[4] Write your own functions to compute the variance and skewness of a numeric vector.

$ \begin{aligned} \text{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x})^2 \end{aligned} $

where

$ \begin{aligned} \bar{x} = \frac{1}{n} \sum_i^n x_i \end{aligned} $

$ \begin{aligned} \text{Skew}(x) = \frac{\frac{1}{n - 2} (\sum_{i=1}^n (x_i - \bar{x})^3)}{\text{Var}(x)^\frac{3}{2}} \end{aligned} $

variance <- function (x, na.rm = FALSE) {
  n      <- length(x)
  m      <- mean(x, na.rm = na.rm)
  sq_err <- (x - m)^2
  sum(sq_err) / (n - 1)
}

var(1:10)
variance(1:10)

9.16666666666667

skewness <- function (x, na.rm = FALSE) {
  n <- length(x)
  m <- mean(x, na.rm = na.rm)
  v <- var(x,  na.rm = na.rm)
  (sum((x - m)^3) / (n - 2)) / v^(3/2)
}

skewness(c(1, 2, 3, 100))

1.49875099748886

5 #

[5] Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

both_na <- function (x, y) {
  sum(is.na(x) & is.na(y))
}

x <- c(NA, NA,  1, 2)
y <- c(NA,  1, NA, 2)

both_na(x, y)

1

x <- c(NA, NA,  1, 2, NA, NA, 1)
y <- c(NA,  1, NA, 2, NA, NA, 1)

both_na(x, y)

3

6 #

[6] What do the following functions do? Why are they useful even though they are so short?

is_directory <- function (x) file.info(x)$isdir
is_readable  <- function (x) file.access(x, 4) == 0

# The function `is_directory()` checks whether the path in `x` is a directory.
is_directory <- function (x) file.info(x)$isdir

# The function `is_readable()` checks whether the path in `x` is readable (i.e., whether the file exists and the user has permission to open it).
is_readable <- function (x) file.access(x, 4) == 0

7 #

[7] Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.

19.4 Conditional execution #

# Here's a simple function that uses an `if` statement.
# The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
# This function takes advantage of the standard return rule: a function returns the last value that it computed. Here that is either one of the two branches of the `if` statement.

has_name <- function (x) {
  nms <- names(x)
  if (is.null(nms)) {
    rep(FALSE, length(x))
  } else {
    !is.na(nms) & nms != ""
  }
}

19.5 Function arguments #

# Compute confidence interval around mean using normal approximation

mean_ci <- function (x, conf = 0.95) {
  se    <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

x <- runif(100)

mean_ci(x)
mean_ci(x, conf = 0.99)

0.460097864269066
0.574181602919688

0.44217400988909
0.592105457299664

19.5.2 Checking values #

# weighted mean summary statistic
wt_mean <- function (x, w) {
  sum(x * w) / sum(w)
}

# weighted variance summary statistic
wt_var  <- function (x, w) {
  mu <- wt_mean(x, w)
  sum(w * (x - mu)^2) / sum(w)
}

# weighted standard deviation summary statistic
wt_sd   <- function (x, w) {
  sqrt(wt_var(x, w))
}

# What happens if `x` and `w` are not the same length?
#   In this case, we don't get an error because of R's recycling rules.

x <- 1:6
w <- 1:3

x
w

wt_mean(x, w)

1
2
3
4
5
6

1
2
3

7.66666666666667

# weighted mean summary statistic
wt_mean <- function (x, w) {
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  sum(x * w) / sum(w)
}

# weighted variance summary statistic
wt_var  <- function (x, w) {
  mu <- wt_mean(x, w)
  sum(w * (x - mu)^2) / sum(w)
}

# weighted standard deviation summary statistic
wt_sd   <- function (x, w) {
  sqrt(wt_var(x, w))
}

20 - Vectors #

letters

'a'
'b'
'c'
'd'
'e'
'f'
'g'
'h'
'i'
'j'
'k'
'l'
'm'
'n'
'o'
'p'
'q'
'r'
's'
't'
'u'
'v'
'w'
'x'
'y'
'z'

typeof(letters)

'character'

1:10

1
2
3
4
5
6
7
8
9
10

typeof(1:10)

'integer'

x <- list('a', 'b', 1:10)
x

'a'
'b'
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10

length(x)

3

Logical #

1:10 %% 3 == 0

FALSE
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE

c(TRUE, TRUE, FALSE, NA)

TRUE
TRUE
FALSE
<NA>

Numeric #

typeof(1)

'double'

typeof(1L)

'integer'

1.5L

1.5

x <- sqrt(x = 2) ^ 2
x
x - 2
x == 2
near(x = x, y = 2)

2

4.44089209850063e-16

FALSE

TRUE

c(-1, 0, 1) / 0

-Inf
NaN
Inf

is.finite(0)
is.finite(Inf)
is.finite(-Inf)
is.finite(NA)
is.finite(NaN)

TRUE

FALSE

is.infinite(0)
is.infinite(Inf)
is.infinite(-Inf)
is.infinite(NA)
is.infinite(NaN)

FALSE

TRUE

FALSE

is.na(0)
is.na(Inf)
is.na(-Inf)
is.na(NA)
is.na(NaN)

FALSE

TRUE

is.nan(0)
is.nan(Inf)
is.nan(-Inf)
is.nan(NA)
is.nan(NaN)

FALSE

TRUE

Character #

# `y` doesn't take up 1000x as much memory as `x`, because each element of `y` is just a pointer to that same string
# a pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 x 1000 + 152 = 8.14 kB

x <- 'This is a reasonably long string.'
pryr::object_size(x)

y <- rep(x, 1000)
pryr::object_size(y)

152 B

8.14 kB

Missing values #

# each type of atomic vector has its own missing value

NA            # logical
NA_integer_   # integer
NA_real_      # double
NA_character_ # character

<NA>

NA

20.4 - Using atomic vectors #

20.4.1 - Coercion #

x <- sample(20, 100, replace = TRUE)
x

16
20
11
5
18
19
6
17
2
12
12
12
5
9
17
18
9
19
3
11
4
3
1
12
11
7
6
13
10
14
15
9
4
2
9
7
20
11
16
9
14
18
9
19
16
19
7
16
4
18
16
14
1
14
4
10
6
5
17
11
7
9
17
15
2
18
4
11
9
6
4
12
11
9
19
8
13
5
6
4
6
5
2
20
4
18
11
2
10
16
19
4
11
15
3
20
19
13
10
9

y <- x > 10
y

TRUE
TRUE
TRUE
FALSE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
TRUE
TRUE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
FALSE
TRUE
TRUE
TRUE
FALSE
TRUE
TRUE
FALSE
TRUE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
TRUE
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
TRUE
FALSE
TRUE
TRUE
TRUE
FALSE
FALSE

# How many are greater than 10?

sum(y)

51

# What proportion are greater than 10?

mean(y)

0.51

typeof(c(TRUE, 1L))

'integer'

typeof(c(1L, 1.5))

'double'

typeof(c(1.5, 'a'))

'character'

20.4.2 - Test functions #

purrr::is_logical(TRUE)
purrr::is_logical(1L)
purrr::is_logical(1.5)
purrr::is_logical('a')

TRUE

FALSE

purrr::is_integer(TRUE)
purrr::is_integer(1L)
purrr::is_integer(1.5)
purrr::is_integer('a')

FALSE

TRUE

FALSE

purrr::is_double(TRUE)
purrr::is_double(1L)
purrr::is_double(1.5)
purrr::is_double('a')

FALSE

TRUE

FALSE

purrr::is_character(TRUE)
purrr::is_character(1L)
purrr::is_character(1.5)
purrr::is_character('a')

FALSE

TRUE

purrr::is_atomic(TRUE)
purrr::is_atomic(1L)
purrr::is_atomic(1.5)
purrr::is_atomic('a')

TRUE

purrr::is_list(TRUE)
purrr::is_list(1L)
purrr::is_list(1.5)
purrr::is_list('a')

FALSE

purrr::is_vector(TRUE)
purrr::is_vector(1L)
purrr::is_vector(1.5)
purrr::is_vector('a')

TRUE

sample(x = 10) + 100

109
106
102
108
110
101
103
105
104
107

runif(n = 10) > 0.5

TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE

1:10 + 1:2

2
4
4
6
6
8
8
10
10
12

1:10 + 1:3

Warning message in 1:10 + 1:3:
"longitud de objeto mayor no es m'ultiplo de la longitud de uno menor"

2
4
6
5
7
9
8
10
12
11

tibble(x = 1:4, y = 1:2)

Error in `tibble()`:
! Tibble columns must have compatible sizes.
* Size 4: Existing data.
* Size 2: Column `y`.
i Only values of size one are recycled.
Traceback:

1. tibble(x = 1:4, y = 1:2)
2. tibble_quos(xs, .rows, .name_repair)
3. vectbl_recycle_rows(res, first_size, j, given_col_names[[j]], 
 .     call)
4. abort_incompatible_size(n, name, size, "Existing data", call)
5. tibble_abort(call = call, bullets("Tibble columns must have compatible sizes:", 
 .     if (!is.null(.rows)) paste0("Size ", .rows, ": ", rows_source), 
 .     problems, info = "Only values of size one are recycled."))
6. abort(x, class, ..., call = call, parent = parent, use_cli_format = TRUE)
7. signal_abort(cnd, .file)

tibble(x = 1:4, y = rep(1:2, 2))

A tibble: 4 x 2
x	y
<int>	<int>
1	1
2	2
3	1
4	2

tibble(x = 1:4, y = rep(1:2, each = 2))

A tibble: 4 x 2
x	y
<int>	<int>
1	1
2	1
3	2
4	2

20.4.4 - Naming vectors #

c(x = 1, y = 2, z = 4)

x: 1
y: 2
z: 4

purrr::set_names(1:3, c('a', 'b', 'c'))

a: 1
b: 2
c: 3

20.4.5 - Subsetting #

x <- c('one', 'two', 'three', 'four', 'five')
x
x[c(3,2,5)]
x[c(1,1,5,5,5,2)]
x[c(-1,-3,-5)]
x[0]

'one'
'two'
'three'
'four'
'five'

'three'
'two'
'five'

'one'
'one'
'five'
'five'
'five'
'two'

'two'
'four'

x <- c(10, 3, NA, 5, 8, 1, NA)
x
x[!is.na(x)]
x[x %% 2 == 0]

10
3
<NA>
5
8
1
<NA>

10
3
5
8
1

10
<NA>
8
<NA>

x <- c(abc = 1, def = 2, xyz = 5)
x
x[c('xyz', 'def')]

abc: 1
def: 2
xyz: 5

xyz: 5
def: 2

20.5 - Recursive vectors (lists)#

x <- list(1, 2, 3)
x

1
2
3

str(object = x)

List of 3
 $ : num 1
 $ : num 2
 $ : num 3

x_named <- list(a = 1, b = 2, c = 3)
x_named

$a: 1
$b: 2
$c: 3

str(object = x_named)

List of 3
 $ a: num 1
 $ b: num 2
 $ c: num 3

y <- list('a', 1L, 1.5, TRUE)
y

'a'
1
1.5
TRUE

str(object = y)

List of 4
 $ : chr "a"
 $ : int 1
 $ : num 1.5
 $ : logi TRUE

20.5.1 - Visualizing lists #

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

x1
x2
x3

str(x1)
str(x2)
str(x3)

1. 1
2. 2
1. 3
2. 4

1. 1
2. 2
1. 3
2. 4

1
1. 2
2. 1. 3

List of 2
 $ : num [1:2] 1 2
 $ : num [1:2] 3 4
List of 2
 $ :List of 2
  ..$ : num 1
  ..$ : num 2
 $ :List of 2
  ..$ : num 3
  ..$ : num 4
List of 2
 $ : num 1
 $ :List of 2
  ..$ : num 2
  ..$ :List of 1
  .. ..$ : num 3

20.5.2 - Subsetting #

a <- list(a = 1:3, b = 'a string', c = pi, d = list(-1, -5))
a

$a

1
2
3

$b

'a string'

$c

3.14159265358979

$d

-1
-5

a[1:2]

$a

1
2
3

$b

'a string'

a[4]

$d =

-1
-5

a[[4]]

-1
-5

a[[4]][1]

-1

a[[4]][[1]]

-1

a$a

1
2
3

a[['a']]

1
2
3

20.6 - Attributes #

x <- 1:10

attr(x, 'greeting')
attr(x, 'greeting') <- 'Hi!'
attr(x, 'farewell') <- 'Bye!'
attributes(x)

NULL

$greeting: 'Hi!'
$farewell: 'Bye!'

# the call to "UseMethod" means that this is a generic function
# and it will call a specific method based on the class of the first argument

as.Date

function (x, ...) 
UseMethod("as.Date")

# list all the methods for a generic function with `methods()`

methods(generic.function = 'as.Date')

[1] as.Date.POSIXct*    as.Date.POSIXlt*    as.Date.character* 
[4] as.Date.default*    as.Date.factor*     as.Date.numeric*   
[7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
see '?methods' for accessing help and source code

# see the specific implementation of a method with `getS3method()`

getS3method(f = 'as.Date', class = 'default')
getS3method(f = 'as.Date', class = 'numeric')

function (x, ...) 
{
    if (inherits(x, "Date")) 
        x
    else if (is.null(x)) 
        .Date(numeric())
    else if (is.logical(x) && all(is.na(x))) 
        .Date(as.numeric(x))
    else stop(gettextf("do not know how to convert '%s' to class %s", 
        deparse1(substitute(x)), dQuote("Date")), domain = NA)
}

function (x, origin, ...) 
if (missing(origin)) .Date(x) else as.Date(origin, ...) + x

20.7 - Augmented vectors #

20.7.1 - Factors #

x <- factor(c('ab', 'cd', 'ab'), levels = c('ab', 'cd', 'ef'))
x

ab
cd
ab

Levels:

'ab'
'cd'
'ef'

typeof(x)

'integer'

attributes(x)

$levels

'ab'
'cd'
'ef'

$class

'factor'

20.7.2 Dates and datetimes #

x <- as.Date('1971-01-01')
x

unclass(x = x)
typeof(x = x)
attributes(x = x)

1971-01-01

365

'double'

$class = 'Date'

x <- lubridate::ymd_hm('1970-01-01 01:00')
x

[1] "1970-01-01 01:00:00 UTC"

Bibliography #

Wickham, Hadley; Mine Çetinkaya-Rundel; & Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st Ed. O’Reilly. Home.

faa	name	lat	lon	alt	tz	dst	tzone
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<chr>
04G	Lansdowne Airport	41.13047	-80.61958	1044	-5	A	America/New_York
06A	Moton Field Municipal Airport	32.46057	-85.68003	264	-6	A	America/Chicago
06C	Schaumburg Regional	41.98934	-88.10124	801	-6	A	America/Chicago
06N	Randall Airport	41.43191	-74.39156	523	-5	A	America/New_York
09J	Jekyll Island Airport	31.07447	-81.42778	11	-5	A	America/New_York
0A9	Elizabethton Municipal Airport	36.37122	-82.17342	1593	-5	A	America/New_York
0G6	Williams County Airport	41.46731	-84.50678	730	-5	A	America/New_York
0G7	Finger Lakes Regional Airport	42.88356	-76.78123	492	-5	A	America/New_York
0P2	Shoestring Aviation Airfield	39.79482	-76.64719	1000	-5	U	America/New_York
0S9	Jefferson County Intl	48.05381	-122.81064	108	-8	A	America/Los_Angeles
0W3	Harford County Airport	39.56684	-76.20240	409	-5	A	America/New_York
10C	Galt Field Airport	42.40289	-88.37511	875	-6	U	America/Chicago
17G	Port Bucyrus-Crawford County Airport	40.78156	-82.97481	1003	-5	A	America/New_York
19A	Jackson County Airport	34.17586	-83.56160	951	-5	U	America/New_York
1A3	Martin Campbell Field Airport	35.01581	-84.34683	1789	-5	A	America/New_York
1B9	Mansfield Municipal	42.00013	-71.19677	122	-5	A	America/New_York
1C9	Frazier Lake Airpark	54.01333	-124.76833	152	-8	A	America/Vancouver
1CS	Clow International Airport	41.69597	-88.12923	670	-6	U	America/Chicago
1G3	Kent State Airport	41.15139	-81.41511	1134	-5	A	America/New_York
1G4	Grand Canyon West Airport	35.89990	-113.81567	4813	-7	A	America/Phoenix
1H2	Effingham Memorial Airport	39.07000	-88.53400	585	-6	A	America/Chicago
1OH	Fortman Airport	40.55533	-84.38662	885	-5	U	America/New_York
1RL	Point Roberts Airpark	48.97972	-123.07889	10	-8	A	America/Los_Angeles
23M	Clarke CO	32.05170	-88.44340	320	-6	A	America/Chicago
24C	Lowell City Airport	42.95392	-85.34391	681	-5	A	America/New_York
24J	Suwannee County Airport	30.30013	-83.02469	104	-5	A	America/New_York
25D	Forest Lake Airport	45.24775	-92.99439	925	-6	A	America/Chicago
29D	Grove City Airport	41.14603	-80.16775	1371	-5	A	America/New_York
2A0	Mark Anton Airport	35.48625	-84.93108	718	-5	A	America/New_York
2B2	Plum Island Airport	42.79536	-70.83944	11	-5	A	America/New_York
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
X59	Valkaria Municipal	27.96086	-80.55833	26	-5	A	America/New_York
XFL	Flagler County Airport	29.28210	-81.12120	33	-5	A	America/New_York
XNA	NW Arkansas Regional	36.28187	-94.30681	1287	-6	A	America/Chicago
XZK	Amherst Amtrak Station AMM	42.37500	-72.51139	258	-5	A	America/New_York
Y51	Municipal Airport	43.57936	-90.89647	1292	-6	A	America/Chicago
Y72	Bloyer Field	43.97622	-90.48061	966	-6	A	America/Chicago
YAK	Yakutat	59.30120	-139.39370	33	-9	A	NA
YIP	Willow Run	42.23793	-83.53041	716	-5	A	America/New_York
YKM	Yakima Air Terminal McAllister Field	46.56820	-120.54400	1095	-8	A	America/Los_Angeles
YKN	Chan Gurney	42.87110	-97.39690	1200	-6	A	America/Chicago
YNG	Youngstown Warren Rgnl	41.26074	-80.67910	1196	-5	A	America/New_York
YUM	Yuma Mcas Yuma Intl	32.65658	-114.60598	216	-7	N	America/Phoenix
Z84	Clear	64.30120	-149.12014	552	-9	A	America/Anchorage
ZBP	Penn Station	39.30722	-76.61556	66	-5	A	America/New_York
ZFV	Philadelphia 30th St Station	39.95570	-75.18200	0	-5	A	America/New_York
ZPH	Municipal Airport	28.22806	-82.15583	90	-5	A	America/New_York
ZRA	Atlantic City Rail Terminal	39.36650	-74.44200	8	-5	A	America/New_York
ZRD	Train Station	37.53430	-77.42945	26	-5	A	America/New_York
ZRP	Newark Penn Station	40.73472	-74.16417	0	-5	A	America/New_York
ZRT	Hartford Union Station	41.76888	-72.68150	0	-5	A	America/New_York
ZRZ	New Carrollton Rail Station	38.94800	-76.87190	39	-5	A	America/New_York
ZSF	Springfield Amtrak Station	42.10600	-72.59305	65	-5	A	America/New_York
ZSY	Scottsdale Airport	33.62289	-111.91053	1519	-7	A	America/Phoenix
ZTF	Stamford Amtrak Station	41.04694	-73.54149	0	-5	A	America/New_York
ZTY	Boston Back Bay Station	42.34780	-71.07500	20	-5	A	America/New_York
ZUN	Black Rock	35.08323	-108.79178	6454	-7	A	America/Denver
ZVE	New Haven Rail Station	41.29867	-72.92599	7	-5	A	America/New_York
ZWI	Wilmington Amtrak Station	39.73667	-75.55167	0	-5	A	America/New_York
ZWU	Washington Union Station	38.89746	-77.00643	76	-5	A	America/New_York
ZYP	Penn Station	40.75050	-73.99350	35	-5	A	America/New_York

month	n
<int>	<int>
1	27004
2	24951
3	28834
4	28330
5	28796
6	28243
7	29425
8	29327
9	27574
10	28889
11	27268
12	28135

gain	hours	gain_per_hour
<dbl>	<dbl>	<dbl>
-9	3.7833333	-2.3788546
-16	3.7833333	-4.2290749
-31	2.6666667	-11.6250000
17	3.0500000	5.5737705
19	1.9333333	9.8275862
-16	2.5000000	-6.4000000
-24	2.6333333	-9.1139241
11	0.8833333	12.4528302
5	2.3333333	2.1428571
-10	2.3000000	-4.3478261
0	2.4833333	0.0000000
1	2.6333333	0.3797468
-9	5.7500000	-1.5652174
12	6.0166667	1.9944598
-32	4.2833333	-7.4708171
4	0.7333333	5.4545455
7	5.6166667	1.2462908
7	2.5333333	2.7631579
-12	2.2333333	-5.3731343
7	2.4500000	2.8571429
0	2.8333333	0.0000000
-19	1.7500000	-10.8571429
8	2.5333333	3.1578947
4	2.1333333	1.8750000
17	2.6166667	6.4968153
-24	2.3166667	-10.3597122
-3	6.1000000	-0.4918033
-1	2.9166667	-0.3428571
21	3.0333333	6.9230769
9	2.0000000	4.5000000
⋮	⋮	⋮
22	0.7500000	29.333333
7	1.2000000	5.833333
29	3.5500000	8.169014
32	0.7500000	42.666667
21	0.6000000	35.000000
30	4.9666667	6.040268
2	0.7833333	2.553191
25	3.2000000	7.812500
7	2.3166667	3.021583
16	0.6166667	25.945946
7	0.6500000	10.769231
0	0.8333333	0.000000
-10	1.0166667	-9.836066
20	1.6166667	12.371134
15	2.0000000	7.500000
7	0.8000000	8.750000
38	5.3000000	7.169811
24	2.0500000	11.707317
0	0.7166667	0.000000
12	0.6833333	17.560976
10	0.8666667	11.538462
11	0.7833333	14.042553
11	0.5500000	20.000000
15	3.2666667	4.591837
NA	NA	NA
NA	NA	NA
NA	NA	NA
NA	NA	NA
NA	NA	NA
NA	NA	NA

A tibble: 336776 × 19
year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
<int>	<int>	<int>	<int>	<int>	<dbl>	<int>	<int>	<dbl>	<chr>	<int>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dttm>
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00
2013	1	1	555	600	-5	913	854	19	B6	507	N516JB	EWR	FLL	158	1065	6	0	2013-01-01 06:00:00
2013	1	1	557	600	-3	709	723	-14	EV	5708	N829AS	LGA	IAD	53	229	6	0	2013-01-01 06:00:00
2013	1	1	557	600	-3	838	846	-8	B6	79	N593JB	JFK	MCO	140	944	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	753	745	8	AA	301	N3ALAA	LGA	ORD	138	733	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	849	851	-2	B6	49	N793JB	JFK	PBI	149	1028	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	853	856	-3	B6	71	N657JB	JFK	TPA	158	1005	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	924	917	7	UA	194	N29129	JFK	LAX	345	2475	6	0	2013-01-01 06:00:00
2013	1	1	558	600	-2	923	937	-14	UA	1124	N53441	EWR	SFO	361	2565	6	0	2013-01-01 06:00:00
2013	1	1	559	600	-1	941	910	31	AA	707	N3DUAA	LGA	DFW	257	1389	6	0	2013-01-01 06:00:00
2013	1	1	559	559	0	702	706	-4	B6	1806	N708JB	JFK	BOS	44	187	5	59	2013-01-01 05:00:00
2013	1	1	559	600	-1	854	902	-8	UA	1187	N76515	EWR	LAS	337	2227	6	0	2013-01-01 06:00:00
2013	1	1	600	600	0	851	858	-7	B6	371	N595JB	LGA	FLL	152	1076	6	0	2013-01-01 06:00:00
2013	1	1	600	600	0	837	825	12	MQ	4650	N542MQ	LGA	ATL	134	762	6	0	2013-01-01 06:00:00
2013	1	1	601	600	1	844	850	-6	B6	343	N644JB	EWR	PBI	147	1023	6	0	2013-01-01 06:00:00
2013	1	1	602	610	-8	812	820	-8	DL	1919	N971DL	LGA	MSP	170	1020	6	10	2013-01-01 06:00:00
2013	1	1	602	605	-3	821	805	16	MQ	4401	N730MQ	LGA	DTW	105	502	6	5	2013-01-01 06:00:00
2013	1	1	606	610	-4	858	910	-12	AA	1895	N633AA	EWR	MIA	152	1085	6	10	2013-01-01 06:00:00
2013	1	1	606	610	-4	837	845	-8	DL	1743	N3739P	JFK	ATL	128	760	6	10	2013-01-01 06:00:00
2013	1	1	607	607	0	858	915	-17	UA	1077	N53442	EWR	MIA	157	1085	6	7	2013-01-01 06:00:00
2013	1	1	608	600	8	807	735	32	MQ	3768	N9EAMQ	EWR	ORD	139	719	6	0	2013-01-01 06:00:00
2013	1	1	611	600	11	945	931	14	UA	303	N532UA	JFK	SFO	366	2586	6	0	2013-01-01 06:00:00
2013	1	1	613	610	3	925	921	4	B6	135	N635JB	JFK	RSW	175	1074	6	10	2013-01-01 06:00:00
2013	1	1	615	615	0	1039	1100	-21	B6	709	N794JB	JFK	SJU	182	1598	6	15	2013-01-01 06:00:00
2013	1	1	615	615	0	833	842	-9	DL	575	N326NB	EWR	ATL	120	746	6	15	2013-01-01 06:00:00
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2013	9	30	2123	2125	-2	2223	2247	-24	EV	5489	N712EV	LGA	CHO	45	305	21	25	2013-09-30 21:00:00
2013	9	30	2127	2129	-2	2314	2323	-9	EV	3833	N16546	EWR	CLT	72	529	21	29	2013-09-30 21:00:00
2013	9	30	2128	2130	-2	2328	2359	-31	B6	97	N807JB	JFK	DEN	213	1626	21	30	2013-09-30 21:00:00
2013	9	30	2129	2059	30	2230	2232	-2	EV	5048	N751EV	LGA	RIC	45	292	20	59	2013-09-30 20:00:00
2013	9	30	2131	2140	-9	2225	2255	-30	MQ	3621	N807MQ	JFK	DCA	36	213	21	40	2013-09-30 21:00:00
2013	9	30	2140	2140	0	10	40	-30	AA	185	N335AA	JFK	LAX	298	2475	21	40	2013-09-30 21:00:00
2013	9	30	2142	2129	13	2250	2239	11	EV	4509	N12957	EWR	PWM	47	284	21	29	2013-09-30 21:00:00
2013	9	30	2145	2145	0	115	140	-25	B6	1103	N633JB	JFK	SJU	192	1598	21	45	2013-09-30 21:00:00

A tibble: 26115 × 15
origin	year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	time_hour
<chr>	<int>	<int>	<int>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dttm>
EWR	2013	1	1	1	39.02	26.06	59.37	270	10.35702	NA	0	1012.0	10	2013-01-01 01:00:00
EWR	2013	1	1	2	39.02	26.96	61.63	250	8.05546	NA	0	1012.3	10	2013-01-01 02:00:00
EWR	2013	1	1	3	39.02	28.04	64.43	240	11.50780	NA	0	1012.5	10	2013-01-01 03:00:00
EWR	2013	1	1	4	39.92	28.04	62.21	250	12.65858	NA	0	1012.2	10	2013-01-01 04:00:00
EWR	2013	1	1	5	39.02	28.04	64.43	260	12.65858	NA	0	1011.9	10	2013-01-01 05:00:00
EWR	2013	1	1	6	37.94	28.04	67.21	240	11.50780	NA	0	1012.4	10	2013-01-01 06:00:00
EWR	2013	1	1	7	39.02	28.04	64.43	240	14.96014	NA	0	1012.2	10	2013-01-01 07:00:00
EWR	2013	1	1	8	39.92	28.04	62.21	250	10.35702	NA	0	1012.2	10	2013-01-01 08:00:00
EWR	2013	1	1	9	39.92	28.04	62.21	260	14.96014	NA	0	1012.7	10	2013-01-01 09:00:00
EWR	2013	1	1	10	41.00	28.04	59.65	260	13.80936	NA	0	1012.4	10	2013-01-01 10:00:00
EWR	2013	1	1	11	41.00	26.96	57.06	260	14.96014	NA	0	1011.4	10	2013-01-01 11:00:00
EWR	2013	1	1	13	39.20	28.40	69.67	330	16.11092	NA	0	NA	10	2013-01-01 13:00:00
EWR	2013	1	1	14	39.02	24.08	54.68	280	13.80936	NA	0	1010.8	10	2013-01-01 14:00:00
EWR	2013	1	1	15	37.94	24.08	57.04	290	9.20624	NA	0	1011.9	10	2013-01-01 15:00:00
EWR	2013	1	1	16	37.04	19.94	49.62	300	13.80936	20.71404	0	1012.1	10	2013-01-01 16:00:00
EWR	2013	1	1	17	35.96	19.04	49.83	330	11.50780	NA	0	1013.2	10	2013-01-01 17:00:00
EWR	2013	1	1	18	33.98	15.08	45.43	310	12.65858	25.31716	0	1014.1	10	2013-01-01 18:00:00
EWR	2013	1	1	19	33.08	12.92	42.84	320	10.35702	NA	0	1014.4	10	2013-01-01 19:00:00
EWR	2013	1	1	20	32.00	15.08	49.19	310	14.96014	NA	0	1015.2	10	2013-01-01 20:00:00
EWR	2013	1	1	21	30.02	12.92	48.48	320	18.41248	26.46794	0	1016.0	10	2013-01-01 21:00:00
EWR	2013	1	1	22	28.94	12.02	48.69	320	18.41248	25.31716	0	1016.5	10	2013-01-01 22:00:00
EWR	2013	1	1	23	28.04	10.94	48.15	310	16.11092	NA	0	1016.4	10	2013-01-01 23:00:00
EWR	2013	1	2	0	26.96	10.94	50.34	310	14.96014	25.31716	0	1016.3	10	2013-01-02 00:00:00
EWR	2013	1	2	1	26.06	10.94	52.25	330	12.65858	24.16638	0	1016.3	10	2013-01-02 01:00:00
EWR	2013	1	2	2	24.98	10.94	54.65	330	13.80936	NA	0	1017.0	10	2013-01-02 02:00:00
EWR	2013	1	2	3	24.08	8.96	51.93	320	14.96014	NA	0	1016.6	10	2013-01-02 03:00:00
EWR	2013	1	2	4	24.08	8.96	51.93	330	12.65858	NA	0	1016.9	10	2013-01-02 04:00:00
EWR	2013	1	2	5	24.08	8.96	51.93	330	6.90468	NA	0	1016.9	10	2013-01-02 05:00:00
EWR	2013	1	2	6	24.08	8.96	51.93	310	3.45234	NA	0	1017.2	10	2013-01-02 06:00:00
EWR	2013	1	2	7	24.98	10.04	52.50	300	6.90468	NA	0	1017.6	10	2013-01-02 07:00:00
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
LGA	2013	12	29	13	42.80	37.94	88.76	70	12.65858	NA	0.19	NA	2.50	2013-12-29 13:00:00
LGA	2013	12	29	14	41.00	37.94	93.19	60	18.41248	NA	0.21	NA	1.75	2013-12-29 14:00:00
LGA	2013	12	29	15	41.00	39.02	92.59	40	13.80936	NA	0.37	999.9	1.50	2013-12-29 15:00:00
LGA	2013	12	29	16	41.00	37.94	88.76	350	8.05546	23.01560	0.28	998.7	1.50	2013-12-29 16:00:00
LGA	2013	12	29	17	44.06	41.00	93.24	350	20.71404	NA	0.04	NA	5.00	2013-12-29 17:00:00
LGA	2013	12	29	18	42.08	39.02	88.81	330	14.96014	NA	0.00	997.2	3.00	2013-12-29 18:00:00
LGA	2013	12	29	19	42.80	37.94	85.13	320	17.26170	NA	0.00	NA	8.00	2013-12-29 19:00:00
LGA	2013	12	29	20	42.08	37.94	86.89	320	19.56326	NA	0.00	NA	10.00	2013-12-29 20:00:00

A tibble: 336776 × 6
dep_time	sched_dep_time	arr_time	sched_arr_time	air_time	time_hour
<int>	<int>	<int>	<int>	<dbl>	<dttm>
517	515	830	819	227	2013-01-01 05:00:00
533	529	850	830	227	2013-01-01 05:00:00
542	540	923	850	160	2013-01-01 05:00:00
544	545	1004	1022	183	2013-01-01 05:00:00
554	600	812	837	116	2013-01-01 06:00:00
554	558	740	728	150	2013-01-01 05:00:00
555	600	913	854	158	2013-01-01 06:00:00
557	600	709	723	53	2013-01-01 06:00:00
557	600	838	846	140	2013-01-01 06:00:00
558	600	753	745	138	2013-01-01 06:00:00
558	600	849	851	149	2013-01-01 06:00:00
558	600	853	856	158	2013-01-01 06:00:00
558	600	924	917	345	2013-01-01 06:00:00
558	600	923	937	361	2013-01-01 06:00:00
559	600	941	910	257	2013-01-01 06:00:00
559	559	702	706	44	2013-01-01 05:00:00
559	600	854	902	337	2013-01-01 06:00:00
600	600	851	858	152	2013-01-01 06:00:00
600	600	837	825	134	2013-01-01 06:00:00
601	600	844	850	147	2013-01-01 06:00:00
602	610	812	820	170	2013-01-01 06:00:00
602	605	821	805	105	2013-01-01 06:00:00
606	610	858	910	152	2013-01-01 06:00:00
606	610	837	845	128	2013-01-01 06:00:00
607	607	858	915	157	2013-01-01 06:00:00
608	600	807	735	139	2013-01-01 06:00:00
611	600	945	931	366	2013-01-01 06:00:00
613	610	925	921	175	2013-01-01 06:00:00
615	615	1039	1100	182	2013-01-01 06:00:00
615	615	833	842	120	2013-01-01 06:00:00
⋮	⋮	⋮	⋮	⋮	⋮
2123	2125	2223	2247	45	2013-09-30 21:00:00
2127	2129	2314	2323	72	2013-09-30 21:00:00
2128	2130	2328	2359	213	2013-09-30 21:00:00
2129	2059	2230	2232	45	2013-09-30 20:00:00
2131	2140	2225	2255	36	2013-09-30 21:00:00
2140	2140	10	40	298	2013-09-30 21:00:00
2142	2129	2250	2239	47	2013-09-30 21:00:00
2145	2145	115	140	192	2013-09-30 21:00:00

A tibble: 336776 × 3
dep_time	hour	minute
<int>	<dbl>	<dbl>
517	5	17
533	5	33
542	5	42
544	5	44
554	5	54
554	5	54
555	5	55
557	5	57
557	5	57
558	5	58
558	5	58
558	5	58
558	5	58
558	5	58
559	5	59
559	5	59
559	5	59
600	6	0
600	6	0
601	6	1
602	6	2
602	6	2
606	6	6
606	6	6
607	6	7
608	6	8
611	6	11
613	6	13
615	6	15
615	6	15
⋮	⋮	⋮
2123	21	23
2127	21	27
2128	21	28
2129	21	29
2131	21	31
2140	21	40
2142	21	42
2145	21	45

R for Data Science 1

Contents