A08 - Project: Scraping Nuclear Reactors

A08 - Project: Scraping Nuclear Reactors#

Kaplan, Daniel & Matthew Beckman. (2021). Data Computing. 2nd Ed. Home.


Revised

19 Jun 2023


Programming Environment#

library(lubridate)
library(rvest)
library(tidyverse)

str_c('EXECUTED : ', now())
sessionInfo()
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
 dplyr   1.1.2      readr   2.1.4
 forcats 1.0.0      stringr 1.5.0
 ggplot2 3.4.3      tibble  3.2.1
 purrr   1.0.2      tidyr   1.3.0
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter()         masks stats::filter()
 readr::guess_encoding() masks rvest::guess_encoding()
 dplyr::lag()            masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
'EXECUTED : 2024-06-20 10:49:29.358445'
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.2    
 [5] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3  
 [9] tidyverse_2.0.0 rvest_1.0.3     lubridate_1.9.2

loaded via a namespace (and not attached):
 [1] gtable_0.3.3     jsonlite_1.8.5   compiler_4.3.0   crayon_1.5.2    
 [5] tidyselect_1.2.0 IRdisplay_1.1    xml2_1.3.4       scales_1.2.1    
 [9] uuid_1.1-0       fastmap_1.1.1    IRkernel_1.3.2   R6_2.5.1        
[13] generics_0.1.3   munsell_0.5.0    tzdb_0.4.0       pillar_1.9.0    
[17] rlang_1.1.1      utf8_1.2.3       stringi_1.7.12   repr_1.1.6      
[21] timechange_0.2.0 cli_3.6.1        withr_2.5.0      magrittr_2.0.3  
[25] digest_0.6.31    grid_4.3.0       hms_1.1.3        base64enc_0.1-3 
[29] pbdZMQ_0.3-9     lifecycle_1.0.3  vctrs_0.6.3      evaluate_0.21   
[33] glue_1.6.2       fansi_1.0.4      colorspace_2.1-0 httr_1.4.6      
[37] tools_4.3.0      pkgconfig_2.0.3  htmltools_0.5.5 

In what ways is the table tidy? How is it not tidy? What’s different about it from a tidy table?

Tidy Data Criteria

  • (i) rows (or cases, observations) must each represent the same underlying attribute (i.e., each observation must have its own row)

  • (ii) columns (or variables, values) must each contain the same type of value for each row (i.e., each variable must have its own column)

  • (iii) each value must have its own cell

  • It’s impossible to only satisfy two of the three criteria. This implies the following.

    • (i) put each dataset into a tibble

    • (ii) put each variable into a column

There is at least one row that does not represent a typical case (that is, the header row(s)). Certain columns are blank and need to be removed. The remaining columns may contain heterogeneous data formats or data types. And missing values must be addressed. But with some cleaning, the table hints at what a typical case should look like.

page      <- 'https://en.wikipedia.org/wiki/List_of_commerical_nuclear_reactors'
tableList <-
  page %>%
    read_html() %>%
    html_nodes(css = 'table') %>%
    html_table(fill = TRUE)
length(tableList)

Japan <-
  tableList[[21]] %>%
    select(1:9)
#names(Japan)[c(3, 6)] <- c('type', 'grossMW')
head(Japan)
55
A tibble: 6 × 9
PlantnameUnitNo.TypeModelStatusCapacity(MW)BeginbuildingCommercialoperationClosed
<chr><int><chr><chr><chr><int><chr><chr><chr>
Fugen 1HWLWRATR Shut down 14810 May 197220 Mar 197929 Mar 2003
Fukushima Daiichi1BWR BWR-3Inoperable43925 Jul 196726 Mar 197119 May 2011
Fukushima Daiichi2BWR BWR-4Inoperable7609 Jun 1969 18 Jul 197419 May 2011
Fukushima Daiichi3BWR BWR-4Inoperable76028 Dec 197027 Mar 197619 May 2011
Fukushima Daiichi4BWR BWR-4Inoperable76012 Feb 197312 Oct 197819 May 2011
Fukushima Daiichi5BWR BWR-4Shut down 76022 May 197218 Apr 197817 Dec 2013

Among other things, some of the variable names appear redundant and others have multiple words separated by spaces. You can rename variables using the data verb rename(), finding appropriate names from the Wikipedia table. Another problem is that the first row is not data but a continuation of the variable names. So row number 1 should be dropped.

Japan <-
  Japan %>%
    filter(row_number() > 1) %>%
    rename(
      name         = Plantname,
      reactor      = `UnitNo.`,
      type         = Type,
      model        = Model,
      status       = Status,
      netMW        = `Capacity(MW)`,
      construction = Beginbuilding,
      operation    = Commercialoperation,
      closure      = Closed
    )
head(Japan)
A tibble: 6 × 9
namereactortypemodelstatusnetMWconstructionoperationclosure
<chr><int><chr><chr><chr><int><chr><chr><chr>
Fukushima Daiichi1BWRBWR-3Inoperable 43925 Jul 196726 Mar 197119 May 2011
Fukushima Daiichi2BWRBWR-4Inoperable 7609 Jun 1969 18 Jul 197419 May 2011
Fukushima Daiichi3BWRBWR-4Inoperable 76028 Dec 197027 Mar 197619 May 2011
Fukushima Daiichi4BWRBWR-4Inoperable 76012 Feb 197312 Oct 197819 May 2011
Fukushima Daiichi5BWRBWR-4Shut down 76022 May 197218 Apr 197817 Dec 2013
Fukushima Daiichi6BWRBWR-5Shut down 106726 Oct 197324 Oct 197917 Dec 2013
str(Japan)
tibble [61 × 9] (S3: tbl_df/tbl/data.frame)
 $ name        : chr [1:61] "Fukushima Daiichi" "Fukushima Daiichi" "Fukushima Daiichi" "Fukushima Daiichi" ...
 $ reactor     : int [1:61] 1 2 3 4 5 6 1 2 3 4 ...
 $ type        : chr [1:61] "BWR" "BWR" "BWR" "BWR" ...
 $ model       : chr [1:61] "BWR-3" "BWR-4" "BWR-4" "BWR-4" ...
 $ status      : chr [1:61] "Inoperable" "Inoperable" "Inoperable" "Inoperable" ...
 $ netMW       : int [1:61] 439 760 760 760 760 1067 1067 1067 1067 1067 ...
 $ construction: chr [1:61] "25 Jul 1967" "9 Jun 1969" "28 Dec 1970" "12 Feb 1973" ...
 $ operation   : chr [1:61] "26 Mar 1971" "18 Jul 1974" "27 Mar 1976" "12 Oct 1978" ...
 $ closure     : chr [1:61] "19 May 2011" "19 May 2011" "19 May 2011" "19 May 2011" ...

Using your cleaned data, make a plot of net generation capacity versus date of construction. Color the points by the type of reactor (for example, BWR, PWR, etc.) In addition to your plot, give a sentence or two of interpretation; what patterns do you see?

# BWR boiling water reactor
# FBR
# GCR
# PWR pressurized water reactor

Japan <-
  Japan %>%
    mutate(
      netMW        = as.integer(netMW),
      construction = lubridate::dmy(construction),
      operation    = lubridate::dmy(operation),
      closure      = lubridate::dmy(closure)
    )
head(Japan)

Japan %>%
  ggplot(mapping = aes(x = construction, y = netMW)) +
    geom_point(aes(color = type))
Warning message:
“There were 2 warnings in `mutate()`.
The first warning was:
 In argument: `construction = lubridate::dmy(construction)`.
Caused by warning:
!  1 failed to parse.
 Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.”
A tibble: 6 × 9
namereactortypemodelstatusnetMWconstructionoperationclosure
<chr><int><chr><chr><chr><int><date><date><date>
Fukushima Daiichi1BWRBWR-3Inoperable 4391967-07-251971-03-262011-05-19
Fukushima Daiichi2BWRBWR-4Inoperable 7601969-06-091974-07-182011-05-19
Fukushima Daiichi3BWRBWR-4Inoperable 7601970-12-281976-03-272011-05-19
Fukushima Daiichi4BWRBWR-4Inoperable 7601973-02-121978-10-122011-05-19
Fukushima Daiichi5BWRBWR-4Shut down 7601972-05-221978-04-182013-12-17
Fukushima Daiichi6BWRBWR-5Shut down 10671973-10-261979-10-242013-12-17
Warning message:
“Removed 1 rows containing missing values (`geom_point()`).”
../../../../_images/64ed74d44048fc2126de2bb30fa5370dd18d05499b55dea00dfdcade464cb459.png

Carry out the same cleaning process for the China reactor table, and then append it with the Japan data. Use mutate() to add a variable that has the name of the country. (Hint: functions such as bind_cols() or bind_rows() form the dplyr package are helpful for appending data frames.) Collating the data for all countries is a matter of repeating this process over and over. Inevitably, there are inconsistencies. For example, the US data had been organized in a somewhat different format when compared to the Japan and China data for many years until Wikipedia editors decided to reconcile them.

China <-
  tableList[[10]] %>%
    select(1:9)
China <-
  China %>%
    filter(row_number() > 2) %>%
    rename(
      name         = Plantname,
      reactor      = `UnitNo.`,
      type         = Type,
      model        = Model,
      status       = Status,
      netMW        = `Capacity(MW)`,
      construction = Beginbuilding,
      operation    = Commercialoperation,
      closure      = Closed
    )
China <-
  China %>%
    mutate(
      netMW        = as.integer(netMW),
      construction = lubridate::dmy(construction),
      operation    = lubridate::dmy(operation),
      closure      = lubridate::dmy(closure)
    )
China %>%
  ggplot(mapping = aes(x = construction, y = netMW)) +
    geom_point(aes(color = type))
Warning message:
“There were 2 warnings in `mutate()`.
The first warning was:
 In argument: `construction = lubridate::dmy(construction)`.
Caused by warning:
!  17 failed to parse.
 Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.”
Warning message:
“Removed 50 rows containing missing values (`geom_point()`).”
../../../../_images/279b3de83bac8f05d18afaab6e6e15c4599981a3f40590435d4b0ac23066c8cc.png
bind_rows(Japan, China) %>%
  mutate(Country = ifelse(name %in% unique(Japan$name), 'Japan', 'China')) %>%
  head()
Error in `bind_rows()`:
! Can't combine `..1$reactor` <integer> and `..2$reactor` <character>.
Traceback:

1. bind_rows(Japan, China) %>% mutate(Country = ifelse(name %in% 
 .     unique(Japan$name), "Japan", "China")) %>% head()
2. head(.)
3. mutate(., Country = ifelse(name %in% unique(Japan$name), "Japan", 
 .     "China"))
4. bind_rows(Japan, China)
5. vec_rbind(!!!dots, .names_to = .id, .error_call = current_env())
6. (function () 
 . vec_default_ptype2(x = x, y = y, x_arg = x_arg, y_arg = y_arg, 
 .     call = call, `vctrs:::from_dispatch` = `vctrs:::from_dispatch`, 
 .     `vctrs:::s3_fallback` = `vctrs:::s3_fallback`))()
7. vec_default_ptype2(x = x, y = y, x_arg = x_arg, y_arg = y_arg, 
 .     call = call, `vctrs:::from_dispatch` = `vctrs:::from_dispatch`, 
 .     `vctrs:::s3_fallback` = `vctrs:::s3_fallback`)
8. withRestarts(stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg, 
 .     `vctrs:::from_dispatch` = match_from_dispatch(...), call = call), 
 .     vctrs_restart_ptype2 = function(ptype) {
 .         ptype
 .     })
9. withOneRestart(expr, restarts[[1L]])
10. doWithOneRestart(return(expr), restart)
11. stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg, `vctrs:::from_dispatch` = match_from_dispatch(...), 
  .     call = call)
12. stop_incompatible(x, y, x_arg = x_arg, y_arg = y_arg, details = details, 
  .     ..., message = message, class = c(class, subclass, "vctrs_error_incompatible_type"), 
  .     call = call)
13. stop_vctrs(message, class = c(class, "vctrs_error_incompatible"), 
  .     x = x, y = y, details = details, ..., call = call)
14. abort(message, class = c(class, "vctrs_error"), ..., call = call)
15. signal_abort(cnd, .file)

Make an informative graphic that shows how long it took between start of construction and commissioning for operation of each nuclear reactor in Japan (or another country of your choice). One possibility: use reactor name vs date as the frame. For each reactor, set the glyph to be a line extending from start of construction to commissioning. You can do this with geom_segment() using name as the y coordinate and time as the x coordinate. (Tip: use the paste() function to create the reactorID on the vertical axis.)

options(warn = -1)

Japan %>%
  mutate(
    reactorID = paste(name, reactor),
    time      = operation - construction
  ) %>%
  ggplot() +
    geom_segment(
      linewidth = 2,
      mapping = aes(
        x     = operation,
        xend  = construction,
        y     = reactorID,
        yend  = reactorID,
        color = type
      )
    )
../../../../_images/cb842782173f3b67767816cc811fa9f056637bc6fdb668883964a35844d31018.png