A08 - Project: Scraping Nuclear Reactors

A08 - Project: Scraping Nuclear Reactors #

Kaplan, Daniel & Matthew Beckman. (2021). Data Computing. 2nd Ed. Home.

Revised

19 Jun 2023

Programming Environment #

library(lubridate)
library(rvest)
library(tidyverse)

str_c('EXECUTED : ', now())
sessionInfo()

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr   1.1.2     ✔ readr   2.1.4
✔ forcats 1.0.0     ✔ stringr 1.5.0
✔ ggplot2 3.4.3     ✔ tibble  3.2.1
✔ purrr   1.0.2     ✔ tidyr   1.3.0

── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

'EXECUTED : 2025-01-23 15:23:35.444161'

R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.2    
 [5] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3  
 [9] tidyverse_2.0.0 rvest_1.0.3     lubridate_1.9.2

loaded via a namespace (and not attached):
 [1] gtable_0.3.3     jsonlite_1.8.5   compiler_4.3.0   crayon_1.5.2    
 [5] tidyselect_1.2.0 IRdisplay_1.1    xml2_1.3.4       scales_1.2.1    
 [9] uuid_1.1-0       fastmap_1.1.1    IRkernel_1.3.2   R6_2.5.1        
[13] generics_0.1.3   munsell_0.5.0    tzdb_0.4.0       pillar_1.9.0    
[17] rlang_1.1.1      utf8_1.2.3       stringi_1.7.12   repr_1.1.6      
[21] timechange_0.2.0 cli_3.6.1        withr_2.5.0      magrittr_2.0.3  
[25] digest_0.6.31    grid_4.3.0       hms_1.1.3        base64enc_0.1-3 
[29] pbdZMQ_0.3-9     lifecycle_1.0.3  vctrs_0.6.3      evaluate_0.21   
[33] glue_1.6.2       fansi_1.0.4      colorspace_2.1-0 httr_1.4.6      
[37] tools_4.3.0      pkgconfig_2.0.3  htmltools_0.5.5 

In what ways is the table tidy? How is it not tidy? What’s different about it from a tidy table?

Tidy Data Criteria

(i) rows (or cases, observations) must each represent the same underlying attribute (i.e., each observation must have its own row)
(ii) columns (or variables, values) must each contain the same type of value for each row (i.e., each variable must have its own column)
(iii) each value must have its own cell
It’s impossible to only satisfy two of the three criteria. This implies the following.
- (i) put each dataset into a tibble
- (ii) put each variable into a column

There is at least one row that does not represent a typical case (that is, the header row(s)). Certain columns are blank and need to be removed. The remaining columns may contain heterogeneous data formats or data types. And missing values must be addressed. But with some cleaning, the table hints at what a typical case should look like.

page      <- 'https://en.wikipedia.org/wiki/List_of_commerical_nuclear_reactors'
tableList <-
  page %>%
    read_html() %>%
    html_nodes(css = 'table') %>%
    html_table(fill = TRUE)
length(tableList)

Japan <-
  tableList[[21]] %>%
    select(1:9)
#names(Japan)[c(3, 6)] <- c('type', 'grossMW')
head(Japan)

55

A tibble: 6 × 9
Plantname	UnitNo.	Type	Model	Status	Capacity(MW)	Beginbuilding	Commercialoperation	Closed
<chr>	<int>	<chr>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>
Fugen	1	HWLWR	ATR	Shut down	148	10 May 1972	20 Mar 1979	29 Mar 2003
Fukushima Daiichi	1	BWR	BWR-3	Inoperable	439	25 Jul 1967	26 Mar 1971	19 May 2011
Fukushima Daiichi	2	BWR	BWR-4	Inoperable	760	9 Jun 1969	18 Jul 1974	19 May 2011
Fukushima Daiichi	3	BWR	BWR-4	Inoperable	760	28 Dec 1970	27 Mar 1976	19 May 2011
Fukushima Daiichi	4	BWR	BWR-4	Inoperable	760	12 Feb 1973	12 Oct 1978	19 May 2011
Fukushima Daiichi	5	BWR	BWR-4	Shut down	760	22 May 1972	18 Apr 1978	17 Dec 2013

Among other things, some of the variable names appear redundant and others have multiple words separated by spaces. You can rename variables using the data verb rename(), finding appropriate names from the Wikipedia table. Another problem is that the first row is not data but a continuation of the variable names. So row number 1 should be dropped.

Japan <-
  Japan %>%
    filter(row_number() > 1) %>%
    rename(
      name         = Plantname,
      reactor      = `UnitNo.`,
      type         = Type,
      model        = Model,
      status       = Status,
      netMW        = `Capacity(MW)`,
      construction = Beginbuilding,
      operation    = Commercialoperation,
      closure      = Closed
    )
head(Japan)

A tibble: 6 × 9
name	reactor	type	model	status	netMW	construction	operation	closure
<chr>	<int>	<chr>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>
Fukushima Daiichi	1	BWR	BWR-3	Inoperable	439	25 Jul 1967	26 Mar 1971	19 May 2011
Fukushima Daiichi	2	BWR	BWR-4	Inoperable	760	9 Jun 1969	18 Jul 1974	19 May 2011
Fukushima Daiichi	3	BWR	BWR-4	Inoperable	760	28 Dec 1970	27 Mar 1976	19 May 2011
Fukushima Daiichi	4	BWR	BWR-4	Inoperable	760	12 Feb 1973	12 Oct 1978	19 May 2011
Fukushima Daiichi	5	BWR	BWR-4	Shut down	760	22 May 1972	18 Apr 1978	17 Dec 2013
Fukushima Daiichi	6	BWR	BWR-5	Shut down	1067	26 Oct 1973	24 Oct 1979	17 Dec 2013

str(Japan)

tibble [61 × 9] (S3: tbl_df/tbl/data.frame)
 $ name        : chr [1:61] "Fukushima Daiichi" "Fukushima Daiichi" "Fukushima Daiichi" "Fukushima Daiichi" ...
 $ reactor     : int [1:61] 1 2 3 4 5 6 1 2 3 4 ...
 $ type        : chr [1:61] "BWR" "BWR" "BWR" "BWR" ...
 $ model       : chr [1:61] "BWR-3" "BWR-4" "BWR-4" "BWR-4" ...
 $ status      : chr [1:61] "Inoperable" "Inoperable" "Inoperable" "Inoperable" ...
 $ netMW       : int [1:61] 439 760 760 760 760 1067 1067 1067 1067 1067 ...
 $ construction: chr [1:61] "25 Jul 1967" "9 Jun 1969" "28 Dec 1970" "12 Feb 1973" ...
 $ operation   : chr [1:61] "26 Mar 1971" "18 Jul 1974" "27 Mar 1976" "12 Oct 1978" ...
 $ closure     : chr [1:61] "19 May 2011" "19 May 2011" "19 May 2011" "19 May 2011" ...

Using your cleaned data, make a plot of net generation capacity versus date of construction. Color the points by the type of reactor (for example, BWR, PWR, etc.) In addition to your plot, give a sentence or two of interpretation; what patterns do you see?

# BWR boiling water reactor
# FBR
# GCR
# PWR pressurized water reactor

Japan <-
  Japan %>%
    mutate(
      netMW        = as.integer(netMW),
      construction = lubridate::dmy(construction),
      operation    = lubridate::dmy(operation),
      closure      = lubridate::dmy(closure)
    )
head(Japan)

Japan %>%
  ggplot(mapping = aes(x = construction, y = netMW)) +
    geom_point(aes(color = type))

Warning message:
“There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `construction = lubridate::dmy(construction)`.
Caused by warning:
!  1 failed to parse.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.”

A tibble: 6 × 9
name	reactor	type	model	status	netMW	construction	operation	closure
<chr>	<int>	<chr>	<chr>	<chr>	<int>	<date>	<date>	<date>
Fukushima Daiichi	1	BWR	BWR-3	Inoperable	439	1967-07-25	1971-03-26	2011-05-19
Fukushima Daiichi	2	BWR	BWR-4	Inoperable	760	1969-06-09	1974-07-18	2011-05-19
Fukushima Daiichi	3	BWR	BWR-4	Inoperable	760	1970-12-28	1976-03-27	2011-05-19
Fukushima Daiichi	4	BWR	BWR-4	Inoperable	760	1973-02-12	1978-10-12	2011-05-19
Fukushima Daiichi	5	BWR	BWR-4	Shut down	760	1972-05-22	1978-04-18	2013-12-17
Fukushima Daiichi	6	BWR	BWR-5	Shut down	1067	1973-10-26	1979-10-24	2013-12-17

Warning message:
“Removed 1 rows containing missing values (`geom_point()`).”

../../../../_images/64ed74d44048fc2126de2bb30fa5370dd18d05499b55dea00dfdcade464cb459.png

Carry out the same cleaning process for the China reactor table, and then append it with the Japan data. Use mutate() to add a variable that has the name of the country. (Hint: functions such as bind_cols() or bind_rows() form the dplyr package are helpful for appending data frames.) Collating the data for all countries is a matter of repeating this process over and over. Inevitably, there are inconsistencies. For example, the US data had been organized in a somewhat different format when compared to the Japan and China data for many years until Wikipedia editors decided to reconcile them.

China <-
  tableList[[10]] %>%
    select(1:9)
China <-
  China %>%
    filter(row_number() > 2) %>%
    rename(
      name         = Plantname,
      reactor      = `UnitNo.`,
      type         = Type,
      model        = Model,
      status       = Status,
      netMW        = `Capacity(MW)`,
      construction = Beginbuilding,
      operation    = Commercialoperation,
      closure      = Closed
    )
China <-
  China %>%
    mutate(
      netMW        = as.integer(netMW),
      construction = lubridate::dmy(construction),
      operation    = lubridate::dmy(operation),
      closure      = lubridate::dmy(closure)
    )
China %>%
  ggplot(mapping = aes(x = construction, y = netMW)) +
    geom_point(aes(color = type))

Warning message:
“There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `construction = lubridate::dmy(construction)`.
Caused by warning:
!  20 failed to parse.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.”

Warning message:
“Removed 53 rows containing missing values (`geom_point()`).”

../../../../_images/19259fb3a3c94a3414334ece237d7bb45866622c700b3a08169eb4d692a57ea4.png

bind_rows(Japan, China) %>%
  mutate(Country = ifelse(name %in% unique(Japan$name), 'Japan', 'China')) %>%
  head()

Error in `bind_rows()`:
! Can't combine `..1$reactor` <integer> and `..2$reactor` <character>.
Traceback:

1. bind_rows(Japan, China) %>% mutate(Country = ifelse(name %in% 
 .     unique(Japan$name), "Japan", "China")) %>% head()
2. head(.)
3. mutate(., Country = ifelse(name %in% unique(Japan$name), "Japan", 
 .     "China"))
4. bind_rows(Japan, China)
5. vec_rbind(!!!dots, .names_to = .id, .error_call = current_env())
6. (function () 
 . vec_default_ptype2(x = x, y = y, x_arg = x_arg, y_arg = y_arg, 
 .     call = call, `vctrs:::from_dispatch` = `vctrs:::from_dispatch`, 
 .     `vctrs:::s3_fallback` = `vctrs:::s3_fallback`))()
7. vec_default_ptype2(x = x, y = y, x_arg = x_arg, y_arg = y_arg, 
 .     call = call, `vctrs:::from_dispatch` = `vctrs:::from_dispatch`, 
 .     `vctrs:::s3_fallback` = `vctrs:::s3_fallback`)
8. withRestarts(stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg, 
 .     `vctrs:::from_dispatch` = match_from_dispatch(...), call = call), 
 .     vctrs_restart_ptype2 = function(ptype) {
 .         ptype
 .     })
9. withOneRestart(expr, restarts[[1L]])
10. doWithOneRestart(return(expr), restart)
11. stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg, `vctrs:::from_dispatch` = match_from_dispatch(...), 
  .     call = call)
12. stop_incompatible(x, y, x_arg = x_arg, y_arg = y_arg, details = details, 
  .     ..., message = message, class = c(class, subclass, "vctrs_error_incompatible_type"), 
  .     call = call)
13. stop_vctrs(message, class = c(class, "vctrs_error_incompatible"), 
  .     x = x, y = y, details = details, ..., call = call)
14. abort(message, class = c(class, "vctrs_error"), ..., call = call)
15. signal_abort(cnd, .file)

Make an informative graphic that shows how long it took between start of construction and commissioning for operation of each nuclear reactor in Japan (or another country of your choice). One possibility: use reactor name vs date as the frame. For each reactor, set the glyph to be a line extending from start of construction to commissioning. You can do this with geom_segment() using name as the y coordinate and time as the x coordinate. (Tip: use the paste() function to create the reactorID on the vertical axis.)

options(warn = -1)

Japan %>%
  mutate(
    reactorID = paste(name, reactor),
    time      = operation - construction
  ) %>%
  ggplot() +
    geom_segment(
      linewidth = 2,
      mapping = aes(
        x     = operation,
        xend  = construction,
        y     = reactorID,
        yend  = reactorID,
        color = type
      )
    )

../../../../_images/cb842782173f3b67767816cc811fa9f056637bc6fdb668883964a35844d31018.png

A08 - Project: Scraping Nuclear Reactors

Contents

A08 - Project: Scraping Nuclear Reactors#

Programming Environment#

A08 - Project: Scraping Nuclear Reactors #

Programming Environment #