Big Data#
Table of Contents#
Resources#
[ h ][ w ] Hadoop
[ y ]
12-12-2018
. Computerphile. “Apache Spark - Computerphile”.[ y ]
07-14-2021
. freeCodeCamp. “PySpark Tutorial”.[ y ]
03-02-2023
. Simplilearn. “🔥Spark Full Course 2023 | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn”.[ y ]
05-18-2021
. Simplilearn. “Spark Full Course | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn”.[ y ]
08-01-2019
. Simplilearn. “What Is Apache Spark? | Apache Spark Tutorial | Apache Spark For Beginners | Simplilearn”.
more
[ y ]
01-05-2024
. Business Analytics for Beginners. “Batch Processing VS Stream Processing in Big Data Analytics”.
Texts#
[ h ][ course ] Leskovec, Jure; Anand Rajaraman; & Jeff Ullman. Mining of Massive Datasets.
Hadoop
Kunigk, Jan et al. (2018). Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale. O’Reilly.
Sammer, Eric. (2012). Hadoop Operations: A Guide for Developers and Administrators. O’Reilly. Home. GitHub.
White, Tom. (2015). Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. 4th Ed. O’Reilly.
Spark
Chambers, Bill & Matei Zaharia. (2018). Spark The Definitive Guide: Big Data Processing Made Simple. O’Reilly. GitHub.
Damji et al. (2020). Learning Spark: Lightning-Fast Data Analytics. 2nd Ed. O’Reilly. GitHub.
Karau, Holden & Rachel Warren. (2017). High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly. GitHub.
Maas, Gerard & Francois Garillot. (2019). Stream Processing with Apache Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly. GitHub.
Parsian, Mahmoud. (2022). Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up Using PySpark. O’Reilly. GitHub.
Perrin, Jean-Georges. (2020). Spark in Action: With examples in Java, Python, and Scala. 2nd Ed. Manning.
Polak, Adi. (2023). Machine Learning with Spark: Designing Distributed ML Platforms with PyTorch, TensorFlow, and MLLib. O’Reilly.
Ryza et al. (2017). Advanced Analytics with Spark: Patterns for Learning from Data at Scale. 2nd Ed. O’Reilly. GitHub.
Tandon et al. (2022). Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark. O’Reilly.
Notes#
Hadoop#
Configuration
echo "HADOOP_HOME=\"/opt/homebrew/Cellar/hadoop/3.3.3/libexec\"" >> ~/.zshrc
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
$HADOOP_HOME/etc/hadoop/core-site.xml
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
$HADOOP_HOME/etc/hadoop/mapred-site.xml
$HADOOP_HOME/etc/hadoop/yarn-site.xml
Defaults
$HADOOP_HOME/share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml
$HADOOP_HOME/share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
$HADOOP_HOME/share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
$HADOOP_HOME/share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Startup
hdfs --daemon start namenode
hdfs --daemon start secondarynamenode
hdfs --daemon start datanode
yarn --daemon start resourcemanager
yarn --daemon start nodemanager
yarn --daemon start proxyserver
mapred --daemon start historyserver
jps
Lists the instrumented Java Virtual Machines (JVMs) on the target system. This command is experimental and unsupported.
Monitoring
hdfs dfsadmin -report
hdfs fsck /
Spark#
Startup
spark-shell --master spark://10.0.4.146:7077 --driver-memory 8g --executor-memory 6g