Big Data#


Table of Contents#


Resources#

[ h ][ w ] Hadoop

[ h ][ w ] Apache Spark

  • Downloads

  • [ y ] 12-12-2018. Computerphile. “Apache Spark - Computerphile”.

  • [ y ] 07-14-2021. freeCodeCamp. “PySpark Tutorial”.

  • [ y ] 03-02-2023. Simplilearn. “🔥Spark Full Course 2023 | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn”.

  • [ y ] 05-18-2021. Simplilearn. “Spark Full Course | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn”.

  • [ y ] 08-01-2019. Simplilearn. “What Is Apache Spark? | Apache Spark Tutorial | Apache Spark For Beginners | Simplilearn”.

more

  • [ y ] 01-05-2024. Business Analytics for Beginners. “Batch Processing VS Stream Processing in Big Data Analytics”.


Texts#

[ h ][ course ] Leskovec, Jure; Anand Rajaraman; & Jeff Ullman. Mining of Massive Datasets.

Hadoop

  • Kunigk, Jan et al. (2018). Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale. O’Reilly.

  • Sammer, Eric. (2012). Hadoop Operations: A Guide for Developers and Administrators. O’Reilly. Home. GitHub.

  • White, Tom. (2015). Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. 4th Ed. O’Reilly.

Spark

  • Chambers, Bill & Matei Zaharia. (2018). Spark The Definitive Guide: Big Data Processing Made Simple. O’Reilly. GitHub.

  • Damji et al. (2020). Learning Spark: Lightning-Fast Data Analytics. 2nd Ed. O’Reilly. GitHub.

  • Karau, Holden & Rachel Warren. (2017). High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly. GitHub.

  • Maas, Gerard & Francois Garillot. (2019). Stream Processing with Apache Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly. GitHub.

  • Parsian, Mahmoud. (2022). Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up Using PySpark. O’Reilly. GitHub.

  • Perrin, Jean-Georges. (2020). Spark in Action: With examples in Java, Python, and Scala. 2nd Ed. Manning.

  • Polak, Adi. (2023). Machine Learning with Spark: Designing Distributed ML Platforms with PyTorch, TensorFlow, and MLLib. O’Reilly.

  • Ryza et al. (2017). Advanced Analytics with Spark: Patterns for Learning from Data at Scale. 2nd Ed. O’Reilly. GitHub.

  • Tandon et al. (2022). Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark. O’Reilly.


Notes#

Hadoop#

Configuration

  • echo "HADOOP_HOME=\"/opt/homebrew/Cellar/hadoop/3.3.3/libexec\"" >> ~/.zshrc

  • $HADOOP_HOME/etc/hadoop/hadoop-env.sh

  • $HADOOP_HOME/etc/hadoop/core-site.xml

  • $HADOOP_HOME/etc/hadoop/hdfs-site.xml

  • $HADOOP_HOME/etc/hadoop/mapred-site.xml

  • $HADOOP_HOME/etc/hadoop/yarn-site.xml

Defaults

  • $HADOOP_HOME/share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml

  • $HADOOP_HOME/share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

  • $HADOOP_HOME/share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

  • $HADOOP_HOME/share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Startup

  • hdfs --daemon start namenode

  • hdfs --daemon start secondarynamenode

  • hdfs --daemon start datanode

  • yarn --daemon start resourcemanager

  • yarn --daemon start nodemanager

  • yarn --daemon start proxyserver

  • mapred --daemon start historyserver

  • jps Lists the instrumented Java Virtual Machines (JVMs) on the target system. This command is experimental and unsupported.

Monitoring

  • hdfs dfsadmin -report

  • hdfs fsck /

Spark#

Startup

  • spark-shell --master spark://10.0.4.146:7077 --driver-memory 8g --executor-memory 6g