Big Data

Big Data #

[ h ][ w ] Hadoop

[ h ][ w ] Apache Spark

Downloads
[ y ] 12-12-2018. Computerphile. “Apache Spark - Computerphile”.
[ y ] 07-14-2021. freeCodeCamp. “PySpark Tutorial”.
[ y ] 03-02-2023. Simplilearn. “🔥Spark Full Course 2023 | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn”.
[ y ] 05-18-2021. Simplilearn. “Spark Full Course | Spark Tutorial For Beginners | Learn Apache Spark | Simplilearn”.
[ y ] 08-01-2019. Simplilearn. “What Is Apache Spark? | Apache Spark Tutorial | Apache Spark For Beginners | Simplilearn”.

more

[ y ] 01-05-2024. Business Analytics for Beginners. “Batch Processing VS Stream Processing in Big Data Analytics”.

[ h ][ course ] Leskovec, Jure; Anand Rajaraman; & Jeff Ullman. Mining of Massive Datasets.

Hadoop

Kunigk, Jan et al. (2018). Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale. O’Reilly.
Sammer, Eric. (2012). Hadoop Operations: A Guide for Developers and Administrators. O’Reilly. Home. GitHub.
White, Tom. (2015). Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. 4th Ed. O’Reilly.

Spark

Chambers, Bill & Matei Zaharia. (2018). Spark The Definitive Guide: Big Data Processing Made Simple. O’Reilly. GitHub.
Damji et al. (2020). Learning Spark: Lightning-Fast Data Analytics. 2nd Ed. O’Reilly. GitHub.
Karau, Holden & Rachel Warren. (2017). High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly. GitHub.
Maas, Gerard & Francois Garillot. (2019). Stream Processing with Apache Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly. GitHub.
Parsian, Mahmoud. (2022). Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up Using PySpark. O’Reilly. GitHub.
Perrin, Jean-Georges. (2020). Spark in Action: With examples in Java, Python, and Scala. 2nd Ed. Manning.
Polak, Adi. (2023). Machine Learning with Spark: Designing Distributed ML Platforms with PyTorch, TensorFlow, and MLLib. O’Reilly.
Ryza et al. (2017). Advanced Analytics with Spark: Patterns for Learning from Data at Scale. 2nd Ed. O’Reilly. GitHub.
Tandon et al. (2022). Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark. O’Reilly.

Configuration

Defaults

$HADOOP_HOME/share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml
$HADOOP_HOME/share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
$HADOOP_HOME/share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
$HADOOP_HOME/share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Startup

hdfs --daemon start namenode
hdfs --daemon start secondarynamenode
hdfs --daemon start datanode
yarn --daemon start resourcemanager
yarn --daemon start nodemanager
yarn --daemon start proxyserver
mapred --daemon start historyserver
jps Lists the instrumented Java Virtual Machines (JVMs) on the target system. This command is experimental and unsupported.

Monitoring

Startup

spark-shell --master spark://10.0.4.146:7077 --driver-memory 8g --executor-memory 6g