Department of Computer and Information Science (IDA)

Reading materials

Python (refresh)

Relational databases (refresh)

Ramez Elmasri and Shamkant B Navathe, Fundamentals of Database Systems, 7th edition, 2016: chapters 3-6 and 9, section 7.1.
SQL tutorial

Parallel processing (recommended reading)

C. Lin, L. Snyder: Principles of Parallel Programming. Pearson/Addison Wesley, 2008. 978-0-321-54942.

MapReduce and Hadoop (recommended reading)

Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. Proc. OSDI, ACM, 2004. (There is also the journal version in CACM 2008, which is under 'Machine Learning' on this page.)
Apache Hadoop: https://hadoop.apache.org
Donald Miner and Adam Shook: MapReduce Design Patterns. O'Reilly, 2012.

Spark (recommended reading)

Matei Zaharia et al.: Spark: cluster computing with working sets. Proc. HotCloud'10, USENIX, 2010.
Apache Spark: http://spark.apache.org
A. Nandi: Spark for Python Developers. Packt Publishing, 2015.

Resource management in big-data clusters (recommended reading)

Vinod Kumar Vavilapalli et al.: Apache Hadoop YARN: Yet Another Resource Negotiator. Proc. SoCC'13, ACM, 2013.
Apache Hadoop YARN: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html
Benjamin Hindman et al.: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. Proc. NSDI'11, USENIX, 2011.
Apache Mesos: http://mesos.apache.org/

NoSQL data stores and techniques (recommended reading)

Elmasri et al.:, Fundamentals of Database Systems, 7th edition, 2016: Chapter 24; Sections 20.1-20.3; 23.1-23.4.
Strauch: NoSQL Databases.
Cattell: Scalable SQL and NoSQL data stores. ACM SIGMOD Record 2010, pages 12-27.
Grolinger et. al: Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing, 2013.
Stonebraker et al.: Ã¢â‚¬Å“One Size Fits AllÃ¢â‚¬Â: An Idea Whose Time Has Come and Gone . ICDE 2005, pages 2-11.
Brewer: Towards Robust Distributed Systems. Keynote talk at ACM PODC 2000.
Coulouris et al: Distributed Systems: Concepts and Design, Chapter: Time & Global States, 5th Edition.
Fox et al.: Cluster-Based Scalable Network Services. SOSP 1997, pages 78-91.
Karger et al.: Consistent Hashing and Random Trees.ACM STOC 1997, pages 654-663.
Vogels: Eventually Consistent. Communications of the ACM 2009, pages 40-44.

HDFS

(recommended reading) Shvachko et al.: The Hadoop Distributed File System . IEEE MSST 2010, pages 1-10.
(optional) White: Hadoop The Definitive Guide, Chapter: The Hadoop Distributed File System. 2011.

Dynamo (recommended reading)

HBase (recommended reading)

Borthakur et al.: Apache hadoop goes realtime at Facebook. ACM SIGMOD 2011, pages 1071-1080.
George: HBase The Definitive Guide, Chapter: Introduction. 2011.

Hive and Shark/SparkSQL (recommended reading)

Machine learning (recommended reading)

Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.
Chu, C.-T. et al. Map-Reduce for Machine Learning on Multicore. NIPS 19, 281-288, 2006.
Zaharia, M. et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 12, 15-28, 2012.
Meng, X. et al. MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research, 17(34):1-7, 2016.

Page responsible: BDA
Last updated: 2017-03-15