Reading materials
- Interactive Python web tutorial
- Python code visualization
- Python tutorial
- Codecademy
- Infographic on R vs Python
- Cheat sheet
- Python's assignment (or "binding") model
- Ramez Elmasri and Shamkant B Navathe, Fundamentals of Database Systems, 7th edition, 2016: chapters 3-6 and 9, section 7.1.
- SQL tutorial
- C. Lin, L. Snyder: Principles of Parallel Programming. Pearson/Addison Wesley, 2008. 978-0-321-54942.
- Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. Proc. OSDI, ACM, 2004. (There is also the journal version in CACM 2008, which is under 'Machine Learning' on this page.)
- Apache Hadoop: https://hadoop.apache.org
- Donald Miner and Adam Shook: MapReduce Design Patterns. O'Reilly, 2012.
- Matei Zaharia et al.: Spark: cluster computing with working sets. Proc. HotCloud'10, USENIX, 2010.
- Apache Spark: http://spark.apache.org
- A. Nandi: Spark for Python Developers. Packt Publishing, 2015.
- Vinod Kumar Vavilapalli et al.: Apache Hadoop YARN: Yet Another Resource Negotiator. Proc. SoCC'13, ACM, 2013.
- Apache Hadoop YARN: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html
- Benjamin Hindman et al.: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. Proc. NSDI'11, USENIX, 2011.
- Apache Mesos: http://mesos.apache.org/
- Elmasri et al.:, Fundamentals of Database Systems, 7th edition, 2016: Chapter 24; Sections 20.1-20.3; 23.1-23.4.
- Strauch: NoSQL Databases.
- Cattell: Scalable SQL and NoSQL data stores. ACM SIGMOD Record 2010, pages 12-27.
- Grolinger et. al: Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing, 2013.
- Stonebraker et al.: Ć¢ā‚¬Å“One Size Fits AllĆ¢ā‚¬Ā¯: An Idea Whose Time Has Come and Gone . ICDE 2005, pages 2-11.
- Brewer: Towards Robust Distributed Systems. Keynote talk at ACM PODC 2000.
- Coulouris et al: Distributed Systems: Concepts and Design, Chapter: Time & Global States, 5th Edition.
- Fox et al.: Cluster-Based Scalable Network Services. SOSP 1997, pages 78-91.
- Karger et al.: Consistent Hashing and Random Trees.ACM STOC 1997, pages 654-663.
- Vogels: Eventually Consistent. Communications of the ACM 2009, pages 40-44.
- (recommended reading) Shvachko et al.: The Hadoop Distributed File System . IEEE MSST 2010, pages 1-10.
- (optional) White: Hadoop The Definitive Guide, Chapter: The Hadoop Distributed File System. 2011.
- DeCandia et al.: Dynamo: Amazon's Highly Available Key-value Store.ACM SOSP 2007, pages 205-220.
- Borthakur et al.: Apache hadoop goes realtime at Facebook. ACM SIGMOD 2011, pages 1071-1080.
- George: HBase The Definitive Guide, Chapter: Introduction. 2011.
- Thusoo et al.: Hive - a petabyte scale data warehouse using Hadoop. ICDE 2010, pages 996-1005.
- Prokopp: The Free Hive Book.
- Hive Manual.
- Xin et al.: Shark: SQL and Rich Analytics at Scale. ACM SIGMOD 2013, pages 13-24.
- SparkSQL Manual.
- Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.
- Chu, C.-T. et al. Map-Reduce for Machine Learning on Multicore. NIPS 19, 281-288, 2006.
- Zaharia, M. et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 12, 15-28, 2012.
- Meng, X. et al. MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research, 17(34):1-7, 2016.
Page responsible: BDA
Last updated: 2017-03-15