MapReduce Programming Paradigm

MapReduce Programming Paradigm

1. Distributed storage for big data

1.1. Working with big data

1.2. Types of storage

1.3. Google File System (GFS)

1.4. Design and implementation of GFS/HDFS


2. MapReduce Programming Model

2.1 Motivation

2.2. In a nutshell

1
2
int square(x) { return x*x;}
map square [1,2,3,4] -> [1,4,9,16]
1
2
reduce ([1,2,3,4]) using sum -> 10
reduce ([1,2,3,4]) using multiply -> 24

3. Word Count: the “Hello, World” of Big Data

3.1. Problem statement

3.2. MapReduce workflow

3.3. MapReduce workflow - what do you really do

{alt=”mapping and reducing”}

3.4. Everything else …

3.5. Distributed storage and MapReduce

The MapReduce framework lends itself nicely to the distributed storage model of GFS/HDFS, as shown in the figure below.


4. Algorithms using MapReduce

4.1. Overview

4.2. Matrix-Vector multiplication

???example “Example Colab Notebook” Matrix Vector Dot Product

4.3. Matrix multiplication

4.4. Relational algebra: selection

4.5. Relational algebra: projection

4.6. Relational algebra: union, intersection, and difference

4.7. Relational algebra: natural join

4.8. Relational algebra: grouping and aggregation


5. Extensions to MapReduce

5.1. Overview

5.2. Workflow systems

6. Cost Model

6.1. Communication cost

6.2. Wall-Clock time