MapReduce Programming Paradigm
Contents MapReduce Programming Paradigm 1. Distributed storage for big data 1.1. Working with big data Hundreds or thousands of machines to support big data. Distribute data for storage Distribute data computation Handle failure The papers 1.2. Types of storage Single computer storage Remote large-scale storage Remote distributed storage Parallel file systems Distributed file systems
1.3. Google File System (GFS) Google File System (and its open-source counterpart, Hadoop Distributed File System) addressed the distributed storage question. Hardware failure is the norm rather than the exception Streaming data access Not for general purpose applications For batch processing rather than interactive use For high throughput of data access rather than low latency of data access Large data sets (terabytes in size) Simple coherency model (write once read many) Moving computation is cheaper than moving data Portability across heterogeneous hardware and software platform How do you write programs to process data stored in this manner? 1.4. Design and implementation of GFS/HDFS Master node (GFS) / NameNode (HDFS) Stores metadata about where files are stored Might be replicated Chunk servers (GFS) / DataNode (HDFS) File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) across chunk servers/datanodes Try to keep replicas in different racks Client library for file access Talks to master to find chunk servers Connects directly to chunk servers to access data
2. MapReduce Programming Model 2.1 Motivation Challenges: input data is usually large the computations have to be distributed across hundreds or thousands of machines finish in a reasonable amount of time. Addressing these challenges causes the original simple computation to become obscured by large amounts of supporting complex codes. What MapReduce does: is a new abstraction expresses the simple core computations hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Why MapReduce? inspired by the _map_ and _reduce_ primitives present in Lisp and many other functional languages. Most data computations involve: applying a _map_ operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a _reduce_ operation to all the values that shared the same key to combine the derived data appropriately. 2.2. In a nutshell What is map? A function/procedure that is applied to every individual elements of a collection/list/array/…
1
2
int square ( x ) { return x * x ;}
map square [ 1 , 2 , 3 , 4 ] -> [ 1 , 4 , 9 , 16 ]
What is reduce? A function/procedure that performs an operation on a list. This operation will fold/reduce this list into a single value (or a smaller subset)
1
2
reduce ([ 1 , 2 , 3 , 4 ]) using sum -> 10
reduce ([ 1 , 2 , 3 , 4 ]) using multiply -> 24
3. Word Count: the “Hello, World” of Big Data 3.1. Problem statement We have a large amount of text … Could be stored in a single massive file. Could be stored in multiple files. We want to count the number of times each distinct word appears in the files Sample applications: Analyze web server logs to find popular URLs/keywords. Analyze security logs to find incidents. Standard parallel programming approach: Count number of files or set up seek distances within a file. Set number of processes Possibly setting up dynamic workload assignment A lot of data transfer Significant coding effort Serial implementation: 3.2. MapReduce workflow
3.3. MapReduce workflow - what do you really do Input: a set of key-value pairs Programmer specifies two methods: Map(k, v) -> (k', v'): Takes a key-value pair and outputs a set of key-value pairs. E.g., key is the filename, value is a single line in the file There is one Map call for every (k,v) pair Reduce(k2, <v'>) -> <k’, v’’> All values v' with same key k' are reduced together and processed in v' order. There is one Reduce function call per unique key k'. {alt=”mapping and reducing”}
3.4. Everything else … The MapReduce framework takes care of: Partitioning the input data Scheduling the program’s execution across a set of machines Performing the group by key step Handling machine failures Managing required inter-machine communication 3.5. Distributed storage and MapReduce The MapReduce framework lends itself nicely to the distributed storage model of GFS/HDFS, as shown in the figure below.
4. Algorithms using MapReduce 4.1. Overview MapReduce is not a solution to every problem GFS/HDFS is only beneficial for files that are very large rarely updated Original purpose of Google’s MapReduce Implementation matrix-vector calculation matrix-matrix calculation Relational algebra 4.2. Matrix-Vector multiplication Suppose we have: $n\times n$ matrix M whose element in row $i$ and column $j$ is denoted $m_{ij}$ Vector v of length $n$ whose $j^{th}$ element is $v_j$ Matrix-vector product: $x_i = \sum_{j=1}^{n} m_{ij}v_j$ First, assume n is large, but can fit into main memory of a node Map Each map task operates on a chunk of M and reads in the entirety of v For each matrix element $m_{ij}$, return key/value pair $(i,m_{ij}v_j)$ Reduce Sum up all values of pairs with the same key $i$ ???example “Example Colab Notebook” Matrix Vector Dot Product
If n does not fit into main memory Divide matrix into vertical stripes of equal width Matrix M becomes $n/w$ matrices, each has the same dimension $(n/w,n)$ Divide vector in equal stripes of the same length $w$ Develop Map and Reduce procedures in a similar fashion. 4.3. Matrix multiplication 4.4. Relational algebra: selection Apply a condition C to each tuple in the relation and produce as output only those satisfies C. Does not need the full power of MapReduce Map: for each data element e, test if it satisfies condition C. If so, produce the key/value pair (e,e) Reduce: not needed. Simply pass the pair (e,e) through 4.5. Relational algebra: projection For some subset S of the attributes of the relation, produce from each tuple only the components for the attributes in S Could generate duplicates Map: for each $t$ in R, construct $t’$ by removing from $t$ components whose attributes are not in S. Reduce: Group pairs by key. Pairs with multiple values should be flattened to a single value (remove duplicates) 4.6. Relational algebra: union, intersection, and difference Relations R and S tuple t could belong to either R or S Union: $t \xrightarrow{map} (t,t) \xrightarrow{shuffle} (t,t)\ or\ (t,[t,t]) \xrightarrow{reduce} (t,t)$ Intersection $t \xrightarrow{map} (t,t) \xrightarrow{shuffle} (t,t)\ or\ (t,[t,t]) \xrightarrow{reduce} (t,t)\ from (t,[t,t])\ only$ Difference Assume R - S $t \xrightarrow{map} (t,R)\ or\ (t,S) \xrightarrow{shuffle} (t,R)\ or\ (t,S)\ or\ (t,[R,S]) \xrightarrow{reduce} (t,R)$ 4.7. Relational algebra: natural join Join R(A, B) with S(B, C) Map $(a,b)\ of\ R \xrightarrow{map} (b, (R,a))$ $(b,c)\ of\ S \xrightarrow{map} (b, (S,c))$ R and S are sing-bit values representing whether this comes from R or S Reduce $(b,[(R,a),(S,c)]) \xrightarrow{reduce} (somekey, (a,b,c))$ 4.8. Relational algebra: grouping and aggregation Assume a relation whose tuple has three attributes: A for group, B for aggregating, and C: R(A,B,C) Map: for each tuple, produce key-value pair (a,b) Reduce: perform aggregation on the list-type value: $(a,[b_1,b_2,…,b_n])$ 5. Extensions to MapReduce 5.1. Overview MapReduce influenced a number of extensions and modifications Similarity Built on a distributed file system Manage a very large number of tasks, spawned from small number of user-written code (bring computation to data) Builtin fault tolerant and error recovery 5.2. Workflow systems Extend MR from a two-step workflow to any collection of functions An acyclic graph representing workflow among functions Dataflow programming Each function of the workflow can be executed by many tasks (data parallelism) Blocking property: function only deliver output after completion Examples 6. Cost Model 6.1. Communication cost Assuming an algorithm implemented by an acyclic network of tasks Single MapReduce job Chained MapReduce jobs … Communication cost of a task is the size of the input to the task Communication cost of the algorithm is the sum of the communication cost of all tasks. For massive data, communication cost is key performance indicator: Computation cost of a single task tends to be simple and linear I/O bottleneck due to network transfer I/O bottleneck due to disk-to-memory transfer Example: MapReduce natural join 6.2. Wall-Clock time Parallel by nature (move computation to distributed data storage) Should be very carefully if used in tradeoff with communication cost