Performance Techniques in MapReduce
- The Google File System (GFS) stores multiple copies (typically 3)
of data files on different computers for redundancy and availability.
- Master assigns workers to process data such that the data
is on the worker's disk, or near the worker within the same rack.
This reduces network communication; network bandwidth is scarce.
- Combiner functions can perform partial reductions (adding
"1" values) before data are written out to disk, reducing
both I/O and network traffic.
- Master can start redundant workers to process the same data
as a dead or ``slacker'' worker. Master will use the result from
the worker that finishes first; results from later workers will
- Reduce workers can start work as soon as some Map workers have
finished their data.