Some data values are much more popular than others. For example, there were 13 people on a class roster whose names started with S, but only one K, and no Q or X.
If MapReduce assigned Reduce tasks based on key values, some Reduce tasks might have large inputs and be too slow, while other Reduce tasks might have too little work.
MapReduce performs load balancing by having a large number R of Reduce tasks and using hashing to assign data to Reduce tasks: task = Hash(key) mod R
This assigns many keys to the same Reduce task. The Reduce task reads the files produced by all Map tasks for its hash value (remote read over the network), sorts the combined input by key value, and appends the value lists before calling the application's Reduce function.