The Reduce Phase
Once input data is split into multiple datasets which have been analyzed by corresponding multiple map tasks, reduce phase begins. The input of the reduce phase is multiple map tasks which are given as input to multiple reduce tasks that also run in parallel. Finally the processed outputs from different map tasks are aggregated and consolidated into final consolidated result. The results are also stored in the HDFS by default.
Map and Reduce Phases Run in Parallel
We saw that the output from the map phase is fed as input to the reduce phase, however in practical these two phases are not exactly sequential. As soon as any of the map tasks is completed, the reduce task for that particular map task begins making map and reduce phases run in parallel to each other.