Hadoop MapReduce is a software framework developed by Apache, which is used to process huge datasets in parallel across clustered hardware. There are two basic steps involved in this process: Map and Reduce. In this article, we shall see that how Map and Reduce steps work. But before that, let us review another important term i.e. MapReduce Job.
A Job is a top level unit of work. Job typically consists of two phases i.e. map and reduce. However, the latter can be avoided. A typical example of MapReduce Job is counting the number of occurrences of a specific word across thousands of documents. During the Map phase of the job, occurrence of the word in each document is counted while reduce phase aggregates word count of the individual document and results in a total sum.
The Map Phase
In Map phase, the MapReduce framework takes input data by default from Hadoop Distributed File System. The input data is divided into small chunks which are processed by several map tasks running in parallel across the hadoop cluster.