If you have heard of big data, you have probably heard of Hadoop as well. These two terms are often used in conjunction owing to the fact that Hadoop is the primary software framework currently being used for distributed storage, manipulation and distributed processing of huge amount of data residing on a cluster of computers that are build on commodity hardware.
Currently, companies like LinkedIn, Facebook, Yahoo and ebay use Hadoop for processing huge amount of data and extracting useful information from that. Hadoop fetches inspiration from Google’s publications on Big table, Map reduce and GoogleFS. However, the fact that Hadoop can be hosted on very simple commodity hardware i.e. an Intel based PC with Linux OS and few TB of hard disk, makes it one of the most popular big data processing software. The Hadoop framework consists of many tools. HDFS and Hadoop MapReduce are the two core subsets of the Hadoop environment.
Hadoop Distributed File System (HDFS)
HDFS is a special name for file system used by Hadoop environment. HDFS is similar to any other file system. The difference between HDFS and ordinary file system is that when a file is stored on HDFS, it is virtually divided into small chunks and replicated on three hardware servers by default. This increases fault tolerance of the file system.
Hadoop MapReduce is a phenomenon where a large processing request is split into multiple smaller requests that are sent to several processing servers which process these requests in parallel. This technique utilizes scalability power of CPU in most efficient manner.
Apart from these two core technologies, Hadoop framework consists of multiple small tools and technologies such as HBase, Lucene search engine, ZooKeeper and Languages such as Pig and Hive.