Hadoop overview and its EcoSystem


Hadoop overview

Hadoop overview and its EcoSystem

  • Hadoop is an open source implementation of the Map Reduce Platform and distributed file system, written in Java.
  • Hadoop is actually a collection of tools, and an ecosystem built on top of the tools.
  • The problem Hadoop solves is how to store and process big data. And when we need to store and process peta bytes. Of information, the monolithic approach to computing no longer makes sense.
  • When data is loaded into the system, it is split into blocks i.e typically 64 MB or 128MB.
  • The first part of the Map Reduce System is to work on relatively some portions of data in a single block.
  • A master program allocates work to nodes such that a Map just will work on a block of data stored locally on that node whenever possible and many nodes work in parallel, each on their own part of the overall data set.
  • Hadoop Consists of two core components are
    • The Hadoop distributed file system (HDRS)
    • Map Reduce.
  • There are many other projects based around core hardtop, often referred to as the Hadoop Ecosystem.
  • The Hadoop Eco –Systems are pig, Hive, HBase, Flume, Oozie ,Sqoop, Zookeeper.
  • A Set of Machines running HDFS and Map Reduce is known as a HADOOP Cluster.
  • In Hadoop Cluster, Individual machines are known as Nodes and a cluster can have as few as one Node, as many as several thousands.
  • If there are more nodes in a Hadoop Cluster, performance is better.

Hadoop Cluster

Hadoop Daemons;-

Hadoop is comprised of five separate daemons. They are

  • Name Node: Holds the Meta data for HDFS.
  • Secondary Name Node: performs housekeeping functions for the Name Node and is not a backup or hot stand by for the Name Node.
  • Data Node: stores actual HDFS data blocks.
  • Job Tracker: Manages Map Reduce jobs, distributes individual tasks to machines running.
  • Task Tracker: Instantiates and Monitors individual

Each Daemon runs in its own Java virtual Machine.

No mode on a real cluster will run all fire daemons although this is technically possible.

We can consider nodes to be in two different categories;

  • Master Nodes: Run the Name Node, secondary Name Node, Job tracker daemons.
  • Slave Nodes: Run the Data Node and Task Tracker daemons and a slave node will run both of these daemons.

Basic cluster configuration


Job tracker   Name Node   Secondary Name Node

Slave Nodes

Data Node Data Node Data Node
  Task Tracker Task Tracker Task Tracker


On very small clusters, the Name Node, job Tracker and secondary Name Node can all reside on a single machine and it is typical to put them on separator Machines as the cluster grows beyond 20-30 Nodes.

Each dotted box on the previous diagram represents a separate Java virtual Machine. (jvm)




Share on FacebookShare on LinkedInTweet about this on TwitterGoogle+