Hadoop Architechture – What It Is And How It Is Important?
Apache Hadoop is a very famous and the most demandable open-source software framework. It is basically used for storage and large-scale processing of data-sets on clusters of commodity hardware. Talking about its structure, there are mainly five building blocks inside this runtime environment.
Hortonworks- the founder has already predicted that by the end of 2020, 75% of Fortune 2000 companies will be running 1000 node hadoop clusters in production. Hadoop- a tiny toy elephant in the big data room has become very popular when it comes to sort out the big data issues across the globe. There is a large amount of implementation of Hadoop in the production and very important when deployment and management challenges like scalability, flexibility and cost effectiveness take place. Those businesses that venture into enterprise adoption of Hadoop by business users or anyone else doesn’t have any knowledge on how a good hadoop architecture design should be and how actually a hadoop cluster works when it comes to the production.
There is a huge lack of knowledge among the IT professionals and which clearly leads to design of a hadoop cluster and make it very complex to run. Apache Hadoop prime reason to develop is to have a low–cost, redundant data store that would allow organizations to leverage big data analytics at economical cost so that the profit can be maximized.
Hadoop File System was developed so that distributed file system design can be developed. It always runs on commodity hardware and unlike other distributed systems, HDFS is highly fault tolerant and specifically designed to use low-cost hardware. HDFS is known for holding quite large amount of data and provides easier access. If you are looking for storing such huge data, the files are stored across multiple machines. The files in the machines are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS is the best as it allows the applications available to parallel processing.
Features of HDFS
It is important to know the features of the HDFS to check how important it is and how it is contributing. The foremost reason is -It is suitable for the distributed storage and processing and Hadoop provides a command interface to interact with HDFS. The term namenode and datanode are always there to help users to easily check the status of cluster. HDFS always provides file permissions and authentication as well as streaming access to file system data is possible.
Hadoop Architecture Overview
The cluster is the set of host machines (nodes) and nodes may be partitioned in racks. Talking about YARN Infrastructure, which is also called as- Yet Another Resource Negotiator is the framework responsible for providing the computational resources, like- CPUs, memory, and so on and it is needed for application executions.
It also includes most important elements that are: the Resource Manager, which is known as the master and it completely knows where the slaves are located or best for the Rack Awareness. It also knows how many resources they have as well as it is best when it comes to run several services. In this the most important part is the Resource Scheduler, which decides how to assign the resources.
The Node Manager is the slave of the infrastructure. When it starts, it announces itself to the Resource Manager. Time to time, it sends heartbeats to the Resource Manager and each node offers some resources to the cluster. At the time of processing, the Resource Scheduler decides how to use this capacity: a Container is a fraction of the NM capacity and it is used by the client for running a program.
It is important to know more about HDFS Federation, which is the framework supports in offering permanent, reliable and distributed storage. This is typically used for storing inputs and output, but, ignores the intermediate ones. It is important to know that Amazon uses the Simple Storage Service (S3). Talking about the MapReduce Framework, it is the software layer implementing the MapReduce paradigm.
As Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework along with the storage capacity and local computing power by leveraging, thus, it becomes very important to go with. Hadoop always works with the Master Slave architecture in order to transformation and analysis of large data sets using Hadoop MapReduce paradigm. Another most important hadoop components that play a vital role in the Hadoop architecture are
- Hadoop Distributed File System
- Patterned after the UNIX file system
- Hadoop MapReduce Yet Another Resource Negotiator (YARN).
Hadoop skillset always requires a thoughtful knowledge of every layer in the hadoop stack right from the understanding of the various components in the hadoop architecture along with designing a hadoop cluster, performance tuning it and setting up the top chain responsible for data processing. As said, Hadoop follows a master slave architecture design for storing the data and processing the same with the help of HDFS and MapReduce respectively, works in a proper order for quick results. The master node for data storage is called hadoop HDFS, which is the NameNode and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. Talking about the slave nodes in the hadoop architecture are the other machines in the Hadoop cluster, which functions to store data and perform complex computations. It is important to note while understanding the Hadoop Architecture, every slave node has a Task Tracker daemon and a DataNode that help in synchronizing the overall processes with the Job Tracker and NameNode respectively. In Hadoop architectural implementation the master or slave systems can be setup in the cloud or on-premise.
Role of Distributed Storage – HDFS in Hadoop Application Architecture Implementation
There is a great importance of the HDFS as it helps in splitting into multiple blocks and each is replicated within the Hadoop cluster. A block on HDFS is called a blob of data where we can find out the underlying file system with a default size of 64MB. This is the size of a block can be expanded up to 256 MB based on the requirements, needs and usage. Hadoop Distributed File System later stores the application data and file system metadata by keeping them separate on the dedicated servers. When it comes to the components of the Hadoop Distributed file system, it has got two important components, called- NameNode and DataNode. Those Applications or data are stored on servers referred to as DataNodes and file system metadata is stored on servers referred to as NameNode.
Hadoop Distributed File System reproduces the file content on multiple DataNodes so that replication factor can be done in order to ensure reliability of data. In order for better communication of NameNode and DataNode with each other it is important to use TCP based protocols. In order to make Hadoop architecture to be performed very efficient and effective, it is mandatory that HDFS must satisfy certain pre-requisites as follows:
- A-Z hard drives must have a high throughput.
- Good network speed is highly important to manage intermediate data transfer and block replications.
It is important to know the term NameNode in order to know more about the architecture of Hadoop. It is called NameNode as all the files and directories in the HDFS namespace are represented on the NameNode by Inodes. These Inodes is a house of various attributes like permissions, modification timestamp, access times, namespace quota and various others. NameNode also maps the entire file system structure into memory. In this, two files fsimage and edits are used for persistence during restarts. Fsimage file contains the Inodes and the list of blocks which define the metadata. We can use the edits file, when it comes to any modifications that have been performed on the content of the fsimage file.
Next is DataNode which is here to manage the state of an HDFS node as well as helps in interacting with the blocks. A DataNode performs most of the CPU intensive jobs including- semantic and language analysis, statistics and machine learning tasks, and I/O intensive jobs. It is important to know I/O intensive jobs, which are called for clustering, data import, data export, search, decompression, and indexing. A DataNode is very much required for I/O for data processing and transfer. In the beginning, every DataNode starts connecting with the NameNode and performs a handshake to verify the namespace ID and the software version of the DataNode. If in case, they both don’t match then the DataNode shuts down automatically. A DataNode also helps in verifying the block replicas in its ownership by sending a block report to the NameNode. Once, the DataNode registers, the first block report is sent. DataNode also known for sending heartbeat to the NameNode in every 3 seconds to confirm that the DataNode is operating well as well as the block replicas it hosts are available.
Goals of HDFS
Fault detection and recovery
As HDFS includes a large number of commodity hardware, there is a huge chance of the failure of components is frequent. HDFS have all the mechanisms which generally help to find quick and automatic fault as well as helps in recovering the same.
HDFS always have more than hundreds of nodes per cluster to manage the applications having huge datasets. Via the same applications run in the best manner and this really helps the businesses.
Hardware at data
In order to perform a requested task to be done effectively and efficiently, HDFS plays a very important role when the computation takes place near the data. If there is a huge datasets involved, it reduces the network traffic and increases the throughput.
How does the Hadoop MapReduce architecture work?
It is important to know how and when execution of a MapReduce job begins. Well, the job begins, when the client submits the job configuration to the Job Tracker which specifies the map, as well as it combines and reduces the functions along with the location for input and output data. Once the job configuration is received, the job tracker identifies the number of splits based on the input path and select Task Trackers based on their network vicinity to the data sources. Job Tracker is very efficient when it comes to a request to the selected Task Trackers. The processing of the Map phase begins once the Task Tracker extracts the input data from the splits. On the completion of the map task, the Task Tracker notifies the Job Tracker and when it is done, the Job Tracker notifies the selected Task Trackers to begin the reduce phase. Later, the Task Tracker reads the region files and sorts the key-value pairs for each key. The reduce function is then invoked which collects the aggregated values into the output file.
Hadoop Architecture Design And The Best Practices
- If you would like to make the best and great use of Hadoop Architecture, it is important to focus on good-quality commodity servers to make it cost efficient and flexible to scale out for complex business use cases. Once the commodity servers will be in a good condition, everything will be sorted out as works as the requested manner. Don’t know about the best configuration to go with? Well, Hadoop architecture can be developed using 6 core processors, 96 GB of memory and 1 0 4 TB of local hard drives. This is called an ideal configuration but can’t say it an absolute one.
- When it comes to faster and efficient processing of data, it will be the best idea if you move the processing in close proximity to the data instead of separating the two. This is an ideal process will surely help to get efficient processing.
- Hadoop always performs better if you use the local drives only. Therefore, you can plan to opt only- Just a Bunch of Disks (JBOD) with replication and just ignore redundant array of independent disks (RAID).
- One can plan to design the Hadoop architecture for multi-tenancy by sharing the compute capacity with capacity scheduler and share HDFS storage. It will be good if you don’t edit the metadata files as it can corrupt the state of the Hadoop cluster.
About Hadoop Online Training @ BigClasses
BigClasses is one of the best online training organizations offer Hadoop Online training. We have qualified and experienced faculties who are responsible for taking the online sessions. We provide study materials and 24 hours support to our national and international learners as well. If you are interested in Hadoop online training, contact us for the detailed course and the free demo classes.