In this Hadoop tutorial we’ll talk about data locality in Hadoop, how data locality helps in running the job faster and saves cluster bandwidth.
Data locality in Hadoop
When a file is stored in HDFS it is divided into blocks of 128 MB (Default block size) and these blocks are stored on different nodes across the cluster. These HDFS blocks are also replicated as per the replication factor (Default is 3). Even at the time of creating replicas Hadoop takes the cluster topology into consideration and tries to honor the data locality.
- Refer HDFS Replica Placement Policy for details.
When a MapReduce job is started to process a file in Hadoop, MapReduce job calculates the input splits for the job, by default input split size is same as HDFS block size i.e. 128 MB. Hadoop framework creates as many map tasks as there are input splits on the job.
For example – There is a 1 GB file which is stored as 8 HDFS blocks of 128 MB each. If MapReduce job has to process this file, it will calculate that there are 8 input splits, Hadoop framework will start 8 map tasks to process these 8 input splits.
Now what makes more sense sending the map tasks, which will be few KBs in most cases, to the node where data is residing (128 MB block which map task has to process) or transferring the data across to the network where Map task is started? Don’t forget that there are 8 Map tasks and all of them will want their split data which means a lot of pressure on bandwidth if all of that data is transferred across nodes to their respective map tasks.
To avoid this Hadoop framework does the smart thing known as “data locality optimization”, rather than bringing data to computation it sends computation to data. Hadoop tries to run the Map tasks on the same nodes where the split data resides in HDFS thus making the task data local.
Task execution in YARN
When the application master requests containers for map tasks from ResourceManager data locality is also considered. Scheduler tries to allocate container on the node where the data resides so that the task is data local. But that is not possible always as there may not be enough resources available on the node where data resides to run a map task that brings us to the topic of levels of proximity between map task and data.
Map task and data proximity categories
Data locality in Hadoop can be categorized into 3 categories based on the proximity between the Map task and the data.
- Data local – If map task runs on the same node where data resides that is the optimal case and known as data local.
- Rack local – If a map task run on the same rack though not on the same node where the split resides that is known as rack local.
- Different rack – If map task can’t run on the same node, not even on the same rack then map task has to get the data it has to process from different rack. This is the least preferred scenario.
That’s all for the topic What is Data Locality in Hadoop. If something is missing or you have something to share about the topic please write a comment.
You may also like