Wednesday, July 6, 2011

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Apache Hadoop framework uses Google MapReduce model and Google File system logic's. In Hadoop, Data will be split into chunks and distributed across all nodes in cluster. This concept is inherited from Google file system, In hadoop we mention it as HDFS (i.e. Hadoop Distributed File System). While loading data into HDFS, it start distributing to all nodes based on few parameters. Here will see two important parameter need to consider for better performance.

1. Chunk size (dfs.block.size(in bytes)) - 64MB,128MB,256MB or 512MB. its preferable to choose size based on our input data size to be process and power of each node.

2. Replication Factor (dfs.replication=3) - by default its 3. means data will be available in 3 nodes or 3 times around cluster. In case of high chance of failure in nodes, better to increment replication factor value. Need for data replication is, if any node in cluster failed, data in that node cannot be processed, so will not get complete result.


For Example, to process 1TB of data with 1000 nodes. 1TB(1024GB)* 3 replication factor = 3072 GB of data will be available in all 1000 node cluster. we can specify chunk size based on our node capability. if node has more than 2GB memory(RAM), then can specify 512MB chunk size. so one node TaskTracker will process one chunk at a time. If its a dual core processor, one node will process 2 chunks at a same time. so specify chunk size based on memory available in each node. Recommended not to use NameNode(Master) also as a Datanode, else that single node overloaded with task of both TaskTracker and JobTracker. 

 

Will that data distributed equally in hadoop cluster’s node?

No, it’s not distributed like 3GB in each node. some node will have 8GB of data, other node will have 5GB, and 1GB.. and so on. but node will have complete chunk. it wont be distributed like half chunk here and there.

In Upcoming posts we will see about more hadoop parameter to improve cluster performance. If you like this post, please click +1 button below to recommend this page and click ‘like’ button to get updates in facebook(Only once in a week).

4 comments:

  1. Thanks for posting these tips.
    Was very important.
    Shermi

    ReplyDelete
  2. why replication factor is default to 3, Any specific reason

    ReplyDelete
    Replies
    1. Replication factor should be more than one, to avoid single point of failure. Lets consider, 100 MB of data is distributed in 10 node exactly one copy(ie replication factor is 1), 10MB per node, if one node failed, then map-reduce can't give any completed result for any process. it can process only among 90MB of data.

      Now new question arise, why not replication factor is 2 then, ofcourse, its a right question, when request comes for particular data chunk to process, and that particular node is busy with processing other data chunk, instead of waiting for node to complete, it can go to other nodes which as same data chunk suppose to be processed. if replication factor is 3, if one node is busy, then Hadoop has 2 more different node available with same data chunk to process.

      you can go for replication factor 2, it will reduce memory usage, but replication factor 1 is not recommended. I hope i explained my best for your question.

      Delete