Big Data (Hadoop) & Cloud Computing: BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Wednesday, July 6, 2011

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Apache Hadoop framework uses Google MapReduce model and Google File system logic's. In Hadoop, Data will be split into chunks and distributed across all nodes in cluster. This concept is inherited from Google file system, In hadoop we mention it as HDFS (i.e. Hadoop Distributed File System). While loading data into HDFS, it start distributing to all nodes based on few parameters. Here will see two important parameter need to consider for better performance.

1. Chunk size (dfs.block.size(in bytes)) - 64MB,128MB,256MB or 512MB. its preferable to choose size based on our input data size to be process and power of each node.

2. Replication Factor (dfs.replication=3) - by default its 3. means data will be available in 3 nodes or 3 times around cluster. In case of high chance of failure in nodes, better to increment replication factor value. Need for data replication is, if any node in cluster failed, data in that node cannot be processed, so will not get complete result.

For Example, to process 1TB of data with 1000 nodes. 1TB(1024GB)* 3 replication factor = 3072 GB of data will be available in all 1000 node cluster. we can specify chunk size based on our node capability. if node has more than 2GB memory(RAM), then can specify 512MB chunk size. so one node TaskTracker will process one chunk at a time. If its a dual core processor, one node will process 2 chunks at a same time. so specify chunk size based on memory available in each node. Recommended not to use NameNode(Master) also as a Datanode, else that single node overloaded with task of both TaskTracker and JobTracker.

Will that data distributed equally in hadoop cluster’s node?

No, it’s not distributed like 3GB in each node. some node will have 8GB of data, other node will have 5GB, and 1GB.. and so on. but node will have complete chunk. it wont be distributed like half chunk here and there.

In Upcoming posts we will see about more hadoop parameter to improve cluster performance. If you like this post, please click +1 button below to recommend this page and click ‘like’ button to get updates in facebook(Only once in a week).

4 comments:

Shermila Guerra Santa CruzNovember 24, 2011 at 3:33 PM
Thanks for posting these tips.
Was very important.
Shermi
ReplyDelete
Replies
AnonymousApril 29, 2013 at 8:20 AM
why replication factor is default to 3, Any specific reason
ReplyDelete
Replies
UnknownMarch 18, 2015 at 2:21 AM
Aweosome Post thanks for sharing Salesforce Online Training
ReplyDelete
Replies