Saturday, July 16, 2011

Hadoop Performance Tuning (Hadoop-Hive)

Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:
  • ·         Operating System
  • ·         Processor and its number of cores
  • ·         Memory (RAM)
  • ·         Number of nodes in cluster
  • ·         Storage capacity of each node
  • ·         Network bandwidth
  • ·         Amount of input data
  • ·         Number of jobs in business logic

Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.

Storage capacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writes log and need some space for swapping memory.

Thursday, July 14, 2011

Big Data with Cloud computing (Amazon Web Service)

As Big Data projects requires huge amount of resources, cloud computing helps to avoid resource maintenance headache, Already discussed more details in previous post. Here discuss more about big data with cloud provider Amazon Web Service

Three major resource requires for any type of computing are Processor (CPU), Memory (RAM), Storage (Hard disk). Amount of each resource requires for projects vary, especially for big data project need more for processing huge amount of data. 

Amazon provides Elastic Compute Cloud (EC2) instance, is similar like a single desktop machine which is cloud, user can compute over network. EC2 are available in different types for user from micro instance to very large instance. user can create any number of EC2 instance. EC2 small instance has memory of 1.7GB, 160GB hard disk and one 32-bit processor. Those who requires more power in single instance can go up-to EC2 Cluster GPU instance, which has 22 GB memory, 33.5 processor(64-bit), 1680GB hard disk.

Wednesday, July 6, 2011

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Apache Hadoop framework uses Google MapReduce model and Google File system logic's. In Hadoop, Data will be split into chunks and distributed across all nodes in cluster. This concept is inherited from Google file system, In hadoop we mention it as HDFS (i.e. Hadoop Distributed File System). While loading data into HDFS, it start distributing to all nodes based on few parameters. Here will see two important parameter need to consider for better performance.

1. Chunk size (dfs.block.size(in bytes)) - 64MB,128MB,256MB or 512MB. its preferable to choose size based on our input data size to be process and power of each node.

2. Replication Factor (dfs.replication=3) - by default its 3. means data will be available in 3 nodes or 3 times around cluster. In case of high chance of failure in nodes, better to increment replication factor value. Need for data replication is, if any node in cluster failed, data in that node cannot be processed, so will not get complete result.