Thursday, September 22, 2011

Hadoop performance tuning (Hadoop-Hive) part -3

[Note: This post is Third part of Hadoop performance tuning, if you directly reached this page, please click here for part 1 and click here for part 2.]


Before going to see some configuration parameter for performance tuning, I like to ask you a question, have you ever observed job and task tracker WebUI, there you can see lot of jobs are being killed after few seconds or minutes before completion. Why so? Have you ever think of it? Of course, few of them know. Those who already know about this, please skip next paragraph.

[NOTE: To check WebUI of Hadoop cluster: Open the browser, type http://masternode-machineip(or)localhost:portnumber. We can also check this port number by changing the configuration parameter value to the portnumber we want.]



Name
Port
Configuration parameter
Jobtracker
50030
mapred.job.tracker.http.address
Task trackers
50060
mapred.task.tracker.http.address

Monday, September 19, 2011

Hadoop with Hive

My new post about Hive is posted in LearnComputer.com. Here i discussed about how to make hadoop as backend resource with the help of hive. hive act like a interface which accept SQL type queries(HQL), convert the HQL query to Map-Reduce Job and pass to Hadoop cluster for processing. Please click here to the link for more....

Please leave a comment and recommend this post by clicking  Facebook ‘Like’ button and ‘+1’ at bottom of this page. By clicking like button you got regulare update about my post in your facebook update.

Tuesday, August 2, 2011

Hadoop Performance tuning (Hadoop-Hive) Part 2


[Note: This post is second part of Hadoop performance tuning, if you directly reached this page, please click here for part 1.]

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster. 

Map Output compression ( mapred.compress.map.output )
By default this value set to false, its recommend to set this parameter to true for cluster with large amount of input data to be processed.  Because of compression data transfer between nodes are fast. Map output will not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write. And it’s not recommended to set this parameter to true for small amount of input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare to time its saves in transferring and disk read/write. 

Saturday, July 16, 2011

Hadoop Performance Tuning (Hadoop-Hive)

Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:
  • ·         Operating System
  • ·         Processor and its number of cores
  • ·         Memory (RAM)
  • ·         Number of nodes in cluster
  • ·         Storage capacity of each node
  • ·         Network bandwidth
  • ·         Amount of input data
  • ·         Number of jobs in business logic

Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.

Storage capacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writes log and need some space for swapping memory.

Thursday, July 14, 2011

Big Data with Cloud computing (Amazon Web Service)

As Big Data projects requires huge amount of resources, cloud computing helps to avoid resource maintenance headache, Already discussed more details in previous post. Here discuss more about big data with cloud provider Amazon Web Service

Three major resource requires for any type of computing are Processor (CPU), Memory (RAM), Storage (Hard disk). Amount of each resource requires for projects vary, especially for big data project need more for processing huge amount of data. 

Amazon provides Elastic Compute Cloud (EC2) instance, is similar like a single desktop machine which is cloud, user can compute over network. EC2 are available in different types for user from micro instance to very large instance. user can create any number of EC2 instance. EC2 small instance has memory of 1.7GB, 160GB hard disk and one 32-bit processor. Those who requires more power in single instance can go up-to EC2 Cluster GPU instance, which has 22 GB memory, 33.5 processor(64-bit), 1680GB hard disk.

Wednesday, July 6, 2011

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Apache Hadoop framework uses Google MapReduce model and Google File system logic's. In Hadoop, Data will be split into chunks and distributed across all nodes in cluster. This concept is inherited from Google file system, In hadoop we mention it as HDFS (i.e. Hadoop Distributed File System). While loading data into HDFS, it start distributing to all nodes based on few parameters. Here will see two important parameter need to consider for better performance.

1. Chunk size (dfs.block.size(in bytes)) - 64MB,128MB,256MB or 512MB. its preferable to choose size based on our input data size to be process and power of each node.

2. Replication Factor (dfs.replication=3) - by default its 3. means data will be available in 3 nodes or 3 times around cluster. In case of high chance of failure in nodes, better to increment replication factor value. Need for data replication is, if any node in cluster failed, data in that node cannot be processed, so will not get complete result.

Thursday, June 30, 2011

Big Data with Cloud Computing

What is Big Data?


Big Data usually refer to processing/analysing huge amount of data or data set(terabyte, petabyte...etc of data) which take long time to process in RDBMS type of databases. Big Data projects uses lot of technologies and framework to process data. First Google introduced MapReduce framework in 2004 and present day also google uses MapReduce framework to index whole WWW for google search engine. Few other frameworks used for Big Data are massively parallel processing (MPP) databases, data-mining grids, Apache Hadoop Framework etc,.

How cloud computing related with Big Data, that a big question?
For this, we just need to know how this MapReduce works.

Example let consider a scenario that, you have two table with 1TB of data (or) you can say 1Billion record (1000 Million) in each table. Running time for a querying these two tables with complex join condition will take around 30 minutes(approx), might vary depends on your database server capability. MapReduce framework have a strategy to handle this situation. Strategy is simple, big task is split-out and given to multiple people, so task will be done soon.