Big Data (Hadoop) & Cloud Computing: August 2011

[Note: This post is second part of Hadoop performance tuning, if you directly reached this page, please click here for part 1.]

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster.

Map Output compression ( mapred.compress.map.output )

By default this value set to false, its recommend to set this parameter to true for cluster with large amount of input data to be processed. Because of compression data transfer between nodes are fast. Map output will not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write. And it’s not recommended to set this parameter to true for small amount of input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare to time its saves in transferring and disk read/write.

Big Data (Hadoop) & Cloud Computing

Tuesday, August 2, 2011

Hadoop Performance tuning (Hadoop-Hive) Part 2