Big Data (Hadoop) & Cloud Computing: Hadoop Performance tuning (Hadoop-Hive) Part 2

[Note: This post is second part of Hadoop performance tuning, if you directly reached this page, please click here for part 1.]

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster.

Map Output compression ( mapred.compress.map.output )

By default this value set to false, its recommend to set this parameter to true for cluster with large amount of input data to be processed. Because of compression data transfer between nodes are fast. Map output will not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write. And it’s not recommended to set this parameter to true for small amount of input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare to time its saves in transferring and disk read/write.

Once we set above configuration parameter to true, other dependent parameter will be active such as setting compression technique (codec) and compression type.

Compression method or technique orcodec (mapred.map.output.compression.codec )

Default value for this parameter is org.apache.hadoop.io.compress.DefaultCodec. Other available codec are org.apache.hadoop.io.compress.GzipCodec. DefaultCodec will take more time but more compression. In LZO method it will take less time for compression amount of compression is less. Our own codec also can be added. Add codec or compression library which is suitable (best) for your input data type.

mapred.map.output.compression.type parameter help to identify in which basis data should be compressed. User can set either RECORD or BLOCK. Record type is default type in which each individual value is compressed, means it will compress whole data block as it is. Block type is recommended one, in which data compressed based on data block key-value pairs, so it helps for sorting data in reducer side. In Cloudera Hadoop, default type is set to Block for better performance.

Three more configuration parameters are there

1. mapred.output.compress

2. mapred.output.compression.type

3. mapred.output.compression.codec

Same above rules apply here, but this parameter meant for MapReduce job output, first three parameters specify compressed output for map output alone. These three configuration parameter specify for all job output which should be compressed or not and in which type and codec.

More configuration parameter will be discussed here regarding hadoop hive performance tuning in upcoming posts

Hadoop performance tuning part 3 >> Click Here

Above suggestions are observed with Hadoop cluster with Hive querying, please leave a comment and recommend this post by clicking Facebook ‘Like’ button and ‘+1’ at bottom of this page.

7 comments:

AnonymousAugust 4, 2011 at 3:53 AM
Good summary. Please note that compression might not always work better than uncompressed map output since there is always an overhead on comp-decomp. Only when the data transfer volumes are high, the comp-decomp overhead gets reduced to a minimal.
Also, Google compression library- snappy seems to be emerging as a more performant comp-decompress library in some cases.
VenkataHari ShankarAugust 4, 2011 at 6:14 AM
Thanks indoos(Sanjay Sharma). I hope i mentioned it that, not recommended for small amount of input data to be processed. but your comment deliver the msg clearly and enrich my post. Thanks once again.
Sourav MazumderAugust 5, 2011 at 7:45 AM
To add to Indoos comment - it all depends on whether your workload is CPU bound, Memory bound or Disk bound and the way u want to (or limited to) optimize it. If it is already CPU bound comp-decomp will not give you much benefits.
UnknownMarch 18, 2015 at 2:21 AM
Aweosome Post thanks for sharing Salesforce Online Training
UnknownApril 23, 2016 at 1:57 AM
Great article.I also use online data room providers for data!
EdiAugust 28, 2019 at 1:48 PM
Ingin mendapatkan kemenangan mudah dan cepat pada permainan Ceme Online, segera mainkan dengan menggunakan Bobol Server Judi Ceme Online.
Hoki Pasti
Info Jitu
asyaJuly 29, 2023 at 11:44 AM
kırşehir
gümüşhane
yozgat
kırıkkale
kocaeli
KFAKU

Tuesday, August 2, 2011

Hadoop Performance tuning (Hadoop-Hive) Part 2

7 comments: