Big Data (Hadoop) & Cloud Computing: 2011

Thursday, September 22, 2011

Hadoop performance tuning (Hadoop-Hive) part -3

[Note: This post is Third part of Hadoop performance tuning, if you directly reached this page, please click here for part 1 and click here for part 2.]

Before going to see some configuration parameter for performance tuning, I like to ask you a question, have you ever observed job and task tracker WebUI, there you can see lot of jobs are being killed after few seconds or minutes before completion. Why so? Have you ever think of it? Of course, few of them know. Those who already know about this, please skip next paragraph.

[NOTE: To check WebUI of Hadoop cluster: Open the browser, type http://masternode-machineip(or)localhost:portnumber. We can also check this port number by changing the configuration parameter value to the portnumber we want.]

Name	Port	Configuration parameter
Jobtracker	50030	`mapred.job.tracker.http.address`
Task trackers	50060	`mapred.task.tracker.http.address`

Hadoop with Hive

My new post about Hive is posted in LearnComputer.com. Here i discussed about how to make hadoop as backend resource with the help of hive. hive act like a interface which accept SQL type queries(HQL), convert the HQL query to Map-Reduce Job and pass to Hadoop cluster for processing. Please click here to the link for more....

Please leave a comment and recommend this post by clicking Facebook ‘Like’ button and ‘+1’ at bottom of this page. By clicking like button you got regulare update about my post in your facebook update.

Tuesday, August 2, 2011

Hadoop Performance tuning (Hadoop-Hive) Part 2

[Note: This post is second part of Hadoop performance tuning, if you directly reached this page, please click here for part 1.]

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster.

Map Output compression ( mapred.compress.map.output )

By default this value set to false, its recommend to set this parameter to true for cluster with large amount of input data to be processed. Because of compression data transfer between nodes are fast. Map output will not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write. And it’s not recommended to set this parameter to true for small amount of input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare to time its saves in transferring and disk read/write.

Hadoop Performance Tuning (Hadoop-Hive)

Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:

· Operating System
· Processor and its number of cores
· Memory (RAM)
· Number of nodes in cluster
· Storage capacity of each node
· Network bandwidth
· Amount of input data
· Number of jobs in business logic

Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.

Storage capacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writes log and need some space for swapping memory.

Big Data with Cloud computing (Amazon Web Service)

As Big Data projects requires huge amount of resources, cloud computing helps to avoid resource maintenance headache, Already discussed more details in previous post. Here discuss more about big data with cloud provider Amazon Web Service.

Three major resource requires for any type of computing are Processor (CPU), Memory (RAM), Storage (Hard disk). Amount of each resource requires for projects vary, especially for big data project need more for processing huge amount of data.

Amazon provides Elastic Compute Cloud (EC2) instance, is similar like a single desktop machine which is cloud, user can compute over network. EC2 are available in different types for user from micro instance to very large instance. user can create any number of EC2 instance. EC2 small instance has memory of 1.7GB, 160GB hard disk and one 32-bit processor. Those who requires more power in single instance can go up-to EC2 Cluster GPU instance, which has 22 GB memory, 33.5 processor(64-bit), 1680GB hard disk.

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Apache Hadoop framework uses Google MapReduce model and Google File system logic's. In Hadoop, Data will be split into chunks and distributed across all nodes in cluster. This concept is inherited from Google file system, In hadoop we mention it as HDFS (i.e. Hadoop Distributed File System). While loading data into HDFS, it start distributing to all nodes based on few parameters. Here will see two important parameter need to consider for better performance.

1. Chunk size (dfs.block.size(in bytes)) - 64MB,128MB,256MB or 512MB. its preferable to choose size based on our input data size to be process and power of each node.

2. Replication Factor (dfs.replication=3) - by default its 3. means data will be available in 3 nodes or 3 times around cluster. In case of high chance of failure in nodes, better to increment replication factor value. Need for data replication is, if any node in cluster failed, data in that node cannot be processed, so will not get complete result.

Big Data with Cloud Computing

What is Big Data?

Big Data usually refer to processing/analysing huge amount of data or data set(terabyte, petabyte...etc of data) which take long time to process in RDBMS type of databases. Big Data projects uses lot of technologies and framework to process data. First Google introduced MapReduce framework in 2004 and present day also google uses MapReduce framework to index whole WWW for google search engine. Few other frameworks used for Big Data are massively parallel processing (MPP) databases, data-mining grids, Apache Hadoop Framework etc,.

How cloud computing related with Big Data, that a big question?

For this, we just need to know how this MapReduce works.

Example let consider a scenario that, you have two table with 1TB of data (or) you can say 1Billion record (1000 Million) in each table. Running time for a querying these two tables with complex join condition will take around 30 minutes(approx), might vary depends on your database server capability. MapReduce framework have a strategy to handle this situation. Strategy is simple, big task is split-out and given to multiple people, so task will be done soon.

Reason for Cloud computing popularity and rapid development

Nowadays Cloud computing users is growing in exponential. Because, it has lot of features like Pooled resources, Elasticity, instance startup, On-Demand Computing, self service model, location independence, Multi-tenancy, Reliability, Easy Maintenance and Pay per use Model. And one more main reason is, cloud computing is not a single technology, it’s a group of technology names as cloud computing. Or we can say, Cloud service providers bring lot of new features from other technology and introducing more and more new features along with cloud to beat competitors.

What are all technologies clubbed with Cloud Computing or Underling technologies in Cloud Computing?

Client Server model - Distributed Environment, client request for service, server process the request and send back the result, computing happen in server.

What is iCloud?

From the series of Apple Inc products and services like iPhone, iPad, iPod, iOS, iTunes, iAd, iBookStore here come up with the new one iCloud, free cloud service for his customers.

Advantages of iCloud:

Nowadays it’s hectic to keep all your data updated across all your devices. iCloud helps to keep your data updated automatically to all your devices. This can be done by wirelessly either in Wi-Fi or 3G (based on user configuration). So users no need to worry about plug-in or transferring data between devices like iPad, iPhone and iPod. Changes made to your data in one device will automatically update to your other apple devices. For example photo which taken in your iPhone will be pushed to your PC automatically.

iCloud provides free storage space about 5GB to store user data and synchronize data between devices between our other devices like iPhone, iPad, iPod and Mac or Windows PC. And also it allows us to configure up to 10 devices for each user. Moreover space required to store the music, books, apps purchased from apple are not included in this 5GB, hence user will get dedicated 5GB storage to store their personal Mail, Documents, and Backup data. In the case of user need more storage space for their use, and then they can extend their storage space by paying extra price.

What is Virtualization?

Virtualization means creating virtual form of hardware and/or software resources. Virtualization of server means partitioning physical server into several virtual servers, or machines. In which each virtual machines are isolated, so it can interact with other devices, applications, user and data independently. So it helps to install/run different operating systems in virtual machines although it runs under same physical server. Since it has the isolation feature, if one virtual machine crashes it will not affect the another virtual machine.

Virtualization is not only about partitioning resources, it can also combine group of physical server into single server. Main concept of virtualization is utilizing the resource usefully by either partitioning or combining.

Why virtualization?

Take an example of data center, most of the machines utilize only 10-15% of its total capacity most of the time which results in wastage of electricity and maintenance cost.To make the optimal utilisation of the remaining capacities we can use Virtualization concepts which also helps to avoid installation of platform/application specific data centers.

What is Cloud Computing ?

Cloud Computing has become the next big thing in the world of computers. In this blog am going to share the basics of what cloud computing is and where/how it is being used in simple terms.

Now , what is "Cloud Computing ". The basic definition of it can be derived by parsing this term "Cloud Computing" into words as Cloud & Computing.

Cloud:- Generally we use the Cloud symbol to represent the internet- which is a collection of web servers, File server, super computer, printer etc., So some resources will be available somewhere in the network, we can access and make use of it based on the privileges we are holding.

Computing:- It explicitly tells that, computing/processing something on the cloud.
From this we can understand that, the main functionality of cloud computing is, usage of resources over network as a service, on demand basis
We can take Web Server as one of the best example of Cloud Computing. In this, as a user we are sending request to web server, for accessing particular page for certain purpose. The web server recieves our request, and will compute/process that based on the specification we have given and send back response to us. But I have mentioned this example for basic understanding; in reality it is much more than that.

Big Data (Hadoop) & Cloud Computing

Thursday, September 22, 2011

Hadoop performance tuning (Hadoop-Hive) part -3

Monday, September 19, 2011

Hadoop with Hive

Tuesday, August 2, 2011

Hadoop Performance tuning (Hadoop-Hive) Part 2

Saturday, July 16, 2011

Hadoop Performance Tuning (Hadoop-Hive)

Thursday, July 14, 2011

Big Data with Cloud computing (Amazon Web Service)

Wednesday, July 6, 2011

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Thursday, June 30, 2011

Big Data with Cloud Computing

Monday, June 27, 2011

Reason for Cloud computing popularity and rapid development

Saturday, June 25, 2011

What is iCloud?

Saturday, June 18, 2011

What is Virtualization?

Wednesday, June 15, 2011

What is Cloud Computing ?