Big Data (Hadoop) & Cloud Computing

Saturday, July 16, 2011

Hadoop Performance Tuning (Hadoop-Hive)

Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:

· Operating System
· Processor and its number of cores
· Memory (RAM)
· Number of nodes in cluster
· Storage capacity of each node
· Network bandwidth
· Amount of input data
· Number of jobs in business logic

Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.

Storage capacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writes log and need some space for swapping memory.

Big Data with Cloud computing (Amazon Web Service)

As Big Data projects requires huge amount of resources, cloud computing helps to avoid resource maintenance headache, Already discussed more details in previous post. Here discuss more about big data with cloud provider Amazon Web Service.

Three major resource requires for any type of computing are Processor (CPU), Memory (RAM), Storage (Hard disk). Amount of each resource requires for projects vary, especially for big data project need more for processing huge amount of data.

Amazon provides Elastic Compute Cloud (EC2) instance, is similar like a single desktop machine which is cloud, user can compute over network. EC2 are available in different types for user from micro instance to very large instance. user can create any number of EC2 instance. EC2 small instance has memory of 1.7GB, 160GB hard disk and one 32-bit processor. Those who requires more power in single instance can go up-to EC2 Cluster GPU instance, which has 22 GB memory, 33.5 processor(64-bit), 1680GB hard disk.

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Apache Hadoop framework uses Google MapReduce model and Google File system logic's. In Hadoop, Data will be split into chunks and distributed across all nodes in cluster. This concept is inherited from Google file system, In hadoop we mention it as HDFS (i.e. Hadoop Distributed File System). While loading data into HDFS, it start distributing to all nodes based on few parameters. Here will see two important parameter need to consider for better performance.

1. Chunk size (dfs.block.size(in bytes)) - 64MB,128MB,256MB or 512MB. its preferable to choose size based on our input data size to be process and power of each node.

2. Replication Factor (dfs.replication=3) - by default its 3. means data will be available in 3 nodes or 3 times around cluster. In case of high chance of failure in nodes, better to increment replication factor value. Need for data replication is, if any node in cluster failed, data in that node cannot be processed, so will not get complete result.

Big Data with Cloud Computing

What is Big Data?

Big Data usually refer to processing/analysing huge amount of data or data set(terabyte, petabyte...etc of data) which take long time to process in RDBMS type of databases. Big Data projects uses lot of technologies and framework to process data. First Google introduced MapReduce framework in 2004 and present day also google uses MapReduce framework to index whole WWW for google search engine. Few other frameworks used for Big Data are massively parallel processing (MPP) databases, data-mining grids, Apache Hadoop Framework etc,.

How cloud computing related with Big Data, that a big question?

For this, we just need to know how this MapReduce works.

Example let consider a scenario that, you have two table with 1TB of data (or) you can say 1Billion record (1000 Million) in each table. Running time for a querying these two tables with complex join condition will take around 30 minutes(approx), might vary depends on your database server capability. MapReduce framework have a strategy to handle this situation. Strategy is simple, big task is split-out and given to multiple people, so task will be done soon.

Reason for Cloud computing popularity and rapid development

Nowadays Cloud computing users is growing in exponential. Because, it has lot of features like Pooled resources, Elasticity, instance startup, On-Demand Computing, self service model, location independence, Multi-tenancy, Reliability, Easy Maintenance and Pay per use Model. And one more main reason is, cloud computing is not a single technology, it’s a group of technology names as cloud computing. Or we can say, Cloud service providers bring lot of new features from other technology and introducing more and more new features along with cloud to beat competitors.

What are all technologies clubbed with Cloud Computing or Underling technologies in Cloud Computing?

Client Server model - Distributed Environment, client request for service, server process the request and send back the result, computing happen in server.

What is iCloud?

From the series of Apple Inc products and services like iPhone, iPad, iPod, iOS, iTunes, iAd, iBookStore here come up with the new one iCloud, free cloud service for his customers.

Advantages of iCloud:

Nowadays it’s hectic to keep all your data updated across all your devices. iCloud helps to keep your data updated automatically to all your devices. This can be done by wirelessly either in Wi-Fi or 3G (based on user configuration). So users no need to worry about plug-in or transferring data between devices like iPad, iPhone and iPod. Changes made to your data in one device will automatically update to your other apple devices. For example photo which taken in your iPhone will be pushed to your PC automatically.

iCloud provides free storage space about 5GB to store user data and synchronize data between devices between our other devices like iPhone, iPad, iPod and Mac or Windows PC. And also it allows us to configure up to 10 devices for each user. Moreover space required to store the music, books, apps purchased from apple are not included in this 5GB, hence user will get dedicated 5GB storage to store their personal Mail, Documents, and Backup data. In the case of user need more storage space for their use, and then they can extend their storage space by paying extra price.

What is Virtualization?

Virtualization means creating virtual form of hardware and/or software resources. Virtualization of server means partitioning physical server into several virtual servers, or machines. In which each virtual machines are isolated, so it can interact with other devices, applications, user and data independently. So it helps to install/run different operating systems in virtual machines although it runs under same physical server. Since it has the isolation feature, if one virtual machine crashes it will not affect the another virtual machine.

Virtualization is not only about partitioning resources, it can also combine group of physical server into single server. Main concept of virtualization is utilizing the resource usefully by either partitioning or combining.

Why virtualization?

Take an example of data center, most of the machines utilize only 10-15% of its total capacity most of the time which results in wastage of electricity and maintenance cost.To make the optimal utilisation of the remaining capacities we can use Virtualization concepts which also helps to avoid installation of platform/application specific data centers.

Big Data (Hadoop) & Cloud Computing

Saturday, July 16, 2011

Hadoop Performance Tuning (Hadoop-Hive)

Thursday, July 14, 2011

Big Data with Cloud computing (Amazon Web Service)

Wednesday, July 6, 2011

BigData : How data distributed in Hadoop HDFS (Hadoop Distributed File System)

Thursday, June 30, 2011

Big Data with Cloud Computing

Monday, June 27, 2011

Reason for Cloud computing popularity and rapid development

Saturday, June 25, 2011

What is iCloud?

Saturday, June 18, 2011

What is Virtualization?