Thursday, June 30, 2011

Big Data with Cloud Computing

What is Big Data?


Big Data usually refer to processing/analysing huge amount of data or data set(terabyte, petabyte...etc of data) which take long time to process in RDBMS type of databases. Big Data projects uses lot of technologies and framework to process data. First Google introduced MapReduce framework in 2004 and present day also google uses MapReduce framework to index whole WWW for google search engine. Few other frameworks used for Big Data are massively parallel processing (MPP) databases, data-mining grids, Apache Hadoop Framework etc,.

How cloud computing related with Big Data, that a big question?
For this, we just need to know how this MapReduce works.

Example let consider a scenario that, you have two table with 1TB of data (or) you can say 1Billion record (1000 Million) in each table. Running time for a querying these two tables with complex join condition will take around 30 minutes(approx), might vary depends on your database server capability. MapReduce framework have a strategy to handle this situation. Strategy is simple, big task is split-out and given to multiple people, so task will be done soon. 


MapReduce has two functions:

Map - Input data is partitioned and map to multiple(all) node in cluster. Once query is given, each node process and send its respective results.

Reduce - Hereafter Reducer combine all node result and give the final combined result.

As per above scenario, and along with 1000 node(worker) cluster. 1 TB data is partitioned into 256, 512 or 1024MB data blocks and mapped to all nodes in cluster by mapper while loading data. Once operation is initiated, each node process the data and send back the result to Reducer. Reducer combine the result and return the result. Here 1000 node is pretty enough to process 1 TB data, resulting time also will be reduced to almost 500 times roughly. 

But, Why should we use Cloud Computing for Big data?

The main reason is,
  •  Its difficult to establish and maintain 1000 nodes to process data for minimum usage of time.
  •  Chances are there for data size will increased to 100 TB, 200 TB and so on, Its difficult to establish required number of node on-demand.

For this situation cloud computing will help us, Cloud computing is scalable to any size to any number of nodes, and Utility computing model help to pay only for duration consumed to process our data. After observed above picture, might get lot of doubts about master, user program, intermediate files, We will see about that in Apache hadoop frameworks in upcoming posts. 

11 comments:

  1. Nice one. i am waiting for your hadoop framework post.

    ReplyDelete
  2. In all your previous post i thought, we can do lot more things with cloud computing, but this one is amazing. by following your blog i can able to learn lot of new things. Can you give me some real-time example, about Bigdata, where they use.

    ReplyDelete
  3. @John : Thanks, My next post is about hadoop for your request.

    @Maria : you mean Real-time Big data Projects right.
    1. Google uses Bigdata for indexing web
    2. Facebook used BigData MapReduce technology
    3. Most of Social network uses Bigdata Mapreduce to process billions of users data.

    ReplyDelete
  4. Nice post buddy!!!!, very informative;

    ReplyDelete
  5. Very interesting details you have noted, appreciate it for putting up.
    online backup service

    ReplyDelete
  6. hi..

    I am new here to learn hadoop. i don't know java. is java is must for learning hadoop. please suggest.


    Thanks,
    Vaibhav

    ReplyDelete
  7. I want to know more about hadoop.Please suggest me the best way for it.

    Amazon Web Services

    ReplyDelete
  8. Thanks for providing a useful article containing valuable information. start learning the best online software courses.

    Workday HCM Online Training

    ReplyDelete