Big Data

Srinivasreddy
3 min readSep 16, 2020

What is Big Data?

Big data is a problem where the volume is huge which is beyond the capacity of storage such problem is known as a Big data problem

example:let we have a capacity of 100 GB of storage but in future we have a situation to store more than 100 GB then how can we store the data?

*here we faced a problem of volume.

google storage

We can buy more storage which is more than 100 GB but buying a storing more than 100 GB effects in cost .

let we buy the storage even which is costly now we can overcome with the problem of volume by buying the storage using high cost, now we have a high capacity storage and we used all the storage let it be 100/100 GB.

Then reading 1 huge storage and giving solution can takes much time right?

*here now we faced with new problem i.e I/O service.

Facebook generates 4 petabytes of data per day — that’s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data

now let us have a example of Google:

A data center normally holds petabytes to exabytes of data. Google currently processes over 20 petabytes of data per day through an average of 100,000 Map Reduce jobs spread across its massive computing clusters.

here we can see than google process over 20 petabytes per day then let us imagine that we search something in google then google needs to read all the data which is in google storage and gives with a solution so,to read that huge storage it times much time may be days too…

if after we search something and google gives solution after 5 to 10 days then will we use google?

so, google needs to overcome from the problem of I/O service right?

Not only google uses much data but Facebook and other social media platforms are facing this problem.

here we have 2 major sub problems of Big data

  1. Volume
  2. I/O service

how can we overcome with these problems?

Here now we have technique i.e Distribution storage Cluster.

master slave cluster topology

Here it have 1 master and 3 slave clusters, let master have capacity of 50 GB but need to store more than 50 GB of storage i.e problem of Volume

To reduce problem of volume we have slaves let each slave have capacity of 20 GB so overall storage will be 110 GB along with master’s storage now through this way we reduced problem of Volume.if we need to store more than 110 GB then we can implement some more slaves.

now we are using multiple slaves so that each slave is limited with some storage so, we can easily read the data too . through this way we can solve the problem of I/O service too.

Each slave is connected with master through network.

Hadoop is a name of a product to create a clusters of master and slave model.

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the Map Reduce programming model.

--

--