Big Data means a very large volume of data. Term big data is used to describe data so huge and ever growing that has gone beyond the storage and processing capabilities of traditional data management and processing tools.
- Facebook which stores data about your posts, notification clicks, post likes, photos uploaded generates about 600 TB of data everyday, which means 18 Petabyte of data in a month.
Reference : https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
- The NCCS (NASA Center for Climate Simulation) which focuses on climate and weather data houses around 32 petabytes of data.
- Size of the Climate Change data repositories alone are projected to grow to nearly 350 Petabytes by 2030.
Reference : https://open.nasa.gov/blog/what-is-nasa-doing-with-big-data-today/
- Wal-Mart handles more than a million customer transactions each hour and imports those into databases estimated to contain more than 2.5 petabytes of data.
Reference : https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/big-data-meets-big-data-analytics-105777.pdf
What to do with Big Data
Giving such examples of having petabytes of data is fantastic but the question is what to do with that kind of data. Big Data is not just examples of huge volume of data generation. One aspect of Big Data is to come up with technologies to store such huge data but another, and more important aspect, is to be able to analyse that data and use it to make business decisions faster, more accurately, to have more understanding of consumer behavior.
Data in Big Data
Data in Big Data can be any type of data; structured, semi-structured, unstructured such as text, video, audio, sensor data, log files etc.
- Structured data – Any data that is organized in a format that is fixed can be termed as structured data such as data stored in relational databases or in spread sheet.
For creating structured data you will have predefined rules on what type of data will be stored and how that data will be stored.
- Semi-structured data – Any data that doesn’t confirm to the rigid structure associated with the structured data but still have some structure like having tags or other markers to separate and identify different elements and have hierarchies of records and fields with in the data can be termed as semi-structured data.
As example – XML, JSON.
- Unstructured data – As the name suggests unstructured data is exact opposite of structured data which means it doesn’t confirm to any predefined rules in terms of type of data and field positions with in a file or record. Unstructured data usually include multiple types of data where you may have a combination of text, videos, images that too in no defined manner.
Examples of unstructured data are books, any web page, email message etc.
Because of it’s not fitting to any defined format it becomes very difficult to analyze unstructured data.
3 Vs of Big Data
Big Data can be described by following characteristics –
- Volume – This characteristic refers to the volume of data that is generated and stored. It’s the size of data that determines the potential insight that can be derived from it and even determines whether the data can actually be considered as big data or not.
- Velocity – This characteristic refers to the speed at which data is generated and processed.
As example – Processing trade data created each day in a stock exchange to identify potential fraud.
Analyzing click stream data of a consumer in real time to provide consumer with suitable alternatives or products.
- Variety – This characteristic refers to the type and nature of the data. Data may be structured, unstructured, semi-structured. Analyzing all these types of data together provide better insights.
These 3 Vs are expanded and now even termed as 5 Vs to add new characteristics to Big Data.
- Variability – This characteristic refers to the inconsistency of the data flow. There may be some peak times when data flow is quite huge which may render the processes in place, to handle and manage data, ineffective.
- Veracity – This characteristic refers to the quality of data collected from multiple sources.
Some Big Data technologies
Some of the Big data technologies for storing and analyzing big data are –
- Apache Hadoop– Actually over the years Hadoop has grown to have a whole ecosystem of related technologies like Hadoop, HDFS, Hive, PIG even Apache Spark.
- NoSQL Databases– For storing unstructured data and providing very fast performance. Some of the NoSQL databases are MongoDB, Cassandra, Hbase.
- Presto– Developed by Facebook, Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
That’s all for the topic What is Big Data. If something is missing or you have something to share about the topic please write a comment.
You may also like
- Installing Hadoop in Pseudo-distributed mode
- Introduction to HDFS
- How MapReduce Works in Hadoop
- Exception Propagation in Java
- Race Condition in Java
- Java Program to Find Longest Palindrome in The Given String