What is Big Data?

Welcome to Internap’s Big Data Video series. First, let’s cover the Big Data Basics.

What is meant by the term, Big Data? Why is it important? And what are some common Big Data use cases?

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Big data is also defined by three characteristics: volume, variety, and velocity.

Volume refers to the enormous amount of data being stored. It is a characteristic of a big data project or application that uses, potentially, petabytes of data and tens of millions of transactions per hour. For example, Twitter alone generates more than 7 terabytes of data every day.

Variety refers to the wide range of data types used as part of the analytical and decision making process. Much of these unstructured or semi-structured data sets don’t fit into typical organizational schemas. For example, tweets, social media blurbs, security camera images, weather reports and the like, are all examples of data that can be highly variable.

Velocity is the speed at which information arrives, is processed, and is delivered in an actionable presentation. Within a big data scenario, data streams with real-time or near real-time analysis requirements are not uncommon, and they can be far faster than transactional streams. The combination of these elements requires significantly more flexibility in organizing, processing and analyzing than traditional approaches can deliver.

In the 1970s, data management systems were primitive, very structured, typically relied on mainframes and lacked complex relational capabilities. In the ’80s and ’90s, data became more usable via the development of multifaceted relational databases.

Fortunately, a number of tools and processes have been developed to address big data processing and analysis needs within the past several years. MapReduce, a large-scale parallel processing system, developed and patented by Google, distributed file systems like HDFS and NoSQL databases, as well as on-demand, virtual and bare metal infrastructure, as a service.

What is a big data being used for? Oftentimes, companies are either trying to use as much applicable data as available to answer why something happened, predict what’s going to happen next, or to determine which questions to ask.

One common use scenario involves marketers using big data to understand consumer purchasing behavior. For example, when you’re at your local grocery store and you scan your savings card, an abundance of information is captured, such as what you bought, what time it was, was it on sale, what complementary products were also purchased, the time of the year, and so on, and then marketers attempt to use this information to put their products in front of you at the most advantageous time and price.

Our own IP architecture group here at Internap provides a real-life, close-to-home example of Big Data in action. Managed Internet Route OptimizerTM (MIRO), our proprietary IP routing algorithm, captures more than 1 trillion path and performance data points over the course of a 90 day period. These include real time, NetFlow, and SMTP data, latency and jitter statistics, as well as path plotting decisions.

Through the use of a commercial Hadoop distribution in our own AgileSERVERS, we’re able to scale out to address exponential data growth, while efficiently processing and analyzing tremendous amounts of high-velocity variable data.

In our next video, we’ll provide more detail on the different tools and processes, and infrastructure architectures used in big data applications and tell you which ones fit, and which ones don’t.

Watch next: Why is Big Data Important?