Demystifying Big Data


GE recently organized a BBQ research center at the South-By-South-West (SXSW) conference at Austin, Texas. At the center, they installed a 12-foot tall barbecue cooker with embedded sensors, that generate data about temperature, pressure and relative humidity inside the cooker. The data shows up in real time in a nearby computer screen, providing the barbecue’s pitmaster with real-time information on how the meat is cooking, without having to open the cooker. This helps cook more reproducible barbecues. This gives an idea of how GE is incorporating big data into the design of devices such as turbines and generators to optimize industrial processes, which generate way more data than a single BBQ cooker. How do firms like GE manage and analyze such large quantities of data? That’s were specialized big data software like Apache Hadoop comes in.

Why big data

So why is big data now suddenly important? We have always had specialized areas that have grappled with lots of data, such as finance and weather applications. Typically those areas have employed supercomputers to crunch large amounts of data, and have invested in expensive data storage systems to store large quantities of data. However in recent years there have been two major trends: (1) big data technologies have benefited from drastic fall in data storage and processing costs, and (2) there are lots of new technologies that help capture data more efficiently, such as mobile devices, RFID chips (e.g. EZ pass), wearable computing, and so on.

The result is that the volume, velocity, and variety of data has increased exponentially. Volume refers to the sheer quantity of data: Wal mart processes 2 petabytes of data every hour – that is equivalent to 20 million filing cabinets. Velocity refers to the speed with which data about real world events is collected and processed: MIT researchers were able to use location data from Macy’s parking lots to estimate Macy’s thanksgiving sales, even before Macy’s could. Variety refers to the multiple data sources and types: social media, mobile devices such as smartphones, e-book readers and MP3 players; wearable computing, sensors embedded in machinery, and others.

Data that was once measured in kilobytes and megabytes is now routinely measured in gigabytes, terabytes and petabytes. A megabyte is about a 1000 kilobytes; A gigabyte is about a 1000 megabytes; a terabyte is about a 1000 gigabytes; and 1 petabyte is about a 1000 terabytes. A typical smartphone photo has a size of one megabyte. An entry level iPhone can store upto 16 gigabytes of data (equivalent to 16,000 photos – if you took one photo every minute, you’d need to take photos for eleven days nonstop, to fill up the iPhone). A typical desktop has one terabyte of storage, which is equivalent to about 60 such iPhones.

Google, which pioneered many of the big data software in use today, indexes 60 trillion web pages worldwide, and stores 100 Petabytes of data (equivalent to 100,000 iMacs, each with a terabyte of storage). The company processes 20 petabytes of data every day. Facebook ingests 500 terabytes of data every day, of which a large part is photos uploaded by users (300 million photos every day, uploaded by over 900 million daily active users).

Wal Mart, which processes over 1 million customer transactions every hour (2.5 petabytes of data per hour), uses big data software to set adwords pricing, see what’s trending in social media, analyze buying patterns among customers, analyze competitors’ pricing in real time, and so on. Recently it the policy for free shipping minimum purchase from $45 to $50, based on big data analytics.

The insurance industry is increasingly adopting big data analytics for a variety of applications. One such area is risk minimization in selecting insurance customers. Historically this role has been played by community based insurance agents. Increasingly, insurance companies have access to multiple data sources on customer behavior, weather patterns, and employment statistics, which help them select customers and price insurance more accurately. Insurance companies are better able to personalize products to fit customer needs more closely, cross-sell and up-sell products to customers based on better insights into customer needs, and employ sophisticated fraud detection based on visibility into online behavior (e.g. social media posts).

In the auto insurance industry, companies such as Progressive, Travelers and AllState are beginning to employ usage-based insurance, where they have customers plug in a GPS tracking device in their cars, which track the driving speeds, braking frequencies, time of day (so users can avoid late nights), and a host of other driving information. This information is used or could be used to more accurately price care insurance, and could also inform claims processing decisions.

Cigna has launched a new digital health coaching program where it combines insights from sociology, on motivation and behavior change, with health tools, to gamify the healthy behaviors for its customers. Members earn badges and rewards for achieving health and exercise milestones, which translates to better health for the members, and lower insurance costs for Cigna. This needs large scale data collection and analysis to support it.

The data ecosystem

So what is the data ecosystem that companies are embracing? The most common entry point for data in a company is its transaction processing system, which uses a relational database (Relational Data Base Management System – RDBMS) to store transactions such as purchases, returns, personnel hires, and such. The data from multiple such repositories within the company is aggregated into a data warehouse, which enables data analysts to (a) produce periodic reports – aggregations of data for managerial decision making, and (b) perform special analytics for additional business insight. Here is a graphical representation of a typical data ecosystem in a company.

Data ecosystem

In this ecosystem, Big Data software is increasingly finding a place. The most commonly used software for Big Data applications is Hadoop, a free, open-source software supported by the non-profit Apache Foundation. Hadoop software runs on a cluster of inter-connected computers, with a master computer coordinating storage and computation on a cluster of child-computers. Hadoop derives time savings by (a) dividing up big datasets into small chunks that are stored in a cluster of computers, and (b) dividing up computation tasks into small chunks that can be carried out in tens or hundreds of computers in parallel. These computers can be inexpensive commodity computers, purchased in bulk at volume discounts. The Hadoop software has fault-tolerance built in: it stores every chunk of data in three computers, so that if a computer in the cluster fails, other computers have the data that was in the failed computer. Hadoop is scalable – you can add more computers to the cluster as your data processing needs increase. Hadoop also offers cutting-edge analytics capabilities for almost every imaginable need, and offers tools to let data engineers develop analytics for any specialized needs. Companies are free to develop new applications, as the software is open source.

The Hadoop ecosystem consists of the Hadoop Distributed File System (HDFS), which is responsible for storing and retrieving the data in the computer cluster. On top of the HDFS, there is a resource manager, called YARN (“yet another resource negotiator”), that is responsible for assigning and keeping track of computing jobs spread out over computers in the cluster. Over this, there is a variety of applications such as Pig (for slicing and dicing data), Hive (for creating reports), Spark (for in-memory computing), Solr (for search applications), Storm (for streaming data), and other software for a variety of analytics needs.

Overview of Analytics

There are several different types of analytics methods for a variety of needs, including classification (e.g. for fraud detection, spam filtering), regression (for detecting correlations between variables of interest), recommender systems (for recommending products to customers – e.g. Amazon and Netflix), clustering (e.g. plotting crime on a map, plotting diseases to get a sense for which areas need attention), time series analysis (for data that varies over time – e.g. stock prices), frequent item-set mining (for knowing which items go together – e.g. supermarket shelf management, and detecting communities on twitter), mining data streams (e.g. incoming tweets, facebook comments, trending popular items in the New York Times, etc).

The data that can be analyzed with big data software covers the gamut, including structured data (e.g. transactions), semi-structured (e.g. web logs) or unstructured data (emails, customer reviews), graph data (e.g. social media linkages of who knows whom, web page links), streaming data (credit card approvals, insurance claims), mobile data (e.g. iPhone app usage data), and others.

We now explore one analytics method, recommender systems, in some detail. A recommender system is designed to recommend items to users. Items could be products (e.g. Amazon), movies (e.g. Netflix), friends (Facebook), news articles, or any other objects. Users could be paying customers, subscribers or other users.

It is possible to build a recommender system provides recommendations based on a user’s prior history, or based on an item’s similarity to other items. However, modern recommender systems use a technique known “collaborative filtering”, where they leverage the item choices of other users, in making recommendations.

The system works in two steps. Using the example of Amazon, let’s say sees that you are interested in (or have bought) a product – the focal product, and wants to recommend other products to you. In the first step, the system measures the similarity of the focal product to other products. Two products that share a lot of common buyers (compared to the overall buyers of either product) are deemed very similar. The system then ranks products by similarity to the focal product, and recommends the most similar other product(s) to you.

Peer-reviewed academic research has found that recommender systems have been found to positively impact sales and prices, and have a stronger impact than reviews. The display of co-purchase relationships has also been found to increase the influence of complementary products on product demand levels.