10 | 2012

"With world Internet usage quintupling per decade, there is no upper limit on the number and value of new business opportunities for those who can bend the swelling flood of data to their purposes."

-- Ralph Hughes, Guest Editor

Opening Statement

Interest in Big Data analytics (BDA) has certainly skyrocketed in the past few years to reach a fevered pitch, with the market for this technology projected to reach a 58% compounded annual growth rate over the next five years.1 Indeed, when I walked the vendor exhibit halls at several TDWI World Conferences during the past year, it seemed that nearly all the application vendors had introduced a new package offering a "Big Data" solution. At every booth, plenty of curious attendees lined up to hear about these new features. The vendors were certainly happy for the attention, but they also confided to me that they had grown tired of answering the same question day after day, namely "What is Big Data?"

I believe this lament is actually more emblematic of the state of BDA today than any particular solution being offered. When vendors rush to cater to needs that many customers do not yet understand, are we at risk of solving the wrong problem or cementing in place a basic strategy we will later regret? Perhaps at this early juncture we should carefully dodge the hype about Big Data and offer a sober appraisal of this new technology before acting.

LOOKING PAST THE HYPE

Industry pundits, in the area of data warehousing at least, take a jaundiced view of the buzz surrounding Big Data. "When haven't business intelligence applications had to deal with 'Big Data'?" they ask. Any type of data requires deliberate engineering to acquire, store, summarize, and present it in way that generates business insights. The cynics among us discount the fever over Big Data as a vendor-stoked overreaction to a few white papers by computer science wonks at Google and Yahoo! who found a couple of processing shortcuts while taming their own flood of Web stream data. These cynics see Big Data as a craze that will quickly fade.

Such skepticism might be too extreme, however. New technologies do frequently follow quick lifecycles, but several considerations suggest that Big Data represents a sea change for enterprise information. With the cost of processing and data storage falling so rapidly each year, our society no longer seems constrained as to the amount of information it can create and retain. Today's burgeoning numbers of online users now leave a trail of "digital exhaust" as they cruise social networking sites; e-commerce continues to grow at 35% per year; and RFID tags are steadily appearing on wholesalers' pallets and manufacturers' products. We are entering the "Internet of Things," in which phones, cars, trains, and planes -- plus process controllers, appliances, and medical devices -- all transmit a steady stream of data for interested parties to mine. Even dairy cows now sport portable monitors announcing when they come into heat.2 The data our society generates in a single year recently surpassed a zettabyte (a trillion gigabytes), which is a hundred million times more information than is contained in the print collection of the US Library of Congress -- and this onslaught is doubling every two years.3

Naturally, people worry about how much of this data they should capture, manage, and analyze. We frequently read about creative entrepreneurs discovering riches hidden in this information. For example, companies can now measure customer sentiment toward their products by mining the comments, ratings, and even images shared on the Web. They can correlate these sentiment statistics with purchase records provided by loyalty programs at grocers and retail stores, empowering marketers to customize advertising campaigns for individual consumers. As we move between websites today, we encounter a sequence of offers that are so subtle they go unnoticed but are so aligned with our individual preferences and behavioral triggers that we are almost certain to buy. With world Internet usage quintupling per decade,4 there is no upper limit on the number and value of new business opportunities for those who can bend the swelling flood of data to their purposes. In this context, the frenzied interest in Big Data makes sense because the power of such analytics has been proven, and rational companies should be actively seeking to profit from it.

MAPREDUCE IS NOT OUR SILVER BULLET

Unfortunately, the best method of channeling this informational deluge is far from clear, because the term "Big Data" has not yet been well defined. Big data analytics is frequently described as the management of information volumes much larger than our ordinary data management tools can handle. Pundits usually refer to Doug Laney's "3Vs"5 -- volume, velocity, and variety -- which will be explored in the articles in this issue. Yet the 3Vs are only a description of the problem, one that leaves most of us searching for an industry standard approach proven to overcome the challenge. Such a search does not uncover a single direction, however, but instead a myriad of competing strategies. Despite the fact that experts have been discussing Big Data for over 10 years now, the field is still very new, and for all the urgency we feel, no silver bullet yet exists.

The most commonly cited solution for BDA involves a technology pioneered by the large Internet search engines, called "MapReduce" (MR). So frequently do Big Data conversations gravitate to MR that Hadoop, the open source implementation of MapReduce, is now a standard component of most mainstream databases.6, 7

Yet MapReduce is not a universal solution to all Big Data problems, for several reasons. First, it solves only problems that can be formulated in terms of key-value pairs. This approach is capable of some powerful insights, but it has a distinct sweet spot that generally requires the input data to be already assembled into a flat file. Second, as anyone who has tried to join multiple tables using MR (or even wrestle it into printing "Hello, world!") can tell you, MR is not a general solution to many common data management challenges. Third, the interface to MR data stores is fairly primitive in comparison to the standard DBMSs -- a team must know Java well. Fourth, attempts to provide an SQL-like querying tool for MR still lack many ANSI SQL-92 commands and other common SQL extensions. Fifth, solid MR programmers are difficult to find, so the added cost and risk of building MR applications can far exceed the investment required by the many alternatives.

Because MapReduce is not the only solution available for high data volume, velocity, and variety, a solid Big Data strategy should look at the other technologies. There are many more columnar databases available today than there are MR implementations. Many of these columnar DBMSs are imbedded in data warehouse appliances that allow our existing business intelligence (BI) applications to handle very large volumes of data using a standard SQL interface. Furthermore, many columnar databases are more mature than MR, allowing Big Data applications to be designed and developed by developers with more typical skills. For organizations willing to consider newer offerings by smaller vendors, there are also the numerous types of Big Data solutions found in the NoSQL ("Not Only SQL") universe, such as key-value pair databases that do not require MR programming; graphic databases that use "triples" rather than key-value pairs; and in-memory relational databases that settle for "eventual consistency" in the interest of very fast read-write operations. These products, too, often look more like our traditional tools, making them easier to work with, and several of them can tackle analytical questions that MR cannot begin to address.

REPLACING FEAR WITH DISCIPLINE

Given the limits of MapReduce and the presence of many alternative solutions, it is odd that so many conversations about Big Data turn instantly to Hadoop. This knee-jerk reaction is driven mostly by fear. Both business and IT executives feel threatened by the accelerating flood of data coming from a proliferating number of sources. They worry that they should be doing something creative and profitable with it today, before competitors blindside them with new capabilities. They naturally want to start storing everything now, even if they cannot articulate the value for this information, and they hope against all odds that grabbing hold of this information is going to be quick and easy. Indeed, Forbes8 notes that Big Data today is ill-defined, intimidating, and immediate (i.e., demanding action now) -- all of which adds up to a set of "3 Is" that may be more important to consider than the 3 Vs.

A more sober view of the situation might suggest that data streams in the exabytes are only another chapter in data management, just as terabytes and petabytes challenged us in previous decades. We must remind ourselves that new technologies frequently get overhyped by the media and vendors, and that our search for a silver bullet often leads to profound disappointment. We will need time and discipline to see what Big Data can realistically offer. A disciplined approach should begin with compelling use cases that express clearly attainable business impacts. Only by articulating realistic objectives can we rationally choose a technical solution from the several competing Big Data technologies. Moreover, any Big Data solutions must integrate into our existing strategies for "not-so-Big Data," so that the information flood from the coming "Internet of everything" calmly fills our carefully architected BI ecosystems with usable data rather than washing them away.

IN THIS ISSUE

The articles selected for this issue of Cutter IT Journal provide a handy opportunity to conduct that sober evaluation of Big Data technology. The discussion first provides a solid introduction to the world of BDA and then explores a set of important extensions of the technology. Richard Walsh, Richard O'Callaghan, and Sabine Yoffou start off our collection by systematically defining Big Data so that we can begin successfully planning a serious implementation effort. Next, IBM's Matthew Ganis and Avinash Kohirkar examine one of the most common uses of BDA, namely mining social media discussions. Rich Johnson and Ron Zahavi of Microsoft then address the essential topic of incorporating this new style of analytics into our traditional data warehousing programs, so that we end up with well-integrated BI platforms.

The theme of extending Big Data technology begins with Frank Coyle, who discusses one of the primary competitors to MapReduce -- the RDF triple, which will someday soon enable the Semantic Web. Holly Korda, Ann Magee, and Lori Damiano then explore Big Data's potential in a specific industry, showing how it can be leveraged to bring transparency and accountability to the world of healthcare. Finally, Saeed Lajami, Anson Mok, Mario Wahyu Prabowo, and Cutter Senior Consultant Sara Cullen provide an interesting alternative for our solutions toolkit by advocating the use of crowdsourcing to solve Big Data challenges.

Together these articles introduce insights of breadth and depth into the new and quickly evolving world of BDA. We hope they will help you begin to explore and understand how this technology can solve what will be some of IT's most pressing challenges for the foreseeable future.

ENDNOTES

1 Kelly, Jeff. "Big Data Market Size and Vendor Revenues." Wikibon, 16 October 2012 (http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues).

2 Tagliabue, John. "Swiss Cows Send Texts to Announce They're in Heat." The New York Times, 1 October 2012.

3 Gantz, John F., and David Reinsel. "The 2011 Digital Universe Study: Extracting Value from Chaos." IDC, June 2011 (www.emc.com/digital_universe).

4Internet World Stats (www.internetworldstats.com/stats.htm).

5 Laney, Doug. "3D Data Management: Controlling Data Volume, Velocity, and Variety." Meta Group, 2001.

6 Groenfeldt, Tom. "Microsoft Does Big Data -- Hadoop on Windows." Forbes, 5 June 2012.

7 Dijcks, Jean-Pierre. "Oracle: Big Data for the Enterprise" (PDF). Oracle, October 2011 (www.oracle.com/technetwork/database/nosqldb/learnmore/wp-big-data-with-oracle-521209.pdf).

8Feinleib, Dave. "The 3 I's of Big Data." Forbes, 9 July 2012.

ABOUT THE AUTHOR

The articles selected for this issue of Cutter IT Journal provide a handy opportunity to conduct that sober evaluation of Big Data technology. The discussion first provides a solid introduction to the world of BDA and then explores a set of important extensions of the technology.