Advisor

Big Data -- Does Size Matter?

Posted March 11, 2013 | Leadership | Technology |

While the term "Big Data" continues to encroach on the common vernacular, it's important to understand the drivers of the phenomenon, its possible value, and also its potential for misuse. Many of the "misuse" examples come from areas of privacy and ethics, but there is also the potential for misuse that is just bad business.

One tantalizing aspect of the Big Data wave (and it surely is a wave, just as 4GLs, artificial intelligence, mainframes, and personal computers were waves; as opposed to the Internet, a true disruption) is the possibility for insight stemming primarily from the enormous amount of data we can now collect. That hope brings to mind the old story about the young child digging in a large pile of horse manure, because "there's got to be a pony in here somewhere." How much of what we are rushing (and spending lots of money) to build is just adding to the pile?

I'm also reminded of the classic "beer and diapers" story from the early(er) days of data analytics. The story goes that a convenience store noticed that beer and diapers frequently appeared in shopping carts together, leading to a decision to put stocks of diapers near beer coolers, which drove increased beer sales.

While heartwarming to BI marketers, this story is a myth. Some people who retell the story point out one possible explanation is that fathers who are sent out on an errand to buy diapers might treat themselves to beer as a reward. The phenomenon could be real, but would also lead to a correlation that was mathematically real while useless or harmful for business planning.

Nassim Taleb jumps on this problem in a recent article in Wired:

Modernity provides too many variables, but too little data per variable. So the spurious relationships grow much, much faster than real information.

In other words, Big Data may mean more information, but it also means more false information.

Traps to avoid with analysis of any data are cherry-picking just the data to support one's case, and not disproving the null hypothesis. This presumes that there is a hypothesis to test. A larger problem with the current marketing of Big Data is the implication that just by collecting the data, we will be able to see and act upon emergent properties not visible in smaller sets.

Putting aside Taleb's argument that "in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal)," we still are faced with the absolute size of the data we are collecting. Mike Hoffman writes in DefenseTech:

The military has a problem with "big data" -- the problem being that it collects too much of it. The infatuation with unmanned vehicles and the sensors mounted onto them has spurred a wave of data collected on the battlefield....

[Deputy Assistant Secretary of Defense for Research Reggie Brothers] used the ARGUS-IS as an example of the major advances being made in the world of intelligence sensors. The ARGUS-IS can stream up to a million terabytes of data and record 5,000 hours of high definition footage per day. It can do this with the 1.8 gigapixel camera and 368 different sensors all housed in the ARGUS-IS sensor that can fly on an MQ-9 Reaper.

However, the analysis of the data collected by those sensors can't keep up.

The military response to this problem seems not to be "stop collecting data" or "be smarter about the data we collect," but to increase its search for "data scientists" who can formulate the search strategies and hypotheses needed to extract value, while avoiding the problems above, in an ever-expanding pile of data. Because there's got to be a pony in there somewhere.

There are at least two issues on the table:

  1. Should I collect data, just because I can?
  2. How do I find value in the data I collect?

Prior to any discussions of the first issue, I recommend viewing several episodes of the TV show Hoarders, and then ask yourself if what you are doing doesn't result in a house overrun with junk that could ultimately become a health hazard. If you are collecting data without any sunset policies, you are well on your way. As to the second item, smart people are already seeing through some of the breathless implications about patterns emerging from large data sets without additional effort (the current analog of "lose weight while sleeping" or "write programs without requirements").

In a piece for the Harvard Business Review ("Why Data Will Never Replace Thinking"), Justin Fox reminds us that hypotheses are always present when looking at data, but sometimes they are implicit or subconscious. To not make them explicit is to abandon the scientific method -- it's not research, it's shopping:

We humans are quite capable of coming up with stories to explain just about anything after the fact. It's only by trying to come up with our stories beforehand, then testing them, that we can reliably learn the lessons of our experiences -- and our data.

Lastly, any discussion of Big Data is remiss without noting that one of the most successful predictions in recent memory came from a decidedly small data set. Nate Silver's work on the 2012 US elections was based on aggregations of multiple polls, which themselves had sample sizes in the low thousands. We're not talking petabytes here. What made Silver's prediction astonishingly accurate was his process, his hard work, and his ability to draw correct conclusions from the data.

Silver's results prove to me that size doesn't always matter. But sometimes it does. We'll explore that topic in Part II.

I welcome your comments about this Advisor and encourage you to send your insights to me at comments@cutter.com.

-- Lou Mazzucchelli, Fellow, Cutter Business Technology Council

About The Author
Lou Mazzucchelli
Lou Mazzucchelli is a Fellow of Cutter Consortium and a member of Arthur D. Little's AMP open consulting network. He provides advisory services to technology and media companies. He will lend his broad expertise (and considerable wit) to Cutter Summit 2022 as its Moderator. Recently, Mr. Mazzucchelli was the coordinator of Bryant University’s Entrepreneurship Program, where he retooled and taught senior-level entrepreneurship courses. He… Read More