“Big data” and “analytics” are among the most overhyped and abused terms in today’s IT lexicon. Despite widespread use for almost a decade, their precise meanings remain mysterious and fluid. It is beyond doubt that the volume of data being generated and gathered has been growing exponentially and will continue to do so, intuitively validating the big moniker. However, other vital characteristics of today’s data, such as structure, transience, and — most disturbingly — meaning and value, remain highly ambiguous. Analytics also remains troublingly vague, as it is prefixed with adjectives ranging from operational to predictive.
In such circumstances, defining what success in big data analytics (BDA) might mean is problematic. Describing how it could be cultivated would seem especially challenging. Nonetheless, that is what this issue of Cutter IT Journal seeks to do, beginning with the premise that, irrespective of definitional difficulties, success in implementing BDA is predicated on addressing four specific aspects of the overall process:
- People. Data scientists have been called unicorns due to their elusive nature. Finding them externally or growing them internally and defining mandatory skills, roles, and responsibilities are among the challenges organizations face.
- Preparation. Sourcing, cleansing, and contextualizing incoming data before analysis can make the difference between valid discoveries and rabid nonsense. With such preparation reportedly taking 80% of the time and effort of data scientists, new approaches are needed to streamline the process.
- Prediction. As the industry progresses from business intelligence and hand-crafted analysis to self-improving deep learning algorithms, issues of understanding, control, and trustworthiness will need to be addressed.
- Production. Procedures for effective and seamless transition from discoveries in the analytic laboratory to action in the production environments of manufacturing, operations, and sales will be vital.
These four aspects neither stand alone as all that you must address, nor are they unique to big data analytics. You will need to take care of all the other, usual pieces of process and project management involved in any significant IT undertaking. However, one or (usually) more of these four aspects present particular difficulties in most BDA implementations.
Teaming Up the Right People
BDA has a long history in the open source software scene, particularly focused around Apache Hadoop and its accompanying menagerie, and an association with the Web behemoths of big data, such as Google, Facebook, and Amazon. It was within this environment that the job title of “data scientist” was popularized and from here that it has become the allegedly sexiest job in the world. Many people now want to be one, and many companies want one or more of them. Recruiters declare huge shortages, and authors have created lists of diverse characteristics for the role, ranging from degrees in statistics to Perl programming skills, and from experience in presenting to and influencing management to deep knowledge of business data and processing.
Such guidance is undoubtedly valuable but may be offset by advice from vendors and consultants to avoid “polluting” your future-oriented data scientist team with “BInosaurs” — staff who have grown up in the traditional business intelligence (BI) environment. This is unfortunate. Many companies that are transitioning from a purely physical business environment to the digitized world face the challenge of crafting a functioning team that can extract the undisputed value that exists in big data while also creating and maintaining a data management environment where data quality and governance are critical. Engaging the right people is important, but creating an integrated and empowered team is the first and most vital step toward success in BDA.
Preparing the Ground
In data warehousing, this step used to go under the label of ETL — extract, transform, and load — and it was renowned as the most challenging aspect of building a BI system. Data in existing operational sources was often not how it was described in the specs. Even the specs were often not what they seemed. Data that should match across sources didn’t. The list goes on, but the result was that preparing data for consumption in the BI system was the part of the project that consumed the most resources and could be almost guaranteed to overrun.
Fast-forward to big data analytics, and it appears that history is doomed to repeat itself. Despite the fact that the majority of big data is sourced externally to the enterprise, coming from notoriously uncertified social media sources and highly unreliable Internet of Things (IoT) sensors, data scientists express ongoing incredulity that data preparation (now sexily called “wrangling”) takes such time and effort. According to Monica Rogati, VP for data science at Jawbone, “Data wrangling is a huge — and surprisingly so — part of the job. At times, it feels like everything we do.”
Since then, a number of vendors have introduced or expanded offerings that address preparation, cleansing, wrangling, and quality of big data in the Hadoop environment. This is, of course, welcome. However, it remains a technical, product-level solution to a broader problem.
Successful BDA requires a fundamental rethinking of the process of data preparation. This begins with a policy decision that only data of known and agreed levels of quality can be introduced into particular analyses. Some initial analyses may use “dirtier” data, of course, but as the process progresses toward actual decision making, more stringent requirements on meaning, structure, and completeness may be mandated. Modeling of data, both in advance of ingestion and on an ongoing, in-stream basis, must become the norm. Rules that limit mixing or combining data of different levels of quality will be required and must be enforced. For example, at the most obvious level, combining data from social media sources with regulated financial data would be disallowed. “Health warnings” should be attached to input and output data sets, as well as analytical reports, clearly stating the business process or circumstances under which these sets may or may not be used.
Predicting the Future Is Hard
Of all the adjectives attached to analytics in the market, it is predictive that gains most attention. This is because the actual business value of collecting and analyzing data is closely related to — and arguably depends on — the ability of business to forecast future events, outcomes, or behaviors. As a result, the term “analytics” has largely displaced “business intelligence” in the market over the past decade.
In common usage, the meaning of [choose an adjective] analytics ranges from basic data query and reporting, through statistical analysis, to the application of advanced AI techniques to decision-making support. One useful approach to clearing some of the confusion is to look at the purpose and time frame of the analytic activity. This leads to five classes of analytics:
- Descriptive — focuses on the past to describe what happened at a detailed and/or summary level, corresponding to traditional BI (query and reporting) and data mining (statistics)
- Operational — focuses on the present moment, often down to subsecond intervals, and seeks to know what is currently happening in great detail in real time
- Diagnostic — spans the past and present time frames to understand why the things discovered in descriptive and operational analytics actually occurred, to deduce causation
- Predictive — focuses on the future in an effort toforecast what may happen with some level of statistical probability
- Prescriptive — takes input from the previous four types of analytics and attempts to influence future behaviors and events, using optimization and simulation techniques, for example
Different authors use different subsets of the above list and use their lists for different purposes. For example, these categories can be used to evaluate tool and product capabilities when comparing vendor solutions. Alternatively, one can describe an organization’s maturity in decision-making support by observing their capabilities in these types of analytics, understanding that the classes build one upon the other in order listed above (i.e., descriptive is the most basic and prescriptive the most advanced).
It is useful to note the overlap and interdependence of the categories listed above. Success in predicting behavior or outcomes is built upon a strong foundation of descriptive work, both BI and data mining, as well as — in many cases — serious operational, real-time analytics. Beneath both these aspects lies a firm foundation of data management and quality work, of which building and maintaining an enterprise data warehouse (EDW) infrastructure is most important.
The role of the EDW in prediction and, indeed, prescription, is to create and manage core business information, the legally binding record of the state and history of the business. The task of forecasting the future can be eased only if this core information is of high quality. Only then can the business have confidence that the results obtained have a high probability of being valid.
The challenge now emerging — and doing so rapidly — is to understand and address the implications of models and algorithms that are capable of self-improvement. Rapid advances in a range of overlapping fields such as deep learning, AI, and cognitive computing are being incorporated in predictive and prescriptive analytics. Whether implemented through automation (replacement of human decision makers), augmentation (collaboration between humans and machines according to their respective strengths), or, most likely, a combination of both, new processes and models are urgently required, and new legislative and ethical frameworks will need to be devised.
Production Is the End Point
The popularity of data scientists and the creation of “analytics labs” in larger organizations have led to a popular image of white-coated researchers pursuing lofty searches for truth in the data lakes of the business world. Unfortunately, this image misleads. Data science falls more correctly under the category of applied R&D rather than that of pure, fundamental research. As with all applied R&D, the aim of data science is not to discover new truths for their own sake, but rather to take discoveries from the lab into day-to-day business situations to improve the bottom line. This transition from exploration to production is usually messy. However, doing it well — in terms of speed, quality, and ease — is the final guarantor of success in BDA.
In traditional, physical industries, R&D and production are very distinct and separate activities, carried out in different places, using different materials and tools, and so on. There exists between them a well-defined boundary with well-formulated procedures and rules for moving from one side to the other. In the case of analytics, this boundary is far from obvious, for both historical and practical reasons.
Historically, IT has created virtually all of the data used by the business; such process-oriented data is essentially a byproduct of automating business processes via computers. In this situation, R&D on such data is largely meaningless. The focus of IT has thus first been on creating and managing the data itself (via operational systems) and, second, making it available for management reporting and problem solving (via informational or BI systems). In both cases, quality and consistency are key. In essence, IT has had no historical reason to differentiate between R&D and production.
The current drive toward digitization of business changes the situation dramatically, with an influx of external data swamping traditional process-oriented data in both physical volume and business attention. Such data, from social media and the IoT, differs significantly from process-mediated data in terms of structure, quality, lifespan, and more. It provides ample opportunities for R&D (analytics), but this is best performed in conjunction with the core business information contained in process-mediated data. This need to blend the two types of data in analytics blurs the boundary between R&D and production in practice.
Formalizing and strengthening this boundary is vital for success in BDA. While technology does have a role to play (through metadata management and data quality tools, for example), this is far from the simplistic formula of “relational databases for production and Hadoop for analytics.” New conceptual and logical architectural frameworks that address both modern real-time business needs and today’s disparate data types are necessary. In addition, formal methodologies and real-life implementations are emerging that show how this can be achieved.
In This Issue
This issue of Cutter IT Journal offers a variety of perspectives on what is required to ensure success in big data analytics, covering topics from methodologies to architectures, as well as a dive into one of the key technologies of the field. Our authors provide five distinctly different views on where you should direct your attention over the coming years of evolving BDA practice.
We open with a thought-provoking article by Steve Bell and Karen Whitley Bell, who use the “lessons learned in over five decades of Lean Thinking” to consider how to get the most value from BDA. Starting from a consideration of the adaptive learning organization, they take us to a Six P Model that relates purpose, process, and people to planning, performance, and problems, concluding that “paradoxically, to achieve desired outcomes, managers must pay more attention to the process and less to those outcomes.”
In our second article, Cutter Senior Consultant Bhuvan Unhelkar combines academic and hands-on experience to offer the Big Data Framework for Agile Business as an architectural foundation for BDA. Arguing that technology must be balanced with a deep appreciation of business drivers and realities, Unhelkar’s framework includes “agile values for business, organizational roles in big data, building blocks of big data strategies for business (including the role of analytics within those strategies), key artifacts in big data adoption, business conditions and limitations, agile practices, and a compendium (repository)” as the basis for successful implementation.
Our third offering, from Jeff Carr, dives into a fascinating exploration of the importance of semistructured data and NoSQL technology in support of BDA. He defines a generic data model for NoSQL and describes eight fundamental capabilities a NoSQL analytics system must have to derive analytic value from arbitrary semistructured data. These attributes can form the basis for evaluating the ability of any tool or system to perform generalized NoSQL analytics.
Next, Donald Wynn and Renée Pratt take us back to the organizational challenges to maximizing value in BDA from the viewpoint of innovation. These challenges revolve around managing the implementation of processes, data management, and staffing. They note the resemblance of driving BDA innovation to traditional business process management, describing an iterative process that begins where the previous evaluation phase concluded, which “naturally leads to the contemplation of desired future changes.” Of particular interest is their analysis of when top-down and bottom-up approaches to BDA implementation are most appropriate in organizations.
Our fifth and final article, from Mohan Babu K, examines the implementation of big data analytics in an industry seldom associated with IT: agriculture. While most industries have arrived at BDA from a history of business intelligence, agriculture offers a greenfield (pun intended) scenario as IoT sensors provide a completely new foundation for augmenting human decision making among people for whom “analysis of data is not their core competence.” Babu K describes a framework for analytics in agriculture that will be familiar to practitioners across all industries, once again demonstrating that analytics applies everywhere.
Despite the IT industry’s many years of talking about and implementing big data analytics, the articles in this issue of Cutter IT Journal serve to emphasize that this field is still undergoing significant evolution and that there remain widely varying ways of planning and implementing BDA. In some sense, we have only scratched the surface of the field, although we have dug deep in particular areas. I trust you will enjoy reading the articles here and believe you will find some nuggets of inspiration that will help you drive success in your own projects.