In my last blog I tried to define the concept of insight. In this post I discuss insight generation.
Insights are generated by systematically and exhaustively examining a) the output of various analytic models (including predictive, benchmarking, outlier-detection models, etc.) created from a body of data, b) the content and structure of the models themselves that give rise to the set of actions, i.e., the plan, associated with the insight, and c) the decisioning process itself. Insight generation is a process that follows model generation. It is separate from the decisioning process during which a set of insights is applied on a new situation described through data.
The generation of insights depends on our ability to a) collect, organize and retain data, b) generate a variety of analytic models from that data, c) examine the generated models, and d) understand the goals and constraints of the decisioning process. Therefore, in order to generate insights, we must have the ability to generate models from collected data, data derived from the collected data, as well as the metadata of the collected data. This means that we need to be thinking not only about the data collection, management and archiving processes, but also about how to post-process the collected data; what attributes to derive, what metadata to generate.
The role of data
Data may be collected in a variety of ways. Sometimes data is collected by from an instrumented environment. For example, the data collected from the sensors of autonomous vehicles. Other times data is collected by conducting reproducible experiments or simulations (synthetic data). As an example, consider the data collected by companies like Waymo by simulating the performance of their autonomous vehicle fleets under a variety of conditions. In some situations there may only be one shot at collecting a particular data set of interest. For example, consider the data collected by the Voyager spacecrafts in their various flybys. Regardless, insight generation is highly dependent on how an environment, real or synthetic, is “instrumented.” For example, consumer marketers have gone from measuring a few attributes per consumer, think of the early consumer panels run by companies such as Nielsen, to measuring thousands of attributes, including consumer web behavior, and consumer interactions in social networks. The “right” instrumentation is not always immediately obvious, i.e., it is not obvious which of the data that can be captured needs to be captured. Oftentimes, it may not even be immediately possible to capture particular types of data. For example, it took some time between the advent of the web and our ability to capture browsing activity through cookies. Knowing how to instrument an environment and ultimately how to use the instrumentation to measure and gather data can be thought of as an experiment-design process and frequently requires domain knowledge.
As the body of knowledge in a particular domain increases, it is important to constantly explore whether new insights can be generated from a set of archived data. Sometimes the combination of archived with new data may lead to additional insights to those generated in the past. Other times, insights can be generated when the data being collected reaching a “tipping point.” It is therefore important to utilize scalable big data infrastructures enables this capability.
The role of models
Insight generation is serendipitous in nature. For this reason, insights are more likely to be generated by examining, and often combining, the output of several analytic models that have been created from the same body of data because each model-creation approach considers different characteristics of a target data set to identify relations or other characteristics. Because insight generation is based not only on the output of the models but also the analysis of the models themselves, the process is facilitated when models can be expressed declaratively. A good example, of the advocated approach is used by IBM’s Watson system. This system uses ensemble learning to create many expert analytic models. Each created model provides a different perspective on the target topic. Watson ensemble learning approach utilizes optimization, outlier identification and analysis, benchmarking, etc. techniques in the process of trying to generate insights.
While we are able to describe data collection and engineering, as well as model creation in prescriptive ways, and have been able to largely automate them, this is still not the case with insight generation. This is in fact the most compelling reason for offering insight as a service. Today insights are generated manually. Newer academic research is proposing approaches for the automatic generation of insights. The analysis of the derived analytic models will enable us to understand which of the relations comprising a model are simply correlations supported by the analyzed data set (but don’t constitute insights because they don’t satisfy the other characteristics an insight must possess), and which are actually meaningful, important and satisfy all the characteristics we outlined before.
For an insight to be valid it must have a plan associated with it. This plan, which consists of a set of actions, is applied during a decisioning process. The characteristics of a particular decisioning process will also need to be considered during the insight generation process because the costs of applying the plan must be considered. For example, the time allotted to execute the plan is one such characteristic. Watson’s Jeopardy play provided a great illustration of this point, as the system had a limited amount of time to come up with the correct response to beat its opponents. Below I provide an initial, rudimentary illustration of the time it needs to take to action specific actions in particular domains.
We are starting to make progress in understanding the difference between patterns and correlations derived from a data set and insights. This is becoming particularly important as we are dealing more frequently with big data but also because we need to use insights to gain a competitive advantage. Offering insight-generation manual services provides us with some short term reprieve but ultimately we need to develop automated systems because the data is getting bigger and our ability to act on it is not improving proportionately.
More details on the insight generation process