Big Data and the Internet of Things

The physical world (from goods to equipment) is becoming digitally connected through a multitude of sensors.  Sensors can be found today in most industrial equipment, from metal presses to airplane engines, shipping containers (RFID), and automobiles (telematics devices).  Consumer mobile devices are essentially sensor platforms.  These connected devices can automatically provide status updates, performance updates, maintenance requirements, and machine-to-machine (M2M) interaction updates.  They can also be described in terms of their characteristics, their location, etc.  Until recently these sensors have been interconnected using proprietary protocols.  More recently, however, sensors are starting to be connected via IP, to form the Internet of Things, and by 2020 50B devices will be connected in this way.  The connected physical world is becoming a source of immense amount of low-level, structured and semi-structured data, e.g., big data.

Collecting and utilizing sensor data is not new.  For example, GE uses data from sensors to monitor the performance of industrial equipment, locomotives, jet engines and health care equipment.  United Airlines uses sensors to monitor the performance of its planes on each flight. And government organizations, such as the TSA, collect data from the various scanners they use at airports.  The key applications that have emerged through these earlier efforts are remote service and predictive maintenance.

While our ability to collect the data from these interconnected devices is increasing, our ability to effectively, securely and economically store, manage, clean and, in general, prepare the data for explorationanalysissimulation, and visualization is not keeping pace.  Today we seem to be pre-occupied with the goal of trying to put all of data we collect into a single database.  Even in this task we are not doing a particularly good job.  The existing database management systems are proving inadequate for this task.  They may be able to process the time series data collected by sensors, but they cannot correlate it.  The effectiveness of newer database management systems (NoSQL), e.g., Hadoop, MongoDB, Cassandra, is also proving inconsistent and depends largely on the type of application accessing the database and operating on the collected data.

The new generation of applications that will exploit the big data collected by sensors must take a ground up approach to the problem they are trying to address, not unlike that taken by Splunk.  In Splunk’s case, the application developers considered the ways the sensor data being collected from data centers must be cleaned, the other data sets with which it must be integrated/fused, the approach to interact with the resulting data sets, etc.  Splunk’s developers were able to accomplish this and deliver a very effective application because they understood the problem, the spectrum of data that must be used to address the problem, and the role the low-level data is playing in this spectrum.  They also appear to have understood the importance of providing effective analyses of the low-level data as well of the higher-level data sets that resulted when several different data sources are fused.

The Internet of Things necessitates the creation of two types of systems with data implications.  First, a new type of ERP system (the system of record) that will enable organizations to manage their infrastructure (IT infrastructure, human infrastructure, manufacturing infrastructure, field infrastructure, transportation infrastructure, etc.) in the same way that the current generation of ERP systems allow corporations to manage their critical business processes.  Second, a new analytic system that will enable organizations to organize, clean, fuse, explore and experiment, simulate and mine the data that is being stored to create predictive patterns and insights.  Today our ability to analyze the collected data is inadequate because:

  1. The sensor data we collect is too low-level; it needs to be integrated with data from other sensors, as well as higher-level data, e.g., weather data, supply chain logistics data, to create information-richer data sets. Data integration is important because a) high-velocity sensor data must be brought together and b) low-granularity sensor data needs to be integrated with other higher-granularity data.  Today integration of sensor data is still done manually on a case-by-case basis.  Standards-based ways to integrate such data, e.g., RESTful APIs, other types of web services, have not yet adopted broadly in the Internet of Things world and they need to.  We need to start thinking of sensor data APIs in the same way we have been thinking about APIs for higher-level data.  And once we start defining these standards-based APIs we also need to start thinking about API management.
  2. We don’t yet know the range of complex analyses to perform on the collected sensor data because we don’t know yet what enterprise and government problems we can solve through this data.
  3. Even for the analyses we perform, we often lack the ability to translate any analysis results to specific actions.

Finally, along with these two types of systems we will need to effectively manage the IP addresses of all devices that are being connected in these sensor networks.  IPV6 gives us the ability to connect the billions of sensors using IP.  We need better ways to manage these connected devices.  Most organizations today manage them on spreadsheets.

The big data generated by the Internet of Things is opening up great opportunities for a new generation of operational and analytic applications.  Creating these applications will require taking a ground-up approach from the basic sensor technology and the data sensors can generate to the ways sensors and managed and data is integrated, to the actions that can be taken as a result of the analyzed data.