The physical world (from goods to equipment) is becoming digitally connected through a multitude of sensors. Sensors can be found today in most industrial equipment, from metal presses to airplane engines, shipping containers (RFID), and automobiles (telematics devices). Consumer mobile devices are essentially sensor platforms. These connected devices can automatically provide status updates, performance updates, maintenance requirements, and machine-to-machine (M2M) interaction updates. They can also be described in terms of their characteristics, their location, etc. Until recently these sensors have been interconnected using proprietary protocols. More recently, however, sensors are starting to be connected via IP, to form the Internet of Things, and by 2020 50B devices will be connected in this way. The connected physical world is becoming a source of immense amount of low-level, structured and semi-structured data, e.g., big data.
Collecting and utilizing sensor data is not new. For example, GE uses data from sensors to monitor the performance of industrial equipment, locomotives, jet engines and health care equipment. United Airlines uses sensors to monitor the performance of its planes on each flight. And government organizations, such as the TSA, collect data from the various scanners they use at airports. The key applications that have emerged through these earlier efforts are remote service and predictive maintenance.
While our ability to collect the data from these interconnected devices is increasing, our ability to effectively, securely and economically store, manage, clean and, in general, prepare the data for exploration, analysis, simulation, and visualization is not keeping pace. Today we seem to be pre-occupied with the goal of trying to put all of data we collect into a single database. Even in this task we are not doing a particularly good job. The existing database management systems are proving inadequate for this task. They may be able to process the time series data collected by sensors, but they cannot correlate it. The effectiveness of newer database management systems (NoSQL), e.g., Hadoop, MongoDB, Cassandra, is also proving inconsistent and depends largely on the type of application accessing the database and operating on the collected data.
The new generation of applications that will exploit the big data collected by sensors must take a ground up approach to the problem they are trying to address, not unlike that taken by Splunk. In Splunk’s case, the application developers considered the ways the sensor data being collected from data centers must be cleaned, the other data sets with which it must be integrated/fused, the approach to interact with the resulting data sets, etc. Splunk’s developers were able to accomplish this and deliver a very effective application because they understood the problem, the spectrum of data that must be used to address the problem, and the role the low-level data is playing in this spectrum. They also appear to have understood the importance of providing effective analyses of the low-level data as well of the higher-level data sets that resulted when several different data sources are fused.
The Internet of Things necessitates the creation of two types of systems with data implications. First, a new type of ERP system (the system of record) that will enable organizations to manage their infrastructure (IT infrastructure, human infrastructure, manufacturing infrastructure, field infrastructure, transportation infrastructure, etc.) in the same way that the current generation of ERP systems allow corporations to manage their critical business processes. Second, a new analytic system that will enable organizations to organize, clean, fuse, explore and experiment, simulate and mine the data that is being stored to create predictive patterns and insights. Today our ability to analyze the collected data is inadequate because:
- The sensor data we collect is too low-level; it needs to be integrated with data from other sensors, as well as higher-level data, e.g., weather data, supply chain logistics data, to create information-richer data sets. Data integration is important because a) high-velocity sensor data must be brought together and b) low-granularity sensor data needs to be integrated with other higher-granularity data. Today integration of sensor data is still done manually on a case-by-case basis. Standards-based ways to integrate such data, e.g., RESTful APIs, other types of web services, have not yet adopted broadly in the Internet of Things world and they need to. We need to start thinking of sensor data APIs in the same way we have been thinking about APIs for higher-level data. And once we start defining these standards-based APIs we also need to start thinking about API management.
- We don’t yet know the range of complex analyses to perform on the collected sensor data because we don’t know yet what enterprise and government problems we can solve through this data.
- Even for the analyses we perform, we often lack the ability to translate any analysis results to specific actions.
Finally, along with these two types of systems we will need to effectively manage the IP addresses of all devices that are being connected in these sensor networks. IPV6 gives us the ability to connect the billions of sensors using IP. We need better ways to manage these connected devices. Most organizations today manage them on spreadsheets.
The big data generated by the Internet of Things is opening up great opportunities for a new generation of operational and analytic applications. Creating these applications will require taking a ground-up approach from the basic sensor technology and the data sensors can generate to the ways sensors and managed and data is integrated, to the actions that can be taken as a result of the analyzed data.
1 thought on “Big Data and the Internet of Things”
this is a very interesting perspective, and i agree with lots of what you say – except for one thing.
i don’t see the storing of the data as a necessity. the low-level data you refer to for the most part (Something like 99.99% of the time) is inconsequential. it repeats the same every-time – i am on, i am off, i am fine, i need nothing, etc. well, that is the interpretation of that data. the number of devices available won’t matter as long as we have the ability to process the non-equal ones – the alerts to react.
sure, for compliance reasons eventually you may need to store some of this, but the vast majority of it will be discarded. and the same applies to social noise – we don’t store every single tweet that comes across with your keywords – may way route a few (usually less than 1/2 of a percent of those that mention us) for further processing, but the majority of tweets that hate united or att or whomever are discarded since there is no resolution possible.
and therein lies the rub – the “revolution” of big data is not about the data but the ability to process it closer to real time than ever before. it was never about the data, it was (as you point you) about the outcomes that are necessary / expected.
and this is the main problem today – not the lack of “data scientists” but the lack of knowing what to do with all this “data”.
and for that, the answer cannot come from a vendor or a technology but from people understanding what is the purpose of the biz they built (well, beyond making money that is).