Predictive Analytics with Machine Learning for Real-Time Streaming Data

  • Home
  • /
  • Blog
  • /
  • Predictive Analytics with Machine Learning for Real-Time Streaming Data
Try Free SQL Trainer - learn by doing!
SQL queries made easy - Natural Questions to SQL Converter.

Predictive analytics continually expands on new frontiers with machine learning methods. Forecasting on real-time data sets and monitoring streaming data from IoT devices are among the most exciting applications today. ML algorithms process real-time data streams from devices and apps. And this animates apps with a “live” interactive and intelligent feel.

A great variety of analytics platforms now offers every imaginable feature for developers. And We will explore the best of them in depth in this survey which is specially configured for developers! Among the most popular predictive analytics platforms, we find both open source freeware and paid subscriber platforms as a service. Here are several noteworthy platforms:

  • Spark – Build fault-tolerant, scalable streaming app.
  • Flink – Computations on the distributed data stream and batch processing.
  • Splunk – Develop apps to analyze vast machine-generated big data with SPL
  • Streaming Analytics – Collect and analyze from diverse unstructured data sources.
  • Azure – Develop real-time analytics remote management, sensing, and monitoring.

We will feature two Apache projects, Spark and Flink, to cover remarkable open source platforms. We will also have a word about the related platform, Apache Kafka, a bit later. Next, we will look into three paid subscriber platforms as a service. IBM, in particular, became a popular leader in predictive analytics after noteworthy success with the Watson AI platform.

After briefly introducing the unique features of each platform, we will look at practical implementations of each. In particular, we want to explore coding examples of Splunk’s SPL search language. … (add more to describe later examples).

Spark’s fault-tolerant scalable streaming data apps

Spark is architected for building flexible, fault-tolerant and highly scalable streaming data analytics applications. Spark streaming sources support its own unique proprietary Apache language integrated API. This API enables developers to write streaming jobs much in the same way batch jobs are coded. Spark Streaming data also supports three leading standard coding languages: Java, Python, and Scala.

Apache Spark is favored for live data-intensive industries today, especially finance and health. Travel apps are also coming online with streaming data. This includes real-time data and real-time big data analytics from taxi services like PassApp and Grab. A great benefit of using Spark is a vast resource base shared by the Apache developer community.

Flink is remarkable for distributed data streams

Another advanced offering from Apache, Flink is an open-source platform fitted to distributed streams as well as batch processing. Flink’s data engine supports scalable data distribution.  Remarkable fault tolerance in computationally intensive distributed stream batch data processing over immense data streams is also a noteworthy asset of Flink.

Apache Flink benefits from an active developer community. Frequent updates and revisions keep this platform on the cutting edge. Apache offers a number of API language interfaces for connecting to the Flink engine. Here a few examples:

  • DataSet API – for Python applications
  • DataStream API – for unbounded streams for Java and Scala
  • Table API – which supports an SQL query language

Splunk and machine-generated big data

Splunk has a devoted developer community as a result of their awesome platform for analyzing machine-generated big data. Searching and monitoring streaming data or real-time data are facilitated by the very popular SPL search language. Splunk’s core product captures streaming data from a source, indexes and correlates it in real-time.

The result is an indexed dataset built on a scalable and searchable repository. From the repository, developers can use SPL or the web UI to create visualizations, reports, and even custom dashboards. Splunk is popularly integrated with QA automation testing and testware because it supports anomaly detection in user behavior. Splunk Analytics for Hadoop is another important functionality which garners loyal developer interest.

Splunk software is comprised of a variety of Splunk tools for developers. These empower developers to exploit data mining concepts and techniques. Among these are Splunk machine learning tools, and the popular Splunk security platform. Splunk monitoring tools are used in anomaly detection. Splunk predictive analytics and Splunk data analytics are first in class.

Streaming Analytics handles unstructured data sources

Streaming Analytics is the name of IBM’s predictive analytics platform. The unique feature fo Streaming Analytics lies in its capacity for handling diverse datasets. With Streaming Analytics developers create applications to collect, analyze, and find relationships and patterns in data. But Streaming Analytics goes a step farther.

Streaming Analytics supports streaming data analysis from diverse data sources. And the emphasis is on massive unstructured streaming data. IBM’s analytics solution can manage extremely large throughputs of data. And it can handle millions of transactions per second. For this reason, it is a leader in the paid and subscriber analytics platforms. Geospatial data is important here.

As we will see, Streaming Analytics can analyze a broad range of unstructured data including text, audio, and video streams. And – important to new SQL and NoSQL databases like MongoDB – Streaming Analytics handles geospatial and remote sensor data sources. As we saw in our MongoDB survey, geospatial streaming data usage is integral to IoT analytics today.

Azure does real-time remote sensing and monitoring

Although Microsoft is not usually thought of as budget-friendly, Azure weighs in a low-cost subscriber analytics solution. Azure’s goal is to develop real-time insights from IoT devices and remote management sensors, in addition to web apps. Networked autonomous vehicles or “connected cars” are listed among preferred applications. Parallelism is also featured by Azure.

Azure supports a query language within its analytics platform which resembles SQL. And Azure is naturally integrated with PowerApps and the other .Net framework products, including visual studio. The vast user base of these products means that developers will find community support and solution sharing widely available.

Apache Kafka as a Supporting Framework

We want to connect Apache Kafka in the discussion. Although Kafka streaming alone is not an analytics platform, it does provide real-time data streaming support. We’ll look at Kafka streaming vs spark streaming real time. There is likewise a query language – KSQL – for use in database management. In line with our previous tutorial on KSQL, we illustrate how this framework evolved into other Apache projects.

Kafka handles real-time data feeds, like the platforms already mentioned. Kafka evolved from the publisher-subscriber message queue. In other words, a data source is connected and collected like a distributed stream processing transaction log. Kafka’s storage layer is fed from a pub/sub message channel. Kafka Connect and Kafka Streams provide external data set connections to a Java stream processing library.

How to use the streaming analytics platforms

Streaming analytics platforms can connect to a data source and index any kind of data. Live data from IoT devices, machine-generated big data, synthetic data or testware generated data from automation testing, all of this and more can be collected and analyzed. Stock Market live data, live application logs, Windows event logs, web server logs, network feeds, system performance metrics, anomaly detection monitoring, archive files, message queues, the possibilities are unlimited. How do we connect to a data source and get started coding?

Starting out with Spark language API

We will provide a brief overview of the Spark API to get you started. Spark architecture is based on the concept of distributed datasets. These may contain Java or Python objects. However, with additional coding, R programming (R coding language) and others can be implemented.  Dataset from external data can be created. Parallelism is also supported. An essential building block of the Spark API is API “RDD.” The RDD API supports two types of operations:

Transformations – define a new dataset based on inheritance.
Actions – execute a job to execute on a cluster.

The following code sample creates a simple dataset:

my_text_file = sc.textFile("hdfs://...")
acounts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

Spark supports a distributed collection of data organized into columns called a DataFrame. Spark’s DataFrame API performs relational operations which resemble RDBMS on external data sources as well as Spark streaming metrics proprietary distributed collections. Spark’s database engine is called Catalyst and automatically optimizes code using the Dataframe API.

Monitoring Page Edits with Flink

Let’s look at an example of connecting to a web app and monitoring any changes that occur. The wiki provides an IRC for monitoring edits to pages via the log. Flink can read this channel count the content edited per user during a fixed period. This can be done with Flink in a few lines of code. And this sample will provide a good starting point to get into the API:


Above is an excerpted standard dependency. We can now add this to the code:

KeyedStream<WikipediaEditEventNo1, String> keyedEdits = edits
    .keyBy(new KeySelector<WikipediaEditEventNo1, String>() {
        public String getKey(WikipediaEditEventNo1 event) {
            return event.getUser();

Build an HTTP Event Collector in Splunk

Splunk is popular for collecting and analyzing event data from web apps, especially performance metrics and synthetic test data. Let’s see how to build an HTTP Event Collector (HEC) with Splunk. HEC is an endpoint that supports sending application events to a Splunk deployment via HTTP or HTTPS. We use an authentication model based on tokens. It is very simple to connect an event collector to JSON.

Splunk has Java and JavaScript plus .NET logging libraries available for developers. Or an HTTP request can be generated from any HTTP client using JSON-encoded events. This cURL statement uses an example HTTP Event Collector token

curl -k  https://localhost:8088/services/collector/event -H "Authorization: Splunk B5A791AD-B822-A1C-80D1-819FFB0" -d '{"event": "hello analytics world!!!"}'

Which generates the JSON response:

{"text": "Success", "code": 0}

Configure a logging library or HTTP client with a token to send data to HEC Streaming Analytics. HEC feeds apps with data for further predictive computation and optimization.

Getting started with Bluemix and Streaming Analytics

Streaming Analytics for Bluemix is a killer combination for real-time analysis of live streaming data sources. Streaming Analytics is powered by IBM InfoSphere which is a supporting analytics platform for “ingesting” and analyzing information as it is generated by real-time data sources like IoT and web-connected devices.

The associated InfoSphere Streams Analytics platform manages the highest data rates available commercially. Machine learning algorithms benefit from such low latency analytics capabilities because they require the highest computational overheads. The following code sample illustrates how to send a call to the API:

"streaming-analytics": [
"name": " Streaming Analytics-dev-aj",
"label": "streaming-analytics",
"plan": "Beta",
"credentials": {
"bundles_path": "/jax-rs/bundles/service_instances/key1/service_bindings/key1",
"start_path": "/jax-rs/streams/start/service_instances/key1/service_bindings/key1",
"statistics_path": "/jax-rs/streams/statistics/service_instances/key1/service_bindings/key1",
"status_path": "/jax-rs/streams/status/service_instances/key1/service_bindings/key1",
"rest_port": "443",
"resources_path": "/jax-rs/resources/service_instances/key1/service_bindings/key1",
"jobs_path": "/jax-rs/jobs/service_instances/key1/service_bindings/key1",
"rest_url": "",
"userid": "key1",
"stop_path": "/jax-rs/streams/stop/service_instances/key1/service_bindings/key1",
"rest_host": "",
"password": "key1"

Add a real device to an Azure IoT application

In order to get an example of how Azure connects to an IoT device, let’s look at an example. Although covering the entire application is beyond our scope, we can look at an important piece. This example app connects to an air conditioner to receive or stream telemetry data such as temperature and humidity. Here are the objectives of the app:

● Add a new real-time IoT device to the app
● Configure the real-time device as a data source
● Add a connection string to sample Node.js code
● Configure client code for the real device

The code to connect the Azure app to the air conditioner is in a Node.js file called MyConnectedAirConditioner.js on the server. This file contains the connection strings and variables to define the project. Here is an example:

'use strict';
var MyClientFromConnectionString = require('azure-iot-device-mqtt').clientFromConnectionString;
var Message = require('azure-iot-device').MessageA;
var ThisConnectionString = require('azure-iot-device').ConnectionString;
var ThisConnectionString = '{ADD your device connection string}';
var ATargetTemperature = 0;
var MyClient = MyClientFromConnectionString(connectionString);

Each Azure Central application IoT device has a unique connection string. The client app can now connect and stream data via this Javascript code:

// Send my device’s telemetry.
function SendMyTelemetry() {
  var temperature = ATargetTemperature + (Math.random() * 15);
  var data = JSON.stringify({ temperature: temperature });
  var message = new Message(data);
  client.sendEvent(message, (err, res) => console.log(`Sent message: ${message.getData()}` +
    (err ? `; error: ${err.toString()}` : '') +
    (res ? `; status: ${}` : '')));

Data streamed from an IoT device can now feed into machine learning algorithms for predicting seasonal usage, average temperatures, and cost.

Machine Learning in streaming analytics platforms

ML algorithms have reached an advanced level, and are now applied in remote management solutions. They now learn autonomously from data sources such as IoT devices everywhere. Although AI is not yet strong, streaming data platforms are building mountains of rich resources for future AI to exploit. The realm of predictive analytics platforms which we explored above is leading the learning adventure and discovery.

Gathering and sorting oceans of data and discovering meaningful patterns in an infinite sea, these platforms are already accomplishing what is not possible for people alone! Autonomous vehicles and natural language processing are just the tip of the iceberg. Stay tuned to ByteScout as we stay tuned to future developers!