<iframe src="//www.googletagmanager.com/ns.html?id=GTM-TT4L49" height="0" width="0" style="display:none;visibility:hidden">
Jethro - We Make Real-Time Business Intelligence Work on Hadoop

Blog

Real-Time Data Analytics

By Remy Rosenbaum on May 12, 2016

Share

Before we get into how to achieve real-time data analytics, we first need to define what it is. In the book Real-Time Big Data Analytics: Emerging Architecture, big data author, Mike Barlow provides a striking example of what real-time data analytics is all about:

Imagine that it’s 2007. You’re a top executive at major search engine company, and Steve Jobs has just unveiled the iPhone. You immediately ask yourself, “Should we shift resources away from some of our current projects so we can create an experience expressly for iPhone users?” Then you begin wondering, “What if it’s all hype? Steve is a great showman … how can we predict if the iPhone is a fad or the next big thing?” The good news is that you’ve got plenty of data at your disposal. The bad news is that you have no way of querying that data and discovering the answer to a critical question: How many people are accessing my sites from their iPhones? Back in 2007, you couldn’t even ask the question without upgrading the schema in your data warehouse, an expensive process that might have taken two months. Your only choice was to wait and hope that a competitor didn’t eat your lunch in the meantime.

Real-time analytics of data is growing in importance as businesses collect larger quantities of data from multiple sources including the public web, software and machine log data, sensor data and the Internet of Things, historical archives, business apps, social media, and more (see an infographic from Datafloq). For a long time businesses have extracted insights from these sources of data using a variety of analytical techniques. Now there is a need to extract these insights, visualize and consume them by different parts of the business, in real time.

 

The Meaning of “Real Time”

There are several definitions of “real time”. According to TechoPedia, real time means using the data less than one minute after it was produced. O’Reilly provides a richer definition:

Real-time denotes the ability to process data as it arrives [...] processing data in the present, rather than in the future. But “the present” also has different meanings to different users. From the perspective of an online merchant, “the present” means the attention span of a potential customer. If the processing time of a transaction exceeds the customer’s attention span, the merchant doesn’t consider it real time. From the perspective of an options trader, however, real time means milliseconds. From the perspective of a guided missile, real time means microseconds.

For most data analysts, real time means “pretty fast” at the data layer and “very fast” at the decision layer.

Uses of Real Time Analytics

Traditional uses of real-time analytics range from network monitoring, to risk management and fraud detection, to transaction cost analysis and pricing analytics, to algorithmic trading, to intelligence and surveillance. In many of these use cases, machines are capturing and analyzing the data in real time, and sending out reports or alerts for human consumption.

A new application of real-time analytics is Real-Time Business Intelligence - enabling business analysts and decision makers to view rich visualizations of data, using tools like Tableau and Qlik, which are updated in real time from data warehouses and big data platforms like Hadoop.

The Missing Piece in Real-Time Data Analytics: Interactive Query Engines

Real-time data analytics works well when it is based on predefined algorithms and queries. Current tools are able to synthesize multiple streams of rapidly flowing data and perform complex operations on them. This was traditionally known as ETL or CEP, and now occurs at real-time speeds using tools such as Apache Storm, Apache Spark, Apache Samza and DataTorrent (see a review by Kai Wähner on InfoQ).

But what’s missing is interactivity. Is it possible for a human to explore the data by issuing different types of queries, “slicing and dicing”, and drilling down until a specific insight is uncovered? When the data gets big, even today’s highest performing tools can’t live up to the challenge.

We recently ran a benchmark which tested Cloudera Impala and Amazon Redshift, two high performance computation engines for big data. The test was a dataset of ~2.9 billion rows, visualized in a dashboard using the Tableau Business Intelligence platform. We checked how fast the dashboard would refresh as the user successively adds filters to their query. Both Impala and Redshift, considered to be some of the world’s fastest big data compute engines, took an average of 1.5 minutes to refresh the Tableau dashboard. Which is far from real-time by any measure, and does not allow analysts to perform interactive exploration of the data.

Index-Based Query Processing for Complex User Queries

An index-access architecture is the most flexible solution for directly querying big data sources. Most SQL on Hadoop engines rely on MPP (massively parallel processing database) architectures that fully-scan the entire database with every query - by contrast, drill-down, index-access architecture removes the need for limiting extracts or cubes and allows analysts to query data and drill-down any way they like, at interactive speed.

Hadoop Cluster Load

Jethro is an index-access SQL acceleration engine that was built for the unique scenario of flexible user queries on large data sets. Instead of fully scanning the data (MPP), like all other SQL on Hadoop tools, Jethro indexes every single column of a big data “lake”.

Jethro leverages these indexes to surgically access only the data necessary for a query instead of waiting for a full scan, resulting in query response times that are faster by an order of magnitude. This enables a true interactive response to user queries. Queries can leverage multiple indexes for better performance - the more a user drills down, the faster the query runs.

In Jethro, data is accessed directly instead of waiting for a full scan, so responses to complex queries are faster by an order of magnitude. This enables a true real-time response to queries and interactive exploration of data. Queries can leverage multiple indexes for better performance - the more a user drills down, the faster the query runs.

In the benchmark mentioned above, Jethro was also tested, and managed to refresh the Tableau dashboard visualizing 2.9 billion rows in a mere 6 seconds (compared to 1.5 minutes for the current state of the art compute engines). This enables truly interactive real-time analytics on big data.