Solving Common BI on Hadoop Performance Pitfalls

Introduction

Your company has recently made the jump to big data and now that a few months, or even years, have passed you think your team has a firm grip on getting data in/out of the environment for use, right? Most likely if you are thinking this you are probably also discovering that you aren’t really sure how to easily hook one of your Business Intelligence (BI) tools to this data, against data you really want – i.e., a subset, and most produce fast and rapid-response results. These three simple tasks are almost universal to any BI request and can wreak havoc on teams in the big data age with so many services available but none providing a solution to any of these.

Oftentimes this will lead us elsewhere, purchasing expensive alternatives from third party companies to manipulate, display, and provide meaningful access to the data for our users and ourselves and in most cases what are we left with? The sad truth of it is that we’re left with supporting multiple products or going back to what is familiar like SQL Server or other new in-memory alternatives such as Microsoft’s Analysis Services Tabular models; neither of which we want to do.

In this article, we’ll explore some of the various ways we can answer the question of how do we do BI from big data and compare alternatives with Jethro’s approach. Most of the alternatives mentioned in this piece have been explored and even implemented (sad to admit) by me before big data really grew into what is available now. So please enjoy and feel free to provide input or email comments and share experiences with me and the rest of the Jethro team so we can potentially advise and help through similar situations and avoid the pitfalls described here.

Core Hadoop

One of the most common, and frankly naïve, ways people approach BI with Hadoop, which generally ends in failure, is using core Hadoop to power BI reporting with tools such as Tableau, SSRS, or some other third party system like Business Objects, etc. When we talk about core Hadoop we are referring to the following services:

  • HDFS
  • MapReduce/MR2 YARN
  • Hive/HiveServer 2
  • Zookeeper

Now don’t get me wrong, MR2/YARN and the others listed above are excellent for what they are meant to do, but one of these services is often misrepresented for being a multipurpose solution that it is not: HIVE. The purpose of HIVE is simple, batch processing in/out – that’s it. Working with a company a few years back there weren’t many alternatives and I remember seeing demos with HIVE tables and views as the source to a Tableau dashboard. At the time, we were looking for alternatives to present the data quicker from HIVE into Tableau and ultimately ended up settling on SQL Server Analysis Services Tabular (more to come on this). I can clearly remember asking the presenter to manipulate filters via screen share and then the awkward waiting as one, two, five minutes passed as we waited for a MapReduce job to run in the background unbeknownst to the audience for the new data.

One of the major problems with this solution is the fact that Tableau, and other tools that rely on queries to refresh data, will cause MapReduce jobs to be launched as users explore and elect to change the initial set of data. There are clever ways around this, extracting the data all up front, etc., but oftentimes the cost for doing so is not worth it.

For example, take the connection below to a table with ~75M rows:

SS table 75M_rows

Connecting Tableau to Hive + Spark*

 

Connecting Tableau to Hive and Spark

The screenshot illustrates how long the wait was for metadata generation (I killed it after 2:00 minutes). One could argue that limiting the general import would fix this issue, i.e., if I did a LIMIT 500,000 I can pull the data through, and that is true. But what do you do when your smallest dataset initially starts at 25M, 50M, or greater?

You can spend many man hours on attempting to make this work with partitions, shrinking data into smaller tables, or other downstream tasks but none will ever prove sustainable nor provide the speed many of your customers are expecting.

Retro Solutioning

Retro Solutioning, which may be a coined term (not sure to be honest), is when you don’t know enough about the other Hadoop services outside of the core features and decide to stick with what’s “safe.” The resulting solution will utilize technologies your team has at hand and is comfortable with and is usually the second line of attack when initial plans don’t quite work out (see the previous section).

Although there are valid situations that may call for utilizing existing technology and every company and situation is unique, more often than not it proves to be additional work and, by nature, additional processes for you to manage. Not only does this completely negate the purpose of Hadoop and big data but it wastes the time and resources of your team and has a high likelihood of becoming a band-aide solution instead of a scalable one.

Being a Microsoft SQL developer for 10+ years prior to getting into big data I felt this example, derived from my own experience, was very relevant to this topic and involves utilizing Microsoft SQL Server and the Tabular Analysis Services features to make up for less than desired performance out of Hive. We’ll start by outlining the hypothetical project and then go into the implemented solution and the results afterwards. As you read through it, think of how you’d approach this problem and if it lines up or not (if it doesn’t line up then you may have did it right!)

The “Hypothetical” Project

Project: Reporting Dashboards
Original Systems:
Microsoft SQL Server (DB)
Microsoft .Net for dashboards – yikes, I know
Microsoft Reporting Services

Project Overview:

Take existing reports that have an MS SQL Server database source and upgrade them to Tableau, deprecating any use of .NET as a dashboarding interface and removing any reporting dependency on MS Reporting Services.

 

Project Requirements:

  • All reports should load in <10 seconds on initial browse
  • All reports should respond to filter changes within <10 seconds
  • Data source should hold at least 1 year for users to pull in active reports/dashboards
  • Solution should be able to handle a user Base = 1000 users with 200 being concurrent

The total data size is 225 million rows

Seems straightforward enough, right? The answer: Maybe.

Knowing that plugging your reporting tool, Tableau in for this case, into HS2 is a no-no for BI when a performance SLA is involved, we needed to come up with a way to solve for this quickly. Since we had SSAS talent and owned pretty beefy machines we decided to aggregate where possible and push to memory using the in-memory Tabular SSAS. Acknowledging that there would be overhead to develop the exports of data from Hive using ODBC and SQL Linked Servers and there would be duplication of data between what was stored in HDFS and Tabular we set off on our journey. The first thing that hurt us the most was the time it took extracting data from Hive into the in-memory area for SSAS and we were hit with the almost same overhead if we were to have just connected Tableau directly to HS2.

Here is a breakdown of the different systems via SQL Linked Server and ODBC. Note: Impala nor Jethro were available to us at the time b/c we didn’t even consider exploring them!

 

SQL Server ODBC Query Times

 

A couple things to note from the above; 1. The query times are using OPENQUERY to import data into SSAS, 2. Each resulting set was against an 80M row fact table unrelated to the one from our hypothetical project and the testing scenario used a query that would generate a dimension with ~1200 rows, and 3. Hive is Hive + Spark which definitely helped improve timing after a container was created.

So, with the time trials in mind, let’s get back to the hypothetical project and import. The next screenshot (below) shows an import of 1 day for ~7.1M rows and although the time is not show it took approximately 20 mins for this entire import. If you do the math it would take, in theory, 632 minutes for the entire dataset (10 hours)

Table Import Wizard

Importing the data at an aggregate level for our reporting did help, knocking the row counts down to about half the total, but that is well beyond the recommended size for in-memory solutions for Tabular Analysis Services and this showed when we hit it with Tableau. Query filtering was at best 10 seconds and could sometimes hit upwards of 30/60 seconds depending on if we were cascading filter boxes or not. We quickly learned that if we were seeing this type of performance during development and while hitting these reports on Tableau Server that it was not going to work for our customers and be a nightmare as we scaled up in size and userbase.

Below shows the wait on-open of the dashboard with Tableau Desktop. This would easily be prolonged when opening from the server and busts one of the requirements for SLA.

The Pitfalls

We learned how to connect different systems to HS2; but we also learned that relying on it to feed a performance driven solution was not going to work without doing one of the following:

  1. More hardware – faster and more expensive SSAS machines would certainly help – but where is the budget for that?
  2. Compromise on the solution. Reduce the amount of data kept and displayed assuming the customer would allow that – probably not

After reading this you may be saying “well, there are many other solutions like Cognos, SAS, AtScale, that can use as a layer to prepare data for BI” and that is certainly true. However, in my humble opinion, those solutions are not retro since most companies have not had them in-house for years like your SQL Servers or Oracle databases (MS Access – don’t even go there). The point of this section wasn’t to bash older database technologies or discourage you from pulling data into them if that works for your customers’ needs; but rather to illustrate how they may solve for the short term but fail in the long run when we include the other aspects of what makes data “big.”

The Best of Both Worlds

How do we get the best of both worlds? That is the question of the day, what magical solution will allow us to get the benefits of Hadoop and power your BI while, at the same time, avoiding costly add-on tools that mimic some familiar functionality. Sticking with your current BI tool, whether that is Tableau or Qlik is the right start but you also need to identify what your overall direction is with BI and reporting within your organization. Oftentimes we jump into tools that have punch lines such as “bring Hadoop to your analysts” or “data discovery at your fingertips” and while these tools may be great at specific tasks, they ultimately fall short with BI.

Keeping in mind there may never be a “catch-all” solution, and why would there be – this is Hadoop and it’s an ecosystem, let’s break it down by some BI needs…

Power User Pandemonium

A situation derived from personal experiences, having an abnormal amount of power users who have decent-to-good skills in SQL and databases and who really just want to get the data and manipulate it for their own reports. This BI nightmare, or dream depending on how you look at it, often includes many ad-hoc queries tapping your source system and the use of easily accessible analyst tools like Microsoft Excel and Access to pull data in and do what is needed with it.

Report Stores

Having a bunch of reports that are maintained by a development team and utilized by analysts but are relatively trivial when inspected independently. We all have instances of this, these “quick-hit” reports that were spun up to answer 20-30 users constant inquiries and they never went away.

Dashboards

The traditional dashboard requests that serve to answer many questions from executive levels and onward (depending on your organization). Dashboard development is the best part of BI, in my opinion, because it encompasses all the challenges associate with BI (learning the business, data, and dashboarding tool use).

What’s the Answer?

Easy…There isn’t one answer to handle it all as I eluded to in the previous sections. But if you approach your BI challenges knowing that you are starting down the right path because you are coming in with an open mind to the different technologies and combinations of them that will help you when solutioning. For example, one of my favorite approaches to solving the BI needs we listed above (realizing there are more than those three) is using this stack:

  • Core Hadoop

Hive + Spark for batch processes and heavy lifting with HDFS, YARN, Zookeeper

  • Impala

Deployed to adhoc users and analysts who know their way around various SQL languages and won’t have the patience, or need, for Hive. Impala is a big win for organizations jumping into Hadoop that have a lot of power user’s b/c it provides a happy medium between big data and what they’re [your advanced users] are used to working with.

  • Jethro

This is my SLA-saver. Impala works great and can be used for a lot of quick reports or outputs that a few users rely on but it can easily fall short when it comes to speed and tight SLAs. If you are presented with a strict SLA such as reports respond in <10 seconds or filters need to be near-instance on change -Jethro is the answer. With access to HDFS and the usage of the new manager tool (Jethro 3.0), data can easily be imported and queried with many of the common BI tools; Basically, any tool that can make ODBC connections can utilize the data within Jethro making it very powerful.

  • Hue

Replaces a lot of user’s need for some tool similar to Microsoft’s SQL Management Studio and visualizes the abilities of Hadoop. I find Hue is a great way to help new users to the Hadoop world understand the basic principles of it while still getting their data, i.e., they can browse HDFS, use the import table wizard, etc.

  • Sentry

Provides role level security for your stack for Hive, Impala, and Jethro. This is a must when you are going to have advanced users playing around via Impala because it shares the Hive metastore and you’ll want to restrict your users so they don’t accidently drop or start creating duplicate items.

The above listing, although high level, is an example of my favorite stack setup that has proven successful for solving big data and BI needs because it is focused on ensuring there is a service for every type of scenario while avoiding getting backed into a technology corner per-se.

Related: Read our white paper on making interactive BI work on Big Data


Having worked as a database architect for 10+ years, beginning with Microsoft SQL Server 2000+, Analysis Services, and ultimately Big Data Hadoop, AJ Adams has come across many different challenges. He has worked with many different types of organizations covering a variety of industries including energy, government, healthcare, and sales/marketing; designing and implementing architecture for BI-related solutions using tools such as Tableau, Qlik, and others. With the growth of the size of data he moved into Hadoop, primarily Cloudera and services associated with the distribution specializing in administration, development, and overall stack setup. He believes one of the keys to success with making the move to big data is to approach it openly and focus on what sets of services can best handle the various needs versus battling to come up with a ‘one-solution-fits-all’ design.

On-Demand Webinar: How Tata Communications Uses Jethro to Deliver Interactive BI on Hadoop

Learn how Tata Communications CDN overcame Hadoop latency pitfalls to deliver its end users a BOI dashboard on a shared Hadoop data lake with thousands of concurrent users that works at interactive speed.

This on-demand webinar is hosted by Tata Communications CDN Head of Engineering, Samik Mukherjee and Jethro CEO Eli Singer. You will learn about Tata Communications CDN’s Hadoop deployment, performance requirements, and architectural considerations.

 

Easy Steps for Jethro Deployment

Install and Create Instance

Set up a server

Jethro runs on a Linux server. Typical deployment requires a RedHat/CentOS server with 128 GB of RAM, 16 CPU cores and 500GB of disk space.

Jethro stores its data on Hadoop, but can use any file system, either local or network, instead.

See the installation guide for server requirements and setup:

http://jethro.io/learn/documentation

If using Amazon AWS, you can view the following post on how to install Jethro on an EC2 instance:

http://jethro.io/blog/how-to-set-up-an-amazon-instance

You can also view this video that shows how to install Jethro on AWS:

https://www.youtube.com/watch?v=jdREOTFmcdk&t=691s

Download the software

Go to http://jethro.io/download1 to download the software.

Install Jethro

Download the installation rpm to the Jethro server and run the following command as root, to install it:

rpm -Uvh <rpm file>

See the installation guide for Jethro install:

http://jethro.io/learn/documentation

Create an Instance

Before creating an instance you need to create a directory for the storage and for local cache and to change the ownership of these directories to user “jethro” which was created as part of the installation process.

Then, login as user “jethro” (password is also “jethro”) and run the command:

JethroAdmin create-instance <instance name> -storage-path=<storage path> -cache-path=<local cache path> -cache-size=300G

 

Replace <instance name> with the name you want to give the instance,  <storage path> with the full path of the storage directory on Hadoop and <local cache path> with the full path of the local cache directory. Set the cache size according to the size on the local cache directory.

This video shows how to connect Jethro to Hadoop:

https://www.youtube.com/watch?v=4zOEpdiBQjA

 

For instances with storage NOT on Hadoop, add the following parameter to the end of the command:

-Dstorage.type=posix

Notes:

 

After creating the instance, you can start the Jethro service by running:

service Jethro start

Analyze data

Analyzing the data will help you to determine the correct data type for each column and also will create the table creation script and description files that are needed to load data.

Download the Data Analysis tool

Download the tool using this link:

https://github.com/JethroData/AnalyzeData/archive/master.zip

Unzip the file and then go to the AnalyzeData-master directory.

Analyze sample data

Use the AnalyzeData script to analyze the data.

It requires python to be installed on the server.

It also uses an optional python package called “tabulate” to generate the data analysis report. To install tabulate, use the command:

TABULATE_INSTALL=lib-only pip install tabulate

If pip is not installed, install it using the command:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

Note: Installing tabulate requires root or sudo access to the server. If you do not have root or sudo access, you can still use AnalyzeData. In this case, it will fall back to csv mode when displaying the results.

Run Analyze data with the following optional parameters:

AnalyzeData.py [-i <rows to read>] [-d <delimiter>] [-q <quote char>] [-n] [-c] [-g <table name>] [<input file>]

-i: Number of rows to read from the input. Default=unlimited.

-d: The input data delimiter. Default=”,” (comma).

-q: The input data quote character. This allows a delimiter character inside quotes as    part of the data. Default=” (double quotes).

-n: If specified, treats the first row as headers row which contains the column names.

-c: CSV formatted output. Write the output report as a tab delimited file instead of a formatted table. Installing tabulate is not required in this mode.

-g: Generate a create table script and a description file using the given table name.

input file: The input file to read. If not specified, read from standard input.

It is recommended to limit the input up to 1 million rows (-i 1000000).

This will generate a report such as the following:

 

Number: The column serial number.

Name: The column name, if the data contains headers. Otherwise it is c1..cN.

Rows: The number of rows for the column.

Type: The suggested type to use based on the data. A non string type is suggested in case non string values are found and there are no more than 5 distinct exception values.

Category: For certain values, a category can be detected based on regular expressions. It also specifies “Primary Key” when a column has unique values and “High Cardinality” if it has many unique values.

Percent: The percentage of the values of the suggested type out of all values.

Exceptions: A list of up to 5 exception values. Exception values are values that do not match the suggested type.

Distinct: The number of distinct values.

Samples: Sample values of the suggested type.

In addition, if the -g parameter is specified with a table name, then a create table script and a description file is generated based on the data. For the above data, the following scripts are generated as test.ddl and test.desc when given the table name “test”:

 

create table test

(

name STRING,

age BIGINT,

birth date TIMESTAMP,

balance DOUBLE,

phone STRING

);

table test

row format delimited

            fields terminated by '|'

            null defined as 'NULL'

OPTIONS

            SKIP 1

(

name,

age,

birth date format='yyyy-M-d',

balance null defined as '-',

phone

)

You can view the report to determine the following:

  1.       Is the suggested type acceptable?
  2.       Is there more than on exception values? Jethro only supports one NULL value for each column, so if there are more than one exception value you either need to modify the values before loading or load this column as string.
  3.       How many distinct values are there. For a string column with many distinct values, you may choose not to load it or to truncate the values so they become less unique (in the case of URL’s for example).

Then, you can modify the table create script and description files if necessary:

  1.       Assign column names if they were not provided as part of the input.
  2.       Change types as required.
  3.       Add a partition definition.

For example, to partition the above table by birth_date every month, add the following just before the create table ending semicolon:

Partition by range(birth_date) every (interval, ‘1’, month)

 

  1.       Change description file options, like rows or columns to skip.
  2.       To learn more about partitions and description file, consult the reference guide:

https://jethrodownload.s3.amazonaws.com/JethroReferenceGuide.pdf

Load data

Once the data is analyzed and the scripts are created, you can start loading your data.

Create tables

Before loading data, you need to create tables in Jethro.

Start the JethroClient using the command:

JethroClient <instance name> localhost:9111 -p jethro

 

Copy the create table command and drop it at the command prompt.

Alternatively, you can run the JethroClient with the -i option specifying the ddl file of the table. For example:

JethroClient <instance name> localhost:9111 -p Jethro -i test.ddl

Load data to tables

Once the tables are created, you can start loading data.

To load data to a table use the command:

JethroLoader <instance name> <description file> <input files or directories>

 

If the data is not in text format, you may need to make it accessible to Jethro first. In this case, you can use a pipe to forward the data to the Jethro loader, specifying “stdin” instead of files or directories. For example, to load data from a hive table, you could run:

hive -S -e “select * from <db>.<table> | JethroLoader <instance name> <description file> stdin

 

The load process may take some time, so it is recommended to run it in the background by specifying  “&” at the end of the command.

When the loader starts running, it will print the path to the log file. You can view this file using tail -f to monitor the load progress.

You can learn more about loading data from the reference guide and from the following video:

https://www.youtube.com/watch?v=Dol7TyTbgVU&t=27s

Connect BI tool and run queries

Once data is loaded, you can connect your BI tool and start working with the data.

Download connection drivers

Connection to Jethro is made through ODBC or JDBC. You first need to download the Jethro drivers from here:

http://jethro.io/driver-downloads

Install the ODBC driver

After downloading the installation file, double click on it to start the install process.

If Tableau is installed on your box, the installer automatically detects that and will ask you to copy the Jethro.tdc file to the proper location.

Create a connection

After installing the ODBC driver, you need to create a connection to Jethro.

Open the “ODBC Data sources” and go to the “System DSN” tab

Click on the ‘Add” button, select “JethroODBCDriver” from the list of drivers and click “Finish”.

Fill in the details. Choose a data source name. In the host filed, specify all the Jethro servers connected to the instance in the format: <host>:<port>;<host>:<port>….<host>:<port>. This will allow the ODBC driver to load balance between all the servers by sending each query to a different server. You can leave the port filed empty. In the Instance field, specify the Jethro instance name. The Login ID is “jethro” and the password is also “jethro” if you did not change it.

Click OK to save the connection.

Install the JDBC driver

After downloading the JDBC zip file, uncompress it and put its content in a folder of your choice.

The zip file contains 2 jar files: JethroDataJDBC.jar and protobuf-java-2.4.1.jar.

It is important to use both when defining the driver for your tool.

For example, to configure SQL Workbench to connect to Jethro, first add the JDBC driver:

Then, create a new connection providing the connection details:

Connect BI tool

Once a connection is created, you can use it in order to connect your BI tool.

The following video shows how to connect Tableau to Jethro:

https://www.youtube.com/watch?v=secprIbz4tY&t=250s

Setup Automatic Microcubes Generation

Microcubes provide interactive speed data access for queries with a low number of filters by generating and caching the results for all value combinations of a given query pattern.

Microcubes can be generated automatically from the queries sent by users. It is recommended to set up a separate jethro server that will be dedicated for cube generation. A typical environment should include at least 2 jethro servers. One server will run the maint process and will be used to generate the microcubes. This server can also be used to load data. Any other server will be used to query data.

We assume that the existing server will be used for microcubes generation and any new server will be used for queries.

Setup query servers.

  1. Set up the new jethro servers.
  2. In each server, attach the server to the instance by running the command:

JethroAdmin attach-instance <instance name> -storage-path=<storage path> -cache-path=<local cache path> -cache-size=300G

Replace <instance name> with the name of the instance,  <storage path> with the full path of the storage directory on Hadoop and <local cache path> with the full path of the local cache directory. Set the cache size according to the size on the local cache directory.

For instances with storage NOT on Hadoop, add the following parameter to the end of the command:

-Dstorage.type=posix
  1. Start the jethro service on each server:
service jethro start
  1. Edit any ODBC/JDBC connection to include the list of query servers that you set.

 

Configure autocube generation

  1. In the first server that you set, where you created the instance, edit the file:
/opt/jethro/intances/<instance name>/local-conf.ini
  1. At the end of the file, add the following 3 lines:
dynamic.aggregation.auto.generate.enable=1

dynamic.aggregation.auto.update.incremental.cubes.enable=1

dynamic.aggregation.auto.generate.execution.hosts=localhost:9111
  1. Save the file.
  2. Restart the jethro service:

service jethro restart

 

Now, microcubes will be generated automatically based on the queries that are being sent by users.

 

Generate microcubes manually

If you do not want to allocate a server for microcubes generation, you can still generate microcubes manually for specific queries.

Connect to the server using JethroClient or any other SQL client and run the command:

generate cubes <cube name> from <sql query>

 

You should replace <cube name> with a cube name of your choice.

This command will instruct the server to try to generate microcubes for the given sql query. If microcubes can be generated for the query, they will be generated and you will get a response. Otherwise, you will see a response with the reason why the microcube was not generated.

To see the existing microcubes, run the following sql command:

show cubes

Webinar: Advancing From Tableau Extracts to Live Connect on Hadoop

Learn how Jethro overcomes Hadoop latency pitfalls to deliver interactive Tableau on Hadoop

In this on-demand webinar hosted by Jethro CEO Eli Singer, you will learn the benefits of “live connect” vs “extracts in Tableau and understand the performance pitfalls involved. Jethro CEO Eli Singer will discuss how Jethro solves Hadoop performance issues and enables Tableau users to live-connect to their data sets on Hadoop without sacrificing speed.

Eli Singer Discusses How to Make Business Intelligence Work on Hadoop

Jethro CEO Eli Singer discusses the transition from traditional EDWs to Hadoop and the resulting latency issues. He discusses why this is problematic for Business Intelligence use cases and the ideal SQL-on-Hadoop architecture to solve this problem. Eli tells how to enable Tableau, Qlik and MicroStrategy to deliver  business users an interactive business intelligence on Hadoop experience.

Database Architectures for Business Intelligence Demystified: A Concise Comparison

Database Architectures for Business Intelligence Demystified: A Concise Comparison

Introduction

The year 2008 was one of drastic change in the world economy and in society. Major financial institutions collapsed, while Internet social networking service providers such as Facebook and Twitter (and later Uber and WhatsApp) grew exponentially.

This resulted in enormous changes in the database (DB) arena that had been based on the RDBMS  (Relational Database Management System) paradigm since the late 1970s. This paradigm ensured high consistency and data integrity on the one hand, and high availability on the other hand, and was popular within the enterprise world, especially within financial institutions that could afford the high maintenance costs and effort required to scale up their databases.

However the new social media companies required something drastically different. The new service providers preferred high-availability, and the ability to scale their networks with low CapEx and OpEx in terms of hardware costs and personnel.

Data stores cannot support availability, consistency and partitioning (scaling) simultaneously (according to the CAP theorem), and this provided an opportunity for new radically different data storage products that began leading in many segments.

But are the new products really significant, and should you use them? And, most importantly, do they work with the leading Business Intelligence (BI) tools in the market? In this paper we explore these questions and provide clear guidelines as to which products we recommend, and for what purpose.

What data stores should you consider?

Data stores only become relevant when people and organizations use them. Important considerations when choosing a data store include the ease of report generation and analysis, and whether it is supported by Business Intelligence (BI) tools like Tableau, Qlik or Microstrategy. When a BI software vendor invests significant resources to support a data store then this is a good indication that the data store is worth considering.

Most of us are familiar with the most common RDBMS systems such as Oracle, MySQL, Microsoft SQL Server, IBM DB2, and Sybase ASE that still lead the data store market today (see db-engines.com ranking scores), however there are new data stores on the rise today.

According to our analysis, and data from the DB-Engines knowledge base, the market share of columnar databases (Cassandra, HPE Vertica, Sybase IQ, MonetDB, and SAP HANA), MPP (Massively Parallel Processing) databases (RedShift, IBM Netezza, Pivotal Greenplum, Teradata, Snowflake,and Hadoop Hive), search engines (Splunk, Solr, Attivio), and document data stores (MongoDB, Couchbase, MarkLogic) are gaining significant market share and are coming mainstream.

So what is causing this dramatic shift? Why have so many new data store categories been created in the last 10 years, and what business purposes do they serve? We will provide some possible answers to these questions below.

RDBMS: Relational Database Management System

Short intro

RDBMS systems are probably the most solid data store solution to date, and have been at the heart of the most common information storage solutions since the 1970s. The design of RDBMS systems best fits enterprise requirements where no transactions can afford to be lost, and high availability is a priority.

Leading products in this category

The leading products in this category are Oracle DB, MySQL, Microsoft SQL Server, IBM DB2. Sybase ASE and Postgres.

Key architecture

  • Presents data to the user as relationships between entities. Each entity is presented in a tabular format and each table consists of a set of rows and columns.
  • Provides relational operators such as JOIN and UNION to manipulate the data in its tabular format
  • The most common standardized query language is ANSI SQL, although most vendors use extensions to this standard
  • Data is saved fully normalized

Strengths

  • Strong consistency of the data
  • Includes mature High Availability (HA) solutions
  • Minimal redundancy of data
  • Independence from the physical data, since it can be stored using various storage engines, and can also be retrieved in the same way

Weaknesses

  • Poor representation of ‘real world’ entities, and requires substantial efforts by software engineers to transform the object-oriented data to an entity model
  • A nonlinear and complex scaling solution
  • High Database Administration (DBA) and maintenance costs
  • Difficult to represent complex data types using RDBMS as there is a data structure
  • In large data stores it is difficult to change the table structure without causing major downtime, or a slowdown in the system
  • No support for recursive queries. This limits support for hierarchical data models such as organization structure

Common use cases

  • Financial transaction systems
  • Low volume inter-organization data entry systems

RDBMS and BI

For decades BI tools have been designed to work with RDBMS systems that support JOIN between tables, and SQL, however ‘Big data’ RDBMS often fail to scale resulting in poor BI performance.

Tidbit

F. Codd, from IBM, invented the term “relational database” in 1970. He introduced the term in a paper he published: “A Relational Model of Data for Large Shared Data Banks”.

Document Data Stores

Short intro

Document data stores are the heart of many modern information systems. Software engineers love the many useful features of document data stores such as built-in sharing, easy-to-setup clusters, end-to-end support for JSON and rapid deployment that avoids the need for a DBA in the initial phase.

Leading products in this category

The leading products in this category are MongoDB, Couchbase, and MarkLogic.

Key architecture

Document data stores save data as JSON-like documents in collections/tables. The document structure is flexible and every document can have different fields. For this reason, when you want to add new fields you do not need to make any change in existing documents or table structure. Data can be retrieved easily from the data store using predefined indexes.

Strengths

  • Support rapid development, since software engineers only need to serialize data for data persistence
  • Schema-less design enables support of new fields in a populated table without downtime or effects on the user experience.
  • Data is stored in the JSON format that is completely fit for full stack Javascript platforms
  • High insertion rates due to built-in shards
  • Very easy to setup and maintain. Operational cost is usually 10% of a RDBMS system of the same size

Weaknesses

  • Poor support of JOIN capabilities between entities
  • If the document structure is complex, it is difficult to use BI tools and structured queries
  • Unlike RDBMS systems that assign default values to fields and have data validation rules, when there is no schema these tasks must be managed at the application level.

Common use cases

  • Content Management
  • Catalog systems
  • Analytics systems
  • Billing systems where mediation can be avoided

Document Data Stores and BI

Most BI tools have only recently begun supporting document data stores, since most of the data stores did not support SQL. Even today, the lack of JOIN operations between entities still limits the type of analysis and dashboards you can create based on these document data stores.

Tidbit

MongoDB is considered the 4th most widespread and widely-used data store according to the db-engines.com knowledge base, just behind the 3 RDBMS products that lead the market (Oracle, MySQL and Microsoft SQL Server).

Columnar Databases

Short intro

Columnar databases store data in indexes and enable near real-time query response, and high insertion rates of 1 GB/s. The databases support SQL and are efficient in terms of storage and memory usage.

Leading products in this category

The leading products in this category are Cassandra/DataStax, HPE Vertica, Jethro, Sybase IQ, MonetDB, and SAP HANA.

Key architecture

Columnar databases store data as columns rather than as rows. They are designed to retrieve specific columns quickly without the need to retrieve the complete row, and discard non-relevant fields. Data is stored in indexes, rather on data tables, and therefore there is no need to retrieve data from the tables themselves.

Strengths

  • Storage and memory use is efficient as every column is encoded and values are stored in index nodes
  • Near real-time access to data (1-2 ms for read/write), based on data on index leaves
  • Standard SQL is nearly fully supported

Weaknesses

  • Data is modeled for specific queries
  • If multiple queries are needed from the same data, then the data may need to be stored twice
  • The SELECT * FROM pattern that is widely used in software programming should be avoided otherwise it may impact performance

Common use cases

  • Authentication systems that can verify a user in O(1)
  • Personalization systems that can return user data in O(1)
  • Real-time bidding systems and online exchanges
  • Real-time analytics where the insert rate is a key requirement
  • Large data sets that exceed 1 TB in a single table/entity
  • Retail systems where branch independence is key. Columnar databases allow users to read/write to each node even when the network between a retail branch and its headquarters is cut off
  • Financial transaction real-time risk management systems

Indexes

Columnar databases are based on indexes. The indexes enable fast access to data by creating a tree-like search pattern where every node serves as a decision point, avoiding searching half of the remaining items in the table. However, indexes best serve specific queries that fit the specific index, and this may result in poor performance for other queries.

Columnar Databases and BI

Most columnar databases support SQL, and some support JOIN operations. These features position them as the best candidates for BI on “Big Data.” However, since data must be designed for specific queries, it is difficult to provide flexible queries for BI analysts.

Tidbit

Jethro, that provides a columnar database based on Hadoop, has solved the low-flexibility challenge of columnar databases (and indexes in general) by integrating search indexing (see below) into its solution. This makes Jethro the best fit for ‘big data’ and BI cases.

MPP: Massively Parallel Processing Data Stores

Short intro

MPP provides rapid and powerful analytics on petabyte-scale data volumes. This category used to be known  as the data warehouse appliances category, and is specialized in enabling ad-hoc queries on large-scale data repositories for long-term analytics.

Leading products in this category

The leading products in this category are Amazon Redshift, IBM Netezza, Pivotal Greenplum, Teradata, Snowflake, Google BigQuery, Hadoop Impala, Hadoop Hive, Hadoop Drill, Hadoop Hbase, Facebook Presto, Spark SQL, MemSQL, EXASolution, and Kognitio.

Key architecture

  • Shared (shared-nothing) architecture, where the entire query is executed on the nodes with emphasis on minimizing data movement
  • Data is stored on different nodes
  • Queries are split into different nodes, and only summarized and compacted results are returned from each

Strengths

  • The data architecture is flexible enough to support any query with a dedicated index, or table model design
  • Supports petabyte-level data requirements
  • Supports the requirements of the modern offline machine-learning model

Weaknesses

  • Poor support for real-time and near real-time use cases as query results can take between several seconds and several hours
  • BI queries require massive scanning of data, and result in high utilization of server resources..
  • Requires dedicated hardware, or large-size installations. MPP is not recommended for installations with less than 10 TB of data, or those with more than 100 TB of data.
  • Scalability is limited (a maximum of 100-200 nodes)

Common use cases

  • Large data warehouses with over 10 TB of data
  • Long-term analytics
  • Ad-hoc queries against that data
  • Enterprise level data lakes
  • Data store for training machine learning systems and models

Cubes

Cubes are multi-dimensional tables such as period, product and city, where data can be easily sliced and summarized according to one or more the dimensions. As cubes consist of large normalized tables, they are best suited for data warehouses.

MPP and BI

MPP provides high flexibility that enables BI analysts create any query they want, and many of them support SQL. However, MPP data stores often scan massive amounts of data in order to return results, and this may result in high latency, and extensive usage of server resources (CPU, RAM and disk).

Tidbit

IBM Netezza is sold as a dedicated appliance that includes FPGA to augment CPUs and minimize network bus traffic.

Search Engines

Short intro

The search engine (also known as Google) is the heart of the internet. As enterprise software and SaaS become more content aware, and support larger data sets, the use of internal search engines to locate relevant data, or create context and geolocation-aware results, is coming mainstream.

Leading products in this category

The leading products in this category are Splunk, Solr, and Attivio.

Key architecture

During the ETL (Extract, Transform, and Load) phase, search engines analyze the input data and create terms and frequencies. These frequencies serve as a basis for indexes and inverse indexes that enable searching of the data modeled as documents, including fields and attributes such as geolocation.

Strengths

  • Support for full-text searches, where users can query any field and get results in near real-time
  • Support for spelling corrections
  • Custom search models, where the order of results can be boosted based on given attributes (such as distance from a given location)

Weaknesses

  • Search engines are not designed to be a primary data source
  • Results may not be 100% accurate and may not be repeatable
  • Results may depend upon the hardware load (RAM, CPU and disk utilization)
  • Data load is CPU intensive

Common use cases

  • Log aggregation and analysis
  • Content management system internal search capabilities
  • Internal website search
  • Auto-completion and support for fuzzy searches (spelling mistakes)
  • Location-based systems that adjust search results, and offers results based on the user location

Search Engines and BI

Most search engines do not support SQL or JOIN operations much like document data stores. This results in limitations in the type of analysis and dashboards you can create based on these data stores.

Tidbit

Although Apache Lucene is supported by only a few BI vendors, it is actually a key component in the fast growing search products Solr and Elasticsearch.

Summary

Various business requirements have resulted in major changes in the database and data-store landscape in the past 10 years. From an industry with only a single technical solution, the glorified RDBMS (that was best fit for enterprises), there are now a wide range of products each fit for a different business requirement. These products now match enterprise requirements, and are supported by leading BI providers.

Strata Hadoop NYC 2016

Strata Hadoop NYC

We’re proud to launch Jethro 2.0 at Strata + Hadoop NYC next week. You can read an article about it in Datanami.

Come by booth 736 to grab some Jethro swag, see live demos by our CTO Boaz Raufman, and chat BI on Hadoop with our CEO Eli Singer and the rest of the Jethro team.

In case you missed last weeks Jethro 2.0 webinar, you can check it out here: view webinar

Contact us to schedule a specific meeting time.