Webinar: Enterprise Interactive BI on Hadoop with Data Realty and Tableau

Learn How Data Realty Delivers Analysts Business Intelligence at Interactive Speed on Big Data Applications.

Join Thomas Panozzo, CTO at Data Realty, together with Vaidy Krishnan, Product Marketing at Tableau and Mike Kravec from Jethro for an exclusive look into how Data Realty leverages Jethro and Tableau to deliver users healthcare insights at the speed of thought. 

Webinar: Under the BI on Big Data Hood with Symphony Healthy and Qlik

Learn How Symphony Health Delivers Analysts Business Intelligence at Interactive Speed on Big Data Applications.

Join Sue Davis, Director Application Development at Symphony Health, together with David Freriks, Emerging Technology Evangelist at Qlik and Eli Singer, Jethro CEO for a peek into how Symphony Health leverages Jethro and Qlik to deliver analysts business insights at the speed of thought. 

Define and Create External Tables in Jethro 3


Installation Prerequisites

Before starting to use external tables, ensure that the conditions specified in the following sections are met according to your environment:

Java Installation

  1. Check which JAVA version is installed on your machine by running the command:
    java -version
  2. In case JAVA version is lower than 1.8 :
    1. Uninstall all previous installed versions of JAVA by using one of the following methods:
      1. Run the command:
        sudo yum remove {JAVA version}

        For example:

        sudo yum remove java-1.7.0
      2. Show all installed packages on server and locate relevant java packages for removal:
        yum list installed

        Remove the located java packages:

        sudo rpm -e {package name}

        For example:

        sudo rpm -e java-1.7.0-openjdk.x86_64
    2. Install JAVA 1.8 by using one of the following methods:
      1. Run the command:
        sudo yum install {java version}

        For example:

        sudo yum install java-1.8.0
      2. Install JAVA based on RPM package with JRE 1.8 previously downloaded from Java’s site:
        1. JAVA latest version can be found here:
        2. Copy the link of Linux x64 RPM
        3.  wget {copied link}
        4.  mv {downloaded file} java1.8.rpm
        5.  rpm -Uhiv java1.8.rpm
      3. Additional information regarding this installation process can be found here:

Hive client

Latest hive-JDBC version needs to be installed through yum install:

 sudo yum install hive

For that purpose make sure you already retrieved the relevant installation repo file and installed necessary Hadoop components, depending on your Hadoop distribution as specified on Jethro Installation Guide – Installing Jethro on Hadoop (“Configuring Jethro Connection with Hadoop” section)

For example:

For Cloudera CDH 5.x distribution:

  1. Switch to user root.
  2. Retrieve the installation repo file by running the following commands:
     rpm --quiet --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
     wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
  3. Install Hadoop client component:
     yum -y localinstall cloudera-cdh-5-0.x86_64.rpm
  4. Install hive-jdbc:
     yum -y install hive

Using External Tables

Create External Data Source in Jethro Client

Defining a connector between Jethro and a foreign Hive data source (schema).

Basic Syntax

CREATE EXTERNAL DATA SOURCE <new_data_source_name>


LOCATION = [sub-protocol://]<host>:<port>/<db> 

credentials=<user name>/<password>

For example:

CREATE EXTERNAL DATA SOURCE hive_simple_test TYPE=HIVE LOCATION='jdbc:hive2://' credentials=''/'';


Note: A general connector can be created without specifying data base name. In this case the reference to the schema will be done later on upon external table creation.

For example:

CREATE EXTERNAL DATA SOURCE hive_simple_test TYPE=HIVE LOCATION='jdbc:hive2://' credentials=''/'';

CREATE EXTERNAL TABLE hive_ext_jethro_simple_test







DATA SOURCE= hive_simple_test


Additional information regarding External Data Source can be found here.

Create External Data Source With KERBEROS in Jethro Client

Defining a connector between Jethro and a foreign Hive data source (schema) with KERBEROS.

Basic Syntax

CREATE EXTERNAL DATA SOURCE <new_data_source_name>


LOCATION = [sub-protocol://]<host>:<port>/<db>;principal=<principal name>/<server name or IP address>@<KERBEROS domain> 

credentials=<user name>/<password>

For example:

CREATE EXTERNAL DATA SOURCE hive_simple_test TYPE=HIVE LOCATION='jdbc:hive2://;principal=hive/ip-10-1-1-116.us-west-2.compute.internal@JETHRO.COM' credentials=''/'';


  • Active KERBEROS ticket for user Jethro is needed.
  • Use principal which has access to hive (in the example above: “hive”)
  • Use server name/IP address (In the example above: “ip-10-1-1-116.us-west-2.compute.internal”) that is connected to KERBEROS domain (in the example above: “JETHRO.COM”)

 Create Target Table in Jethro Client

Creating the target table in Jethro which will be populated with the data driven from the Hive data source table.

For example:

create table jethro_simple_test







Create External Table in Jethro Client

Creating a table in Jethro which is mapped to a foreign Hive data source table. It will use as an intermediate between Hive source table and Jethro Target Table and will allow loading the Hive source table data into Jethro Target Table using INSERT INTO command.

Basic Syntax

CREATE EXTERNAL TABLE <new_external_table_name>

 <column name>



<column name>




LOCATION= <table_name_from_Hive_data_source>

For example:

CREATE EXTERNAL TABLE hive_ext_jethro_simple_test







DATA SOURCE=hive_simple_test



Additional information regarding External Table can be found here.

Loading Data Into Jethro Target Table in Jethro Client

Basic Syntax

Insert into <Target Table Name in Jethro> select * from <External Table in Jethro>

For example:

 insert into jethro_simple_test select * from hive_ext_jethro_simple_test;


Now Releasing Jethro Manager

Jethro Manager adds a Web UI to the Jethro Engine

Jethro manager is part of Jethro 3.0. 

Jethro is proud to release a new Web UI that will allow its users to manage and perform actions on their Jethro servers using a remote web browser.

The first version of the product focuses on features that are most commonly used, activities that are most time consuming, and tasks can benefit users the most through the help of a visual guiding mechanism:

  • Create & attach Instances
  • Load from CSV data files located on a Local/Network/Hadoop file system
  • Load directly from Hive files of any format
  • Monitor load statuses and logs in real-time

Having a Web UI not only reduces the need to use command lines on Linux machines, but also takes advantage of the new interface to prevent users from repeating commonly known mistakes. The UI guides users with on-screen tips and rules that will better optimize their Jethro Server performance.

Let’s see a few screenshots from each feature, to better understand how it’s done.

Create & attach Instances

The ability to create and attach instances in Jethro Manager (JM) is designed through a wizard with three simple screens. The first one identifies the type of storage platform. The second navigates to the location where the Jethro Instance files will be stored within that storage. The third one asks for the name of the instance and the location and size of its local cache storage.

If an unattached instance is already found on the storage path provided, JM will offer the user the ability to simply select it from a dropdown list.

jethro screen shot attach instance

Upload data from files

The data upload flow consists of three screens. First a user chooses the type of data source. Then a user can navigate through folders, and select the files (or folders of files) to be loaded. On the third screen a user can map the source data to the target tables in Jethro, and to initiate a load process.
Users who are already familiar with Jethro will be glad to hear that no more description files editing is required.

Going back to the second screen—once the file(s) have been chosen by the user, JM will try to identify its delimiter char automatically and preview the data of the file in colors together with other parameters that can be selected and affect the way the file will be processed during the load.

jethro screen shot data upload

On the third screen the user will be asked to verify the mapping of columns from the source file(s) chosen on the previous screen, to a target table in Jethro. It can be a new table, or an existing one.

When creating a new table, JM will try to suggest the data types it sees as the best fit for the source data, while considering Jethro’s optimal preferences. The user will still be able to change the recommendations according to his own preferences.

When choosing an existing table to load the source data into, JM will try to identify the best existing table based on the data types identified on both sided of the map and will generate a match percentage score, which will appear next to each table name on the list. This score helps users with hundreds of tables in identifying their required target table and prevent mistakes such as loading of data into a wrong table.

JM also provides indications for the match quality between the linked source and target columns—if it thinks the mapping is wrong or could be better optimized. If a mapping between two columns doesn’t match, JM will notify the user via color, icons and tooltips.

jethro screen shot select data

The user will also have an option to select specific columns on both the target and the source, and to get a preview of their data right next to each other. Again – this comes to prevent commonly seen mistakes.

Loading data directly from Hive or Impala

The process of loading from Hive or Impala is practically the same as of the loading of files we’ve seen previously. The only difference is that you will query your Hive/impala metastore for table list instead of the directory file tree.

Monitoring load statuses and logs in real-time

The load manager screen will allow the users monitor the progress of load processes, and in case anything goes wrong, be able to access the logs.

JM offers the ability to repeat a load process within a single click. If a load process has failed due to a simple mistake that can be fixed, the user will be able to jump back to the historical load flow definition, and change whatever is required to make the load process succeed.

jethro screen shot loads manager

Solving Common BI on Hadoop Performance Pitfalls


Your company has recently made the jump to big data and now that a few months, or even years, have passed you think your team has a firm grip on getting data in/out of the environment for use, right? Most likely if you are thinking this you are probably also discovering that you aren’t really sure how to easily hook one of your Business Intelligence (BI) tools to this data, against data you really want – i.e., a subset, and most produce fast and rapid-response results. These three simple tasks are almost universal to any BI request and can wreak havoc on teams in the big data age with so many services available but none providing a solution to any of these.

Oftentimes this will lead us elsewhere, purchasing expensive alternatives from third party companies to manipulate, display, and provide meaningful access to the data for our users and ourselves and in most cases what are we left with? The sad truth of it is that we’re left with supporting multiple products or going back to what is familiar like SQL Server or other new in-memory alternatives such as Microsoft’s Analysis Services Tabular models; neither of which we want to do.

In this article, we’ll explore some of the various ways we can answer the question of how do we do BI from big data and compare alternatives with Jethro’s approach. Most of the alternatives mentioned in this piece have been explored and even implemented (sad to admit) by me before big data really grew into what is available now. So please enjoy and feel free to provide input or email comments and share experiences with me and the rest of the Jethro team so we can potentially advise and help through similar situations and avoid the pitfalls described here.

Core Hadoop

One of the most common, and frankly naïve, ways people approach BI with Hadoop, which generally ends in failure, is using core Hadoop to power BI reporting with tools such as Tableau, SSRS, or some other third party system like Business Objects, etc. When we talk about core Hadoop we are referring to the following services:

  • HDFS
  • MapReduce/MR2 YARN
  • Hive/HiveServer 2
  • Zookeeper

Now don’t get me wrong, MR2/YARN and the others listed above are excellent for what they are meant to do, but one of these services is often misrepresented for being a multipurpose solution that it is not: HIVE. The purpose of HIVE is simple, batch processing in/out – that’s it. Working with a company a few years back there weren’t many alternatives and I remember seeing demos with HIVE tables and views as the source to a Tableau dashboard. At the time, we were looking for alternatives to present the data quicker from HIVE into Tableau and ultimately ended up settling on SQL Server Analysis Services Tabular (more to come on this). I can clearly remember asking the presenter to manipulate filters via screen share and then the awkward waiting as one, two, five minutes passed as we waited for a MapReduce job to run in the background unbeknownst to the audience for the new data.

One of the major problems with this solution is the fact that Tableau, and other tools that rely on queries to refresh data, will cause MapReduce jobs to be launched as users explore and elect to change the initial set of data. There are clever ways around this, extracting the data all up front, etc., but oftentimes the cost for doing so is not worth it.

For example, take the connection below to a table with ~75M rows:

SS table 75M_rows

Connecting Tableau to Hive + Spark*


Connecting Tableau to Hive and Spark

The screenshot illustrates how long the wait was for metadata generation (I killed it after 2:00 minutes). One could argue that limiting the general import would fix this issue, i.e., if I did a LIMIT 500,000 I can pull the data through, and that is true. But what do you do when your smallest dataset initially starts at 25M, 50M, or greater?

You can spend many man hours on attempting to make this work with partitions, shrinking data into smaller tables, or other downstream tasks but none will ever prove sustainable nor provide the speed many of your customers are expecting.

Retro Solutioning

Retro Solutioning, which may be a coined term (not sure to be honest), is when you don’t know enough about the other Hadoop services outside of the core features and decide to stick with what’s “safe.” The resulting solution will utilize technologies your team has at hand and is comfortable with and is usually the second line of attack when initial plans don’t quite work out (see the previous section).

Although there are valid situations that may call for utilizing existing technology and every company and situation is unique, more often than not it proves to be additional work and, by nature, additional processes for you to manage. Not only does this completely negate the purpose of Hadoop and big data but it wastes the time and resources of your team and has a high likelihood of becoming a band-aide solution instead of a scalable one.

Being a Microsoft SQL developer for 10+ years prior to getting into big data I felt this example, derived from my own experience, was very relevant to this topic and involves utilizing Microsoft SQL Server and the Tabular Analysis Services features to make up for less than desired performance out of Hive. We’ll start by outlining the hypothetical project and then go into the implemented solution and the results afterwards. As you read through it, think of how you’d approach this problem and if it lines up or not (if it doesn’t line up then you may have did it right!)

The “Hypothetical” Project

Project: Reporting Dashboards
Original Systems:
Microsoft SQL Server (DB)
Microsoft .Net for dashboards – yikes, I know
Microsoft Reporting Services

Project Overview:

Take existing reports that have an MS SQL Server database source and upgrade them to Tableau, deprecating any use of .NET as a dashboarding interface and removing any reporting dependency on MS Reporting Services.


Project Requirements:

  • All reports should load in <10 seconds on initial browse
  • All reports should respond to filter changes within <10 seconds
  • Data source should hold at least 1 year for users to pull in active reports/dashboards
  • Solution should be able to handle a user Base = 1000 users with 200 being concurrent

The total data size is 225 million rows

Seems straightforward enough, right? The answer: Maybe.

Knowing that plugging your reporting tool, Tableau in for this case, into HS2 is a no-no for BI when a performance SLA is involved, we needed to come up with a way to solve for this quickly. Since we had SSAS talent and owned pretty beefy machines we decided to aggregate where possible and push to memory using the in-memory Tabular SSAS. Acknowledging that there would be overhead to develop the exports of data from Hive using ODBC and SQL Linked Servers and there would be duplication of data between what was stored in HDFS and Tabular we set off on our journey. The first thing that hurt us the most was the time it took extracting data from Hive into the in-memory area for SSAS and we were hit with the almost same overhead if we were to have just connected Tableau directly to HS2.

Here is a breakdown of the different systems via SQL Linked Server and ODBC. Note: Impala nor Jethro were available to us at the time b/c we didn’t even consider exploring them!


SQL Server ODBC Query Times


A couple things to note from the above; 1. The query times are using OPENQUERY to import data into SSAS, 2. Each resulting set was against an 80M row fact table unrelated to the one from our hypothetical project and the testing scenario used a query that would generate a dimension with ~1200 rows, and 3. Hive is Hive + Spark which definitely helped improve timing after a container was created.

So, with the time trials in mind, let’s get back to the hypothetical project and import. The next screenshot (below) shows an import of 1 day for ~7.1M rows and although the time is not show it took approximately 20 mins for this entire import. If you do the math it would take, in theory, 632 minutes for the entire dataset (10 hours)

Table Import Wizard

Importing the data at an aggregate level for our reporting did help, knocking the row counts down to about half the total, but that is well beyond the recommended size for in-memory solutions for Tabular Analysis Services and this showed when we hit it with Tableau. Query filtering was at best 10 seconds and could sometimes hit upwards of 30/60 seconds depending on if we were cascading filter boxes or not. We quickly learned that if we were seeing this type of performance during development and while hitting these reports on Tableau Server that it was not going to work for our customers and be a nightmare as we scaled up in size and userbase.

Below shows the wait on-open of the dashboard with Tableau Desktop. This would easily be prolonged when opening from the server and busts one of the requirements for SLA.

The Pitfalls

We learned how to connect different systems to HS2; but we also learned that relying on it to feed a performance driven solution was not going to work without doing one of the following:

  1. More hardware – faster and more expensive SSAS machines would certainly help – but where is the budget for that?
  2. Compromise on the solution. Reduce the amount of data kept and displayed assuming the customer would allow that – probably not

After reading this you may be saying “well, there are many other solutions like Cognos, SAS, AtScale, that can use as a layer to prepare data for BI” and that is certainly true. However, in my humble opinion, those solutions are not retro since most companies have not had them in-house for years like your SQL Servers or Oracle databases (MS Access – don’t even go there). The point of this section wasn’t to bash older database technologies or discourage you from pulling data into them if that works for your customers’ needs; but rather to illustrate how they may solve for the short term but fail in the long run when we include the other aspects of what makes data “big.”

The Best of Both Worlds

How do we get the best of both worlds? That is the question of the day, what magical solution will allow us to get the benefits of Hadoop and power your BI while, at the same time, avoiding costly add-on tools that mimic some familiar functionality. Sticking with your current BI tool, whether that is Tableau or Qlik is the right start but you also need to identify what your overall direction is with BI and reporting within your organization. Oftentimes we jump into tools that have punch lines such as “bring Hadoop to your analysts” or “data discovery at your fingertips” and while these tools may be great at specific tasks, they ultimately fall short with BI.

Keeping in mind there may never be a “catch-all” solution, and why would there be – this is Hadoop and it’s an ecosystem, let’s break it down by some BI needs…

Power User Pandemonium

A situation derived from personal experiences, having an abnormal amount of power users who have decent-to-good skills in SQL and databases and who really just want to get the data and manipulate it for their own reports. This BI nightmare, or dream depending on how you look at it, often includes many ad-hoc queries tapping your source system and the use of easily accessible analyst tools like Microsoft Excel and Access to pull data in and do what is needed with it.

Report Stores

Having a bunch of reports that are maintained by a development team and utilized by analysts but are relatively trivial when inspected independently. We all have instances of this, these “quick-hit” reports that were spun up to answer 20-30 users constant inquiries and they never went away.


The traditional dashboard requests that serve to answer many questions from executive levels and onward (depending on your organization). Dashboard development is the best part of BI, in my opinion, because it encompasses all the challenges associate with BI (learning the business, data, and dashboarding tool use).

What’s the Answer?

Easy…There isn’t one answer to handle it all as I eluded to in the previous sections. But if you approach your BI challenges knowing that you are starting down the right path because you are coming in with an open mind to the different technologies and combinations of them that will help you when solutioning. For example, one of my favorite approaches to solving the BI needs we listed above (realizing there are more than those three) is using this stack:

  • Core Hadoop

Hive + Spark for batch processes and heavy lifting with HDFS, YARN, Zookeeper

  • Impala

Deployed to adhoc users and analysts who know their way around various SQL languages and won’t have the patience, or need, for Hive. Impala is a big win for organizations jumping into Hadoop that have a lot of power user’s b/c it provides a happy medium between big data and what they’re [your advanced users] are used to working with.

  • Jethro

This is my SLA-saver. Impala works great and can be used for a lot of quick reports or outputs that a few users rely on but it can easily fall short when it comes to speed and tight SLAs. If you are presented with a strict SLA such as reports respond in <10 seconds or filters need to be near-instance on change -Jethro is the answer. With access to HDFS and the usage of the new manager tool (Jethro 3.0), data can easily be imported and queried with many of the common BI tools; Basically, any tool that can make ODBC connections can utilize the data within Jethro making it very powerful.

  • Hue

Replaces a lot of user’s need for some tool similar to Microsoft’s SQL Management Studio and visualizes the abilities of Hadoop. I find Hue is a great way to help new users to the Hadoop world understand the basic principles of it while still getting their data, i.e., they can browse HDFS, use the import table wizard, etc.

  • Sentry

Provides role level security for your stack for Hive, Impala, and Jethro. This is a must when you are going to have advanced users playing around via Impala because it shares the Hive metastore and you’ll want to restrict your users so they don’t accidently drop or start creating duplicate items.

The above listing, although high level, is an example of my favorite stack setup that has proven successful for solving big data and BI needs because it is focused on ensuring there is a service for every type of scenario while avoiding getting backed into a technology corner per-se.

Related: Read our white paper on making interactive BI work on Big Data

Having worked as a database architect for 10+ years, beginning with Microsoft SQL Server 2000+, Analysis Services, and ultimately Big Data Hadoop, AJ Adams has come across many different challenges. He has worked with many different types of organizations covering a variety of industries including energy, government, healthcare, and sales/marketing; designing and implementing architecture for BI-related solutions using tools such as Tableau, Qlik, and others. With the growth of the size of data he moved into Hadoop, primarily Cloudera and services associated with the distribution specializing in administration, development, and overall stack setup. He believes one of the keys to success with making the move to big data is to approach it openly and focus on what sets of services can best handle the various needs versus battling to come up with a ‘one-solution-fits-all’ design.