Webinar: Under the BI on Big Data Hood with Symphony Healthy and Qlik

Learn How Symphony Health Delivers Analysts Business Intelligence at Interactive Speed on Big Data Applications.

Join Sue Davis, Director Application Development at Symphony Health, together with David Freriks, Emerging Technology Evangelist at Qlik and Eli Singer, Jethro CEO for a peek into how Symphony Health leverages Jethro and Qlik to deliver analysts business insights at the speed of thought. 

Define and Create External Tables in Jethro 3

 

Installation Prerequisites

Before starting to use external tables, ensure that the conditions specified in the following sections are met according to your environment:

Java Installation

  1. Check which JAVA version is installed on your machine by running the command:
    java -version
  2. In case JAVA version is lower than 1.8 :
    1. Uninstall all previous installed versions of JAVA by using one of the following methods:
      1. Run the command:
        sudo yum remove {JAVA version}

        For example:

        sudo yum remove java-1.7.0
      2. Show all installed packages on server and locate relevant java packages for removal:
        yum list installed

        Remove the located java packages:

        sudo rpm -e {package name}

        For example:

        sudo rpm -e java-1.7.0-openjdk.x86_64
    2. Install JAVA 1.8 by using one of the following methods:
      1. Run the command:
        sudo yum install {java version}

        For example:

        sudo yum install java-1.8.0
      2. Install JAVA based on RPM package with JRE 1.8 previously downloaded from Java’s site:
        1. JAVA latest version can be found here:
          https://java.com/en/download/linux_manual.jsp
        2. Copy the link of Linux x64 RPM
        3.  wget {copied link}
        4.  mv {downloaded file} java1.8.rpm
        5.  rpm -Uhiv java1.8.rpm
      3. Additional information regarding this installation process can be found here:
        https://docs.oracle.com/javase/8/docs/technotes/guides/install/linux_jre.html#CFHBHAGI

Hive client

Latest hive-JDBC version needs to be installed through yum install:

 sudo yum install hive

For that purpose make sure you already retrieved the relevant installation repo file and installed necessary Hadoop components, depending on your Hadoop distribution as specified on Jethro Installation Guide – Installing Jethro on Hadoop (“Configuring Jethro Connection with Hadoop” section)

For example:

For Cloudera CDH 5.x distribution:

  1. Switch to user root.
  2. Retrieve the installation repo file by running the following commands:
     rpm --quiet --import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
    
     wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
  3. Install Hadoop client component:
     yum -y localinstall cloudera-cdh-5-0.x86_64.rpm
  4. Install hive-jdbc:
     yum -y install hive

Using External Tables

Create External Data Source in Jethro Client

Defining a connector between Jethro and a foreign Hive data source (schema).

Basic Syntax

CREATE EXTERNAL DATA SOURCE <new_data_source_name>

TYPE = HIVE

LOCATION = [sub-protocol://]<host>:<port>/<db> 

credentials=<user name>/<password>

For example:

CREATE EXTERNAL DATA SOURCE hive_simple_test TYPE=HIVE LOCATION='jdbc:hive2://10.1.1.61:10000/simpletest' credentials=''/'';

 

Note: A general connector can be created without specifying data base name. In this case the reference to the schema will be done later on upon external table creation.

For example:

CREATE EXTERNAL DATA SOURCE hive_simple_test TYPE=HIVE LOCATION='jdbc:hive2://10.1.1.61:10000/default' credentials=''/'';

CREATE EXTERNAL TABLE hive_ext_jethro_simple_test

(

a STRING,

b STRING,

c STRING,

d STRING

)

DATA SOURCE= hive_simple_test

LOCATION='simpletest.simple_test';

Additional information regarding External Data Source can be found here.

Create External Data Source With KERBEROS in Jethro Client

Defining a connector between Jethro and a foreign Hive data source (schema) with KERBEROS.

Basic Syntax

CREATE EXTERNAL DATA SOURCE <new_data_source_name>

TYPE = HIVE

LOCATION = [sub-protocol://]<host>:<port>/<db>;principal=<principal name>/<server name or IP address>@<KERBEROS domain> 

credentials=<user name>/<password>

For example:

CREATE EXTERNAL DATA SOURCE hive_simple_test TYPE=HIVE LOCATION='jdbc:hive2://10.1.1.61:10000/simpletest;principal=hive/ip-10-1-1-116.us-west-2.compute.internal@JETHRO.COM' credentials=''/'';

Notes:

  • Active KERBEROS ticket for user Jethro is needed.
  • Use principal which has access to hive (in the example above: “hive”)
  • Use server name/IP address (In the example above: “ip-10-1-1-116.us-west-2.compute.internal”) that is connected to KERBEROS domain (in the example above: “JETHRO.COM”)

 Create Target Table in Jethro Client

Creating the target table in Jethro which will be populated with the data driven from the Hive data source table.

For example:

create table jethro_simple_test

(

j_a STRING,

j_b STRING,

j_c STRING,

j_d STRING

);

Create External Table in Jethro Client

Creating a table in Jethro which is mapped to a foreign Hive data source table. It will use as an intermediate between Hive source table and Jethro Target Table and will allow loading the Hive source table data into Jethro Target Table using INSERT INTO command.

Basic Syntax

CREATE EXTERNAL TABLE <new_external_table_name>

(
 <column name>

<column_type>

….

<column name>

<column_type>

)

DATA_SOURCE=<existing_external_data_source_name>

LOCATION= <table_name_from_Hive_data_source>

For example:

CREATE EXTERNAL TABLE hive_ext_jethro_simple_test

(

a STRING,

b STRING,

c STRING,

d STRING

)

DATA SOURCE=hive_simple_test

LOCATION='simple_test';

 

Additional information regarding External Table can be found here.

Loading Data Into Jethro Target Table in Jethro Client

Basic Syntax

Insert into <Target Table Name in Jethro> select * from <External Table in Jethro>

For example:

 insert into jethro_simple_test select * from hive_ext_jethro_simple_test;

 

Now Releasing Jethro Manager

Jethro Manager adds a Web UI to the Jethro Engine

Jethro manager is part of Jethro 3.0. 

Jethro is proud to release a new Web UI that will allow its users to manage and perform actions on their Jethro servers using a remote web browser.

The first version of the product focuses on features that are most commonly used, activities that are most time consuming, and tasks can benefit users the most through the help of a visual guiding mechanism:

  • Create & attach Instances
  • Load from CSV data files located on a Local/Network/Hadoop file system
  • Load directly from Hive files of any format
  • Monitor load statuses and logs in real-time

Having a Web UI not only reduces the need to use command lines on Linux machines, but also takes advantage of the new interface to prevent users from repeating commonly known mistakes. The UI guides users with on-screen tips and rules that will better optimize their Jethro Server performance.

Let’s see a few screenshots from each feature, to better understand how it’s done.

Create & attach Instances

The ability to create and attach instances in Jethro Manager (JM) is designed through a wizard with three simple screens. The first one identifies the type of storage platform. The second navigates to the location where the Jethro Instance files will be stored within that storage. The third one asks for the name of the instance and the location and size of its local cache storage.

If an unattached instance is already found on the storage path provided, JM will offer the user the ability to simply select it from a dropdown list.

jethro screen shot attach instance

Upload data from files

The data upload flow consists of three screens. First a user chooses the type of data source. Then a user can navigate through folders, and select the files (or folders of files) to be loaded. On the third screen a user can map the source data to the target tables in Jethro, and to initiate a load process.
Users who are already familiar with Jethro will be glad to hear that no more description files editing is required.

Going back to the second screen—once the file(s) have been chosen by the user, JM will try to identify its delimiter char automatically and preview the data of the file in colors together with other parameters that can be selected and affect the way the file will be processed during the load.

jethro screen shot data upload

On the third screen the user will be asked to verify the mapping of columns from the source file(s) chosen on the previous screen, to a target table in Jethro. It can be a new table, or an existing one.

When creating a new table, JM will try to suggest the data types it sees as the best fit for the source data, while considering Jethro’s optimal preferences. The user will still be able to change the recommendations according to his own preferences.

When choosing an existing table to load the source data into, JM will try to identify the best existing table based on the data types identified on both sided of the map and will generate a match percentage score, which will appear next to each table name on the list. This score helps users with hundreds of tables in identifying their required target table and prevent mistakes such as loading of data into a wrong table.

JM also provides indications for the match quality between the linked source and target columns—if it thinks the mapping is wrong or could be better optimized. If a mapping between two columns doesn’t match, JM will notify the user via color, icons and tooltips.

jethro screen shot select data

The user will also have an option to select specific columns on both the target and the source, and to get a preview of their data right next to each other. Again – this comes to prevent commonly seen mistakes.

Loading data directly from Hive or Impala

The process of loading from Hive or Impala is practically the same as of the loading of files we’ve seen previously. The only difference is that you will query your Hive/impala metastore for table list instead of the directory file tree.

Monitoring load statuses and logs in real-time

The load manager screen will allow the users monitor the progress of load processes, and in case anything goes wrong, be able to access the logs.

JM offers the ability to repeat a load process within a single click. If a load process has failed due to a simple mistake that can be fixed, the user will be able to jump back to the historical load flow definition, and change whatever is required to make the load process succeed.

jethro screen shot loads manager

Solving Common BI on Hadoop Performance Pitfalls

Introduction

Your company has recently made the jump to big data and now that a few months, or even years, have passed you think your team has a firm grip on getting data in/out of the environment for use, right? Most likely if you are thinking this you are probably also discovering that you aren’t really sure how to easily hook one of your Business Intelligence (BI) tools to this data, against data you really want – i.e., a subset, and most produce fast and rapid-response results. These three simple tasks are almost universal to any BI request and can wreak havoc on teams in the big data age with so many services available but none providing a solution to any of these.

Oftentimes this will lead us elsewhere, purchasing expensive alternatives from third party companies to manipulate, display, and provide meaningful access to the data for our users and ourselves and in most cases what are we left with? The sad truth of it is that we’re left with supporting multiple products or going back to what is familiar like SQL Server or other new in-memory alternatives such as Microsoft’s Analysis Services Tabular models; neither of which we want to do.

In this article, we’ll explore some of the various ways we can answer the question of how do we do BI from big data and compare alternatives with Jethro’s approach. Most of the alternatives mentioned in this piece have been explored and even implemented (sad to admit) by me before big data really grew into what is available now. So please enjoy and feel free to provide input or email comments and share experiences with me and the rest of the Jethro team so we can potentially advise and help through similar situations and avoid the pitfalls described here.

Core Hadoop

One of the most common, and frankly naïve, ways people approach BI with Hadoop, which generally ends in failure, is using core Hadoop to power BI reporting with tools such as Tableau, SSRS, or some other third party system like Business Objects, etc. When we talk about core Hadoop we are referring to the following services:

  • HDFS
  • MapReduce/MR2 YARN
  • Hive/HiveServer 2
  • Zookeeper

Now don’t get me wrong, MR2/YARN and the others listed above are excellent for what they are meant to do, but one of these services is often misrepresented for being a multipurpose solution that it is not: HIVE. The purpose of HIVE is simple, batch processing in/out – that’s it. Working with a company a few years back there weren’t many alternatives and I remember seeing demos with HIVE tables and views as the source to a Tableau dashboard. At the time, we were looking for alternatives to present the data quicker from HIVE into Tableau and ultimately ended up settling on SQL Server Analysis Services Tabular (more to come on this). I can clearly remember asking the presenter to manipulate filters via screen share and then the awkward waiting as one, two, five minutes passed as we waited for a MapReduce job to run in the background unbeknownst to the audience for the new data.

One of the major problems with this solution is the fact that Tableau, and other tools that rely on queries to refresh data, will cause MapReduce jobs to be launched as users explore and elect to change the initial set of data. There are clever ways around this, extracting the data all up front, etc., but oftentimes the cost for doing so is not worth it.

For example, take the connection below to a table with ~75M rows:

SS table 75M_rows

Connecting Tableau to Hive + Spark*

 

Connecting Tableau to Hive and Spark

The screenshot illustrates how long the wait was for metadata generation (I killed it after 2:00 minutes). One could argue that limiting the general import would fix this issue, i.e., if I did a LIMIT 500,000 I can pull the data through, and that is true. But what do you do when your smallest dataset initially starts at 25M, 50M, or greater?

You can spend many man hours on attempting to make this work with partitions, shrinking data into smaller tables, or other downstream tasks but none will ever prove sustainable nor provide the speed many of your customers are expecting.

Retro Solutioning

Retro Solutioning, which may be a coined term (not sure to be honest), is when you don’t know enough about the other Hadoop services outside of the core features and decide to stick with what’s “safe.” The resulting solution will utilize technologies your team has at hand and is comfortable with and is usually the second line of attack when initial plans don’t quite work out (see the previous section).

Although there are valid situations that may call for utilizing existing technology and every company and situation is unique, more often than not it proves to be additional work and, by nature, additional processes for you to manage. Not only does this completely negate the purpose of Hadoop and big data but it wastes the time and resources of your team and has a high likelihood of becoming a band-aide solution instead of a scalable one.

Being a Microsoft SQL developer for 10+ years prior to getting into big data I felt this example, derived from my own experience, was very relevant to this topic and involves utilizing Microsoft SQL Server and the Tabular Analysis Services features to make up for less than desired performance out of Hive. We’ll start by outlining the hypothetical project and then go into the implemented solution and the results afterwards. As you read through it, think of how you’d approach this problem and if it lines up or not (if it doesn’t line up then you may have did it right!)

The “Hypothetical” Project

Project: Reporting Dashboards
Original Systems:
Microsoft SQL Server (DB)
Microsoft .Net for dashboards – yikes, I know
Microsoft Reporting Services

Project Overview:

Take existing reports that have an MS SQL Server database source and upgrade them to Tableau, deprecating any use of .NET as a dashboarding interface and removing any reporting dependency on MS Reporting Services.

 

Project Requirements:

  • All reports should load in <10 seconds on initial browse
  • All reports should respond to filter changes within <10 seconds
  • Data source should hold at least 1 year for users to pull in active reports/dashboards
  • Solution should be able to handle a user Base = 1000 users with 200 being concurrent

The total data size is 225 million rows

Seems straightforward enough, right? The answer: Maybe.

Knowing that plugging your reporting tool, Tableau in for this case, into HS2 is a no-no for BI when a performance SLA is involved, we needed to come up with a way to solve for this quickly. Since we had SSAS talent and owned pretty beefy machines we decided to aggregate where possible and push to memory using the in-memory Tabular SSAS. Acknowledging that there would be overhead to develop the exports of data from Hive using ODBC and SQL Linked Servers and there would be duplication of data between what was stored in HDFS and Tabular we set off on our journey. The first thing that hurt us the most was the time it took extracting data from Hive into the in-memory area for SSAS and we were hit with the almost same overhead if we were to have just connected Tableau directly to HS2.

Here is a breakdown of the different systems via SQL Linked Server and ODBC. Note: Impala nor Jethro were available to us at the time b/c we didn’t even consider exploring them!

 

SQL Server ODBC Query Times

 

A couple things to note from the above; 1. The query times are using OPENQUERY to import data into SSAS, 2. Each resulting set was against an 80M row fact table unrelated to the one from our hypothetical project and the testing scenario used a query that would generate a dimension with ~1200 rows, and 3. Hive is Hive + Spark which definitely helped improve timing after a container was created.

So, with the time trials in mind, let’s get back to the hypothetical project and import. The next screenshot (below) shows an import of 1 day for ~7.1M rows and although the time is not show it took approximately 20 mins for this entire import. If you do the math it would take, in theory, 632 minutes for the entire dataset (10 hours)

Table Import Wizard

Importing the data at an aggregate level for our reporting did help, knocking the row counts down to about half the total, but that is well beyond the recommended size for in-memory solutions for Tabular Analysis Services and this showed when we hit it with Tableau. Query filtering was at best 10 seconds and could sometimes hit upwards of 30/60 seconds depending on if we were cascading filter boxes or not. We quickly learned that if we were seeing this type of performance during development and while hitting these reports on Tableau Server that it was not going to work for our customers and be a nightmare as we scaled up in size and userbase.

Below shows the wait on-open of the dashboard with Tableau Desktop. This would easily be prolonged when opening from the server and busts one of the requirements for SLA.

The Pitfalls

We learned how to connect different systems to HS2; but we also learned that relying on it to feed a performance driven solution was not going to work without doing one of the following:

  1. More hardware – faster and more expensive SSAS machines would certainly help – but where is the budget for that?
  2. Compromise on the solution. Reduce the amount of data kept and displayed assuming the customer would allow that – probably not

After reading this you may be saying “well, there are many other solutions like Cognos, SAS, AtScale, that can use as a layer to prepare data for BI” and that is certainly true. However, in my humble opinion, those solutions are not retro since most companies have not had them in-house for years like your SQL Servers or Oracle databases (MS Access – don’t even go there). The point of this section wasn’t to bash older database technologies or discourage you from pulling data into them if that works for your customers’ needs; but rather to illustrate how they may solve for the short term but fail in the long run when we include the other aspects of what makes data “big.”

The Best of Both Worlds

How do we get the best of both worlds? That is the question of the day, what magical solution will allow us to get the benefits of Hadoop and power your BI while, at the same time, avoiding costly add-on tools that mimic some familiar functionality. Sticking with your current BI tool, whether that is Tableau or Qlik is the right start but you also need to identify what your overall direction is with BI and reporting within your organization. Oftentimes we jump into tools that have punch lines such as “bring Hadoop to your analysts” or “data discovery at your fingertips” and while these tools may be great at specific tasks, they ultimately fall short with BI.

Keeping in mind there may never be a “catch-all” solution, and why would there be – this is Hadoop and it’s an ecosystem, let’s break it down by some BI needs…

Power User Pandemonium

A situation derived from personal experiences, having an abnormal amount of power users who have decent-to-good skills in SQL and databases and who really just want to get the data and manipulate it for their own reports. This BI nightmare, or dream depending on how you look at it, often includes many ad-hoc queries tapping your source system and the use of easily accessible analyst tools like Microsoft Excel and Access to pull data in and do what is needed with it.

Report Stores

Having a bunch of reports that are maintained by a development team and utilized by analysts but are relatively trivial when inspected independently. We all have instances of this, these “quick-hit” reports that were spun up to answer 20-30 users constant inquiries and they never went away.

Dashboards

The traditional dashboard requests that serve to answer many questions from executive levels and onward (depending on your organization). Dashboard development is the best part of BI, in my opinion, because it encompasses all the challenges associate with BI (learning the business, data, and dashboarding tool use).

What’s the Answer?

Easy…There isn’t one answer to handle it all as I eluded to in the previous sections. But if you approach your BI challenges knowing that you are starting down the right path because you are coming in with an open mind to the different technologies and combinations of them that will help you when solutioning. For example, one of my favorite approaches to solving the BI needs we listed above (realizing there are more than those three) is using this stack:

  • Core Hadoop

Hive + Spark for batch processes and heavy lifting with HDFS, YARN, Zookeeper

  • Impala

Deployed to adhoc users and analysts who know their way around various SQL languages and won’t have the patience, or need, for Hive. Impala is a big win for organizations jumping into Hadoop that have a lot of power user’s b/c it provides a happy medium between big data and what they’re [your advanced users] are used to working with.

  • Jethro

This is my SLA-saver. Impala works great and can be used for a lot of quick reports or outputs that a few users rely on but it can easily fall short when it comes to speed and tight SLAs. If you are presented with a strict SLA such as reports respond in <10 seconds or filters need to be near-instance on change -Jethro is the answer. With access to HDFS and the usage of the new manager tool (Jethro 3.0), data can easily be imported and queried with many of the common BI tools; Basically, any tool that can make ODBC connections can utilize the data within Jethro making it very powerful.

  • Hue

Replaces a lot of user’s need for some tool similar to Microsoft’s SQL Management Studio and visualizes the abilities of Hadoop. I find Hue is a great way to help new users to the Hadoop world understand the basic principles of it while still getting their data, i.e., they can browse HDFS, use the import table wizard, etc.

  • Sentry

Provides role level security for your stack for Hive, Impala, and Jethro. This is a must when you are going to have advanced users playing around via Impala because it shares the Hive metastore and you’ll want to restrict your users so they don’t accidently drop or start creating duplicate items.

The above listing, although high level, is an example of my favorite stack setup that has proven successful for solving big data and BI needs because it is focused on ensuring there is a service for every type of scenario while avoiding getting backed into a technology corner per-se.

Related: Read our white paper on making interactive BI work on Big Data


Having worked as a database architect for 10+ years, beginning with Microsoft SQL Server 2000+, Analysis Services, and ultimately Big Data Hadoop, AJ Adams has come across many different challenges. He has worked with many different types of organizations covering a variety of industries including energy, government, healthcare, and sales/marketing; designing and implementing architecture for BI-related solutions using tools such as Tableau, Qlik, and others. With the growth of the size of data he moved into Hadoop, primarily Cloudera and services associated with the distribution specializing in administration, development, and overall stack setup. He believes one of the keys to success with making the move to big data is to approach it openly and focus on what sets of services can best handle the various needs versus battling to come up with a ‘one-solution-fits-all’ design.

On-Demand Webinar: How Tata Communications Uses Jethro to Deliver Interactive BI on Hadoop

Learn how Tata Communications CDN overcame Hadoop latency pitfalls to deliver its end users a BOI dashboard on a shared Hadoop data lake with thousands of concurrent users that works at interactive speed.

This on-demand webinar is hosted by Tata Communications CDN Head of Engineering, Samik Mukherjee and Jethro CEO Eli Singer. You will learn about Tata Communications CDN’s Hadoop deployment, performance requirements, and architectural considerations.

 

Easy Steps for Jethro Deployment

Install and Create Instance

Set up a server

Jethro runs on a Linux server. Typical deployment requires a RedHat/CentOS server with 128 GB of RAM, 16 CPU cores and 500GB of disk space.

Jethro stores its data on Hadoop, but can use any file system, either local or network, instead.

See the installation guide for server requirements and setup:

http://jethro.io/learn/documentation

If using Amazon AWS, you can view the following post on how to install Jethro on an EC2 instance:

http://jethro.io/blog/how-to-set-up-an-amazon-instance

You can also view this video that shows how to install Jethro on AWS:

https://www.youtube.com/watch?v=jdREOTFmcdk&t=691s

Download the software

Go to http://jethro.io/download1 to download the software.

Install Jethro

Download the installation rpm to the Jethro server and run the following command as root, to install it:

rpm -Uvh <rpm file>

See the installation guide for Jethro install:

http://jethro.io/learn/documentation

Create an Instance

Before creating an instance you need to create a directory for the storage and for local cache and to change the ownership of these directories to user “jethro” which was created as part of the installation process.

Then, login as user “jethro” (password is also “jethro”) and run the command:

JethroAdmin create-instance <instance name> -storage-path=<storage path> -cache-path=<local cache path> -cache-size=300G

 

Replace <instance name> with the name you want to give the instance,  <storage path> with the full path of the storage directory on Hadoop and <local cache path> with the full path of the local cache directory. Set the cache size according to the size on the local cache directory.

This video shows how to connect Jethro to Hadoop:

https://www.youtube.com/watch?v=4zOEpdiBQjA

 

For instances with storage NOT on Hadoop, add the following parameter to the end of the command:

-Dstorage.type=posix

Notes:

 

After creating the instance, you can start the Jethro service by running:

service Jethro start

Analyze data

Analyzing the data will help you to determine the correct data type for each column and also will create the table creation script and description files that are needed to load data.

Download the Data Analysis tool

Download the tool using this link:

https://github.com/JethroData/AnalyzeData/archive/master.zip

Unzip the file and then go to the AnalyzeData-master directory.

Analyze sample data

Use the AnalyzeData script to analyze the data.

It requires python to be installed on the server.

It also uses an optional python package called “tabulate” to generate the data analysis report. To install tabulate, use the command:

TABULATE_INSTALL=lib-only pip install tabulate

If pip is not installed, install it using the command:

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

Note: Installing tabulate requires root or sudo access to the server. If you do not have root or sudo access, you can still use AnalyzeData. In this case, it will fall back to csv mode when displaying the results.

Run Analyze data with the following optional parameters:

AnalyzeData.py [-i <rows to read>] [-d <delimiter>] [-q <quote char>] [-n] [-c] [-g <table name>] [<input file>]

-i: Number of rows to read from the input. Default=unlimited.

-d: The input data delimiter. Default=”,” (comma).

-q: The input data quote character. This allows a delimiter character inside quotes as    part of the data. Default=” (double quotes).

-n: If specified, treats the first row as headers row which contains the column names.

-c: CSV formatted output. Write the output report as a tab delimited file instead of a formatted table. Installing tabulate is not required in this mode.

-g: Generate a create table script and a description file using the given table name.

input file: The input file to read. If not specified, read from standard input.

It is recommended to limit the input up to 1 million rows (-i 1000000).

This will generate a report such as the following:

 

Number: The column serial number.

Name: The column name, if the data contains headers. Otherwise it is c1..cN.

Rows: The number of rows for the column.

Type: The suggested type to use based on the data. A non string type is suggested in case non string values are found and there are no more than 5 distinct exception values.

Category: For certain values, a category can be detected based on regular expressions. It also specifies “Primary Key” when a column has unique values and “High Cardinality” if it has many unique values.

Percent: The percentage of the values of the suggested type out of all values.

Exceptions: A list of up to 5 exception values. Exception values are values that do not match the suggested type.

Distinct: The number of distinct values.

Samples: Sample values of the suggested type.

In addition, if the -g parameter is specified with a table name, then a create table script and a description file is generated based on the data. For the above data, the following scripts are generated as test.ddl and test.desc when given the table name “test”:

 

create table test

(

name STRING,

age BIGINT,

birth date TIMESTAMP,

balance DOUBLE,

phone STRING

);

table test

row format delimited

            fields terminated by '|'

            null defined as 'NULL'

OPTIONS

            SKIP 1

(

name,

age,

birth date format='yyyy-M-d',

balance null defined as '-',

phone

)

You can view the report to determine the following:

  1.       Is the suggested type acceptable?
  2.       Is there more than on exception values? Jethro only supports one NULL value for each column, so if there are more than one exception value you either need to modify the values before loading or load this column as string.
  3.       How many distinct values are there. For a string column with many distinct values, you may choose not to load it or to truncate the values so they become less unique (in the case of URL’s for example).

Then, you can modify the table create script and description files if necessary:

  1.       Assign column names if they were not provided as part of the input.
  2.       Change types as required.
  3.       Add a partition definition.

For example, to partition the above table by birth_date every month, add the following just before the create table ending semicolon:

Partition by range(birth_date) every (interval, ‘1’, month)

 

  1.       Change description file options, like rows or columns to skip.
  2.       To learn more about partitions and description file, consult the reference guide:

https://jethrodownload.s3.amazonaws.com/JethroReferenceGuide.pdf

Load data

Once the data is analyzed and the scripts are created, you can start loading your data.

Create tables

Before loading data, you need to create tables in Jethro.

Start the JethroClient using the command:

JethroClient <instance name> localhost:9111 -p jethro

 

Copy the create table command and drop it at the command prompt.

Alternatively, you can run the JethroClient with the -i option specifying the ddl file of the table. For example:

JethroClient <instance name> localhost:9111 -p Jethro -i test.ddl

Load data to tables

Once the tables are created, you can start loading data.

To load data to a table use the command:

JethroLoader <instance name> <description file> <input files or directories>

 

If the data is not in text format, you may need to make it accessible to Jethro first. In this case, you can use a pipe to forward the data to the Jethro loader, specifying “stdin” instead of files or directories. For example, to load data from a hive table, you could run:

hive -S -e “select * from <db>.<table> | JethroLoader <instance name> <description file> stdin

 

The load process may take some time, so it is recommended to run it in the background by specifying  “&” at the end of the command.

When the loader starts running, it will print the path to the log file. You can view this file using tail -f to monitor the load progress.

You can learn more about loading data from the reference guide and from the following video:

https://www.youtube.com/watch?v=Dol7TyTbgVU&t=27s

Connect BI tool and run queries

Once data is loaded, you can connect your BI tool and start working with the data.

Download connection drivers

Connection to Jethro is made through ODBC or JDBC. You first need to download the Jethro drivers from here:

http://jethro.io/driver-downloads

Install the ODBC driver

After downloading the installation file, double click on it to start the install process.

If Tableau is installed on your box, the installer automatically detects that and will ask you to copy the Jethro.tdc file to the proper location.

Create a connection

After installing the ODBC driver, you need to create a connection to Jethro.

Open the “ODBC Data sources” and go to the “System DSN” tab

Click on the ‘Add” button, select “JethroODBCDriver” from the list of drivers and click “Finish”.

Fill in the details. Choose a data source name. In the host filed, specify all the Jethro servers connected to the instance in the format: <host>:<port>;<host>:<port>….<host>:<port>. This will allow the ODBC driver to load balance between all the servers by sending each query to a different server. You can leave the port filed empty. In the Instance field, specify the Jethro instance name. The Login ID is “jethro” and the password is also “jethro” if you did not change it.

Click OK to save the connection.

Install the JDBC driver

After downloading the JDBC zip file, uncompress it and put its content in a folder of your choice.

The zip file contains 2 jar files: JethroDataJDBC.jar and protobuf-java-2.4.1.jar.

It is important to use both when defining the driver for your tool.

For example, to configure SQL Workbench to connect to Jethro, first add the JDBC driver:

Then, create a new connection providing the connection details:

Connect BI tool

Once a connection is created, you can use it in order to connect your BI tool.

The following video shows how to connect Tableau to Jethro:

https://www.youtube.com/watch?v=secprIbz4tY&t=250s

Setup Automatic Microcubes Generation

Microcubes provide interactive speed data access for queries with a low number of filters by generating and caching the results for all value combinations of a given query pattern.

Microcubes can be generated automatically from the queries sent by users. It is recommended to set up a separate jethro server that will be dedicated for cube generation. A typical environment should include at least 2 jethro servers. One server will run the maint process and will be used to generate the microcubes. This server can also be used to load data. Any other server will be used to query data.

We assume that the existing server will be used for microcubes generation and any new server will be used for queries.

Setup query servers.

  1. Set up the new jethro servers.
  2. In each server, attach the server to the instance by running the command:

JethroAdmin attach-instance <instance name> -storage-path=<storage path> -cache-path=<local cache path> -cache-size=300G

Replace <instance name> with the name of the instance,  <storage path> with the full path of the storage directory on Hadoop and <local cache path> with the full path of the local cache directory. Set the cache size according to the size on the local cache directory.

For instances with storage NOT on Hadoop, add the following parameter to the end of the command:

-Dstorage.type=posix
  1. Start the jethro service on each server:
service jethro start
  1. Edit any ODBC/JDBC connection to include the list of query servers that you set.

 

Configure autocube generation

  1. In the first server that you set, where you created the instance, edit the file:
/opt/jethro/intances/<instance name>/local-conf.ini
  1. At the end of the file, add the following 3 lines:
dynamic.aggregation.auto.generate.enable=1

dynamic.aggregation.auto.update.incremental.cubes.enable=1

dynamic.aggregation.auto.generate.execution.hosts=localhost:9111
  1. Save the file.
  2. Restart the jethro service:

service jethro restart

 

Now, microcubes will be generated automatically based on the queries that are being sent by users.

 

Generate microcubes manually

If you do not want to allocate a server for microcubes generation, you can still generate microcubes manually for specific queries.

Connect to the server using JethroClient or any other SQL client and run the command:

generate cubes <cube name> from <sql query>

 

You should replace <cube name> with a cube name of your choice.

This command will instruct the server to try to generate microcubes for the given sql query. If microcubes can be generated for the query, they will be generated and you will get a response. Otherwise, you will see a response with the reason why the microcube was not generated.

To see the existing microcubes, run the following sql command:

show cubes