Jethro runs on one or few dedicated, higher-end hosts optimized for SQL processing – with extra memory and CPU cores, and local SSD for caching. The query hosts are stateless, and new ones can be dynamically added to support additional concurrent users.
Full Indexing, a Columnar Structure & Adaptive Caching
Jethro is a combination of two engines: columnar SQL database and search indexing. Think Impala plus Elastic Search, in one product. It looks and behaves like a SQL engine but leverages the full-indexing engine inside to power fast queries. It utilizes indexing, smart caching and micro auto-cubes to accelerate Business Intelligence (BI) queries.
What are the alternatives?
Many companies are forced to point their BI applications away from Hadoop: 1. traditional EDWs such as Teradata or Vertica 2. cloud services like Redshift or Big Query 3. full BI as a Service such as Domo or GoodData. This goes directly against Hadoop’s promise of being THE data platform of the future.
Don’t we already have Hive, Impala & SparkSQL? Can’t they do the job?
These, and the many other existing SQL-on-Hadoop tools, are general-purpose SQL engines. They use a traditional MPP / Full-Scan architecture which often requires a scan the entire dataset. For BI applications, where queries tend to be selective, index-architecture, as purposely built by Jethro, is much more effective. Jethro is a complimentary tool to Impala and is typically used side-by-side on the same cluster.
- Jethro compute servers run as edge nodes and are stateless
- All data, including index files, column files, cubes, query results, are stored in HDFS
- Jethro communicates with the HDFS cluster via hdfs-client or NFS
When to use Jethro
- Need to run BI tool such as Tableau over data in Hadoop
- Datasets typically range from 1B to 10B rows (largest customer has over 100B rows)
- Dozens of internal or external users expecting <10s response time
How BI Tools Access Your Indexed Data on Hadoop
Jethro works with BI tools such as Tableau, Qlik and Microstrategy, as well as SaaS BI/Analytics dashboards.
1. Getting Data Indexed by Jethro
Identify BI-ready datasets
- A one-time process is used to create a Jethro-indexed version of the dataset which is stored in Hadoop. Data is never moved from HDFS.
- As new data arrives, it is passed on to Jethro to perform incremental indexing, as frequently as every minute
2. Connecting to BI Tools
BI applications connect live to Jethro via ODBC/JDBC and send SQL queries for each user interaction
- Jethro uses its indexes to identify the exact rows actually needed for the query and then fetches the relevant data
- Jethro also utilizes other features such as query result cache and auto micro cubes to further speed up typical BI queries
The key to Jethro’s superior BI performance is its unique indexing technology. Jethro’s indexes are sorted, multi-hierarchy, compressed bitmaps. They are created automatically for every column, and are written in an efficient, append-only fashion – avoiding expensive random writes and locking. Queries use indexes to read only the data they need, instead of performing full scans, leading to faster response times.
Data Loading & Indexes
A loader service processes input files and creates query-optimized column and index files, which are encoded, compressed and stored on HDFS or Amazon S3. This service can run on its own host or on one of the query processing hosts.
Jethro stores its files (e.g., indexes) in an existing Hadoop cluster or in an Amazon S3 bucket. With Hadoop, it uses a standard HDFS client (libhdfs) and is compatible with all common Hadoop distributions. Jethro only generates a light I/O load on HDFS – offloading SQL processing from Hadoop and enabling sharing the cluster between online users and batch processing.