Performance is a central issue for SQL on Hadoop. The original Hive which relied on Hadoop’s MapReduce suffered from poor performance, making it mainly applicable for batch queries or ETL scenarios and less so for fast/interactive queries. More modern tools, including the later versions of Hive which run on Tez and Spark, have substantially improved query performance and latency.
Optimized Data Formats
The traditional way to improve the performance of SQL on Hadoop is to use an optimized data format. Raw data typically arrives as Text or JSON files, and converting it to a columnar binary format provides a significant performance boost. Popular open columnar formats are ORC and Parquet, used by many of the native Hadoop engines. Kudu, a new data format (and file system) was recently announced and supported by Impala and Drill, which performs even better.
A newer generation of commercial tools which are focused on high performance use their internal proprietary data formats for columnar storage – two examples are Actian and Jethro. These data formats are even faster than the generic formats like ORC and Parquet; the downside is that you’ll need to convert your data to the vendor’s proprietary format.
Besides optimized data formats, some of the SQL on Hadoop tools employ other techniques to boost performance – SQL query optimization, multi-threaded processing, and vectorization of queries.
Indexing – Order-of-Magnitude Performance Boost
A new innovation in SQL on Hadoop performance was introduced by Jethro (that’s us). Jethro’s big data platform fully indexes every single column of the dataset on Hadoop HDFS. It leverages these indexes to access only the data necessary for a query, instead of performing a full scan like all other SQL on Hadoop tools.
Because data is accessed directly instead of waiting for a full scan, responses to queries are faster by an order of magnitude, enabling real-time response to queries and interactive exploration of data.
Queries can leverage multiple indexes for better performance – the more a user drills down, the faster the query runs.
Non-index-based tools can also provide fast query response, but only given a physical pre-organization of the data (sorting, partitioning). This pre-organization is done for specific queries or scenarios (e.g. queries by date or geographical region) – queries using a different structure will be dramatically slower.