After just five short years in the spotlight, tech experts are calling for the death of Hadoop. For many, it seems like Hadoop is already going the way of legacy applications—it’s not going any time soon, but many business leaders are seeing it as more of a necessary evil than the miracle it once was.
The frustration is justified. The Hadoop distributed computing platform has been oversold as a solution for your entire business. But in reality, Hadoop is designed for data scientists who can code in MapReduce—not business analysts who need interactive business intelligence functionality.
Sometimes, though, the grass isn’t greener on the other side. If you’re thinking of abandoning Hadoop for BI, you may want to reconsider.
It’s True—Hadoop Isn’t Meant for Business Intelligence
Despite what you may have thought when you dove headfirst into Hadoop, it isn’t built for visualization software that promises near real-time business intelligence.
Consider three main points that summarize why Hadoop isn’t made for business intelligence (and is “failing” you as a result):
- Data lakes aren’t meant for interactive queries—the lack of guaranteed response times makes latency too much of a challenge in these situations.
- Hadoop is best-suited for ETL batch workloads and machine learning because it offers a cheap storage repository.
- Data scientists can master Hadoop while critics say business users would have to learn Hive, Pig or Spark to actually make it work—which obviously isn’t going to happen any time soon.
These “failures” of Hadoop lead to many workaround solutions to make BI work on the distributed computing platform—which is where even greater frustration often sets in.
Workaround Challenges for BI on Hadoop
Latency issues in Hadoop point to a larger problem—big data might just be too big for business intelligence in Hadoop. Business users need insights in near real-time and Hadoop and BI tools won’t integrate seamlessly to make this a reality.
Your instinct might be to abandon the Hadoop ship and find a new distributed computing platform to meet the needs of both data scientists and business users—but finding an actual solution won’t work.
DataTorrent Co-Founder Phu Hoang summed up the problem best: “Hadoop is painful. But [business leaders] don’t see another solution. Until there may be other distributed computing platforms out there, our focus will be on making that one as easy to use as possible.”
This means finding workarounds to make BI work in Hadoop. But if you’ve ever tried the following workarounds, you know they aren’t without their own challenges:
- Implement a Generic, Out-of-the-box BI Solution: Trying to force a standard BI solution into Hadoop won’t work. These solutions are often slow when connected to Hadoop, might not fit into your specific use case, or might fail flex to the querying needs of your users. However, this additional software layer is the only way to create a BI dashboard for low latency insights.
- Make Big Data Smaller with Extracts: BI tools like Tableau, Qlik and MicroStrategy are excellent for visualizing ingested data from Excel spreadsheets. But when you start working with 3 billion rows of big data, you’ll just freeze/crash typical solutions. Some companies extract smaller sets of data to work with in standard BI solutions. However, this negates the benefits of granular big data analytics when you’re working with massive datasets. You end up with data silos across the organization and increasingly frustrated users.
- Adopting Any of the SQL-on-Hadoop Solutions: When you choose a SQL-on-Hadoop solution like Hive, it doesn’t work the way you expect. While they are a step up from Hadoop’s lack of inherent SQL support, these solutions aren’t enough to meet your high-performance needs. On their own, data warehousing and massively parallel processing (MPP) solutions for SQL-on-Hadoop will only fuel the perceived downward trend of Hadoop for BI.
Hadoop Isn’t Going Anywhere—But BI on Hadoop Must Get Easier
Will Hadoop always be the primary (or only) distributed computing platform? Probably not. But regardless, so many companies have billions of rows of data in Hadoop and migrating it all won’t happen in the short term.
Instead, we have to start thinking about ways to satisfy business users as well as data scientists within Hadoop-powered organizations. The best way to do this—and to end the calls for Hadoop’s death—is to take advantage of a query acceleration engine. You can enjoy the cost efficiency of Hadoop without sacrificing high-performance business intelligence.
If you want to dig deeper into reaching peace between business intelligence and Hadoop, download our free white paper, How to Make Real-Time Business Intelligence Work in Hadoop.
Jethro delivers BI on Hadoop at interactive speed. Download a data sheet to find out more.
When is Hadoop Summit 2016 North America in San Jose?
The Hadoop Summit 2016 North America in San Jose will be held from June 28th – June 30th 2016.
Where is Hadoop Summit 2016?
The Hadoop Summit 2016 North America in San Jose will be in the San Jose Convention Center located at 150 W San Carlos St, San Jose, CA 95113.
About Hadoop Summit 2016
The 8th Annual Hadoop Summit in San Jose is a three-day-event that will kick off on Tuesday June 28th and bring together Apache community leaders and key players under one roof. Attendees will enjoy hearing about dev and admin tips and tricks, see presentations about successful Hadoop use cases, and get educated on ideal ways to leverage Apache Hadoop as an integral component in their enterprise data architecture.
Visit Jethro & Grab Some Data Swag
Jethro will be exhibiting its SQL-on-Hadoop acceleration solution. We will discuss ways to achieve interactive speed on BI dashboards and visualizations, such as Tableau or Qlik, with direct data connections to multi-billion-raw data sets without impacting Hadoop cluster load. Data architects and data scientists are invited to Jethro’s booth to see live benchmarks and demos.
We look forward to seeing you at the Hadoop Summit 2016 in San Jose!
The main themes of Jethro 1.6.0 are concurrency and new range-index features.
Concurrency features are:
Reuse of results when the same “where” clause is used by multiple queries, in order to reduce resource consumption and increase concurrency.
Enhanced locking infrastructure to protect against deadlocks during high load.
- Increase the maximum number of threads allocated by the operating system to be used by the Jethro services
This new feature allows you to create special indexes for ranges of values. This significantly increases the performance of queries that employ wide range of column values in filters.
Full list of version issues can be found on our release notes page: http://jethro.io/release-notes/
Updated documentation: https://jethrodownload.s3.amazonaws.com/JethroReferenceGuide.pdf
You’ve already downloaded Jethro, now you’re ready to install it and start accelerating your database performance. Follow these below tips to get you started on the right foot. You can always contact us with any problems or questions along the way.
Top 11 Tips for Optimizing Jethro Performance
- Optimal Queries – Use as Many Filters as Possible
- Jethro process queries by first evaluating the WHERE clause and determining the rows needed for the query. It then fetched column data only for those rows. The narrower the query, the faster it performs
- Optimal Data Types
- Use numeric formats (INT/BIGINT, FLOAT/DOUBLE) whenever possible – any string column that holds only numeric values should be converted. This is especially true for high cardinality columns.
- Use TIMESTAMP for Date/Time columns. Jethro creates multiple date-related indexes for such columns to improve performance of date-range queries.
- Partitions for Large FACT Tables
- A TIMESTAMP column is typically best choice for partitions as it simplifies maintenance tasks like purging old data
- Jethro recommends a total of 5-25 partitions although it comfortably supports hundreds of partitions
- Jethro partition key can use range values. For example: PARTITION BY RANGE(ts_col) EVERY (INTERVAL ‘7’ DAY)
- Cache Space
- Jethro uses server-side caching for metadata and frequently used file fragments. The greater the space the more data it will be able to store. Note that the benefit of the cache will be realized over time as filling up the cache can take some time.
- Cache space can be defined when an instance is created or updated later on by editing the local-conf.ini file.
- Jethro automatically enables query-result cache. The query-result cache is stored in HDFS and does not require local disk space.
- Consolidate Tables When Possible
- While Jethro optimizes JOINs and automatically performs Star-Transformation, it is better to avoid them when not required.
- Jethro’s columnar format and effective compression minimize the storage impact of such denormalization.
- Hardware considerations: more is better!
- More CPU and RAM Improves query speed as Jethro takes advantage of multi-threading. It also improves concurrency as more user/queries can be served in parallel.
- 10Gb network connectivity to cluster will speed up HDFS access.
- Local drives for caching – SSD is preferable.
- Trial servers can start with as little as 64GB and 8 cores.
- Use a Cluster of Jethro Servers
- Multiple servers linearly increase Jethro’s capacity for concurrent users and queries.
- When performing frequent incremental loads, it is recommended to run the JethroLoader on a different server.
- Data sorting can improve performance
- If a large number of queries filter by a specific column (that is not already a partition column) it could be beneficial to pre-sort the input data by such column before it’s loaded into Jethro.
- Join Indexes
- When attributes of large dimensions are often used as a filter it is recommended to define them as a JOIN INDEX on the fact table. There is no limit to the number of JOIN INDEXES that can be defined.
- Jethro without Hadoop
- Jethro is capable of using other storage systems besides Hadoop’s HDFS. These include a local filesystem, cloud storage (eg S3) or network storage (SAN/NAS).
- When the dataset used with Jethro can fit in a local filesystem it is often the best solution as it avoids Hadoop overhead.
- Load “Overwrite” for table update with no downtime
- When a dimension changes and need to be reloaded you can use Jethro’s Load Overwrite feature. It loads the updated table and only when the process is completed the tables are swapped.
- Use ALTER TABLE to add columns on the fly
- Jethro, being a column oriented design, can dynamically add (or drop) columns without having to reload the table. The value NULL will be used for the new column over existing rows
Use Jethro’s “SHOW” SQL command to learn about Jethro internals
- SHOW [SESSION | GLOBAL] PARAM parameter | ALL (show parameter values)
- SHOW TABLES [EXTENDED | MAINT] (show all tables, size stats, fragmanation)
- SHOW TABLE PARTITIONS table_name (show table’s partition stats)
- SHOW TABLE COLUMNS [FULL] table_name (show column stats)
- SHOW VIEWS [EXTENDED] (show views)
- SHOW LOCAL CACHE (show local file cache usage)
- SHOW ADAPTIVE CACHE (show query result cache)
- SHOW ACTIVE QUERIES (show currently running and queued queries)
- SHOW SCHEMAS (show defined schemas)
- SHOW JOIN INDEXES (show all defined JOIN indexes)
The Jethro team is attending Tableau’s user conference for the first time this week in Vegas. It has been incredible to see the genuine sense of collaboration and community that unites that over 10,000 attendees.
Today at the conference we are excited to announce Jethro for Tableau to empower people using Tableau with visualizations at a natural interactive pace. Instead of waiting minutes for their big data visualizations to render, Tablue users can now gain crucial big data insights in seconds.
Jethro for Tableau leverages Jethro’s index-based performance acceleration and combines it with the ease of use and visualization of Tableau. The combination will enable Tableau users to utilize Tableau’s great experience over big data tables that are simply too large to be extracted and loaded in Tableau’s memory.
The product is now available for beta testing: Download Jethro for Tableau.
Check out our live benchmarks so you can see Tableau run at interactive speed while live-connected to a 2.9B row dataset!
To acces live benchmarks:
- Navigate to: http://tableau.jethrodata.com
User: demo Password: demo
- Choose Jethro workbook
- For performance comparison chose “Impala” or “Redshift” workbooks
Read more about Jethro for Tableau
We’re excited to exhibit at Tableau Conference 2015 in Vegas and we hope to see you there. Stop by our booth to grab some swag and chat about accelerating your queries on Tableau. We’ll be in town from October 19 – 23 and we would love to talk to you about big data performance–at the booth, over a beer or at the tables. Contact us to set up a time to meet.
Jethro Talks Interactive BI on Hadoop at Strata + Hadoop World NYC 2015
CTOs, data scientists, engineers and business leaders learned about the performance advantages of Jethro’s unique index-based SQL engine at the Strata + Hadoop show in NYC. The three-day convention at the Javits Center was the perfect stage for us to meet with key industry players and hear about their BI performance issues on Hadoop.
Of particular interest were the impressive live benchmarks that showed off how fast Jethro truly is. From our discussions with both tech-guys and BI users at the show, from large financial institutions to companies that provide BI tools, we’ve witnessed a maturing of the entire Hadoop ecosystem. Companies that are using Hadoop are now understanding the limitations and performance issues surrounding BI on Hadoop.
If you missed the show, contact us to schedule a demo and see live benchmarks.
Attending Hadoop Summit 2015 in San Jose? We are too, and we’d love for you to come by our booth and chat about how Jethro can help your company. You can find us at booth G9, right next door to SnapLogic.
Give us your business card at the booth and participate in a raffle for an Apple Watch. Also, some lucky folks will receive our brand new “Big Data at Biblical Scale” t-shirt.
One last thing: on Tuesday, June 9, we’ll be making an exciting announcement about our company. Stay tuned…
- Can Your Grandpa’s OLAP do Big Data BI?August 17, 2017 - 5:32 pm
- Webinar: Under the BI on Big Data Hood with Symphony Healthy and QlikAugust 16, 2017 - 5:09 pm
- Define and Create External Tables in Jethro 3July 11, 2017 - 2:10 pm
- Applying Big Data to tame Manufacturing ComplexityJuly 5, 2017 - 11:26 am
- Now Releasing Jethro ManagerJune 20, 2017 - 11:48 am