Don’t Jump from the BI on Hadoop Ship Just Yet—You’ll Find Challenges Moving Business Intelligence Elsewhere

After just five short years in the spotlight, tech experts are calling for the death of Hadoop. For many, it seems like Hadoop is already going the way of legacy applications—it’s not going any time soon, but many business leaders are seeing it as more of a necessary evil than the miracle it once was.

The frustration is justified. The Hadoop distributed computing platform has been oversold as a solution for your entire business. But in reality, Hadoop is designed for data scientists who can code in MapReduce—not business analysts who need interactive business intelligence functionality.

Sometimes, though, the grass isn’t greener on the other side. If you’re thinking of abandoning Hadoop for BI, you may want to reconsider.

It’s True—Hadoop Isn’t Meant for Business Intelligence

Despite what you may have thought when you dove headfirst into Hadoop, it isn’t built for visualization software that promises near real-time business intelligence.

Consider three main points that summarize why Hadoop isn’t made for business intelligence (and is “failing” you as a result):

  • Data lakes aren’t meant for interactive queries—the lack of guaranteed response times makes latency too much of a challenge in these situations.
  • Hadoop is best-suited for ETL batch workloads and machine learning because it offers a cheap storage repository.
  • Data scientists can master Hadoop while critics say business users would have to learn Hive, Pig or Spark to actually make it work—which obviously isn’t going to happen any time soon.

These “failures” of Hadoop lead to many workaround solutions to make BI work on the distributed computing platform—which is where even greater frustration often sets in.

Workaround Challenges for BI on Hadoop

Latency issues in Hadoop point to a larger problem—big data might just be too big for business intelligence in Hadoop. Business users need insights in near real-time and Hadoop and BI tools won’t integrate seamlessly to make this a reality.

Your instinct might be to abandon the Hadoop ship and find a new distributed computing platform to meet the needs of both data scientists and business users—but finding an actual solution won’t work.

DataTorrent Co-Founder Phu Hoang summed up the problem best: “Hadoop is painful. But [business leaders] don’t see another solution. Until there may be other distributed computing platforms out there, our focus will be on making that one as easy to use as possible.”

This means finding workarounds to make BI work in Hadoop. But if you’ve ever tried the following workarounds, you know they aren’t without their own challenges:

  • Implement a Generic, Out-of-the-box BI Solution: Trying to force a standard BI solution into Hadoop won’t work. These solutions are often slow when connected to Hadoop, might not fit into your specific use case, or might fail flex to the querying needs of your users. However, this additional software layer is the only way to create a BI dashboard for low latency insights.
  • Make Big Data Smaller with Extracts: BI tools like Tableau, Qlik and MicroStrategy are excellent for visualizing ingested data from Excel spreadsheets. But when you start working with 3 billion rows of big data, you’ll just freeze/crash typical solutions. Some companies extract smaller sets of data to work with in standard BI solutions. However, this negates the benefits of granular big data analytics when you’re working with massive datasets. You end up with data silos across the organization and increasingly frustrated users.
  • Adopting Any of the SQL-on-Hadoop Solutions: When you choose a SQL-on-Hadoop solution like Hive, it doesn’t work the way you expect. While they are a step up from Hadoop’s lack of inherent SQL support, these solutions aren’t enough to meet your high-performance needs. On their own, data warehousing and massively parallel processing (MPP) solutions for SQL-on-Hadoop will only fuel the perceived downward trend of Hadoop for BI.

Hadoop Isn’t Going Anywhere—But BI on Hadoop Must Get Easier

Will Hadoop always be the primary (or only) distributed computing platform? Probably not. But regardless, so many companies have billions of rows of data in Hadoop and migrating it all won’t happen in the short term.

Instead, we have to start thinking about ways to satisfy business users as well as data scientists within Hadoop-powered organizations. The best way to do this—and to end the calls for Hadoop’s death—is to take advantage of a query acceleration engine. You can enjoy the cost efficiency of Hadoop without sacrificing high-performance business intelligence.

If you want to dig deeper into reaching peace between business intelligence and Hadoop, download our free white paper, How to Make Real-Time Business Intelligence Work in Hadoop.


Jethro delivers BI on Hadoop at interactive speed. Download a data sheet to find out more.

On-Demand Webinar: How Tata Communications Uses Jethro to Deliver Interactive BI on Hadoop

Learn how Tata Communications CDN overcame Hadoop latency pitfalls to deliver its end users a BOI dashboard on a shared Hadoop data lake with thousands of concurrent users that works at interactive speed.

This on-demand webinar is hosted by Tata Communications CDN Head of Engineering, Samik Mukherjee and Jethro CEO Eli Singer. You will learn about Tata Communications CDN’s Hadoop deployment, performance requirements, and architectural considerations.

 

5 Reasons Business Intelligence on Hadoop Projects Fail

Business Intelligence (BI) on Hadoop Projects Are Difficult to Execute Successfully

Hadoop has its own set of constraints and problems, and when you throw BI into the mix it’s a whole other ballgame. As companies move their EDW to Hadoop, they expect to be able to migrate their analytical applications as well. While workloads such as ETL, predictive analytics and machine learning work great on Hadoop, the more performance-sensitive BI applications suffer from Hadoop’s inherent low latency. 

We’ll discuss these five harbingers of Hadoop doom and gloom and look at what can you do to avoid those pitfalls and be successful when setting up your own BI on Hadoop endeavor. To begin with, good planning, as is often the case, is an excellent starting point. You need to ensure that you have buy-in from management, and your internal domain experts are actively involved. Your computer hardware and network also need to be able to handle big data otherwise your system might be painfully slow or unreliable. Finally, in order to make efficient use of your big data you will need Business Intelligence (BI) capabilities, SQL and the ability to run complex queries.

1 – Good Planning

The first factor that can result in the failure of your BI on Hadoop project is a lack of good planning, in terms of project scope, budget, time frame, hardware (server storage, and computing power), and network requirements. Another reason is a lack of buy-in from top management. Make sure management appreciates the value of your big data project, and your Business Intelligence (BI) framework, and integrates it into their decision-making process. Historically-isolated data silos in your organization also pose a barrier and need to be taken into account during planning. “60% of big data projects will fail to make it into production either due to an inability to demonstrate value or because they cannot evolve into existing EIM processes,” concluded Roxane Edjlali, Chief Data Officer Research and Advisory at Gartner.

Understanding Hadoop Project Costs

Project costs are often underestimated, or not fully understood, resulting in poor scope and planning. A useful reference when calculating the cost of big data projects is the Forbes article ‘The Big Cost Of Big Data’. According to the authors, a petabyte Hadoop big-data project can cost around USD 1 million, including up to 250 nodes, compared to tens or even hundreds of millions of US Dollars for an enterprise data warehouse project.

Hardware

Begin with small, iterative steps, and let your organization’s internal domain and data experts experiment with Hadoop using a flexible, and open data infrastructure. Select server hardware, and network infrastructure that is up to the task. Pay attention to your Java Virtual Machine (JVM) and garbage collector, memory settings, and the Input/Output (disk and network input and output) speeds of your server hardware. Also consider investing in NameNode High Availability (HA, and secondary NameNode), and Disaster Recovery (DR) hardware.

Appropriate Use Cases

Even if the costs are well understood, the use-case may be incorrect. This can mislead you into building a solution that meets all the requirements of the plan, but does not fix the problem you intended to solve in the first place. To prevent this, focus on the goals, and document a number of detailed use cases that will help you configure your system to them.

CRISP DM Process

Figure 1. This diagram shows the relationship between the different CRISP-DM phases in a data mining project. (Kenneth Jensen via Wikimedia Commons)

Your EIM Model and Knowledge Silos

Over-reliance on only data experts, without the active involvement of your organization’s own domain knowledge experts, is also cited as a common reason for failure. You may end up building an impressive project, but it also needs to fit into your organization’s EIM model from the organizational vision and strategy, to the existing data infrastructure. You will need to adjust existing business models, and the business culture in your organization to enable improved internal data sharing and utilization.

The CRISP-DM (Cross Industry Standard Process for Data Mining) process model is a useful standard for BI and covers the full process from understanding the business and data, to creating a technical solution that can provide new insights and improve your business.

 

2 – Performance and Stability

Another important factor that can lead to the failure of a big data project is the stability and performance of the system. Ready access to data, and the ability to process it within acceptable latency times is critical. Performance and stability depend partially on the selection of appropriate hardware and network infrastructure, but also on the architecture of the SQL-on-Hadoop solution. The tools and the architecture must be suitable to your specific use cases and your BI dashboard requirements.

Get Help from a Big Data Vendor

It is advisable to work with a reputable and experienced big data vendor since building a distributed computing big data project is complex. The vendor can provide support, training, and certification for Hadoop, SQL-on-Hadoop solutions, and BI tools. Companies such as Cloudera, Pivotal, and Hortonworks, can help you setup a Hadoop virtual machine, for example, so that you can begin exploring the options for your big data solution.

Hadoop Performance

Hadoop does not guarantee standard response times and is optimized for dealing with large batches of files. As such, an out-of-box Hadoop installation may not be relevant for interactive queries on your data. Hadoop is designed around Java and JVM technologies, and as such can suffer from performance problems, such as the ‘Stop of World’ garbage collector issues.

3 – Business Intelligence on Hadoop

The third factor on which the success of a Hadoop big data project depends is its ability provide useful, near real-time BI. In the majority of new big data projects, this is not the case.

An Additional Software Layer

To build a fast, and usable BI dashboard, that can let you access the information you need, when you need it, with near real-time latency, you will need an additional software layer that can query your data. Be careful with generic out-of-the-box BI solutions – they are often slow, may not be suitable for your a specific use case, or just not flexible enough to adjust to the specific needs of an organization.

A Query Acceleration Engine

Organizations use various SQL-on-Hadoop solutions, such as Hive, and Apache Impala, to enable SQL queries on Hadoop, but often learn that they are not fast enough, or flexible enough for unique resource-hungry needs of BI. Another alternative would be to add a BI solution that integrates with Hadoop, and in addition, consider using an acceleration server that sits on top of Hadoop.

An example of one such solution is that provided by Jethro. This solution has an efficient indexing architecture that allows data updates to be loaded incrementally, without locking the indexes. The solution uses intelligent caching and automatic micro-cubes, compresses and sorts data, and creates multi-hierarchical indexes for every column of a large dataset.

Jethro can process, index, and compress a Hadoop system by up to a factor of ten. This not only saves on storage but also reduces the load on data storage hardware, and results in much faster and more efficient operation of the operational infrastructure.

Data Ingestion

For useful BI you also need to ensure efficient ingestion of new data into your database. When you set up your BI solution, check that it can meet the minimal required latency times for data ingestion. If not configured well, data ingestion can cause extensive system downtime, and result in users missing current data updates in their queries.

4 – SQL on Hadoop

Even though SQL is the default query language used by data analysts (over 30% use it according to KDNuggest Poll), unfortunately Hadoop does not support SQL out-of-the box. SQL  enables connectivity with BI solutions, but an additional software layer such as Apache Hive, and an acceleration engine, are needed to ensure acceptable BI performance. Some organizations start of using SQL technologies but soon come to the conclusion that SQL-on-Hadoop solutions alone are not enough to meet their specific requirements or use cases.

SQL-on-Hadoop Solutions

SQL-on-Hadoop solutions that are used for querying big data include Data Warehouse (DWH) solutions such as Cloudera’s Impala – an open-source Massively Parallel Processing (MPP) query engine, or Pivotal’s HAWQ big data analytics solution. The Impala engine provides an SQL-like interface, but uses batch processing partitions instead of indexes, and does not provide multi-tenancy. Apache Hive also provides an SQL-like interface for queries on a Hadoop database, but may require the addition of an acceleration engine.

To see a more detailed breakdown of the SQL-on-Hadoop solutions, view: Hadoop Hive & 11 other SQL-on-Hadoop Alternatives

5 – Fast Query Response Times

In order to successfully execute a big data project it is important to understand that Hadoop is a two-tier solution, which consists of a Distributed File System (HDFS) and the Resource Allocation and Task Management (YARN) resource manager. In fact, Hadoop is not a database, but ‘an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware’ (source: Wikipedia). Hadoop was in fact designed to store and process huge amounts of structured, semi-structured and unstructured data on commodity hardware. Additional requirements such as SQL support, interactive response times, indexing and caching data were not the main focus of Hadoop designers.

Query Response Times

To run queries on Hadoop partitions you will need to design a query-based solution, preferably one that uses indexes, otherwise the solution will need to scan the entire database, or all the data in a partition each time, resulting in poor performance, and unacceptable latency. Indexing is especially important for use cases where users need to perform selective queries, and retrieve small subsets of data for further analysis. On top of that, an acceleration engine can improve query response times even further, from a maximum of hours for complex queries to several minutes, or seconds for simpler queries.

Advanced Queries for BI Dashboards

A good BI dashboard can provide advanced querying capabilities including standard off-the-shelf reports, charts and widgets, or can enable users to drill down and extract information using their own specific queries. Additional querying options may include ad-hoc reports using many different filters such as region, time, or specific URL, and cross-tab reporting for comprehensive correlation analysis. An efficient query engine that uses indexes, will result  in a much faster interactive dashboard experience, with minimal load on the Hadoop cluster hardware and network infrastructure. Advanced querying engines can even can use more than one index at once, auto-cubes, and columnar indexing improving performance even further.

To Summarize

In order to set up and run an effective big-data Hadoop project that provides reliable BI, your organization will need to adopt a new mindset that addresses not only the technology, but also the organizational EIM. You will need to conduct a comprehensive analysis of your business with the help of analysts, internal domain experts, and strategists to come up with robust and relevant business use cases. You will also need buy-in from management, and take company politics into consideration.

Your big data project needs to work with your existing BI tools, and your security and monitoring systems. Data security needs to be addressed because standard Hadoop implementations have relatively poor security, and many organizations are wary of keeping all their data in one location.

 

 

Database Architectures for Business Intelligence Demystified: A Concise Comparison

Database Architectures for Business Intelligence Demystified: A Concise Comparison

Introduction

The year 2008 was one of drastic change in the world economy and in society. Major financial institutions collapsed, while Internet social networking service providers such as Facebook and Twitter (and later Uber and WhatsApp) grew exponentially.

This resulted in enormous changes in the database (DB) arena that had been based on the RDBMS  (Relational Database Management System) paradigm since the late 1970s. This paradigm ensured high consistency and data integrity on the one hand, and high availability on the other hand, and was popular within the enterprise world, especially within financial institutions that could afford the high maintenance costs and effort required to scale up their databases.

However the new social media companies required something drastically different. The new service providers preferred high-availability, and the ability to scale their networks with low CapEx and OpEx in terms of hardware costs and personnel.

Data stores cannot support availability, consistency and partitioning (scaling) simultaneously (according to the CAP theorem), and this provided an opportunity for new radically different data storage products that began leading in many segments.

But are the new products really significant, and should you use them? And, most importantly, do they work with the leading Business Intelligence (BI) tools in the market? In this paper we explore these questions and provide clear guidelines as to which products we recommend, and for what purpose.

What data stores should you consider?

Data stores only become relevant when people and organizations use them. Important considerations when choosing a data store include the ease of report generation and analysis, and whether it is supported by Business Intelligence (BI) tools like Tableau, Qlik or Microstrategy. When a BI software vendor invests significant resources to support a data store then this is a good indication that the data store is worth considering.

Most of us are familiar with the most common RDBMS systems such as Oracle, MySQL, Microsoft SQL Server, IBM DB2, and Sybase ASE that still lead the data store market today (see db-engines.com ranking scores), however there are new data stores on the rise today.

According to our analysis, and data from the DB-Engines knowledge base, the market share of columnar databases (Cassandra, HPE Vertica, Sybase IQ, MonetDB, and SAP HANA), MPP (Massively Parallel Processing) databases (RedShift, IBM Netezza, Pivotal Greenplum, Teradata, Snowflake,and Hadoop Hive), search engines (Splunk, Solr, Attivio), and document data stores (MongoDB, Couchbase, MarkLogic) are gaining significant market share and are coming mainstream.

So what is causing this dramatic shift? Why have so many new data store categories been created in the last 10 years, and what business purposes do they serve? We will provide some possible answers to these questions below.

RDBMS: Relational Database Management System

Short intro

RDBMS systems are probably the most solid data store solution to date, and have been at the heart of the most common information storage solutions since the 1970s. The design of RDBMS systems best fits enterprise requirements where no transactions can afford to be lost, and high availability is a priority.

Leading products in this category

The leading products in this category are Oracle DB, MySQL, Microsoft SQL Server, IBM DB2. Sybase ASE and Postgres.

Key architecture

  • Presents data to the user as relationships between entities. Each entity is presented in a tabular format and each table consists of a set of rows and columns.
  • Provides relational operators such as JOIN and UNION to manipulate the data in its tabular format
  • The most common standardized query language is ANSI SQL, although most vendors use extensions to this standard
  • Data is saved fully normalized

Strengths

  • Strong consistency of the data
  • Includes mature High Availability (HA) solutions
  • Minimal redundancy of data
  • Independence from the physical data, since it can be stored using various storage engines, and can also be retrieved in the same way

Weaknesses

  • Poor representation of ‘real world’ entities, and requires substantial efforts by software engineers to transform the object-oriented data to an entity model
  • A nonlinear and complex scaling solution
  • High Database Administration (DBA) and maintenance costs
  • Difficult to represent complex data types using RDBMS as there is a data structure
  • In large data stores it is difficult to change the table structure without causing major downtime, or a slowdown in the system
  • No support for recursive queries. This limits support for hierarchical data models such as organization structure

Common use cases

  • Financial transaction systems
  • Low volume inter-organization data entry systems

RDBMS and BI

For decades BI tools have been designed to work with RDBMS systems that support JOIN between tables, and SQL, however ‘Big data’ RDBMS often fail to scale resulting in poor BI performance.

Tidbit

F. Codd, from IBM, invented the term “relational database” in 1970. He introduced the term in a paper he published: “A Relational Model of Data for Large Shared Data Banks”.

Document Data Stores

Short intro

Document data stores are the heart of many modern information systems. Software engineers love the many useful features of document data stores such as built-in sharing, easy-to-setup clusters, end-to-end support for JSON and rapid deployment that avoids the need for a DBA in the initial phase.

Leading products in this category

The leading products in this category are MongoDB, Couchbase, and MarkLogic.

Key architecture

Document data stores save data as JSON-like documents in collections/tables. The document structure is flexible and every document can have different fields. For this reason, when you want to add new fields you do not need to make any change in existing documents or table structure. Data can be retrieved easily from the data store using predefined indexes.

Strengths

  • Support rapid development, since software engineers only need to serialize data for data persistence
  • Schema-less design enables support of new fields in a populated table without downtime or effects on the user experience.
  • Data is stored in the JSON format that is completely fit for full stack Javascript platforms
  • High insertion rates due to built-in shards
  • Very easy to setup and maintain. Operational cost is usually 10% of a RDBMS system of the same size

Weaknesses

  • Poor support of JOIN capabilities between entities
  • If the document structure is complex, it is difficult to use BI tools and structured queries
  • Unlike RDBMS systems that assign default values to fields and have data validation rules, when there is no schema these tasks must be managed at the application level.

Common use cases

  • Content Management
  • Catalog systems
  • Analytics systems
  • Billing systems where mediation can be avoided

Document Data Stores and BI

Most BI tools have only recently begun supporting document data stores, since most of the data stores did not support SQL. Even today, the lack of JOIN operations between entities still limits the type of analysis and dashboards you can create based on these document data stores.

Tidbit

MongoDB is considered the 4th most widespread and widely-used data store according to the db-engines.com knowledge base, just behind the 3 RDBMS products that lead the market (Oracle, MySQL and Microsoft SQL Server).

Columnar Databases

Short intro

Columnar databases store data in indexes and enable near real-time query response, and high insertion rates of 1 GB/s. The databases support SQL and are efficient in terms of storage and memory usage.

Leading products in this category

The leading products in this category are Cassandra/DataStax, HPE Vertica, Jethro, Sybase IQ, MonetDB, and SAP HANA.

Key architecture

Columnar databases store data as columns rather than as rows. They are designed to retrieve specific columns quickly without the need to retrieve the complete row, and discard non-relevant fields. Data is stored in indexes, rather on data tables, and therefore there is no need to retrieve data from the tables themselves.

Strengths

  • Storage and memory use is efficient as every column is encoded and values are stored in index nodes
  • Near real-time access to data (1-2 ms for read/write), based on data on index leaves
  • Standard SQL is nearly fully supported

Weaknesses

  • Data is modeled for specific queries
  • If multiple queries are needed from the same data, then the data may need to be stored twice
  • The SELECT * FROM pattern that is widely used in software programming should be avoided otherwise it may impact performance

Common use cases

  • Authentication systems that can verify a user in O(1)
  • Personalization systems that can return user data in O(1)
  • Real-time bidding systems and online exchanges
  • Real-time analytics where the insert rate is a key requirement
  • Large data sets that exceed 1 TB in a single table/entity
  • Retail systems where branch independence is key. Columnar databases allow users to read/write to each node even when the network between a retail branch and its headquarters is cut off
  • Financial transaction real-time risk management systems

Indexes

Columnar databases are based on indexes. The indexes enable fast access to data by creating a tree-like search pattern where every node serves as a decision point, avoiding searching half of the remaining items in the table. However, indexes best serve specific queries that fit the specific index, and this may result in poor performance for other queries.

Columnar Databases and BI

Most columnar databases support SQL, and some support JOIN operations. These features position them as the best candidates for BI on “Big Data.” However, since data must be designed for specific queries, it is difficult to provide flexible queries for BI analysts.

Tidbit

Jethro, that provides a columnar database based on Hadoop, has solved the low-flexibility challenge of columnar databases (and indexes in general) by integrating search indexing (see below) into its solution. This makes Jethro the best fit for ‘big data’ and BI cases.

MPP: Massively Parallel Processing Data Stores

Short intro

MPP provides rapid and powerful analytics on petabyte-scale data volumes. This category used to be known  as the data warehouse appliances category, and is specialized in enabling ad-hoc queries on large-scale data repositories for long-term analytics.

Leading products in this category

The leading products in this category are Amazon Redshift, IBM Netezza, Pivotal Greenplum, Teradata, Snowflake, Google BigQuery, Hadoop Impala, Hadoop Hive, Hadoop Drill, Hadoop Hbase, Facebook Presto, Spark SQL, MemSQL, EXASolution, and Kognitio.

Key architecture

  • Shared (shared-nothing) architecture, where the entire query is executed on the nodes with emphasis on minimizing data movement
  • Data is stored on different nodes
  • Queries are split into different nodes, and only summarized and compacted results are returned from each

Strengths

  • The data architecture is flexible enough to support any query with a dedicated index, or table model design
  • Supports petabyte-level data requirements
  • Supports the requirements of the modern offline machine-learning model

Weaknesses

  • Poor support for real-time and near real-time use cases as query results can take between several seconds and several hours
  • BI queries require massive scanning of data, and result in high utilization of server resources..
  • Requires dedicated hardware, or large-size installations. MPP is not recommended for installations with less than 10 TB of data, or those with more than 100 TB of data.
  • Scalability is limited (a maximum of 100-200 nodes)

Common use cases

  • Large data warehouses with over 10 TB of data
  • Long-term analytics
  • Ad-hoc queries against that data
  • Enterprise level data lakes
  • Data store for training machine learning systems and models

Cubes

Cubes are multi-dimensional tables such as period, product and city, where data can be easily sliced and summarized according to one or more the dimensions. As cubes consist of large normalized tables, they are best suited for data warehouses.

MPP and BI

MPP provides high flexibility that enables BI analysts create any query they want, and many of them support SQL. However, MPP data stores often scan massive amounts of data in order to return results, and this may result in high latency, and extensive usage of server resources (CPU, RAM and disk).

Tidbit

IBM Netezza is sold as a dedicated appliance that includes FPGA to augment CPUs and minimize network bus traffic.

Search Engines

Short intro

The search engine (also known as Google) is the heart of the internet. As enterprise software and SaaS become more content aware, and support larger data sets, the use of internal search engines to locate relevant data, or create context and geolocation-aware results, is coming mainstream.

Leading products in this category

The leading products in this category are Splunk, Solr, and Attivio.

Key architecture

During the ETL (Extract, Transform, and Load) phase, search engines analyze the input data and create terms and frequencies. These frequencies serve as a basis for indexes and inverse indexes that enable searching of the data modeled as documents, including fields and attributes such as geolocation.

Strengths

  • Support for full-text searches, where users can query any field and get results in near real-time
  • Support for spelling corrections
  • Custom search models, where the order of results can be boosted based on given attributes (such as distance from a given location)

Weaknesses

  • Search engines are not designed to be a primary data source
  • Results may not be 100% accurate and may not be repeatable
  • Results may depend upon the hardware load (RAM, CPU and disk utilization)
  • Data load is CPU intensive

Common use cases

  • Log aggregation and analysis
  • Content management system internal search capabilities
  • Internal website search
  • Auto-completion and support for fuzzy searches (spelling mistakes)
  • Location-based systems that adjust search results, and offers results based on the user location

Search Engines and BI

Most search engines do not support SQL or JOIN operations much like document data stores. This results in limitations in the type of analysis and dashboards you can create based on these data stores.

Tidbit

Although Apache Lucene is supported by only a few BI vendors, it is actually a key component in the fast growing search products Solr and Elasticsearch.

Summary

Various business requirements have resulted in major changes in the database and data-store landscape in the past 10 years. From an industry with only a single technical solution, the glorified RDBMS (that was best fit for enterprises), there are now a wide range of products each fit for a different business requirement. These products now match enterprise requirements, and are supported by leading BI providers.

Delivering a Consistent Interactive Business Intelligence Experience

Say you’re an analyst using Tableau with live-connect to Hadoop to visualize and analyze your glorious big data—somewhere in the neighborhood of 2,879,987,999 rows. A few clicks and drags later and you’re glued to your desk staring at the loading spinner of doom and gloom wondering if your dashboard will refresh before your jeans go out of style.

Now, you might be thinking to yourself, “Whoah, I can live-connect Tableau directly to Hadoop? Cool!” The answer is yes, you can. Or maybe you’re a more advanced user and think, “We already use Hive/Impala/SparkSQL to access our Hadoop data, but it’s far from an interactive experience. Sometimes it works, more often it’s slower than a snail on peanut butter.”

When you want your BI tool (Tableau, Qlik, Microstrategy etc.) or your own analytics dashboard to access data on Hadoop you’re going to need to add a SQL-on-Hadoop layer. Hive, Impala, Presto and SparkSQL all classic examples of SQL-on-Hadoop’s. If you take a look under the hood, you’ll learn that these are designed as MPP (Massively Parallel Processing) architecture that typically requires a full-scan of the dataset rows for every query. This works well with certain use cases like with predicative analytics or machine learning jobs. However, full-scan based solutions by nature will not deliver consistently interactive-speed BI—even if your company has already poured tons of coin and effort trying to make a square peg fit into a circular hole. We’ve seen many tech teams with the best intentions pave this perfectly executed road to BI-on-Hadoop project failure.

So, if you want to melt your frozen data lakes and unleash truly interactive BI on Hadoop—at any scale and without sacrificing any data freshness—you’re going to need to get a SQL-on-Hadoop engine that was designed for the specific use case of BI-on-Hadoop. Sounds logical enough, right? That’s exactly why we created Jethro 2.0.

Jethro 2.0 was carefully crafted to deliver a consistently interactive BI-on-Hadoop experience in almost any scenario. In order to do this tricky feat, the team at Jethro went to the drawing board and architected a product that combines a columnar SQL engine with search indexing. Instead of scanning all the rows with every query, Jethro uses indexing, micro-cubes and query-result cache to deliver queries from Hadoop back to a BI dashboard at interactive speed. Jethro was designed to run as an edge node and all its data never leaves HDFS.

Jethro Query Processing

  • Indexes are utilized when queries are filtered. It allows Jethro to surgically access only the data needed for the query and stream results to the BI dashboard
  • NEW to Jethro 2.0 are “auto micro-cubes” which further accelerates performance. These are small aggregations (similar to OLAP cubes) that are dynamically generated by Jethro based on actual query patterns that Jethro learns. Large cubes are avoided by using indexes to speed up queries with multiple filters or high cardinality columns
  • Query-result cache is managed by Jethro which captures the results of each query and decides to store it in HDFS based on result-set size and total execution length. When the same query repeats, as is typical for dashboards, it will be served out of cache with no processing

Jethro’s three performance features work in tandem and complement each other in delivering consistent interactive BI experience in almost any scenario the fastest query response time. The cubes are complementary to Jethro’s indexing and caching. ensure that Jethro can consistently accelerate the query performance of BI tools and dashboards in almost any scenario.

When a BI tool sends a query to Hadoop, Jethro’s three performance features work in tandem to deliver the fastest query response time. The Jethro SQL Acceleration Engine categorizes the query and delegates it to the optimal process.

Jethro Query Processing

Whether the end user is an analyst visualizing 2,879,987,999 rows of data on Tableau, or even a the client or a large company with a SaaS BI/ analytics dashboard looking at his own metrics, Jethro 2.0 is a must-have addition to the tech stack for a consistent interactive-speed business intelligence experience.

Access a live demo to see how your blazingly fast your BI tools and analytics dashboards can be.

 

Big Data Acronyms & Abbreviations Every Data Pro Should Know

Acronym Meaning
3DM Data Driven Decision Making
ACID Atomicity, Consistency, Isolation and Durability
ADAPA Adaptive Decision and Predictive Analytics
AOSD Aspect-Oriented Software Development
AQL Annotation Query Language
BDA Big Data Analytics
CART Classification and Regression Trees
CCA Canonical Correlational Analysis
CEP Complex Event Processing
COTS Commodity off-the-shelf
CQL Cypher Query Language
CQL Cassandra Query Language
CQL Contextual/Common Query Language
DAD Discover, Access, Distill
DAG Directed Acyclic Graph
DHSL Distributed Hadoop Storage Layer
ECL Enterprise Control Language
EDA Exploratory Data Analysis, Event Driven Architecture
EPN Event Processing Nodes
ETL Extract, transform and load
FUSE Filesystem in Userspace
GEOFF Graph Serialization Format
HAR Hadoop Archive
HDFS ACL Hadoop Distributed File System Access Control List
HPCC High Performance Computing Cluster
hPDL Hadoop Process Definition Language
HPIL Hadoop Physical Infrastructure Layer
HTAP Hybrid Transactional and Analytical Processing
IDA Initial Data Analysis
JAQL JSON Query Language
JSON JavaScript Object Notation Query Language
KFS Kosmos File System
LZO Lempel-Ziv-Oberhumer
MDM Master Data Management
MPP Massively Parallel Processing
NLP Natural Language Processing
OLAP Online Analytical Processing
OLTP Online Transactional Processing
PMML Predictive Model Markup Language
PXF Pivotal Xtension Frameworks
RDD Resilient Distributed Database
S4 Simple Scalable Streaming System
SOA Service Oriented Architecture
TDA Topological Data Analysis
UDAF USer Defined Aggregate Function
UDTF User Defined Tablegenerating Function
UIMA Unstructured Information Management Architecture
UPPI Universal PMML Plug-in
VC Vapnik Chervonekis Dimension
W3C World Wide Web Consortium
WEKA Waikato Environment for Knowledge Analysis
XDW Extended Data Warehouse
XML Extensible Markup Language
YARN Yet Another Resource Negotiator
ZFS Zettabyte File System by Sun Mircosystem

Hadoop Summit 2016 North America in San Jose

When is Hadoop Summit 2016 North America in San Jose?

The Hadoop Summit 2016 North America in San Jose will be held from June 28th – June 30th 2016.

Where is Hadoop Summit 2016?

The Hadoop Summit 2016 North America in San Jose will be in the San Jose Convention Center located at 150 W San Carlos St, San Jose, CA 95113.

About Hadoop Summit 2016

The 8th Annual Hadoop Summit in San Jose is a three-day-event that will kick off on Tuesday June 28th and bring together Apache community leaders and key players under one roof. Attendees will enjoy hearing about dev and admin tips and tricks, see presentations about successful Hadoop use cases, and get educated on ideal ways to leverage Apache Hadoop as an integral component in their enterprise data architecture.

Visit Jethro & Grab Some Data Swag

Jethro will be exhibiting its SQL-on-Hadoop acceleration solution. We will discuss ways to achieve interactive speed on BI dashboards and visualizations, such as Tableau or Qlik, with direct data connections to  multi-billion-raw data sets without impacting Hadoop cluster load. Data architects and data scientists are invited to Jethro’s booth to see live benchmarks and demos.

We look forward to seeing you at the Hadoop Summit 2016 in San Jose!