AtScale Blog

Stay Open: Keeping Hadoop Accessible Means Limitless Possibilities

Posted by Ali Haeri on Aug 13, 2015

With the rise of Big Data, organizations are seeking new and better ways to leverage large sets of business data for competitive advantage. To meet their growing data demands today’s businesses have two main solutions to choose from: Proprietary software platforms contracted by commercial vendors, and the open source Hadoop big data analytics platform.

\While both of these big data technologies have merit for business, a growing number of use cases suggest that the all-in-one approach of many proprietary software solutions is restrictive and problematic, as it locks business users into a single monolithic platform. In contrast, open source Hadoop delivers levels of flexibility, agility and scalability that can open up limitless possibilities for business analytics.

What follows is a look at the many benefits of keeping Hadoop open and accessible.     

Flexibility

First off, being open on the Hadoop layer makes a number of SQL and Hadoop engines available for use on the back end. By relying on these, rather than proprietary engines, organizations can take full advantage of Hadoop’s scale-out architecture, using the power of Hadoop as a cluster to do the heavy lifting.

There are a number of backend Hadoop “flavors”, and they are all compatible with the Hive Metastore. This is critical because the Hive Metastore is the standard in Hadoop for how to express data. So most tools in the Hadoop ecosystem will be able to take advantage of the Hive Metastore. This is of special value to business users, as they can store all of their data once in HDFS,  

where it’s managed through the Hive Metastore, and then access it via multiple SQL query engines. Instead of being locked into one proprietary approach, business users can take full advantage of the latest SQL-on-Hadoop improvements and innovations.

Businesses will find that each Hadoop flavor has its own advantages. For example, engines that take advantage of Hadoop YARN can integrate smoothly with other engines on Hadoop, which may be batch workloads or other interactive workloads. Other flavors take advantage of query federation, the ability to retrieve data outside of Hadoop and make it available for query.

Security

It should be emphasized that each of these projects are backed by a different sponsor, which keeps them free, open source, and rapidly developing. This is not the case with proprietary software, and business users would do well to keep their options open. All-in-one proprietary tools that come with their own visualization layers, their own engines, and their own built-in ETL methodology are problematic by nature. By sticking with Hadoop and these open source engines, organizations will be safe and able to future proof their infrastructure in an ever-evolving Hadoop ecosystem.

Versatility

The real power of open source Hadoop lies in the ability to leverage a single semantic layer, while staying open on both the back end and the front end. The same semantic layer, built once, works regardless of the BI tool that is being accessed.  

A few examples of leveraging a single semantic layer across different platforms are:   

Running Tableau on Hadoop puts all of the capabilities of the Tableau tool at the business user’s disposal.

Running Excel on Hadoop enables the user to do live pivot tables and do drill-through to the detail directly to the cluster.

Running Qlik Sense Desktop on Hadoop allows users to run live queries against the Hadoop cluster through AtScale.

When moving to open source, IT business executives should keep the following considerations in mind:

  • Query data in place. Don’t move it. This gives you the ability to handle new data sets---and to handle them at scale.

  • Create a single semantic layer rather than having multiple definitions of reality.

  • Instead of using proprietary software, use the Hadoop production cluster as the single cluster to do your data processing. This will vastly simplify your data stack and allow you to scale out one resource---your Hadoop cluster. This gives you the scalability to add more nodes, and therefore more scalability and more capacity at will.

  • Begin experimenting with and migrating to these schema-less approaches to model data. Break away from the traditional star schema way of thinking. Start thinking about the nested schemas that provide additional agility and additional performance through being able to operate on the data in place, using a schema on demand approach.   

  • Remember---openness on the front end and openness on the back end. Being open allows you to use open source engines and any BI tool you choose. On the back end choose engines that are open source and free. And on the front end, don’t bother being locked into a proprietary visualization stack. Use or leverage what you already have and choose the best visualization tool for you to move forward, and then stick with that.

The Big Data era is here. Those organizations that choose open source Hadoop as a business analytics solution over proprietary software stand to gain competitive advantage through the limitless possibilities that Hadoop’s agility, flexibility and scalability can provide.    

New Call-to-action

 







Topics: Hadoop

Learn about BI & Hadoop

The AtScale Blog is the one-stop shop for cutting edge news and insights about BI on Hadoop and all things AtScale.

Subscribe to Email Updates