With Hadoop Summit San Jose just around the corner, I thought it might be helpful to preview what to watch out for a the conference. In some ways, not much has changed in the past few months - streaming data is a hot topic, more and more people are adopting adjacent technologies (like Spark), and “in memory” is “in vogue” in the world of big data. However, a quick tour around the Hadoop Summit website reveals a few more trends that deserve some additional attention.
So, while you are strolling the halls and ballrooms of the San Jose Convention center, you may want to keep following topics in mind.
Apache Commitment Remains Key: While different vendors in the Hadoop space of chosen different paths towards delivering commercial Hadoop distributions, one thing that’s become clear in the space is that there is both market and customer value to donating projects to the Apache Foundation. Clearly Hortonworks has been a leader in this space, consistently ensuring that all components of their distribution are Apache open source components. But the past few years have seen Cloudera following suit by launching Sentry in the Apache Incubator and by donating Impala and Kudu to the Apache Foundation. Initially developed by MapR, Drill is also a top level Apache project and is receiving a lot of attention. Even Pivotal has joined the Apache crowd, recently launching HAWQ as an Apache Incubator project. Based on all of this activity, one thing is abundantly clear - the market has spoken and core Apache open-source is the way to go for the core elements the Hadoop ecosystem.
Spark Continues to Gain Momentum: Remember when Spark was just a toolkit for those few data scientists hidden in the back rooms of the data center? Well, things have changed, and Spark continues to to penetrate into a number of big data uses cases. With the inclusion of Apache Zeppelin as a visualization and data sharing tool (also packaged by Hortonworks) more and more users are leveraging Spark as a platform for data analysis and visualization. Additionally, the fast pace of Spark development has led to significant performance and functionality improvements (as we detailed in our BI-on-Hadoop Benchmarks) leading to broader adoption of Spark as a component of an interactive analytic query strategy. Keep an eye out for how enterprises are beginning to adopt Spark for mission critical functions, not just back-office data science.
Security First, Security Always: With the increasing adoption of the “Data Lake Mindset” more and more enterprises are using Hadoop as the shared, singular location for all of their data. And while this approach holds the promise of delivering great value, it brings with it new and more stringent requirements in the area of data security and governance. If you haven’t become familiar with Kerberos, Apache Sentry, Apache Ranger, or Apache Atlas, then you may want to seek out sessions that expose you to the expanding and complex world of Hadoop data security and governance. And you may want to become familiar with concepts like Delegated Authorization and Impersonation. Similarly, the data sprawl invited by the Hadoop Data Lake has led to demands for a shared semantic layer on top of this data - data governance and metadata management are always top of mind.
Cloud Adoption Shows no Signs of Slowing: No matter how much you would like to believe that you can solve your business customers’ problems with a pure on-premise solution, the promise of the cloud has become a reality. While it may be the case the core data assets remain behind the firewall (and even this is not true for many forward-looking enterprises) the ability to spin up and spin down resources in the cloud and easily connect BI tools to this data has changed the game for how IT professionals in the data space need to view the cloud. You can either pretend that this shift isn’t happening, or fundamentally change your views and open up to expanding your relationship to include the cloud. While you’re at Hadoop Summit keep an eye out for best practices around how enterprises are navigating the new world of this type of hybrid data deployment.
Business Value is the Most Important Discussion to Have: At the end of the day, the marriage of the (as the Hadoop Summit website puts it) “technologies and business drivers that are transforming big data” is intended to drive “Business Value”. And although the concept has been an undercurrent in previous years’ conferences, I do think that the next 18 months will usher in the ultimate realization of Hadoop as a platform that delivers the next generation of data-driven applications. With innovations like Hortonworks Data Flow, the maturing of Spark, the secure collection and provisioning of data, and hybrid deployments of cloud-based services this year’s Hadoop Summit is a great place to learn about how innovative companies are assembling these pieces to drive real data value from their big data investments.
My hope is that you found a few nuggets of value in here. I welcome your feedback and thoughts. See on you at the show, or visit www.atscale.come to see where else we might meet in the future.