AtScale Blog

The Future of Hadoop: Spark

Posted by Bruno Aziza on Aug 28, 2015
Find me on:

MIC-ChartData_06-01_1Understand why Spark has experienced such wide adoption and learn about some Spark use cases today. Take a technical, deep dive into the architecture, and the vision for the Hadoop ecosystem and why Spark is the successor to MapReduce for Hadoop data processing.

Easy and Fast Big Data

Spark not only gives you easy and fast big data, but it is easy to develop. Cultivate rich APIs in Java, Scala and Python. It’s also very easy to open up the interactive shell, start typing commands, and do all kinds of work in the shell. The end result is that you’ll find that you have significantly less code, even as much as 2-5x less. Spark is also fast to run because it is able to leverage memory very effectively. It has general execution graphs and in-memory storage that contribute to up to 10x faster speed on the disk and 100x in memory.

Take Advantage of Memory

Spark introduces the concept of resilient distributed datasets (RDD) and provides a very easy way to store data in memory. Data is distributed across a cluster, and you don’t have to worry about machine crashes, for example--the cache is already managed for you by the system and is stored in a distributed, fault-tolerant cache. This allows your memory to fall back to the disk when the data-set does not fit in the memory. It also provides fault-tolerance through the concept of lineage.

Out of the Box Functionality

Spark is also well-integrated with Hadoop and supports all the standard Hadoop data formats. It also runs very well with YARN in mixed clusters and behaves properly in an environment where you’re running both Spark and other workloads.

Spark offers language support in the form of SparkR, Java 8, Schema support in Spark’s APIs and SQL support in Spark Streaming. New libraries are being added to the project like the Mllib, GraphX, Spark Streaming and Spark SQL.

Customer Use Cases

Spark is used by a number of different sectors in order to improve their processes, including Financial Services, Genomics, Data Services, and Healthcare. In the Financial Services sector, Spark is used to calculate portfolio risk analysis, ETL pipeline speed-up, and analyzing stock data for 20 years. It replaces home grown applications. In Genomics, two use cases identify disease-causing genes in the full human genome. Spark replaces MySQL engine. In Data Services, trend analysis uses statistical methods on large data sets, performs document classification (LDA) and fraud analytics. It replaces Netezza and net new. Finally, in the Healthcare sector, Spark can be used to calculate Jaccard scores on health care data sets, thereby replacing net new.

Limitations

Spark is great, but it is certainly not perfect and comes with certain limitations. Spark is relatively new, and it’s important to understand that there are going to be some challenges. Spark has opaque API limitations, is a little bit more difficult to debug and troubleshoot, and has a fairly complex configuration.

New Call-to-action

Topics: Hadoop

Learn about BI & Hadoop

The AtScale Blog is the one-stop shop for cutting edge news and insights about BI on Hadoop and all things AtScale.

Subscribe to Email Updates