Understand why Spark has experienced such wide adoption and learn about some Spark use cases today. Take a technical, deep dive into the architecture, the vision for the Hadoop ecosystem, and why Spark is the successor to MapReduce for Hadoop data processing.
Easy and Fast Big Data
Spark not only gives you easy and fast big data, but it is easy to develop. Cultivate rich APIs in Java, Scala and Python. It’s also very easy to open up the interactive shell, start typing commands, and do all kinds of work in the shell. The end result is that you’ll find that you have significantly less code, even as much as 2-5x less. Spark is also fast to run because it's able to leverage memory very effectively. It has general execution graphs and in-memory storage that contribute to up to 10x faster speed on the disk and 100x in memory.
Take Advantage of Memory
Spark introduces the concept of resilient distributed datasets (RDD) and provides a very easy way to store data in memory. Data is distributed across a cluster, and you don’t have to worry about machine crashes. For example, the cache is already managed for you by the system and is stored in a distributed, fault-tolerant cache. This allows your memory to fall back to the disk when the data-set does not fit in the memory. It also provides fault-tolerance through the concept of lineage.
Out of the Box Functionality
Spark is also well-integrated with Hadoop and supports all the standard Hadoop data formats. It also runs very well with YARN in mixed clusters and behaves properly in an environment where you’re running both Spark and other workloads.
Spark offers language support in the form of SparkR, Java 8, Schema support in Spark’s APIs and SQL support in Spark Streaming. New libraries are being added to the project like the Mllib, GraphX, Spark Streaming and Spark SQL.
Customer Use Cases
Spark is used by a number of different sectors in order to improve their processes, including Financial Services, Genomics, Data Services, and Healthcare. In the Financial Services sector, Spark is used to calculate portfolio risk analysis, ETL pipeline speedup, and analyzing stock data for 20 years. It replaces home grown applications. In Genomics, two use cases identify disease-causing genes in the full human genome. Spark replaces MySQL engine. In Data Services, trend analysis uses statistical methods on large data sets, performs document classification (LDA) and fraud analytics. It replaces Netezza and net new. Finally, in the Healthcare sector, Spark can be used to calculate Jaccard scores on health care data sets, thereby replacing net new.
Spark is great, but it is certainly not perfect and comes with certain limitations. Spark is relatively new, and it’s important to understand that there are going to be some challenges. Spark has opaque API limitations, is a little bit more difficult to debug and troubleshoot, and has a fairly complex configuration.