Geek Logbook

Tech sea log book

Matei Zaharia – Spark: The Definitive Guide – Life Cycle of a Spark Application

The Life Cycle of a Spark Application (Inside Spark)

The SparkSession

The first step of any Spark Application is creating a SparkSession. In many interactive modes, this is done for you, but in an application, you must do it manually.  (Chambers, 2017, 259)

Some of your legacy code might use the new SparkContext pattern. This should be avoided in favor of the builder method on the SparkSession  (Chambers, 2017, 260)

Logical Instructions

As you saw in the beginning of the book, Spark code essentially consists of transformations and actions. How you build these is up to you—whether it’s through SQL, low-level RDD manipulation, or machine learning algorithms. Understanding how we take declarative instructions like DataFrames and convert them into physical execution plans is an important step to understanding how Spark runs on a cluster. In this section, be sure to run this in a fresh environment (a new Spark shell) to follow along with the job, stage, and task numbers. (Chambers, 2017, 261)

A Spark Job

In general, there should be one Spark job for one action. Actions always return results. Each job breaks down into a series of stages,(Chambers, 2017, 262)

Stages

Stages in Spark represent groups of tasks that can be executed together to compute the same operation on multiple machines. In general, Spark will try to pack as much work as possible (i.e., as many transformations as possible inside your job) into the same stage, but the engine starts new stages after operations called shufles. A shuffle represents a physical repartitioning of the data—for example, sorting a DataFrame, or grouping data that was loaded from a file by key (which requires sending records with the same key to the same node). This type of repartitioning requires coordinating across executors to move data around. Spark starts a new stage after each shuffle, and keeps track of what order the stages must run in to compute the final result. ,(Chambers, 2017, 263)

Tasks

Stages in Spark consist of tasks. Each task corresponds to a combination of blocks of data and a set of transformations that will run on a single executor. If there is one big partition in our dataset, we will have one task. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel. (Chambers, 2017, 264)

Writing Spark Applications

Spark Applications are the combination of two things: a Spark cluster and your code. (Chambers, 2017, 266)

What to Monitor

After that brief tour of the monitoring landscape, let’s discuss how we can go about monitoring and debugging our Spark Applications. There are two main things you will want to monitor: the processes running your application (at the level of CPU usage, memory usage, etc.), and the query execution inside it (e.g., jobs and tasks). (Chambers, 2017, 295)

Tunning

There are a variety of different parts of Spark jobs that you might want to optimize, and it’s valuable to be specific. Following are some of the areas:

  • Code-level design choices (e.g., RDDs versus DataFrames)
  • Data at rest
  • Joins
  • Aggregations
  • Data in flight
  • Individual application properties
  • Inside of the Java Virtual Machine (JVM) of an executor
  • Worker nodes
  • Cluster and deployment properties

This list is by no means exhaustive, but it does at least ground the conversation and the topics that wencover in this chapter.(Chambers, 2017, 316)

DataFrames versus SQL versus Datasets versus RDDs 

This question also comes up frequently. The answer is simple. Across all languages, DataFrames, Datasets, and SQLare equivalent in speed. This means that if you’re using DataFrames in any of these languages, performance is equal. However, if you’re going to be defining UDFs, you’ll take a performance hit writing those in Python or R, and to some extent a lesser performance hit in Java and Scala. If you want to optimize for pure performance, it would behoove you to try and get back to DataFrames and SQLas quickly as possible. Although all DataFrame, SQL, and Dataset code compiles down to RDDs, Spark’s optimization engine will write “better” RDD code than you can manually and certainly do it with orders of magnitude less effort. Additionally, you will lose out on new optimizations that are added to Spark’s SQLengine every release.

Lastly, if you want to use RDDs, we definitely recommend using Scala or Java. If that’s not possible, we recommend that you restrict the “surface area” of RDDs in your application to the bare minimum.

That’s because when Python runs RDD code, it’s serializes a lot of data to and from the Python

process. This is very expensive to run over very big data and can also decrease stability.

Although it isn’t exactly relevant to performance tuning, it’s important to note that there are also somengaps in what functionality is supported in each of Spark’s languages. (Chambers, 2017, 318)

User-Defined Functions (UDFs)

In general, avoiding UDFs is a good optimization opportunity. UDFs are expensive because they force representing data as objects in the JVM and sometimes do this multiple times per record in a query.

You should try to use the Structured APIs as much as possible to perform your manipulations simply because they are going to perform the transformations in a much more efficient manner than you can do in a high-level language.  (Chambers, 2017, 325)

Joins

Joins are a common area for optimization. The biggest weapon you have when it comes to optimizing joins is simply educating yourself about what each join does and how it’s performed. This will help you the most. Additionally, equi-joins are the easiest for Spark to optimize at this point and therefore should be preferred wherever possible. Beyond (Chambers, 2017, 328)

equi join vs inner join

There are many different ways to optimize the performance of your Spark Applications and make them run faster and at a lower cost. In general, the main things you’ll want to prioritize are (1) reading as little data as possible through partitioning and efficient binary formats, (2) making sure there is sufficient parallellism and no data skew on the cluster using partitioning, and (3) using high-level APIs such as the Structured APIs as much as possible to take already optimized code. (Chambers, 2017, 329)

Stream Processing Use Cases – (Chambers, 2017, 333)

  1. Notifications and alerting
  2. Real-time reporting
  3. Incremental ETL
  4. Update data to serve in real time
  5. Real-time decision making
  6. Online machine learning

Pandas Integration

One of the powers of PySpark is its ability to work across programming models. For instance, a common pattern is to perform very large-scale ETLwork with Spark and then collect the (single- machine-sized) result to the driver and then leverage Pandas to manipulate it further. This allows you to use a best-in-class tool for the best task at hand—Spark for big data and Pandas for small data (Chambers, 2017, 522)

Leave a Reply

Your email address will not be published. Required fields are marked *.

*
*
You may use these <abbr title="HyperText Markup Language">HTML</abbr> tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>