Geek Logbook

Tech sea log book

Matei Zaharia – Spark: The Definitive Guide. Common Operations

Define Schemas manually

When using Spark for production Extract, Transform, and Load (ETL), it is often a good idea to define your schemas manually, especially when working with untyped data sources like CSV and JSON, because schema inference can vary depending on the type of data that you read in. (Chambers, 2017, 66)

SQL Expressions Vs Spark Scala

This is an extremely important point to reinforce. Notice how the previous expression is actually valid SQL code, as well, just like you might put in a SELECT statement? That’s because this SQL expression and the previous DataFrame code compile to the same underlying logical tree prior to execution. This means that you can write your expressions as DataFrame code or as SQL expressions and get the exact same performance characteristics. This is discussed in Chapter 4. (Chambers, 2017, 71)

Select Vs SQL Expr

select and selectExpr allow you to do the DataFrame equivalent of SQLqueries on a table of

data. (Chambers, 2017, 74) Because select followed by a series of expr is such a common pattern, Spark has a shorthand for doing this efficiently: selectExpr.

Differences Explode Vs Split

Maps

In addition to working with any type of values, Spark also allows us to create the following groupings

types:

The simplest grouping is to just summarize a complete DataFrame by performing an aggregation in a select statement. A “group by” allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns. A “window” gives you the ability to specify one or more keys as well as one or more aggregation functions to transform the value columns. However, the rows input to the function are somehow related to the current row. A “grouping set,” which you can use to aggregate at multiple different levels. Grouping sets are available as a primitive in SQLand via rollups and cubes in DataFrames. A “rollup” makes it possible for you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized

hierarchically A “cube” allows you to specify one or more keys as well as one or more aggregation functions to transform the value columns, which will be summarized across all combinations of columns.(Chambers, 2017, 124)

  • countDistinct: 
  • sumDistinct
  • collect_set

User-defined aggregation functions (UDAFs) are a way for users to define their own aggregation functions based on custom formulae or business rules. (Chambers, 2017, 142)

Join On complex types – JOIN WITH CONTAINS

After we have a DataFrame reader, we specify several values:

  • The format
  • The schema
  • The read mode
  • A series of options.(Chambers, 2017, 161)

Read Mode:

Saves modes

It is a file format that works exceptionally well with Apache Spark and is in fact the default file format. We recommend writing data out to Parquet for long-term storage because reading from a Parquet file will always be more efficient than JSON or CSV. Another advantage of Parquet is that it supports complex types. This means that if your column is an array (which would fail with a CSV file, for example), map, or struct, you’ll still be able to read and write that file without issue. (Chambers, 2017, 170)

mergeSchema in parquet readers

Bucketing

Managing File Size – Not So Big, Not So Small

Datasets are the foundational type of the Structured APIs. We already worked with DataFrames, which are Datasets of type Row, and are available across Spark’s different languages. Datasets are a strictly Java Virtual Machine (JVM) language feature that work only with Scala and Java. Using Datasets, you can define the object that each row in your Dataset will consist of. In Scala, this will be a case class object that essentially defines a schema that you can use, and in Java, you will define a Java Bean. (Chambers, 2017, 204)

When to Use Datasets

You might ponder, if I am going to pay a performance penalty when I use Datasets, why should I use them at all? If we had to condense this down into a canonical list, here are a couple of reasons: When the operation(s) you would like to perform cannot be expressed using DataFrame manipulations When you want or need type-safety, and you’re willing to accept the cost of performance to achieve it  (Chambers, 2017, 204)

What Are the Low-Level APIs?

There are two sets of low-level APIs: there is one for manipulating distributed data (RDDs), and another for distributing and manipulating distributed shared variables (broadcast variables and accumulators).

When to Use the Low-Level APIs?

You should use the lower-level APIs in three situations:

  • You need some functionality that you cannot find in the higher-level APIs; for example, if you need very tight control over physical data placement across the cluster.
  • You need to maintain some legacy codebase written using RDDs.
  • You need to do some custom shared variable manipulation. We will discuss shared variables more in Chapter 14. (Chambers, 2017, 213)

In short, an RDD represents an immutable, partitioned collection of records that can be operated on in parallel. Unlike DataFrames though, where each record is a structured row containing fields with a known schema, in RDDs the records are just Java, Scala, or Python objects of the programmer’s choosing.

RDDs give you complete control because every record in an RDD is a just a Java or Python object. You can store anything you want in these objects, in any format you want. This gives you great power, but not without potential issues. Every manipulation and interaction between values must be defined by hand, meaning that you must “reinvent the wheel” for whatever task you are trying to carry out.(Chambers, 2017, 214)

When to use RDDs?

In general, you should not manually create RDDs unless you have a very, very specific reason for doing so. They are a much lower-level API that provides a lot of power but also lacks a lot of the optimizations that are available in the Structured APIs. For the vast majority of use cases, DataFrames will be more efficient, more stable, and more expressive than RDDs. The most likely reason for why you’ll want to use RDDs is because you need fine-grained control over the physical distribution of data (custom partitioning of data). (Chambers, 2017, 216)

Leave a Reply

Your email address will not be published. Required fields are marked *.