Tuesday, July 2, 2024

Simplify PySpark testing with DataFrame equality capabilities

The DataFrame equality take a look at capabilities had been launched in Apache Spark™ 3.5 and Databricks Runtime 14.2 to simplify PySpark unit testing. The complete set of capabilities described on this weblog publish will probably be obtainable beginning with the upcoming Apache Spark 4.0 and Databricks Runtime 14.3.

Write extra assured DataFrame transformations with DataFrame equality take a look at capabilities

Working with information in PySpark entails making use of transformations, aggregations, and manipulations to DataFrames. As transformations accumulate, how are you going to be assured that your code works as anticipated? PySpark equality take a look at utility capabilities present an environment friendly and efficient technique to test your information in opposition to anticipated outcomes, serving to you determine sudden variations and catch errors early within the evaluation course of. What’s extra, they return intuitive info pinpointing exactly the variations so you’ll be able to take motion instantly with out spending a variety of time debugging.

Utilizing DataFrame equality take a look at capabilities

Two equality take a look at capabilities for PySpark DataFrames had been launched in Apache Spark 3.5: assertDataFrameEqual and assertSchemaEqual. Let’s check out learn how to use every of them.

assertDataFrameEqual: This perform means that you can examine two PySpark DataFrames for equality with a single line of code, checking whether or not the info and schemas match. It returns descriptive info when there are variations.

Let’s stroll via an instance. First, we’ll create two DataFrames, deliberately introducing a distinction within the first row:

df_expected = spark.createDataFrame(information=[("Alfred", 1500), ("Alfred", 2500), ("Anna", 
500), ("Anna", 3000)], schema=["name", "amount"])

df_actual = spark.createDataFrame(information=[("Alfred", 1200), ("Alfred", 2500), ("Anna", 500), 
("Anna", 3000)], schema=["name", "amount"])

Then we’ll name assertDataFrameEqual with the 2 DataFrames:

from pyspark.testing import assertDataFrameEqual

assertDataFrameEqual(df_actual, df_expected)

The perform returns a descriptive message indicating that the primary row within the two DataFrames is totally different. On this instance, the primary quantities listed for Alfred on this row usually are not the identical (anticipated: 1500, precise: 1200):

DataFrames

With this info, you instantly know the issue with the DataFrame your code generated and might goal your debugging based mostly on that.

The perform additionally has a number of choices to manage the strictness of the DataFrame comparability as a way to regulate it in line with your particular use circumstances.

assertSchemaEqual: This perform compares solely the schemas of two DataFrames; it doesn’t examine row information. It enables you to validate whether or not the column names, information sorts, and nullable property are the identical for 2 totally different DataFrames.

Let’s take a look at an instance. First, we’ll create two DataFrames with totally different schemas:

schema_actual = "title STRING, quantity DOUBLE"

data_expected = [["Alfred", 1500], ["Alfred", 2500], ["Anna", 500], ["Anna", 3000]]
data_actual = [["Alfred", 1500.0], ["Alfred", 2500.0], ["Anna", 500.0], ["Anna", 3000.0]]

df_expected = spark.createDataFrame(information = data_expected)
df_actual = spark.createDataFrame(information = data_actual, schema = schema_actual)

Now, let’s name assertSchemaEqual with these two DataFrame schemas:

from pyspark.testing import assertSchemaEqual

assertSchemaEqual(df_actual.schema, df_expected.schema)

The perform determines that the schemas of the 2 DataFrames are totally different, and the output signifies the place they diverge:

DataFrames

On this instance, there are two variations: the info kind of the quantity column is LONG within the precise DataFrame however DOUBLE within the anticipated DataFrame, and since we created the anticipated DataFrame with out specifying a schema, the column names are additionally totally different.

Each of those variations are highlighted within the perform output, as illustrated right here.

assertPandasOnSparkEqual isn’t coated on this weblog publish since it’s deprecated from Apache Spark 3.5.1 and scheduled to be eliminated within the upcoming Apache Spark 4.0.0. For testing Pandas API on Spark, see Pandas API on Spark equality take a look at capabilities.

Structured output for debugging variations in PySpark DataFrames

Whereas the assertDataFrameEqual and assertSchemaEqual capabilities are primarily geared toward unit testing, the place you sometimes use smaller datasets to check your PySpark capabilities, you would possibly use them with DataFrames with greater than only a few rows and columns. In such situations, you’ll be able to simply retrieve the row information for rows which can be totally different to make additional debugging simpler.

Let’s check out how to do this. We’ll use the identical information we used earlier to create two DataFrames:

df_expected = spark.createDataFrame(information=[("Alfred", 1500), ("Alfred", 2500), 
("Anna", 500), ("Anna", 3000)], schema=["name", "amount"])
df_actual = spark.createDataFrame(information=[("Alfred", 1200), ("Alfred", 2500), ("Anna", 
500), ("Anna", 3000)], schema=["name", "amount"])

And now we’ll seize the info that differs between the 2 DataFrames from the assertion error objects after calling assertDataFrameEqual:

from pyspark.testing import assertDataFrameEqual
from pyspark.errors import PySparkAssertionError

strive:
    assertDataFrameEqual(df_actual, df_expected, includeDiffRows=True)
besides PySparkAssertionError as e:
    # `e.information` right here seems like:
    # [(Row(name='Alfred', amount=1200), Row(name='Alfred', amount=1500))]
    spark.createDataFrame(e.information, schema=["Actual", "Expected"]).present() 

Making a DataFrame based mostly on the rows which can be totally different and exhibiting it, as we have completed on this instance, illustrates how straightforward it’s to entry this info:

DataFrame

As you’ll be able to see, info on the rows which can be totally different is instantly obtainable for additional evaluation. You now not have to write down code to extract this info from the precise and anticipated DataFrames for debugging functions.

This characteristic will probably be obtainable from the upcoming Apache Spark 4.0 and DBR 14.3.

Pandas API on Spark equality take a look at capabilities

Along with the capabilities for testing the equality of PySpark DataFrames, Pandas API on Spark customers could have entry to the next DataFrame equality take a look at capabilities:

  • assert_frame_equal
  • assert_series_equal
  • assert_index_equal

The capabilities present choices for controlling the strictness of comparisons and are nice for unit testing your Pandas API on Spark DataFrames. They supply the very same API because the pandas take a look at utility capabilities, so you need to use them with out altering present pandas take a look at code that you simply wish to run utilizing Pandas API on Spark.

Listed here are a few examples demonstrating the usage of assert_frame_equal with totally different parameters, evaluating Pandas API on Spark DataFrames:

from pyspark.pandas.testing import assert_frame_equal
import pyspark.pandas as ps

# Create two barely totally different Pandas API on Spark DataFrames
df1 = ps.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})
df2 = ps.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})  # 'b' column as integers

# Validate DataFrame equality with strict information kind checking
assert_frame_equal(df1, df2, check_dtype=True)

On this instance, the schemas of the 2 DataFrames are totally different. The perform output lists the variations, as proven right here:

DataFrames

We are able to specify that we wish the perform to match column information even when the columns should not have the identical information kind utilizing the check_dtype argument, as on this instance:

# DataFrames are equal with check_dtype=False
assert_frame_equal(df1, df2, check_dtype=False)

Since we specified that assert_frame_equal ought to ignore column information sorts, it now considers the 2 DataFrames equal.

These capabilities additionally permit comparisons between Pandas API on Spark objects and pandas objects, facilitating compatibility checks between totally different DataFrame libraries, as illustrated on this instance:

import pandas as pd

# Pandas DataFrame
df_pandas = pd.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})

# Evaluating Pandas API on Spark DataFrame with the Pandas DataFrame
assert_frame_equal(df1, df_pandas)

# Evaluating Pandas API on Spark Collection with the Pandas Collection
assert_series_equal(df1.a, df_pandas.a)

# Evaluating Pandas API on Spark Index with the Pandas Index
assert_index_equal(df1.index, df_pandas.index)

Utilizing the brand new PySpark DataFrame and Pandas API on Spark equality take a look at capabilities is an effective way to verify your PySpark code works as anticipated. These capabilities enable you to not solely catch errors but additionally perceive precisely what has gone unsuitable, enabling you to rapidly and simply determine the place the issue is. Try the Testing PySpark web page for extra info.

These capabilities will probably be obtainable from the upcoming Apache Spark 4.0. DBR 14.2 already helps it.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles