Friday, November 22, 2024

Parameterized queries with PySpark | Databricks Weblog

PySpark has all the time offered fantastic SQL and Python APIs for querying information. As of Databricks Runtime 12.1 and Apache Spark 3.4, parameterized queries help protected and expressive methods to question information with SQL utilizing Pythonic programming paradigms.

This put up explains find out how to make parameterized queries with PySpark and when it is a good design sample on your code.

Parameters are useful for making your Spark code simpler to reuse and check. In addition they encourage good coding practices. This put up will exhibit the 2 alternative ways to parameterize PySpark queries:

  1. PySpark customized string formatting
  2. Parameter markers

Let us take a look at find out how to use each sorts of PySpark parameterized queries and discover why the built-in performance is best than different alternate options.

Advantages of parameterized queries

Parameterized queries encourage the “do not repeat your self” (DRY) sample, make unit testing simpler, and make SQL easier-to-reuse. In addition they forestall SQL injection assaults, which may pose safety vulnerabilities.

It may be tempting to repeat and paste massive chunks of SQL when writing comparable queries. Parameterized queries encourage abstracting patterns and writing code with the DRY sample.

Parameterized queries are additionally simpler to check. You may parameterize a question so it’s simple to run on manufacturing and check datasets.

Alternatively, manually parameterizing SQL queries with Python f-strings is a poor different. Take into account the next disadvantages:

  1. Python f-strings don’t shield towards SQL injection assaults.
  2. Python f-strings don’t perceive Python native objects akin to DataFrames, columns, and particular characters.

Let us take a look at find out how to parameterize queries with parameter markers, which shield your code from SQL injection vulnerabilities, and help computerized kind conversion of widespread PySpark cases in string format.

Parameterized queries with PySpark customized string formatting

Suppose you’ve got the next information desk known as h20_1e9 with 9 columns:

+-----+-----+------------+---+---+-----+---+---+---------+
|  id1|  id2|         id3|id4|id5|  id6| v1| v2|       v3|
+-----+-----+------------+---+---+-----+---+---+---------+
|id008|id052|id0000073659| 84| 89|82005|  5| 11|64.785802|
|id079|id037|id0000041462|  4| 35|28153|  1|  1|28.732545|
|id098|id031|id0000027269| 27| 38|13508|  5|  2|59.867875|
+-----+-----+------------+---+---+-----+---+---+---------+

You want to parameterize the next SQL question:

SELECT id1, SUM(v1) AS v1 
FROM h20_1e9 
WHERE id1 = "id089"
GROUP BY id1

You’d wish to make it simple to run this question with totally different values of id1. This is find out how to parameterize and run the question with totally different id1 values.

question = """SELECT id1, SUM(v1) AS v1 
FROM h20_1e9 
WHERE id1 = {id1_val} 
GROUP BY id1"""

spark.sql(question, id1_val="id016").present()

+-----+------+
|  id1|    v1|
+-----+------+
|id016|298268|
+-----+------+

Now rerun the question with one other argument:

spark.sql(question, id1_val="id018").present()

+-----+------+
|  id1|    v1|
+-----+------+
|id089|300446|
+-----+------+

The PySpark string formatter additionally permits you to execute SQL queries instantly on a DataFrame with out explicitly defining short-term views.

Suppose you’ve got the next DataFrame known as person_df:

+---------+--------+
|firstname| nation|
+---------+--------+
|    frank|     usa|
|   sourav|   india|
|    rahul|   india|
|      sim|buglaria|
+---------+--------+

This is find out how to question the DataFrame with SQL.

spark.sql(
    "choose nation, depend(*) as num_ppl from {person_df} group by nation",
    person_df=person_df,
).present()

+--------+-------+
| nation|num_ppl|
+--------+-------+
|     usa|      1|
|   india|      2|
|bulgaria|      1|
+--------+-------+

Working queries on a DataFrame utilizing SQL syntax with out having to manually register a short lived view could be very good!

Let’s now see find out how to parameterize queries with arguments in parameter markers.

Parameterized queries with parameter markers

You may as well use a dictionary of arguments to formulate a parameterized SQL question with parameter markers.

Suppose you’ve got the next view named some_purchases:

+-------+------+-------------+
|   merchandise|quantity|purchase_date|
+-------+------+-------------+
|  socks|  7.55|   2022-05-15|
|purse| 49.99|   2022-05-16|
| shorts|  25.0|   2023-01-05|
+-------+------+-------------+

This is find out how to make a parameterized question with named parameter markers to calculate the full quantity spent on a given merchandise.

question = "SELECT merchandise, sum(quantity) from some_purchases group by merchandise having merchandise = :merchandise"	

Compute the full quantity spent on socks.

spark.sql(
    question,
    args={"merchandise": "socks"},
).present()

+-----+-----------+
| merchandise|sum(quantity)|
+-----+-----------+
|socks|      32.55|
+-----+-----------+

You may as well parameterize queries with unnamed parameter markers; see right here for extra info.

Apache Spark sanitizes parameters markers, so this parameterization method additionally protects you from SQL injection assaults.

How PySpark sanitizes parameterized queries

This is a high-level description of how Spark sanitizes the named parameterized queries:

  • The SQL question arrives with an non-compulsory key/worth parameters listing.
  • Apache Spark parses the SQL question and replaces the parameter references with corresponding parse tree nodes.
  • Throughout evaluation, a Catalyst rule runs to switch these references with their offered parameter values from the parameters.
  • This method protects towards SQL injection assaults as a result of it solely helps literal values. Common string interpolation applies substitution on the SQL string; this technique could be susceptible to assaults if the string accommodates SQL syntax apart from the supposed literal values.

As beforehand talked about, there are two sorts of parameterized queries supported in PySpark:

The {} syntax does a string substitution on the SQL question on the consumer aspect for ease of use and higher programmability. Nonetheless, it doesn’t shield towards SQL injection assaults for the reason that question textual content is substituted earlier than being despatched to the Spark server.

Parameterization makes use of the args argument of the sql() API and passes the SQL textual content and parameters individually to the server. The SQL textual content will get parsed with the parameter placeholders, substituting the values of the parameters specified within the args within the analyzed question tree.

There are two flavors of server-side parameterized queries: named parameter markers and unnamed parameter markers. Named parameter markers use the :<param_name> syntax for placeholders. See the documentation for extra info on find out how to use unnamed parameter markers.

Parameterized queries vs. string interpolation

You may as well use common Python string interpolation to parameterize queries, nevertheless it’s not as handy.

This is how we would need to parameterize our earlier question with Python f-strings:

some_df.createOrReplaceTempView("no matter")
the_date = "2021-01-01"
min_value = "4.0"
table_name = "no matter"

question = f"""SELECT * from {table_name}
WHERE the_date > '{the_date}' AND quantity > {min_value}"""
spark.sql(question).present()

This is not as good for the next causes:

  • It requires creating a short lived view.
  • We have to symbolize the date as a string, not a Python date.
  • We have to wrap the date in single quotes within the question to format the SQL string correctly.
  • This does not shield towards SQL injection assaults.

In sum, built-in question parameterization capabilities are safer and simpler than string interpolation.

Conclusion

PySpark parameterized queries offer you new capabilities to write down clear code with acquainted SQL syntax. They’re handy once you wish to question a Spark DataFrame with SQL. They allow you to use widespread Python information sorts like floating level values, strings, dates, and datetimes, which routinely convert to SQL values beneath the hood. On this method, now you can leverage widespread Python idioms and write lovely code.

Begin leveraging PySpark parameterized queries at this time, and you’ll instantly get pleasure from the advantages of a better high quality codebase.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles