January 14, 2019

Learn to Test Your Pyspark Project with Pytest — example-based Tutorial

In this tutorial, I will explain how to get started with test writing for your Spark project.

comic strip about unit tests

There is no doubt that testing is a crucial step in any software development project. However, when you are only getting started with test writing, it may seem to be a time-consuming and not a very pleasant activity. For that reason, many developers choose to avoid them in order to go faster and this degrades the quality of the delivered app. But if you include tests into your list of programming habits, they eventually stop being that mind-wrecking and you start gathering benefits from them.

Part 1: Basic Example

As an example, let us take a simple function that filters Spark data frame by value in the specific column age. Here is the content of the file main.py that contains the function we would like to test:

The basic test for this function will consist of the following parts: initialization of Spark context, input and output data frames creation, assertion of expected and actual outputs, closing Spark context:

The major stumbling block arises at the moment when you assert the equality of the two data frames. Using only PySpark methods, it is quite complicated to do and for this reason, it is always pragmatic to move from PySpark to Pandas framework. However, while comparing two data frames the order of rows and columns is important for Pandas. Pandas provides such function like pandas.testing.assert_frame_equal with the parameter check_like=True to ignore the order of columns. However, it does not have a built-in functionality to ignore the order of rows. Therefore, to make the two data frames comparable we will use the created method get_sorted_data_frame.

To launch the example, in your terminal simply type pytest at the root of your project that contains main.py and test_main.py. Make sure you have set all the necessary environment variables. To run this tutorial on Mac you will need to set PYSPARK_PYTHON and JAVA_HOME environment variables.

Part 2: Refactoring of Spark Context

This tutorial demonstrates the basics of test writing. However, your real project will probably contain more than one test and you would not want to initialize resource-intensive Spark Context over and over again. For that reason, with Pytest you can create conftest.py that launches a single Spark session for all of your tests and when all of them were run, the session is closed. In order to make the session visible for tests, you should decorate the functions with Pytest fixtures. Here is the content of conftest.py:

It is important that conftest.py has to be placed at the root of your project! Afterwards, you just need to pass sql_context parameter into your test function.

Here is how test_filter_spark_data_frame looks like after the refactoring:

I hope you enjoyed this tutorial and happy test writing with Pytest and Spark!

Thanks to Pierre Marcenac, Nicolas Jean, Raphaël Meudec, and Louis Nicolle. 

Spark with Jupyter

Get Started with PySpark and Jupyter Notebook in 3 Minutes

Spark is a fast and powerful framework.

blurry street

GAN with Keras: Application to Image Deblurring

A Generative Adversarial Networks tutorial applied to Image Deblurring with the Keras library.

convolutional layer and convolution kernel

About Convolutional Layer and Convolution Kernel

A story of Convnet in machine learning from the perspective of kernel sizes.