Apache Spark is a must for Big data’s lovers. In a few words, Spark is a fast and powerful framework that provides an API to perform massive distributed processing over resilient sets of data.
Jupyter Notebook is a popular application that enables you to edit, run and share Python code into a web view. It allows you to modify and re-execute parts of your code in a very flexible way. That’s why Jupyter is a great tool to test and prototype programs.
I wrote this article for Linux users but I am sure Mac OS users can benefit from it too.
While using Spark, most data engineers recommends to develop either in Scala (which is the “native” Spark language) or in Python through complete PySpark API.
Python for Spark is obviously slower than Scala. However like many developers, I love Python because it’s flexible, robust, easy to learn, and benefits from all my favorites libraries. In my opinion, Python is the perfect language for prototyping in Big Data/Machine Learning fields.
If you prefer to develop in Scala, you will find many alternatives on the following github repository: alexarchambault/jupyter-scala
To learn more about Python vs. Scala pro and cons for Spark context, please refer to this interesting article: Scala vs. Python for Apache Spark.
Now, let’s get started.
Before installing pySpark, you must have Python and Spark installed. I am using Python 3 in the following examples but you can easily adapt them to Python 2. Go to the Python official website to install it. I also encourage you to set up a virtualenv
To install Spark, make sure you have Java 8 or higher installed on your computer. Then, visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.
Unzip it and move it to your /opt folder:
$ tar -xzf spark-1.2.0-bin-hadoop2.4.tgz $ mv spark-1.2.0-bin-hadoop2.4 /opt/spark-1.2.0
Create a symbolic link:
$ ln -s /opt/spark-1.2.0 /opt/spark̀
This way, you will be able to download and use multiple Spark versions.
Finally, tell your bash (or zsh, etc.) where to find Spark. To do so, configure your $PATH variables by adding the following lines in your
export SPARK_HOME=/opt/spark export PATH=$SPARK_HOME/bin:$PATH
Install Jupyter notebook:
$ pip install jupyter
You can run a regular jupyter notebook by typing:
$ jupyter notebook
Let’s check if PySpark is properly installed without using Jupyter Notebook first.
You may need to restart your terminal to be able to run PySpark. Run:
$ pysparkWelcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 3.5.2 (default, Jul 2 2016 17:53:06) SparkSession available as 'spark'. >>>
It seems to be a good start! Run the following program: (I bet you understand what it does!)
import random num_samples = 100000000 def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples print(pi) sc.stop()
The output will probably be around
There are two ways to get PySpark available in a Jupyter Notebook:
Configure PySpark driver to use Jupyter Notebook: running
pyspark will automatically open a Jupyter Notebook
Load a regular Jupyter Notebook and load PySpark using findSpark package
First option is quicker but specific to Jupyter Notebook, second option is a broader approach to get PySpark available in your favorite IDE.
Update PySpark driver environment variables: add these lines to your
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Restart your terminal and launch PySpark again:
Now, this command should start a Jupyter Notebook in your web browser. Create a new notebook by clicking on ‘New’ > ‘Notebooks Python [default]’.
Copy and paste our Pi calculation script and run it by pressing Shift + Enter.
Done! You are now able to run PySpark in a Jupyter Notebook :)
There is another and more generalized way to use PySpark in a Jupyter Notebook: use findSpark package to make a Spark Context available in your code.
findSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.
To install findspark:
$ pip install findspark
Launch a regular Jupyter Notebook:
$ jupyter notebook
Create a new Python [default] notebook and write the following script:
import findspark findspark.init() import pyspark import random sc = pyspark.SparkContext(appName="Pi") num_samples = 100000000 def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, num_samples)).filter(inside).count() pi = 4 * count / num_samples print(pi) sc.stop()
The output should be:
I hope this 3-minutes guide will help you easily getting started with Python and Spark. Here are a few resources if you want to go the extra mile:
And if you want to tackle some bigger challenges, don't miss out the more evolved JupyterLab environnement or the PyCharm integration of jupyter notebooks.
How To Build A Successful AI PoC
Turn Your Artificial Intelligence Ideas Into Working Software
How to Perform Fraud Detection with Personalized Page Rank
This article shows how to perform fraud detection with Graph Analysis.
Image Registration: From SIFT to Deep Learning
How the field has evolved from OpenCV to Neural Networks.