Spark is this technology everyone is talking about when dealing with Big Data analytics. It is closely related to Hadoop and makes distributed computing accessible.
Yet, in my opinion, it is a wonderful opportunity to build scalable jobs and tackle problems that were up to now reserved for massive computers.
And what would Spark be without Databricks (the company created by the founders of Apache Spark)? More than 30% of the current committers are directly related to the company, that provides support and visibility to Apache Spark.
So why not choose Databricks certification to prove you have the skills to work with Spark?
You can choose between the Scala and Python (aka pyspark) developer certification. I took the Python one.
Databricks uses a service provider named Kryterion to organize their exams. You will have to register here. I chose the online monitored version, finding it comfier, especially when knowing it lasts 3 hours. The exam itself takes place in an application called Sentinel Secure that, among other things, prevent you from accessing other resources on your computer during the exam.
Note 1: “the online proctored exam is supported by standard English keyboards only”. But all questions are Multiple Choices Questions, so don’t let it destabilize you. Fun story, I took this typing pattern thing too seriously. Since I’m in France, I’m using a French keyboard. And because Sentinel detects your identity through typing pattern, I tried to artificially alter my writing pattern to anticipate the difficulty I would have with the English keyboard during the exam. Result? I had to authenticate 10 times on the exam day. Not the best way to start the exam.
Note 2: “A detached web camera is required for the online proctored exam”. Your internal cam will do the job, no need to buy an external one. But also note that someone will watch you all your exam long. No paper, no pen, no leaving your chair… 3 hours, it’s long.
All the details you need to prepare your machine for the exam are available in this document.
The exam in 3 points:
Only Multiple Choices Questions
Questions are mainly divided into 5 Spark fields. Following is a summary of what you will be confronted to:
Databricks certification main themes, with pyspark examples
Depending on your current knowledge about Spark, here are some materials that will help you get certified:
The best way to start with Spark is by letting someone else explain it to you. Interested in Scala? Coursera proposes a MOOC, as part of the Functional Programming in Scala program. If you are more into Python, you will also find nice materials on coursera and edx. The former covers the basics of Spark (HDFS, MapReduce and RDD). For the latter, I recommend focusing on the first 2 weeks since the following weeks are more ML-oriented, which does not represent a huge part in the certification. You can find other online courses on udemy, in Scala or Python, but unlike the other, those are paying (even though they can be expensive — up to hundreds of euros — you can take advantage of temporary discounts and pay less than 15 euros for a rich material).
If you prefer good old books, O’Reilly edited Learning Spark — Lightning-Fast Big Data Analysis. The authors are active Spark project contributors, among which Matei Zaharia, creator of Apache Spark. Even if the last version dates back to 2015, the book is still relevant.
I already know Spark basics
My personal advice is to focus on SparkSQL and DataFrames manipulation since it represents a huge part of the evaluated skills. To do so, databricks offers free access to its platform through the Community Edition. You can test your own use cases on it.
If you want to sharpen your knowledge, you can follow databricks courses dedicated to DataFrames basics (SP800), Data Manipulation (SP820 & SP821) and Tuning and Troubleshooting (SP870). They are unfortunately paying. If you are already fluent in Spark, you can skip them, but if you just discovered Spark and really want to get the certification, that is a good shot.
I am a seasoned user
If Spark has no more secret for you, take a look at this document: 7 Steps for a Developer to Learn Apache Spark. It is a good wrap-up of what you need to know to be ready for the certification.
Once you certified your skills, check out this must-see resource: Advanced Apache Spark Training. It will provide you an in-depth understanding of Spark architecture!
Good Luck! 🤞
Are you looking for Data Engineers for your projects? Don’t hesitate to contact us!
Get Started with PySpark and Jupyter Notebook in 3 Minutes
Spark is a fast and powerful framework.
Publish Data Outside Your Data Lake with a Spark Connector
Feedback on implementing a Spark connector for Tableau.
Introduction to Deep Q-learning with SynapticJS & ConvNetJS
An application to the Connect 4 game.