lakeside
December 12, 2018

Publish Data Outside Your Data Lake with a Spark Connector

Feedback on implementing a Spark connector for Tableau.
lake

data lake makes structured and unstructured data massively available within a company. Spark connectors allow users to retrieve data from the data lake.

A data lake is a reliable collection of transformed data coming from and at destination of various businesses. But a data lake is worth nothing, if users cannot query the data it holds. Users want to get the data, crunch it and visualize it. That is why connectors from Spark to data visualization formats — such as Tableau — are a necessary step in data engineering.

Connectors in Spark are interfaces used to write Resilient Distributed Databases (aka RDDs, aka Spark’s distributed dataframes) to external storage systems.

Spark comes with a lot of pre-packaged connectors within the Spark SQL API. For instance, write connector makes it very easy to write a dataframe to CSV using a single line of code.

dataframe.write.csv('mycsv.csv')

Other supported standard formats are:

In some cases, however, you will be compelled to work with more exotic formats. TableauAlteryxMicrosoft: major data software companies developed their own formats for big data. Tableau’s coorporate solution Tableau Server for instance, uses either Tableau Data Extract (.tde) or — more recently — Hyper (.hyper), as storage formats for its data tables.

A connector divides in three parts:

Figure1 — Spark connector to cloud vendor
Figure1 — Spark connector to cloud vendor

Let’s dive into each part of the connector!

1 — Convert a Spark dataframe to your target format

The proper way to convert a dataframe to your target format is to proceed partition by partition. RDDs are distributed in partitions which are not directly accessible to cluster’s driver where Python code is running.

For each partition of the RDD, collect data within that partition to the driver. You can then go through the collected partition and insert it to your target file row by row.

The function convert_and_insert will be provided by data vendors. For Tableau formats, for instance, you can refer to Tableau SDK (for .tde) or Tableau API 2.0 (for .hyper), that both have C++, Java and Python APIs.

This method supposes that you have full control:

  • on the driver’s memory (which can be set for the current Spark session through

    spark.driver.memory), because all partitions that will be collected to that driver one after another should fit in, and

  • on the partitioning of the source dataframe, because no partition should exceed the driver’s memory in order to avoid OutOfMemory errors.

2 — Export the source file to the cloud

Once data is converted to the proper format, it can be exported to the cloud and made available to users. For instance, I use Tableau REST API to publish Tableau files to Tableau Server. Every vendor provides developers with dedicated APIs to publish data to the cloud.

Not all APIs, however, are well-documented, so my advice is to directly clone the project and dive in the code to see if the one feature you need is already implemented. If features are still missing, you will even be able to submit a pull request.

3 — Make the connector easy to use for your users

I built a command-line interface in Python on the top of my connector. The user can choose a source environment and a target environment (as represented in the figure 1 above). These options are parsed, mapped to a configuration file, and both services described above in part 1(convert) and part 2 (export) are sequentially triggered to publish formatted data to the cloud.

Thanks to Arnaud, Irina Stolbova, Florian Carra, and Nicolas Jean. 

convolutional layer and convolution kernel

About Convolutional Layer and Convolution Kernel

A story of Convnet in machine learning from the perspective of kernel sizes.

Edge detection, tutorial, knowledge

Edge Detection in Opencv 4.0, A 15 Minutes Tutorial

This tutorial will teach you, with examples, two OpenCV techniques in python to deal with edge detection.

apache airflow to Celery

How Apache Airflow Distributes Jobs on Celery workers

The life of a distributed task instance