A datalake in 3 months

Challenge : a custom-made datalake

Thanks its "hypergrowth", the future French unicorn in the tourism sector saw its data volumes grow up to 1.5 TB in 2019. Analysts deliver and present BI reports to decision boards in more than a week, without any certainty that shown numbers are accurate. We implemented a datalake that delivers aggregated data collections according to collected business needs so that analysts can deliver reports in 4h.

The startup's selling point is custom-made and unique trips for travellers. To do so, it relies on big data volume from various sources (data collected from agents and travellers exchanges, payment data, users data, website sessions data).







Our impact

Split by 8
BI reports' production time
aggregated sources

Sicara is a driving force on technical issues: they helped us choose the technology stack that fitted our business ambitions and shared their ETL best practices. Furthermore, Sicara's teams perfectly fitted into our in-house data and business teams and helped us with lean methodology implementation.

Johann S. Lead Data Engineer

We developed a custom-made datalake in 3 months

Delivery of raw data and aggregated data collections

We developed a custom-made datalake in 3 months

Delivery of raw data and aggregated data collections

We developed a datalake with its historical data and up-to-date data to be delivered in real time. Thi data is either stocked or aggregated from 7 different sources in response to pre-identified business needs. First, product and marketing teams use raw data (website user sessions data, clients recommendations, sales data) in order to enhance user experience. Secondly, BI teams rely on data aggregated according to specific business rules that allow them to deliver BI reports in less than 4 hours to the executive team and internal teams (compared to 1 week beforehand).

What we did

A custom-made ETL

What we did

A custom-made ETL

We implemented an EL-ETL that organized 1TB historical data, hence more than 100.000 millions documents were aggregated in 22 collections. Once aggregated, these collections are provisionned in real-time by a dual system of 2O RabbitMQ workers that manage this data and 21 RabbitMQ workers that update this data. Furthermore, the data engineers team updated the PostgreSQL architecture to make it fluid and scalable in order to adapt to the startup ever-changing needs.

logo, python, manomano, sicara

Our ETL specialist team

In total integration with the client's team

startup, sicara, team, teamwork

Our ETL specialist team

In total integration with the client's team

In order to implement the ETL in less than 3 months, we integrated a team of 4 Sicara data engineers to the in-house client's data team, composed of 1 lead data engineer. Sicara also technically advised the executive team on how to adapt the firm's data architecture. Moreover, a Sicara Product Manager who gathered business and data needs and a Sicara Agile Coach who implemented lean methodology helped increase x3 the delivery of the datalake.






Télécom ParisTech – MVA


ENSTA ParisTech

Related articles written by Sicara data scientists

Automate AWS Tasks Thanks to Airflow Hooks

This article is a step-by-step tutorial that will show you how to upload a file to an S3 bucket thanks to an Airflow ETL (Extract Transform Load) pipeline

apache airflow to Celery

How Apache Airflow Distributes Jobs on Celery workers

The life of a distributed task instance

How to Get Certified in Spark by Databricks?

This article aims to prepare you for the Databricks Spark Developer Certification: register, train and succeed, based on my recent experience.