Developer to Data Scientist— Part I

Ritresh Girdhar
6 min readOct 29, 2022

In the transition of my career from Software development to Data science. I am finding some interesting challenges and I will try to share my experience in this article.

Before applying mathematical models and other algorithms or statistics we need to understand the data and problem statement first. And in today's world, it's difficult to get insight and information about the data, as data is distributed these are the challenges that we solve in Data engineering.

Data Distribution — Microservices

In the era of Microservices — Data is distributed between multiple systems in different formats. Not only that, data is in various different formats, and applications generate complex data in good volume and good velocity.

Let's understand it with an example, Suppose you have an e-commerce Application that comprises of

  • Inventory Service
  • Product master Service
  • Cart Service
  • Payment Service
  • Customer Care Operation Service
  • Profile/Customer Manager
  • Authentication Service

And each year your company targets ~10X increase in customer base.

High-Level Architecture of any event-driven based e-commerce application

When we say data is distributed, that means each module or micro-service is using its own database and that might be relational or nonrelational.

The more the data is distributed, the less the retrieval of Information.

Event-driven based e-commerce applications with the respective types of databases

Here are a few of the challenges you might have faced as a Data Engineer with the above architecture

  • System Integration — Procure the data from heterogeneous systems
  • Data Management — Aggregate, transform, normalize and clean redundant data
  • Regulatory Compliance — Each microservices have its own business complexity and governance, what data to keep, mask, and drop based on regulatory compliance gets tricky.

As data science students, we could see there are enormous possibilities to explore and help business growth by analyzing historical data and building prediction/regression/patterns. For ex:

  1. Customer segmentation — Identify or Segmentize customers who pay more than $X00 on grocery
  2. Predict the sale of a particular product based on previous season patterns
  3. Determine the trend of a particular segment of products and determine the correlation between products (Explained in my prev article — Apriori Algorithm — determine Frequently purchased Items)
  4. Predict the probability of inventory of a particular warehouse based on the customer’s historical purchase behavior
  5. Help Board members create tactical understanding
  6. Help team in A/B testing particular features on particular customer bases.

Now we understand with the help of the above examples that we need some kind of “Dataware house” to persist this aggregated data for analysis.

And the original challenge is “How to get this data integrated”. There could be two ways:

  1. Batch ETL
  2. Streaming ETL

You could come up with a number of possible solutions based on your requirement/timeline/resource. For understanding, let's go with some high-level solutions

Possible Solution — 1

Event-driven ETL — Write a listener, start listening to Broker messages, and dump them into the data warehouse. On top of it write a custom query builder and integrate it with a visualization tool for creating StoryBoard/Dashboards.

Challenges with this approach

  • We would be able to listen to inter-service communications that are mainly action-based messages only but what about the internal events or changes in the internal microservice?

Possible Solution — 2

Sort of Event ETL — Make changes in each service and push entity update (CDC) to Aggregator which will persist the changes in the data warehouse
And write a custom query builder on top of it and integrate it with the visualization tool.

Challenges with this approach:

  • Push-based mechanism — Too much code change is required in an existing application
  • Even if someone says to set up a CDC tool like Debezium but what if some microservice is using no SQL?

How we solved it

We understand that these are not long-term solutions as the volume and velocity increase, and it will be tricky to keep these systems consistent another challenge is which Data Warehousing Tools to choose to persist the aggregated data.

We choose BigQuery because the main reason was that our applications were on GCP plus GCP-BigQuery provides some other OOTB features.

What is BigQuery?

BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence

Pros

  • Real-time analytics
  • Predictive analytics
  • Log Analytics
  • Analyze and share data across clouds
  • Geospatial analysis
  • Automatic high availability
  • Standard SQL + Materialized Views
  • Data governance and security

GCP also provides seamless Schedule Query kind of integration with BigQuery

Approach 1: ETL with BigQuery

BigQuery Schedule Query (~Batch ETL)

Schedule queries or transfer external data from SaaS applications to Google BigQuery on a regular basis.

The BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis. Your analytics team can lay the foundation for a BigQuery data warehouse without writing a single line of code.

Pros

  • Instead of relying on each service to push the changes, Bigquery Schedule Query will pull the changes from the database (recommended read replica) at some frequency and dump them into BigQuery Table
  • This way we will get complete raw data over it we might need to add a few views

Approach 2: ETL into BigQuery with Dataflow (Recommended)

Extract, Transform and Load into BigQuery with the Dataflow pipeline. It's recommended way when you are performing massive joins, that is, from around 500–5000 columns of more than 10 TB of data, with the following goals.

Pros

  • Dataflow enables fast, simplified streaming data pipeline development with lower data latency.

Cons

ETL from a relational database into BigQuery using Dataflow is seamless For firestore there is a connector `Firestore and Apache Beam` connector but it is not seamless.

After having the aggregated clean data, the next part is to analyze the data to get insight. For visual analytics, I prefer PowerBI or Tableau but there are some other tools as well. Most of the BI tools provide a Google Bigquery connector. Some provide Web version but some tools are limited to support Desktop version only.

Google Data Studio — WEB

Google Data Studio — is a free data analytics tool that enables users to create branded reports with data visualizations to share with clients.

This product is a dashboard and reporting tool that is simple to use, customize, and share. Does not require Desktop App installation on the client device.

Tableau — APP

Start working with data in Tableau, and select the Google BigQuery connector from the menu of servers.

Power BI — App

Limitation — The built-in Google BigQuery connector can be found only in the desktop version of Power BI

--

--

Ritresh Girdhar

Father || Coder || Engineer || Learner || Reader || Writer || Silent Observer