Project Overview
A complete data pipeline that collects, processes, stores, and analyzes e-commerce sales data in real-time using Kafka, MSSQL, Jupyter, and Docker. Again, practice makes progress and one day makes perfect.
Architecture Diagram
Main Components
Kafka
Receives incoming sales events and serves as a message broker.
Python Producer
Generates and sends fake sales data into Kafka to simulate orders.
Python Consumer
Reads data from Kafka and loads it into the MSSQL database.
MSSQL
Stores the sales data in a structured format for analysis.
Jupyter Notebook
Connects to MSSQL, runs queries, and plots sales KPIs.
Divolte Collector + Kafka vs. Supabase for Analytics
Should we use Divolte Collector + Kafka instead of Supabase, and in fact, that's a powerful and enterprise-grade setup for event tracking and analytics. But it comes with tradeoffs.
Short answer: Yes β if we're ready to manage more complexity.
π§± What It Would Look Like
[User on Website] β [Divolte Collector] β [Kafka Topic: user-events] β [Kafka Consumers: Flink / Spark / Python ETL] β [Analytics Store: PostgreSQL / ClickHouse / BigQuery] β [Dashboard: Metabase / Superset / Tableau]
π¦ Comparison
Feature | Supabase | Divolte + Kafka |
---|---|---|
Setup Time | Minutes | Hours to Days |
Skill Level | Beginner | Intermediate to Advanced |
Real-time Processing | β No | β Yes |
Event Scalability | π« Basic | π’ Very High |
Free Tier | β Yes | β Yes (self-hosted) |
Flexibility | π‘ Medium | π’ Very High |
Hosting | Cloud (managed) | Self-hosted / Confluent Cloud |
Data Modeling Simplicity | β Easy | β Manual |
β When to Use Divolte + Kafka
- We want to learn big data pipelines
- We need to process large volumes of events in real time
- Weβre building a portfolio project to show off data engineering skills
- We want to integrate with Apache Flink, Spark, or dbt
π οΈ How It Works (Simplified)
- Divolte Collector runs in the background and captures user browser events.
- We embed a JavaScript snippet on your e-commerce page.
- It sends JSON events to Kafka.
- Kafka acts as the data pipeline.
- Weβll need to create topics like
user-events
.
📈 From Idea to Insight: Building Your First Data Pipeline & Analytics Sandbox
**Published:** June 1, 2025 | **Category:** Data Engineering, Analytics
Ever wondered how data flows from a user interaction on a website all the way to a meaningful chart on an analytics dashboard? Building a robust data pipeline can seem daunting. But by starting with a "simple project," we can demystify the core components and understand their roles in the larger data ecosystem.
Let's explore a practical example: collecting user feedback from a website, storing it, and then analyzing it to derive insights. This mini-project, while simple, touches upon essential concepts like ETL (Extract, Transform, Load) and EDA (Exploratory Data Analysis).
The "Simple Project" Unveiled: Feedback to Insight
Our sample project aims to gather user feedback on e-commerce features. Here's the toolkit we're using:
- **Streamlit (GUI):** A Python framework for rapidly building web applications. This acts as our user-friendly interface where feedback is submitted.
- **Microsoft SQL Server (MSSQL):** Our relational database of choice for storing the raw feedback data.
- **Kafka (Conceptual/Context):** While not directly used in the *feedback submission* flow of this simple project, Kafka was considered in the broader context of collecting high-volume clickstream data. It represents the robust streaming backbone for events.
- **Jupyter Notebook (Analysis & Dashboarding):** Our interactive environment for querying the stored data, performing analysis, and generating visualizations.
This setup creates a direct path: User submits feedback via **Streamlit** → Feedback is saved to **MSSQL** → Data is analyzed in a **Jupyter Notebook**.
Demystifying the Data Flow: ETL and EDA in Action
When we talk about data pipelines, two acronyms frequently pop up: ETL (Extract, Transform, Load) and EDA (Exploratory Data Analysis). Our simple project demonstrates elements of both.
The EL (Extract & Load) Heartbeat
- **Extract (E):** In our project, the "extraction" happens right at the source β the **Streamlit GUI**. As a user types in feedback and clicks "Submit," this raw data is extracted from the web form's input fields by our Python backend code.
- **Load (L):** Immediately after extraction, this raw feedback data is loaded directly into the **MSSQL database** table using `pyodbc`. This is the "Load" phase.
- **Transform (T):** Noticeably, in this *simple* EL pipeline, the "Transform" step is minimal. The data isn't being cleaned, aggregated, or significantly reshaped *before* being loaded. It goes in pretty much as it comes from the GUI. In a production scenario, especially with raw clickstream, this 'T' phase is crucial for data quality and usability.
So, the Streamlit → MSSQL flow constitutes a fundamental Extract-Load (EL) segment of a data pipeline.
The EDA Workbench in Jupyter Notebook
Once the data resides in MSSQL, our focus shifts to understanding what the data tells us. This is where **Jupyter Notebook** shines as an **Exploratory Data Analysis (EDA)** workbench.
- We connect the Jupyter Notebook to MSSQL to **extract** the feedback data.
- Using Python libraries like Pandas, we perform **transformations** specific to analysis: calculating average ratings, counting submissions per feature, or perhaps preparing data for sentiment analysis on comments.
- Finally, we use visualization libraries like Matplotlib or Seaborn to create charts (e.g., rating distributions, feature popularity) to **explore and present** the insights.
This Jupyter phase is quintessential EDA β it's about asking questions of the data, finding patterns, and telling a story, rather than just moving data around.
Why This Project Matters (and Why It's "Simple")
This "simple project" is incredibly valuable for:
- **Learning Integration:** It effectively demonstrates how different tools (GUI, database, analysis environment) can be chained together.
- **Rapid Prototyping:** Streamlit allows for very quick development of data input forms and simple displays.
- **Understanding Data Flow:** It provides a clear, end-to-end view of data from source to insight.
However, it's crucial to acknowledge why this project is "simple" and **not suitable for a production-scale clickstream analytics pipeline**:
- **MSSQL's Scalability for Clickstream:** As previously discussed, MSSQL struggles with the sheer volume and velocity of raw clickstream data. Its OLTP (transactional) nature makes analytical queries on massive datasets inefficient and very costly at scale.
- **Jupyter's Production Readiness:** While great for analysis, Jupyter Notebooks are not designed for automated, real-time, multi-user production dashboards. They lack the robustness, user management, and seamless updates of dedicated Business Intelligence (BI) tools.
- **Lack of Robust Transformation:** A real clickstream pipeline requires sophisticated data cleansing, enrichment, and aggregation *before* storage for efficient querying.
Your Next Steps in Data Engineering
This simple project is a fantastic stepping stone. To truly build a scalable and cost-effective clickstream analytics solution for an e-commerce platform, the next evolution involves:
- Leveraging **Kafka** as the central, high-throughput event streaming backbone.
- Adopting a **specialized open-source columnar analytical database** (like ClickHouse or Apache Druid) for storing raw and aggregated clickstream data β these are engineered for lightning-fast analytical queries at massive scale, offering superior TCO.
- Deploying a dedicated **open-source BI tool like Metabase** (which you already have!) for building robust, user-friendly, and scalable production dashboards.
By understanding the "why" behind these tool choices, you're well on your way to designing and implementing robust data solutions that meet both performance and budget requirements.
📖 Data Engineering Strategy Report: Scalable and Cost-Optimized Clickstream Analytics
**Date:** June 1, 2025
**Prepared for:** E-commerce Platform Leadership / Data Team
1. Executive Summary
This report outlines a strategic approach to building a highly scalable and cost-effective clickstream analytics pipeline for our e-commerce platform, currently leveraging Kafka and Divolte. Through a detailed evaluation of various database technologies and analytical tools, we recommend moving beyond traditional RDBMS like MSSQL and even general-purpose open-source alternatives like PostgreSQL, for primary clickstream storage. Instead, we advocate for a specialized **open-source columnar analytical database (e.g., ClickHouse)** as the core data store, integrated with our existing Kafka streaming layer. This approach, complemented by Metabase for production dashboards and Jupyter Notebooks for ad-hoc analysis, minimizes **Total Cost of Ownership (TCO)** while ensuring robust performance and future scalability for high-volume, real-time data.
2. Current State & Core Challenge
Our e-commerce platform is effectively collecting clickstream data via Divolte, which is then streamed through Kafka. The immediate challenge is to establish a robust and performant analytics dashboard. The initial consideration of MSSQL for this purpose, and subsequent exploration of PostgreSQL, highlighted a fundamental architectural decision: how to efficiently store and query high-volume, high-velocity, semi-structured clickstream data for analytical insights while strictly adhering to a budget-conscious mindset.
3. Database Evaluation for Clickstream Data (TCO Focus)
We've assessed three primary database categories for their suitability as the core analytical store for clickstream data, with a strong emphasis on TCO over initial setup costs:
3.1. MSSQL (Proprietary RDBMS)
- **Pros (Limited):** Leverages existing infrastructure/licenses (if any), team familiarity.
-
**Cons (Significant for Clickstream):**
- **Prohibitive Scaling Costs:** While initially "free" if licenses exist, scaling MSSQL for billions or trillions of clickstream events becomes economically unsustainable. High-performance features often reside in expensive Enterprise Editions. Licensing costs for additional cores, memory, and high-IOPS storage escalate rapidly.
- **Suboptimal Architecture:** As a row-oriented OLTP database, MSSQL is fundamentally ill-suited for the read-heavy, aggregative analytical queries typical of clickstream. Extensive indexing, partitioning (e.g., table partitioning, custom ETL to aggregate tables), and complex query tuning become necessary, leading to increased administrative burden and compute resource consumption.
- **Schema Rigidity:** While JSON support exists, its performance for querying complex, nested, or evolving semi-structured clickstream events is inferior to specialized databases.
- **Example Upscaling Cost Pitfall:** Imagine a sudden traffic surge from a marketing campaign. With MSSQL, achieving sub-second query response times on raw clickstream would likely necessitate throwing more compute (e.g., Azure SQL DB higher tiers, AWS RDS SQL Server larger instances) and faster storage at the problem, leading to disproportionate cost increases (e.g., 5x compute for 2x performance gain), coupled with high administrative overhead for constant re-tuning.
3.2. PostgreSQL (Open-Source RDBMS)
- **Pros (Strong):** Zero licensing costs, mature, highly reliable, strong community, excellent JSONB support, and extensibility (e.g., `TimescaleDB` for time-series, `pg_partman` for automated partitioning).
-
**Cons (Mitigated, but Present at Extreme Scale):**
- **Analytical Performance Ceiling:** Despite its versatility, PostgreSQL, as an OLTP-first database, will still hit performance limits for ad-hoc analytical queries on truly massive, raw clickstream datasets compared to columnar stores. Strategies like heavy indexing, partitioning, materialized views, and pre-aggregated tables become *mandatory* to manage query response times.
- **Operational Burden for Self-Managed:** While open-source, managing PostgreSQL at scale (e.g., horizontal scaling, replication, complex partitioning strategies for trillions of records) demands considerable data engineering and DBA expertise, which translates to staff cost. Managed cloud services mitigate this but introduce their own cloud infrastructure costs.
- **Example Upscaling Strategy (PostgreSQL):** For initial stages (hundreds of millions of events), PostgreSQL with `TimescaleDB` and time-based partitioning offers great value. To scale further, the strategy shifts to *creating aggregated tables* (e.g., daily/hourly user metrics, product view counts) and building dashboards on these summaries, offloading heavy computations from query time to ETL time. This minimizes runtime costs but increases ETL complexity and latency for fresh data.
3.3. Open-Source Columnar Analytical Databases (Primary Recommendation)
- **Leading Candidates:** ClickHouse, Apache Druid, Apache Pinot.
-
**Pros (Ideal for Clickstream & TCO):**
- **Purpose-Built for OLAP:** These databases are designed from the ground up for analytical queries on large datasets. Their columnar storage, data compression, and parallel processing capabilities deliver orders of magnitude faster query performance for aggregates and time-series analysis on clickstream data.
- **Exceptional Cost-Efficiency at Scale:** Due to high data compression and query efficiency, they require significantly less hardware (CPU, RAM, storage) to achieve comparable performance to row-oriented databases on analytical workloads. This directly translates to lower infrastructure costs, leading to a much lower TCO for high-volume clickstream.
- **Real-time Capabilities:** Optimized for high-throughput ingestion from Kafka and low-latency queries, crucial for near real-time dashboards.
- **Schema Flexibility:** Handle semi-structured clickstream JSON data natively and efficiently.
- **Example Upscaling Strategy (ClickHouse):** Scaling ClickHouse involves adding more nodes to its distributed cluster. This scales linearly. For 10x data volume, you might need just 2x nodes, leveraging its inherent efficiency. Kafka integration is often native (e.g., Kafka engine table in ClickHouse), simplifying ingestion and reducing custom code. Budget is allocated to managed cloud services (e.g., Altinity.Cloud for ClickHouse) for operational ease, allowing engineers to focus on data modeling and insights rather than infrastructure.
4. Tooling for Analytics and Dashboarding
4.1. Jupyter Notebook (Analysis & Prototyping)
- **Role:** Absolutely essential for data scientists and data engineers for **ad-hoc analysis, data exploration, cleaning, transformation, and prototyping new metrics or machine learning models**. Its interactivity and rich ecosystem of Python libraries (Pandas, Plotly, etc.) are invaluable.
- **Strategic Budget Mindset:** Using Jupyter for prototyping allows rapid iteration on data models and visualizations *before* committing to dashboard development resources. It minimizes "throwaway" dashboarding effort.
- **Upscaling Example:** A data scientist uses Jupyter to analyze raw clickstream in ClickHouse, discover a new user engagement pattern, build a compelling visual. This prototype then informs a new dashboard panel in Metabase, ensuring engineering effort is spent on validated insights.
4.2. Metabase (Production Dashboarding)
- **Role:** The **ideal solution for delivering production-grade, interactive dashboards to business users**. Metabase connects seamlessly to various databases, including PostgreSQL and, crucially, columnar analytical databases like ClickHouse.
- **Strategic Budget Mindset:** Being open-source, Metabase eliminates licensing costs common with commercial BI tools. Its intuitive interface empowers business users to self-serve queries (through its "Questions" feature), significantly reducing dependency on data engineers for routine reporting requests. This frees up engineering time for more complex, high-value tasks.
- **Upscaling Example:** As the number of business users grows, Metabase can be deployed on horizontally scalable infrastructure. Its efficient query generation against an optimized columnar database ensures dashboards remain performant even with increasing user load. Budget is allocated to optimizing the underlying data store and Metabase's infrastructure (e.g., dedicated server, robust caching), not per-user licensing.
5. Integrated Architecture and Strategic Upscaling
Based on this analysis, the recommended cost-optimized, scalable clickstream analytics architecture is:
- **Data Collection:** Divolte (for rich, raw event capture)
- **Event Streaming Backbone:** Kafka (for durable, scalable event transport)
- **Data Ingestion:** Kafka Connect or specialized connectors to load directly into the analytical store.
-
**Primary Analytical Data Store:** **ClickHouse** (or another open-source columnar database like Apache Druid/Pinot) - optimized for fast analytical queries on massive datasets.
- **Budget Strategy for Upscaling:**
- **Initial Stage (MVP):** Start with a single-node ClickHouse instance on a modest cloud VM (e.g., a General Purpose instance on AWS/Azure/GCP). Optimize table schemas and materialized views for common queries. Focus on ingesting a subset of critical clickstream events.
- **Growth Stage 1 (Increased Volume/Users):** As data volume grows (e.g., billions of events) and dashboard users increase, vertically scale the existing ClickHouse instance (upgrade VM size: more CPU/RAM). Implement basic partitioning strategies. Consider a managed ClickHouse service (e.g., Altinity.Cloud, Tinybird, or self-managed on Kubernetes) to reduce operational burden.
- **Growth Stage 2 (Massive Scale/Low Latency):** Transition to a multi-node ClickHouse cluster (distributed tables, sharding) for horizontal scalability and high availability. Implement advanced data retention policies (e.g., expiring raw data after N months, keeping only aggregates longer). Leverage Kafka Consumers with batching for highly efficient ingestion. Budget allocation shifts from large, monolithic servers to distributed, scalable cloud infrastructure where costs scale more linearly with actual usage.
- **Budget Strategy for Upscaling:**
-
**Dashboarding & Reporting:** Metabase - connect directly to ClickHouse.
- **Budget Strategy for Upscaling:**
- **Initial Stage:** Metabase deployed on a small dedicated VM.
- **Growth Stage:** Scale Metabase vertically first (more RAM/CPU for queries). As user concurrency grows, consider horizontally scaling Metabase instances behind a load balancer. Optimize Metabase caching.
- **Budget Strategy for Upscaling:**
-
**Ad-Hoc Data Science & Deep Dives:** Jupyter Notebooks - connected to ClickHouse.
- **Budget Strategy for Upscaling:** Utilize cloud-based Jupyter environments (e.g., AWS SageMaker, GCP Vertex AI Workbench, Azure Machine Learning workspaces) for elastic compute resources, allowing data scientists to spin up powerful instances only when needed, paying for compute by the hour, rather than maintaining expensive always-on workstations.
6. Conclusion
While traditional RDBMS like MSSQL and PostgreSQL have their merits, they are not the optimal foundational choice for high-volume, cost-effective clickstream analytics at scale. Our strategic recommendation is to build the core analytical layer on a specialized **open-source columnar database like ClickHouse**. This, combined with our existing Kafka/Divolte pipeline, the power of Metabase for production dashboards, and the analytical flexibility of Jupyter Notebooks, forms a robust, scalable, and budget-conscious data engineering architecture perfectly suited for our evolving e-commerce data needs. This approach minimizes TCO by leveraging open-source efficiency and cloud elasticity, allowing budget to be reallocated towards data product development and advanced analytics, rather than licensing and disproportionate infrastructure.
Tools Involved
Step-by-Step Build Plan
Virtual Machine setup
Setup Oracle virtual machine to deploy pipeline.
Setup Docker to Deploy Services
Instead of manually deploying microservices, we use Docker.
Using Streamlit to create a temporary GUI
Write a Python script that generates fake "sales" events and sends them to a Kafka topic called sales_events.
Setup pipeline to collect clickstream
We use a package named Divolte Collector, Kafka, Zookeeper to collect clicking event.
Dump the data into database or datawarehouse
Insert all events collected into the MSSQL sales table.
MSSQL Database Setup
Configure the database schema and tables to store sales data efficiently.
Analytics with Jupyter
Create Jupyter notebooks to connect to the database, analyze sales data, and generate visualizations.