Project Clickstream: A Real-Time Sales Data Pipeline

A dockerized data pipeline for e-commerce sales analytics

Project Overview

A complete data pipeline that collects, processes, stores, and analyzes e-commerce sales data in real-time using Kafka, MSSQL, Jupyter, and Docker.

Architecture Diagram

Sales Events
Python Producer
Kafka
Python Consumer
MSSQL
Jupyter Notebook

Main Components

Kafka

Receives incoming sales events and serves as a message broker.

Dockerized

Python Producer

Generates and sends fake sales data into Kafka to simulate orders.

Python

Python Consumer

Reads data from Kafka and loads it into the MSSQL database.

Python

MSSQL

Stores the sales data in a structured format for analysis.

Dockerized

Jupyter Notebook

Connects to MSSQL, runs queries, and plots sales KPIs.

Python

Tools Involved

Docker
Linux
Kafka + Zookeeper
MSSQL Server
Python
Jupyter Lab

Step-by-Step Build Plan

1

Virtual Machine setup

Setup Oracle virtual machine to deploy pipeline.

Bonus

Setup Docke to Deploy Services

Instead of manually deploying microservices, we use Docker.

2

Using Streamlit to create a temporary GUI

Write a Python script that generates fake "sales" events and sends them to a Kafka topic called sales_events.

3

Setup pipeline to collect clickstream

We use a package named Divolte Collector, Kafka, Zookeeper to collect clicking event.

4

Dump the data into database or datawarehouse

Insert all events collected into the MSSQL sales table.

5

MSSQL Database Setup

Configure the database schema and tables to store sales data efficiently.

6

Analytics with Jupyter

Create Jupyter notebooks to connect to the database, analyze sales data, and generate visualizations.