Project Overview
A complete data pipeline that collects, processes, stores, and analyzes e-commerce sales data in real-time using Kafka, MSSQL, Jupyter, and Docker.
Architecture Diagram
Main Components
Kafka
Receives incoming sales events and serves as a message broker.
Python Producer
Generates and sends fake sales data into Kafka to simulate orders.
Python Consumer
Reads data from Kafka and loads it into the MSSQL database.
MSSQL
Stores the sales data in a structured format for analysis.
Jupyter Notebook
Connects to MSSQL, runs queries, and plots sales KPIs.
Tools Involved
Step-by-Step Build Plan
Virtual Machine setup
Setup Oracle virtual machine to deploy pipeline.
Setup Docke to Deploy Services
Instead of manually deploying microservices, we use Docker.
Using Streamlit to create a temporary GUI
Write a Python script that generates fake "sales" events and sends them to a Kafka topic called sales_events.
Setup pipeline to collect clickstream
We use a package named Divolte Collector, Kafka, Zookeeper to collect clicking event.
Dump the data into database or datawarehouse
Insert all events collected into the MSSQL sales table.
MSSQL Database Setup
Configure the database schema and tables to store sales data efficiently.
Analytics with Jupyter
Create Jupyter notebooks to connect to the database, analyze sales data, and generate visualizations.