30 Data Science Projects

Spanning ML, deep learning, NLP, computer vision, data engineering, and MLOps.

🛰️
Geospatial

Satellite Image Analysis Platform

Deep learning system for land use classification and change detection from satellite imagery using PyTorch with ResNet/EfficientNet and interactive Folium maps.

→ Multi-temporal change detection
📈
Retail / Supply Chain

Time Series Demand Forecasting

Multi-model forecasting combining Prophet, LSTM, and XGBoost for retail demand prediction with seasonality detection, anomaly handling, and ensemble forecasts.

→ Ensemble forecast with backtesting
📊
Social Media Analytics

Sentiment Analysis Dashboard

Real-time sentiment monitoring using fine-tuned RoBERTa, BERTopic for topic modeling, entity-level sentiment with spaCy NER, and a Plotly Dash dashboard.

→ Real-time multi-model NLP pipeline
📊
Business Intelligence

BI Dashboard Suite

Interactive business intelligence dashboard with Plotly Dash and DuckDB featuring KPI cards, cross-filtering, drill-down charts, role-based access, and PDF/CSV export.

→ Multi-page BI dashboard with RBAC
🏥
Healthcare

Medical Image Segmentation Tool

U-Net architecture in TensorFlow for medical image segmentation with data augmentation, MC Dropout uncertainty, DICOM handling, and Grad-CAM visualization.

→ ONNX-exported production model
🔍
Finance

Real-time Fraud Detection System

Production-grade ML pipeline for credit card fraud detection with real-time streaming inference, SHAP explainability, A/B testing, and a Streamlit monitoring dashboard.

→ Multi-page real-time dashboard
💳
Finance

Credit Risk Scoring Model

Interpretable ML model for loan approval using LightGBM with SHAP/LIME explainability, Fairlearn bias testing, traditional WoE/IV scorecard, and regulatory compliance.

→ Fairness-audited production model
🔧
Manufacturing / IoT

IoT Anomaly Detection System

Unsupervised anomaly detection for manufacturing IoT sensors using Isolation Forest, PyTorch Autoencoders, and DBSCAN with real-time scoring and alerting.

→ Real-time anomaly scoring pipeline
🔄
Data Warehousing

ETL Orchestration Pipeline

Airflow-based ETL pipeline with multi-source extraction, dbt transformations on DuckDB warehouse, SCD Type 2 snapshots, data quality checks, and lineage tracking.

→ Automated ETL with data quality
📚
Enterprise Knowledge

RAG Knowledge Assistant

Retrieval-augmented generation chatbot ingesting PDF/DOCX/Markdown, embedding into ChromaDB, answering with citations using LangChain and Claude/OpenAI APIs.

→ Multi-format document Q&A system
👤
Telecom

Customer Churn Prediction with AutoML

Automated ML pipeline for churn prediction using H2O.ai AutoML and Optuna-tuned LightGBM, with Boruta feature selection and business impact calculator.

→ AutoML + manual model comparison
👍
E-commerce

Hybrid Recommendation Engine

Production-ready recommendation system combining collaborative filtering (Surprise) and content-based filtering with cold-start handling, Redis caching, and A/B testing.

→ Hybrid model with cold-start support
📄
Content Classification

Text Classification API

FastAPI multi-class text classification service using fine-tuned DistilBERT and BERT on AG News, with model versioning and A/B testing infrastructure.

→ Production NLP API with versioning
🔎
Manufacturing

Manufacturing Defect Detection

Computer vision pipeline using ResNet-50 with transfer learning for product defect detection, Grad-CAM localization, active learning, and ONNX Runtime export.

→ Edge-deployable ONNX model
📑
Document Processing

Document Intelligence OCR System

End-to-end document processing combining Tesseract OCR, OpenCV preprocessing, and transformer-based extraction to extract structured data from invoices and receipts.

→ Structured data from unstructured docs
👁️
HR / Security

Face Recognition Attendance System

Real-time attendance tracking using FaceNet embeddings and MTCNN detection with anti-spoofing liveness detection, privacy controls, and report generation.

→ Privacy-compliant face recognition
🌐
Customer Support

Multilingual Support Classifier

Zero-shot classification for customer support ticket routing across 20+ languages using XLM-RoBERTa, with language detection, urgency scoring, and response templates.

→ 20+ language zero-shot classification
⚙️
Manufacturing

Predictive Maintenance System

Deep learning CNN-LSTM architecture in PyTorch to predict equipment Remaining Useful Life from NASA C-MAPSS sensor data, with MC Dropout uncertainty and maintenance scheduling.

→ RUL prediction with uncertainty
👥
Marketing / CRM

Customer 360 Analytics Platform

Unified customer view combining CRM, transactional, web, and support data with entity resolution, RFM analysis, K-Means segmentation, and predictive CLV.

→ Unified customer view with CLV
🛒
Retail Analytics

Retail Analytics with Object Detection

YOLOv8-based system for customer behavior analysis in retail: foot traffic tracking with ByteTrack, dwell time per zone, heatmap generation, and privacy-preserving face blurring.

→ Real-time foot traffic analytics
🏗️
Cloud Architecture

Data Lake Architecture

Serverless data lake simulation using MinIO (local S3), DuckDB, and medallion architecture with data cataloging, schema evolution, and Terraform templates.

→ Medallion architecture data lake
Real-time Processing

Real-Time Streaming Pipeline

Scalable streaming architecture using Kafka, PySpark Structured Streaming, and Delta Lake for e-commerce clickstream processing with exactly-once semantics.

→ Exactly-once streaming pipeline
🔁
ML Engineering

CI/CD Pipeline for ML

GitHub Actions workflows automating the full ML lifecycle: data validation, model training with performance gates, artifact storage, and multi-environment deployment.

→ Automated ML deployment pipeline
🖥️
ML Infrastructure

ML Model Deployment Platform

MLOps pipeline using MLflow for experiment tracking and model registry, FastAPI serving with canary deployment, Kubernetes manifests, and Prometheus/Grafana monitoring.

→ Full MLOps platform with monitoring
👁️
ML Observability

ML Monitoring & Observability

Comprehensive ML monitoring with Evidently AI drift detection, CUSUM degradation detection, Prometheus metrics, Grafana dashboards, and automated drift reporting.

→ Full observability stack with alerting
🧮
Deep Learning Infrastructure

Distributed Training Framework

Multi-GPU training using PyTorch DDP and Horovod, with Ray Tune hyperparameter optimization, mixed precision training, and Weights & Biases experiment tracking.

→ Multi-GPU scaling benchmarks
🛡️
Insurance

Insurance Analytics Pipeline

End-to-end data science pipeline for insurance claim risk classification with Pandera validation, sklearn-compatible preprocessing, FastAPI inference, and Typer CLI.

→ Validated insurance risk classifier
📋
LLM Engineering

Prompt Evaluation Framework

Testing suite for LLM prompts with multi-model evaluation (GPT-4, Claude), A/B testing with statistical significance, cost optimization, and version control for prompts.

→ Multi-model prompt testing suite
📜
Legal / Compliance

AI Contract Review System

Legal document analysis with PDF parsing, LLM-powered clause identification, per-clause risk scoring, GDPR compliance checking, and side-by-side contract comparison.

→ Automated clause risk analysis
💻
Developer Tools

Automated Code Documentation

LLM-powered tool parsing Python and JS/TS codebases via AST/tree-sitter, generating documentation (docstrings, READMEs, module docs) using Claude API with complexity analysis.

→ Auto-generated project documentation