Why a Lakehouse?
For my graduation thesis at VNU-HCM, I needed to handle two conflicting requirements:
- Real-time dashboards — operations teams need sub-minute latency for anomaly alerts
- Historical analysis — analysts need months of clean, queryable data
A traditional data warehouse handles (2) well but struggles with (1). A pure streaming system handles (1) but makes (2) painful. A lakehouse gives you both.
Architecture Overview
IoT Sensors → MQTT Broker → Kafka →
├── Stream Processor (Flink) → Real-time Dashboard
└── Batch Layer (Delta Lake) → Analytics / ML
The core idea: Kafka as the single source of truth. Everything downstream reads from it — the stream processor for live alerts, and the batch layer for historical storage.
Key Decisions
Delta Lake over plain Parquet — ACID transactions matter when you have multiple writers. Plain Parquet files with concurrent writes will corrupt silently.
Flink over Spark Streaming — True streaming vs micro-batching. For solar monitoring, 30-second micro-batches are acceptable, but Flink’s stateful processing made windowed aggregations much cleaner.
Schema Registry — Learned this the hard way. Without enforcing schema at the Kafka producer level, one firmware update changed a field name and broke every downstream consumer.
Lessons
The hardest part wasn’t the tech — it was data quality at the edge. Sensors drop packets, send duplicate readings, and occasionally report physically impossible values. Build your ingestion layer to handle dirty data first, then worry about performance.