What is Streaming Data?

BY TOOLS.FUN  ·  MARCH 28, 2026  ·  6 min read

Streaming data is data that is generated continuously and processed in real time or near-real time, as opposed to batch processing where data is collected over a period and processed all at once. Streaming enables immediate insights and actions — detecting fraud as it happens, updating dashboards in real time, and triggering automated responses to events.

Batch vs Streaming

Batch processing collects data over a time window (an hour, a day) and processes it all at once. It is simple, efficient, and well-understood. Streaming processing handles each event as it arrives, providing results with seconds or milliseconds of latency. The trade-off is complexity: streaming systems must handle out-of-order events, late data, exactly-once processing, and state management — problems that do not exist in batch. Use the Timestamp Converter to work with event timestamps in streaming pipelines.

Apache Kafka

Kafka is the dominant streaming platform. Producers write events to topics, and consumers read from topics. Kafka stores events in a durable, ordered, distributed log. Key concepts:

Topics: Named streams of events. Partitions: Topics are split into partitions for parallelism. Consumer Groups: Multiple consumers in a group share the work of reading a topic. Offsets: Each consumer tracks its position in the partition.

Kafka retains events for a configurable period (or indefinitely with compaction), enabling consumers to replay history. Validate Kafka message payloads (typically JSON) with the JSON Formatter.

Key point: Kafka is not just a message queue — it is a distributed commit log. Events are retained, not deleted after consumption. This enables multiple consumers to read the same data independently and replay events for recovery or reprocessing.

Stream Processing Frameworks

Apache Flink is the most powerful stream processing framework. It provides exactly-once processing, event-time windowing, complex event processing, and stateful computations. It is used at scale by Uber, Netflix, and Alibaba.

Kafka Streams is a lightweight library (not a separate cluster) for stream processing within Java/Kotlin applications. Simpler than Flink but less powerful.

Apache Spark Structured Streaming provides stream processing with a batch-like API. Good for teams already using Spark for batch processing.

Event Time vs Processing Time

Event time is when the event actually occurred. Processing time is when the system processes it. These can differ significantly — an IoT sensor might generate an event at 3:00 PM but due to network delays, the system processes it at 3:05 PM. Stream processing frameworks must handle this distinction because windowing (grouping events into time buckets) based on processing time gives incorrect results when events arrive late.

Windowing

Windowing groups events into time-based buckets for aggregation. Types include: tumbling windows (fixed-size, non-overlapping: every 5 minutes), sliding windows (fixed-size, overlapping: 5-minute windows every 1 minute), session windows (gap-based: group events until there is a gap of inactivity). Windowing is essential for aggregations like "count events per minute" or "average value over the last hour." Use the Crontab Calculator to schedule companion batch jobs that reconcile streaming results.

Key point: Windowing with event time is the core challenge of stream processing. You must decide how long to wait for late events (the watermark) and what to do with events that arrive after the window closes (drop them, update the result, or trigger a correction).

Event Sourcing

Event sourcing stores every state change as an immutable event: OrderCreated, ItemAdded, PaymentProcessed. Current state is derived by replaying events. This provides a complete audit trail, enables time-travel debugging, and fits naturally with streaming architectures. The trade-off is complexity: you need projection logic, event versioning, and snapshotting for performance.

When to Use Streaming

Use streaming when: freshness matters (fraud detection, real-time dashboards, operational alerts), when events must trigger immediate actions (stock trades, IoT responses), or when you need to process unbounded data sets. Do not use streaming when batch is sufficient — streaming adds significant complexity. Most analytics use cases do not need sub-second freshness; hourly or daily batch processing is simpler and cheaper. Inspect streaming event schemas with the Code Diff tool when comparing message format versions.

Key point: Start with batch processing and add streaming only where real-time is a genuine business requirement, not a nice-to-have. The complexity cost of streaming is high — justify it with a clear use case before investing in the infrastructure.
← Back