Streaming Data
Continuously generated data that is processed and analyzed in real time or near-real time as it arrives, rather than being stored first and processed in batches at scheduled intervals.
Streaming data architectures process events as they flow through the system, enabling sub-second reactions to new information. User clicks, IoT sensor readings, transaction events, and log entries are examples of streaming data. Platforms like Apache Kafka, Apache Flink, Apache Spark Streaming, and AWS Kinesis provide the infrastructure for ingesting and processing these continuous data flows.
The key architectural difference from batch processing is that streaming systems process events individually or in micro-batches (milliseconds to seconds) rather than large batches (minutes to hours). This enables use cases that require immediacy: real-time fraud detection, live dashboards, instant personalization, and alerting on anomalous patterns.
For AI products, streaming data enables real-time feature computation for model inference. Instead of relying on features computed hours ago in a batch pipeline, streaming pipelines compute up-to-the-second features like "number of page views in the last 5 minutes" or "running average session duration." This freshness can significantly improve model accuracy for time-sensitive predictions like fraud detection and real-time recommendations.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.