Blog - Striim

Morrisons Updates Data Infrastructure to Drive Real-Time Insights and Improve Customer Experience

Striim Team — Tue, 22 Oct 2024 00:48:10 +0000

Morrisons, a leading UK-based supermarket chain, is modernizing its data infrastructure to support real-time insights and operational efficiency. By embracing advanced data integration capabilities, Morrisons is transitioning to a more agile, data-driven approach. This shift allows the company to optimize processes, enhance decision-making, and ultimately improve the overall customer experience across its stores and online platforms.

About Morrisons

Morrisons is one of the UK’s largest supermarket chains, with over 100 years of experience in the food retail industry. Proudly based in Yorkshire, it serves customers across the UK through a network of nearly 500 conveniently located supermarkets and various online home delivery channels. With a commitment to quality, Morrisons sources fresh produce directly from over 2,700 farmers and growers, ensuring customers receive the best products. Dedicated to sustainability and community engagement, Morrisons continually invests in innovative solutions to enhance operations and improve the shopping experience.

Challenge

Morrisons set out to modernize its data infrastructure to achieve five key goals:

Elevating Customer Experience: Creating a better shopping experience for customers.
Loading to Google Cloud: Transitioning to Google Cloud and leveraging Looker for enhanced reporting capabilities.
Accessing Real-Time Data: Shifting from batch processing to real-time data access, enabling faster decision-making and improved operational efficiency.
Enhancing Picking Efficiency: Morrisons sought to streamline their online picking process by improving stock visibility across depots and warehouses.
Improving On-Shelf Availability: Ensuring products are consistently in stock and accessible to customers.

To meet these goals, the team needed to move away from their legacy Oracle Exadata data warehouse and strategically align on Google Cloud. This involved transitioning their data to Google BigQuery as the new centralized data warehouse, which required not only propagating data but also ensuring real-time access for better decision-making and operational efficiency. Moreover, prior to this transition, Morrisons never had a centralized repository of real-time data, and only ever had batch snapshots delivered from its disparate systems.

“Retail is real-time. We have our online shop open 24/7, and we have products moving around our distribution network every minute of every day. It’s really important that we have a real-time view of how our business is operating,” shares Peter Laflin, Chief Data Officer at Morrisons.

In order to accomplish this, Morrisons needed a tool that could connect their separate systems and seamlessly move data into Google Cloud. Striim was selected to ingest critical datasets, including the Retail Management System (RMS), which holds vast store transaction data and key reference tables, and the Warehouse Management Systems (WMS), which oversee operations across 14 distribution depots. The integration of these systems into BigQuery in real time provided critical visibility into product availability, stock levels, and core business metrics such as waste and shrinkage. Most importantly, Morrisons needed this mission-critical data delivered in real time.

“We’ve moved from a world where we have batch-processing to a world where, within two minutes, we know what we sold and where we sold it,” shares Laflin. “That empowers senior leaders, colleagues in stores, colleagues across our logistics and manufacturing sites to understand where we are as a business right now. Real-time data is not a nice to have, real-time data is an absolute essential to run a business the scale and size of ours.”

Morrisons sought to move away from their existing analytics suite and leverage Google Looker for their reporting and analytics needs. This meant they had to regenerate all existing reports that previously ran on the Exadata platform, aligning them with the new Google Cloud infrastructure. Striim played a critical role in centralizing their data in BigQuery and delivering it in real time, enabling Morrisons to power their reporting with fresh insights. This transformation is key to achieving their goal of a more agile, data-driven operation and supporting future business initiatives.

Solution

Morrisons now leverages Striim to connect disparate systems and ingest critical datasets from their Oracle databases into Google Cloud, using BigQuery as their new centralized data warehouse. They required a solution that could seamlessly load data from multiple sources while providing real-time access through BigQuery, and Striim provides this.

Striim plays a pivotal role in ingesting two core databases: the Retail Management System (RMS) and the Warehouse Management System (WMS). The RMS, a vast dataset containing store transaction tables and key reference data, requires efficient data transfer to minimize latency, and Striim ensures that this high volume of data is processed seamlessly.

Striim also ingests data from all 14 distribution depots, which are connected through 28 sources in the WMS. This integration provides real-time visibility into stock levels, enabling ‘live-pick’ decision-making by revealing what stock is available, where it is located, and at what time. Backed by real-time intelligence, this capability accelerates business processes that were previously reliant on periodic batch updates. As a result, Morrisons can optimize the replenishment process and ensure that shelves remain well-stocked, ultimately improving overall efficiency and increasing customer satisfaction.

Striim’s real-time data delivery powers Morrisons’ reporting transformation as they rebuild all reporting within Google Looker. By centralizing and accelerating the flow of data into BigQuery in real time, Striim enables faster, actionable insights that drive operational excellence and future business initiatives. “My team felt that Striim was the only tool that could deliver the requirements that we have,” shares Laflin.

Outcome

By leveraging Striim to transition from batch processing to real-time data access, Morrisons has significantly enhanced their ability to track and manage three critical key performance indicators (KPIs): availability, waste, and shrinkage. With access to faster, real-time insights, executives can more effectively identify risks and implement strategies to mitigate them, ultimately leading to improved operational decision-making and better performance across the organization. This shift allows Morrisons to optimize their processes and drive positive outcomes related to these key metrics.

“Without Striim, we couldn’t create the real-time data that we then use to run the business,” shares Laflin. “It’s a very fundamental part of our architecture.”

The move towards real-time data has allowed Morrisons to identify that their shelf availability has notably improved, ensuring that products are consistently in stock and accessible to customers. As a result, they are beginning to uncover the full range of benefits that this transformation can bring, including enhanced inventory management and reduced waste.

From the customer perspective, better shelf availability translates into happier shoppers, as they can find the products they want when they visit stores. This improvement not only fosters customer loyalty but also positions Morrisons to compete more effectively in the marketplace, ultimately driving growth and enhancing overall customer satisfaction.

Striim’s Multi-Node Deployments: Ensuring Scalability, High Availability, and Disaster Recovery

Striim Team — Fri, 18 Oct 2024 22:12:57 +0000

In today’s enterprise landscape, ensuring high availability, scalability, and disaster recovery is paramount for businesses relying on continuous data flow and analytics. Striim, a leading platform for real-time data integration and streaming analytics, offers multi-node deployments that significantly enhance redundancy while delivering enterprise-grade capabilities for mission-critical workloads. This blog explores how Striim’s multi-node architecture supports these objectives, providing enterprises with a robust solution for high availability, scalability, and disaster recovery both as a fully managed cloud service, or platform that can be deployed in your private cloud and on-premises environments.

This blog explores how Striim’s multi-node architecture supports these objectives, providing enterprises with a robust solution for high availability, scalability, and disaster recovery.

Multi-Node Architecture: A Foundation for Enterprise Resilience

At the heart of Striim’s mission-critical platform is its multi-node architecture. Multi-node deployments allow Striim to operate across several interconnected servers or nodes, each handling data processing, streaming, and analytics in tandem. This distributed architecture introduces redundancy, ensuring that even if one node fails, other nodes can continue operations seamlessly. This approach is essential for disaster recovery, high availability, and fault tolerance.

1. Increasing Redundancy and Supporting Scalability

Redundancy is vital in distributed systems because it ensures that multiple copies of data and processing capabilities exist across nodes. Striim’s multi-node deployment increases redundancy by replicating workloads and data across several nodes. This means that in the event of a failure, another node can immediately take over, minimizing downtime and preventing data loss.

Additionally, Striim supports horizontal scalability. As data volumes grow—whether due to business expansion, increasing IoT devices, or heightened customer interactions—additional nodes can be added to the cluster to distribute the processing load. This ensures that the system can handle increasing demand without performance degradation, maintaining the ability to process millions of events per second across a distributed cluster.

2. High Availability Through Node Redundancy and Failover Mechanisms

For business-critical workloads, any downtime or data loss can have serious consequences. Striim addresses this concern by delivering high availability (HA) through node redundancy and automatic failover mechanisms. In a multi-node deployment, each node holds redundant copies of data and processing logic, ensuring that if one node fails, another can take over instantly without interrupting data flow.

Striim’s built-in failover automatically shifts workloads from a failed node to a functioning one, maintaining continuous service for real-time applications. This is critical for systems that demand high uptime, such as financial transactions, customer-facing dashboards, or logistics monitoring. Furthermore, Striim guarantees exactly-once processing, ensuring data integrity during node transitions and preventing duplicate or missed data events.

To provide a simple, declarative construct for node management and failover, Striim offers Deployment Groups which represent a group of one or more nodes with its own application and resource configurations. You can deploy Striim Apps to a Deployment Group, and that Deployment Group governs the runtime and resilience of the application.

3. Disaster Recovery with Multi-Region and Cross-Cloud Support

In addition to failover, Striim’s multi-node deployment enhances disaster recovery (DR) by replicating data and services across geographically distributed nodes or across clouds. Enterprises can configure active-active or active-passive DR setups to quickly recover from catastrophic failures. By distributing nodes across multiple regions or clouds, Striim ensures that if one region experiences an outage, another can take over seamlessly, ensuring business continuity.

Striim’s cross-cloud capabilities offer additional flexibility, allowing organizations to distribute their infrastructure across different cloud providers. This architecture ensures resilience even in the face of regional outages, ensuring rapid recovery and reducing the risk of data loss. Additionally, Striim’s Change Data Capture (CDC) ensures that data is continuously synchronized between nodes, keeping all data consistent and up-to-date across the entire system.

Integrating Multi-Node Capabilities with In-Memory Technology

To provide real-time data streaming and analytics efficiently, Striim relies heavily on in-memory technology. Striim’s architecture allows for data to be cached in an in-memory data grid, enabling rapid data access without the latency of disk I/O. However, ensuring all nodes can process this data without time-consuming remote calls requires a tightly integrated design.

Striim’s multi-node deployment ensures that all system components—data streaming, in-memory storage, and real-time analytics—operate in the same memory space. This eliminates the need for costly remote calls, allowing for rapid joins and analytics on streaming data. By leveraging in-memory processing across a distributed cluster, Striim ensures that the system remains both highly performant and scalable, even under high data loads.

Security Across Nodes and Clusters

As enterprises scale their data processing across multiple nodes and regions, maintaining security becomes increasingly important. Striim addresses this need by employing a holistic, role-based security model that spans the entire architecture. Whether it’s securing individual data streams, protecting sensitive data in motion, or managing access to management dashboards, Striim provides comprehensive security across all nodes and processes in both Striim Cloud and Striim’s on-premise Striim Platform.

This centralized approach to security simplifies the task of managing access controls, especially in distributed systems where data and processes are spread across multiple locations. Striim’s role-based model ensures that all security policies are consistently applied across the entire system, reducing the risk of vulnerabilities while maintaining compliance with industry regulations.

Conclusion: Simplifying Enterprise-Grade Data Streaming

Striim’s multi-node deployments provide enterprises with a powerful, scalable, and resilient platform for real-time data streaming and analytics. By increasing redundancy, ensuring high availability through failover mechanisms, and supporting disaster recovery with multi-region and cross-cloud configurations, Striim enables businesses to maintain continuous operations even in the face of unexpected failures.

With Striim, enterprises can focus on deriving insights from their data without the need to invest in complex infrastructures or develop intricate disaster recovery strategies. Striim’s platform takes care of the complexities of distributed processing, in-memory analytics, and security, ensuring that business-critical workloads run smoothly and efficiently at scale.

By offering a unified solution for real-time data integration and streaming analytics, Striim empowers businesses to meet the demands of today’s data-driven world while maintaining the resilience and agility necessary to thrive in a competitive environment.

What is a Data Pipeline (and 7 Must-Have Features of Modern Data Pipelines)

Striim Team — Fri, 11 Oct 2024 14:53:30 +0000

A well-executed data pipeline can make or break your company’s ability to leverage real-time insights and stay competitive. Thriving in today’s world requires building modern data pipelines that make moving data and extracting valuable insights quick and simple.

Today, we’ll answer the question, “What is a data pipeline?” Then, we’ll explore a data pipeline example and dive deeper into the key differences between a traditional data pipeline vs ETL.

What is a Data Pipeline?

A data pipeline refers to a series of processes that transport data from one or more sources to a destination, such as a data warehouse, database, or application. These pipelines are essential for managing and optimizing the flow of data, ensuring it’s prepared and formatted for specific uses, such as analytics, reporting, or machine learning.

Throughout the pipeline, data undergoes various transformations such as filtering, cleaning, aggregating, enriching, and even real-time analysis. These steps guarantee that data is accurate, reliable, and meaningful by the time it reaches its destination, making it possible for teams to generate insights and make data-driven decisions.

In addition to the individual steps of a pipeline, data pipeline architecture refers to how the pipeline is designed to collect, flow, and deliver data effectively. This architecture can vary based on the needs of the organization and the type of data being processed. There are two primary approaches to moving data through a pipeline:

Batch processing: In batch processing, batches of data are moved from sources to targets on a one-time or regularly scheduled basis. Batch processing is the tried-and-true legacy approach to moving data, but it doesn’t allow for real-time analysis and insights, which is its primary shortcoming.
Stream processing: Stream processing enables real-time data movement by continuously collecting and processing data as it flows, which is crucial for applications needing immediate insights like monitoring or fraud detection. Change Data Capture (CDC) plays a key role here by capturing and streaming only the changes (inserts, updates, deletes) in real time, ensuring efficient data handling and up-to-date information across systems. As a result, stream processing makes real-time business intelligence feasible.

Why are Data Pipelines Significant?

Now that we’ve answered the question, ‘What is a data pipeline?’ We can dive deeper into the essential role they play. Data pipelines are significant to businesses because they:

Consolidate Data: Data pipelines are responsible for integrating and unifying data from diverse sources and formats, making it consistent and usable for analytics and business intelligence.
Enhance Accessibility: Thanks to data pipelines, you can provide team members with necessary data without granting direct access to sensitive production systems.
Support Decision-Making: When you ensure that clean, integrated data is readily available, you facilitate informed decision-making and boost operational efficiency.

What is a Data Pipeline Example?

As you’ll see by taking a look at this data pipeline example, the complexity and design of a pipeline varies depending on intended use. For instance, Macy’s streams change data from on-premises databases to Google Cloud. As a result, customers enjoy a unified experience whether they’re shopping in a brick and mortar store or online.

Another excellent data pipeline example is American Airlines’ work with Striim. Striim supported American Airlines by implementing a comprehensive data pipeline solution to modernize and accelerate operations.

To achieve this, the TechOps team implemented a real-time data hub using MongoDB, Striim, Azure, and Databricks to maintain seamless, large-scale operations. This setup uses change data capture from MongoDB to capture operational data in real time, then processes and models it for downstream systems. The data is streamed in real time to end users, delivering valuable insights to TechOps and business teams, allowing them to monitor and act on operational data to enhance the customer travel experience.

This data pipeline diagram illustrates how it works:

Data Pipeline vs ETL: What’s the Difference?

You’re likely familiar with the term ‘ETL data pipeline’ and may be curious to learn the difference between a traditional data pipeline vs ETL. In actuality, ETL pipelines are simply a form of data pipeline. To understand an ETL data pipeline fully, it’s imperative to understand the process that it entails.

ETL stands for Extract, Transform, Load. This process involves:

Extraction: Data is extracted from a source or multiple sources.
Transformation: Data is processed and converted into the appropriate format for the target destination — often a data warehouse or lake.
Loading: The loading phase involves transferring the transformed data into the target system where your team can access it for analysis. It’s now usable for various use cases, including for reporting, insights, and decision-making.

A traditional ETL data pipeline typically involves disk-based processing, which can lead to slower transformation times. This approach is suitable for batch processing where data is processed at scheduled intervals, but may not meet the needs of real-time data demands.

While legacy ETL has a slow transformation step, modern ETL platforms, like Striim, have evolved to replace disk-based processing with in-memory processing. This advancement allows for real-time data transformation, enrichment, and analysis, providing faster and more efficient data processing. Striim, for example, handles data in near real-time, enabling quicker insights and more agile decision-making.

Now, let’s dive into the seven must-have features of modern data pipelines.

7 Must-Have Features of Modern Data Pipelines

To create an effective modern data pipeline, incorporating these seven key features is essential. Though not an exhaustive list, these elements are crucial for helping your team make faster and more informed business decisions.

1. Real-Time Data Processing and Analytics

The number one requirement of a successful data pipeline is its ability to load, transform, and analyze data in near real time. This enables business to quickly act on insights. To begin, it’s essential that data is ingested without delay from multiple sources. These sources may range from databases, IoT devices, messaging systems, and log files. For databases, log-based Change Data Capture (CDC) is the gold standard for producing a stream of real-time data.

Real-time, continuous data processing is superior to batch-based processing because the latter takes hours or even days to extract and transfer information. Because of this significant processing delay, businesses are unable to make timely decisions, as data is outdated by the time it’s finally transferred to the target. This can result in major consequences. For example, a lucrative social media trend may rise, peak, and fade before a company can spot it, or a security threat might be spotted too late, allowing malicious actors to execute on their plans.

Real-time data pipelines equip business leaders with the knowledge necessary to make data-fueled decisions. Whether you’re in the healthcare industry or logistics, being data-driven is equally important. Here’s an example: Suppose your fleet management business uses batch processing to analyze vehicle data. The delay between data collection and processing means you only see updates every few hours, leading to slow responses to issues like engine failures or route inefficiencies. With real-time data processing, you can monitor vehicle performance and receive instant alerts, allowing for immediate action and improving overall fleet efficiency.

2. Scalable Cloud-Based Architecture

Modern data pipelines rely on scalable, cloud-based architecture to handle varying workloads efficiently. Unlike traditional pipelines, which struggle with parallel processing and fixed resources, cloud-based pipelines leverage the flexibility of the cloud to automatically scale compute and storage resources up or down based on demand.

In this architecture, compute resources are distributed across independent clusters, which can grow both in number and size quickly and infinitely while maintaining access to a shared dataset. This setup allows for predictable data processing times as additional resources can be provisioned instantly to accommodate spikes in data volume.

Cloud-based data pipelines offer agility and elasticity, enabling businesses to adapt to trends without extensive planning. For example, a company anticipating a summer sales surge can rapidly increase processing power to handle the increased data load, ensuring timely insights and operational efficiency. Without such elasticity, businesses would struggle to respond swiftly to changing trends and data demands.

3. Fault-Tolerant Architecture

It’s possible for data pipeline failure to occur while information is in transit. Thankfully, modern pipelines are designed to mitigate these risks and ensure high reliability. Today’s data pipelines feature a distributed architecture that offers immediate failover and robust alerts for node, application, and service failures. Because of this, we consider fault-tolerant architecture a must-have.

In a fault-tolerant setup, if one node fails, another node within the cluster seamlessly takes over, ensuring continuous operation without major disruptions. This distributed approach enhances the overall reliability and availability of data pipelines, minimizing the impact on mission-critical processes.

4. Exactly-Once Processing (E1P)

Data loss and duplication are critical issues in data pipelines that need to be addressed for reliable data processing. Modern pipelines incorporate Exactly-Once Processing (E1P) to ensure data integrity. This involves advanced checkpointing mechanisms that precisely track the status of events as they move through the pipeline.

Checkpointing records the processing progress and coordinates with data replay features from many data sources, enabling the pipeline to rewind and resume from the correct point in case of failures. For sources without native data replay capabilities, persistent messaging systems within the pipeline facilitate data replay and checkpointing, ensuring each event is processed exactly once. This technical approach is essential for maintaining data consistency and accuracy across the pipeline.

5. Self-Service Management

Modern data pipelines facilitate seamless integration between a wide range of tools, including data integration platforms, data warehouses, data lakes, and programming languages. This interconnected approach enables teams to create, manage, and automate data pipelines with ease and minimal intervention.

In contrast, traditional data pipelines often require significant manual effort to integrate various external tools for data ingestion, transfer, and analysis. This complexity can lead to bottlenecks when building the pipelines, as well as extended maintenance time. Additionally, legacy systems frequently struggle with diverse data types, such as structured, semi-structured, and unstructured data.

Contemporary pipelines simplify data management by supporting a wide array of data formats and automating many processes. This reduces the need for extensive in-house resources and enables businesses to more effectively leverage data with less effort.

6. Capable of Processing High Volumes of Data in Various Formats

It’s predicted that the world will generate 181 zettabytes of data by 2025. To get a better understanding of how tremendous that is, consider this — one zettabyte alone is equal to about 1 trillion gigabytes.

Since unstructured and semi-structured data account for 80% of the data collected by companies, modern data pipelines need to be capable of efficiently processing these diverse data types. This includes handling semi-structured formats such as JSON, HTML, and XML, as well as unstructured data like log files, sensor data, and weather data.

A robust big data pipeline must be adept at moving and unifying data from various sources, including applications, sensors, databases, and log files. The pipeline should support near-real-time processing, which involves standardizing, cleaning, enriching, filtering, and aggregating data. This ensures that disparate data sources are integrated and transformed into a cohesive format for accurate analysis and actionable insights.

7. Prioritizes Efficient Data Pipeline Development

Modern data pipelines are crafted with DataOps principles, which integrate diverse technologies and processes to accelerate development and delivery cycles. DataOps focuses on automating the entire lifecycle of data pipelines, ensuring timely data delivery to stakeholders.

By streamlining pipeline development and deployment, organizations can more easily adapt to new data sources and scale their pipelines as needed. Testing becomes more straightforward as pipelines are developed in the cloud, allowing engineers to quickly create test scenarios that mirror existing environments. This allows thorough testing and adjustments before final deployment, optimizing the efficiency of data pipeline development.

Gain a Competitive Edge with Striim

Data pipelines are crucial for moving, transforming, and storing data, helping organizations gain key insights. Modernizing these pipelines is essential to handle increasing data complexity and size, ultimately enabling faster and better decision-making.

Striim provides a robust streaming data pipeline solution with integration across hundreds of sources and targets, including databases, message queues, log files, data lakes, and IoT. Plus, our platform features scalable in-memory streaming SQL for real-time data processing and analysis. Schedule a demo for a personalized walkthrough to experience Striim.

Unlocking Actionable Insights: Morrisons’ Digital Transformation with Striim and Google Cloud

Striim Team — Thu, 03 Oct 2024 17:13:42 +0000

In the fast-paced world of retail, the ability to harness data effectively is crucial for staying ahead. On September 18, 2024, at Big Data London, Morrisons shared its digital transformation journey through the presentation, “Learn How Morrisons is Accelerating the Availability of Actionable Data at Scale with Google and Striim.”

Peter Laflin, Chief Data Officer at Morrisons, outlined the supermarket chain’s strategic partnership with Striim, a global leader in real-time data integration and streaming, and Google Cloud. This collaboration is pivotal in optimizing Morrisons’ supply chain, improving stock management, and enhancing customer satisfaction through the power of real-time data analytics.

By harnessing Striim’s advanced data platform alongside Google Cloud’s robust infrastructure, Morrisons has effectively integrated and streamlined data from its vast network of over 2,700 farmers and growers supplying raw materials to its manufacturing plants across the UK. This initiative has enabled seamless information flow and real-time visibility across its operations, allowing the supermarket to make quicker, data-driven decisions that directly impact customer experience. Tata Consultancy Services (TCS), Morrisons’ long-standing systems integration partner, has been instrumental in the success of this transformation. TCS worked closely with Morrisons’ teams to ensure the seamless implementation of Striim’s platform, facilitating smooth integration and alignment across operations.

The keynote featured insights from industry experts, including John Kutay, Head of Products at Striim, and Mike Reed, Retail Account Executive at Google, who underscored the transformative impact of innovative data strategies in the retail sector.

As Morrisons continues to embrace this data-driven approach, it sets a new standard for enhancing customer satisfaction and operational efficiency in the competitive retail environment.

Check out the Recap:

Training and Calling SGDClassifier with Striim for Financial Fraud Detection

Dmitriy Rudakov — Thu, 26 Sep 2024 13:42:58 +0000

In today’s fast-paced financial landscape, detecting transaction fraud is essential for protecting institutions and their customers. This article explores how to leverage Striim and SGDClassifier to create a robust fraud detection system that utilizes real-time data streaming and machine learning.

Problem

Transaction fraud detection is a critical responsibility for the IT teams of financial institutions. According to the 2024 Global Financial Crime Report from Nasdaq, an estimated $485.6 billion was lost to fraud scams and bank fraud schemes globally in 2023.

AI and ML help detect fraud, while real-time streaming frameworks like Striim play a key role in delivering financial data to reference and train classification models, enhancing customer protection.

Solution

In this article, I will demonstrate how to use Striim to perform key tasks for fraud detection with machine learning:

Ingest data using a Change Data Capture (CDC) reader in real time, call the model and deliver alerts to a target such as Email, Slack, Teams or any other target supported by Striim
Train the model using Striim Initial load app and re-train the model if its accuracy score decreases by using automation via REST APIs

Fraud Detection Approach

In typical credit card transactions, a financial institution’s data science team uses supervised learning to label data records as either fraudulent or legitimate. By carefully analyzing the data, engineers can extract key features that define a fraudulent user profile and behavior, such as personal information, number of orders, order content, payment history, geolocation, and network activity.

For this example, I’m using a dataset from Kaggle, which contains credit card transactions collected from EU retailers approximately 10 years ago. The dataset is already labeled with two classes representing fraudulent and normal transactions. Although the dataset is imbalanced, it serves well for this demonstration. Key fields include purchase value, age, browser type, source, and the class parameter, which indicates normal versus fraudulent transactions.

Picking Classification Model

There are many possibilities for classification using ML. In this example, I evaluated logistic regression and SGDClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html. The main difference is that SGDClassifier uses stochastic gradient descent optimization whereas logistic regression uses the logistic function to model binary classification. Many experts consider SGD to be a more optimal approach for larger datasets, which is why it was selected for this application.

Accuracy Measurement

The accuracy score is a metric that measures how often a model correctly predicts the desired outcome. It is calculated by dividing the total number of correct predictions by the total number of predictions. In an ideal scenario, the best possible accuracy is 100% (or 1). However, due to the challenges of obtaining and diagnosing a high-quality dataset, data scientists typically aim for an accuracy greater than 90% (or 0.9).

Training Step

Striim provides the ability to read historical data from various sources including databases, messaging systems, files, and more. In this case, we have historical data stored in the MySQL database, which is a highly popular data source in the FinTech industry. Here’s what architecture with real-time data streaming augmented with training of the ML model looks like:

You can achieve this in Striim with an Initial Load application that has a Database reader pointed to the transactions table in MySQL and file target. With Striim’s flexible adapters, data can be loaded virtually from any database of choice and loaded into a local file system, ADLS, S3 or GCS.

Once the data load is completed, the application will change its status from RUNNING to COMPLETED. A script, or in this case, a PS made Open Processor (OP), can capture the status change and call the training Python script.

Additionally, I added a step with CQ (Continuous Query) that allows data scientists to add any transformation to the data in order to prepare the form satisfactory for the training process. This step can be easily implemented using Striim’s Flow Designer, which features a drag and drop interface along with the ability to code data modifications using a combination of SQL-like language and utility function calls.

Model Reference Step

Once the model is trained, we can deploy it in a real-time data CDC application that streams user financial transactions from an operational database. The application calls the model’s predict method, and if fraud is detected, it generates and sends an alert. Additionally, it will check the model accuracy and, if needed, initiate the retraining step described above.

Model Reference App Structure

Flow begins with Striim’s CDC reader that streams financial transactions directly from database binary log. It then invokes our classification model that was trained in the previous step via a REST CALL. In this case, I am using an OP that executes REST POST calls containing parsed transaction values needed for predictions. The model service returns the prediction to be parsed by a query. If fraud is detected, it generates an alert. At the same time, if the model accuracy dips below 90 percent, the Application Manager function can restart a training application called IL MySQL App using an internal management REST API.

Final Thoughts on Leveraging SGDClassifier and Striim for Financial Fraud Detection

This example illustrates how a real-world data streaming application can detect fraud by interacting with a classification model. The application sends alerts when fraud is detected using various Striim alert adapters, including email, web, Slack, or database. Furthermore, if the model’s quality deteriorates, it can retain the model for further evaluation.

For reference TQL sources:

				
					 CREATE OR REPLACE APPLICATION FraudDetectionApp;

    CREATE OR REPLACE SOURCE TransactionsReader USING Global.MysqlReader ()
    OUTPUT TO transactionsStream;

    CREATE STREAM sgdOutput OF Global.JsonNodeEvent;

    CREATE STREAM FraudAlertStream OF Global.AlertEvent;

    CREATE CQ checkPrediction
    INSERT INTO predStream
    SELECT data.get("prediction").toString() as pred FROM sgdOutput s;;

    CREATE OR REPLACE CQ checkModelAccuracy
    INSERT INTO accuracyStream
    SELECT
    data.get("accuracy").toString() as acc
    FROM sgdOutput s;

    CREATE OR REPLACE OPEN PROCESSOR CallSGDClassifier USING Global.RestCallerPOST ( )
    INSERT INTO sgdOutput
    FROM transactionsStream;

    CREATE SUBSCRIPTION AlertAdapter USING Global.WebAlertAdapter (
    isSubscription: 'true' )
    INPUT FROM FraudAlertStream;

    CREATE OR REPLACE CQ generateFraudAlert
    INSERT INTO FraudAlertStream
    SELECT "Company XYZ", "Value", "warning", "raise", "fraud prediction alert on CC transaction"
    FROM predStream p where pred = "1.0";;

    CREATE OR REPLACE CQ CallTraining
    INSERT INTO callOutput
    SELECT com.striim.udf.app.ApplicationManager.startApplication("admin.ILMySqlApp")
    FROM accuracyStream a
    where TO_FLOAT(acc) < 0.9;

    END APPLICATION FraudDetectionApp;

				
					CREATE OR REPLACE APPLICATION InitialLoadMySQLApp;

    CREATE SOURCE ProcessorToStartTrainingStep USING Global.PrePostProcess ()
    OUTPUT TO m;

    CREATE OR REPLACE SOURCE MySqlInitLoad USING Global.DatabaseReader ()
    OUTPUT TO myLoadOut;

    CREATE CQ MyTransformationQuery
    INSERT INTO myFileOutput
    SELECT
    to_string(data[0]) as age,
    dnow() as curtime,
    to_string(data[2]) as sourceOfdata,
    to_string(data[0]) as browserType,
    to_string(data[3]) as purchaseValue,
    to_string(data[4]) as FraudClass….
    FROM myLoadOut m;;

    CREATE TARGET TrainFileTarget USING Global.FileWriter ( )
    INPUT FROM myFileOutput;

    END APPLICATION ILMySqlApp;

A Guide to Data Pipelines (And How to Design One From Scratch)

John Kutay — Wed, 11 Sep 2024 18:41:30 +0000

Data pipelines are the backbone of your business’s data architecture. Implementing a robust and scalable pipeline ensures you can effectively manage, analyze, and organize your growing data. Most importantly, these pipelines enable your team to transform data into actionable insights, demonstrating tangible business value.

According to an IBM study, businesses expect that fast data will enable them to “make better informed decisions using insights from analytics (44%), improved data quality and consistency (39%), increased revenue (39%), and reduced operational costs (39%).” With data volumes and sources rapidly increasing, optimizing how you collect, transform, and extract data is more crucial to stay competitive. That’s where real-time data, and stream processing can help.

In this guide, we’ll dive into everything you need to know about data pipelines—whether you’re just getting started or looking to optimize your existing setup. We’ll answer the question, “What are data pipelines?” Then, we’ll dive deeper into how to build data pipelines and why it’s imperative to make your data pipelines work for you.

What are Data Pipelines?

A data pipeline is a systematic sequence of components designed to automate the extraction, organization, transfer, transformation, and processing of data from one or more sources to a designated destination. Dmitriy Rudakov, Director of Solutions Architecture at Striim, describes it as “a program that moves data from source to destination and provides transformations when data is inflight.”

Benjamin Kennady, Cloud Solutions Architect at Striim, emphasizes the outcome-driven nature of data pipelines. “A data pipeline can be thought of as the flow of logic that results in an organization being able to answer a specific question or questions on that data,” he shares. “This question could be displayed in a dashboard for decision makers or just be a piece of the required puzzle to answer a larger question.”

Because of this, data pipelines are vital when data is stored in formats or locations that hinder straightforward analysis. As Kennady notes, “The reason a pipeline must be used in many cases is because the data is stored in a format or location that does not allow the question to be answered.” The pipeline transforms the data during transfer, making it actionable and enabling your organization to answer critical questions.

AI and Data Pipelines

Another quintessential function of data pipelines is for integrating artificial intelligence (AI) into organizational processes, enabling the seamless flow of data that powers AI-driven insights. Because AI models require vast amounts of data to learn, adapt, and make predictions, the efficiency and robustness of data pipelines directly impact the quality of your organization’s AI outcomes.

A well-designed data pipeline ensures that data is not only transferred from source to destination but also properly cleaned, enriched, and transformed to meet the specific needs of AI algorithms.

Why are data pipelines important?

Without well-engineered, scalable, and robust data pipelines, your organization risks accumulating large volumes of data in scattered locations, making it difficult to process or analyze effectively. Instead of being a valuable resource, this data becomes a bottleneck, hindering your ability to innovate and grow.

Kennady adds, “The capability of a company to make the best decisions is partly dictated by its data pipeline. The more accurate and timely the data pipelines are set up allows an organization to more quickly and accurately make the right decisions.”

Data Pipeline Use Cases

Data pipelines are integral to virtually every industry today, serving a wide range of functions from straightforward data transfers to complex transformations required for advanced machine learning applications. Whether it’s moving data from a source to a destination or preparing it for sophisticated recommendation engines, data pipelines are the backbone of modern data architectures.

Some use cases where building data pipelines is crucial include:

Processing and storing transaction data to power reporting and analytics to enhance business products and services
Consolidating data from multiple sources (SaaS tools, databases) to a big data store (data warehouses, data lakes) to provide a single source of truth for the organization’s data
Improving overall backend system performance by migrating data to large data stores, reducing the load on operational databases
Ensuring data quality, reliability, and consistency for faster data access across business units

What are Six Key Data Pipeline Components?

Understanding the essential components of data pipelines is crucial for designing efficient and effective data architectures. These components work in tandem to ensure data is accurately ingested, transformed, and delivered, supporting everything from real-time analytics to machine learning applications. Here are six key components that are fundamental to building and maintaining an effective data pipeline.

Data Sources

The first component of a modern data pipeline is the data source, which is the origin of the data your business leverages. This can include any system or application that generates or collects data, such as:

Behavioral Data: User behavior data that provides insights into how customers interact with your products or services.
Transactional Data: Sales and product records that capture critical business transactions and operations.
Third-Party Data: External data sources that your company does not collect directly but integrates to enhance insights or support decision-making.

These data sources serve as the starting point for the pipeline, providing the raw data that will be ingested, processed, and analyzed.

Data Collection/Ingestion

The next component in the data pipeline is the ingestion layer, which is responsible for collecting and bringing data into the pipeline. This critical step leverages data ingestion tools to interface with diverse data sources, both internal and external, using various protocols and formats.

The ingestion layer supports multiple data types and formats, including:

Batch Data: Data collected and processed in discrete chunks, typically from static sources such as databases or logs. Historically, batch processing was sufficient for many use cases. However, in today’s fast-paced environment, where real-time insights are crucial, batch data can become outdated by the time it is processed. This delay limits the ability to respond to immediate business needs or emerging trends.
Streaming Data: Real-time data that continuously flows from sources such as IoT devices, sensors, or live transaction feeds. This data requires immediate processing to provide up-to-the-minute insights and enable timely decision-making, making it the ideal choice for modern businesses.

“Data pipelines can be thought of as two different types: batch loading and continuous replication,” says Kennady. “Continuous replication via CDC is an event driven architecture. This is a more efficient data pipeline methodology because it only gets triggered when there is a change to the source.”

Striim, for instance, facilitates the seamless integration of real-time streaming data from various sources, ensuring that it is continuously captured and delivered to big data storage targets. This real-time capability is essential in today’s environment, where immediate insights and rapid response are crucial for staying competitive and making timely decisions. Furthermore, Striim also supports real-time data replication and real-time analytics, which are both crucial for your organization to maintain up-to-date insights. By efficiently handling data ingestion, this component sets the stage for effective data processing and analysis.

Data Processing

That brings us to our next step: Data processing. The processing layer is responsible for transforming data into a consumable state through various operations, including validation, clean-up, normalization, transformation, and enrichment. The approach to this processing depends on the data pipeline architecture, specifically whether it employs ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes.

In an ETL-based architecture, data is first extracted from source systems, then transformed into a structured format, and finally loaded into data stores, typically data warehouses. This method is advantageous when dealing with structured data that requires pre-processing before storage.

Conversely, in an ELT-based architecture, data is initially loaded into storage systems such as data lakes in its raw form. Transformation occurs post-loading, allowing for flexible and scalable processing. This approach is beneficial for handling large volumes of diverse data types and enables on-demand transformation to meet various business use cases.

Both ETL and ELT architectures serve distinct needs, and the choice between them depends on the organization’s specific requirements for data storage, processing efficiency, and analytical flexibility.

Data storage

Data storage follows. This component is responsible for providing durable, scalable, and secure storage solutions for the data pipeline. It typically includes large data repositories designed to handle varying types of data efficiently.

Data Warehouses: These are optimized for storing structured data, often organized in relational databases. They support complex querying and analytical processing, making them ideal for business intelligence and reporting. Data warehouses offer high performance and scalability, enabling organizations to manage large volumes of structured data efficiently.
Data Lakes: Data lakes are designed to store structured, semi-structured, and unstructured data, providing a flexible and scalable solution. They retain raw data in its native format, facilitating extensive data ingestion and integration from various sources. This approach supports large volumes of diverse data, enabling advanced analytics, machine learning, and data exploration by transforming and analyzing data as needed.

Both data warehouses and data lakes play crucial roles in a data pipeline, providing the necessary infrastructure to store and manage data efficiently. They ensure that data is preserved with durability, protected with robust security measures, and scaled to meet the growing demands of modern data processing and analytics. Because of this, many organizations leverage both.

Data Consumption

The consumption layer is essential for extracting and leveraging data from storage systems. It offers scalable and high-performance tools that enable efficient data access and utilization. This layer integrates a variety of analytics tools tailored to different user needs and analytical methods.

It supports SQL-based queries for precise data retrieval, batch analytics for processing large datasets, and reporting dashboards for visualizing key metrics and trends. Additionally, it facilitates machine learning applications, allowing for advanced data analysis and predictive insights. By providing these diverse tools and capabilities, the consumption layer ensures that all users—from data scientists to business analysts—can derive actionable insights and drive informed decision-making across the organization.

Data Governance

The security and governance layer ensures the protection and management of data throughout the entire pipeline. It includes:

Access Control: Restricts data access to authorized users through robust authentication and permissions management.
Encryption: Secures data both at rest and in transit to prevent unauthorized access.
Network Security: Utilizes firewalls, intrusion detection systems, and secure communication channels to safeguard data from cyber threats.
Usage Monitoring: Tracks data access and usage patterns to detect anomalies and enforce security policies.
Auditing Mechanisms: Maintains a detailed audit trail of all data operations and user activities for compliance and oversight.

This layer is integrated with all other pipeline components to ensure consistent application of security measures and governance practices across the data pipeline.

How to Build Data Pipelines in Eight Steps

Designing data pipelines involves many considerations, and the decisions made early on can significantly impact future success. This section serves as a guide for asking the right questions during the initial design phase of a data pipeline.

In this guide, we’ll design a data pipeline for a hypothetical movie streaming service called “Strimmer.” Strimmer will offer a library of films and TV series accessible across Web, iOS, and Android platforms. Our goal is to create a data pipeline that supports a machine learning (ML) recommendation engine, enhancing movie recommendations for users.

Step 1: Determine the goal in building data pipelines

Your first step when building data pipelines is to identify the outcome or value it will offer your company or product. At this point, you’d ask questions like:

What are our objectives for this data pipeline?
How do we measure the success of the data pipeline?
What use cases will the data pipeline serve (reporting, analytics, machine learning)?
Who are the end-users of the data that this pipeline will produce? How will that data help them meet their goals?

Strimmer: For our Strimmer application, the data pipeline will provide data for the ML recommendation engine, which will help Strimmer determine the best movies and series to recommend to users.

Step 2: Choose the data sources

In the next step, consider the possible data sources to enter the data pipeline. Ask questions such as:

What are all the potential sources of data?
In what format will the data come in (flat files, JSON, XML)?
How will we connect to the data sources?

Strimmer: For our Strimmer data pipeline, sources would include:

User historical data, such as previously watched movies and search behaviors stored in operational databases like SQL, NoSQL
User behavior data/analytics, such as when a user clicks a movie detail
3rd party data from social media applications and movie rating sites like IMDB

Step 3: Determine the data ingestion strategy

Now that you understand your pipeline goals and have defined data sources, it’s time to ask questions about how the pipeline will collect the data. Ask questions including:

Should we build our own data ingest pipelines in-house with python, airflow, and other scriptware?
Would we be utilizing third-party integration tools to ingest the data?
Are we going to be using intermediate data stores to store data as it flows to the destination?
Are we collecting data from the origin in predefined batches or in real time?

Strimmer: For our Striimmer data pipeline, we’ll leverage Striim, a unified real-time data integration and streaming platform, to ingest both batch and real-time data from the various data sources.

Step 4: Design the data processing plan

Once data is ingested, it must be processed and transformed for it to be valuable to downstream systems. At this stage, you’ll ask questions like:

What data processing strategies are we utilizing on the data (ETL, ELT, cleaning, formatting)?
Are we going to be enriching the data with specific attributes?
Are we using all the data or just a subset?
How do we remove redundant data?

Strimmer: To build the data pipeline for our Strimmer service, we’ll use Striim’s streaming ETL data processing capabilities, allowing us to clean and format the data before it’s stored in the data store. Striim provides an intuitive interface to write streaming SQL queries to correct deficiencies in data quality, remove redundant data, and build a consistent data schema to enable consumption by the analytics service.

Step 5: Set up storage for the output of the pipeline

Once the data gets processed, we must determine the final storage destination for our data to serve various business use cases. Ask questions including:

Are we going to be using big data stores like data warehouses or data lakes?
Would the data be stored on cloud or on-premises?’
Which of the data stores will serve our top use cases?
In what format will the final data be stored?

Strimmer: Because we’ll be handling structured data sources in our Strimmer data pipeline, we could opt for a cloud-based data warehouse like Snowflake as our big data store.

Step 6: Plan the data workflow

Now, it’s time to design the sequencing of processes in the data pipeline. At this stage, we ask questions such as:

What downstream jobs are dependent on the completion of an upstream job?
Are there jobs that can run in parallel?
How do we handle failed jobs?

Strimmer: In our Strimmer pipeline, we’ll utilize a third-party workflow scheduler like Apache Airflow to help schedule and simplify the complex workflows between the different processes in our data pipeline via Striim’s REST API. For instance, we can define a workflow that independently reads data from our sources, joins the data using a specific key, and writes the transformation output to our data warehouse.

Step 7: Implement a data monitoring and governance framework

You’ve almost built an entire data pipeline! Our second to final step includes establishing a data monitoring and governance framework, which helps us observe the data pipeline to ensure a healthy and efficient channel that’s reliable, secure, and performs as required. In this step, we determine:

What needs to be monitored? Dropped records? Failed pipeline runs? Node outages?
How do we ensure data is secure and no sensitive data is exposed?
How do we secure the machines running the data pipelines?
Is the data pipeline meeting the delivery SLOs?
Who is in charge of data monitoring?

Strimmer: We need to ensure proper security and monitoring in our Strimmer data pipeline. We can do this by utilizing fine-grained permission-based access control from the cloud providers we use, encrypting data in the data warehouse using customer-managed encryption keys, storing detailed logs, and monitoring metrics for thresholds using tools like Datadog.

Step 8: Plan the data consumption layer

This final step determines the various services that’ll consume the processed data from our data pipeline. At the data consumption layer, we ask questions such as:

What’s the best way to harness and utilize our data?
Do we have all the data we need for our intended use case?
How do our consumption tools connect to our data stores?

Strimmer: The consumption layer in our Strimmer data pipeline can consist of an analytics service like Databricks that feeds from data in the warehouse to build, train, and deploy ML models using TensorFlow. The algorithm from this service then powers the recommendation engine to improve movie and series recommendations for all users.

Where does Striim Come into Play When Building Data Pipelines?

Striim radically simplifies and manages the development, deployment, and management of real-time data pipelines. Historically, creating data pipelines involved manually stitching components together with scripts, a process that was often cumbersome, difficult to maintain, and prone to errors. Modern frameworks have improved this with visual design tools, but Striim takes it further by simplifying and automating the entire process.

As Dmitriy Rudakov notes, “The Striim platform contains all tools necessary for running a data pipeline: a multitude of sources and targets, schema evolution, a transformation layer called continuous query, and integration with UDFs. These capabilities are integrated into what’s called a Flow Designer, which provides a simple drag-and-drop interface and a monitoring framework that ensures smooth execution.” This comprehensive suite of features makes it easier to design, execute, and manage data pipelines with minimal complexity.

Striim also offers low-code and REST APIs so that data teams can automate the entire deployment, monitoring, and security of the data pipelines with CI/CD processes.

Unlike traditional batch processing systems that rely on scheduled updates and often require additional logic to handle data changes, Striim offers a more streamlined approach. As Benjamin Kennady highlights, “Striim reads inserts, updates, and deletes as they occur and replicates them into the target. This methodology means that the source dataset does not require a field for capturing the updated time or when it was deleted. By not capturing when the last value was deleted, this saves on storage and processing requirements. This is also a more straightforward and lightweight way to work with a data pipeline.”

Striim’s real-time data integration ensures that changes are captured and processed instantly, eliminating the need for complex update schedules and reducing the overall workload. By connecting directly to the source database and table, Striim initiates the replication process with ease, thereby accelerating data pipelines and simplifying workflow management. “It simplifies development, deployment and management of real time data pipelines,” shares Dmitriy Rudakov. “In the past programmers used to stitch everything together with scripts which were hard to maintain and understand while modern frameworks tend to provide visual pipeline design studio that allow to automate running and monitoring of the user applications.”

Flexibility and Scalability are the Keys to Sustainable Data Pipelines

Data pipelines enable companies to make faster, more informed decisions, gain a competitive edge, and derive substantial value from their growing data assets. Designing a scalable and adaptable data pipeline is crucial for managing increasing data volumes and evolving use cases.

With over 150 automated connectors, Striim integrates data from various sources—applications and databases—streaming trillions of events daily for diverse applications.

Schedule a demo today to discover how Striim can transform your data management strategy.

Harnessing Continuous Data Streams: Unlocking the Potential of Online Machine Learning

Striim Team — Wed, 04 Sep 2024 14:52:25 +0000

The world is generating an astonishing amount of data every second of every day. It reached 64.2 zettabytes in 2020, and is projected to mushroom to over 180 zettabytes by 2025, according to Statista.

Modern problems require modern solutions — which is why businesses across industries are moving away from batch processing and towards real-time data streams, or streaming data. Moreover, the concept of ‘online machine learning’ has emerged as a potential solution for organizations working with data that arrives in a continuous stream or when the dataset is too large to fit into memory.

Today, we’ll walk you through the close connection between successful machine learning and streaming data. You’ll learn potential applications and why online machine learning is an excellent idea.

What is Online Machine Learning?

Online machine learning is an approach that feeds data to the machine learning model in an incremental manner, which can leverage continuous streams. Instead of being trained on a complete data set all at once, online machine learning allows models to receive data points one at a time or in small batches. This method is especially helpful in scenarios where data is generated continuously, as this enables the model to learn and adapt in real time.

Applying machine learning to streaming data can help organizations with a wide range of applications. These include fraud detection from real-time financial transactions, real-time operations management (e.g., stock monitoring in the supply chain), or sentiment analysis over live social media trends on Facebook, Twitter, etc.

“Online ML is the only way forward as old ways of using schedules to run batches do not fit with the growing data volumes and real time expectations,” shares Dmitriy Rudakov, Director of Solution Architecture at Striim.

Simson Chow, Sr. Cloud Solutions Architect, adds, “Online machine learning allows models to continuously learn from new data and adapt in real-time. This will allow models to rapidly adjust to changing environments and produce accurate, up-to-date predictions. This dynamic approach is crucial in a constantly changing environment, where static models can quickly become outdated and ineffective.”

What are Potential Use Cases for Online Machine Learning?

Some instances where online machine learning is particularly impactful include:

When your data has no end and is effectively continuous
When your training data is sensitive due to privacy issues, and you are unable to move it to an offline environment
When you can’t transfer training data to an offline environment due to device or network limitations
When the size of training datasets is too large, making it impossible to fit into the memory of a single machine at a specific time

Online vs Offline Machine Learning: Why Offline Machine Learning Is Not Ideal for Streaming Data

To effectively utilize streaming data for machine learning, traditional batch processing methods fall short.

These methods, usually referred to as offline or batch learning, can handle static datasets, processing them all at once. However, they’re not equipped to deal with the continuous flow of data in real time. Due to this, taking such an approach is not only resource-intensive but also time-consuming, making it unsuitable for dynamic environments where timely updates are crucial. Let’s dive deeper.

Online vs Offline Machine Learning: Offline Learning Limitations

Offline learning systems are limited by their inability to learn incrementally. Each time new data becomes available, the entire model must be retrained from scratch, incorporating both the old and new data into a single dataset.

“Because traditional batch processing relies on frequently updating models with massive batches of data, it can result in redundant predictions and inadequate responses to new patterns, changes in the data, and more costs as a result of the model’s retraining and re-deployment, requiring significant infrastructure and compute resources,” says Chow. “This makes it unsuitable for various machine learning use cases. Because of this latency, it is not appropriate for real-time applications like online personalization, fraud detection, or autonomous systems where quick decisions are necessary.”

This process consumes significant computational resources and can result in prolonged downtime as the model is retrained, re-evaluated, and redeployed. While automated tools can streamline this process, the delay in retraining limits the model’s responsiveness, particularly in time-sensitive applications such as financial forecasting.

“There are 2 main reasons traditional batch systems don’t work for customers anymore,” says Dmitriy Rudakov. “The first one is the growing need to act in real time. For example, can you imagine using Uber without a fast real-time response today?” Dmitriy Rudakov also adds that, while traditionally data administrators have tried to time this process to occur at night so it doesn’t interfere with daily operations, “Growing volumes of data [means] batch based training just doesn’t fit the time windows provided.”

Online vs Offline Machine Learning: Online Learning Advantages

On the contrary, online machine learning can handle streaming data by feeding the model data incrementally. This approach allows the model to update itself in real time as new data arrives, making it highly adaptable to changes and reducing the latency associated with batch learning. For example, in stock price forecasting, where real-time data is crucial, an online learning model can continuously refine its predictions without the need for complete retraining, ensuring that forecasts are always based on the most current information.

How Does Online Machine Learning Work?

Now that you know why online machine learning is the better option, here’s how it works from a technical perspective — and how stream processing plays a role.

Think of stream processing as the backbone that enables online machine learning to function effectively. It provides the infrastructure to ingest, process, and manage continuous data flows in real-time. This is where Striim comes into play, offering a robust platform designed to handle the complexities of stream processing and real-time data integration.

Striim also captures and processes real-time data from various sources, such as databases, IoT devices, and cloud environments. By leveraging the platform, organizations can seamlessly feed this real-time data into their online machine learning models, allowing them to learn and adapt continuously. Striim’s low-latency data streaming ensures that the online learning models are always working with the most current data, enabling timely and accurate decision-making.

How Online Machine Learning Can Make a Difference

Online machine learning is an approach in which training occurs incrementally by feeding the model data continuously as it arrives from the source. The data from real-time streams are broken down into mini-batches and then fed to the model. Here’s how it can make a difference.

Save Computing Resources

Online learning is accessible regardless of computing resources. If you have minimal computing resources and a lack of space to store streaming data, you can still leverage it successfully.

Once an online learning system is done learning from a data stream, it can discard it or move the data to a storage medium, saving your business a significant amount of money and space. Online machine learning doesn’t require powerful and heavy-end hardware to process streaming data. That’s because only one mini-batch is processed in the memory at a time, unlike offline machine learning, where everything has to be processed at once. As a result, you can even use an affordable piece of hardware like Raspberry Pi to perform online machine learning.

“ML can be applied with data streaming systems in two ways,” shares Dmitriy Rudakov. “Model inference, i.e., calling the model in real time, can be done via different CDC techniques. This process does not require a lot of computing resources as the model is already trained, and the real-time app is just accessing it to generate some useful insights. Incidentally, if there is a change of properties in time (drift), the real-time system can make calls to calculate model accuracy scores and initiate retraining via automation.

Alternatively, training models can be done via the initial load phase, where, for a short period, the system can read and process all relevant data or subsets of data to train the model of choice. Training can also be done in real-time by sending event batches broken into chunks, according to use case needs, to the training modules, which will save computing resources and ensure freshness of models, thus addressing the drift problem.”

Prevent the occurrence of concept drifts

Online machine learning can also address concept drift — a known problem in machine learning. In machine learning, a ‘concept’ refers to a variable or a quantity that a machine learning model is trying to predict.

The term ‘concept drift’ refers to the phenomenon in which the target concept’s statistical properties change over time. This can be a sudden change in variance, mean, or any other characteristics of data. In online machine learning, the model computes one mini-batch of data at a time and can be updated on the fly. This can help to prevent concept drift as new streams of data are continuously used to update the model.

Learning from large amounts of data streams can help with applications that deal with forecasting, spam filtering, and recommender systems. For example, if a user buys multiple products (e.g., a winter coat and gloves) within a space of minutes on an e-commerce website, an online machine learning model can use this real-time information to recommend products that can complement their purchase (e.g., a scarf).

Online learning is closely connected to another concept called operationalizing machine learning, as both involve the continuous updating and adaptation of models with real-time data. Online learning enables models to refine their predictions on-the-fly, which is essential for maintaining accuracy in live environments. With this connection in mind, let’s explore how Striim supports these processes to enhance decision-making and operational efficiency.

Operationalizing Machine Learning with Striim

Operationalizing machine learning involves integrating models into live environments to leverage real-time data for continuous predictions and decision-making. This approach tackles challenges like handling high volumes of data, managing the speed at which data is generated and collected, and addressing the variety of data formats. For businesses, operationalizing machine learning translates into real-time insights, agility, improved accuracy, and enhanced operational efficiency.

Striim is an ideal platform for this task, offering comprehensive data movement capabilities crucial for digital transformation. It ingests and processes streaming data in real-time, performing essential transformations, filtering, and enrichment before the data is fed into online learning models. “ The only way to keep the model fresh is leveraging data provided in real time,” shares Dmitriy Rudakov. By continuously feeding these models with fresh data, Striim ensures they can adapt in real-time, keeping predictions and decisions accurate as conditions change.

The connection between operationalizing machine learning and online machine learning is crucial. Online machine learning, which incrementally updates models with new data, ensures continuous learning and adaptation—exactly what’s needed for operationalizing machine learning in dynamic, real-world environments.

To address the challenges of data variety and ensure models stay current, Striim can help you with:

Event-driven data capture and processing to train models incrementally.
Capturing schema changes from source systems and managing data drift.
Handling large volumes of streaming data from multiple sources.
Performing filtering, enriching, and data preparation on streaming data.
Providing data-driven insights and predictions by integrating trained models with real-time data streams.
Tracking data evolution and assessing model performance, enabling automatic retraining with minimal human intervention.

With these capabilities, Striim provides a robust foundation for operationalizing machine learning, supporting continuous, real-time learning and adaptation. Learn more in our guide to operationalizing machine learning.

Leverage Striim for Online Machine Learning Use Cases

By combining the strengths of Striim’s real-time data integration with online machine learning, your organization can effectively tackle the challenges of modern data environments. Striim’s platform not only supports seamless data streaming but also enhances the accuracy and relevance of your machine learning models by providing continuous, up-to-date insights. Whether you need to adapt to shifting data patterns or optimize resource usage, Striim equips you with the tools to maintain a competitive edge. Get a demo today to learn how Striim can empower your online machine learning initiatives and drive smarter, faster decisions.

The Future of AI is Real-Time Data

Striim Team — Wed, 28 Aug 2024 10:03:41 +0000

To the data scientists pushing the boundaries of what’s possible, the AI experts and enthusiasts who see beyond the horizon, and the techies building tomorrow’s solutions today — this manifesto is for you. The key to unlocking AI’s full potential lies in real time data. Traditional methods no longer suffice in a world that demands instant insights and immediate action.

Real-Time AI as the New Competitive Battleground

AI and ML are more than just buzzwords; they are driving substantial economic growth, creating new job opportunities, and shaping the future. The AI market is projected to reach a staggering $407 billion by 2027. This exponential growth underscores the widespread adoption and integration of AI across various industries. Furthermore, AI is on track to boost the US GDP by 21% by 2030. This highlights the profound economic impact AI will have. By automating routine tasks, optimizing operations, and providing deep insights through data analysis, AI enables businesses to increase productivity while reducing costs. And contrary to common fears that AI will eliminate jobs, it is expected to create 97 million new jobs by 2025. These roles will span various sectors, including data science, AI ethics, machine learning engineering, and AI-related research and development.

Real-Time Data — The Missing Link

What is Real-Time Data?

In the realm of data processing, real-time data refers to information that is delivered and processed almost instantaneously as it is generated. Unlike batch processing, which involves collecting and processing data in bulk at scheduled intervals, real-time data ensures immediate availability and actionability. This immediacy allows for decisions and responses to be made in the moment, offering a dynamic edge over traditional methods.

The Death of Traditional Batch Processing

The shift from batch processing to real-time data marks a crucial technological evolution driven by the need for speed and efficiency. Batch processing resulted in significant delays between data generation and actionable insights. As the demand for faster decision-making grew, the limitations of traditional batch processing became glaringly apparent. Traditional methods introduced latency, making it impossible to act on data immediately, a critical issue in environments requiring timely decisions.

Furthermore, batch processing systems were rigid and inflexible, struggling to scale as data volumes grew and needing substantial reengineering to adapt to new data types or sources. The advent of real-time data processing revolutionized this paradigm, providing the means to analyze and act on data as it flows, thereby minimizing latency to sub-second and offering unparalleled scalability and adaptability to modern data streams. This transformation is responsible for enabling real-time decision-making and fostering innovation across industries, cementing real-time data as the cornerstone of AI algorithms and advancements.

Dispelling Misconceptions and Demonstrating Value

In the world of AI and ML, there are a few common objections to the adoption of real-time data processing. Let’s dive into these misconceptions and demonstrate the true value of real-time capabilities.

Misconception: Batch Processing Suffices

Objection: Many AI/ML tasks can be handled with batch processing. Models trained on historical data can make predictions without needing real-time updates. The necessity of real-time data is highly specific to certain use cases, and not all industries or applications benefit equally.

Reality Check: While batch processing works for some tasks, it falls short in dynamic environments requiring high responsiveness and timely decision-making. Real-time data integration allows models to process the most recent data points, reducing lag between data generation and actionable insights. This is crucial in fields like finance, where market conditions shift rapidly, or e-commerce, where user behavior and inventory status constantly change. For example, fraud detection models relying on batch data might miss real-time anomalies, whereas real-time data can detect and respond to fraud within milliseconds. In healthcare, real-time patient monitoring can provide immediate insights for timely interventions, improving patient outcomes. The notion that real-time data is only useful in specific cases is outdated as countless industries increasingly leverage real-time capabilities to stay competitive and responsive.

Misconception: Complexity and Cost

Objection: Implementing real-time data systems is complex and costly. The infrastructure required for real-time data ingestion, processing, and analysis can be significantly more expensive than batch processing systems.

Reality Check: While real-time systems require an investment, the ROI is substantial. Modern cloud-based architectures and scalable platforms like Striim and Apache Kafka have reduced the complexity and cost of real-time data processing. Real-time systems drive higher revenues and better customer experiences by enabling immediate responses to emerging trends and anomalies. For instance, real-time inventory management in retail can prevent stockouts and overstock, directly impacting sales and customer satisfaction. The initial investment in real-time capabilities is outweighed by the long-term gains in efficiency, responsiveness, and competitive advantage.

Misconception: Data Quality and Stability

Objection: Real-time data can be noisy and unstable, leading to potential inaccuracies in model predictions. Batch processing allows for more thorough data cleaning and preprocessing.

Reality Check: Real-time data does not mean compromising on quality. Advanced real-time analytics platforms incorporate robust data cleaning and anomaly detection, ensuring models receive high-quality, stable inputs. Tools like Apache Beam and Spark Streaming provide mechanisms for real-time data validation and cleansing. Real-time data pipelines can also integrate seamlessly with existing ETL processes to maintain data integrity. By leveraging these technologies, organizations can ensure that their real-time data is as reliable and accurate as batch-processed data, while gaining the added advantage of immediacy.

Misconception: Model Retraining Frequency

Objection: Many models do not need to be retrained frequently. The insights gained from real-time data might not justify the cost and effort of constant retraining.

Reality Check: The pace of change in today’s world demands models that can adapt quickly. Real-time data enables continuous learning and incremental updates, ensuring models remain relevant and accurate. Techniques like online learning and incremental model updates allow models to evolve without the need for complete retraining. For example, recommendation systems can benefit from real-time user behavior data, continuously refining their suggestions to enhance user engagement. By integrating real-time data, organizations can maintain high model performance and accuracy, adapting swiftly to new patterns and trends.

Industry Disruption through Real-Time AI

Real-time AI is redefining how businesses operate by providing up-to-the-second information that enhances predictive accuracy, supports continuous learning, and automates complex decision-making processes. This integration allows AI to adapt instantly to new data, which is essential for applications where split-second decision-making is critical, including fraud detection, autonomous vehicles, and financial trading. It also powers real-time anomaly detection in cybersecurity and manufacturing, identifying threats and malfunctions as they occur. Additionally, real-time data empowers personalized customer experiences by analyzing interactions on the fly, delivering tailored recommendations and services. The scalability and adaptability of real-time data platforms ensure AI systems are always equipped with the most current information, driving innovation and efficiency across industries.

Real-Time AI & ML in the Real World

Predictive Maintenance in Manufacturing

ML algorithms, often powered by sensors and IoT devices, continuously monitor equipment health. Anticipating failures, predictive maintenance minimizes downtime and optimizes productivity by analyzing historical data and real-time sensor readings, enabling proactive scheduling and preventing disruptions in production.

Customer Churn Prediction in Telecom

ML models may consider factors such as customer demographics, usage patterns, customer service interactions, and billing history. By identifying customers at risk of churn, telecom companies can implement targeted retention strategies, such as personalized offers or improved customer support.

Fraud Detection in Finance

ML algorithms learn from historical data to identify patterns associated with fraudulent transactions. Real-time monitoring allows financial institutions to detect anomalies and trigger immediate alerts or interventions. This proactive approach helps prevent financial losses due to fraudulent activities.

Personalized Marketing in E-commerce

ML algorithms analyze not only purchase history but also browsing behavior and preferences. This enables e-commerce platforms to deliver personalized product recommendations through targeted advertisements, email campaigns, and website interfaces, enhancing the overall shopping experience.

Healthcare Diagnostics and Predictions

ML models, particularly in medical imaging, can assist healthcare providers by identifying subtle patterns indicative of diseases. Predictive analytics also help healthcare providers anticipate patient health deterioration, enabling early interventions and personalized treatment plans.

Dynamic Pricing in Retail

ML algorithms consider a multitude of factors, including competitor pricing, inventory levels, historical sales data, and customer behavior. By dynamically adjusting prices in real time, retailers can optimize revenue, respond to market changes, and maximize profitability.

Supply Chain Optimization

ML-driven demand forecasting considers historical data, seasonality, and external factors like economic trends and geopolitical events. This enables accurate inventory management, reduces excess stock, and ensures timely deliveries, ultimately improving the overall efficiency of the supply chain.

Human Resources and Talent Management

ML tools assist in resume screening by identifying relevant skills and qualifications. Predictive analytics can assess employee satisfaction, helping organizations identify areas for improvement and implement strategies to enhance employee retention and engagement.

UPS Success Story: Where Real-Time Data Supercharged Real-Time AI

Safeguarding shipments with AI and real-time data

UPS Capital® is leveraging Google’s Data Cloud and AI technologies to safeguard packages from porch piracy. With more than 300 million American consumers turning to online shopping, UPS Capital has witnessed the significant challenges customers face in securing their package delivery ecosystem. Now, the company is leveraging its digital capabilities and access to data to help customers rethink traditional approaches to combat shipping loss and deliver better customer experiences.

DeliveryDefense Address Confidence utilizes real-time data and machine learning algorithms to safeguard packages. By assigning a confidence score to potential delivery locations, it enhances the assessment of successful delivery probabilities while mitigating loss or theft risks. Every address is allocated a confidence score on a scale from 100 to 1000, with 1000 indicating the highest probability of delivery success. These scores are based on customer reports of package theft. Shippers can integrate this score into their shipping workflow through an API to take proactive, preventative actions on low-confidence addresses. For instance, if a package is destined for an address with a low confidence score, the merchant can proactively reroute the shipment to a secure UPS Access Point location. These locations typically have a confidence score of around 950 due to their high chain of custody security precautions.

Striim’s real-time data integration platform works in tandem with Google Cloud’s modern architecture by dynamically embedding vectors into streaming information, enhancing data representation, processing efficiency, and analytical accuracy. Striim also integrates structured and unstructured data pulled from diverse sources and applies a variety of AI models from OpenAI and Vertex AI to generate embeddings that establish similarity scores between data points to reveal possible relationships.

UPS Capital brings significant operational rewards, evidenced by over 280,000 claims paid annually. With $236 billion in declared value and 690k shippers protected, its solutions offer robust protection for shippers, ensuring peace of mind and financial security in every shipment.

The Future of AI is Now — And It’s Real-Time

Real-time data and AI are significantly improving existing processes and impacting the bottom line across industries. From retail and finance to healthcare and beyond, the integration of real-time data is driving greater efficiency, more personalized customer experiences, and continuous innovation. This shift is creating new opportunities and setting higher standards.

Businesses are encouraged to embrace real-time data and AI to stay competitive in the future. By adopting these technologies, companies can fully leverage AI, stay ahead of the competition, and navigate the evolving technological landscape. The future of AI is real-time, and the time to act is now.

An In-Depth Guide to Real-Time Analytics

John Kutay — Thu, 22 Aug 2024 10:03:44 +0000

It’s increasingly necessary for businesses to make immediate decisions. More importantly, it’s crucial these decisions are backed up with data. That’s where real-time analytics can help. Whether you’re a SaaS company looking to release a new feature quickly, or own a retail shop trying to better manage inventory, these insights can empower businesses to assess and act on data quickly to make better decisions. As a result, you’ll enjoy empowered decision-making, know how to respond to the latest trends, and boost operational efficiency.

We’re here to walk you through everything you need to know about real-time analytics. Whether you want to learn more about the benefits of real-time analytics or dive deeper into the most significant characteristics of a real-time analytics system, we’ll ensure you have a robust understanding of how real-time analytics move your business forward.

What is real-time analytics?

So, what is real time analytics? And more importantly, how does real-time analytics work?

Real-time analytics refers to pulling data from different sources in real-time. Then, the data is analyzed and transformed into a format that’s digestible for target users, enabling them to draw conclusions or immediately garner insights once the data is entered into a company’s system. Users can access this data on a dashboard, report, or another medium.

Moreover, there are two forms of real-time analytics. These include:

On-demand real-time analytics

With on-demand real-time analytics, users send a request, such as with an SQL query, to deliver the analytics outcome. It relies on fresh data, but queries are run on an as-needed basis.

The requesting user varies, and can be a data analyst or another team member within the organization who wants to gain insight into business activity. For instance, a marketing manager can leverage on-demand real-time analytics to identify how users on social media react to an online advertisement in real time.

Continuous real-time analytics

On the contrary, continuous real-time analytics takes a more proactive approach. It delivers analytics continuously in real time without requiring a user to make a request. You can view your data on a dashboard via charts or other visuals, so users can gain insight into what’s occurring down to the second.

One potential use case for continuous real-time analytics is within the cybersecurity industry. For instance, continuous real-time analytics can be leveraged to analyze streams of network security data flowing into an organization’s network. This makes threat detection a possibility.

In addition to the main types of real-time analytics, streaming analytics also plays a crucial role in processing data as it flows in real-time. Let’s dive deeper into streaming analytics now.

What’s the difference between real-time analytics and streaming analytics?

Streaming analytics focuses on analyzing data in motion, unlike traditional analytics, which deals with data stored in databases or data warehouses. Streams of data are continuously queried with Streaming SQL, enabling correlation, anomaly detection, complex event processing, artificial intelligence/machine learning, and live visualization. Because of this, streaming analytics is especially impactful for fraud detection, log analysis, and sensor data processing use cases.

How does real-time analytics work?

To fully understand the impact of real-time analytics processing, it’s necessary to understand how it works.

1. Collect data in real time

Every organization can leverage valuable real-time data. What exactly that looks like varies depending on your industry, but some examples include:

Enterprise resource management (ERP) data: Analytical or transactional data
Website application data: Top source for traffic, bounce rate, or number of daily visitors
Customer relationship management (CRM) data: General interest, number of purchases, or customer’s personal details
Support system data: Customer’s ticket type or satisfaction level

Consider your business operations to decide the type of data that’s most impactful for your business. You’ll also need to have an efficient way of collecting it. For instance, say you work in a manufacturing plant and are looking to use real-time analytics to find faults in your machinery. You can use machine sensors to collect data and analyze it in real time to deduct if there are any signs of failure.

For collection of data, it’s imperative you have a real-time ingestion tool that can reliably collect data from your sources.

2. Combine data from various sources

Typically, you’ll need data from multiple sources to gain a complete analysis. If you’re looking to analyze customer data, for instance, you’ll need to get it from operational systems of sales, marketing, and customer support. Only with all of those facets can you leverage the information you have to determine how to improve customer experience.

To achieve this, combine data from the sum of your sources. For this purpose, you can use ETL (extract, transform, and load) tools or build a custom data pipeline of your own and send the aggregated data to a target system, such as a data warehouse.

3. Extract insights by analyzing data

Finally, your team will extract actionable insights. To do this, use statistical methods and data visualizations to analyze data by identifying underlying patterns or correlations in the data. For example, you can use clustering to divide the data points into different groups based on their features and common properties. You can also use a model to make predictions based on the available data, making it easier for users to understand these insights.

Now that you have an answer to the question, “how does real time analytics work?” Let’s discuss the difference between batch and real-time processing.

Batch processing vs. real-time processing: What’s the difference?

Real-time analytics is made possible by the way the data is processed. To understand this, it’s important to know the difference between batch and real-time processing.

Batch Processing

In data analytics, batch processing involves first storing large amounts of data for a period and then analyzing it as needed. This method is ideal when analyzing large aggregates or when waiting for results over hours or days is acceptable. For example, a payroll system processes salary data at the end of the month using batch processing.

“Sometimes there’s so much data that old batch processing (late at night once a day or once a week) just doesn’t have time to move all data and hence the only way to do it is trickle feed data via CDC,” says Dmitriy Rudakov, Director of Solution Architecture at Striim.

Real-time Processing

With real-time processing, data is analyzed immediately as it enters the system. Real-time analytics is crucial for scenarios where quick insights are needed. Examples include flight control systems and ATM machines, where events must be generated, processed, and analyzed swiftly.

“Real-time analytics gives businesses an immediate understanding of their operations, customer behavior, and market conditions, allowing them to avoid the delays that come with traditional reporting,” says Simson Chow, Sr. Cloud Solutions Architect at Striim. “This access to information is necessary because it enables businesses to react effectively and quickly, which improves their ability to take advantage of opportunities and address problems as they arise.”

Real-Time Analytics Architecture

When implementing real-time analytics, you’ll need a different architecture and approach than you would with traditional batch-based data analytics. The streaming and processing of large volumes of data will also require a unique set of technologies.

With real-time analytics, raw source data rarely is what you want to be delivered to your target systems. More often than not, you need a data pipeline that begins with data integration and then enables you to do several things to the data in-flight before delivery to the target. This approach ensures that the data is cleaned, enriched, and formatted according to your needs, enhancing its quality and usability for more accurate and actionable insights.

Data integration

The data integration layer is the backbone of any analytics architecture, as downstream reporting and analytics systems rely on consistent and accessible data. Because of this, it provides capabilities for continuously ingesting data of varying formats and velocity from either external sources or existing cloud storage.

It’s crucial that the integration channel can handle large volumes of data from a variety of sources with minimal impact on source systems and sub-second latency. This layer leverages data integration platforms like Striim to connect to various data sources, ingest streaming data, and deliver it to various targets.

For instance, consider how Striim enables the constant, continuous movement of unstructured, semi-structured, and structured data – extracting it from a wide variety of sources such as databases, log files, sensors, and message queues, and delivering it in real-time to targets such as Big Data, Cloud, Transactional Databases, Files, and Messaging Systems for immediate processing and usage.

Event/stream processing

The event processing layer provides the components necessary for handling data as it is ingested. Data coming into the system in real-time are often referred to as streams or events because each data point describes something that has occurred in a given period. These events typically require cleaning, enrichment, processing, and transformation in flight before they can be stored or leveraged to provide data.

Therefore, another essential component for real-time data analytics is the infrastructure to handle real-time event processing.

Event/stream processing with Striim

Some data integration platforms, like Striim, perform in-flight data processing. This includes filtering, transformations, aggregations, masking, and enrichment of streaming data. These platforms deliver processed data with sub-second latency to various environments, whether in the cloud or on-premises.

Additionally, Striim can deliver data to advanced stream processing platforms such as Apache Spark and Apache Flink. These platforms can handle and process large volumes of data while applying sophisticated business logic.

Data storage

A crucial element of real-time analytics infrastructure is a scalable, durable, and highly available storage service to handle the large volumes of data needed for various analytics use cases. The most common storage architectures for big data include data warehouses and lakes. Organizations seeking a mature, structured data solution that focuses on business intelligence and data analytics use cases may consider a data warehouse. Data lakes, on the contrary, are suitable for enterprises that want a flexible, low-cost big data solution to power machine learning and data science workloads on unstructured data.

It’s rare for all the data required for real-time analytics to be contained within the incoming stream. Applications deployed to devices or sensors are generally built to be very lightweight and intentionally designed to produce minimal network traffic. Therefore, the data store should be able to support data aggregations and joins for different data sources — and must be able to cater to a variety of data formats.

Presentation/consumption

At the core of a real-time analytics solution is a presentation layer to showcase the processed data in the data pipeline. When designing a real-time architecture, keep this step at the forefront as it’s ultimately the end goal of the real-time analytics pipeline.

This layer provides analytics across the business for all users through purpose-built analytics tools that support analysis methodologies such as SQL, batch analytics, reporting dashboards, and machine learning. This layer is essentially responsible for:

Providing visualization of large volumes of data in real time
Directly querying data from big stores, like data lakes and warehouses
Turning data into actionable insights using machine learning models that help businesses deliver quality brand experiences

What are Key Characteristics of a Real-Time Analytics System?

To verify that a system supports real-time analytics, it must have specific characteristics. Those characteristics include:

Low latency

In a real-time analytics system, latency refers to the time between when an event arrives in the system and when it is processed. This includes both computer processing latency and network latency. To ensure rapid data analysis, the system must operate with low latency. “Businesses can access the most accurate data since the system responds quickly and has minimal latency,” says Chow.

High availability

Availability refers to a real-time analytics system’s ability to perform its function when needed. High availability is crucial because without it:

The system cannot instantly process data
The system will find it hard to store data or use a buffer for later processing, particularly with high-velocity streams

Chow adds, “High availability guarantees uninterrupted operation.”

Horizontal scalability

Finally, a key characteristic of a successful real-time analytics system is horizontal scalability. This means the system can increase capacity or enhance performance by adding more servers to the existing pool. In cases where you cannot control the rate of data ingress, horizontal scalability becomes crucial, as it allows you to adjust the system’s size to handle incoming data effectively. “When the business adds more servers, the horizontal scalability feature of the system increases its flexibility even more by enabling it to handle more data and users,” shares Chow. “When combined, these characteristics ensure the system’s scalability, speed, and reliability as the business grows.”

According to Rudakov, these three capabilities are crucial for several reasons. “[Low latency is important] because in order to move data for reasons above the operator needs data to get triggered ASAP with lowest latency possible,” he says. “Secondly, the system needs to be redundant with recovery support so that if it fails it comes back quickly and has no data loss. Finally, if the data is not moving fast enough, the operator needs to be able to easily scale the data moving system, i.e. add parallel components into the pipeline and add nodes into the cluster.”

Rudakov adds that’s exactly why Striim is the right choice for a real-time analytics platform. “Striim provides all real time platform necessary elements described above: low latency pipeline controls such as CDC readers to read data in real time from database logs, recovery, batch policies, ability to run pipelines in parallel and finally multi-node cluster to support HA and scalability,” he says. “Additionally, it supports an easy drag and drop interface to create pipelines in a simple SQL based language (TQL).”

Benefits of Real-Time Analytics

There are countless benefits of real-time analytics. Some include:

To Optimize the Customer Experience

According to an IBM/NRF report, post-pandemic customer expectations regarding online shopping have evolved considerably. Now, consumers seek hybrid services that can help them move seamlessly from one channel to another, such as buy online, pickup in-store (BOPIS), or order online and get it delivered to their doorstep. According to the IBM/NRF report, one in four consumers wants to shop the hybrid way.

In order to enable this, however, retailers must access real-time analytics to move data from their supply chain to the relevant departments. Organizations today need to monitor their rapidly changing contexts 24/7. They need to process and analyze cross-channel data immediately. Just consider how Macy’s leveraged Striim to improve operational efficiency and create a seamless customer experience. “In many scenarios, businesses need to act in real time and if they don’t their revenue and customers get impacted,” says Rudakov.

Real-time analytics also enhances personalization. It enables brands to deliver tailored content to consumers based on their actions on channels like websites, mobile apps, SMS, or email—instantly.

“Having access to real-time data allows a retail store to quickly respond to changes in demand for a certain item by adjusting inventory levels, launching focused marketing campaigns, or adjusting pricing techniques,” says Chow. “Similarly, companies may move quickly to address potential problems—like a drop in website performance or a decrease in consumer satisfaction—and mitigate negative consequences before they escalate.”

To Stay Proactive and Act Quickly

Another way businesses can leverage real-time analytics is to stay proactive and act quickly in case of an anomaly, such as with fraud detection. Unfortunately, fraud is a reality for innumerable businesses, regardless of size. However, real-time analytics can help organizations identify theft, fraud, and other types of malicious activities. Because of this, leveraging real-time analytics is a powerful way to ensure your business is staying proactive and able to move quickly if something goes wrong.

This is especially important as these malicious online activities have seen a surge over the past few years. Consumers lost more than $10 billion to fraud in 2023, according to the Federal Trade Commission.

“At some point a major credit card company used our platform to read network access logs and call an ML model to detect hacker attempts on their network,” shares Rudakov.

For example, companies can use real-time analytics by combining it with machine learning and Markov modeling. Markov modeling is used to identify unusual patterns and make predictions on the likelihood of a transaction being fraudulent. If a transaction shows signs of unusual behavior, it then gets flagged.

To Improve Decision-Making

Using up-to-date information allows organizations to know what they are doing well and improve. Conversely, it allows them to identify pitfalls and determine how to improve.

For instance, if a piece of machinery isn’t working optimally in a manufacturing plant, real-time analytics can collect this data from sensors and generate data-driven insights that can help technicians resolve it.

Real-time Use Cases in Different Industries

The benefits of real-time analytics vary just as the use cases do. Let’s walk through several use cases of real-time analytics platforms.

Supply chain

Real-time analytics in supply chain management can enable better decision-making. Managers can view real-time dashboard data to oversee the supply chain and strategize demand and supply. “Management of the supply chain is another example [of a real-time analytics use case]. By monitoring shipments and inventory data, real-time analytics allow companies to quickly fix delays or shortages,” says Chow.

Some of the other ways real-time analytics can help organizations include:

Feed live data to route planning algorithms in the logistics industry. These algorithms can analyze real-time data to optimize routes and save time by going through traffic patterns on roadways, weather conditions, and fuel consumption.
Use aggregation of real-time data from fuel-level sensors to resolve fuel issues faced by drivers. These sensors can provide data on fuel level volumes, consumption, and dates of refills.
Collect real-time data from electronic logging devices (ELD) to study driver behavior and improve it. This data provides valuable insights into driving patterns, enabling fleet managers to implement targeted training and safety measures

Finance

In certain industries, such as commodities trading, market fluctuations require organizations to be agile. Real-time analytics can help in these scenarios by intercepting changes and empowering organizations to adapt to rapid market fluctuations. Financial firms can use real-time analytics to analyze different types of financial data, such as trading data, market prices, and transactional data.

Consider the case of Inspyrus (now MineralTree), a fintech company seeking to improve accounts payable operations for businesses. The company wanted to ensure its users could get a real-time view of their transactional data from invoicing reports. However, their existing stack was unable to support real-time analytics, which meant that it took a whole hour for data updates, whereas some operations could even take weeks. There were also technical issues with moving data from an online transaction processing (OLTP) database to Snowflake in real time.

By utilizing Striim, Inspyrus ingested real-time data from an OLTP database, loaded it into Snowflake, and transformed it there. It then used an intelligence tool to visualize this data and create rich reports for users. As a result, Inspyrus users are able to view reports in real time and utilize insights immediately to fuel better decisions.

Use Striim to power your real-time analytics infrastructure

Your real-time analytics infrastructure can be only as good as the tool you use to support it. Striim is a unified real-time data integration and streaming platform that enables real-time analytics that can offer a range of benefits in this regard. It can help you:

Collect data non-intrusively, securely, and reliably, from operational sources (databases, data warehouses, IoT, log files, applications, and message queues) in real time
Stream data to your cloud analytics platform of choice, including Google BigQuery, Microsoft Azure Synapse, Databricks Delta Lake, and Snowflake
Offer data freshness SLAs to build trust among business users
Perform in-flight data processing such as filtering, transformations, aggregations, masking, enrichment, and correlations of data streams with an in-memory streaming SQL engine
Create custom alerts to respond to key business events in real time

When seeking a real-time analytics platform, look no further than Striim. Striim, a unified real-time data integration and streaming platform, connects clouds, data, and applications. You can leverage it to connect hundreds of enterprise sources, all while supporting data enrichment, the creation of complex in-flight data transformations with Striim, and more. “Striim uses log-based Change Data Capture (CDC) technology to capture real-time changes from the source database and continuously replicate the data in-memory to multiple target systems, all without disrupting the source database’s operation,” says Chow.

Ready to discover how Striim can help evolve how you process data? Sign up for a demo today.

Driving Retail Transformation: How Striim Powers Seamless Cloud Migration and Data Modernization

Allen Skees — Tue, 20 Aug 2024 14:34:26 +0000

In today’s fast-paced retail environment, digital transformation is essential to stay competitive. One powerful way to achieve this transformation is by modernizing data architecture and migrating to the cloud. There are countless ways to leverage Striim but this is one of the most exciting, as the platform offers large retailers the tools they need to seamlessly transition from legacy systems to a more agile, cloud-based infrastructure.

Retailers often face the challenge of managing tremendous amounts of data, typically stored in cumbersome on-premises systems. Striim helps retailers liberate their data by tackling to significant areas:

Enabling a data mesh for enhanced self-service analytics
Migrating from legacy systems, like Oracle Exadata, to Google Cloud

Let’s explore why these initiatives are imperative for retailers and how Striim plays a pivotal role in driving this transformation.

Why Are These Initiatives Important?

For retailers, modernizing data architecture is not just about upgrading technology—it’s about empowering teams with better, faster access to data while future-proofing their infrastructure. Striim facilitates this transformation by enabling the implementation of a data mesh and supporting the migration to Google Cloud.

The data mesh approach decentralizes data management, making it easier for various teams across an organization to perform self-service analytics and derive actionable insights. This shift promotes a more collaborative and agile data culture, ultimately boosting business agility and responsiveness.

Migrating to Google Cloud, on the other hand, provides retailers with a scalable, flexible infrastructure that can handle increasing volumes of data. Striim’s real-time data integration ensures a smooth and seamless transition, minimizing disruptions and maintaining data integrity throughout the process.

Why Retailers Choose Striim

Many retailers are transitioning to Google Cloud, and managing real-time data migration presents a significant challenge across the industry. To address this, organizations require a robust, enterprise-grade solution for change data capture (CDC) to fill the gaps in their existing tools. After evaluating various options, many choose to move forward with proof of concept projects using Striim, confident in its ability to meet their needs and drive successful data transformation.

Striim is equipped to handle the complexities of modern retail environments, making it the leading choice for enterprises looking to enhance their data infrastructure. Whether it’s enabling a data mesh, supporting cloud migrations, or modernizing legacy systems, Striim provides the real-time data movement capabilities needed to drive successful digital transformation.

By leveraging Striim, retailers can ensure that their data transformation projects are not only effective but also aligned with their broader business goals.

Architecture and Striim’s Role

Retailers transitioning to Google Cloud often require real-time data movement from their existing systems, such as Oracle databases, to cloud-based platforms like Google BigQuery. The typical architecture involves:

CDC Adapter: This component captures changes from source databases, ensuring that all data modifications are efficiently tracked and recorded. For instance, Striim’s Oracle CDC Adapter captures changes from source Oracle databases including the Retail Management System, Warehouse Management System, and Warehouse Execution System.
Cloud Integration Writer: This component pushes captured data in real-time to cloud targets, making it available for analysis as soon as it is generated. An example of this is BigQuery Writer, which pushes the captured data to BigQuery targets in real time.

This architecture supports key objectives for retailers:

Data Mesh Integration: By incorporating real-time data from operational systems into a data mesh, retailers ensure that stakeholders have access to up-to-date information, enhancing decision-making and analytics capabilities.
Cloud Migration Support: Continuous data movement from on-premises systems to cloud environments facilitates the transition to a scalable, flexible infrastructure capable of handling increasing volumes of data.

Striim’s advanced data integration capabilities streamline the migration process and improve data management efficiency, making it a valuable asset for retailers aiming to modernize their data architecture and migrate to the cloud.

Applicability to Other Use Cases

Striim’s capabilities highlight its value for various enterprise data transformation efforts, including:

Enabling Data Mesh Architectures: Striim provides the real-time data integration layer needed to populate domain-specific data products within a data mesh, ensuring that data is readily accessible across the organization.
Cloud Migrations: For organizations moving from on-premises databases to cloud data warehouses, Striim offers low-latency, continuous data replication to maintain synchronization between source and target systems.
Legacy System Modernization: Striim supports the transition from legacy systems by replicating data to modern cloud platforms in real time, facilitating a gradual and efficient modernization process.
Real-Time Analytics: By continuously streaming operational data to analytics platforms, Striim enables fresher insights and more timely decision-making.
Transformation Capabilities: By leveraging Striim, your team gains access to real-time transformation, allowing you to process and adapt data dynamically. Striim’s powerful transformation engine supports complex operations such as enrichment, filtering, and aggregation, ensuring your data is instantly optimized and ready for immediate use.
Ease of Scalability: Striim was designed with scalability in mind, so regardless of how your team’s data volume increases, you can count on Striim for reliable performance.

Striim’s real-time data integration is a crucial element for successful data transformation initiatives. Whether your organization is implementing a data mesh, migrating to the cloud, or modernizing its data stack, Striim provides the data movement capabilities essential for achieving effective digital transformation. Ready to discover how Striim can help drive your retail transformation? Request a demo to learn more.