Striim

Bloor InBrief Report

Striim — Thu, 24 Oct 2024 19:28:09 +0000

Morrisons Updates Data Infrastructure to Drive Real-Time Insights and Improve Customer Experience

Striim Team — Tue, 22 Oct 2024 00:48:10 +0000

Morrisons, a leading UK-based supermarket chain, is modernizing its data infrastructure to support real-time insights and operational efficiency. By embracing advanced data integration capabilities, Morrisons is transitioning to a more agile, data-driven approach. This shift allows the company to optimize processes, enhance decision-making, and ultimately improve the overall customer experience across its stores and online platforms.

About Morrisons

Morrisons is one of the UK’s largest supermarket chains, with over 100 years of experience in the food retail industry. Proudly based in Yorkshire, it serves customers across the UK through a network of nearly 500 conveniently located supermarkets and various online home delivery channels. With a commitment to quality, Morrisons sources fresh produce directly from over 2,700 farmers and growers, ensuring customers receive the best products. Dedicated to sustainability and community engagement, Morrisons continually invests in innovative solutions to enhance operations and improve the shopping experience.

Challenge

Morrisons set out to modernize its data infrastructure to achieve five key goals:

Elevating Customer Experience: Creating a better shopping experience for customers.
Loading to Google Cloud: Transitioning to Google Cloud and leveraging Looker for enhanced reporting capabilities.
Accessing Real-Time Data: Shifting from batch processing to real-time data access, enabling faster decision-making and improved operational efficiency.
Enhancing Picking Efficiency: Morrisons sought to streamline their online picking process by improving stock visibility across depots and warehouses.
Improving On-Shelf Availability: Ensuring products are consistently in stock and accessible to customers.

To meet these goals, the team needed to move away from their legacy Oracle Exadata data warehouse and strategically align on Google Cloud. This involved transitioning their data to Google BigQuery as the new centralized data warehouse, which required not only propagating data but also ensuring real-time access for better decision-making and operational efficiency. Moreover, prior to this transition, Morrisons never had a centralized repository of real-time data, and only ever had batch snapshots delivered from its disparate systems.

“Retail is real-time. We have our online shop open 24/7, and we have products moving around our distribution network every minute of every day. It’s really important that we have a real-time view of how our business is operating,” shares Peter Laflin, Chief Data Officer at Morrisons.

In order to accomplish this, Morrisons needed a tool that could connect their separate systems and seamlessly move data into Google Cloud. Striim was selected to ingest critical datasets, including the Retail Management System (RMS), which holds vast store transaction data and key reference tables, and the Warehouse Management Systems (WMS), which oversee operations across 14 distribution depots. The integration of these systems into BigQuery in real time provided critical visibility into product availability, stock levels, and core business metrics such as waste and shrinkage. Most importantly, Morrisons needed this mission-critical data delivered in real time.

“We’ve moved from a world where we have batch-processing to a world where, within two minutes, we know what we sold and where we sold it,” shares Laflin. “That empowers senior leaders, colleagues in stores, colleagues across our logistics and manufacturing sites to understand where we are as a business right now. Real-time data is not a nice to have, real-time data is an absolute essential to run a business the scale and size of ours.”

Morrisons sought to move away from their existing analytics suite and leverage Google Looker for their reporting and analytics needs. This meant they had to regenerate all existing reports that previously ran on the Exadata platform, aligning them with the new Google Cloud infrastructure. Striim played a critical role in centralizing their data in BigQuery and delivering it in real time, enabling Morrisons to power their reporting with fresh insights. This transformation is key to achieving their goal of a more agile, data-driven operation and supporting future business initiatives.

Solution

Morrisons now leverages Striim to connect disparate systems and ingest critical datasets from their Oracle databases into Google Cloud, using BigQuery as their new centralized data warehouse. They required a solution that could seamlessly load data from multiple sources while providing real-time access through BigQuery, and Striim provides this.

Striim plays a pivotal role in ingesting two core databases: the Retail Management System (RMS) and the Warehouse Management System (WMS). The RMS, a vast dataset containing store transaction tables and key reference data, requires efficient data transfer to minimize latency, and Striim ensures that this high volume of data is processed seamlessly.

Striim also ingests data from all 14 distribution depots, which are connected through 28 sources in the WMS. This integration provides real-time visibility into stock levels, enabling ‘live-pick’ decision-making by revealing what stock is available, where it is located, and at what time. Backed by real-time intelligence, this capability accelerates business processes that were previously reliant on periodic batch updates. As a result, Morrisons can optimize the replenishment process and ensure that shelves remain well-stocked, ultimately improving overall efficiency and increasing customer satisfaction.

Striim’s real-time data delivery powers Morrisons’ reporting transformation as they rebuild all reporting within Google Looker. By centralizing and accelerating the flow of data into BigQuery in real time, Striim enables faster, actionable insights that drive operational excellence and future business initiatives. “My team felt that Striim was the only tool that could deliver the requirements that we have,” shares Laflin.

Outcome

By leveraging Striim to transition from batch processing to real-time data access, Morrisons has significantly enhanced their ability to track and manage three critical key performance indicators (KPIs): availability, waste, and shrinkage. With access to faster, real-time insights, executives can more effectively identify risks and implement strategies to mitigate them, ultimately leading to improved operational decision-making and better performance across the organization. This shift allows Morrisons to optimize their processes and drive positive outcomes related to these key metrics.

“Without Striim, we couldn’t create the real-time data that we then use to run the business,” shares Laflin. “It’s a very fundamental part of our architecture.”

The move towards real-time data has allowed Morrisons to identify that their shelf availability has notably improved, ensuring that products are consistently in stock and accessible to customers. As a result, they are beginning to uncover the full range of benefits that this transformation can bring, including enhanced inventory management and reduced waste.

From the customer perspective, better shelf availability translates into happier shoppers, as they can find the products they want when they visit stores. This improvement not only fosters customer loyalty but also positions Morrisons to compete more effectively in the marketplace, ultimately driving growth and enhancing overall customer satisfaction.

Striim’s Multi-Node Deployments: Ensuring Scalability, High Availability, and Disaster Recovery

Striim Team — Fri, 18 Oct 2024 22:12:57 +0000

In today’s enterprise landscape, ensuring high availability, scalability, and disaster recovery is paramount for businesses relying on continuous data flow and analytics. Striim, a leading platform for real-time data integration and streaming analytics, offers multi-node deployments that significantly enhance redundancy while delivering enterprise-grade capabilities for mission-critical workloads. This blog explores how Striim’s multi-node architecture supports these objectives, providing enterprises with a robust solution for high availability, scalability, and disaster recovery both as a fully managed cloud service, or platform that can be deployed in your private cloud and on-premises environments.

This blog explores how Striim’s multi-node architecture supports these objectives, providing enterprises with a robust solution for high availability, scalability, and disaster recovery.

Multi-Node Architecture: A Foundation for Enterprise Resilience

At the heart of Striim’s mission-critical platform is its multi-node architecture. Multi-node deployments allow Striim to operate across several interconnected servers or nodes, each handling data processing, streaming, and analytics in tandem. This distributed architecture introduces redundancy, ensuring that even if one node fails, other nodes can continue operations seamlessly. This approach is essential for disaster recovery, high availability, and fault tolerance.

1. Increasing Redundancy and Supporting Scalability

Redundancy is vital in distributed systems because it ensures that multiple copies of data and processing capabilities exist across nodes. Striim’s multi-node deployment increases redundancy by replicating workloads and data across several nodes. This means that in the event of a failure, another node can immediately take over, minimizing downtime and preventing data loss.

Additionally, Striim supports horizontal scalability. As data volumes grow—whether due to business expansion, increasing IoT devices, or heightened customer interactions—additional nodes can be added to the cluster to distribute the processing load. This ensures that the system can handle increasing demand without performance degradation, maintaining the ability to process millions of events per second across a distributed cluster.

2. High Availability Through Node Redundancy and Failover Mechanisms

For business-critical workloads, any downtime or data loss can have serious consequences. Striim addresses this concern by delivering high availability (HA) through node redundancy and automatic failover mechanisms. In a multi-node deployment, each node holds redundant copies of data and processing logic, ensuring that if one node fails, another can take over instantly without interrupting data flow.

Striim’s built-in failover automatically shifts workloads from a failed node to a functioning one, maintaining continuous service for real-time applications. This is critical for systems that demand high uptime, such as financial transactions, customer-facing dashboards, or logistics monitoring. Furthermore, Striim guarantees exactly-once processing, ensuring data integrity during node transitions and preventing duplicate or missed data events.

To provide a simple, declarative construct for node management and failover, Striim offers Deployment Groups which represent a group of one or more nodes with its own application and resource configurations. You can deploy Striim Apps to a Deployment Group, and that Deployment Group governs the runtime and resilience of the application.

3. Disaster Recovery with Multi-Region and Cross-Cloud Support

In addition to failover, Striim’s multi-node deployment enhances disaster recovery (DR) by replicating data and services across geographically distributed nodes or across clouds. Enterprises can configure active-active or active-passive DR setups to quickly recover from catastrophic failures. By distributing nodes across multiple regions or clouds, Striim ensures that if one region experiences an outage, another can take over seamlessly, ensuring business continuity.

Striim’s cross-cloud capabilities offer additional flexibility, allowing organizations to distribute their infrastructure across different cloud providers. This architecture ensures resilience even in the face of regional outages, ensuring rapid recovery and reducing the risk of data loss. Additionally, Striim’s Change Data Capture (CDC) ensures that data is continuously synchronized between nodes, keeping all data consistent and up-to-date across the entire system.

Integrating Multi-Node Capabilities with In-Memory Technology

To provide real-time data streaming and analytics efficiently, Striim relies heavily on in-memory technology. Striim’s architecture allows for data to be cached in an in-memory data grid, enabling rapid data access without the latency of disk I/O. However, ensuring all nodes can process this data without time-consuming remote calls requires a tightly integrated design.

Striim’s multi-node deployment ensures that all system components—data streaming, in-memory storage, and real-time analytics—operate in the same memory space. This eliminates the need for costly remote calls, allowing for rapid joins and analytics on streaming data. By leveraging in-memory processing across a distributed cluster, Striim ensures that the system remains both highly performant and scalable, even under high data loads.

Security Across Nodes and Clusters

As enterprises scale their data processing across multiple nodes and regions, maintaining security becomes increasingly important. Striim addresses this need by employing a holistic, role-based security model that spans the entire architecture. Whether it’s securing individual data streams, protecting sensitive data in motion, or managing access to management dashboards, Striim provides comprehensive security across all nodes and processes in both Striim Cloud and Striim’s on-premise Striim Platform.

This centralized approach to security simplifies the task of managing access controls, especially in distributed systems where data and processes are spread across multiple locations. Striim’s role-based model ensures that all security policies are consistently applied across the entire system, reducing the risk of vulnerabilities while maintaining compliance with industry regulations.

Conclusion: Simplifying Enterprise-Grade Data Streaming

Striim’s multi-node deployments provide enterprises with a powerful, scalable, and resilient platform for real-time data streaming and analytics. By increasing redundancy, ensuring high availability through failover mechanisms, and supporting disaster recovery with multi-region and cross-cloud configurations, Striim enables businesses to maintain continuous operations even in the face of unexpected failures.

With Striim, enterprises can focus on deriving insights from their data without the need to invest in complex infrastructures or develop intricate disaster recovery strategies. Striim’s platform takes care of the complexities of distributed processing, in-memory analytics, and security, ensuring that business-critical workloads run smoothly and efficiently at scale.

By offering a unified solution for real-time data integration and streaming analytics, Striim empowers businesses to meet the demands of today’s data-driven world while maintaining the resilience and agility necessary to thrive in a competitive environment.

Powering Analytics, Operations, and Customer Experiences with Real-Time Data and AI

Striim Team — Fri, 11 Oct 2024 21:00:59 +0000

What is a Data Pipeline (and 7 Must-Have Features of Modern Data Pipelines)

Striim Team — Fri, 11 Oct 2024 14:53:30 +0000

A well-executed data pipeline can make or break your company’s ability to leverage real-time insights and stay competitive. Thriving in today’s world requires building modern data pipelines that make moving data and extracting valuable insights quick and simple.

Today, we’ll answer the question, “What is a data pipeline?” Then, we’ll explore a data pipeline example and dive deeper into the key differences between a traditional data pipeline vs ETL.

What is a Data Pipeline?

A data pipeline refers to a series of processes that transport data from one or more sources to a destination, such as a data warehouse, database, or application. These pipelines are essential for managing and optimizing the flow of data, ensuring it’s prepared and formatted for specific uses, such as analytics, reporting, or machine learning.

Throughout the pipeline, data undergoes various transformations such as filtering, cleaning, aggregating, enriching, and even real-time analysis. These steps guarantee that data is accurate, reliable, and meaningful by the time it reaches its destination, making it possible for teams to generate insights and make data-driven decisions.

In addition to the individual steps of a pipeline, data pipeline architecture refers to how the pipeline is designed to collect, flow, and deliver data effectively. This architecture can vary based on the needs of the organization and the type of data being processed. There are two primary approaches to moving data through a pipeline:

Batch processing: In batch processing, batches of data are moved from sources to targets on a one-time or regularly scheduled basis. Batch processing is the tried-and-true legacy approach to moving data, but it doesn’t allow for real-time analysis and insights, which is its primary shortcoming.
Stream processing: Stream processing enables real-time data movement by continuously collecting and processing data as it flows, which is crucial for applications needing immediate insights like monitoring or fraud detection. Change Data Capture (CDC) plays a key role here by capturing and streaming only the changes (inserts, updates, deletes) in real time, ensuring efficient data handling and up-to-date information across systems. As a result, stream processing makes real-time business intelligence feasible.

Why are Data Pipelines Significant?

Now that we’ve answered the question, ‘What is a data pipeline?’ We can dive deeper into the essential role they play. Data pipelines are significant to businesses because they:

Consolidate Data: Data pipelines are responsible for integrating and unifying data from diverse sources and formats, making it consistent and usable for analytics and business intelligence.
Enhance Accessibility: Thanks to data pipelines, you can provide team members with necessary data without granting direct access to sensitive production systems.
Support Decision-Making: When you ensure that clean, integrated data is readily available, you facilitate informed decision-making and boost operational efficiency.

What is a Data Pipeline Example?

As you’ll see by taking a look at this data pipeline example, the complexity and design of a pipeline varies depending on intended use. For instance, Macy’s streams change data from on-premises databases to Google Cloud. As a result, customers enjoy a unified experience whether they’re shopping in a brick and mortar store or online.

Another excellent data pipeline example is American Airlines’ work with Striim. Striim supported American Airlines by implementing a comprehensive data pipeline solution to modernize and accelerate operations.

To achieve this, the TechOps team implemented a real-time data hub using MongoDB, Striim, Azure, and Databricks to maintain seamless, large-scale operations. This setup uses change data capture from MongoDB to capture operational data in real time, then processes and models it for downstream systems. The data is streamed in real time to end users, delivering valuable insights to TechOps and business teams, allowing them to monitor and act on operational data to enhance the customer travel experience.

This data pipeline diagram illustrates how it works:

Data Pipeline vs ETL: What’s the Difference?

You’re likely familiar with the term ‘ETL data pipeline’ and may be curious to learn the difference between a traditional data pipeline vs ETL. In actuality, ETL pipelines are simply a form of data pipeline. To understand an ETL data pipeline fully, it’s imperative to understand the process that it entails.

ETL stands for Extract, Transform, Load. This process involves:

Extraction: Data is extracted from a source or multiple sources.
Transformation: Data is processed and converted into the appropriate format for the target destination — often a data warehouse or lake.
Loading: The loading phase involves transferring the transformed data into the target system where your team can access it for analysis. It’s now usable for various use cases, including for reporting, insights, and decision-making.

A traditional ETL data pipeline typically involves disk-based processing, which can lead to slower transformation times. This approach is suitable for batch processing where data is processed at scheduled intervals, but may not meet the needs of real-time data demands.

While legacy ETL has a slow transformation step, modern ETL platforms, like Striim, have evolved to replace disk-based processing with in-memory processing. This advancement allows for real-time data transformation, enrichment, and analysis, providing faster and more efficient data processing. Striim, for example, handles data in near real-time, enabling quicker insights and more agile decision-making.

Now, let’s dive into the seven must-have features of modern data pipelines.

7 Must-Have Features of Modern Data Pipelines

To create an effective modern data pipeline, incorporating these seven key features is essential. Though not an exhaustive list, these elements are crucial for helping your team make faster and more informed business decisions.

1. Real-Time Data Processing and Analytics

The number one requirement of a successful data pipeline is its ability to load, transform, and analyze data in near real time. This enables business to quickly act on insights. To begin, it’s essential that data is ingested without delay from multiple sources. These sources may range from databases, IoT devices, messaging systems, and log files. For databases, log-based Change Data Capture (CDC) is the gold standard for producing a stream of real-time data.

Real-time, continuous data processing is superior to batch-based processing because the latter takes hours or even days to extract and transfer information. Because of this significant processing delay, businesses are unable to make timely decisions, as data is outdated by the time it’s finally transferred to the target. This can result in major consequences. For example, a lucrative social media trend may rise, peak, and fade before a company can spot it, or a security threat might be spotted too late, allowing malicious actors to execute on their plans.

Real-time data pipelines equip business leaders with the knowledge necessary to make data-fueled decisions. Whether you’re in the healthcare industry or logistics, being data-driven is equally important. Here’s an example: Suppose your fleet management business uses batch processing to analyze vehicle data. The delay between data collection and processing means you only see updates every few hours, leading to slow responses to issues like engine failures or route inefficiencies. With real-time data processing, you can monitor vehicle performance and receive instant alerts, allowing for immediate action and improving overall fleet efficiency.

2. Scalable Cloud-Based Architecture

Modern data pipelines rely on scalable, cloud-based architecture to handle varying workloads efficiently. Unlike traditional pipelines, which struggle with parallel processing and fixed resources, cloud-based pipelines leverage the flexibility of the cloud to automatically scale compute and storage resources up or down based on demand.

In this architecture, compute resources are distributed across independent clusters, which can grow both in number and size quickly and infinitely while maintaining access to a shared dataset. This setup allows for predictable data processing times as additional resources can be provisioned instantly to accommodate spikes in data volume.

Cloud-based data pipelines offer agility and elasticity, enabling businesses to adapt to trends without extensive planning. For example, a company anticipating a summer sales surge can rapidly increase processing power to handle the increased data load, ensuring timely insights and operational efficiency. Without such elasticity, businesses would struggle to respond swiftly to changing trends and data demands.

3. Fault-Tolerant Architecture

It’s possible for data pipeline failure to occur while information is in transit. Thankfully, modern pipelines are designed to mitigate these risks and ensure high reliability. Today’s data pipelines feature a distributed architecture that offers immediate failover and robust alerts for node, application, and service failures. Because of this, we consider fault-tolerant architecture a must-have.

In a fault-tolerant setup, if one node fails, another node within the cluster seamlessly takes over, ensuring continuous operation without major disruptions. This distributed approach enhances the overall reliability and availability of data pipelines, minimizing the impact on mission-critical processes.

4. Exactly-Once Processing (E1P)

Data loss and duplication are critical issues in data pipelines that need to be addressed for reliable data processing. Modern pipelines incorporate Exactly-Once Processing (E1P) to ensure data integrity. This involves advanced checkpointing mechanisms that precisely track the status of events as they move through the pipeline.

Checkpointing records the processing progress and coordinates with data replay features from many data sources, enabling the pipeline to rewind and resume from the correct point in case of failures. For sources without native data replay capabilities, persistent messaging systems within the pipeline facilitate data replay and checkpointing, ensuring each event is processed exactly once. This technical approach is essential for maintaining data consistency and accuracy across the pipeline.

5. Self-Service Management

Modern data pipelines facilitate seamless integration between a wide range of tools, including data integration platforms, data warehouses, data lakes, and programming languages. This interconnected approach enables teams to create, manage, and automate data pipelines with ease and minimal intervention.

In contrast, traditional data pipelines often require significant manual effort to integrate various external tools for data ingestion, transfer, and analysis. This complexity can lead to bottlenecks when building the pipelines, as well as extended maintenance time. Additionally, legacy systems frequently struggle with diverse data types, such as structured, semi-structured, and unstructured data.

Contemporary pipelines simplify data management by supporting a wide array of data formats and automating many processes. This reduces the need for extensive in-house resources and enables businesses to more effectively leverage data with less effort.

6. Capable of Processing High Volumes of Data in Various Formats

It’s predicted that the world will generate 181 zettabytes of data by 2025. To get a better understanding of how tremendous that is, consider this — one zettabyte alone is equal to about 1 trillion gigabytes.

Since unstructured and semi-structured data account for 80% of the data collected by companies, modern data pipelines need to be capable of efficiently processing these diverse data types. This includes handling semi-structured formats such as JSON, HTML, and XML, as well as unstructured data like log files, sensor data, and weather data.

A robust big data pipeline must be adept at moving and unifying data from various sources, including applications, sensors, databases, and log files. The pipeline should support near-real-time processing, which involves standardizing, cleaning, enriching, filtering, and aggregating data. This ensures that disparate data sources are integrated and transformed into a cohesive format for accurate analysis and actionable insights.

7. Prioritizes Efficient Data Pipeline Development

Modern data pipelines are crafted with DataOps principles, which integrate diverse technologies and processes to accelerate development and delivery cycles. DataOps focuses on automating the entire lifecycle of data pipelines, ensuring timely data delivery to stakeholders.

By streamlining pipeline development and deployment, organizations can more easily adapt to new data sources and scale their pipelines as needed. Testing becomes more straightforward as pipelines are developed in the cloud, allowing engineers to quickly create test scenarios that mirror existing environments. This allows thorough testing and adjustments before final deployment, optimizing the efficiency of data pipeline development.

Gain a Competitive Edge with Striim

Data pipelines are crucial for moving, transforming, and storing data, helping organizations gain key insights. Modernizing these pipelines is essential to handle increasing data complexity and size, ultimately enabling faster and better decision-making.

Striim provides a robust streaming data pipeline solution with integration across hundreds of sources and targets, including databases, message queues, log files, data lakes, and IoT. Plus, our platform features scalable in-memory streaming SQL for real-time data processing and analysis. Schedule a demo for a personalized walkthrough to experience Striim.

Joe Reis at Big Data LDN

Striim Team — Fri, 04 Oct 2024 22:13:44 +0000

Join us as we sit down with Joe Reis, live at Big Data LDN (London) 2024. Joe shares his partnership with DeepLearning.ai and AWS through his new course on Data Engineering. Joe’s new course promises to elevate your data skills with hands-on exercises that marry foundational knowledge with cutting-edge practices. We dive into how this course complements his seminal book, “Fundamentals of Data Engineering,” and why certification is valuable for those looking for foundational, hands-on knowledge to be a data practitioner.

But that’s not all; we also dissect the hurdles of adopting modern data architectures like data mesh in traditionally siloed companies. Using Conway’s Law as a lens, Joe discuss why businesses struggle to transition from outdated infrastructures to decentralized systems and how cross-disciplinary skills—a concept inspired by mixed martial arts—are crucial in this endeavor as he cleverly calls it ‘Mixed Model Arts’.

Check out Joe’s Work:

Fundamentals of Data Engineering book on Amazon: https://a.co/d/8yvabfO
New Coursera courses by Joe Reis: https://www.coursera.org/instructor/j…

What’s New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What’s New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

Unlocking Actionable Insights: Morrisons’ Digital Transformation with Striim and Google Cloud

Striim Team — Thu, 03 Oct 2024 17:13:42 +0000

In the fast-paced world of retail, the ability to harness data effectively is crucial for staying ahead. On September 18, 2024, at Big Data London, Morrisons shared its digital transformation journey through the presentation, “Learn How Morrisons is Accelerating the Availability of Actionable Data at Scale with Google and Striim.”

Peter Laflin, Chief Data Officer at Morrisons, outlined the supermarket chain’s strategic partnership with Striim, a global leader in real-time data integration and streaming, and Google Cloud. This collaboration is pivotal in optimizing Morrisons’ supply chain, improving stock management, and enhancing customer satisfaction through the power of real-time data analytics.

By harnessing Striim’s advanced data platform alongside Google Cloud’s robust infrastructure, Morrisons has effectively integrated and streamlined data from its vast network of over 2,700 farmers and growers supplying raw materials to its manufacturing plants across the UK. This initiative has enabled seamless information flow and real-time visibility across its operations, allowing the supermarket to make quicker, data-driven decisions that directly impact customer experience. Tata Consultancy Services (TCS), Morrisons’ long-standing systems integration partner, has been instrumental in the success of this transformation. TCS worked closely with Morrisons’ teams to ensure the seamless implementation of Striim’s platform, facilitating smooth integration and alignment across operations.

The keynote featured insights from industry experts, including John Kutay, Head of Products at Striim, and Mike Reed, Retail Account Executive at Google, who underscored the transformative impact of innovative data strategies in the retail sector.

As Morrisons continues to embrace this data-driven approach, it sets a new standard for enhancing customer satisfaction and operational efficiency in the competitive retail environment.

Check out the Recap:

Revolutionizing Data Queries with TextQL: Insights from Co-Founder Ethan Ding

Striim Team — Fri, 27 Sep 2024 22:16:52 +0000

Can AI really make your data analysis as easy as talking to a friend? Join us for an enlightening conversation with Ethan Ding, the co-founder and CEO of TextQL, as he shares his journey from Berkeley graduate to pioneering the text-to-SQL technology that’s transforming how businesses interact with their data. Discover how natural language queries are breaking down barriers, making data analysis accessible to everyone, regardless of technical skill. Ethan delves into the historical hurdles and the game-changing advancements that are pushing the boundaries of AI and large language models in data querying.

Ever wondered how the quest for full autonomy in self-driving cars relates to data querying? We draw fascinating parallels between these two cutting-edge fields, emphasizing the importance of structured systems over chaotic, AI-driven approaches. This chapter reveals the often-overlooked limitations of current data management practices and underscores the critical need for high-quality data and robust modeling. Through a comparison of traditional business intelligence tools and advanced AI-driven solutions, we explore what truly makes data querying effective and insightful.

Hear from Ethan Deng, co-founder and CEO of TextQL, as he explains how their innovative tool integrates seamlessly with existing BI infrastructures, boosting productivity without the need for disruptive overhauls. Tune in to find out how TextQL is making data-driven decisions faster and smarter, paving the way for a future where data is everyone’s best friend.

Follow Ethan Ding and TextQL at:

Ethan’s LinkedIn: / theethanding
Ethan’s Twitter: https://x.com/TheEthanDing
TextQL’s LinkedIn: / textql
TextQL’s Twitter: https://x.com/textql
TextQL’s website: https://www.textql.com/

Training and Calling SGDClassifier with Striim for Financial Fraud Detection

Dmitriy Rudakov — Thu, 26 Sep 2024 13:42:58 +0000

In today’s fast-paced financial landscape, detecting transaction fraud is essential for protecting institutions and their customers. This article explores how to leverage Striim and SGDClassifier to create a robust fraud detection system that utilizes real-time data streaming and machine learning.

Problem

Transaction fraud detection is a critical responsibility for the IT teams of financial institutions. According to the 2024 Global Financial Crime Report from Nasdaq, an estimated $485.6 billion was lost to fraud scams and bank fraud schemes globally in 2023.

AI and ML help detect fraud, while real-time streaming frameworks like Striim play a key role in delivering financial data to reference and train classification models, enhancing customer protection.

Solution

In this article, I will demonstrate how to use Striim to perform key tasks for fraud detection with machine learning:

Ingest data using a Change Data Capture (CDC) reader in real time, call the model and deliver alerts to a target such as Email, Slack, Teams or any other target supported by Striim
Train the model using Striim Initial load app and re-train the model if its accuracy score decreases by using automation via REST APIs

Fraud Detection Approach

In typical credit card transactions, a financial institution’s data science team uses supervised learning to label data records as either fraudulent or legitimate. By carefully analyzing the data, engineers can extract key features that define a fraudulent user profile and behavior, such as personal information, number of orders, order content, payment history, geolocation, and network activity.

For this example, I’m using a dataset from Kaggle, which contains credit card transactions collected from EU retailers approximately 10 years ago. The dataset is already labeled with two classes representing fraudulent and normal transactions. Although the dataset is imbalanced, it serves well for this demonstration. Key fields include purchase value, age, browser type, source, and the class parameter, which indicates normal versus fraudulent transactions.

Picking Classification Model

There are many possibilities for classification using ML. In this example, I evaluated logistic regression and SGDClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html. The main difference is that SGDClassifier uses stochastic gradient descent optimization whereas logistic regression uses the logistic function to model binary classification. Many experts consider SGD to be a more optimal approach for larger datasets, which is why it was selected for this application.

Accuracy Measurement

The accuracy score is a metric that measures how often a model correctly predicts the desired outcome. It is calculated by dividing the total number of correct predictions by the total number of predictions. In an ideal scenario, the best possible accuracy is 100% (or 1). However, due to the challenges of obtaining and diagnosing a high-quality dataset, data scientists typically aim for an accuracy greater than 90% (or 0.9).

Training Step

Striim provides the ability to read historical data from various sources including databases, messaging systems, files, and more. In this case, we have historical data stored in the MySQL database, which is a highly popular data source in the FinTech industry. Here’s what architecture with real-time data streaming augmented with training of the ML model looks like:

You can achieve this in Striim with an Initial Load application that has a Database reader pointed to the transactions table in MySQL and file target. With Striim’s flexible adapters, data can be loaded virtually from any database of choice and loaded into a local file system, ADLS, S3 or GCS.

Once the data load is completed, the application will change its status from RUNNING to COMPLETED. A script, or in this case, a PS made Open Processor (OP), can capture the status change and call the training Python script.

Additionally, I added a step with CQ (Continuous Query) that allows data scientists to add any transformation to the data in order to prepare the form satisfactory for the training process. This step can be easily implemented using Striim’s Flow Designer, which features a drag and drop interface along with the ability to code data modifications using a combination of SQL-like language and utility function calls.

Model Reference Step

Once the model is trained, we can deploy it in a real-time data CDC application that streams user financial transactions from an operational database. The application calls the model’s predict method, and if fraud is detected, it generates and sends an alert. Additionally, it will check the model accuracy and, if needed, initiate the retraining step described above.

Model Reference App Structure

Flow begins with Striim’s CDC reader that streams financial transactions directly from database binary log. It then invokes our classification model that was trained in the previous step via a REST CALL. In this case, I am using an OP that executes REST POST calls containing parsed transaction values needed for predictions. The model service returns the prediction to be parsed by a query. If fraud is detected, it generates an alert. At the same time, if the model accuracy dips below 90 percent, the Application Manager function can restart a training application called IL MySQL App using an internal management REST API.

Final Thoughts on Leveraging SGDClassifier and Striim for Financial Fraud Detection

This example illustrates how a real-world data streaming application can detect fraud by interacting with a classification model. The application sends alerts when fraud is detected using various Striim alert adapters, including email, web, Slack, or database. Furthermore, if the model’s quality deteriorates, it can retain the model for further evaluation.

For reference TQL sources:

				
					 CREATE OR REPLACE APPLICATION FraudDetectionApp;

    CREATE OR REPLACE SOURCE TransactionsReader USING Global.MysqlReader ()
    OUTPUT TO transactionsStream;

    CREATE STREAM sgdOutput OF Global.JsonNodeEvent;

    CREATE STREAM FraudAlertStream OF Global.AlertEvent;

    CREATE CQ checkPrediction
    INSERT INTO predStream
    SELECT data.get("prediction").toString() as pred FROM sgdOutput s;;

    CREATE OR REPLACE CQ checkModelAccuracy
    INSERT INTO accuracyStream
    SELECT
    data.get("accuracy").toString() as acc
    FROM sgdOutput s;

    CREATE OR REPLACE OPEN PROCESSOR CallSGDClassifier USING Global.RestCallerPOST ( )
    INSERT INTO sgdOutput
    FROM transactionsStream;

    CREATE SUBSCRIPTION AlertAdapter USING Global.WebAlertAdapter (
    isSubscription: 'true' )
    INPUT FROM FraudAlertStream;

    CREATE OR REPLACE CQ generateFraudAlert
    INSERT INTO FraudAlertStream
    SELECT "Company XYZ", "Value", "warning", "raise", "fraud prediction alert on CC transaction"
    FROM predStream p where pred = "1.0";;

    CREATE OR REPLACE CQ CallTraining
    INSERT INTO callOutput
    SELECT com.striim.udf.app.ApplicationManager.startApplication("admin.ILMySqlApp")
    FROM accuracyStream a
    where TO_FLOAT(acc) < 0.9;

    END APPLICATION FraudDetectionApp;

				
					CREATE OR REPLACE APPLICATION InitialLoadMySQLApp;

    CREATE SOURCE ProcessorToStartTrainingStep USING Global.PrePostProcess ()
    OUTPUT TO m;

    CREATE OR REPLACE SOURCE MySqlInitLoad USING Global.DatabaseReader ()
    OUTPUT TO myLoadOut;

    CREATE CQ MyTransformationQuery
    INSERT INTO myFileOutput
    SELECT
    to_string(data[0]) as age,
    dnow() as curtime,
    to_string(data[2]) as sourceOfdata,
    to_string(data[0]) as browserType,
    to_string(data[3]) as purchaseValue,
    to_string(data[4]) as FraudClass….
    FROM myLoadOut m;;

    CREATE TARGET TrainFileTarget USING Global.FileWriter ( )
    INPUT FROM myFileOutput;

    END APPLICATION ILMySqlApp;

Small Data, Big Impact: Insights from MotherDuck’s Jacob Matson

Striim Team — Thu, 19 Sep 2024 16:46:12 +0000

What makes MotherDuck and DuckDB a game-changer for data analytics? Join us as we sit down with Jacob Matson, a renowned expert in SQL Server, dbt, and Excel, who recently became a developer advocate at MotherDuck.

During this episode, Jacob shares his compelling journey to MotherDuck, driven by his frequent use of DuckDB for solving data challenges. We explore the unique attributes of DuckDB, comparing it to SQLite for analytics, and uncover its architectural benefits, such as utilizing multi-core machines for parallel query execution. Jacob also sheds light on how MotherDuck is pushing the envelope with their innovative concept of multiplayer analytics.

Our discussion takes a deep dive into MotherDuck’s innovative tenancy model and how it impacts database workloads, highlighting the use of DuckDB format in Wasm for enhanced data visualization. Jacob explains how this approach offers significant compression and faster query performance, making data visualization more interactive. We also touch on the potential and limitations of replacing traditional BI tools with Mosaic, and where MotherDuck stands in the modern data stack landscape, especially for organizations that don’t require the scale of BigQuery or Snowflake. Plus, get a sneak peek into the upcoming Small Data Conference in San Francisco on September 23rd, where we’ll explore how small data solutions can address significant problems without relying on big data. Don’t miss this episode packed with insights on DuckDB and MotherDuck innovations!

Small Data SF Signup
Discount Code: MATSON100