Optimizing PyFlink for high-throughput machine learning: Streaming feature engineering in banking

SANDEEP PAMARTHI *

Principal Data Engineer, AI/ML Expert, CGI Inc.
 
Research Article
World Journal of Advanced Engineering Technology and Sciences, 2024, 13(02), 728-737
Article DOI: 10.30574/wjaets.2024.13.2.0549
Publication history: 
Received on 30 September 2024; revised on 11 November 2024; accepted on 13 November 2024
 
Abstract: 
Real-time feature engineering refers to transforming streaming data into meaningful features for machine learning models as events occur. This capability is critical in fraud detection for banking, where detecting anomalous transactions within seconds can prevent losses. Detecting fraud after hours or even minutes is often too late – by the time an offline system flags a fraudulent transaction, the funds may already be gone. Fraud detection systems must ingest transaction streams and compute features (e.g. recent transaction counts, spending velocity, geolocation patterns) continuously, enabling models to score each transaction in sub-second timescales. Real-time data beats” slow data in this domain: a too-late” architecture that relies on batch processing (e.g. daily reports or warehouse analytics) increases risk and can lead to revenue loss and poor customer experience. For example, if credit card fraud is only identified at days end in a data lake, the bank and customer suffer unnecessary damage. This urgency drives modern payment platforms to adopt streaming pipelines for immediate analytics to catch fraud as it happens.
Another crucial application is underwriting decisioning for financial loans and credit. Here, streaming machine learning enables lenders to assess credit risk and make approval decisions in real-time, rather than waiting on batch reports. By continuously updating features like an applicants transaction history, cash-flow patterns, or credit utilization, banks can generate up-to-the-moment risk scores. This enhances decision accuracy and customer experience – applicants receive faster responses and more dynamic risk-based pricing. A lagging, batch-oriented underwriting process might approve a loan based on outdated data or miss warning signals that appear in the interim. In high-volume commercial banking (new credit requests, renewals, modifications), streaming ML ensures that risk assessments and credit decisions reflect the latest information, improving both fraud prevention (catching fraudulent loan applications) and credit risk management (declining or adjusting terms for risky accounts in near-real-time).
Apache Flink, a distributed stream processing engine, has emerged as a leading platform for real-time analytics. PyFlink – Flinks Python API – allows data scientists to build streaming pipelines in Python on Flinks engine. This paper focuses on optimizing PyFlink for high-throughput ML, especially for streaming feature engineering in fraud detection and underwriting use cases. We present benchmarking studies comparing PyFlink with alternative frameworks, discuss how streaming ML improves fraud prevention and underwriting decisions, and outline an end-to-end architecture with implementation considerations. The goal is to offer empirical insights and best practices for financial institutions seeking low-latency, high-throughput streaming ML solutions.
 
Keywords: 
Streaming Machine Learning; PyFlink; Fraud Detection; Underwriting; Feature Engineering; Real-Time Analytics; Financial Services; Apache Flink; Banking; Credit Risk Scoring
 
Full text article in PDF: