15.4 C
New York
Tuesday, September 9, 2025

Zeta reduces banking incident response time by 80% with Amazon OpenSearch Service observability


This can be a visitor publish co-written with Shashidhar Soppin, Manochandra Menni and Anchal Kansal from Zeta.

Zeta is a core banking expertise supplier that allows banks to quickly launch extensible banking property and legal responsibility merchandise. Zeta’s major merchandise are Olympus and Tachyon. Olympus is a platform as a service (PaaS) that simplifies constructing and working cloud-native, safe and distributed multi-tenant software program as a service (SaaS) merchandise. It blends infrastructure as code and GitOps methodologies for environment friendly and constant deployment of SaaS merchandise. Its structure prioritizes sturdy tenant isolation, real-time occasion processing, and complete observability, supporting sturdy API integrations and seamless deployment. Zeta’s Tachyon is a full-stack, cloud-native, API-first digital-banking SaaS service delivered through Olympus. The banking companies of Tachyon embody cost engines (for UPI, credit score, debit, and pay as you go playing cards), financial savings & checking account administration, and so on. Tachyon is a contemporary debit processing product with private finance administration and card controls. It’s designed to extend utilization, upsell credit score, scale back fraud, and enhance buyer satisfaction. The Tachyon product presents complete provisioning, funds, and account administration APIs and SDKs, enabling seamless integration of economic merchandise into third-party apps with out compromising privateness and safety. Zeta operates Tachyon as a multi-tenant SaaS product, serving prospects who’re configured as particular person tenants inside the system. Zeta’s expertise stack is monitored by their Buyer Service Navigator product (CSN), which is a part of Olympus.

As a worldwide SaaS supplier, Zeta wanted an answer able to monitoring tenants, measuring SLAs, assembly native regulatory necessities, and scaling effectively with each new tenant onboarding and seasonal utilization spikes. Zeta sought a cheap, scalable system that would supply a unified “single pane of glass” to observe the applying companies, cloud infrastructure, open-source elements, and third-party merchandise.

Zeta confronted a formidable problem in orchestrating a cohesive monitoring system throughout a quickly increasing multi-tenant atmosphere, numerous domains, and quite a few instruments. As extra tenants joined their system, the complexity grew exponentially, making Zeta’s monitoring answer more and more troublesome to take care of. The first problem stemmed from fragmented monitoring instruments that made it troublesome to shortly determine root causes throughout interconnected programs, resulting in extended troubleshooting occasions and potential service degradation. When customers reported points, equivalent to bank card cost issues, Website Reliability Engineering (SRE) crew needed to navigate via a a number of disparate monitoring instruments and siloed information, and the dearth of built-in observability resulted in time-consuming guide correlation efforts. This multi-tenant, multi-solution panorama considerably sophisticated the flexibility to take care of constant monitoring requirements and repair ranges. The problem was additional sophisticated by the advanced regulatory panorama, the place international enlargement required adherence to numerous native rules, necessitating a versatile structure able to accommodating various information retention insurance policies and entry controls throughout completely different jurisdictions. Every new tenant addition multiplied the complexity of balancing the monitoring wants of inside SRE groups and prospects, requiring refined information segregation and entry administration. Moreover, Zeta required complete anomaly detection capabilities throughout programs, elements, infrastructure, and operations, requiring an answer that might scale dynamically whereas establishing dynamic baselines and figuring out refined patterns which may point out rising points. Because the tenant base continued to develop, the necessity for a unified, scalable monitoring answer that might streamline these processes, improve operational visibility, and keep system integrity turned essential.

Zeta’s objective was to streamline their processes and improve operational visibility throughout the complete expertise panorama. By addressing these challenges, Zeta aimed to create a unified observability answer that might considerably enhance incident response occasions, improve regulatory compliance posture, and finally ship a extra dependable and performant service to their international buyer base.

On this publish we clarify how Zeta constructed a extra unified monitoring answer utilizing Amazon OpenSearch Service that improved efficiency, lowered guide processes, and elevated end-user satisfaction. Zeta has achieved over an 80% discount in imply time to decision (MTTR), with incident response occasions lowering from 30+ minutes to below 5 minutes.

Answer overview

Zeta designed and constructed an observability system, CSN, to ship complete visibility throughout the service atmosphere. CSN is a part of the Olympus suite of merchandise. CSN serves as the first interface for the SRE crew, providing real-time service well being dashboards, infrastructure monitoring, SLA efficiency analytics, and an admin panel for consumer administration. The system is supplied with single sign-on (SSO) integration and enforces role-based entry management (RBAC) to allow safe, granular entry. With CSN, SREs can effectively monitor system well being, obtain actionable alerts and warnings, and handle operational workflows throughout essential companies.

CSN is powered by OpenSearch Service to supply an built-in answer for DevOps and Website Reliability Engineers to assist determine essential occasions and points. Zeta selected OpenSearch Service as a result of it presents a completely managed, open-source search analytics engine that scales effortlessly to deal with the growing variety of tenants, related information development, and analytics wants. It’s seamless integration with AWS companies, sturdy safety features, and assist for real-time information ingestion and querying make it supreme for powering the CSN dashboards and analytics workloads. The next diagram illustrates the CSN deployment structure.

Zeta CSN Deployment Architecture

The OpenSearch Service area makes use of the Multi-AZ with Standby deployment mannequin, following AWS greatest practices for top availability and fault tolerance. Nodes—together with devoted cluster supervisor nodes, information nodes, and UltraWarm nodes—are distributed evenly throughout three Availability Zones in the identical AWS Area. Availability Zones 1 and a pair of deal with lively indexing and search site visitors, and Availability Zone 3 accommodates standby nodes that stay passive throughout regular operations. If an Availability Zone failure happens, OpenSearch Service mechanically promotes standby nodes to lively standing, sustaining cluster operations with minimal disruption and no want for information redistribution.

The OpenSearch cluster consists of three devoted cluster supervisor nodes and a multiple-of-three information node depend to take care of quorum and balanced shard allocation. Every index makes use of at the very least two replicas, offering redundant copies of information throughout the Availability Zones. This Multi-AZ with Standby configuration delivers excessive resilience and fast failover, supporting steady service availability and sturdy catastrophe restoration for the observability workloads.

Information assortment and ingestion

The observability technique facilities on a knowledge assortment and ingestion pipeline designed to deal with the complexity and scale. The structure, as proven within the following diagram, addresses three essential information varieties: AWS useful resource logs, utility logs, and distributed traces, with every information sort utilizing tailor-made assortment and processing strategies optimized for the workloads.

Zeta CSN Data Ingestion

AWS useful resource logs assortment

The infrastructure spans a number of AWS companies together with Amazon Elastic Kubernetes Service(Amazon EKS), Amazon Relational Database Service(Amazon RDS), Amazon Redshift, Utility Load Balancer, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Elastic Compute Cloud (Amazon EC2) and extra. Zeta makes use of Amazon CloudWatch Logs as the first assortment level for AWS service logs, which offers native integration with these companies.

AWS companies ship their logs on to CloudWatch Logs, that are then pulled by Fluentd operating on the Amazon EKS cluster for centralized processing. This strategy natively captures operational information from the AWS assets, together with:

  • Database operational logs and audit trails from Amazon RDS cases
  • Information warehouse question execution logs from Amazon Redshift
  • Utility Load Balancer entry logs capturing site visitors patterns and efficiency metrics
  • Kafka cluster operational logs from Amazon MSK
  • AWS API invocation audit trails from AWS CloudTrail
  • Container runtime and working system logs from Amazon EC2
  • In the course of the log assortment, personally identifiable data (PII) is filtered out. The answer adheres strictly to PCI-DSS pointers all through this course of.

Zeta used Amazon MSK as a scalable and dependable spine for accumulating and streaming logs from varied sources throughout the AWS assets. Logs are ingested into Amazon MSK, offering a sturdy and fault-tolerant buffer that decouples log producers from shoppers. This structure permits real-time log streaming and helps superior processing pipelines earlier than the logs are routed to the OpenSearch Service. By integrating Amazon MSK into the logging workflow, scalability, resilience, and suppleness is improved, so that top log volumes are effectively managed with out impacting downstream programs. This strategy, mixed with native AWS integrations, minimizes operational complexity and maintains complete, centralized log visibility throughout the cloud atmosphere.

Fluentd processes these logs and routes them on to OpenSearch Service, sustaining the advantages of AWS integration whereas offering centralized accessibility. This centralized logging strategy with built-in buffering capabilities reduces the direct load on OpenSearch Service by batching and optimizing log supply, serving to to forestall potential ingestion bottlenecks throughout high-volume durations. The strategy alleviates the necessity for customized log transport brokers on AWS assets, lowering operational overhead whereas sustaining complete protection of the cloud infrastructure.

Utility logs processing

For application-level observability, a pipeline utilizing Fluentd is deployed as Kubernetes DaemonSet. Utility microservices operating on Amazon EKS generate logs that Fluentd DaemonSets acquire, parses, and enrich with metadata equivalent to pod names, namespaces, and repair identifiers. The processed logs then circulation via Amazon MSK for dependable, high-throughput message streaming earlier than ultimate processing by Fluentd and indexing in OpenSearch Service.

This Kafka-based strategy offers a number of benefits:

  • Decoupling – This helps producers and shoppers to function independently, in order that Zeta can scale ingestion and processing individually primarily based on demand.
  • Backpressure dealing with – Utilizing Kafka’s buffering capabilities, this manages site visitors spikes throughout peak banking hours, absorbing sudden will increase in log quantity whereas sustaining system stability throughout seasonal utilization surges.
  • Sturdiness of logs – The system maintains logs durably in order that no log information is misplaced throughout system upkeep or sudden failures via message persistence.

The logs then cross via a second Fluentd layer for ultimate processing and routing to OpenSearch Service, the place they’re listed throughout service-specific indexes (app-index, falco-index, kong-index).

Distributed hint assortment

To handle the problem of correlating points throughout Zeta’s microservices structure, system makes use of distributed tracing utilizing Jaeger, an open-source, end-to-end distributed tracing system. Jaeger permits monitoring and troubleshooting transactions in advanced distributed programs by monitoring requests as they circulation via a number of companies. The applying companies and Kong API Gateway are instrumented with Jaeger shopper libraries that generate hint information together with spans, which signify particular person operations inside a hint. Every span accommodates metadata equivalent to operation names, begin and end timestamps, tags, and logs that present context in regards to the operation being carried out. The Jaeger Collector aggregates these spans from a number of companies, performing validation, indexing, and transformation earlier than forwarding the info.

The traces circulation via Amazon MSK for a similar reliability advantages because the logging pipeline – offering sturdiness, decoupling, and backpressure dealing with throughout high-volume durations. Jaeger Ingester then consumes traces from Amazon MSK and processes them for storage within the jaeger-index inside OpenSearch Service.

This information assortment and ingestion technique offers full end-to-end visibility and builds an observability system that allows SRE groups to observe, troubleshoot, and optimize the companies throughout the complete expertise stack.

Storage tiering

To handle the log, metric, and hint information at scale—about 3TB generated day by day—the answer applied OpenSearch Service storage tiering to steadiness efficiency, retention, and value. Zeta requires close to real-time search and retrieval for at the very least per week, whereas retaining logs and traces for as much as 10 years. Protecting this information in lively clusters would influence search efficiency and considerably improve prices, so the answer makes use of the OpenSearch Service scorching, UltraWarm, and chilly storage tiers to optimize the info lifecycle. The next diagram illustrates storage tiering in OpenSearch Service.

Zeta CSN Storage Tiering

Sizzling storage is used for the latest and regularly accessed information, supporting real-time indexing and low-latency queries. This tier depends on high-performance storage hooked up to straightforward information nodes, making it supreme for powering dwell dashboards and analytics the place pace is essential. The answer makes use of AWS Graviton 2 powered m6g.4xlarge.search occasion varieties to run the OpenSearch Service area which offers upto 40% decrease price in comparison with x86 primarily based cases. Every scorching information node has an hooked up gp3 EBS quantity to retailer indexes. Zeta maintains information in scorching storage for 1 week.

UltraWarm storage serves as a cheap layer for older, read-only information that’s queried much less regularly however nonetheless wants to stay searchable. UltraWarm nodes use Amazon Easy Storage Service (Amazon S3) because the backing retailer with an built-in caching mechanism, to retain giant volumes of information at a fraction of the price of scorching storage whereas nonetheless supporting interactive queries for historic evaluation. Zeta makes use of ultrawarm1.giant.search occasion varieties within the UltraWarm storage tier and maintains information in UltraWarm storage for 15 days.

Chilly storage is designed for long-term archival of occasionally accessed or compliance-driven information. Information in chilly storage is indifferent from lively compute assets and resides in Amazon S3, incurring minimal price. When historic information must be queried, the indexes are hooked up to the UltraWarm nodes utilizing OpenSearch API calls. This helps extracting historic information for audits, periodic analysis or forensic investigations with out sustaining lively compute for the complete retention interval, thereby lowering storage price.

OpenSearch Service automates index transitions between scorching, UltraWarm, and chilly storage tiers utilizing Index State Administration (ISM) insurance policies. ISM insurance policies specify the circumstances and actions for every state, equivalent to transitioning primarily based on index age, dimension, or doc depend. When an index qualifies for a transition, ISM jobs—operating each 5 to eight minutes—consider the coverage and transfer the index to the following tier. When indexes attain the UltraWarm threshold, they’re migrated to UltraWarm nodes backed by Amazon S3, which reduces storage prices whereas protecting information accessible for queries. After the UltraWarm retention interval, ISM archives the indexes to chilly storage, detaching them from compute assets however permitting reattachment for future queries or compliance wants. This automated lifecycle administration reduces operational overhead, optimizes storage prices, and maintains efficiency for each current and historic information.

For observability information, new indexes are created within the scorching tier, the place they continue to be for 7 days to assist quick ingestion and low-latency queries. After this era, ISM transitions these indexes to UltraWarm storage, the place they’re retained for a further 15 days as read-only information, balancing price with searchability.

Safety

Safety is probably the most essential a part of the structure. Zeta’s observability system implements a number of layers of safety for information confidentiality, integrity, and compliance with banking rules, and is constructed utilizing a zero-trust strategy following the AWS shared duty mannequin for OpenSearch Service:

  • Infrastructure safety: The OpenSearch Service area is deployed inside a digital personal cloud (VPC) with personal subnets, isolating it from direct web entry. Safety teams implement restrictive ingress guidelines, permitting entry solely from licensed sources. The OpenSearch Service area makes use of encryption at relaxation via AWS Key Administration Service (KMS). Information in transit is secured utilizing TLS 1.3 encryption, in order that log information, traces, and search queries stay protected throughout transmission. Service-to-service communication makes use of AWS Id and Entry Administration (IAM) roles and encrypted connections, assuaging the necessity for hardcoded credentials.
  • Entry management and authentication: The answer makes use of Amazon OpenSearch Service fine-grained entry management(FGAC) built-in with IAM, the place IAM serves because the authentication supplier and FGAC handles authorization by mapping IAM roles to OpenSearch backend roles. This strategy helps Zeta to manage entry permissions on the index and doc stage primarily based on tenant necessities and consumer duties. The info ingestion pipeline implements end-to-end safety with Fluentd authenticating to Amazon MSK utilizing IAM roles over encrypted connections. Amazon MSK clusters use encryption in transit and at relaxation, defending log information all through the streaming pipeline. Kubernetes RBAC insurance policies prohibit pod-to-pod communication and restrict service account permissions.
  • Information privateness and tenant isolation: Every tenants’ information is maintained in logical separation in OpenSearch Service utilizing tenant id. CSN implements tenant-aware authentication and authorization with FGAC, proscribing customers to their licensed tenants’ dashboards and information. Each API endpoint validates tenant context, in order that customers can solely entry information inside their licensed scope. Importantly, no buyer information is captured within the logs – solely system metrics are used to construct the monitoring system, adhering to banking safety requirements and greatest practices. Person actions are audited and logged for compliance functions, with audit trails maintained in keeping with regulatory necessities.

This safety framework permits the observability system meet the safety necessities of core banking operations whereas sustaining operational effectivity and regulatory compliance throughout international industries.

Buyer Service Navigator

CSN delivers SREs a strong diagnostics interface engineered for high-efficiency monitoring, deep evaluation, and fast troubleshooting of system efficiency throughout distributed environments. The system ingests and processes telemetry information at sub-minute intervals, offering near-real-time metrics, traces, and logs from essential infrastructure elements. Actionable, interactive visualizations—equivalent to heatmaps, anomaly graphs, and dependency maps— helps SREs to shortly detect SLO breaches and drill right down to granular root causes, usually inside a couple of minutes of an incident.

The next screenshot exhibits an instance service well being dashboard in CSN for an Olympus tenant.

Zeta CSN Service Health Dashboard

The next screenshot exhibits an instance of the API efficiency insights dashboard in CSN.

Zeta CSN API Performance Dashboard

Enterprise and technical advantages

The OpenSearch Service-based CSN System offers the next enterprise and technical advantages:

  • Guide effort is lowered via automated Index State Administration (ISM) and lifecycle insurance policies, in order that Zeta’s groups to concentrate on innovation
  • Automated lifecycle insurance policies facilitate seamless retention and archiving of compliance information, lowering the danger of non-compliance
  • The system helps log retention for over 10 years to satisfy regulatory necessities for Zeta’s banking and monetary companies prospects
  • A number of layers of safety—together with encryption at relaxation and in transit, FGAC, and tenant isolation to guard buyer information and assist Zeta’s zero-trust structure
  • By consolidating logs, traces, and metrics from disparate programs into OpenSearch, SRE groups can correlate occasions extra successfully, thereby lowering troubleshooting efforts and reaching an 80% enchancment in MTTR
  • Zeta achieved 99.999999999% information sturdiness for archived logs saved in Amazon S3, offering long-term information integrity
  • Zstandard compression is being applied to optimize long-term storage prices

Conclusion

CSN’s superior correlation engine mechanically associates associated occasions throughout microservices, databases, community layers, and infrastructure, considerably streamlining root trigger evaluation. Built-in alerting and automatic runbooks additional scale back response occasions. Since implementing CSN, Zeta has achieved over an 80% discount in MTTR, with incident response occasions lowering from 30+ minutes to below 5 minutes. The service helps seamless multi-tenant monitoring, processes 3TB of machine-generated information day by day, and is architected for petabyte-scale development. Moreover, CSN helps Zeta meet regulatory necessities for retaining historic logs over a number of years whereas protecting storage prices below management. This has considerably improved operational resilience, elevated service availability, and empowered groups to proactively resolve points earlier than they have an effect on finish customers.

Able to take your group’s observability capabilities to the following stage? Dive into the technical particulars of OpenSearch Service within the Amazon OpenSearch Developer Information. Go to our new migration hub web page for extra prescriptive steerage on transferring your workloads to OpenSearch Service.


Concerning the authors

Deepesh DhapolaDeepesh Dhapola is a Senior Options Architect at AWS India, the place he architects high-performance, resilient cloud options for monetary companies and fintech organizations. He makes a speciality of utilizing superior AI applied sciences—together with generative AI, clever brokers, and the Mannequin Context Protocol (MCP)—to design safe, scalable, and context-aware functions. With deep experience in machine studying and a eager concentrate on rising developments, Deepesh drives digital transformation by integrating cutting-edge AI capabilities to reinforce operational effectivity and foster innovation for AWS prospects. Past his technical pursuits, he enjoys high quality time along with his household and explores inventive culinary methods.

Shashidhar (Shashi) SoppinShashidhar (Shashi) Soppin is an achieved Enterprise Architect and cloud transformation chief with over 24+ years of expertise spanning regulated industries and high-growth expertise environments. At the moment steering strategic initiatives as Lead Architect at Zeta’s CTO workplace, Shashidhar has helped in constructing and led world-class engineering groups, driving innovation in cloud, safety, and fintech domains. He has architected safe, scalable platforms—scaling consumer bases by 10x, enabling advanced integrations for main Financial institution’s migration to Zeta’s platforms, and pioneering Zero Belief frameworks that achieved excellent regulatory compliance. A results-driven government and former DMTS at Wipro, Shashidhar holds 25+ granted patents and has delivered multi-million greenback enterprise offers throughout domains together with AI/ML. Famend as a printed writer (“Necessities of Deep Studying”), frequent business speaker, and hands-on innovator, he combines technical experience with enterprise acumen, propelling organizations towards sturdy, future-ready cloud ecosystems and operational excellence. Previous to Wipro he labored in IBM-ISL as properly.

Anchal KansalAnchal Kansal is a Lead Website Reliability Engineer at Zeta, the place she has spent the previous 4 years constructing and scaling dependable, high-performance programs. With deep experience in OpenSearch, observability platforms, and large-scale infrastructure, she focuses on making certain uptime, efficiency, and operational effectivity. Anchal is obsessed with fixing advanced reliability challenges and sharing sensible insights with the engineering neighborhood.

Mano (Manochandra)Manochandra (Mano) is the Website Reliability Engineering (SRE) skilled at Zeta, specializing in information management-oriented programs. With a deep understanding of large-scale distributed architectures, he has intensive expertise designing, deploying, and sustaining resilient, production-grade OpenSearch programs. Mano is understood for his proactive strategy in optimizing infrastructure reliability and efficiency, in addition to his capability to troubleshoot advanced operational challenges. His experience spans implementing automation, monitoring, and incident administration greatest practices, making him a go-to useful resource for making certain service availability and scalability at Zeta.

 Hitesh SubnaniHitesh Subnani is a FSI Options Architect at AWS India, the place he works with prospects to design and construct architectures that ship enterprise worth. He makes a speciality of complete observability and analytics programs, enabling organizations to realize deep insights from operational information. With experience in search and analytics applied sciences, Hitesh focuses on scalable monitoring programs, real-time dashboards, and compliance-driven architectures for AWS prospects within the monetary sector.

Tarun ChakrabortyTarun Chakraborty is a Sr. Technical Account Supervisor (TAM) at AWS India, the place he companions with main banks and fintech organizations to speed up their cloud transformation journeys. With over 15 years of expertise in expertise and monetary companies, he serves as a trusted advisor serving to prospects leverage AWS’s complete suite of companies to drive innovation and obtain their enterprise goals.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles