How AppZen enhances operational effectivity, scalability, and safety with Amazon OpenSearch Serverless

August 27, 2025

4

AppZen is a number one supplier of AI-driven finance automation options. The corporate’s core providing facilities round an progressive AI platform designed for contemporary finance groups, that includes expense administration, fraud detection, and autonomous accounts payable options. AppZen’s expertise stack makes use of laptop imaginative and prescient, deep studying, and pure language processing (NLP) to automate monetary processes and guarantee compliance. With this complete answer method, AppZen has a well-established enterprise buyer base that features one-third of the Fortune 500 firms.

AppZen hosts all its workloads and software infrastructure on Amazon Net Companies (AWS), repeatedly modernizing its expertise stack to successfully operationalize and host its purposes. Centralized logging, a important part of this infrastructure, is important for monitoring and managing operations throughout AppZen’s numerous workloads. As the corporate skilled speedy development, the legacy logging answer struggled to maintain tempo with increasing wants. Consequently, modernizing this technique grew to become one in all AppZen’s prime priorities, prompting a complete overhaul to boost operational effectivity and scalability.

On this weblog we present, how AppZen modernizes its central log analytics answer from Elasticsearch to Amazon OpenSearch Serverless offering an optimized structure to fulfill above talked about necessities.

Challenges with the legacy logging answer

With a rising variety of enterprise purposes and workloads, AppZen had an growing want for complete operational analytics utilizing log knowledge throughout its multi-account group in AWS Organizations. AppZen’s legacy logging answer created a number of key challenges. It lacked the pliability and scalability to effectively index and make the logs out there for real-time evaluation, which was essential for monitoring anomalies, optimizing workloads, and making certain environment friendly operations.

The legacy logging answer consisted of a 70-node Elasticsearch cluster (with 30 scorching nodes and 40 heat nodes), it struggled to maintain up with the rising quantity of log knowledge as AppZen’s buyer base expanded and new mission-critical workloads have been added. This led to efficiency points and elevated operational complexity. Sustaining and managing the self-hosted Elasticsearch cluster required frequent software program updates and infrastructure patching, leading to system downtime, knowledge loss, and added operational overhead for the AppZen CloudOps group.

Migrating the information to a patched node cluster took 7 days, far exceeding trade commonplace and AppZen’s operational necessities. This prolonged downtime launched knowledge integrity danger and immediately impacted the operational availability of the centralized logging system essential for groups to troubleshoot throughout important workloads. The system additionally suffered frequent knowledge loss that impacted real-time metrics monitoring, dashboarding, and alerting as a result of its software log-collecting agent Fluent Bit lacked important options akin to backoff and retry.

AppZen has an NGINX proxy occasion controlling licensed person entry to knowledge hosted on Elasticsearch. Upgrades and patching of the occasion launched frequent system downtimes. All person requests are routed by way of this proxy layer, the place the person’s permission boundary is evaluated. This had an added operations overhead for directors to handle customers and group mapping on the proxy layer.

Answer overview

AppZen re-platformed its central log analytics answer with Amazon OpenSearch Serverless and Amazon OpenSearch Ingestion. Amazon OpenSearch Serverless enables you to run OpenSearch within the AWS Cloud, so you may run giant workloads with out configuring, managing, and scaling OpenSearch clusters. You possibly can ingest, analyze, and visualize your time-series knowledge with out infrastructure provisioning. OpenSearch Ingestion is a completely managed knowledge collector that simplifies knowledge processing with built-in capabilities to filter, remodel, and enrich your logs earlier than evaluation.

This new serverless structure, proven within the following structure diagram, is cost-optimized, safe, high-performing, and designed to scale effectively for future enterprise wants. It serves the next use instances:

Centrally monitor enterprise operations and knowledge evaluation for deep insights
Software monitoring and infrastructure troubleshooting

Collectively, OpenSearch Ingestion and OpenSearch Serverless present a serverless infrastructure able to operating giant workloads with out configuring, managing, and scaling the cluster. It supplies knowledge resilience with persistent buffers that may assist the present 2 TB per day pipeline knowledge ingestion requirement. IAM Id Middle assist for OpenSearch Serverless helped handle customers and their entry centrally eliminating a necessity for NGINX proxy layer.

The structure diagram additionally exhibits how separate ingestion pipelines have been deployed. This configuration choice improves deployment flexibility based mostly on the workload’s throughput and latency necessities. On this structure, Movement-1 is a push-based knowledge supply (akin to HTTP and OTel logs) the place the workload’s Fluent Bit DaemonSet is configured to ingest log messages into the OpenSearch Ingestion pipeline. These messages are retained within the pipeline’s persistent buffer to offer knowledge sturdiness. After processing the message, it’s inserted into OpenSearch Serverless.

And Movement-2 is a pull-based knowledge supply akin to Amazon Easy Storage Service (Amazon S3) for OpenSearch Ingestion the place the workload’s Fluent Bit DaemonSets are configured to sync knowledge to an S3 bucket. Utilizing S3 Occasion Notifications, the brand new log data creation notifications are despatched to Amazon Easy Queue Service (Amazon SQS). OpenSearch Ingestion consumes this notification and processes the report to insert into OpenSearch Serverless, delegating the information sturdiness to the information supply. For each Movement-1 and Movement-2, the OpenSearch Ingestion pipelines are configured with a dead-letter queue to report failed ingestion messages to the S3 supply, making them accessible for additional evaluation.

AWS logging architecture with ingestion flows to OpenSearch Serverless

For service log analytics, AppZen adopted a pull-based method as proven within the following determine, the place all service logs revealed to Amazon CloudWatch are migrated an S3 bucket for additional processing. An AWS Lambda processor is triggered when each new message is ingested to the S3 bucket, and the processed message is then uploaded to the S3 bucket for OpenSearch ingestion. The next diagram exhibits the OpenSearch Serverless structure for the service log analytics pipeline.

A log ingestion architecture for service log analytics

Workloads and infrastructure unfold throughout a number of AWS accounts can securely ship logs to the central log analytics platform over a non-public community utilizing digital non-public cloud (VPC) peering and AWS PrivateLink endpoints, as proven within the following determine. Each OpenSearch Ingestion and OpenSearch Serverless are provisioned in the identical account and Area, with cross-account ingestion enabled for workloads in different member accounts of the AWS Organizations account.

Cross-account AWS logging with secure centralized collection

Migration method

The migration to OpenSearch Serverless and OpenSearch Ingestion concerned efficiency analysis and fine-tuning the configuration of the logging stack, adopted by migration of manufacturing site visitors to new platform. Step one was to configure and benchmark the infrastructure for cost-optimized efficiency.

Parallel ingestion to benchmark OCU capability necessities

OpenSearch Ingestion scales elastically to fulfill throughput necessities throughout workload spikes. Enabling persistent buffering on ingestion pipelines with push-based knowledge sources supplied knowledge sturdiness and reliability. Information ingestion pipelines are ingesting at a price of two TB per day. On account of AppZen’s 90-day knowledge retention requirement round its ingested knowledge, at any time, there may be roughly 200 TB of listed historic knowledge saved within the OpenSearch Serverless cluster. To judge efficiency and prices earlier than deploying to manufacturing, knowledge sources have been configured to ingest knowledge in parallel into the brand new OpenSearch Serverless atmosphere together with an current setup already operating in manufacturing with Elasticsearch.

To realize parallel ingestion, AppZen put in one other Fluent Bit DaemonSet configured to ingest into the brand new pipeline. This was for 2 causes: 1) To keep away from interruption as a result of adjustments to current ingestion circulation and a pair of) New workflows are rather more simple when the information preprocessing step is offloaded to OpenSearch Ingestion, eliminating the necessity for customized lua script use in Fluent Bit.

Pipeline configuration

The manufacturing pipeline configuration was applied with completely different methods based mostly on knowledge supply sorts. Push-based knowledge sources have been configured with persistent buffer enabled for knowledge sturdiness and a minimal of three OpenSearch Compute Items (OCUs) to offer excessive availability throughout three Availability Zones. In distinction, pull-based knowledge sources, which used Amazon S3 as their supply, didn’t require persistent buffering because of the inherent sturdiness options of Amazon S3. Each pipeline sorts have been initially configured with a minimal of three OCUs and a most of fifty OCUs to ascertain baseline efficiency metrics. This setup meant the group may monitor and analyze precise workload patterns, and subsequently fine-tune employee configurations for optimum OCU utilization. By steady monitoring and adjustment, the pipeline configurations have been modified and optimized to effectively deal with each each day common hundreds and peak site visitors durations, offering cost-effective and dependable knowledge processing operations.

For AppZen’s throughput requirement, within the pull-based method, they recognized six Amazon S3 staff within the OpenSearch Ingestion pipelines optimally processing 1 OCU at 80% effectivity. Following the finest practices suggestion, at this system.cpu.utilization.worth metrics threshold, the pipeline was configured to auto scale. With every employee able to processing 10 messages, AppZen recognized cost-optimized configuration of fifty OCUs as most OCU configuration for its pipelines that’s able to processing as much as 3,000 messages in parallel. This pipeline configuration proven under helps its peak throughput necessities

# That is an OpenSearch Ingestion - pipeline configuration for processing Kubernetes logs and sending them to OpenSearch Serverless
# Information Movement: S3 -> SQS -> OpenSearch Ingestion -> OpenSearch + S3 Archive
# index_name right here is kubernetes.namespace_name or k8 service title
# If k8 Index title is dev: Service1-dev
# If k8 Index title is non-dev: Service1-allenv
model: "2"
entry-pipeline:
  # Supply (S3 + SQS)
  # Reads logs from S3 bucket by way of SQS notifications
  # 6 staff course of JSON information. Deletes S3 objects after processing
  supply:
    s3:
      staff: 6
      notification_type: "sqs"
      codec:
        ndjson:
      compression: "none"
      aws:
        area: "us-east-1"
        sts_role_arn: ""
      acknowledgments: true
      delete_s3_objects_on_read: true
      sqs:
        queue_url: "https://sqs.us-east-1.amazonaws.com/********1234/us-s3-k8-log"
        visibility_duplication_protection: true
  # Processing Pipeline
  # Timestamp: Provides @timestamp from ingestion time
  # Index naming: Units index_name from Kubernetes namespace
  processor:
    - date:
        from_time_received: true
        vacation spot: "@timestamp"
    - add_entries:
        entries:
        - key: "index_name"
          value_expression: "/kubernetes_namespace/title"
          add_when: "/index_name == null"
    - delete_entries:
        with_keys: [ "tmp" ]
    
    # JSON parsing: Parses nested JSON in log and message fields
    # Failed JSON parsing skipped silently
    - parse_json:
        supply: /log
        handle_failed_events: 'skip_silently'
    - parse_json:
        supply: /message
        handle_failed_events: 'skip_silently'
    
    # Atmosphere detection: Makes use of grok patterns to extract atmosphere from namespace names
    - grok:
        grok_when: 'accommodates(/index_name, "prod-") or accommodates(/index_name, "prod-k1-") or accommodates(/index_name, " prod-k2-")'
        match:
          index_name:
            - '%{WORD:prefix}-%{GREEDYDATA:suffix}-%{INT:ignore}'
            - '%{WORD:prefix}-%{GREEDYDATA:suffix}'
    - add_entries:
        entries:
        - key: "/suffix"
          value_expression: "/index_name"
          add_when: "/suffix == null"
        - key: "/labels/atmosphere"
          value_expression: "/prefix"
          add_when: "/prefix != null"
          overwrite_if_key_exists: true
        - key: "/labels/atmosphere"
          value_expression: "/labels_environment"
          add_when: "/labels_environment != null"
          overwrite_if_key_exists: true
  # Routing Logic 
  # k8: Regular Kubernetes logs
  # k8-debug: DEBUG degree logs (separate retention)
  # unknown: Logs with out correct metadata
  routes:
    - k8: '/kubernetes_namespace/title != null or /data_source == "kubernetes"'
    - k8-debug: '/data_source == "kubernetes" and /levelname == "DEBUG"'
    - unknown: '/kubernetes_namespace/title == null and /suffix == null and /log_group == null'
  # Sinks (3 locations)
  # S3 Archive: All logs saved in S3 with date partitioning
  # OpenSearch (Regular): ${suffix}-v4-k8 index for normal logs
  # OpenSearch (Debug): ${suffix}-v4-k8-debug index for debug logs
  sink:
    - s3:
        aws:
          area: "us-east-1"
          sts_role_arn: ""
        bucket: 
        object_key:
          path_prefix: 'us/${getMetadata("s3-prefix")}/%{yyyy}/%{MM}/%{dd}/'
        codec:
          json:
        compression: "none"
        threshold:
          maximum_size: 20mb
          event_collect_timeout: PT10M
    - opensearch:
        hosts: ["https://"]
        index: "${/suffix}-v4-k8"
        index_type: customized
        # Max 15 retries for OpenSearch operations
        max_retries: 15
        aws:
          # IAM function that the pipeline assumes to entry the area sink
          sts_role_arn: ""
          area: "us-east-1"
          serverless: true
          serverless_options:
            network_policy_name: "prod-logging-network"
        # Error Dealing with:
        # Useless Letter Queue (DLQ) to S3 for failed OpenSearch writes
        dlq:
          s3:
            bucket: ""
            key_path_prefix: "/k8/"
            area: "us-east-1"
            sts_role_arn: ""
        routes:
          - k8
    - opensearch:
        hosts: ["https://"]
        index: "${/suffix}-v4-k8-debug"
        index_type: customized
        max_retries: 15
        aws:
          # IAM function that the pipeline assumes to entry the area sink
          sts_role_arn: ""
          area: "us-east-1"
          serverless: true
          serverless_options:
            network_policy_name: "prod-logging-network"
        dlq:
          s3:
            bucket: ""
            key_path_prefix: "/k8-debug/"
            area: "us-east-1"
            sts_role_arn: ""
        routes:
          - k8-debug
    - opensearch:
        hosts: ["https://"]
        index: "unknown"
        index_type: customized
        max_retries: 15
        aws:
          # IAM function that the pipeline assumes to entry the area sink
          sts_role_arn: ""
          area: "us-east-1"
          serverless: true
          serverless_options:
            network_policy_name: "prod-logging-network"
        dlq:
          s3:
            bucket: ""
            key_path_prefix: "/unknown/"
            area: "us-east-1"
            sts_role_arn: ""
        routes:
          - unknown

Indexing technique

When working with search engine, understanding index and shard administration is essential. Indexes and their corresponding shards eat reminiscence and CPU sources to keep up metadata. A key problem emerges when having quite a few small shards in a system as a result of it results in greater useful resource consumption and operational overhead. Within the conventional method, you sometimes create indices on the microservice degree for every atmosphere (prod, qa, and dev). For instance, indices could be named like prod-k1-service or prod-k2-service, the place k1 and k2 characterize completely different microservices. With a whole lot of providers and each day index rotation, this method leads to hundreds of indices, making administration advanced and useful resource intensive. When implementing OpenSearch Serverless, you need to undertake a consolidated indexing technique that strikes away from microservice-level index creation. Relatively than creating particular person indices like prod-k1-service and prod-k2-service for every microservice and atmosphere, you need to consolidate the information into broader environment-based indices akin to prod-service, which accommodates all service knowledge for the manufacturing atmosphere. This consolidation is important as a result of OpenSearch Serverless scales based mostly on sources and has particular limitations on the variety of shards per OCU. Which means that having a better variety of small shards will result in greater OCU consumption.

Nevertheless, though this consolidated method can considerably scale back operational prices and simplify administration by way of built-in knowledge lifecycle insurance policies, it presents a notable problem for multi-tenant eventualities. Organizations with strict safety necessities, the place completely different groups want entry to particular indices solely, would possibly discover this consolidated method difficult to implement. For such instances, a extra granular indices method could be essential to keep up correct entry management, though it may end up in greater useful resource consumption.

By fastidiously evaluating your safety necessities and entry management wants, you may select between a consolidated method for optimized useful resource utilization or a extra granular method that higher helps fine-grained entry management. Each approaches are supported in OpenSearch Serverless, so you may stability useful resource optimization with safety necessities based mostly in your particular use case.

Value optimization

OpenSearch Ingestion allocates some OCUs from configured pipeline capability for persistent buffering, which supplies knowledge sturdiness. Whereas monitoring, AppZen noticed greater OCU utilization for this persistent buffer when processing high-throughput workloads. To optimize this capability configuration, AppZen determined to categorise its workloads into push-based and pull-based classes relying on their throughput and latency necessities. Attaining this created new parallel pipelines to function these flows in parallel, as proven within the structure diagram earlier within the put up. Fluent Bit agent collector configurations have been accordingly modified based mostly on the workload classification.

Relying on the associated fee and efficiency necessities for the workload, AppZen adopted the suitable ingestion circulation. For low latency and low-throughput workload necessities, AppZen selected the push-based method. For prime-throughput workload necessities, AppZen adopted the pull-based method, which helped decrease the persistent buffer OCU utilization by counting on sturdiness to the information supply. Within the pull-based method, AppZen additional optimized on the storage value by configuring the pipeline to robotically delete the processed knowledge from the S3 bucket after profitable ingestion

Monitoring and dashboard

One of many key design ideas for operational excellence within the cloud is to implement observability for actionable insights. This helps acquire a complete understanding of the workloads to assist enhance efficiency, reliability, and the associated fee concerned. Each OpenSearch Serverless and OpenSearch Ingestion publish all metrics and logs knowledge to Amazon CloudWatch. After figuring out key operational OpenSearch Serverless metrics and OpenSearch Service pipeline metrics, AppZen arrange CloudWatch alarms to ship a notification when sure outlined thresholds are met. The next screenshot exhibits the variety of OCUs used to index and search assortment knowledge.

The next screenshot exhibits the variety of Ingestion OCUs in use by the pipeline.

The next screenshot exhibits the proportion of accessible CPU utilization for OCU.

The next screenshot exhibits the % utilization of buffer based mostly on the variety of data within the buffer.

Conclusion

AppZen efficiently modernized their logging infrastructure by migrating to a serverless structure utilizing Amazon OpenSearch Serverless and OpenSearch Ingestion. By adopting this new serverless answer, AppZen eradicated an operations overhead that concerned 7 days of information migration effort throughout every quarterly improve and patching cycle of Kubernetes cluster internet hosting Elasticsearch nodes. Additionally, with the serverless method, AppZen was capable of keep away from index mapping conflicts by utilizing index templates and a brand new indexing technique. This helped the group save a mean 5.2 hours per week of operational effort and as a substitute use the time to concentrate on different precedence enterprise challenges. AppZen achieved a greater safety posture by way of centralized entry controls with OpenSearch Serverless, eliminating the overhead of managing a reproduction set of person permissions on the proxy layer. The brand new answer helped AppZen deal with rising knowledge quantity and construct real-time operational analytics whereas optimizing value, enhancing scalability and resiliency. AppZen optimized prices and efficiency by classifying workloads into push-based and pull-based flows, so they may select the suitable ingestion method based mostly on latency and throughput necessities.

With this modernized logging answer, AppZen is properly positioned to effectively monitor their enterprise operations, carry out in-depth knowledge evaluation, and successfully monitor and troubleshooting the appliance as they proceed to develop. Wanting forward, AppZen plans to make use of OpenSearch Serverless as a vector database, incorporating Amazon S3 Vectors, generative AI, and basis fashions (FMs) to boost operational duties utilizing pure language processing.

To implement an identical logging answer in your group, start by exploring AWS documentation on migrating to Amazon OpenSearch Serverless and establishing OpenSearch Serverless. For steering on creating ingestion pipelines, discuss with the AWS information on OpenSearch Ingestion to start modernizing your logging infrastructure.

In regards to the authors

Prashanth Dudipala is a DevOps Architect at AppZen, the place he helps construct scalable, safe, and automatic cloud platforms on AWS. He’s obsessed with simplifying advanced techniques, enabling groups to maneuver sooner, and sharing sensible insights with the cloud group.

Madhuri Andhale is a DevOps Engineer at AppZen, centered on constructing and optimizing cloud-native infrastructure. She is obsessed with managing environment friendly CI/CD pipelines, streamlining infrastructure and deployments, modernizing techniques, and enabling growth groups to ship sooner and extra reliably. Outdoors of labor, Madhuri enjoys exploring rising applied sciences, touring to new locations, experimenting with new recipes, and discovering inventive methods to resolve on a regular basis challenges.

Manoj Gupta is a Senior Options Architect at AWS, based mostly in San Francisco. With over 4 years of expertise at AWS, he works intently with clients like AppZen to construct optimized cloud architectures. His main focus areas are Information, AI/ML, and Safety, serving to organizations modernize their expertise stacks. Outdoors of labor, he enjoys out of doors actions and touring with household.

Prashant Agrawal is a Sr. Search Specialist Options Architect with Amazon OpenSearch Service. He works intently with clients to assist them migrate their workloads to the cloud and helps current clients fine-tune their clusters to realize higher efficiency and save on value. Earlier than becoming a member of AWS, he helped numerous clients use OpenSearch and Elasticsearch for his or her search and log analytics use instances. When not working, yow will discover him touring and exploring new locations. Briefly, he likes doing Eat → Journey → Repeat.

Previous articleAWS companies scale to new heights for Prime Day 2025: key metrics and milestones

Next articleU.S. Air Drive awards 3D Techniques $7.65m contract for steel 3D printer tech demonstrator

How AppZen enhances operational effectivity, scalability, and safety with Amazon OpenSearch Serverless

Challenges with the legacy logging answer

Answer overview

Migration method