Scaling Up: How We Increased Availability Using a CDN and EC2 Auto-Scaling

General

General

The Signaloid Cloud Computing Engine (SCCE) is a cloud-based platform on which organizations can run mission-critical computing tasks. It implements Signaloid's UxHw technology which enables orders of magnitude faster and more accurate execution than competing solutions, for workloads such as Value at Risk (VaR) computations in finance. Beyond its use in finance, its capabilities are valuable to any use cases which traditionally employed Monte Carlo methods and where low execution latency and scalability to very large workload sets are needed. SCCE provides a virtual processor abstraction to applications running over it. Developers compile and run applications on this hardware abstraction from codebases implemented in languages supported by any of the frontends of the LLVM compiler infrastructure. Interaction with SCCE is via a REST API which allows programmatic launching of tasks, program recompilation, results retrieval, etc. Because SCCE is intended to be permanently integrated as part of the execution stack of production systems such as quantitative finance libraries, achieving high availability and scalability with increasing workload size is paramount. This technology explainer describes how we architected and implemented SCCE's infrastructure to provide 99.91% availability, achieving auto-scaling of resources using Amazon Web Services (AWS) auto-scaling groups (ASGs) located in multiple AWS geographic availability zones (AZs). The technology explainer also highlights how developers can improve the efficiency of their application frontends, as we did for our implementation of the Signaloid Cloud Developer Platform (SCDP), an example of a first-party application and which is used by developers, for, e.g., administering their subscriptions to the paid SCCE platform, generating API tokens, debugging applications, or viewing the execution history of tasks they have launched on SCCE.

Why It Matters

Organizations integrate the Signaloid Cloud Compute Engine (SCCE) into their software stacks to take advantage of Signaloid's UxHw technology for speeding up mission-critical workloads that they previously solved using Monte Carlo methods. The architectural design and implementation choices we have made in the infrastructure of SCCE allow organizations that build on it to be assured of high availability and scalability even in the presence of high loads.

The Technical Details

Enhancing Backend Scalability with Auto Scaling Groups

Tasks which execute on SCCE are compiled with Signaloid's LLVM-based compilers and linked against Signaloid's runtime libraries. When executed, SCCE provides these compiled applications with the impression they are running on a processor where floating-point values (double and float data types in the case of C/C++ programs) also have embedded probability distribution information but the execution obeys standard program semantics. The compilation and execution environment in cloud-based SCCE uses a combination of techniques, including binary rewriting, to implement this abstraction and to implement the performance enhancements of Signaloid's UxHw technology, achieving end-to-end application speedups as high as 1000x in relevant use cases. On-premises installations of SCCE can provide additional speedup when coupled with Signaloid's PCIe hardware accelerator cards.

A naive implementation (Figure 1, top) of a cloud-based or on-premises task execution infrastructure might consist of a single compute server (e.g., a single AWS EC2 instance) that processes tasks from a queue (e.g., an AWS SQS queue). Such a deployment would however have at least two significant disadvantages:

  • The deployment of updates would result in outages. For an infrastructure with frequent updates, this would lead to significant reduction in availability.

  • A single compute instance could fail, halting all processing.

A better approach (Figure 1, bottom) is to architect the infrastructure to be able to scale up or scale down the number of compute servers. For the cloud-based variant of our SCCE infrastructure implemented on top of AWS, we use the ASG facility mentioned above, with a minimum of two instances distributed across multiple geographic AZs. (On-premises deployments of SCCE on AWS Outpost can similarly use this auto-scaling facility.)

Designing SCCE's infrastructure to support auto scaling and multiple AZs also allowed us to make additional performance-enhancing architectural decisions:

  • Increased Fault Tolerance: With a multi-AZ configuration, tasks can continue to be processed even if one geographic AZ fails, e.g., due to a natural disaster. This allows SCCE to achieve an availability of 99.9975% (detailed further below), reducing the likelihood of SCCE service outages.

  • Dynamic Scaling: To enable dynamic scaling, SCCE implements the Backlog Per Instance (BPI) metric, which measures the average quantity of messages in the SQS queue (which we use for queueing execution task) per worker. The ASG automatically adds or removes instances based on the value of the BPI, efficiently managing workload changes without human involvement.

  • Near-zero-downtime rollout: The ASG performs rolling updates, updating each instance separately. This ensures that at least one instance is always running during deployments, eliminating the need to interrupt SCCE service in order to deploy software or configuration updates, and thereby maintaining continuous SCCE service availability.

  • Reduced Latency with VPC Endpoints and NAT: In tandem with using ASGs and multiple AZs, the SCCE architectural design keeps network traffic inside the AWS infrastructure using AWS network address translation (NAT) gateways and AWS virtual private cloud (VPC) endpoints across multiple AZs. This allows data moving within the SCCE infrastructure to traverse fewer network hops, lowering network latency and improving the dependability and end-to-end performance of SCCE.

Because of the architectural decisions outlined above, organizations that want to integrate SCCE into their computing stacks can be assured of 99.9975% availability, with low latency, workload adaptability, and uninterrupted operation even during periods of high demand and even in the presence of updates to the SCCE's software.

One example of an end-to-end application built on top of SCCE is the Signaloid Cloud Developer Platform (SCDP), a platform we provide to allow developers to perform tasks such as creating API keys for use with applications they build over SCCE, monitoring the execution state and history of task executions, interactively launch tasks over SCCE, and more. The SCDP platform, which is available at https://signaloid.io, is a web-based application with a front end hosted on a content delivery network (CDN, see Figure 2) and an intermediate backend hosted on AWS which in turn generates API requests to SCCE.

Figure 1: Some of the benefits of load-adaptive auto-scaling groups (ASGs).

Quantifying the Improved System Availability

SCDP uses Cloudflare as its CDN, spreading our the SCDP web application's page assets across Cloudflare's global network of edge servers, which provide a theoretical availability of 99.99%. The complete SCDP software stack below the web front end, including the SCCE stack described above (with 99.9975% availability), and with an ASG configured with two instances across multiple AZs, is as follows:

  • Content delivery layer: Cloudflare CDN (99.99%)

  • API management layer: API Gateway (99.99%)

  • Request handling layer: Lambda function (99.99%)

  • Messaging layer: SQS (99.99%)

  • Compute layer (i.e., SCCE): EC2 Auto Scaling Group across multiple AZs (99.9975%)

  • First post-processing layer: Lambda function (99.99%)

  • Second post-processing layer: Lambda function (99.99%)

  • Storage layer: S3 and DynamoDB (both 99.99%)

With this architecture, we are able to achieve an availability of SCDP of 99.91% and a worst-case downtime of (100% - 99.91%) × 8,760 = 0.09% × 8,760, i.e., a maximum of 7.9 hours of unplanned downtime per year.

When considering planned downtime due to software updates, the rolling updates enables by ASGs and our use of Cloudflare for the SCDP front-end allow SCDP to achieve high availability even in the face of monthly updates of its software. Although ASG rolling updates enable near-zero downtime deployments for individual components, we observed that synchronizing deployments across three interdependent components remains challenging; in practice, we achieve about two minutes of planned downtime per monthly update or 0.4 hours of planned downtime per year.

Figure 2: Some of the benefits of using a content delivery network (CDN).

The Takeaway

The Signaloid Cloud Computing Engine (SCCE) implements Signaloid's UxHw technology which enables orders of magnitude faster and more accurate execution than competing solutions for workloads such as Value at Risk (VaR) computations in finance, among its many applications. Because SCCE is intended to be permanently integrated as part of the execution stack of production systems such as quantitative finance libraries, achieving high availability and scalability with increasing workload size is paramount. This technology explainer decribed how SCCE achieves 99.9975% availability, through a combination of auto-scaling, gegraphic redundancy, and shows how end-to-end applications built on top of SCCE, can achieve caching via CDNs.

Schedule a Demo Call
Request Whitepaper
Schedule a Demo Call
Request Whitepaper
Schedule a Demo Call
Request Whitepaper