From Crisis to Optimization: Migrating from DynamoDB to S3 Object Storage

The Signaloid Cloud Compute Engine (SCCE) is a platform we designed to execute applications which can benefit from Signaloid's UxHw technology and which might have a wide variety of data processing characteristics. SCCE allows developers to access the outputs of their applications as well as metadata generated on their behalf by the UxHw technology, both at runtime and after the completion of execution; this capability can however lead to SCCE generating large volumes of stored data from the execution of end-user applications. Sometime in 2024, we experienced a marked degradation in the performance of our infrastructure when the volume of these data led to the throttling of writes from applications running on SCCE, as well as an increase in our costs for long-term storage of application output data and metadata. While assessing some of the internal data paths of our cloud platform architecture, we discovered that one particular kind of workload was using a sizable throughput of the AWS DynamoDB key-value store that served as our primary data store, with the potential to completely overload DynamoDB and SCCE. By rethinking SCCE's data storage architecture however, we were able to maintain the original intended functionality while achieving an overall improvement in throughput. This technology explainer details how we overcame these challenges.
Why It Matters
Efficient cloud infrastructure implementation is essential to a cloud service that delivers performance for customers while keeping cloud service operation expenses low. Reducing the sensitivity of a cloud infrastructure to service load also increases its responsiveness and improves the overall availability of the platform even during periods of high demand. Insights such as these, that lead to continuous improvements are important.
The Technical Details
Saturation
When processing large datasets on SCCE, we encountered significant write throughput limitations. Consider a scenario in which a task generates 1 million DynamoDB items (200 MB in total), which are then stored in a key-value store. Our initial strategy was to rate limit the workload to avoid consuming all of the available write throughput. This solution however had significant drawbacks:
A standard allocation of 100 WCU (Write Capacity Units) per task results in a processing time of 2 hours and 47 minutes for the workload example described above.
Even with maximum theoretical throughput (40,000 WCU per DynamoDB partition), processing would still take 25 seconds.
This approach (Figure 1) would negatively impact user experience and increase database infrastructure costs by 2- to 100-fold (compared to an infrastructure designed to handle only a low write throughput), based on dedicated write throughput allocations.
Our design goal was to reduce database saturation from the demanding workloads while continuing to support all of the existing application access patterns.
Figure 1: A design that places an unnecessary burden on DynamoDB.
Object Storage to the Rescue
Our analysis revealed that the output data of the demanding workloads could be efficiently encoded as binary files consisting primarily of floating-point numbers with minimal metadata overhead. This approach enabled us to store a single 8 MB file on Amazon S3 instead of creating one million individual DynamoDB records. With our EC2 instances configured with a network throughput of 12.5 Gbps, we calculated that uploading an 8 MB file to S3 would theoretically take approximately 5 ms. To account for potential network fluctuations and processing overhead, we conservatively estimated 10 ms per file upload. By migrating to this new architecture, we eliminated the need for DynamoDB write capacity, preserving those resources for other operations. While this optimization (Figure 2) dramatically improved our write performance, it came with a trade-off in read latency: Instead of the typical 1–10 ms average response time when reading from DynamoDB, customers now experience approximately 80–100ms of latency when reading data from S3 buckets. Despite this increased read time, the overall benefit of reducing write operations from hours to milliseconds made this a worthwhile decision, especially since most workloads exploiting the UxHw technology end up generating data access patterns for both explicit data writes and metadata writes which are write-intensive rather than read-intensive.
Figure 2: A better design that takes the best of both worlds, combining using DynamoDB with S3.
Quantifying the Performance Improvement
Our decision to use S3 storage resulted in quantifiable performance gains that transformed our solution's scalability. By improving our data storage architecture, we reduced the size of 1 million individual DynamoDB writes (200 MB in total) to a single 8 MB compressed binary file, a 25-fold reduction. This architectural change reduced our post-execution data processing time for the example workload described above, from approximately 2.7 hours to 10 ms, representing a nearly 1,000,000-fold improvement in the time to store the data and metadata after the completion of an execution task. Most importantly, this optimization improved DynamoDB write capacity without requiring additional provisioning. We estimate that two to three times per month, these data-intensive workloads consumed 90% or more of our total DynamoDB write capacity. By offloading this traffic to S3, we reduced throttling events across other DynamoDB services that used the same resources. The modest increase in read latency (from 1–10 ms with DynamoDB to 80–100 ms with S3) is a reasonable trade-off given the benefits across our entire platform ecosystem.
The Takeaway
By migrating the storage of the output of data-intensive workloads from DynamoDB to S3, we eliminated resource contention and reduced database write throttling. Reducing these incidents improved our platform's reliability and stability, at no additional cost. What had appeared to be a reasonable DynamoDB-based implementation at the point of our initial implementation, revealed its limitations at scale. By evaluating alternatives, we created a solution that eliminated performance bottlenecks and improved the performance, reliability, and operating costs our cloud infrastructure. This example demonstrates how architectural changes can simultaneously improve performance, reliability, and costs. Together with Signaloid's underlying UxHw technology, the execution model abstractions we provide, and the carefully-engineered high-performance task execution of our cloud platform, the cloud architecture revisions described in this technology explainer help make the Signaloid Cloud Compute Engine the easiest way to engineer solutions traditionally implemented using Monte Carlo methods, as well as the fastest way to run those solutions.