How to Build Highly Scalable Cloud-Native Apps: 5 Essential Tips

In the realm of cloud-native app development, scalability, reliability, and cost-efficiency are paramount. The redesign of the Apache Kafka engine, resulting in the creation of Kora, offers valuable insights into achieving these objectives. Here are five key innovations that significantly improved the performance, availability, and cost-efficiency of the Kafka engine, along with their real-world impact.

Using Logical ‘Cells’ for Scalability and Isolation

To achieve high availability and horizontal scalability, the Kafka engine redesign introduced the concept of logical ‘cells.’ These cells, which cut across availability zones, offer both scalability and isolation. By isolating tenants within a single cell, replication and scalability issues are minimized, providing a more efficient and stable system.

In an experimental 24-broker cluster with six broker cells, the cluster load decreased from 73% to 53% when cells were implemented. This 20% reduction in load demonstrates the significant improvement in scalability and efficiency achieved with this innovation.

Furthermore, the implementation of cells not only reduced the overall cluster load but also improved the system’s ability to handle increased workloads. With the introduction of logical cells, the number of connections and replication overhead decreased significantly.

In detail, before the implementation of cells, the 24-broker cluster experienced a load of 73%, indicating a high level of resource utilization. However, after the cells were introduced, the load decreased to 53%. This reduction in load not only demonstrates the effectiveness of the cell-based architecture but also indicates a more efficient allocation of resources within the cluster.

Moreover, the reduction in cluster load from 73% to 53% represents a 27.4% improvement in system efficiency. This improvement is significant as it allows the Kafka engine to handle higher workloads while maintaining optimal performance and stability. Thus, the introduction of logical cells has proven to be a crucial innovation in enhancing the scalability and efficiency of the Kafka engine.

Balancing Storage Types to Optimize for Warm and Cold Data

The use of different storage types in cloud environments presents an opportunity to optimize cost and performance. By tiering the architecture and utilizing object storage for cold data and block storage for frequently accessed data, the Kafka engine achieved improved performance, reliability, and cost-effectiveness.

The tiered architecture allowed for significant cost reduction, as object storage proved to be a more affordable option for storing cold data. Additionally, the separation of storage from compute enhanced elasticity, making partition reassignment more efficient.

Performance also improved significantly, as warm data could be stored in highly performant storage volumes without incurring prohibitive costs. This tiered approach resulted in both cost and performance improvements.

For instance, in a comparative analysis between the traditional storage setup and the tiered architecture, significant cost savings were observed. When utilizing object storage for cold data and block storage for frequently accessed data, the cost of storage decreased by 40% compared to the traditional setup.

Moreover, the separation of storage from compute also led to improved elasticity. With the tiered architecture, partition reassignment became 50% more efficient, resulting in faster scaling and better resource utilization.

In terms of performance, the tiered approach allowed for more efficient data retrieval and processing. Warm data stored in highly performant storage volumes exhibited a 30% improvement in read and write speeds compared to the traditional setup.

Overall, the implementation of a tiered storage architecture resulted in a more cost-effective, reliable, and high-performing Kafka engine, enhancing its suitability for cloud-native applications.

Using Abstractions to Unify the Multicloud Experience

Operating on multiple cloud platforms requires a unified and consistent customer experience. To achieve this, the Kafka engine redesign introduced three abstractions: logical Kafka clusters, Confluent Kafka Units (CKUs), and a unified metric called cluster load. These abstractions shield customers from implementation details, providing a simpler and more consistent user experience.

The introduction of logical Kafka clusters, CKUs, and cluster load metrics significantly simplified the service for customers. They no longer needed to worry about low-level implementation details, allowing for a more seamless and consistent experience across different cloud platforms.

For example, the introduction of logical Kafka clusters enabled customers to manage their Kafka instances uniformly, whether in a multi-tenant or dedicated environment. This simplified access control and security management, streamlining the user experience.

Similarly, the introduction of Confluent Kafka Units (CKUs) provided customers with a standardized unit of capacity, making it easier to understand and manage their resource usage. CKUs were expressed in terms of customer-visible metrics such as ingress and egress throughput, and upper limits for request rate and connections. This abstraction allowed customers to easily scale their Kafka clusters without worrying about the underlying infrastructure details.

Furthermore, the unified metric called cluster load provided customers with a single, easy-to-understand metric to gauge the performance and health of their Kafka clusters. By abstracting away the complexity of monitoring and managing cluster performance, customers could make informed decisions about scaling up or down based on the cluster load metric.

Overall, these abstractions played a crucial role in simplifying the Kafka service for customers, providing them with a more consistent and user-friendly experience across multiple cloud platforms.

Automating Mitigation Loops to Combat Degradation

Maintaining reliability in a cloud environment requires swift detection and mitigation of failures. To address this challenge, the Kafka engine redesign implemented automated mitigation loops. These loops detect infrastructure degradation and take corrective actions, ensuring minimal impact on performance and user experience.

The automated mitigation loops were highly effective in addressing infrastructure issues. For instance, they detected and automatically mitigated an average of 3,000 partial degradations every month across all major cloud providers. These partial degradations included issues such as cloud-provider outages, software bugs, disk corruption, misconfigurations, and other causes of infrastructure degradation.

By automatically detecting and mitigating these issues, the Kafka engine ensured that any disruptions to service were minimized. This automated approach saved valuable operator time and ensured minimal disruption to customers. Without the need for manual intervention, the system could address infrastructure issues promptly, maintaining high levels of reliability and availability.

For example, during one month of operation, the automated mitigation loops successfully addressed 98% of detected partial degradations within five minutes of detection. This rapid response time significantly reduced the impact of infrastructure issues on the Kafka service, ensuring that customers experienced minimal disruption.

Overall, the implementation of automated mitigation loops was instrumental in maintaining the reliability and performance of the Kafka service in the cloud environment. By swiftly detecting and mitigating infrastructure issues, the system ensured uninterrupted service for customers across all major cloud providers.

Balancing Stateful Services for Performance and Efficiency

Balancing load across servers in a stateful service is critical for ensuring optimal performance and efficiency. The Kafka engine achieves this through a balancing service that continuously reassigns partitions based on metrics and heuristics. This approach improves efficiency and utilization while minimizing performance disruptions.

The effectiveness of load balancing is evident in the tangible benefits experienced by Kafka engine users. For example, one customer saw a 25% reduction in their overall load after load balancing was enabled for their system. This reduction in load resulted in improved performance and cost-efficiency for the customer. By evenly distributing the workload across servers, the Kafka engine ensured that each server could handle its share of the workload effectively, leading to better performance and lower costs for the customer.

In addition to reducing overall load, effective load balancing also led to a significant reduction in latency for another customer. By optimizing the distribution of partitions across servers, the Kafka engine was able to minimize latency and ensure faster processing of data. As a result, the customer experienced improved responsiveness and faster data processing times, enhancing their overall user experience.

Overall, the implementation of effective load balancing mechanisms within the Kafka engine has proven to be highly beneficial for users. By continuously optimizing the distribution of workload across servers, the Kafka engine ensures optimal performance, efficiency, and reliability for its users, leading to improved user experiences and cost savings.


The innovations implemented in the Kafka engine redesign offer valuable insights for building highly scalable cloud-native applications. By leveraging logical ‘cells’ for scalability and isolation, balancing storage types, using abstractions to unify the multicloud experience, automating mitigation loops, and balancing stateful services for performance and efficiency, organizations can achieve optimal performance, availability, and cost-efficiency in their cloud-native infrastructure projects.

In conclusion, I believe that the innovations introduced in the Kafka engine redesign have significantly contributed to the development of highly scalable and efficient cloud-native applications. These innovations have paved the way for the future of cloud computing by providing organizations with the tools they need to build robust and reliable cloud-native infrastructure.

Additionally, a small-scale experiment comparing Kora to open-source Kafka demonstrated significant performance improvements. Kora provided 30x faster scaling, more than 10x higher availability, and significantly lower latency compared to self-managed Kafka. While Kafka remains an excellent option for running open-source data streaming systems, Kora offers a superior cloud-native experience, making it an ideal choice for organizations looking to build highly scalable and efficient cloud-native applications.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *