In the previous post, we covered partition resizing: when to increase topic partitions, what happens to existing records, and how partition changes affect producers and consumers.
This follow-up focuses on the other side of Kafka scaling: Kafka cluster resizing. In this post, Kafka cluster resizing refers to supported capacity operations such as scaling out the cluster by adding brokers or scaling up supported broker resources. It does not imply customer-driven cluster scale-in, broker scale-down, or choosing specific brokers to remove.
If partitions are Kafka’s unit of parallelism, brokers are the infrastructure that store and serve those partitions. In OCI Streaming with Apache Kafka, a broker is a Kafka server that stores data and processes client requests. You can create and update clusters and brokers through the OCI Console, API, or CLI, and clusters can have up to 30 brokers depending on the workload and cluster type. Reference: OCI Streaming with Apache Kafka concepts
Remember this mental model:
Partitions scale concurrency.
Brokers scale capacity.
Kafka cluster resizing is the right tool when the cluster needs more infrastructure capacity, better fault-tolerance headroom, or more room for growth. On OCI Streaming with Apache Kafka, hereafter OCI Managed Kafka, this primarily means scaling out the cluster or scaling up supported broker resources. Adding brokers does not automatically solve every Kafka scaling problem. In particular, adding brokers does not necessarily redistribute existing partition replicas across the new brokers without a balancing or reassignment step.
This post explains when to resize the Kafka cluster, what impact to expect, and how to approach the change safely in OCI Managed Kafka.
Kafka brokers: a quick refresher
A Kafka cluster is made up of brokers. Topics are split into partitions, and each partition has one or more replicas. Those replicas live on brokers.
Cluster
Broker 1
orders-0 leader
payments-1 follower
Broker 2
orders-1 leader
orders-0 follower
Broker 3
payments-1 leader
orders-1 follower
For each partition, one replica is the leader and the others are followers. Producers write to the leader replica. Followers replicate data from the leader. If a leader broker fails, Kafka can elect a new leader from an in-sync replica.
12 partitions
replication factor 3
12 partition logs
36 total partition replicas
Those replicas must be stored and served by brokers. As throughput, retention, and topic count grow, the broker layer eventually becomes the scaling boundary.
Kafka cluster resizing versus partition resizing
Partition resizing and Kafka cluster resizing solve different problems.
| Scaling action | Primary purpose | Typical reason |
| Increase partitions | Add application parallelism | More consumer or producer concurrency |
| Scale out the cluster | Add cluster capacity | More CPU, network, disk, storage, or broker-level headroom |
| Scale up supported broker resources | Increase per-broker compute | Existing broker count is appropriate, but brokers need more compute |
| Expand storage or adjust retention | Add storage headroom or reduce retention pressure | Retention or data volume is growing |
| Reassign replicas | Redistribute data and load | New brokers were added or existing brokers are imbalanced |
| Rebalance leaders | Redistribute request load | Some brokers host too many partition leaders |
Increasing partitions adds more logical lanes. Scaling out the cluster adds more infrastructure to run those lanes.
If the consumer group has idle instances because there are fewer partitions than consumers, resizing the Kafka cluster probably will not help. If the cluster is running out of disk, CPU, or network capacity, adding partitions may make things worse. That is a broker capacity problem.
When should you resize the Kafka cluster?
Consider resizing the Kafka cluster when the bottleneck is cluster capacity, not application parallelism.
- Broker CPU is consistently high.
- Broker network throughput is close to expected limits.
- Broker disk usage is growing too quickly.
- Broker disk I/O is saturated.
- Produce or fetch latency is increasing.
- Replication is falling behind.
- ISR shrink or expand events are frequent.
- Under-replicated partitions appear.
- Leader distribution is uneven.
- A broker is much hotter than the rest.
- Retention requirements are increasing.
- Topic count or partition count is growing.
OCI Streaming with Apache Kafka enforces broker disk quotas by default to help protect cluster stability. By default, producer operations are rate-limited when broker disk reaches 97% capacity, and producer operations are blocked when broker disk reaches 98% capacity while consumer operations can continue.
Reference: OCI Streaming with Apache Kafka concepts
That makes disk headroom a practical operational trigger. If brokers are approaching disk quota thresholds, scaling out the cluster, scaling up supported broker resources, or adjusting retention/storage strategy should be considered before producers are impacted.
Scale up or scale out?
Scale out: Add more brokers to increase aggregate cluster capacity.
Scale up: Increase supported resources per broker, such as OCPU per broker.
Storage expansion: Increase broker storage or adjust retention/storage configuration where supported.
Scale-in or scale-down: Customer-driven broker removal, cluster scale-in, and scale-down are not currently supported in OCI Managed Kafka. Customers do not choose which broker to remove, so this blog focuses on supported scale-out and scale-up operations.
In OCI Managed Kafka, use the supported cluster update workflows in the OCI Console, CLI, or API to scale out the cluster or update supported broker resources such as OCPU per broker. Customer-driven scale-in or scale-down is not currently supported. Reference: Updating an OCI Streaming with Apache Kafka cluster
| Situation | Better fit |
| All brokers are moderately loaded, but total cluster throughput needs to grow | Scale out the cluster |
| A few brokers are hot because leaders or replicas are uneven | Rebalance leaders or replicas |
| Brokers are CPU-bound but storage and partition distribution are healthy | Scale up supported broker resources |
| Disk usage is high because retention or data volume is growing | Scale out, expand storage where supported, or adjust retention |
| You need more fault-domain or availability-domain spread | Scale out, if supported by the cluster design |
| Partition replica count per broker is too high | Scale out and redistribute replicas |
| Consumer lag is high but brokers are healthy and consumers are limited by partition count | Increase partitions instead |
Scale out the cluster when it needs more aggregate capacity. Add partitions when applications need more parallelism. Reassign replicas when capacity exists but load is not evenly using it.
What happens when you add brokers?
Adding brokers increases available cluster capacity, but it is important to understand what changes immediately and what may require follow-up work.
What changes
New brokers become available for Kafka to use. They can host partition replicas and leaders. They can also be used for new topics, new partitions, and future reassignment or balancing operations.
What may not change automatically
Existing partition replicas might not automatically move to the new brokers. Kafka’s partition reassignment tool is used to move partition replicas across brokers, including when expanding an existing cluster.
Reference: Apache Kafka basic operations: partition reassignment
Adding brokers creates capacity. Replica reassignment or balancing is what moves existing data and load onto that capacity.
Without redistribution, the new brokers may remain lightly used while existing brokers continue carrying most of the workload.
Broker capacity scaling implications
Scaling broker capacity is generally an online operation, but it can still affect the workload. The impact depends on what is being changed and whether partition replicas are moved afterward.
| Area | Impact |
| Existing records | Remain available during normal operation |
| Existing topics | Continue to exist |
| Producers and consumers | Typically continue running, but may see latency changes during balancing or reassignment |
| New brokers | Add capacity, but may need replica/leader balancing to become useful |
| Replica reassignment | Moves data across brokers and can increase network, disk, and replication load; should be reviewed, throttled, and phased on busy clusters |
| Leader balancing | Can shift produce/fetch request load across brokers |
| Consumer groups | Usually not directly affected by adding brokers, but may be affected by related topic or partition changes |
| Monitoring | Should be watched closely during and after the operation |
The largest operational risk is usually not the act of adding brokers. It is the data movement that may follow. Replica reassignment copies partition data between brokers. That can consume network, disk I/O, and replication bandwidth. On busy clusters, review the assignment plan carefully, throttle the movement, and proceed slowly rather than applying a broad movement plan all at once.
Broker capacity scaling does not replace partition planning
Adding brokers does not automatically increase the maximum parallelism of a consumer group. Consumer group parallelism is still bounded by the number of partitions assigned to that group.
Topic: orders
Partitions: 8
Consumer group instances: 20
Brokers: 6
Even with 6 brokers, only up to 8 consumers in that group can actively consume from the 8 partitions.
For many scaling events, you may need both:
1. Add brokers to create more cluster capacity.
2. Reassign replicas or rebalance leaders to use the new capacity.
3. Increase partitions only if application parallelism is still constrained.
Best practices before resizing the Kafka cluster
1. Identify the bottleneck
Review whether the issue is CPU, network, disk capacity, disk I/O, replication lag, uneven leader placement, uneven replica placement, too many partition replicas per broker, or application-level consumer parallelism.
If the bottleneck is consumer parallelism, resizing the Kafka cluster may not solve it. If the bottleneck is broker capacity, increasing partitions may not solve it.
2. Capture the current cluster state
Broker count
OCPU per broker
Storage per broker
Topic count
Partition count
Replication factor
Total partition replica count
Leader count per broker
Replica count per broker
Disk usage per broker
CPU and network usage per broker
Under-replicated partitions
Offline partitions
ISR health
Produce and fetch latency
Consumer lag for critical groups
Total partition replicas = total partitions × replication factor
Example:
500 partitions × replication factor 3 = 1,500 partition replicas
3. Review availability requirements
For production workloads, use a high availability cluster design. OCI Streaming with Apache Kafka high availability clusters are intended for production environments and use a minimum of 3 brokers distributed across multiple availability or fault domains for redundancy and fault tolerance. Reference: OCI Streaming with Apache Kafka concepts
Also review topic-level replication settings and in-sync replica requirements. Cluster configuration supports properties such as default.replication.factor, min.insync.replicas, num.replica.fetchers, and unclean.leader.election.enable. Reference: OCI Streaming with Apache Kafka cluster configuration
4. Check network and subnet readiness
If the update involves networking changes or subnets, make sure the target subnets have enough available IPs and that required IAM policies are in place. When a cluster uses several subnets, subnet updates through API and CLI require enough available IPs, and subnet updates are full replacement operations where the provided list replaces the existing configuration. Reference: Updating an OCI Streaming with Apache Kafka cluster
5. Plan for post-resize balancing
Before adding brokers, decide how the new capacity will be used.
- Will new topics and new partitions naturally use the new brokers?
- Do existing topics need replica reassignment?
- Are leaders unevenly distributed?
- Are some brokers disk-heavy while others are light?
- Do we need to move specific high-volume topics?
- Should balancing be phased topic by topic?
Do not assume that new brokers will immediately reduce load on existing brokers.
Steps to resize the Kafka cluster in OCI Streaming with Apache Kafka
The exact steps depend on whether you use the OCI Console, CLI, API, or automation. The following runbook gives a practical flow for supported scale-out and scale-up operations.
Step 1: Confirm the scaling goal
Start with the reason for the change. Examples include adding brokers to increase aggregate throughput capacity, increasing storage headroom, increasing OCPU per broker to address CPU saturation, adding brokers before increasing topic partition count, or adding brokers before onboarding new high-volume workloads. Also confirm success criteria such as lower broker CPU, disk usage below target threshold, reduced produce or fetch latency, cleared under-replicated partitions, and more even leader or replica distribution.
Step 2: Capture a pre-change baseline
Record broker count, OCPU per broker, storage per broker, topic count, total partitions, total partition replicas, per-broker CPU, per-broker network, per-broker disk usage, under-replicated partitions, offline partitions, ISR health, critical consumer group lag, application error rate, and produce/fetch latency.
Step 3: Review cluster and broker limits
Confirm the target broker count and shape are within service limits. Also check tenancy-level limits or quotas that apply to the environment before scheduling the change. Reference: OCI Streaming with Apache Kafka concepts
Step 4: Update the cluster
You can update the cluster through the OCI Console, CLI, or API. Use the supported update options for scaling out the cluster and updating broker resources, such as OCPU per broker and other supported settings. Customer-driven broker removal, cluster scale-in, and scale-down are not currently supported in OCI Managed Kafka. Reference: Updating an OCI Streaming with Apache Kafka cluster
oci kafka cluster update \
--kafka-cluster-id <kafka-cluster-ocid>
Step 5: Wait for the cluster update to complete
During the update, monitor cluster lifecycle state, broker availability, broker CPU, broker disk usage, broker network throughput, under-replicated partitions, offline partitions, ISR shrink or expand events, produce/fetch latency, and application error rate. Avoid stacking multiple major changes at once.
Step 6: Verify the new broker capacity
After the update completes, confirm the expected broker count, expected OCPU per broker, expected storage configuration, broker health, cluster health, no unexpected offline partitions, no persistent under-replicated partitions, and client connectivity.
Step 7: Decide whether replica reassignment is needed
If the purpose of adding brokers is to relieve existing broker load, you may need to redistribute partition replicas. Apache Kafka’s partition reassignment tool can move partitions across brokers. Because generated reassignment plans can be broad, review the current partition assignment and the proposed movement before execution. Reference: Apache Kafka basic operations: partition reassignment
1. Identify topics or partitions to move.
2. Capture the current partition and replica assignment.
3. Generate or prepare a reassignment plan.
4. Review the plan for unnecessary or excessive movement.
5. Execute a small, controlled reassignment batch.
6. Throttle reassignment traffic based on cluster workload.
7. Verify completion.
8. Monitor broker and application health before proceeding with more moves.
kafka-reassign-partitions.sh \
--bootstrap-server <bootstrap-endpoint> \
--command-config <client.properties> \
--topics-to-move-json-file topics-to-move.json \
--broker-list "<target-broker-ids>" \
--generatekafka-reassign-partitions.sh \
--bootstrap-server <bootstrap-endpoint> \
--command-config <client.properties> \
--reassignment-json-file reassignment.json \
--executekafka-reassign-partitions.sh \
--bootstrap-server <bootstrap-endpoint> \
--command-config <client.properties> \
--reassignment-json-file reassignment.json \
--verify
Important: Replica reassignment moves data. It can increase network traffic, disk I/O, and replication load. Kafka’s generic reassignment tooling can generate a broad movement plan, which may cause a cluster-wide shuffle of data if applied without review. For high-workload clusters, use a controlled window, throttle movement based on current traffic, and proceed slowly, moving only a few partitions at a time where possible.
Step 8: Review leader distribution
Even after replicas are distributed, partition leaders may still be unevenly placed. Since partition leaders handle most client request traffic for their partitions, leader imbalance can create hot brokers.
Leader count per broker
Produce request distribution
Fetch request distribution
CPU per broker
Network per broker
Hot topics and hot partitions
If leadership is uneven, use the appropriate Kafka leader election or balancing workflow for your environment.
Step 9: Monitor after scaling and balancing
Broker CPU
Broker network
Broker disk usage
Disk I/O
Produce latency
Fetch latency
Request error rate
Under-replicated partitions
Offline partitions
ISR shrink or expand events
Consumer lag
Application latency
Application error rate
OCI Monitoring and OCI Logging can be used with OCI Streaming with Apache Kafka for cluster metrics and cluster-level logs. Reference: OCI Streaming with Apache Kafka overview
Common Kafka cluster resizing patterns
Pattern 1: Add brokers before onboarding a large workload
Use this when a new application or tenant will significantly increase throughput or retention.
1. Estimate expected produce, fetch, and retention growth.
2. Add brokers or increase broker resources.
3. Verify new capacity is healthy.
4. Reassign replicas if existing load needs to be redistributed.
5. Create new topics or increase partitions as needed.
6. Onboard traffic gradually.
7. Monitor broker and application metrics.
Pattern 2: Add brokers after disk growth
Use this when broker disk usage is rising faster than expected.
1. Check retention settings and topic growth rate.
2. Confirm whether data growth is expected.
3. Add brokers or storage capacity.
4. Reassign replicas to spread storage.
5. Review retention and compaction policies.
6. Monitor disk usage trend after the change.
Pattern 3: Add brokers before increasing partitions
Use this when applications need more partitions but the current brokers do not have enough headroom.
1. Add brokers or increase broker resources.
2. Rebalance replicas or leaders if needed.
3. Confirm broker headroom.
4. Increase topic partitions.
5. Monitor consumer groups, producer distribution, and broker health.
Pattern 4: Broker hot spot due to leader imbalance
Use this when one broker has much higher CPU or network traffic than peers.
1. Check leader count and high-volume partition leaders.
2. Confirm whether replica placement is balanced.
3. Rebalance leaders if replicas already exist on suitable brokers.
4. Reassign replicas only if leader movement alone is not enough.
5. Monitor request distribution and latency.
Mistakes to avoid
Mistake 1: Resizing the cluster but not redistributing load
Adding brokers creates capacity, but existing partitions may still live mostly on old brokers. Plan for replica or leader balancing when the goal is to relieve current load.
Mistake 2: Using broker capacity scaling to fix consumer parallelism
If a topic has 6 partitions and the consumer group has 20 consumers, adding brokers does not let all 20 consumers actively process that topic. Increase partitions if the workload can tolerate the ordering and rebalance implications.
Mistake 3: Moving too much data at once
Large reassignment operations can put heavy pressure on network and disk. Do not apply a broad generated reassignment plan blindly. Review the current assignment, throttle movement, and phase large moves gradually, especially on high-throughput clusters.
Mistake 4: Ignoring disk thresholds
Disk pressure can lead to producer throttling or blocking based on broker disk quota behavior. Plan capacity before reaching critical thresholds.
Mistake 5: Assuming scale-in or scale-down is supported
OCI Managed Kafka does not currently support customer-driven broker removal, cluster scale-in, or scale-down. Customers should plan capacity with expected growth, retention, replication factor, and operational headroom in mind.
Key takeaway
Broker capacity scaling in OCI Streaming with Apache Kafka is primarily a capacity scaling operation. Use it when the cluster needs more compute, network, disk, storage, or fault-tolerance headroom.
Adding brokers does not automatically increase consumer parallelism, and it may not automatically redistribute existing partition replicas. For many production scaling events, the complete workflow is:
1. Identify the bottleneck.
2. Scale out the cluster or scale up supported broker resources.
3. Verify cluster health.
4. Reassign replicas or rebalance leaders if needed.
5. Monitor broker, Kafka, and application metrics.
6. Increase partitions only if application parallelism still requires it.
Used carefully, broker capacity scaling gives OCI Managed Kafka users more room for growth and more operational headroom. The key is to scale for the right reason, plan any follow-up balancing work, and validate the result with metrics rather than assuming added capacity is automatically carrying traffic.
Final checklist
Before resizing the Kafka cluster
- Confirm the bottleneck is broker capacity, not partition-level parallelism.
- Capture broker CPU, network, disk, latency, ISR, and consumer lag baselines.
- Review topic count, partition count, and total replica count.
- Check service limits, quotas, and target broker count.
- Validate subnet and network readiness if applicable.
- Plan whether replica or leader balancing is needed after the resize.
- Choose a controlled window for critical workloads.
During the scaling operation
- Update broker count or broker resources through supported OCI Console, CLI, API, or automation workflows.
- Monitor cluster lifecycle state.
- Watch broker health, under-replicated partitions, offline partitions, and application errors.
- Avoid stacking unrelated major changes at the same time.
After resizing the Kafka cluster
- Verify expected broker count and broker health.
- Check whether new brokers are carrying replicas and leaders.
- Reassign replicas if existing load needs to move.
- Rebalance leaders if request load is uneven.
- Monitor CPU, network, disk, latency, ISR, and consumer lag.
- Validate that the original bottleneck improved.
Broker capacity scaling is one part of a broader Kafka scaling strategy. Combine it with thoughtful partition planning, retention management, replication design, and workload monitoring to keep streaming applications reliable as they grow.
