In the previous post, we covered partition resizing: when to increase topic partitions, what happens to existing records, and how partition changes affect producers and consumers.

This follow-up focuses on the other side of Kafka scaling: Kafka cluster resizing. In this post, Kafka cluster resizing refers to supported capacity operations such as scaling out the cluster by adding brokers or scaling up supported broker resources. It does not imply customer-driven cluster scale-in, broker scale-down, or choosing specific brokers to remove.

If partitions are Kafka’s unit of parallelism, brokers are the infrastructure that store and serve those partitions. In OCI Streaming with Apache Kafka, a broker is a Kafka server that stores data and processes client requests. You can create and update clusters and brokers through the OCI Console, API, or CLI, and clusters can have up to 30 brokers depending on the workload and cluster type. Reference: OCI Streaming with Apache Kafka concepts

Remember this mental model:

Partitions scale concurrency.
Brokers scale capacity.

Kafka cluster resizing is the right tool when the cluster needs more infrastructure capacity, better fault-tolerance headroom, or more room for growth. On OCI Streaming with Apache Kafka, hereafter OCI Managed Kafka, this primarily means scaling out the cluster or scaling up supported broker resources. Adding brokers does not automatically solve every Kafka scaling problem. In particular, adding brokers does not necessarily redistribute existing partition replicas across the new brokers without a balancing or reassignment step.

This post explains when to resize the Kafka cluster, what impact to expect, and how to approach the change safely in OCI Managed Kafka.

Kafka brokers: a quick refresher

A Kafka cluster is made up of brokers. Topics are split into partitions, and each partition has one or more replicas. Those replicas live on brokers.

Cluster
Broker 1
orders-0 leader
payments-1 follower

Broker 2
orders-1 leader
orders-0 follower

Broker 3
payments-1 leader
orders-1 follower

For each partition, one replica is the leader and the others are followers. Producers write to the leader replica. Followers replicate data from the leader. If a leader broker fails, Kafka can elect a new leader from an in-sync replica.

12 partitions
replication factor 3

12 partition logs
36 total partition replicas

Those replicas must be stored and served by brokers. As throughput, retention, and topic count grow, the broker layer eventually becomes the scaling boundary.

Kafka cluster resizing versus partition resizing

Partition resizing and Kafka cluster resizing solve different problems.

Scaling actionPrimary purposeTypical reason
Increase partitionsAdd application parallelismMore consumer or producer concurrency
Scale out the clusterAdd cluster capacityMore CPU, network, disk, storage, or broker-level headroom
Scale up supported broker resourcesIncrease per-broker computeExisting broker count is appropriate, but brokers need more compute
Expand storage or adjust retentionAdd storage headroom or reduce retention pressureRetention or data volume is growing
Reassign replicasRedistribute data and loadNew brokers were added or existing brokers are imbalanced
Rebalance leadersRedistribute request loadSome brokers host too many partition leaders

Increasing partitions adds more logical lanes. Scaling out the cluster adds more infrastructure to run those lanes.

If the consumer group has idle instances because there are fewer partitions than consumers, resizing the Kafka cluster probably will not help. If the cluster is running out of disk, CPU, or network capacity, adding partitions may make things worse. That is a broker capacity problem.

When should you resize the Kafka cluster?

Consider resizing the Kafka cluster when the bottleneck is cluster capacity, not application parallelism.

  1. Broker CPU is consistently high.
  2. Broker network throughput is close to expected limits.
  3. Broker disk usage is growing too quickly.
  4. Broker disk I/O is saturated.
  5. Produce or fetch latency is increasing.
  6. Replication is falling behind.
  7. ISR shrink or expand events are frequent.
  8. Under-replicated partitions appear.
  9. Leader distribution is uneven.
  10. A broker is much hotter than the rest.
  11. Retention requirements are increasing.
  12. Topic count or partition count is growing.

OCI Streaming with Apache Kafka enforces broker disk quotas by default to help protect cluster stability. By default, producer operations are rate-limited when broker disk reaches 97% capacity, and producer operations are blocked when broker disk reaches 98% capacity while consumer operations can continue.

Reference: OCI Streaming with Apache Kafka concepts

That makes disk headroom a practical operational trigger. If brokers are approaching disk quota thresholds, scaling out the cluster, scaling up supported broker resources, or adjusting retention/storage strategy should be considered before producers are impacted.

Scale up or scale out?

Scale out: Add more brokers to increase aggregate cluster capacity.
Scale up: Increase supported resources per broker, such as OCPU per broker.
Storage expansion: Increase broker storage or adjust retention/storage configuration where supported.
Scale-in or scale-down: Customer-driven broker removal, cluster scale-in, and scale-down are not currently supported in OCI Managed Kafka. Customers do not choose which broker to remove, so this blog focuses on supported scale-out and scale-up operations.

In OCI Managed Kafka, use the supported cluster update workflows in the OCI Console, CLI, or API to scale out the cluster or update supported broker resources such as OCPU per broker. Customer-driven scale-in or scale-down is not currently supported. Reference: Updating an OCI Streaming with Apache Kafka cluster

SituationBetter fit
All brokers are moderately loaded, but total cluster throughput needs to growScale out the cluster
A few brokers are hot because leaders or replicas are unevenRebalance leaders or replicas
Brokers are CPU-bound but storage and partition distribution are healthyScale up supported broker resources
Disk usage is high because retention or data volume is growingScale out, expand storage where supported, or adjust retention
You need more fault-domain or availability-domain spreadScale out, if supported by the cluster design
Partition replica count per broker is too highScale out and redistribute replicas
Consumer lag is high but brokers are healthy and consumers are limited by partition countIncrease partitions instead

Scale out the cluster when it needs more aggregate capacity. Add partitions when applications need more parallelism. Reassign replicas when capacity exists but load is not evenly using it.

What happens when you add brokers?

Adding brokers increases available cluster capacity, but it is important to understand what changes immediately and what may require follow-up work.

What changes

New brokers become available for Kafka to use. They can host partition replicas and leaders. They can also be used for new topics, new partitions, and future reassignment or balancing operations.

What may not change automatically

Existing partition replicas might not automatically move to the new brokers. Kafka’s partition reassignment tool is used to move partition replicas across brokers, including when expanding an existing cluster.

Reference: Apache Kafka basic operations: partition reassignment

Adding brokers creates capacity. Replica reassignment or balancing is what moves existing data and load onto that capacity.

Without redistribution, the new brokers may remain lightly used while existing brokers continue carrying most of the workload.

Broker capacity scaling implications

Scaling broker capacity is generally an online operation, but it can still affect the workload. The impact depends on what is being changed and whether partition replicas are moved afterward.

AreaImpact
Existing recordsRemain available during normal operation
Existing topicsContinue to exist
Producers and consumersTypically continue running, but may see latency changes during balancing or reassignment
New brokersAdd capacity, but may need replica/leader balancing to become useful
Replica reassignmentMoves data across brokers and can increase network, disk, and replication load; should be reviewed, throttled, and phased on busy clusters
Leader balancingCan shift produce/fetch request load across brokers
Consumer groupsUsually not directly affected by adding brokers, but may be affected by related topic or partition changes
MonitoringShould be watched closely during and after the operation

The largest operational risk is usually not the act of adding brokers. It is the data movement that may follow. Replica reassignment copies partition data between brokers. That can consume network, disk I/O, and replication bandwidth. On busy clusters, review the assignment plan carefully, throttle the movement, and proceed slowly rather than applying a broad movement plan all at once.

Broker capacity scaling does not replace partition planning

Adding brokers does not automatically increase the maximum parallelism of a consumer group. Consumer group parallelism is still bounded by the number of partitions assigned to that group.

Topic: orders
Partitions: 8
Consumer group instances: 20
Brokers: 6

Even with 6 brokers, only up to 8 consumers in that group can actively consume from the 8 partitions.

For many scaling events, you may need both:

1. Add brokers to create more cluster capacity.
2. Reassign replicas or rebalance leaders to use the new capacity.
3. Increase partitions only if application parallelism is still constrained.

Best practices before resizing the Kafka cluster

1. Identify the bottleneck

Review whether the issue is CPU, network, disk capacity, disk I/O, replication lag, uneven leader placement, uneven replica placement, too many partition replicas per broker, or application-level consumer parallelism.

If the bottleneck is consumer parallelism, resizing the Kafka cluster may not solve it. If the bottleneck is broker capacity, increasing partitions may not solve it.

2. Capture the current cluster state

Broker count
OCPU per broker
Storage per broker
Topic count
Partition count
Replication factor
Total partition replica count
Leader count per broker
Replica count per broker
Disk usage per broker
CPU and network usage per broker
Under-replicated partitions
Offline partitions
ISR health
Produce and fetch latency
Consumer lag for critical groups
Total partition replicas = total partitions × replication factor


Example:
500 partitions × replication factor 3 = 1,500 partition replicas

3. Review availability requirements

For production workloads, use a high availability cluster design. OCI Streaming with Apache Kafka high availability clusters are intended for production environments and use a minimum of 3 brokers distributed across multiple availability or fault domains for redundancy and fault tolerance. Reference: OCI Streaming with Apache Kafka concepts

Also review topic-level replication settings and in-sync replica requirements. Cluster configuration supports properties such as default.replication.factor, min.insync.replicas, num.replica.fetchers, and unclean.leader.election.enable. Reference: OCI Streaming with Apache Kafka cluster configuration

4. Check network and subnet readiness

If the update involves networking changes or subnets, make sure the target subnets have enough available IPs and that required IAM policies are in place. When a cluster uses several subnets, subnet updates through API and CLI require enough available IPs, and subnet updates are full replacement operations where the provided list replaces the existing configuration. Reference: Updating an OCI Streaming with Apache Kafka cluster

5. Plan for post-resize balancing

Before adding brokers, decide how the new capacity will be used.

  1. Will new topics and new partitions naturally use the new brokers?
  2. Do existing topics need replica reassignment?
  3. Are leaders unevenly distributed?
  4. Are some brokers disk-heavy while others are light?
  5. Do we need to move specific high-volume topics?
  6. Should balancing be phased topic by topic?

Do not assume that new brokers will immediately reduce load on existing brokers.

Steps to resize the Kafka cluster in OCI Streaming with Apache Kafka

The exact steps depend on whether you use the OCI Console, CLI, API, or automation. The following runbook gives a practical flow for supported scale-out and scale-up operations.

Step 1: Confirm the scaling goal

Start with the reason for the change. Examples include adding brokers to increase aggregate throughput capacity, increasing storage headroom, increasing OCPU per broker to address CPU saturation, adding brokers before increasing topic partition count, or adding brokers before onboarding new high-volume workloads. Also confirm success criteria such as lower broker CPU, disk usage below target threshold, reduced produce or fetch latency, cleared under-replicated partitions, and more even leader or replica distribution.

Step 2: Capture a pre-change baseline

Record broker count, OCPU per broker, storage per broker, topic count, total partitions, total partition replicas, per-broker CPU, per-broker network, per-broker disk usage, under-replicated partitions, offline partitions, ISR health, critical consumer group lag, application error rate, and produce/fetch latency.

Step 3: Review cluster and broker limits

Confirm the target broker count and shape are within service limits. Also check tenancy-level limits or quotas that apply to the environment before scheduling the change. Reference: OCI Streaming with Apache Kafka concepts

Step 4: Update the cluster

You can update the cluster through the OCI Console, CLI, or API. Use the supported update options for scaling out the cluster and updating broker resources, such as OCPU per broker and other supported settings. Customer-driven broker removal, cluster scale-in, and scale-down are not currently supported in OCI Managed Kafka. Reference: Updating an OCI Streaming with Apache Kafka cluster

oci kafka cluster update \
  --kafka-cluster-id <kafka-cluster-ocid>

Step 5: Wait for the cluster update to complete

During the update, monitor cluster lifecycle state, broker availability, broker CPU, broker disk usage, broker network throughput, under-replicated partitions, offline partitions, ISR shrink or expand events, produce/fetch latency, and application error rate. Avoid stacking multiple major changes at once.

Step 6: Verify the new broker capacity

After the update completes, confirm the expected broker count, expected OCPU per broker, expected storage configuration, broker health, cluster health, no unexpected offline partitions, no persistent under-replicated partitions, and client connectivity.

Step 7: Decide whether replica reassignment is needed

If the purpose of adding brokers is to relieve existing broker load, you may need to redistribute partition replicas. Apache Kafka’s partition reassignment tool can move partitions across brokers. Because generated reassignment plans can be broad, review the current partition assignment and the proposed movement before execution. Reference: Apache Kafka basic operations: partition reassignment

1. Identify topics or partitions to move.
2. Capture the current partition and replica assignment.
3. Generate or prepare a reassignment plan.
4. Review the plan for unnecessary or excessive movement.
5. Execute a small, controlled reassignment batch.
6. Throttle reassignment traffic based on cluster workload.
7. Verify completion.
8. Monitor broker and application health before proceeding with more moves.

kafka-reassign-partitions.sh \
  --bootstrap-server <bootstrap-endpoint> \
  --command-config <client.properties> \
  --topics-to-move-json-file topics-to-move.json \
  --broker-list "<target-broker-ids>" \
  --generate


kafka-reassign-partitions.sh \
  --bootstrap-server <bootstrap-endpoint> \
  --command-config <client.properties> \
  --reassignment-json-file reassignment.json \
  --execute


kafka-reassign-partitions.sh \
  --bootstrap-server <bootstrap-endpoint> \
  --command-config <client.properties> \
  --reassignment-json-file reassignment.json \
  --verify

Important: Replica reassignment moves data. It can increase network traffic, disk I/O, and replication load. Kafka’s generic reassignment tooling can generate a broad movement plan, which may cause a cluster-wide shuffle of data if applied without review. For high-workload clusters, use a controlled window, throttle movement based on current traffic, and proceed slowly, moving only a few partitions at a time where possible.

Step 8: Review leader distribution

Even after replicas are distributed, partition leaders may still be unevenly placed. Since partition leaders handle most client request traffic for their partitions, leader imbalance can create hot brokers.

Leader count per broker
Produce request distribution
Fetch request distribution
CPU per broker
Network per broker
Hot topics and hot partitions

If leadership is uneven, use the appropriate Kafka leader election or balancing workflow for your environment.

Step 9: Monitor after scaling and balancing

Broker CPU
Broker network
Broker disk usage
Disk I/O
Produce latency
Fetch latency
Request error rate
Under-replicated partitions
Offline partitions
ISR shrink or expand events
Consumer lag
Application latency
Application error rate

OCI Monitoring and OCI Logging can be used with OCI Streaming with Apache Kafka for cluster metrics and cluster-level logs. Reference: OCI Streaming with Apache Kafka overview

Common Kafka cluster resizing patterns

Pattern 1: Add brokers before onboarding a large workload

Use this when a new application or tenant will significantly increase throughput or retention.

1. Estimate expected produce, fetch, and retention growth.
2. Add brokers or increase broker resources.
3. Verify new capacity is healthy.
4. Reassign replicas if existing load needs to be redistributed.
5. Create new topics or increase partitions as needed.
6. Onboard traffic gradually.
7. Monitor broker and application metrics.

Pattern 2: Add brokers after disk growth

Use this when broker disk usage is rising faster than expected.

1. Check retention settings and topic growth rate.
2. Confirm whether data growth is expected.
3. Add brokers or storage capacity.
4. Reassign replicas to spread storage.
5. Review retention and compaction policies.
6. Monitor disk usage trend after the change.

Pattern 3: Add brokers before increasing partitions

Use this when applications need more partitions but the current brokers do not have enough headroom.

1. Add brokers or increase broker resources.
2. Rebalance replicas or leaders if needed.
3. Confirm broker headroom.
4. Increase topic partitions.
5. Monitor consumer groups, producer distribution, and broker health.

Pattern 4: Broker hot spot due to leader imbalance

Use this when one broker has much higher CPU or network traffic than peers.

1. Check leader count and high-volume partition leaders.
2. Confirm whether replica placement is balanced.
3. Rebalance leaders if replicas already exist on suitable brokers.
4. Reassign replicas only if leader movement alone is not enough.
5. Monitor request distribution and latency.

Mistakes to avoid

Mistake 1: Resizing the cluster but not redistributing load

Adding brokers creates capacity, but existing partitions may still live mostly on old brokers. Plan for replica or leader balancing when the goal is to relieve current load.

Mistake 2: Using broker capacity scaling to fix consumer parallelism

If a topic has 6 partitions and the consumer group has 20 consumers, adding brokers does not let all 20 consumers actively process that topic. Increase partitions if the workload can tolerate the ordering and rebalance implications.

Mistake 3: Moving too much data at once

Large reassignment operations can put heavy pressure on network and disk. Do not apply a broad generated reassignment plan blindly. Review the current assignment, throttle movement, and phase large moves gradually, especially on high-throughput clusters.

Mistake 4: Ignoring disk thresholds

Disk pressure can lead to producer throttling or blocking based on broker disk quota behavior. Plan capacity before reaching critical thresholds.

Mistake 5: Assuming scale-in or scale-down is supported

OCI Managed Kafka does not currently support customer-driven broker removal, cluster scale-in, or scale-down. Customers should plan capacity with expected growth, retention, replication factor, and operational headroom in mind.

Key takeaway

Broker capacity scaling in OCI Streaming with Apache Kafka is primarily a capacity scaling operation. Use it when the cluster needs more compute, network, disk, storage, or fault-tolerance headroom.

Adding brokers does not automatically increase consumer parallelism, and it may not automatically redistribute existing partition replicas. For many production scaling events, the complete workflow is:

1. Identify the bottleneck.
2. Scale out the cluster or scale up supported broker resources.
3. Verify cluster health.
4. Reassign replicas or rebalance leaders if needed.
5. Monitor broker, Kafka, and application metrics.
6. Increase partitions only if application parallelism still requires it.

Used carefully, broker capacity scaling gives OCI Managed Kafka users more room for growth and more operational headroom. The key is to scale for the right reason, plan any follow-up balancing work, and validate the result with metrics rather than assuming added capacity is automatically carrying traffic.

Final checklist

Before resizing the Kafka cluster

  • Confirm the bottleneck is broker capacity, not partition-level parallelism.
  • Capture broker CPU, network, disk, latency, ISR, and consumer lag baselines.
  • Review topic count, partition count, and total replica count.
  • Check service limits, quotas, and target broker count.
  • Validate subnet and network readiness if applicable.
  • Plan whether replica or leader balancing is needed after the resize.
  • Choose a controlled window for critical workloads.

During the scaling operation

  • Update broker count or broker resources through supported OCI Console, CLI, API, or automation workflows.
  • Monitor cluster lifecycle state.
  • Watch broker health, under-replicated partitions, offline partitions, and application errors.
  • Avoid stacking unrelated major changes at the same time.

After resizing the Kafka cluster

  • Verify expected broker count and broker health.
  • Check whether new brokers are carrying replicas and leaders.
  • Reassign replicas if existing load needs to move.
  • Rebalance leaders if request load is uneven.
  • Monitor CPU, network, disk, latency, ISR, and consumer lag.
  • Validate that the original bottleneck improved.

Broker capacity scaling is one part of a broader Kafka scaling strategy. Combine it with thoughtful partition planning, retention management, replication design, and workload monitoring to keep streaming applications reliable as they grow.