Kafka Cluster

2025-09-12 10265 words 49 minutes

Contents

1. Introduction

Motivation for Kafka
- Rise of event-driven architectures
- Big data & real-time processing needs
What is Apache Kafka?
- History and evolution (LinkedIn → Apache project)
- Core principles: distributed log, publish-subscribe, scalability
Research Objectives
- Understand Kafka cluster internals
- Compare with alternatives (RabbitMQ, Redis Streams, etc.)
- Explore use cases in modern data platforms

2. Kafka Fundamentals

Kafka as a Distributed Commit Log

Mental model: an append-only file, sharded into partitions, replicated across brokers. Producers append, consumers read by offset.

Sequence (high level)

Key invariants

Ordering is per partition (not across a topic).
Offsets are per partition and monotonic.
Replication gives durability; leaders serve reads/writes.

Quick check (answer in one sentence): If you need strict ordering for all events of a user, what must be true about your partitioning?

Core components

Topics, Partitions, and Keys

A topic is a named log.
A topic has P partitions: T-0, T-1, ..., T-(P-1).
Producer chooses partition via:
- Keyed: partition = hash(key) % P → sticky per key (great for user_id).
- Unkeyed: sticky/batch partitioner (Kafka ≥ 2.4) for throughput but no per-key ordering.
Example: If P=6 and key="user_42", all events for user_42 land on the same partition → ordering preserved for that user.

Producers: Batching, Acks, Idempotence

Kafka Procedure Batching

linger.ms (is the number of milliseconds a producer is willing to wait before sending a batch out. The default value is 0, which means “send the messages right away”.).
batch.size (is the maximum number of bytes that will be included in a batch)

Tip

Key Takeaways

Increase linger.ms and the producer will wait a few milliseconds for the batches to fill up before sending them.
If you are sending full batches and have memory to spare, you can increase batch.size and send larger batches.

Kafka Message Compression

compression supports two types of compression: producer-side and broker-side. Options are none, gzip, lz4, snappy, and zstd

Kafka Producer Acks Deep Dive

acks=0: When acks=0 producers consider messages as “written successfully” the moment the message was sent without waiting for the broker to accept it at all.

acks=1: When acks=1 , producers consider messages as “written successfully” when the message was acknowledged by only the leader.

acks=all: When acks=all, producers consider messages as “written successfully” when the message is accepted by all in-sync replicas (ISR).

The lead replica for a partition checks to see if there are enough in-sync replicas for safely writing the message (controlled by the broker setting min.insync.replicas). The request will be stored in a buffer until the leader observes that the follower replicas replicated the message, at which point a successful acknowledgement is sent back to the client.

The min.insync.replicas can be configured both at the topic and the broker-level. The data is considered committed when it is written to all in-sync replicas - min.insync.replicas. A value of 2 implies that at least 2 brokers that are ISR (including leader) must respond that they have the data.

Kafka Topic Replication, ISR & Message Safety

Kafka Topic Durability and Availability For a topic replication factor of 3, topic data durability can withstand 2 brokers loss. As a general rule, for a replication factor of N, you can permanently lose up to N-1 brokers and still recover your data.

Tip

Popular Configuration acks=all and min.insync.replicas=2 is the most popular option for data durability and availability and allows you to withstand at most the loss of one Kafka broker

Idempotent Kafka Producer

What is an Idempotent Kafka producer? How do they help avoid duplication?

Problem with Retries

Retrying to send a failed message often includes a small risk that both messages were successfully written to the broker, leading to duplicates. This can happen as illustrated below.

Kafka producer sends a message to Kafka
The message was successfully written and replicated
Network issues prevented the broker acknowledgment from reaching the producer
The producer will treat the lack of acknowledgment as a temporary network issue and will retry sending the message (since it can’t know that it was received).
In that case, the broker will end up having the same message twice.

Kafka Idempotent Producer

Producer idempotence ensures that duplicates are not introduced due to unexpected retries.

Tip

Now the Default Starting with Kafka 3.0, producer are by default having enable.idempotence=true and acks=all. See KIP-679 for more details.

Producer Default Partitioner & Sticky Partitioner

A partitioner is the process that will determine to which partition a specific message will be assigned to.

Partitioner when key!=null

Partitioner when key=null

When key=null, the producer has a default partitioner that varies:

Round Robin: for Kafka 2.3 and below
Sticky Partitioner: for Kafka 2.4 and above

Sticky Partitioner improves the performance of the producer especially with high throughput.

Round Robin Partitioner (Kafka ≤ v2.3)

With Kafka ≤ v2.3, when there’s no partition and no key specified, the default partitioner sends data in a round-robin fashion. This results in more batches (one batch per partition) and smaller batches (imagine with 100 partitions). And this is a problem because smaller batches lead to more requests as well as higher latency.

Sticky Partitioner (Kafka ≥ 2.4)

It is a performance goal to have all the records sent to a single partition and not multiple partitions to improve batching.

The producer sticky partitioner will:

“stick” to a partition until the batch is full or linger.ms has elapsed
After sending the batch, change the partition that is “sticky”

This will lead to larger batches and reduced latency (because we have larger requests, and the batch.size is more likely to be reached). Over time, the records are still spread evenly across partitions, so the balance of the cluster is not affected.

Overall, there are some substantial performance improvements by using the Sticky Partitioner.

The latency is smaller:

And the latency in noticeably lower the more partitions you have.

Consumers & Consumer Groups

Delivery Semantics for Kafka Consumers

Deep dive into delivery semantics availables to Kafka consumers

At Most Once Delivery

In this case, offsets are committed as soon as a message batch is received after calling poll(). If the subsequent processing fails, the message will be lost. It will not be read again as the offsets of those messages have been committed already. This may be suitable for systems that can afford to lose data.

The sequence of steps is illustrated below.

At Least Once Delivery (usually preferred)

In at-least-once delivery, every event from the source system will reach its destination, but sometimes retries will cause duplicates. Here, offsets are committed after the message is processed. If the processing goes wrong, the message will be read again. This can result in duplicate processing of messages. This is suitable for consumers that cannot afford any data loss.

Exactly Once Delivery

Some applications require not just at-least-once semantics (meaning no data loss), but also exactly-once semantics. Each message is delivered exactly once. This may be achieved in certain situations if Kafka and the consumer application cooperate to make exactly-once semantics happen.

This can only be achieved for Kafka topic to Kafka topic workflows using the transactions API. The Kafka Streams API simplifies the usage of that API and enables exactly once using the setting processing.guarantee=exactly.once

For Kafka topic to External System workflows, to effectively achieve exactly once, you must use an idempotent consumer.

Kafka Consumer Important Settings: Poll and Internal Threads Behavior

Kafka Consumer Poll Behavior

Kafka consumers poll the Kafka broker to receive batches of data. Once the consumer is subscribed to Kafka topics, the poll loop handles all details of coordination, partition rebalances, heartbeats, and data fetching, leaving the developer with a clean API that simply returns available data from the assigned partitions.

Tip

Internal code optimization If the consumer successfully fetched some data from Kafka, it will start sending the next fetch requests ahead of time, so that while your consumer is processing the current batch of data, will have to wait for the next one less time on the next .poll() call.

Internal Poll Thread & Heartbeat Thread

The way consumers maintain membership in a consumer group and ownership of the partitions assigned to them is by sending heartbeats to a Kafka broker designated as the group coordinator. We also learned in earlier sections that consumers poll broker for messages. These two activities are performed on separate consumer threads illustrated below.

Consumer Auto Offsets Reset Behavior

A consumer is expected to read from a log continuously.

Kafka consumers have a configuration for how to behave when they don’t have a previously committed offset. This can happen if the consumer application has a bug and it is down. For example, if Kafka has a retention of 7 days, and your consumer is down for more than 7 days, the offsets are “invalid” as they will be deleted.

In this case, consumers have a choice to either start reading from the beginning of the partition or from the end of the partition. This is controlled by the consumer configuration - auto.offset.reset

auto.offset.reset

Three possible values:

latest (default) which means consumers will read messages from the tail of the partition
earliest which means reading from the oldest offset in the partition
none throw exception to the consumer if no previous offset is found for the consumer’s group

offset.retention.minutes

The default retention period for message offsets in Kafka (version >= 2.0) is one week (7 days). It is a broker level setting. It is the offset retention period for the __consumer_offsets topic (in minutes).

Warning

Super Important: This setting is particularly helpful to increase in case you expect your consumers to be down for more than 1 week (and therefore lose their committed offsets), or if your topics are low-throughput topics and the consumer has not processed data for more than 1 week. In that case, if the consumer did lose its offset, the auto.reset.offset setting would kick in. If you would like to avoid that case, increase the value of offset.retention.minutes to something like 1 month.

Consumer Read from Closest Replica

Kafka Consumers read by default from the broker that is the leader for a given partition.

In case you have multiple data centers, you run the risk of a slightly higher latency, as well as high network charges if you’re being billing for cross data centers network traffic (which is the case with cloud computing platforms such as AWS).

Since Kafka 2.4, it is possible to configure consumers to read from the closest replica. This may help improve latency, and also decrease network costs if using the cloud.

To setup, you must set the following settings:

Broker setting (must be version Kafka v2.4+):

rack.id config must be set to the data centre ID (ex: AZ ID in AWS)

Example for AWS: AZ ID rack.id=usw2-az1

replica.selector.class must be set to org.apache.kafka.common.replica.RackAwareReplicaSelector

Consumer Client setting (v2.4+):

Set client.rack to the data centre ID the consumer is launched on

Example for AWS client.rack=usw2-az1

Consumer Incremental Rebalance & Static Group Membership

When partition assignments are moving between your consumer, we are dealing with consumer rebalances.

Rebalances can happen in the following conditions:

a consumer leaves the group
a consumer joins a group
partitions are added to a topic

You can have control over the rebalance strategy, as we will see on this page.

Rebalance after a consumer joins a group

Consumer Eager Rebalancing

By default, consumers perform eager rebalancing, which means that all consumers stop consuming from Apache Kafka and give up the membership of their partitions.

During this period of time, the entire consumer group stops processing, this is also called a “stop the world” event.

They will rejoin the consumer group and get a new partition assignment, but don’t necessarily “get back” the partitions that were previously assigned to them.

Consumer Cooperative Rebalance (Incremental Rebalance)

In this mode, only a subset of partitions is moved from one consumer to another. Other Kafka consumers that are not concerned by the rebalance can keep on processing data without interruptions. Your whole consumer group can go through several rebalances until finding a stable assignment, hence the name incremental rebalance.

How to use the cooperative rebalance?

Kafka Consumer: partition.assignment.strategy

You can set the configuration to several values, the last one being the incremental cooperative rebalancing

RangeAssignor: assign partitions on a per-topic basis (can lead to imbalance)
RoundRobin: assign partitions across all topics in a round-robin fashion, optimal balance
StickyAssignor: balanced like RoundRobin, and then minimizes partition movements when consumers join/leave the group in order to minimize movements

3.Kafka Cluster Architecture

Objective: understand how Kafka clusters are built and behave in both ZooKeeper and KRaft (Kafka Raft) modes, then zoom into the moving parts that matter in production.

3.1 ZooKeeper-based Architecture

Role of ZooKeeper (metadata & coordination)

Metadata store: topics, partitions, leaders, ISR, ACLs (older versions), configs.
Leader election: a Kafka Controller broker is elected via ZK ephemeral znodes.
Health & membership: brokers keep ZK sessions alive (heartbeats). Loss → session expiry → cluster reacts.
Watches: brokers and controller subscribe to znode changes (e.g., topic creation, broker join/leave).

Example (controller election):

B1 connects, creates ephemeral /controller.
B1 is now controller; it watches /brokers/* and /admin/*.
If B1’s ZK session expires → /controller disappears → B2/B3 race to become controller.

Ops implications

ZK quorum health is critical path for metadata changes/rebalances.

Controller flaps often indicate ZK instability or network partitions.

Broker–ZooKeeper Interactions

Concrete examples

Topic creation: CLI writes ZK nodes → controller observes → assigns leaders/followers → brokers apply roles.

Broker join: new ephemeral node → controller rebalances leadership to spread load.

Limitations of ZooKeeper-based design

Two systems to secure/operate/upgrade (Kafka + ZK).
Metadata change storms (many topics/partitions) can saturate ZK.
Recovery semantics cross system boundaries (session expiries, watches).
More moving parts → more MTTR on bad days.

3.2 KRaft (Kafka Raft) Mode

Why KRaft

Remove ZK dependency → single control plane.
Metadata kept in a Raft-replicated log (controller quorum) → consistent, scalable.
Faster elections, simpler upgrades, better large-scale partition handling.

Raft consensus for metadata quorum

Small set of controller nodes form a Raft quorum.
Metadata mutations append to a metadata log; committed on majority ACK.
Brokers subscribe to committed metadata and update local state.

Latency clarity: controller quorum stays small (3–5); data brokers can be dozens+.

Broker roles in KRaft

process.roles=controller → participates in Raft only.
process.roles=broker → serves data partitions; subscribes to metadata.
Combined role for dev/small clusters; separated for prod scale/isolation.

3.3 Cluster Components in Detail

Brokers & Replication Factor (RF)

Broker = process that hosts replicas and serves client I/O.
RF = number of copies per partition (typ. 3).
min.insync.replicas (MISR) + acks=all gate write availability vs durability.

Parition leader/follower election

Happy path: controller assigns leaders; followers fetch from leader; ISR tracks replicas in sync.
Leader loss: controller promotes a replica from ISR.
Unclean elections: if ISR empty and unclean.leader.election.enable=true, a stale replica can be promoted → data loss risk. Keep false in prod.

Controller node(s)

ZK mode: a single broker (at a time) is controller; elected via /controller.
KRaft mode: controller quorum (Raft).
Responsibilities: leader assignment, partition state transitions, broker membership, admin ops.

3.4 Scaling Considerations

Scaling Apache Kafka involves careful planning across multiple dimensions: the number of partitions per topic, the number of broker nodes, overall throughput capacity, and the size of consumer groups. Each of these scaling levers comes with trade-offs in performance, reliability, and operational complexity. In cloud-native Kubernetes environments, operators (like Strimzi or Confluent Operator) provide automation for tasks such as broker deployment and partition rebalancing, whereas on bare-metal or VM deployments these tasks often require manual orchestration or separate tools. This guide presents a comprehensive look at scaling Kafka clusters in both Kubernetes and traditional setups, covering best practices for partition and broker scaling, performance tuning, rebalancing strategies, and key configuration examples. The goal is to maximize high-throughput and low-latency streaming while maintaining fault tolerance and manageability.

Scaling Partition Count

Partitions are the primary mechanism for scaling throughput and consumer parallelism in Kafka. By increasing the number of partitions for a topic, you enable more concurrent consumption and higher aggregate throughput (since each partition is processed by at most one consumer in a group and handled by one broker thread at a time). However, choosing the partition count requires balancing trade-offs:

Throughput vs. Overhead: More partitions can increase throughput and concurrency, but partitions aren’t free – each one carries overhead in memory and file handles on the brokers, and in metadata for clients. Each broker must maintain buffers, indexes, and file descriptors for each partition it hosts, so a very high partition count can strain broker resources (particularly RAM and CPU for managing thousands of log segments). A general guideline is to keep partition counts per broker below roughly 4,000 partitions per broker (a soft limit under which many deployments operate efficiently). Exceeding this can lead to high JVM garbage collection overhead, longer broker startup/shutdown times, and increased tail latencies once you go beyond ~10k partitions per broker. (Notably, improvements in newer Kafka versions and KRaft mode without ZooKeeper may raise these limits, but caution is still advised.) forum.confluent.io
Parallelism vs. Utilization: The number of partitions should ideally be a bit higher than the maximum number of consumer instances you expect in a consuming group, to allow scale-out. For example, if you foresee up to 50 consumers, having ~50 partitions (or slightly more) for that topic ensures each consumer can get a partition. It’s common to slightly over-provision partitions to handle future growth, but avoid grossly overestimating. Unused partitions incur overhead on brokers and clients without providing value. As one practitioner noted, teams sometimes over-partition “just in case” (e.g., 60 partitions because they might deploy 60 consumers) and end up with far more partitions than needed. Plan based on throughput and consumer scaling needs, and remember that while you can increase partitions later, you cannot decrease them for a topic (short of creating a new topic and migrating data). Increasing partitions later also has implications: existing data won’t rebalance to new partitions (only new records go to new partitions) and ordering guarantees per key may be affected if not carefully managed.
Even data distribution: If using key-based partitioning, some keys might be much more frequent than others, causing uneven partition load. In such cases, consider using a more fine-grained key or enabling random partitioning for writes to balance load. Random or round-robin partitioning can prevent any single “hot partition” from limiting throughput by spreading messages evenly. If certain partitions still grow disproportionately (in terms of message rate or size), you might need to reshard data (e.g., splitting a hot partition’s key range across multiple partitions) – effectively increasing partition count and redistributing the data for that key range.statsig
Capacity planning: A good practice is to monitor broker metrics like partition count, memory usage, and file descriptors. If brokers are approaching resource limits (high heap usage, nearing open file descriptor limits, etc.) due to partition count, it may be time to add brokers or curb partition creation. Keep in mind that total partitions in the cluster (across all topics) matter, not just per topic. Kafka clusters in practice have been run with up to 200,000 partitions (e.g., 4k partitions × 50 brokers) but such extremes require strong hardware and careful tuning. Aim for the fewest partitions that still meet your throughput and consumer parallelism requirements, and scale gradually while load-testing. As a rule of thumb, “less is better” for partitions unless needed – avoid excessive fine-partitioning of data if a coarser partitioning would suffice.aws If you do need to increase the partition count of a topic later on, Kafka allows it dynamically, but be aware this is a one-way operation. Newly added partitions won’t have historical data, and consumers with state (like Kafka Streams or exactly-once processing) might need special handling when partitioning changes. For stateless consumers, increasing partitions will trigger a rebalance in consumer groups (consumers will redistribute to cover the new partitions). Always test such changes in a staging environment as it can momentarily disrupt consumption during the rebalance. In summary, scale partitions thoughtfully: enough to parallelize and handle your peak throughput, but not so many that the cluster is overwhelmed by management overhead

Scaling Broker Count (Horizontal Scaling)

Scaling out the number of brokers (adding more broker nodes to the cluster) is a direct way to increase cluster capacity for both storage and throughput. When the load on existing brokers nears their CPU, network, or disk throughput limits, or when you need to host more partitions than the current brokers can handle comfortably, it’s time to consider adding brokers. Here are key considerations and best practices for broker horizontal scaling:

Adding brokers and data rebalancing: Simply starting a new broker process and joining it to the cluster is straightforward (in Kubernetes, scaling up the Kafka StatefulSet or deploying a new broker Pod; on bare metal/VMs, provisioning a new server with Kafka and configuring it to join the cluster). However, new brokers begin empty – Kafka does not automatically move existing partitions to them on join. Without intervention, new brokers will only take on data for newly created topics/partitions, leaving existing topic data skewed on the original brokers. To truly gain the benefits of a new broker, you must rebalance (reassign) partitions to migrate some load onto it. This is typically done using Kafka’s partition reassignment tool or automated rebalancing software (discussed in a later section). For example, if you have 3 brokers and add a 4th, you should move roughly 1/4 of the partitions (and their replicas) from the original brokers to the new one so that each broker ends up with about 25% of the partitions. This evens out disk utilization and traffic. The rebalancing can be performed gradually to minimize impact on clients (more on rebalancing strategies below). The bottom line is scaling out is a two-step process: add broker(s), then redistribute data. strimzi.io
Removing brokers (scaling down): Similarly, scaling downrequires care. You shouldn’t just shut off a broker that still hosts partitions – that would lead to “offline” partitions and reduced replication (or unavailability if replication factor was 1). The safe method is to first migrate all partitions off the broker to be removed. Kafka’s partition reassignment tool or Cruise Control can move partition replicas from that broker onto others. Once the broker has no partitions (check via Kafka’s topic partition distribution or Cruise Control’s proposals), it can be cleanly removed from the cluster. In Kubernetes with Strimzi, the operator can coordinate this using a remove-brokers rebalancing mode. Note that scaling down reduces capacity and can increase load on remaining brokers, so ensure the remaining cluster can handle the throughput and data volume. strimzi.io
Scaling out vs. up: An architectural decision arises on whether to scale out (more brokers of similar size) or scale up (use fewer brokers with higher specs). More smaller brokers mean finer granularity of scaling and failure impact (losing one small broker is less load lost than one giant broker), but too many brokers can make management (and inter-broker communication) more complex. Fewer large brokers simplify topology but concentrate more risk per node. A common compromise is to use moderately powerful brokers and add nodes as needed. Cloud environments often make this choice based on instance types: e.g., instead of pushing an instance type to its limits, you might add another node. Keep in mind some limits like maximum partitions per broker when considering very large brokers – e.g., an m5.large instance might only comfortably handle ~1000 partitions, whereas an m5.4xlarge could handle more, up to a few thousand reddit.com. Test scaling strategies with your workload; as one AWS guideline suggests, use ~80% of theoretical capacity as a safe operating target aws.amazon.com, and add brokers or upgrade instances before hitting full utilization.
Horizontal scaling in practice (example): Imagine a cluster with 3 brokers handling 300 MB/s ingress each (900 MB/s total) and you project growth to 1500 MB/s. You could add 2 more brokers (total 5) so each would handle ~300 MB/s (if evenly balanced). The following illustrates an initial 3-broker cluster and an expanded 4-broker cluster after rebalancing:

(Diagram: Brokers A, B, C initially host all partitions. After adding Broker D and rebalancing, the partitions are spread among A, B, C, D.) In a Kubernetes environment, adding a broker might be as simple as increasing the replicas count in the Kafka cluster manifest (if using Strimzi, for instance). The operator will spin up a new broker Pod (Broker 4). Then an admin would invoke a partition reassignment (or use Strimzi’s KafkaRebalance API) to evenly spread partitions to the new broker strimzi.io. On bare metal, you’d provision a new server, install Kafka, join it to the cluster (by pointing it to the same ZooKeeper or controller quorum), then use Kafka’s admin tools to move some partitions over. Always monitor broker health (CPU, disk, network) during rebalancing as moving data can consume considerable bandwidth and I/O – it may be wise to throttle the reassignment (Kafka has configs like replication.throttled.rate) to avoid overloading the cluster during the migration.

Broker indexing and client configs: When brokers are added or removed, clients (producers and consumers) get cluster metadata updates. This is generally automatic. Ensure your client applications handle the dynamic addition of partitions/brokers (most modern Kafka client libraries do). If you use manual partition assignments in producers or consumers, those might need updates to include the new brokers/partitions – but typically, relying on Kafka’s own partitioner and consumer group balancing is simpler. In summary, horizontal scaling of brokers is a powerful way to grow capacity, but it goes hand-in-hand with partition rebalancing. Plan for the data movement and temporary performance impact during reassignments. The advantage in Kubernetes is that an operator can automate much of this (including triggering Cruise Control for optimal reassignments on scale-up events) strimzi.io. On bare metal, you might schedule maintenance windows to add brokers and use tooling to migrate load. Either way, once complete, the cluster will have a lower utilization per broker, offering headroom for increased throughput or more partitions.

Scaling Consumer Groups

Kafka’s consumer group mechanism provides horizontal scalability on the consumption side: multiple consumers in the same group divide the partitions of a topic among themselves. Understanding this relationship is key to scaling out processing:

Parallelism limited by partitions: The maximum parallelism for consuming a single topic is equal to the number of partitions for that topic, because each partition in a group is consumed by exactly one consumer. If you have more consumer instances in a group than partitions, the extra consumers will be idle (they won’t get assigned any partition). For example, with a topic of 10 partitions, a consumer group can effectively scale out to 10 active consumers (each handling one partition’s stream). An 11th consumer in that group would not improve throughput – it would simply remain unassigned (or one consumer would be assigned two partitions while one gets none, depending on the assignment strategy, but there’s no scenario where two consumers share one partition’s data). Therefore, to scale consumption beyond current limits, you may need to increase partition count in tandem with adding consumers, as discussed earlier. A good practice is to align partition counts with expected consumer counts (and some extra room). If you plan to run up to N consumer instances, use N or slightly more partitions so you’re not constrained.
Consumer group rebalances: When you change the number of consumers (say, deploy additional consumer instances in Kubernetes or start new consumer processes on VMs), Kafka will trigger a rebalance in that consumer group. During a rebalance, partitions are reassigned among consumers, which temporarily stops consumption for that group. Frequent scaling events (autoscaling consumers up/down rapidly) can lead to churn and processing delays due to rebalancing overhead. Kafka’s newer cooperative sticky rebalancing (available in newer client versions) can mitigate impact by reassignment in an incremental fashion, but it’s still best to avoid extremely flappy consumer scaling. Design your applications to handle rebalances gracefully (idempotent processing, etc.). In Kubernetes, if you use something like an Horizontal Pod Autoscaler for consumers, you might want to debounce rapid scale changes.
Multiple consumer groups (fan-out load): If you have multiple distinct consumer groups reading the same topic (for example, two different services interested in the same data stream), each group will independently get a copy of the data. From the cluster’s perspective, this multiplies the read throughput. Every consumer group adds additional read load on the brokers, since each group’s consumers collectively have to read every message (for a given topic) once. If you have one group, messages are read once; with two groups, each message is read twice (once per group), doubling the total read throughput required. This can stress network and disk if many consumer groups are tailing a high-volume topic. As AWS notes, “the more consumer groups that are reading from the cluster, the more data that egresses over the network of the brokers”. In other words, each consumer group increases the BytesOut from brokers . If network interface or disk throughput on brokers is a bottleneck, many consumer groups will exacerbate it – in such cases consider scaling up brokers to ones with higher network bandwidth or using compression to reduce per-message size. Also monitor broker outbound traffic (e.g., BytesOutPerSec metric) to ensure it’s within limits as you add consumer groups. aws.amazon.com
Throughput per consumer: Within a single consumer process, Kafka can fetch from multiple partitions in parallel (in separate threads) or sequentially in one thread, depending on your consumer implementation. If a single consumer isn’t able to keep up with the throughput of multiple partitions, adding more consumers (thus splitting partitions among them) will help. On the contrary, if consumers are under-utilized (e.g., each handling very low message rates), you might reduce consumer count to save resources – just don’t go below the partition count if you need all partitions read. Kafka consumers can also be multithreaded, but the simplest scaling is typically one thread per partition.
Order and state: Remember that within a partition, order is maintained, and a consumer processes it sequentially. If you need ordering across a broader key space, you might be forced into fewer partitions or stick to one consumer (limiting scalability). Often, a strategy is to partition by a key that allows parallelism for different keys while preserving order per key. This way you can have many partitions (hence many consumers) without losing ordering guarantees for a given key scope.
Consumer lag and scaling: If consumers fall behind (lag increasing), one solution is to add more consumers (if partitions allow) to catch up faster. Another is to increase the processing throughput of each consumer (optimize the consumer logic or increase its resources). If lag is due to slow processing (and not just volume), adding consumers will help only if the work can indeed be parallelized across partitions. If one partition’s data is the slow part (e.g., one partition has all the heavy messages), that’s a data skew issue – adding consumers won’t help unless you also repartition the topic to spread that hotspot. Monitor consumer lag on each partition to identify if scaling out consumers will resolve the problem or if a re-partitioning of the data is needed.

In summary, scale consumer groups by adding consumers up to the partition count. For multiple consumer groups, be aware of the multiplied load on brokers (you might need to scale the cluster or use larger brokers to handle many groups) aws.amazon.com. Use partition count and consumer count hand-in-hand as tuning knobs: neither alone can solve throughput issues without the other if you’re at the limit (i.e., too few partitions or not enough consumers). In Kubernetes, consumer scaling is typically done by adjusting the number of consumer pods (possibly automated by HPA based on lag metrics). On bare-metal, you might script the launch of additional consumer processes. Either way, coordinate this with partition planning. Finally, ensure that your client apps commit offsets appropriately and handle rebalances – a well-tuned consumer group can gracefully scale up or down with minimal message processing disruption.

Performance Guidelines and Resource Planning

Scaling Kafka isn’t just about adding partitions or brokers blindly – it also requires capacity planning for hardware and resources to meet performance targets. Here we outline key resource considerations and some rules-of-thumb thresholds:

Partitions per Broker: As discussed, Kafka brokers can handle thousands of partitions, but there are practical limits. A commonly cited guideline is no more than ~4,000 partitions per broker for optimum performance (and about 200,000 partitions per cluster in total as an upper bound in traditional setups). Exceeding this can hurt metadata propagation and increase GC pressure. Each partition (and its segments) consume memory for index and OS page cache. If you have very small partitions (tiny throughput each), you might have more, but monitor the broker’s heap and page cache usage. Also note that more partitions means more open file handles and mmap() regions on the broker – you may need to raise the ulimit for file descriptors and the VM max map count on the OS if pushing high partition counts. If you find yourself needing, say, 10,000+ partitions per broker, consider whether you can distribute those across more brokers instead. The Kafka Definitive Guide suggests that beyond a certain point, adding brokers is preferable to endlessly adding partitions on the same broker. Also, a high partition count can slow down broker restarts and leader elections (a broker with 10k partitions will take longer to shut down and have its partitions picked up by others on failure). forum.confluent.io docs.cloudera.com
Memory (RAM): Kafka is designed to leverage the OS page cache. It doesn’t need extremely large JVM heap sizes; in fact, heap is usually kept around 6-16 GB to minimize GC pause times (often <32 GB to stay within 32-bit object reference range for compressed oops). The rest of the memory on the machine is used by the OS for caching log file pages, which greatly speeds up reads. More RAM = more of the recent data stays in memory, which improves read latency and reduces disk I/O. It’s recommended to use machines with ample memory (e.g., 32 GB or more total) but not allocate it all to Kafka’s heap – leave a large portion for the OS cache. Monitor the broker’s heap usage via JMX (HeapMemoryUsage or Kafka’s HeapMemoryAfterGC if using MSK) to ensure you’re not near OOM. If you see high page cache hit rates, that’s good – it means disk reads are often being served from memory. If your workload involves heavy random reads (e.g., consumers catching up on old data), more RAM will significantly help since sequential disk I/O is fast but random seeks are slower; memory can serve those random reads if the data is cached. A best practice for sizing memory is to consider your ingestion rate and retention period: e.g., if you ingest 20 GB per hour per broker and want to keep the last 3 hours readily available in memory to handle consumer catch-up, having ~60 GB for OS page cache per broker would be ideal (this is a guideline mentioned in some Kafka ops discussions). In Kubernetes, ensure the Pod’s resource limits request enough memory and consider using HugePages if applicable for page cache efficiency (though not commonly done for Kafka). community.ibm.com aws.amazon.com
Storage Throughput and IOPS: Kafka’s performance is often bound by disk throughput rather than raw IOPS, because it writes and reads large sequential logs. Fast disks are essential for a high-throughput Kafka cluster. SSD (Solid State Drives) or NVMe drives are strongly recommended over spinning disks, especially for production – SSDs provide high sequential write throughput and handle random I/O (like compaction or cache misses) far better due to low seek times. If on cloud, use high-performance volumes (e.g., Amazon EBS gp3 or io2 with provisioned IOPS/throughput). Monitor disk write and read bytes; a single broker can easily push hundreds of MB/s. Kafka writes are sequential and benefit from disk write buffers, but sustained throughput must be within the disk’s limits. A rule from AWS: keep actual throughput under ~80% of the disk’s rated capacity for safety. aws.amazon.com If using network-attached storage (like EBS), remember that it can be limited by the network pipe to storage. For example, an EC2 instance might allow 1000 MB/s to its EBS volume; if you need more, you either stripe multiple volumes or move to an instance with higher EBS bandwidth. aws.amazon.com Also consider using multiple disks per broker (and configure Kafka log dirs across them) to parallelize I/O – Kafka can round-robin new partitions across log directories. Do not put Kafka data on a shared NAS if you need high throughput – always prefer local or block storage attached to the broker. And avoid co-locating other heavy I/O workloads on the same disks. community.ibm.com
Network throughput: Kafka clusters push a lot of network traffic – producers sending data in, brokers replicating data to each other, and consumers fetching data out. In a scale-out scenario, ensure your network is not a bottleneck. In cloud, choose instance types with 10 GbE or higher network if you’re doing tens of MB/s or more per broker. On bare metal, ideally use 10 Gb Ethernet (or faster) especially for clusters with high throughput or many consumers. If encryption (TLS) is enabled, that can reduce effective throughput due to CPU overhead and slight latency, so factor that in or offload TLS if possible. Monitor metrics like NetworkProcessorAvgIdlePercent (Kafka metric) or simply the OS network interface throughput. If brokers are saturating a 1 GbE link (~125 MB/s) but your disk can handle more, upgrading network is necessary or consider partitioning traffic across multiple NICs (some installations use separate NICs for client traffic vs replication traffic). In Kubernetes, if using overlay networks, be mindful of the overhead – hostNetworking or NodePort can be used to avoid an extra hop, or use fast CNI plugins. Also, client egress can become a bottleneck when many consumer groups are present, as mentioned. Larger instance types often come with higher baseline and burst network bandwidth, which is one reason scaling up a broker (to a bigger VM) can increase how many consumers it can serve. aws.amazon.com community.ibm.com
CPU and thread pool sizing: Kafka is generally not extremely CPU-intensive compared to I/O, but CPU usage grows with high message rates (due to compression, decompression, message format serialization, zero-copy transfer, etc.). Ensure brokers have modern CPUs and enough cores to handle the threading model. Kafka broker uses several thread pools: I/O threads for disk, network threads for handling socket connections, etc. You can tune the counts (num.io.threads,num.network.threads) based on cores available. For example, if you have a 16-core machine, setting num.io.threads=8 and num.network.threads=8 might be appropriate, leaving some cores for background tasks (replication, controller threads, etc.). If CPU is spiking, profile whether it’s due to garbage collection or actual workload (encryption/compression costs can be high – using LZ4 or Snappy compression is fast, but GZIP is CPU heavy; similarly TLS can eat CPU if not using AES-NI or if not offloaded). If needed, consider enabling compression on producers to reduce network and disk load (trading for some CPU) – often a worthwhile trade. Also, ensure you’ve configured GC properly (typically using the G1 GC for Kafka; set appropriate heap and perhaps override GC logging for visibility). If you have extremely high throughput with small messages, the broker can become CPU-bound on overhead per message – in such cases, increasing batch sizes on producers (to reduce per-message overhead) or even using a bigger message payload (packing multiple logical messages into one Kafka message) can help. community.ibm.com
Replication factor and overhead: Increasing replication factor (say from 2 to 3) roughly increases the write traffic by the factor (each message is sent to more brokers) and the storage space accordingly. A higher replication factor improves durability at cost of throughput – not just for writes (each produce has to be replicated) but also for cluster recovery (more replicas to catch up). Most high-throughput clusters stick with RF=3 as a balance between safety and performance (RF=2 in less critical scenarios, but that only tolerates one failure). Make sure to configure at least min.insync.replicas=2 (if RF=3) and use acks=all on producers for strong durability – this will slightly add to latency, but ensures at least 2 brokers have the data on write. If you need ultra-low latency and are willing to risk data loss, acks=1 can be used (only leader waits for itself), but this is usually not recommended in production if data loss is unacceptable. community.ibm.com
Monitoring and headroom: Always monitor key metrics as your cluster scales: CPU%, Memory usage, Disk used and I/O, Network throughput, GC pauses, Consumer lag, etc. Set up alerts when usage goes beyond a threshold (e.g., >70% CPU for sustained periods, >80% network, etc.). The AWS sizing formula and guidance suggest leaving about 20% headroom – running clusters too hot (near 100% utilization) leaves no room for traffic spikes or maintenance operations. Also consider failure scenarios: if you lose one broker out of 4, can the remaining 3 handle the entire load until recovery? It might be wise to size for N-1 brokers capacity (or N-2 in case of multi-AZ failure scenarios) such that the cluster remains stable during a broker outage. In Kubernetes, node auto-scaling might add new nodes if needed, but the Kafka load won’t magically balance unless Cruise Control intervenes. Thus, proactive capacity planning is better than reactive. aws.amazon.com To summarize, treat Kafka like the I/O-intensive system it is: give it fast disks, fast networks, plenty of memory for caching, and enough CPU to handle its throughput (especially with compression or encryption). Follow the known limits (partitions per broker, etc.) but also test with your specific workloads, as patterns of message sizes and usage can influence what a “safe” limit is. When in doubt, scale out horizontally for more capacity rather than pushing a single broker to the moon. And use monitoring to drive when to scale – e.g., if your BytesOutPerSec is nearing the NIC limits or if disk write bytes are near capacity, that’s a sign to add brokers or upgrade hardware.forum.confluent.io docs.cloudera.com

Rebalancing Strategies and Automation Tools

Rebalancing in Kafka refers to redistributing partitions across brokers to balance load. This becomes crucial after scaling events (adding or removing brokers) and also over time if certain brokers become overloaded (e.g., if some topics grow faster and concentrate on a subset of brokers). Kafka doesn’t automatically shuffle existing data just because a new broker joined – administrators must initiate a rebalance. Fortunately, there are tools and strategies to make this easier:

Manual partition reassignment: Apache Kafka comes with command-line tools (or Admin API calls) to move partition replicas. You can specify a new replica assignment for partitions and instruct the cluster to carry out the data movement. This gives fine-grained control but can be tedious for large clusters. Typically one would generate a reassignment JSON (listing partitions and the brokers they should move to) – Kafka can generate a suggested reassignment for adding brokers using the kafka-reassign-partitions script with a --broker-list option (for new brokers) or --additional-broker flag. After executing, Kafka will copy data for those partitions to the target brokers and then switch leadership. Manual reassignment works, but for more than a few partitions it’s error-prone and doesn’t consider cluster balance holistically (unless you painstakingly calculate it).
LinkedIn Cruise Control: Cruise Control is an open-source tool by LinkedIn that automates the balancing of Kafka clusters. It continuously monitors broker metrics (CPU, disk, network, partition counts, leader counts, etc.) and can generate optimization proposals to rebalance load according to defined goals (e.g., balance disk usage within 10%, balance leader counts, minimize inter-rack traffic, etc.). Cruise Control can be run as a service and even set to automatically execute rebalances or just suggest them for operator approval. Key uses of Cruise Control:
- After adding brokers, run a rebalance to move some partitions to the new brokers until goals (like even disk usage) are met.
- Before removing brokers, run a remove-broker optimization to move all data off specific nodes. Continuous optimization: it can periodically tweak partition distribution to avoid skew (for example, if one topic’s partitions all became large and overloaded one broker’s disk, CC could move some to less-used brokers).
- Goal-based approach: You can prioritize certain goals like rack-awareness (never violate rack segregation), capacity goals (don’t overfill any broker disk or saturate network), even distribution of leaders, etc. Cruise Control will find moves that satisfy these.
- Throttle and safety: It will move partitions one at a time or in small batches and can be configured with concurrency limits and throttles to avoid too much impact.
Many organizations use Cruise Control for production Kafka clusters to maintain balance without manual effort. In a bare-metal scenario, you’d deploy Cruise Control separately and point it to the cluster. It has a REST API for triggering rebalances or getting cluster state.
Strimzi KafkaRebalance (Cruise Control on Kubernetes): In Kubernetes, the Strimzi operator integrates Cruise Control as an optional component. If enabled, Strimzi runs a Cruise Control pod alongside the cluster. You as an admin can then create a KafkaRebalance custom resource (CR) requesting an optimization. For instance, after scaling up brokers, you can post a CR with mode add-brokers listing the new broker IDs; Strimzi (via Cruise Control) will calculate a reassignment that moves some partitions to those brokers for balance. Once you approve, it executes the moves. Similarly, before scaling down, use mode remove-brokers in a CR to vacate certain brokers. As of Strimzi 0.44+, there’s also auto-rebalancing: the operator can automatically trigger a Cruise Control rebalance right after a scale-up, or right before a scale-down, so you don’t have to manually intervene. This is powerful in a fully automated environment (e.g., if brokers were scaled via an HPA or cluster autoscaler – after the new pod comes up, Strimzi can auto-run the rebalance to utilize it). To enable this, you configure the autoRebalance policy in the Kafka CR (as shown in Strimzi docs, you can set it to run on add-brokers and/or remove-brokers). Under the hood, it uses Cruise Control’s capabilities. strimzi.io
Other operator tools: Besides Strimzi, other Kafka operators (like Koperator/BanzaiCloud or Confluent for Kubernetes) also support rack-aware scheduling and perhaps some auto-balancing. Confluent Platform has a feature called Self-Balancing Cluster (SBC) which similarly reassigns partitions automatically in background; this might not be available outside Confluent’s distribution. But in open source, Cruise Control is the go-to solution.
Rebalance strategies: Regardless of tool, consider when to rebalance. A good strategy is to perform major rebalances during off-peak hours since moving GBs of data will consume bandwidth and could impact latency (consumers might see increased lag while their partitions are being moved). One can mitigate impact by throttling replication (replication.rate configs) so that partition moves don’t overwhelm normal traffic – though that means the rebalance takes longer. If using Cruise Control, you can configure it to only move one partition at a time per broker, for instance, which keeps the cluster stable. Another strategy is to prefer leader-only moves vs. moving full replicas when possible – sometimes balancing leadership (who is leader for each partition) can alleviate hotspots without physically moving data (since any replica can be a leader). Kafka has a preferred leader election process to shuffle leaders back to “preferred” brokers; this is simpler than moving data and can be done periodically to ensure leaders are evenly spread (Kafka does this on startup by default, and you can trigger it via admin command). In fact, if your cluster imbalance is mainly that one broker has too many leaders (thus handling more client traffic), running a preferred leader election might balance it out if replicas are evenly placed.
Rack-aware balancing: Ensure any rebalancing tool maintains rack-awareness (i.e., it shouldn’t put two replicas of same partition on the same rack). Cruise Control by default can honor rack constraints as a hard goal. The Kafka admin tools also allow specifying rack aware moves. This is especially important in Kubernetes if zone labels are used as rack – you wouldn’t want an automated move to inadvertently put partition replicas on pods in the same zone.

In summary, plan for rebalancing as part of scaling operations. When you add brokers, budget time for data migration. Use automation tools like Cruise Control to simplify the process and optimize the cluster state according to your chosen policies (balance by disk usage, etc.). In Kubernetes, take advantage of your operator’s integration – for example, with Strimzi you might just increase the spec.kafka.replicas count, and if autoRebalance is on, it will take care of the rest, moving data to use the new pod. On bare-metal, you might run a Cruise Control rebalance manually after adding nodes. Neglecting rebalancing can leave new capacity underutilized and old brokers overburdened, defeating the purpose of scaling out. Conversely, after removing brokers, run a rebalance to ensure those brokers’ partitions have been safely relocated. Rebalancing is also a periodic maintenance task – even without scaling, running a rebalance every so often (or monitoring imbalance) keeps the cluster healthy and performing optimally. strimzi.io

Tuning for High Throughput and Low Latency

Beyond just adding partitions or brokers, achieving high throughput and low latency in Kafka requires tuning a variety of configuration parameters on brokers, producers, and consumers. Here are key tuning recommendations and examples for a scaled Kafka deployment:

Broker Configuration: Tuning the broker ensures it can handle heavy I/O and many client connections:

I/O and Network Threads: Increase num.io.threads and num.network.threads on the broker to match the hardware capabilities. These control how many threads handle disk I/O and network socket processing, respectively. For example, on a machine with 16 cores, you might use 8 network threads and 8 I/O threads to allow concurrency in reads/writes. This prevents the broker from becoming single-threaded on disk or network operations when many partitions are active. community.ibm.com
Socket Buffers: Kafka by default might use relatively small TCP socket buffers. Increasing socket.send.buffer.bytes and socket.receive.buffer.bytes (for broker sockets) to a larger value (e.g., 1 MB = 1,048,576 bytes) can improve throughput, especially on high-latency networks or when transferring large batches. Larger buffers reduce the chance of socket send/receive stalls. Statsig notes that setting these to around 1 MB gave a significant throughput boost in their use case. Monitor the system memory impact though – each connection uses some buffer. statsig.com
Log Segment Configs: Adjust log.segment.bytes – the size of each log file segment. Larger segments (e.g., 1 GB) mean fewer files and less frequent roll overs, which is good for throughput (less housekeeping), but too large can make recovery and compaction slower. For high throughput, 512 MB to 1 GB segments are common. Similarly, log.segment.ms (time-based rolling) could be tuned; many leave it default (weekly) or manage by size. Ensure log.retention.bytes or log.retention.hours are set according to your storage limit and requirements so old data is purged promptly (not directly performance, but storage fullness can degrade performance).
Compression: Brokers simply persist what producers send (unless you enabled compression on brokers for compaction, which is less common). Setting compression.type at the broker level will apply to compaction of compacted topics, but for general throughput, it’s producers that decide to compress messages. However, one broker setting – unclean.leader.election.enable – if set to false (recommended for safety) can impact availability (won’t allow an out-of-sync replica to become leader, at the cost of waiting). For throughput we usually leave this off for critical topics to avoid data loss, even if it means a bit more downtime on failures.
Page Cache and Flush: Kafka relies on the OS to flush data to disk. Settings like flush.messages or flush.ms are usually left at defaults (Kafka will not flush on every message; it flushes periodically via OS). Changing these (to flush more often) can dramatically hurt throughput, so it’s rarely done. Better is to trust the OS and have replication for durability (acks=all covers that).
Replication tuning: If you find replication traffic saturating the network, you can tune num.replica.fetchers (number of threads pulling data from leaders to replicate). More fetchers can improve replication throughput at the cost of some more CPU. If using SSL for inter-broker, ensure those connections are optimized (or consider plaintext within a secure network for a small performance gain).
Examples: Below is a snippet of a Kafka broker configuration tuned for throughput on a bare-metal deployment (as an illustration):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# High-level broker tuning
  num.network.threads=8
  num.io.threads=16
  socket.send.buffer.bytes=1048576       # 1 MB
  socket.receive.buffer.bytes=1048576    # 1 MB
  message.max.bytes=1000000             # allow ~1 MB messages (tune as needed)
  replica.fetch.max.bytes=1048576       # replica fetch size (match message.max.bytes or bigger)
  num.replica.fetchers=4               # parallel replication fetch threads
  compression.type=uncompressed        # (Producers will compress; broker just stores)
  log.segment.bytes=1073741824         # 1 GB segments
  log.retention.hours=168              # one week retention
  log.retention.check.interval.ms=300000  # check for log deletion every 5 minutes

In this example, network and I/O threads are increased, socket buffers are 1MB, segment size is 1GB, and other values are tuned to handle large messages and sustained load. These settings would be placed in the broker’s server.properties (for a bare-metal/VM Kafka). In Kubernetes (Strimzi), these can be set under the Kafka.spec.kafka.config section of the custom resource.

Producer Configuration: Tuning producers can greatly increase throughput (at some cost to latency):
- Batching: Increase batch.size (batch of messages in bytes that the producer tries to send in one request) and set a slight linger with linger.ms. By default, Kafka producer may send messages as soon as possible (linger=0). Setting linger.ms=5 or linger.ms=20 means the producer will wait up to that time to accumulate a batch before sending, which can coalesce more messages into one request. A larger batch.size (e.g., 64 KB or 128 KB instead of default 16 KB) also allows it to batch more. This dramatically improves throughput by reducing per-message overhead and network calls, at the cost of a slight increase in latency (messages may sit a few milliseconds before send). A linger of 5-20 ms is often a good trade-off. If ultra-low latency isn’t required, this is one of the first tuning knobs to turn for producers. docs.aws.amazon.com
- Compression: Enable compression on producers (compression.type set to lz4 or snappy typically). Compressed batches mean fewer bytes sent over the network and written to disk, increasing throughput at the cost of some CPU to compress. LZ4 is a popular choice as it’s very fast with decent compression. Snappy is also used. GZIP gives high compression but usually too slow for real-time streams, so it’s less common. If your messages are very small, compression can yield huge wins in throughput by collapsing repetitive data.
- Acknowledgments and Retries: For maximal throughput, some might use acks=1 (leader only) which avoids the latency of waiting for replicas, but this sacrifices durability. In high-throughput but critical systems, it’s recommended to use acks=all along with min.insync.replicas on the broker to ensure at least two replicas have the data. You can still get good throughput with acks=all if batching is used, because multiple messages’ replication can be acked together. Set retries to a high number (or MAX_INT which is default in new clients) so that transient errors don’t drop messages. Also consider delivery.timeout.ms to bound how long the producer retries before giving up.
- Idempotence: Enable enable.idempotence=true (it’s true by default in recent Kafka clients). This ensures that retried sends aren’t duplicated in the log. It has minimal overhead now and is generally recommended; it does enforce acks=all internally and a single in-flight batch per partition, which is fine for order.
- Throughput vs Latency trade-off: If your goal is absolute lowest latency (few ms end-to-end), you’d use small batches and linger=0. But for high throughput (millions of messages per second cluster-wide), you will likely accept a bit higher end-to-end latency to leverage batching and compression. Find the sweet spot by testing – e.g., maybe 10 ms linger yields 5x throughput with only 5 ms extra latency, which is worth it.

Consumer Configuration: Tuning consumers helps them keep up and not overwhelm themselves:

Fetch Size and Wait: By default, a consumer fetches a relatively small amount of data in one request. Increasing fetch.min.bytes lets the broker wait until it has at least that many bytes of data to return (or the fetch hits fetch.max.wait.ms delay). For throughput, you might set fetch.min.bytes to, say, 64KB or 100KB so that consumers get data in larger chunks rather than dribs and drabs. Also, setting fetch.max.wait.ms to slightly higher (e.g., 100 ms from default 500 ms) can allow more data to accumulate if incoming rate is low. Be cautious: too high can add latency if your volume is low, but with steady high volume it can improve efficiency.
Max Records: max.poll.records controls how many records the consumer returns in one poll call. If your consumer application can process in batches effectively, a higher max.poll.records (say 500 or 1000 instead of 500 by default) can reduce the overhead of polling and increase throughput. However, the trade-off is that the consumer has to process that many messages before calling poll again, which might need a longer max.poll.interval.ms. Ensure that your consumer can handle the batch within the max poll interval or else the consumer could be considered dead by the group. If you increase max.poll.records, also consider increasing max.poll.interval.ms proportionally if needed (default is 5 minutes, which is usually fine for reasonable batch sizes).
Parallelism: Unlike the producer, a single consumer instance is single-threaded for message consumption by default (on the poll loop). If you want to parallelize within a consumer, you’d have to hand off messages to worker threads. Alternatively, just run more consumer instances (which is usually simpler due to Kafka’s partition parallelism). So scaling throughput for consumption is typically about adding consumers (up to partition count) and possibly optimizing each consumer’s processing logic (e.g., using vectorized operations or non-blocking I/O if writing to a database, etc.).
Flow Control: If a consumer is slower than the producer, messages will buffer in the broker (consumer lag). Kafka can handle quite a bit of buffering, but you should dimension consumer processing so steady state keeps up with produce rate. If necessary, consider using consumer backpressure – for example, if using reactive frameworks that can signal to slow down producers or an external mechanism to not produce too fast if consumers lag (Kafka on its own doesn’t tell producers to slow down, aside from eventually running out of space).
Latency vs throughput: Similar to producers, if you need low latency (processing each message ASAP), you might favor lower fetch.min.bytes so you get data immediately. But if you want throughput and can tolerate a bit of buffering, set fetch.min.bytes > 1 and even a small fetch wait. If a consumer is reading near real-time (tailing the log), typically it will get data quickly anyway because new data is coming in. The settings matter more for catch-up reads or bursty traffic.
Example Consumer Settings: For a high-throughput consumer in a group:

1
2
3
4
5
props.put("fetch.min.bytes", 65536);       // 64 KB
props.put("fetch.max.wait.ms", 100);       // wait up to 100ms for batch
props.put("max.poll.records", 500);        // process 500 records per poll
props.put("enable.auto.commit", false);    // commit manually after processing batch
props.put("receive.buffer.bytes", 1048576); // 1 MB socket receive buffer

And ensure max.poll.interval.ms is large enough if processing 500 records takes time. By processing in batches of 500, you amortize the poll overhead and can insert/process records in bulk on the consumer side (for example, writing to a database in batches). This can substantially boost throughput compared to one-by-one processing.

End-to-End latency considerations: If you have stringent latency requirements, tuning must be balanced. For instance, if 99th percentile latency is critical, you might not use a linger on producer or large fetch waits on consumer. You’d rely on raw speed and concurrency (more partitions/consumers) rather than buffering. Also, note that enabling fsync on every message (which Kafka does not do by default) would kill throughput; Kafka’s durability comes from replication rather than fsync each write (the OS flushes periodically). If using transactional produce or consume (EOS semantics), that can add some overhead/latency as well, since it involves extra coordination.
Operating System and JVM Tuning: On the OS level, ensure proper file system choices (ext4 or XFS are standard for Kafka, both work well). Mount options like noatime can reduce overhead. Set vm.swappiness to low on Linux so the OS doesn’t swap out Kafka’s memory (swapping Kafka is very bad for latency). The JVM GC should be monitored – using G1GC with tuned pause targets (like 20ms) is common. If heap is modest (e.g., 8 GB), G1 works out of the box typically. If you push heap larger, consider tuning region size or pause time goals. Some users experiment with the newer ZGC for ultra-low pause or using IBM’s J9 JVM – these are advanced steps if GC becomes a pain point. Also, keep the JVM version updated for performance improvements.

In essence, tuning Kafka is about maximizing sequential I/O and network utilization while minimizing per-message overhead:

Batch as much as possible (both in producers and consumers).
Compress to trade CPU for fewer bytes.
Give the broker ample threads and socket buffer to handle concurrent I/O.
Use efficient data formats and avoid overly large messages when smaller ones would suffice (or vice versa, sometimes coalescing tiny messages into a bigger one can improve throughput). Finally, monitor after tuning – sometimes a setting can have side effects, so observe metrics like request latency, GC times, etc., to ensure the tuning is helping. Kafka’s built-in metrics and tools like JMX exporters plus Grafana dashboards are very useful here.

Contents

Kafka Cluster

1. Introduction

2. Kafka Fundamentals

Kafka as a Distributed Commit Log

Core components

Topics, Partitions, and Keys

Producers: Batching, Acks, Idempotence

Kafka Producer Batching

Message Compression

acks=0

acks=1

acks=all

Kafka Topic Replication, ISR & Message Safety

Idempotent Kafka Producer

Problem with Retries

Duplicate Message

Kafka Idempotent Producer

Idempotent Producer

Producer Default Partitioner & Sticky Partitioner

Partitioner when key!=null

Partitioner when key=null

Round Robin Partitioner (Kafka ≤ v2.3)

Round Robin Partitioner

Sticky Partitioner (Kafka ≥ 2.4)

Consumers & Consumer Groups

Delivery Semantics for Kafka Consumers

At Most Once Delivery

At Most Once

At Least Once Delivery (usually preferred)

Exactly Once Delivery

Kafka Consumer Important Settings: Poll and Internal Threads Behavior

Kafka Consumer Poll Behavior

Kafka Consumer Poll

Internal Poll Thread & Heartbeat Thread

Kafka Consumer Liveliness

Consumer Auto Offsets Reset Behavior

Kafka Consumer Offets

auto.offset.reset

offset.retention.minutes

Consumer Read from Closest Replica

Consumer Incremental Rebalance & Static Group Membership

Rebalance after a consumer joins a group

Consumer Eager Rebalancing

Consumer Cooperative Rebalance (Incremental Rebalance)

3.Kafka Cluster Architecture

3.1 ZooKeeper-based Architecture

Role of ZooKeeper (metadata & coordination)

Broker–ZooKeeper Interactions

Limitations of ZooKeeper-based design

3.2 KRaft (Kafka Raft) Mode

3.3 Cluster Components in Detail

Brokers & Replication Factor (RF)

Parition leader/follower election

Controller node(s)

3.4 Scaling Considerations

Scaling Partition Count

Scaling Broker Count (Horizontal Scaling)

Scaling Consumer Groups

Performance Guidelines and Resource Planning

Rebalancing Strategies and Automation Tools

Tuning for High Throughput and Low Latency