Lesson 15 of 25 13 minDeep Systems

Kubernetes Networking: What Happens Between the Load Balancer and Your Pod?

A backend engineer's guide to K8s networking. Learn about Services, ClusterIP, NodePort, Ingress Controllers, and the Container Network Interface (CNI).

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **ClusterIP:** A stable internal IP that load balances traffic across a set of pods. It is only accessible within the cluster.
  • **NodePort:** Exposes the service on a specific port on every Node s IP.
  • **The Controller:** A pod (like Nginx or Envoy) that actually implements the rules.
Recommended Prerequisites
Kubernetes Production Best Practices

Premium outcome

Distributed systems mechanics for engineers building serious backend platforms.

Engineers who want stronger distributed-systems fundamentals for platform work.

You leave with

  • More confidence with consistency, causality, locking, and time in distributed systems
  • A stronger sense of which backend guarantees are expensive and why
  • The systems-level foundation needed for difficult architecture trade-offs

For many backend engineers, their operational mental model of a web request stops once it hits the external cloud Load Balancer. In Kubernetes (K8s), that is where the most complex, dynamically programmed network orchestration actually begins.

When a client makes an HTTP call to your API, the packet traverses through public routing tables, gets intercepted by cloud load balancers, is routed to physical container nodes, and passes through multiple internal networking abstractions—including Ingress Controllers, kube-proxy iptables or IPVS rules, CNI interfaces, and sidecar proxies—before finally entering your application container socket.

Understanding the network hop mechanics between the Ingress and your code is not just devops plumbing; it is a critical skill for debugging transient latencies, CoreDNS timeouts, connection starvation, and configuring a high-performance distributed microservices platform.


System Requirements and Goals

Before we trace the physical path of a packet, let's establish the design goals and operational constraints of a production-grade Kubernetes networking architecture.

1. Functional Networking Goals

  • Stable Internal Service Discovery: Ephemeral container pods die, restart, and reschedule continuously, gaining new random IP addresses. The networking system must provide stable Virtual IPs (VIPs) and DNS names that map to dynamically changing pod targets.
  • North-South Edge Traffic Ingestion: Efficiently route public client requests entering the cluster (North-South) to the correct target container replicas, handling TLS termination, path-based routing, and request transformations.
  • East-West Zero-Trust Microsegmentation: Enable secure, isolated pod-to-pod communications (East-West) while preventing unauthorized lateral movements using strict network firewalls.
  • Dynamic Config & Topology Routing: Intelligently direct packets to local zone nodes whenever possible to avoid expensive cross-Availability-Zone WAN latency penalties.

2. Non-Functional Capacity Benchmarks

  • Sub-Millisecond Routing Latency: Internal cluster routing rules (DNAT/SNAT packet rewrites) must execute in microseconds, minimizing tail latencies (P99).
  • High Scale & Throughput: Gracefully manage thousands of concurrent pods and millions of active connection tables without saturating host kernel limits.
  • Non-Blocking Fault Isolation: Networking failures or outages in CoreDNS or ingress controllers must remain isolated, preventing cascading collapses across unaffected namespaces.

API Design and Interface Contracts

In Kubernetes, networking behaviors are declared using yaml-based API contracts. Below are the production-grade manifests that establish our ingress gateways, stable service routing, and zero-trust firewall configurations.

1. Ingress & Service Interface Declarations (ingress-service.yaml)

This manifest establishes our external Envoy-backed Ingress router and couples it to a stable internal payment-service running in a ClusterIP configuration.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dynamic-api-gateway
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "15"
spec:
  ingressClassName: nginx
  rules:
  - host: api.codesprintpro.com
    http:
      paths:
      - path: /v1/payments
        pathType: Prefix
        backend:
          service:
            name: payment-service
            port:
              number: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: production
spec:
  type: ClusterIP
  selector:
    app: payment-processor
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080

2. Zero-Trust East-West Firewalls (network-policy.yaml)

By default, K8s pods have an open network policy (any pod can talk to any pod). We enforce least-privilege zero-trust access: only our api-gateway pod is permitted to make requests to the payment-service.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-payment-access
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: payment-processor
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8080

High-Level Design Architecture

Kubernetes networking divides traffic routing into North-South (external traffic entering the cluster) and East-West (internal service-to-service communication).

1. The North-South Packet Path: Load Balancer to Pod

When a client hits your web domain, the packet traverses through a highly optimized physical and virtual network graph before hitting your application code.

graph TD
    %% Public Traffic Path
    Client[Public Browser Client] -->|1. HTTPS Request| CloudLB[Cloud Load Balancer: NLB/ALB]
    
    subgraph "Kubernetes Worker Node Node A"
        CloudLB -->|2. NodePort / TargetGroup| IngressController[Ingress Pod: Nginx/Envoy]
        
        %% Service Virtual IP translation
        IngestController -->|3. Route to payment-service| ServiceVIP[ClusterIP VIP: 10.96.0.45]
        
        %% Kernel Table Gating
        ServiceVIP -->|4. kube-proxy IPTables DNAT| KernelTable[Host Kernel: iptables/IPVS]
        
        %% Physical Pod Selection
        KernelTable -->|5. Forward IP to Pod| TargetPod[Payment Pod A: 192.168.1.12]
    end

    subgraph "Kubernetes Worker Node Node B"
        KernelTable -.->|Alternative Route| TargetPodB[Payment Pod B: 192.168.2.14]
    end

    %% Colors
    style CloudLB fill:#1e1b4b,stroke:#4f46e5,stroke-width:2px,color:#fff
    style IngressController fill:#0f172a,stroke:#3b82f6,stroke-width:2px,color:#fff
    style TargetPod fill:#111827,stroke:#10b981,stroke-width:2px,color:#fff

2. East-West Packet Plumbings: IPTables vs. eBPF Routing

Standard Kubernetes clusters rely on kube-proxy programmed with iptables rules to handle Service VIP translations. When Pod A wants to talk to Pod B through a Service, the host kernel intercepts the packet and evaluates iptables rules sequentially.

Modern Container Network Interfaces (CNIs) like Cilium leverage eBPF (Extended Berkeley Packet Filter). eBPF hooks directly into the Linux kernel socket layer, bypassing the slow iptables TCP/IP stack evaluation entirely to route packets with near-native hardware speed.

graph LR
    subgraph "Standard kube-proxy (IPTables)"
        PodA[Pod A] -->|1. TCP SYN| KubeProxy[kube-proxy]
        KubeProxy -->|2. Sequential Scan| IPTablesTable[Sequential IPTables Rules]
        IPTablesTable -->|3. DNAT Rewrite| PodB[Pod B]
    end

    subgraph "Modern eBPF (Cilium CNI)"
        PodC[Pod C] -->|1. Kernel Sock Hook| eBPFProgram[eBPF Kernel Program]
        eBPFProgram -->|2. Direct Memory Map| PodD[Pod D]
    end

    style IPTablesTable fill:#991b1b,stroke:#f87171,stroke-width:2px,color:#fff
    style eBPFProgram fill:#1e3a8a,stroke:#3b82f6,stroke-width:2px,color:#fff

Low-Level Design & Component Mechanics

To understand exactly how Virtual IPs are materialized on a node, we trace the Linux kernel socket mechanics.

1. The ClusterIP Illusion & kube-proxy IPTables Mechanics

A Kubernetes Service ClusterIP (e.g., 10.96.0.45) is not associated with any physical network interface. It is a completely virtual IP, programmed solely into the host kernel's iptables rules.

When a container issues a connection socket write to a Service IP:

  1. The packet enters the host node's network namespace.
  2. The Linux kernel Netfilter hook intercepts the packet during the PREROUTING chain.
  3. Netfilter evaluates the KUBE-SERVICES chain:
    -A KUBE-SERVICES -d 10.96.0.45/32 -p tcp -m comment --comment "production/payment-service" -j KUBE-SVC-PAYMENT
    
  4. It hops into the KUBE-SVC-PAYMENT chain, which selects a target backend pod using a random probability allocation:
    -A KUBE-SVC-PAYMENT -m statistic --mode random --probability 0.5000000000 -j KUBE-SEP-POD-A
    -A KUBE-SVC-PAYMENT -j KUBE-SEP-POD-B
    
  5. Netfilter executes Destination NAT (DNAT), rewriting the destination IP from the Service VIP 10.96.0.45 to the actual physical Pod IP 192.168.1.12, routing it down to the container network namespace via the veth pair.

2. Multi-Zone Topology Aware Routing logic

To prevent cross-Availability Zone egress costs and late tail latencies, we configure our Services to prioritize local worker node routing using Topology-Aware Hints.

apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: production
  annotations:
    service.kubernetes.io/topology-aware-hints: "auto"
spec:
  type: ClusterIP
  selector:
    app: order-processor
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080

When this annotation is configured, kube-proxy filters the endpoint list to generate iptables chains that direct local Node zone traffic exclusively to pods scheduled within the same Availability Zone (e.g. us-east-1a), completely bypassing WAN routing hops.


Scaling Challenges & Production Bottlenecks

Centralizing thousands of ephemeral microservice pods under a high-throughput workload inevitably hits physical Linux kernel networking boundaries:

1. Connection Tracking (conntrack) Table Exhaustion

Every time a Netfilter rule executes Destination NAT (DNAT) to route a Service request to a Pod, the Linux kernel creates an entry in its local conntrack (connection tracking) table. This state table records the original socket IP and the rewritten destination IP to ensure return packets are correctly mapped back.

The Bottleneck: If a cluster experiences high-throughput spiky dynamic traffic, the conntrack table can quickly fill up. Once conntrack hits the physical kernel limit (nf_conntrack_max), the host node drops all new incoming packets, resulting in sudden 504 Gateway Timeout errors on your ingress gateways.

Mitigation:

  • Tune kernel boundaries on worker node instances:
    sysctl -w net.netfilter.nf_conntrack_max=1048576
    
  • Adopt eBPF-based CNI plugins (such as Cilium) that completely bypass the conntrack netfilter layer, replacing it with high-speed BPF hash maps.

2. CoreDNS Query Latency Spikes

Kubernetes schedules a central CoreDNS service to handle internal hostname resolution (e.g., resolving payment-service.production.svc.cluster.local to its ClusterIP).

The Bottleneck: By default, containerized applications write their resolver config with a high search search domain list. When a microservice attempts to resolve an external API address (e.g., api.stripe.com), it sequentially queries:

  1. api.stripe.com.production.svc.cluster.local (fails)
  2. api.stripe.com.svc.cluster.local (fails)
  3. api.stripe.com.cluster.local (fails)
  4. api.stripe.com (finally succeeds)

This default behavior amplifies a single DNS resolution into 4 separate UDP queries, saturating CoreDNS and spiking P99 latency.

Mitigation:

  • Integrate NodeLocal DNSCache on every worker node. This schedules a lightweight local DNS caching agent on every node, capturing DNS queries locally via loopback interfaces and neutralizing CoreDNS saturation.
  • Configure the application's dnsConfig options to reduce search paths:
    dnsConfig:
      options:
      - name: ndots
        value: "2"
    

Technical Trade-offs & Strategic Compromises

Managing cluster routing patterns requires prioritizing either low CPU overhead, strong isolation, or deployment flexibility.

CNI Networking Model Routing Latency CPU Resource Cost Multi-Tenant Isolation Deployment Complexity
Overlay (VxLAN / Geneve) Medium (Packet encapsulation) Medium (CPU encapsulation overhead) High (Virtual isolated tunnels) Low (Default setup)
Direct Routing (BGP / Calico) Low (Native MTU speed) Low Medium High (Requires router coordination)
Kernel Bypass (eBPF / Cilium) Ultra-Low (<10µs overhead) Ultra-Low High (Strict security filters) High (Requires modern kernel versions)

overlay vs. Direct Routing BGP

If you deploy an Overlay VXLAN network, every packet sent between pods on different nodes is wrapped (encapsulated) in a standard UDP envelope. This introduces a $50$-Byte header overhead and consumes CPU cycles for encapsulation.

For high-volume database workloads or sub-millisecond payment ingestion, overlays are an inefficient compromise. We opt for Direct Routing BGP or eBPF-based host routing to eliminate overlay packet wrapping, preserving maximum hardware throughput.


Failure Scenarios and Fault Tolerance

Designing a resilient Kubernetes datapath means assuming your endpoints are unstable.

1. Long-Lived Keep-Alive Connection Pinning

HTTP/2 and gRPC rely on long-lived TCP connections to avoid the constant overhead of three-way handshakes.

The Failure Scenario: If you scale up your payment-service deployment from 2 to 20 pods during a traffic burst, you will notice that the 18 new pods remain completely idle while the original 2 pods continue to hit 100% CPU. Why? Because the existing API Gateway pods have persistent, long-lived TCP connections pinned to the original 2 pods. The ClusterIP iptables DNAT rules only evaluate during the initial connection handshake, not on every single HTTP/2 request.

Fault Tolerance Strategy:

  • Deploy a Layer-7 proxy (e.g., Envoy or Linkerd Service Mesh) between microservices. The Layer-7 proxy intercepts the long-lived TCP socket, parses the individual HTTP/2 streams, and load balances individual requests dynamically across all 20 replicas.
  • Set strict maxConnectionAge boundaries on your gRPC and HTTP client connection pools to periodically force connection recycling.

Staff Engineer Perspective


Verbal Script & Mock Interview

Mock Interview Dialogue

Interviewer: "Welcome! Let's explore how traffic flows in a Kubernetes environment. Walk me through the exact path a request takes from the moment it hits a public Cloud Load Balancer down to a containerized pod. What are the key bottlenecks at scale?"

Candidate: *"To detail the Kubernetes networking datapath, we must trace both the North-South edge ingestion path and the internal East-West routing layer.

First, the public client packet hits our cloud Layer-7 Load Balancer (NLB/ALB). The Load Balancer terminates TLS and forwards the packet to one of our worker nodes on a configured NodePort or directly via IP routing to our Ingress Controller Pod—which we run as a high-performance Nginx/Envoy proxy fleet.

The Ingress Pod parses the request, matches the path (e.g., /v1/payments), and identifies the backend Service. The Service VIP (ClusterIP) is entirely virtual, programmed solely into each node's Linux kernel netfilter/iptables rules by kube-proxy.

As the packet exits the Ingress Pod, the host node's kernel Netfilter hook intercepts it during the PREROUTING phase. It sequentially scans our Kube-Services iptables chain, matches the Service destination IP, selects a target replica pod using a random probability rule, and conducts Destination NAT (DNAT)—rewriting the destination IP from the Service VIP to the actual physical Pod IP. The packet is then routed across the veth pair into the target container's socket interface."*

Interviewer: "Excellent. You mentioned that iptables uses random probability for load balancing. What bottlenecks occur when a cluster grows to thousands of services and pods?"

Candidate: *"At a scale of thousands of active endpoints, iptables becomes a massive CPU bottleneck. The reason is that iptables is designed as a sequential list of rules. To route a packet, the kernel must scan through this list sequentially ($O(N)$ lookup complexity). Every time a pod scales up, down, or rescheduled, the entire list of rules must be rewritten, locking the kernel namespace.

To resolve this bottleneck, a Staff Engineer must migrate the cluster CNI to a modern eBPF-based datapath like Cilium. Cilium completely replaces kube-proxy and iptables netfilter hooks. It runs an eBPF program directly inside the Linux socket layer. Instead of scanning sequential lists, Cilium uses high-speed BPF hash tables to execute direct $O(1)$ lookups and route packets straight to the container namespace, reducing CPU routing overhead by up to 80%."*

Interviewer: "That is a highly sophisticated mitigation. What about gRPC? If we use long-lived gRPC channels, how do you prevent load imbalance?"

Candidate: *"Right, because gRPC uses long-lived HTTP/2 streams over a single TCP connection, standard layer-4 ClusterIP routing rules fail. The connection NAT occurs only during the initial TCP handshake. Subsequent requests over that channel are pinned to a single pod, leading to severe load imbalance.

To solve this, we deploy a Service Mesh (Istio). Envoy proxies run as sidecars next to each pod. The sidecar intercepts the long-lived TCP socket, parses the individual HTTP/2 streams, and actively load balances individual gRPC request calls across our backend pod pool. We also configure our client connection pools with a strict maxConnectionAge limit of 5 minutes to force periodic connection recycling and clean re-balancing."*

Interviewer: "Fantastic! That is an outstanding, complete answer. You clearly understand the deep operational realities of container networking."


Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.