Understanding Network Failures on Android

The Black Box Problem

When a network request fails on Android, what do you actually know?

try {
    val response = apiService.getUser()
} catch (e: IOException) {
    Timber.e(e, "Network error")  // ...and that's it
}

You know it failed. You don't know where or why.

Was it DNS? TCP? TLS? The server? Your user's ISP? A flaky wifi connection?

Most Android apps treat the network as a black box. Request goes in, response (or error) comes out. Everything in between is invisible.


What Actually Happens During a Network Request

Before we can debug network issues, we need to understand what we're debugging. Here's the journey of a single HTTPS request:

graph TD
    A["1. DNS Lookup
api.example.com → IP"] --> B["2. TCP Connect
SYN → SYN-ACK → ACK"] B --> C["3. TLS Handshake
Certificate + Keys"] C --> D["4. Request Send
HTTP headers + body"] D --> E["5. Wait - TTFB
Server processing"] E --> F["6. Response Receive
Download body"]

Each phase can fail independently. Each has different failure modes and different timings.

The problem: Your try/catch block sees them all as one blob.


Why This Matters: A Real Debugging Story

Here's a common scenario:

Your crash reporting tool shows:

UnknownHostException: Unable to resolve host "api.example.com"

What does this tell you? Almost nothing useful.

UnknownHostException could mean:

  • DNS server is unreachable (network issue)
  • Domain doesn't exist (config bug)
  • DNS response was too slow (timeout)
  • User switched networks mid-request
  • ISP is having DNS issues
  • Corporate firewall blocking DNS

Without knowing when in the request lifecycle this happened and how long it took, you're guessing.


The Logging Interceptor Trap

This pattern exists in most Android codebases:

class LoggingInterceptor : Interceptor {
    override fun intercept(chain: Interceptor.Chain): Response {
        val start = System.currentTimeMillis()
        val response = chain.proceed(chain.request())
        val duration = System.currentTimeMillis() - start

        Timber.d("${chain.request().url} took ${duration}ms")
        return response
    }
}

This tells you total time. It doesn't tell you:

  • How much was DNS vs TCP vs TLS vs server?
  • Did the connection get reused? (huge difference)
  • Were there retries hidden inside?
  • Which phase failed on errors?

You're measuring the outcome, not the journey.


OkHttp's Hidden API: EventListener

OkHttp provides EventListener - a callback interface that fires at every phase of the request lifecycle. Most developers don't know it exists.

abstract class EventListener {
    open fun callStart(call: Call) {}

    open fun dnsStart(call: Call, domainName: String) {}
    open fun dnsEnd(call: Call, domainName: String, inetAddressList: List<InetAddress>) {}

    open fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {}
    open fun connectEnd(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy, protocol: Protocol?) {}

    open fun secureConnectStart(call: Call) {}
    open fun secureConnectEnd(call: Call, handshake: Handshake?) {}

    open fun requestHeadersStart(call: Call) {}
    open fun requestHeadersEnd(call: Call, request: Request) {}

    open fun responseHeadersStart(call: Call) {}
    open fun responseHeadersEnd(call: Call, response: Response) {}

    open fun callEnd(call: Call) {}
    open fun callFailed(call: Call, ioe: IOException) {}
}

With this, you can measure each phase independently.


Building a Metrics Listener

Here's a practical implementation:

class NetworkMetricsListener : EventListener() {

    private var callStartNanos = 0L
    private var dnsStartNanos = 0L
    private var connectStartNanos = 0L
    private var secureConnectStartNanos = 0L
    private var requestStartNanos = 0L
    private var responseStartNanos = 0L

    override fun callStart(call: Call) {
        callStartNanos = System.nanoTime()
    }

    override fun dnsStart(call: Call, domainName: String) {
        dnsStartNanos = System.nanoTime()
    }

    override fun dnsEnd(call: Call, domainName: String, inetAddressList: List<InetAddress>) {
        val dnsMs = (System.nanoTime() - dnsStartNanos).nanosToMillis()
        // Log or record: DNS took $dnsMs ms
    }

    override fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {
        connectStartNanos = System.nanoTime()
    }

    override fun secureConnectStart(call: Call) {
        secureConnectStartNanos = System.nanoTime()
    }

    override fun secureConnectEnd(call: Call, handshake: Handshake?) {
        val tlsMs = (System.nanoTime() - secureConnectStartNanos).nanosToMillis()
        // Log or record: TLS handshake took $tlsMs ms
    }

    override fun connectEnd(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy, protocol: Protocol?) {
        val connectMs = (System.nanoTime() - connectStartNanos).nanosToMillis()
        // Log or record: TCP connect took $connectMs ms
    }

    override fun responseHeadersStart(call: Call) {
        responseStartNanos = System.nanoTime()
    }

    override fun callEnd(call: Call) {
        val totalMs = (System.nanoTime() - callStartNanos).nanosToMillis()
        // Now you have the full breakdown
    }

    override fun callFailed(call: Call, ioe: IOException) {
        // You know exactly which phase was in progress when it failed
        val failedAt = when {
            responseStartNanos > 0 -> "response"
            secureConnectStartNanos > 0 -> "tls"
            connectStartNanos > 0 -> "tcp"
            dnsStartNanos > 0 -> "dns"
            else -> "pre-dns"
        }
        // Log: Request failed during $failedAt phase
    }

    private fun Long.nanosToMillis() = TimeUnit.NANOSECONDS.toMillis(this)
}

Wiring It Up

val client = OkHttpClient.Builder()
    .eventListenerFactory { NetworkMetricsListener() }
    .build()

Note: Use eventListenerFactory (not eventListener) so each call gets its own listener instance.


What You Can Learn

Once you have phase-level metrics, you can answer questions like:

DNS Issues

  • Is DNS slow for certain carriers/regions?
  • Are users hitting DNS timeouts?
  • Would DNS caching help?

Connection Issues

  • Are TCP handshakes slow? (distance to server)
  • Are TLS handshakes slow? (certificate chain too long?)
  • Are connections being reused? (if dnsStart never fires, connection was reused)

Server Issues

  • Is TTFB (time to first byte) high? That's server processing time.
  • Is response download slow? Could be payload size or bandwidth.

Connection Reuse: The Silent Optimization

One important insight from EventListener: connection reuse.

If dnsStart and connectStart never fire for a request, OkHttp reused an existing connection. This skips DNS + TCP + TLS entirely.

On a reused connection:

Request: DNS(0) + TCP(0) + TLS(0) + TTFB(50ms) = 50ms

On a fresh connection:

Request: DNS(30ms) + TCP(45ms) + TLS(120ms) + TTFB(50ms) = 245ms

If your metrics show lots of fresh connections, you might be:

  • Creating new OkHttpClient instances (don't do this)
  • Hitting connection pool limits
  • Having connections closed due to timeouts

Timeout Configuration

While we're here, let's clarify OkHttp's timeouts:

val client = OkHttpClient.Builder()
    .connectTimeout(15, TimeUnit.SECONDS)  // DNS + TCP + TLS combined
    .readTimeout(30, TimeUnit.SECONDS)     // Time between bytes (resets per chunk)
    .writeTimeout(30, TimeUnit.SECONDS)    // Time between bytes (resets per chunk)
    .callTimeout(60, TimeUnit.SECONDS)     // Total end-to-end timeout
    .build()

Important: readTimeout and writeTimeout reset on each chunk of data. A slow server trickling bytes can keep a request alive indefinitely.

callTimeout is the only one that represents total user-facing time.


Next Steps

EventListener gives you visibility into what's happening. What you do with that data is up to you:

  1. Log to your analytics - track phase timings by carrier, region, app version
  2. Set up alerts - if DNS P95 spikes, you'll know
  3. Add to crash reports - attach phase info to network exceptions
  4. Distributed tracing - connect client spans to backend spans with OpenTelemetry

The network will always be unreliable. But at least now you can see where it's unreliable.


Further reading: OkHttp Events documentation

Real Device Testing: The Proof

Theory is nice. Here is what we actually measured using the EventListener on real devices:

Devices Tested

  • Moto Razr 40 Ultra — Airtel 4G (no VoLTE)
  • Pixel 9 Pro Fold — Jio 4G LTE
  • Android Emulator — WiFi simulation

Baseline Performance

MetricRazr 40 UltraPixel 9 Pro Fold
Cold DNS5070ms141ms
Cached DNS3-7ms10-12ms
TCP Connect~1300ms~1000ms
TLS Handshake~960ms~650ms
Total (warm)~2150ms~1900ms

The Razr took 5 seconds just for DNS on the first request. Without EventListener, you would just see "slow network" — no idea WHERE the 5 seconds went.

Post-Doze Recovery

After forcing Android Doze mode and waking the device:

TimingRazr 40 UltraPixel 9 Pro Fold
Immediate (0s)✅ Works (2082ms)❌ TCP timeout (15s)
After 5s❌ DNS fail❌ DNS fail
After 30s❌ DNS fail❌ DNS fail

Two flagship foldables, same network type, completely different behavior:

  • Razr: Slow cold DNS (5s!) but recovers from doze immediately
  • Pixel: Fast baseline but aggressive doze kills even the first request

What Crash Reports Would Show vs What EventListener Revealed

DeviceCrash Report SaysEventListener Revealed
Razr 40"Timeout"Cold DNS = 5070ms (the real culprit)
Pixel 9"Timeout"DNS OK (97ms), TCP hung for 15s
Both"UnknownHostException"DNS resolver dies after doze

This is the point: IOException tells you nothing. EventListener tells you exactly which phase failed and why.

Run It Yourself

I built a test app to collect this data. Want to contribute your device's results?

Devices Tested:
├── Moto Razr 40 Ultra (Airtel 4G)
├── Pixel 9 Pro Fold (Jio 4G VoLTE)  
├── Android Emulator (WiFi simulation)
└── [Your device here]
     └── Run the app, share results!

Download the test app — run Baseline + Post-Doze, and DM me your results on Twitter. I'll add your device to this post.

The Network Journey (What EventListener Sees)

graph LR
    A[App calls API] --> B[DNS Lookup]
    B --> C[TCP Connect]
    C --> D[TLS Handshake]
    D --> E[Send Request]
    E --> F[Wait TTFB]
    F --> G[Receive Response]
    
    style B fill:#ff6b6b,color:#fff
    style C fill:#ffa502,color:#fff
    style D fill:#2ed573,color:#fff

Red = DNS (where Razr spent 5 seconds). Orange = TCP (where Pixel timed out). Green = TLS.

Device Comparison: Same API, Different Story

graph TD
    subgraph Razr["Moto Razr 40 Ultra"]
        R1[Cold DNS: 5070ms] --> R2[TCP: 1300ms]
        R2 --> R3[TLS: 960ms]
        R3 --> R4[Total: ~7s first call]
    end
    
    subgraph Pixel["Pixel 9 Pro Fold"]
        P1[Cold DNS: 141ms] --> P2[TCP: 1000ms]
        P2 --> P3[TLS: 650ms]
        P3 --> P4[Total: ~2s first call]
    end

Post-Doze Failure Points

flowchart TD
    A[Device exits Doze] --> B{Immediate request}
    B -->|Razr| C[✅ Success]
    B -->|Pixel| D[❌ TCP Timeout 15s]
    
    A --> E{After 5 seconds}
    E -->|Both devices| F[❌ DNS Fails]
    
    A --> G{After 30 seconds}  
    G -->|Both devices| H[❌ DNS Still Fails]
    
    F --> I[OS DNS resolver goes back to sleep]
    H --> I

Key insight: The DNS resolver itself enters a sleep state after doze. Not a server issue — an OS issue.


What This Means For You

EventListener transforms network debugging from guesswork into science. Instead of:

"Users are reporting slow network. Maybe it's the backend?"

You get:

"Airtel users on Motorola devices have 5s DNS resolution. Jio users on Pixel fail TCP immediately after doze but recover in 5s. Backend is fine — it's carrier DNS and OS power management."

That's the difference between filing a ticket with your backend team and actually fixing the problem.

Your Action Items

  1. Add EventListener to your OkHttp client — 50 lines of code, infinite debugging value
  2. Log phase timings to your analytics — segment by carrier, device, network type
  3. Test doze recovery on YOUR devices — the behavior varies wildly
  4. Set callTimeout — it's the only timeout that reflects user experience

Try It Yourself

I open-sourced the test app: github.com/aldefy/okhttp-network-metrics

Run it on your devices. See what YOUR carrier's DNS looks like. Find out how YOUR devices recover from doze.

Download the test app — run Baseline + Post-Doze, and DM me your results on Twitter. I'll add your device to this post.

Because the network will always be unreliable — but now you can see exactly where and why.


The next time someone says "it's a network issue" you'll know which part of the network, on which carrier, on which device, under which conditions. That's not debugging — that's engineering.