The 45-Second Mystery

Last month, our Crashlytics lit up:

UnknownHostException: Unable to resolve host "api.example.com"
  Occurrences: 2,847
  Users affected: 1,203
  Context: ¯\_(ツ)_/¯

Backend team checked their dashboards: "API response time is 47ms p95. Not our problem."

They were right. The API was fast. But users were staring at spinners for 45 seconds before seeing "Something went wrong."

Where did those 45 seconds go?


The Visibility Gap

Here's what most Android apps measure:

What You Track What Actually Happens
Total request time Yes
HTTP status code Yes
DNS resolution time No
TCP handshake time No
TLS negotiation time No
Time waiting for first byte No
Which phase failed No

You're measuring the destination, but you're blind to the journey.

That UnknownHostException? It could mean:

  • DNS server unreachable (2-30 second timeout)
  • Domain doesn't exist (instant failure)
  • Network switched mid-request (random timing)
  • DNS poisoning in certain regions (varies)

Without phase-level visibility, you're debugging with a blindfold.


Where Time Actually Goes

We instrumented 50,000 requests across different network conditions. Here's what we found:

Good Network (WiFi, 4G LTE)

Phase P50 P95 P99
DNS Lookup 5ms 45ms 120ms
TCP Connect 23ms 89ms 156ms
TLS Handshake 67ms 142ms 203ms
Time to First Byte 52ms 187ms 412ms
Total 147ms 463ms 891ms

Degraded Network (3G, Poor Signal)

Phase P50 P95 P99
DNS Lookup 234ms 8,200ms 29,000ms
TCP Connect 456ms 2,100ms 5,600ms
TLS Handshake 312ms 890ms 1,400ms
Time to First Byte 178ms 1,200ms 3,400ms
Total 1,180ms 12,390ms 39,400ms

The culprit in our 45-second mystery? DNS timeout on degraded networks.

But we only discovered this after adding proper instrumentation.


The Logging Interceptor Trap

This is in 90% of Android codebases:

class LoggingInterceptor : Interceptor {
    override fun intercept(chain: Interceptor.Chain): Response {
        val start = System.currentTimeMillis()
        val response = chain.proceed(chain.request())
        val duration = System.currentTimeMillis() - start

        Timber.d("Request took ${duration}ms") // <-- This number lies
        return response
    }
}

Why it lies:

Scenario What Happened What You See
3 retries due to connection reset 3 separate failures, then success "Request took 12,000ms"
Cache hit Instant response from disk "Request took 2ms" (good!)
Redirect chain (3 hops) 3 network round trips Single timing
DNS timeout + success 30s DNS, 200ms request "Request took 30,200ms"

You're seeing the outcome, not the story.


The OkHttp Timeout Trap

Here's a "reasonable" timeout configuration:

val client = OkHttpClient.Builder()
    .connectTimeout(10, TimeUnit.SECONDS)
    .readTimeout(30, TimeUnit.SECONDS)
    .writeTimeout(30, TimeUnit.SECONDS)
    .build()

Pop quiz: What's the maximum time a user could wait?

If you said 70 seconds, you're wrong. It's potentially infinite.

The Timeout Truth Table

Timeout What It Actually Controls Resets?
connectTimeout DNS + TCP + TLS combined No
readTimeout Max time between bytes Yes, per chunk
writeTimeout Max time between bytes Yes, per chunk
callTimeout Entire operation end-to-end No

A server trickling 1 byte every 25 seconds will never trigger your 30-second readTimeout. Each byte resets the clock.

callTimeout is the only timeout that represents actual user experience.

val client = OkHttpClient.Builder()
    .connectTimeout(15, TimeUnit.SECONDS)
    .readTimeout(30, TimeUnit.SECONDS)
    .writeTimeout(30, TimeUnit.SECONDS)
    .callTimeout(45, TimeUnit.SECONDS)  // <-- The one that matters
    .build()

The Solution: EventListener

OkHttp has a hidden API that most developers don't know exists. EventListener gives you callbacks for every phase of the request lifecycle.

Phase Events
Start callStart
DNS dnsStartdnsEnd
TCP connectStartconnectEnd
TLS secureConnectStartsecureConnectEnd
Request connectionAcquiredrequestHeadersStartrequestHeadersEndrequestBodyStartrequestBodyEnd
Response responseHeadersStartresponseHeadersEndresponseBodyStartresponseBodyEnd
Cleanup connectionReleasedcallEnd

Production Implementation

class NetworkMetricsListener(
    private val onMetrics: (NetworkMetrics) -> Unit
) : EventListener() {

    private var callStart = 0L
    private var dnsStart = 0L
    private var connectStart = 0L
    private var secureConnectStart = 0L
    private var requestStart = 0L
    private var responseStart = 0L

    private var connectionReused = false

    override fun callStart(call: Call) {
        callStart = System.nanoTime()
    }

    override fun dnsStart(call: Call, domainName: String) {
        dnsStart = System.nanoTime()
    }

    override fun dnsEnd(call: Call, domainName: String, inetAddressList: List<InetAddress>) {
        // DNS complete - connection reuse skips this entirely
    }

    override fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {
        connectStart = System.nanoTime()
    }

    override fun secureConnectStart(call: Call) {
        secureConnectStart = System.nanoTime()
    }

    override fun connectionAcquired(call: Call, connection: Connection) {
        connectionReused = (connectStart == 0L)  // No connect phase = reused
    }

    override fun requestHeadersStart(call: Call) {
        requestStart = System.nanoTime()
    }

    override fun responseHeadersStart(call: Call) {
        responseStart = System.nanoTime()
    }

    override fun callEnd(call: Call) {
        emitMetrics(success = true)
    }

    override fun callFailed(call: Call, ioe: IOException) {
        emitMetrics(success = false, error = ioe)
    }

    private fun emitMetrics(success: Boolean, error: IOException? = null) {
        val now = System.nanoTime()
        onMetrics(NetworkMetrics(
            dnsMs = if (dnsStart > 0) (connectStart - dnsStart).toMillis() else 0,
            connectMs = if (connectStart > 0) (secureConnectStart - connectStart).toMillis() else 0,
            tlsMs = if (secureConnectStart > 0) (requestStart - secureConnectStart).toMillis() else 0,
            ttfbMs = (responseStart - requestStart).toMillis(),
            totalMs = (now - callStart).toMillis(),
            connectionReused = connectionReused,
            success = success,
            errorType = error?.javaClass?.simpleName
        ))
    }

    private fun Long.toMillis() = TimeUnit.NANOSECONDS.toMillis(this)
}

What You Get

Before EventListener:

Field Value
Duration 32,450ms
Error UnknownHostException
Context ???

After EventListener:

Phase Duration Status
DNS Lookup 30,120ms TIMEOUT
TCP Connect -- --
TLS Handshake -- --
TTFB -- --
Total 30,120ms
Error UnknownHostException
Failed Phase DNS

Now you know: The DNS resolver on this user's network is broken. Not your API. Not your code. Their ISP.


Level Up: Distributed Tracing with OpenTelemetry

EventListener tells you what happened on the client. But what about the full journey?

Android AppCDNAPI GatewayServiceDatabase

Where is the slowness?

With OpenTelemetry, you can trace a request from button tap to database query:

class TracingEventListener(
    private val tracer: Tracer
) : EventListener() {

    private var rootSpan: Span? = null

    override fun callStart(call: Call) {
        rootSpan = tracer.spanBuilder("HTTP ${call.request().method}")
            .setSpanKind(SpanKind.CLIENT)
            .setAttribute("http.url", call.request().url.toString())
            .startSpan()
    }

    override fun dnsStart(call: Call, domainName: String) {
        tracer.spanBuilder("DNS Lookup")
            .setParent(Context.current().with(rootSpan!!))
            .startSpan()
    }

    // ... create child spans for each phase

    override fun callEnd(call: Call) {
        rootSpan?.setStatus(StatusCode.OK)
        rootSpan?.end()
    }
}

The Trace Waterfall

Now you can answer: "Is it DNS, the network, or the backend?"

HTTP GET /api/users Total: 1,247ms
0ms 250ms 500ms 750ms 1000ms 1247ms
DNS Lookup
45ms
TCP Connect
89ms
TLS Handshake
156ms
Request Send
12ms
Response Receive
945ms
DNS
TCP
TLS
Request
Response

Observability Stack Options

Solution Cost Setup Best For
Honeycomb Paid (20M events free) 5 min Best query experience
Grafana Cloud Free 50GB/mo 10 min Already using Grafana
Jaeger Free (self-host) 1-2 hrs Full control
Datadog Paid 15 min Enterprise, existing DD

Results

After implementing EventListener + OpenTelemetry in our production app:

Metric Before After Change
MTTR for network issues 4.2 hours 23 minutes -91%
"Network error" bug reports 847/week 312/week -63%
P95 false timeout errors 2.3% 0.4% -83%

The biggest win? We stopped blaming the backend for DNS problems.


Real Device Testing: The Proof

Theory is nice. Data is better. I built a test app and ran it on real devices to see what actually happens.

Test App: github.com/aldefy/okhttp-network-metrics

EventListener: See What Interceptors Can't

What Interceptor Sees What EventListener Reveals
Request → Response DNS: 5081ms
Total: 7362ms TCP: 1313ms
TLS: 964ms
TTFB: 7359ms

The Doze Mode Discovery

Doze Mode Recovery Timeline
Pixel 9 Pro Fold • JIO 4G
0s 5s 10s 15s
Immediate
after doze exit
TCP TIMEOUT 15,000ms
After 5s
wait then retry
1,431ms ✓
After 30s
fully recovered
1,777ms ✓
Key Insight: Pixel fails immediately after doze exit because the radio is still waking up. Wait 5 seconds and the connection succeeds.
💡 Moto Razr behavior is opposite: Works immediately, then loses network after 5s. Same error, different root cause.

Same error message. Completely different root causes. This is why EventListener matters.

Baseline Performance

Device Network Cold DNS TCP TLS TTFB Total
Pixel 9 Pro Fold JIO 4G 229ms 972ms 665ms 1989ms 1991ms
Moto Razr 40 Ultra Airtel 5081ms 1313ms 964ms 7359ms 7362ms

That 5-second DNS on Motorola? Thats not a typo. Airtels DNS is slow on first lookup.

Completely opposite behavior!

  • Pixel: Fails immediately post-doze, then recovers after 5 seconds
  • Moto: Works immediately, then loses network entirely

Without EventListener, both would show the same error. With it, you can see the Pixel fails at TCP (SocketTimeoutException), while Moto loses the network interface completely.


TL;DR

  1. Your logging interceptor is lying. It shows outcomes, not phases.

  2. callTimeout is the only timeout that matters for user experience.

  3. EventListener exists. Use it. You'll finally understand your network failures.

  4. Add distributed tracing if you need end-to-end visibility across services.

  5. DNS is usually the culprit for those mysterious 30+ second timeouts.


The UnknownHostException you've been catching with a generic error message? It deserves better. Your users certainly do.


What This Means For You

EventListener transforms network debugging from guesswork into science. Instead of:

"Users are reporting slow network. Maybe its the backend?"

You get:

"Airtel users on Motorola devices have 5s DNS resolution. Jio users on Pixel fail TCP immediately after doze but recover in 5s. Backend is fine - its carrier DNS and OS power management."

Thats the difference between filing a ticket with your backend team and actually fixing the problem.

Your Action Items

  1. Add EventListener to your OkHttp client - 50 lines of code, infinite debugging value
  2. Log phase timings to your analytics - segment by carrier, device, network type
  3. Test doze recovery on YOUR devices - the behavior varies wildly
  4. Set callTimeout - its the only timeout that reflects user experience

Try It Yourself

I open-sourced the test app: github.com/aldefy/okhttp-network-metrics

Run it on your devices. See what YOUR carriers DNS looks like. Find out how YOUR devices recover from doze.

Download the test app — run Baseline + Post-Doze, and DM me your results on Twitter. Ill add your device to this post.

Because the network will always be unreliable - but now you can see exactly where and why.


The next time someone says "its a network issue" youll know which part of the network, on which carrier, on which device, under which conditions. Thats not debugging - thats engineering.


Tags: Android, OkHttp, Kotlin, Network, Observability, OpenTelemetry