The 45-Second Mystery
Last month, our Crashlytics lit up:
UnknownHostException: Unable to resolve host "api.example.com"
Occurrences: 2,847
Users affected: 1,203
Context: ¯\_(ツ)_/¯
Backend team checked their dashboards: "API response time is 47ms p95. Not our problem."
They were right. The API was fast. But users were staring at spinners for 45 seconds before seeing "Something went wrong."
Where did those 45 seconds go?
The Visibility Gap
Here's what most Android apps measure:
| What You Track | What Actually Happens |
|---|---|
| Total request time | Yes |
| HTTP status code | Yes |
| DNS resolution time | No |
| TCP handshake time | No |
| TLS negotiation time | No |
| Time waiting for first byte | No |
| Which phase failed | No |
You're measuring the destination, but you're blind to the journey.
That UnknownHostException? It could mean:
- DNS server unreachable (2-30 second timeout)
- Domain doesn't exist (instant failure)
- Network switched mid-request (random timing)
- DNS poisoning in certain regions (varies)
Without phase-level visibility, you're debugging with a blindfold.
Where Time Actually Goes
We instrumented 50,000 requests across different network conditions. Here's what we found:
Good Network (WiFi, 4G LTE)
| Phase | P50 | P95 | P99 |
|---|---|---|---|
| DNS Lookup | 5ms | 45ms | 120ms |
| TCP Connect | 23ms | 89ms | 156ms |
| TLS Handshake | 67ms | 142ms | 203ms |
| Time to First Byte | 52ms | 187ms | 412ms |
| Total | 147ms | 463ms | 891ms |
Degraded Network (3G, Poor Signal)
| Phase | P50 | P95 | P99 |
|---|---|---|---|
| DNS Lookup | 234ms | 8,200ms | 29,000ms |
| TCP Connect | 456ms | 2,100ms | 5,600ms |
| TLS Handshake | 312ms | 890ms | 1,400ms |
| Time to First Byte | 178ms | 1,200ms | 3,400ms |
| Total | 1,180ms | 12,390ms | 39,400ms |
The culprit in our 45-second mystery? DNS timeout on degraded networks.
But we only discovered this after adding proper instrumentation.
The Logging Interceptor Trap
This is in 90% of Android codebases:
class LoggingInterceptor : Interceptor {
override fun intercept(chain: Interceptor.Chain): Response {
val start = System.currentTimeMillis()
val response = chain.proceed(chain.request())
val duration = System.currentTimeMillis() - start
Timber.d("Request took ${duration}ms") // <-- This number lies
return response
}
}
Why it lies:
| Scenario | What Happened | What You See |
|---|---|---|
| 3 retries due to connection reset | 3 separate failures, then success | "Request took 12,000ms" |
| Cache hit | Instant response from disk | "Request took 2ms" (good!) |
| Redirect chain (3 hops) | 3 network round trips | Single timing |
| DNS timeout + success | 30s DNS, 200ms request | "Request took 30,200ms" |
You're seeing the outcome, not the story.
The OkHttp Timeout Trap
Here's a "reasonable" timeout configuration:
val client = OkHttpClient.Builder()
.connectTimeout(10, TimeUnit.SECONDS)
.readTimeout(30, TimeUnit.SECONDS)
.writeTimeout(30, TimeUnit.SECONDS)
.build()
Pop quiz: What's the maximum time a user could wait?
If you said 70 seconds, you're wrong. It's potentially infinite.
The Timeout Truth Table
| Timeout | What It Actually Controls | Resets? |
|---|---|---|
connectTimeout |
DNS + TCP + TLS combined | No |
readTimeout |
Max time between bytes | Yes, per chunk |
writeTimeout |
Max time between bytes | Yes, per chunk |
callTimeout |
Entire operation end-to-end | No |
A server trickling 1 byte every 25 seconds will never trigger your 30-second readTimeout. Each byte resets the clock.
callTimeout is the only timeout that represents actual user experience.
val client = OkHttpClient.Builder()
.connectTimeout(15, TimeUnit.SECONDS)
.readTimeout(30, TimeUnit.SECONDS)
.writeTimeout(30, TimeUnit.SECONDS)
.callTimeout(45, TimeUnit.SECONDS) // <-- The one that matters
.build()
The Solution: EventListener
OkHttp has a hidden API that most developers don't know exists. EventListener gives you callbacks for every phase of the request lifecycle.
| Phase | Events |
|---|---|
| Start | callStart |
| DNS | dnsStart → dnsEnd |
| TCP | connectStart → connectEnd |
| TLS | secureConnectStart → secureConnectEnd |
| Request | connectionAcquired → requestHeadersStart → requestHeadersEnd → requestBodyStart → requestBodyEnd |
| Response | responseHeadersStart → responseHeadersEnd → responseBodyStart → responseBodyEnd |
| Cleanup | connectionReleased → callEnd |
Production Implementation
class NetworkMetricsListener(
private val onMetrics: (NetworkMetrics) -> Unit
) : EventListener() {
private var callStart = 0L
private var dnsStart = 0L
private var connectStart = 0L
private var secureConnectStart = 0L
private var requestStart = 0L
private var responseStart = 0L
private var connectionReused = false
override fun callStart(call: Call) {
callStart = System.nanoTime()
}
override fun dnsStart(call: Call, domainName: String) {
dnsStart = System.nanoTime()
}
override fun dnsEnd(call: Call, domainName: String, inetAddressList: List<InetAddress>) {
// DNS complete - connection reuse skips this entirely
}
override fun connectStart(call: Call, inetSocketAddress: InetSocketAddress, proxy: Proxy) {
connectStart = System.nanoTime()
}
override fun secureConnectStart(call: Call) {
secureConnectStart = System.nanoTime()
}
override fun connectionAcquired(call: Call, connection: Connection) {
connectionReused = (connectStart == 0L) // No connect phase = reused
}
override fun requestHeadersStart(call: Call) {
requestStart = System.nanoTime()
}
override fun responseHeadersStart(call: Call) {
responseStart = System.nanoTime()
}
override fun callEnd(call: Call) {
emitMetrics(success = true)
}
override fun callFailed(call: Call, ioe: IOException) {
emitMetrics(success = false, error = ioe)
}
private fun emitMetrics(success: Boolean, error: IOException? = null) {
val now = System.nanoTime()
onMetrics(NetworkMetrics(
dnsMs = if (dnsStart > 0) (connectStart - dnsStart).toMillis() else 0,
connectMs = if (connectStart > 0) (secureConnectStart - connectStart).toMillis() else 0,
tlsMs = if (secureConnectStart > 0) (requestStart - secureConnectStart).toMillis() else 0,
ttfbMs = (responseStart - requestStart).toMillis(),
totalMs = (now - callStart).toMillis(),
connectionReused = connectionReused,
success = success,
errorType = error?.javaClass?.simpleName
))
}
private fun Long.toMillis() = TimeUnit.NANOSECONDS.toMillis(this)
}
What You Get
Before EventListener:
| Field | Value |
|---|---|
| Duration | 32,450ms |
| Error | UnknownHostException |
| Context | ??? |
After EventListener:
| Phase | Duration | Status |
|---|---|---|
| DNS Lookup | 30,120ms | TIMEOUT |
| TCP Connect | -- | -- |
| TLS Handshake | -- | -- |
| TTFB | -- | -- |
| Total | 30,120ms | |
| Error | UnknownHostException | |
| Failed Phase | DNS |
Now you know: The DNS resolver on this user's network is broken. Not your API. Not your code. Their ISP.
Level Up: Distributed Tracing with OpenTelemetry
EventListener tells you what happened on the client. But what about the full journey?
Android App → CDN → API Gateway → Service → Database
Where is the slowness?
With OpenTelemetry, you can trace a request from button tap to database query:
class TracingEventListener(
private val tracer: Tracer
) : EventListener() {
private var rootSpan: Span? = null
override fun callStart(call: Call) {
rootSpan = tracer.spanBuilder("HTTP ${call.request().method}")
.setSpanKind(SpanKind.CLIENT)
.setAttribute("http.url", call.request().url.toString())
.startSpan()
}
override fun dnsStart(call: Call, domainName: String) {
tracer.spanBuilder("DNS Lookup")
.setParent(Context.current().with(rootSpan!!))
.startSpan()
}
// ... create child spans for each phase
override fun callEnd(call: Call) {
rootSpan?.setStatus(StatusCode.OK)
rootSpan?.end()
}
}
The Trace Waterfall
Now you can answer: "Is it DNS, the network, or the backend?"
Observability Stack Options
| Solution | Cost | Setup | Best For |
|---|---|---|---|
| Honeycomb | Paid (20M events free) | 5 min | Best query experience |
| Grafana Cloud | Free 50GB/mo | 10 min | Already using Grafana |
| Jaeger | Free (self-host) | 1-2 hrs | Full control |
| Datadog | Paid | 15 min | Enterprise, existing DD |
Results
After implementing EventListener + OpenTelemetry in our production app:
| Metric | Before | After | Change |
|---|---|---|---|
| MTTR for network issues | 4.2 hours | 23 minutes | -91% |
| "Network error" bug reports | 847/week | 312/week | -63% |
| P95 false timeout errors | 2.3% | 0.4% | -83% |
The biggest win? We stopped blaming the backend for DNS problems.
Real Device Testing: The Proof
Theory is nice. Data is better. I built a test app and ran it on real devices to see what actually happens.
Test App: github.com/aldefy/okhttp-network-metrics
EventListener: See What Interceptors Can't
| What Interceptor Sees | What EventListener Reveals |
|---|---|
| Request → Response | DNS: 5081ms |
| Total: 7362ms | TCP: 1313ms |
| TLS: 964ms | |
| TTFB: 7359ms |
The Doze Mode Discovery
Same error message. Completely different root causes. This is why EventListener matters.
Baseline Performance
| Device | Network | Cold DNS | TCP | TLS | TTFB | Total |
|---|---|---|---|---|---|---|
| Pixel 9 Pro Fold | JIO 4G | 229ms | 972ms | 665ms | 1989ms | 1991ms |
| Moto Razr 40 Ultra | Airtel | 5081ms | 1313ms | 964ms | 7359ms | 7362ms |
That 5-second DNS on Motorola? Thats not a typo. Airtels DNS is slow on first lookup.
Completely opposite behavior!
- Pixel: Fails immediately post-doze, then recovers after 5 seconds
- Moto: Works immediately, then loses network entirely
Without EventListener, both would show the same error. With it, you can see the Pixel fails at TCP (SocketTimeoutException), while Moto loses the network interface completely.
TL;DR
-
Your logging interceptor is lying. It shows outcomes, not phases.
-
callTimeoutis the only timeout that matters for user experience. -
EventListener exists. Use it. You'll finally understand your network failures.
-
Add distributed tracing if you need end-to-end visibility across services.
-
DNS is usually the culprit for those mysterious 30+ second timeouts.
The UnknownHostException you've been catching with a generic error message? It deserves better. Your users certainly do.
What This Means For You
EventListener transforms network debugging from guesswork into science. Instead of:
"Users are reporting slow network. Maybe its the backend?"
You get:
"Airtel users on Motorola devices have 5s DNS resolution. Jio users on Pixel fail TCP immediately after doze but recover in 5s. Backend is fine - its carrier DNS and OS power management."
Thats the difference between filing a ticket with your backend team and actually fixing the problem.
Your Action Items
- Add EventListener to your OkHttp client - 50 lines of code, infinite debugging value
- Log phase timings to your analytics - segment by carrier, device, network type
- Test doze recovery on YOUR devices - the behavior varies wildly
- Set callTimeout - its the only timeout that reflects user experience
Try It Yourself
I open-sourced the test app: github.com/aldefy/okhttp-network-metrics
Run it on your devices. See what YOUR carriers DNS looks like. Find out how YOUR devices recover from doze.
Download the test app — run Baseline + Post-Doze, and DM me your results on Twitter. Ill add your device to this post.
Because the network will always be unreliable - but now you can see exactly where and why.
The next time someone says "its a network issue" youll know which part of the network, on which carrier, on which device, under which conditions. Thats not debugging - thats engineering.
Tags: Android, OkHttp, Kotlin, Network, Observability, OpenTelemetry