You are serving an LLM. The inference latency for a single request is a nonnegative random variable T, and you are told only:
E[T] = 1 minute
The interviewer asks a sequence of probability and restart-policy questions:
What can you say about P(T > 5)?
If you are allowed to restart once when an attempt has run too long, what success probability can you guarantee within 10 minutes?
If you are allowed to restart N times, what is the success probability?
Is there a restart strategy, meaning a choice of timeout length, that maximizes the success probability?
Is there a distribution where restarting does not change the success probability?
If different restart times are allowed, is there a better strategy?
Assume each restarted attempt draws an independent fresh copy of T, restart overhead is negligible, and the request succeeds as soon as any attempt finishes before its timeout.
This is a probability and stochastic-process reasoning question, not an implementation question. A strong Research Scientist answer should separate:
Exact probabilities when the latency distribution is known.
Worst-case bounds when only the mean is known.
How restart policies interact with the tail or hazard rate of the latency distribution.
With only E[T] = 1, you generally cannot compute exact probabilities. You can compute bounds.
For example, P(T > 5) = 1/5 is not guaranteed to be exact. The correct statement from Markov's inequality is:
P(T > 5) <= E[T] / 5 = 1/5
That bound is tight up to an arbitrarily small gap. For any small epsilon > 0, let:
P(T = 5 + epsilon) = 1 / (5 + epsilon)
P(T = 0) = 1 - 1 / (5 + epsilon)
Then E[T] = 1, and:
P(T > 5) = 1 / (5 + epsilon)
which approaches 1/5 as epsilon goes to 0.
Part 1: Bound P(T > 5)
Because T >= 0, Markov's inequality gives:
P(T > a) <= E[T] / a
For a = 5:
P(T > 5) <= 1/5
So any single attempt succeeds within 5 minutes with probability at least:
P(T <= 5) >= 4/5
Use a simple policy:
Run the first attempt for at most 5 minutes.
If it has not finished, restart.
Run the second attempt for at most the remaining 5 minutes.
The request fails only if both independent attempts exceed 5 minutes:
P(failure) = P(T_1 > 5 and T_2 > 5)
= P(T > 5)^2
<= (1/5)^2
= 1/25
Therefore the success probability is at least:
P(success) >= 1 - 1/25 = 24/25
If the interviewer informally says the answer is 1 - 1/5 * 1/5, interpret that as the Markov lower bound under independent restarts.
Part 3: N Restarts with Equal Timeouts
If N restarts are allowed, there are N + 1 total attempts.
If the latency survival function is known, define:
S(t) = P(T > t)
For independent attempts with timeouts t_1, t_2, ..., t_(N+1), the exact success probability is:
P(success) = 1 - S(t_1) S(t_2) ... S(t_(N+1))
Without knowing S, use Markov bounds.
If every attempt is given 5 minutes, the failure probability is bounded by:
P(failure) <= (1/5)^(N + 1)
So:
P(success) >= 1 - (1/5)^(N + 1)
This assumes the total budget is large enough to spend 5 minutes on every attempt.
If the total time budget is fixed at 10 minutes and split equally across N + 1 attempts, each attempt gets:
t = 10 / (N + 1)
For t >= 1, Markov gives the useful bound:
P(failure) <= (1/t)^(N + 1)
= ((N + 1) / 10)^(N + 1)
Therefore:
P(success) >= 1 - ((N + 1) / 10)^(N + 1)
When t < 1, Markov's bound is greater than 1, so after clipping probabilities at 1 it gives no useful guarantee.
Suppose you choose m independent attempts and split 10 minutes equally, so each attempt has timeout:
t = 10 / m
When t >= 1, the Markov failure bound is:
P(failure) <= (m / 10)^m
To minimize this bound over real-valued m, minimize:
log failure bound = m log(m / 10)
Differentiate:
d/dm [m log(m / 10)] = log(m / 10) + 1
Set it to zero:
log(m / 10) + 1 = 0
m / 10 = e^-1
m = 10 / e ~ 3.68
So the best integer number of attempts, under this mean-only worst-case objective, is around 4, meaning:
timeout ~ 10 / 4 = 2.5 minutes per attempt
The corresponding Markov guarantee is:
P(success) >= 1 - (4/10)^4
= 1 - 0.0256
= 0.9744
This is slightly better than the one-restart, two-attempt Markov guarantee:
1 - (1/5)^2 = 0.96
The exponential distribution is the canonical example.
If:
T ~ Exponential(rate = 1)
then E[T] = 1, and the distribution is memoryless:
P(T > s + t | T > s) = P(T > t)
Because of memorylessness, an in-progress attempt that has already run for s minutes is statistically as good as a freshly restarted attempt. Restarting does not improve the success probability for a fixed total time budget.
For a total budget B, the probability of finishing without restarts is:
P(T <= B) = 1 - e^-B
If you split B into timeouts t_1, t_2, ..., t_m, the failure probability is:
P(T_1 > t_1, T_2 > t_2, ..., T_m > t_m)
= e^-t_1 e^-t_2 ... e^-t_m
= e^-(t_1 + t_2 + ... + t_m)
= e^-B
So the success probability is still:
1 - e^-B
Restarting neither helps nor hurts in this idealized model.
For a known latency distribution, unequal timeouts can be better. The best policy depends on the shape of the hazard rate:
If the hazard rate decreases over time, old attempts become less promising, so restarting earlier can help.
If the hazard rate increases over time, old attempts become more promising, so restarting can hurt.
If the hazard rate is constant, as in the exponential distribution, restarting does not matter.
With only E[T] = 1 and no distributional knowledge, you can still optimize the Markov worst-case bound, but you should be explicit that this is a conservative guarantee rather than the true optimum for every distribution.
Suppose the total budget is B = 10 and you choose timeouts:
t_1, t_2, ..., t_m
where:
t_1 + t_2 + ... + t_m = 10
For each timeout, Markov gives:
P(T > t_i) <= min(1, 1 / t_i)
If all chosen timeouts are at least 1 minute, the joint failure bound becomes:
P(failure) <= 1 / (t_1 t_2 ... t_m)
For a fixed number of attempts m, making this bound as small as possible means maximizing the product:
t_1 t_2 ... t_m
subject to a fixed sum. By AM-GM, the product is maximized when all timeouts are equal:
t_1 = t_2 = ... = t_m = 10 / m
So under the Markov-only assumption, unequal restart times do not improve the worst-case bound. Unequal schedules become useful only when you know more about the actual latency distribution.