okhttp和http 2.0相遇引發的"血案"

01-28

1.起因

今天在做流量分析時，從服務端記錄的請求日誌發現流量排名前10的用戶里有一個用戶的某個網路請求竟然在很短的時間內請求了10000多次! 那段時間內的流量佔比如下圖:

2.分析

考慮到客戶端採用AOP的方式記錄了所有的網路請求信息，於是趕緊把客戶端日誌拉上來進行分析，看看是不是某些條件觸發的，卻意外地並沒有看到頻繁地網路請求! 我的內心是崩潰的!!!

既然客戶端日誌查不出什麼問題，於是只能繼續從服務端日誌入手，仔細看一下簡直嚇尿了，那一萬多個請求竟然都是同一個requestId，於是懷疑是服務端的日誌記錄有問題，但是可以看到各個請求雖然requestId一樣，但是時間戳是不一樣的，那就排除了服務端的日誌記錄出錯的可能!

那就只能是客戶端的網路請求的重試機制有問題了!!!

但是OkHttp的重試機制很簡單啊，就是添加攔截器，示例代碼如下:

client.interceptors().add(new Interceptor() {n @Overriden public Response intercept(Chain chain) throws IOException {n Request request = chain.request();nn // try the requestn Response response = chain.proceed(request);nn int tryCount = 0;n while (!response.isSuccessful() && tryCount < 3) {nn Log.d("intercept", "Request is not successful - " + tryCount);nn tryCount++;nn // retry the requestn response = chain.proceed(request);n }nn // otherwise just pass the original response onn return response;n }n });n

而且由於並不是一個很緊迫的配置項，所以當時並沒有給它添加重試的攔截器。

那就只可能是它自身的重試機制出問題了，由於我們App裡面用的是okhttp版本是3.4.2，其對應的重試代碼如下:

@Override public Response intercept(Chain chain) throws IOException {n Request request = chain.request();nn streamAllocation = new StreamAllocation(n client.connectionPool(), createAddress(request.url()));nn int followUpCount = 0;n Response priorResponse = null;n while (true) {n if (canceled) {n streamAllocation.release();n throw new IOException("Canceled");n }nn Response response = null;n boolean releaseConnection = true;n try {n response = ((RealInterceptorChain) chain).proceed(request, streamAllocation, null, null);n releaseConnection = false;n } catch (RouteException e) {n // The attempt to connect via a route failed. The request will not have been sent.n if (!recover(e.getLastConnectException(), true, request)) throw e.getLastConnectException();n releaseConnection = false;n continue;n } catch (IOException e) {n // An attempt to communicate with a server failed. The request may have been sent.n if (!recover(e, false, request)) throw e;n releaseConnection = false;n continue;n } finally {n // Were throwing an unchecked exception. Release any resources.n if (releaseConnection) {n streamAllocation.streamFailed(null);n streamAllocation.release();n }n }nn // Attach the prior response if it exists. Such responses never have a body.n if (priorResponse != null) {n response = response.newBuilder()n .priorResponse(priorResponse.newBuilder()n .body(null)n .build())n .build();n }nn Request followUp = followUpRequest(response);nn if (followUp == null) {n if (!forWebSocket) {n streamAllocation.release();n }n return response;n }nn closeQuietly(response.body());nn if (++followUpCount > MAX_FOLLOW_UPS) {n streamAllocation.release();n throw new ProtocolException("Too many follow-up requests: " + followUpCount);n }nn if (followUp.body() instanceof UnrepeatableRequestBody) {n throw new HttpRetryException("Cannot retry streamed HTTP body", response.code());n }nn if (!sameConnection(response, followUp.url())) {n streamAllocation.release();n streamAllocation = new StreamAllocation(n client.connectionPool(), createAddress(followUp.url()));n } else if (streamAllocation.stream() != null) {n throw new IllegalStateException("Closing the body of " + responsen + " didnt close its backing stream. Bad interceptor?");n }nn request = followUp;n priorResponse = response;n }n }n

這段代碼是在RetryAndFollowUpInterceptor中的intercept()方法中，這個類的目的很明確，就是用於重試和跳轉，但是redirect最大次數是20, 那怎麼會無限重試呢？

顯然只可能是在if (++followUpCount > MAX_FOLLOW_UPS) 這個條件判斷之前，繼續分析下去會發現在出現RouteException或者IOException時，會調用recover()方法，而如果每次recover()都返回true的話，就不會釋放連接，而是一直重試下去。

那有沒有可能我們這個請求是發生了RouteException或者IOException呢？

看到服務端有如下記錄:

注意其中的errorDomain和errorDescription,很清楚地可以看到是發生了shutdown的IOException,其實這種情況一般發生在伺服器處於較大壓力的情況下，再看一下recover()這個方法:

/**n * Report and attempt to recover from a failure to communicate with a server. Returns true ifn * {@code e} is recoverable, or false if the failure is permanent. Requests with a body can onlyn * be recovered if the body is buffered.n */n private boolean recover(IOException e, boolean routeException, Request userRequest) {n streamAllocation.streamFailed(e);nn // The application layer has forbidden retries.n if (!client.retryOnConnectionFailure()) return false;nn // We cant send the request body again.n if (!routeException && userRequest.body() instanceof UnrepeatableRequestBody) return false;nn // This exception is fatal.n if (!isRecoverable(e, routeException)) return false;nn // No more routes to attempt.n if (!streamAllocation.hasMoreRoutes()) return false;nn // For failure recovery, use the same route selector with a new connection.n return true;n }n

由於應用層並沒有禁止重試，Request的body()也並非UnrepeatableRequestBody,而且路由選擇也不為false, 所以最終返回true,從而判定為連接可以恢復，最終導致無限重連，耗費大量的流量。

之後在看okhttp的issues時發現其實這個bug早在2016.7.28(對應版本為3.4.1）就有了提了，如下圖所示:

只是這個bug的修復在3.4.2並沒有完成，而是一直到3.5.0才完成，詳細信息見鏈接[Gracefully recover from an HTTP/2 connection shutdown at start of request. #2992](Gracefully recover from an HTTP/2 connection shutdown at start of request. by dave-r12 · Pull Request #2992 · square/okhttp) 所以如果遇到類似問題的同學，請將okhttp升級到3.5.0以上。

3.Root Cause

不管是從okhttp的issue還是我們自己的日誌記錄來看，這種情況的出現都是非常小概率的，那到底是什麼情況下會出現shutdown這樣的IOException呢?

採用charles抓包，發現該網路請求採用的竟然是http 2.0協議，如下圖所示:

為了更好地理解h2的請求過程，簡單地了解一下h2與1.1請求的區別。

h2是基於Google的SPDY協議的，其實h2和1.1最大的區別就是多路復用，即1.1中必須一個連接對應一個stream，而h2中可以多個stream共用一個連接，如下圖所示:

其實現的方式就是將要傳輸的信息分割成一個個二進位幀，其中首部信息會放到HEADER Frame中，而request body會放到DATA Frame中，示意圖如下:

幀傳輸的示意圖如下:

而數據幀的格式其實在okhttp的注釋中就有，如下:

p.s:不得不佩服維護okhttp的團隊注釋寫得非常好。

由於TCP協議的滑動窗口，因而1.1中每次建立連接後都是需要慢啟動，即數據先是慢慢地傳，之後滑動窗口變大，才能高速傳遞，而h2由於共用連接，不僅提高於吞吐量，而且可以使平均傳輸速度增大，如下的動圖很好地表示了h2與1.1在傳輸相同圖片時的對比:

另外，由於h2進行了首部壓縮，所以其實它還可以節省流量。

了解了這些，知道h2是通過幀傳輸的，就能很快地梳理出h2協議的請求過程，如下圖所示:

其中重要的節點我都用紅色標出來了，如前面所說，h2協議是一幀一幀地讀取地，所以正常的請求下Http2.Reader.nextFrame()方法中的type是TYPE_HEADERS和TYPE_DATA,該方法如下:

@Override public boolean nextFrame(Handler handler) throws IOException {n try {n source.require(9); // Frame header sizen } catch (IOException e) {n return false; // This might be a normal socket close.n }nn /* 0 1 2 3n * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1n * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+n * | Length (24) |n * +---------------+---------------+---------------+n * | Type (8) | Flags (8) |n * +-+-+-----------+---------------+-------------------------------+n * |R| Stream Identifier (31) |n * +=+=============================================================+n * | Frame Payload (0...) ...n * +---------------------------------------------------------------+n */n int length = readMedium(source);n if (length < 0 || length > INITIAL_MAX_FRAME_SIZE) {n throw ioException("FRAME_SIZE_ERROR: %s", length);n }n byte type = (byte) (source.readByte() & 0xff);n byte flags = (byte) (source.readByte() & 0xff);n int streamId = (source.readInt() & 0x7fffffff); // Ignore reserved bit.n if (logger.isLoggable(FINE)) logger.fine(formatHeader(true, streamId, length, type, flags));nn switch (type) {n case TYPE_DATA:n readData(handler, length, flags, streamId);n break;nn case TYPE_HEADERS:n readHeaders(handler, length, flags, streamId);n break;nn case TYPE_PRIORITY:n readPriority(handler, length, flags, streamId);n break;nn case TYPE_RST_STREAM:n readRstStream(handler, length, flags, streamId);n break;nn case TYPE_SETTINGS:n readSettings(handler, length, flags, streamId);n break;nn case TYPE_PUSH_PROMISE:n readPushPromise(handler, length, flags, streamId);n break;nn case TYPE_PING:n readPing(handler, length, flags, streamId);n break;nn case TYPE_GOAWAY:n readGoAway(handler, length, flags, streamId);n break;nn case TYPE_WINDOW_UPDATE:n readWindowUpdate(handler, length, flags, streamId);n break;nn default:n // Implementations MUST discard frames that have unknown or unsupported types.n source.skip(length);n }n return true;n }n

一直到服務端出現某種錯誤關閉連接，okhttp就會從數據幀中讀取到TYPE_GOAWAY的標誌位，然後進入readGoAway()的流程中，該方法如下:

private void readGoAway(Handler handler, int length, byte flags, int streamId)n throws IOException {n if (length < 8) throw ioException("TYPE_GOAWAY length < 8: %s", length);n if (streamId != 0) throw ioException("TYPE_GOAWAY streamId != 0");n int lastStreamId = source.readInt();n int errorCodeInt = source.readInt();n int opaqueDataLength = length - 8;n ErrorCode errorCode = ErrorCode.fromHttp2(errorCodeInt);n if (errorCode == null) {n throw ioException("TYPE_GOAWAY unexpected error code: %d", errorCodeInt);n }n ByteString debugData = EMPTY;n if (opaqueDataLength > 0) { // Must read debug data in order to not corrupt the connection.n debugData = source.readByteString(opaqueDataLength);n }n handler.goAway(lastStreamId, errorCode, debugData);n }n

注意到這個方法中的handler其實是實現了Handler介面的FramedConnection.Reader,其goAway()方法如下:

@Override public void goAway(int lastGoodStreamId, ErrorCode errorCode, ByteString debugData) {n if (debugData.size() > 0) { // TODO: log the debugDatan }nn // Copy the streams first. We dont want to hold a lock when we call receiveRstStream().n FramedStream[] streamsCopy;n synchronized (FramedConnection.this) {n streamsCopy = streams.values().toArray(new FramedStream[streams.size()]);n shutdown = true;n }nn // Fail all streams created after the last good stream ID.n for (FramedStream framedStream : streamsCopy) {n if (framedStream.getId() > lastGoodStreamId && framedStream.isLocallyInitiated()) {n framedStream.receiveRstStream(ErrorCode.REFUSED_STREAM);n removeStream(framedStream.getId());n }n }n }n

注意其中將shutdown設置為true了，之後，當StreamAllocation再次調用nextFrame()時，就會拋出名為"shutdown"的IOException, 主要代碼如下:

private FramedStream newStream(int associatedStreamId, List<Header> requestHeaders, boolean out,n boolean in) throws IOException {n boolean outFinished = !out;n boolean inFinished = !in;n boolean flushHeaders;n FramedStream stream;n int streamId;nn synchronized (frameWriter) {n synchronized (this) {n if (shutdown) {n throw new IOException("shutdown");n }n ...n return stream;n }n

由於中間沒有捕獲IOException的操作，所以這個異常就會一直往上傳遞到RetryAndFollowUpInterceptor的intercept()方法中，最終導致前面所說的死循環。

而至於服務端為何會在某些請求中shutdown, 查看流量統計就可以發現這個介面在一天時間內有高達幾億次請求，遠遠超過了當初設計的要求，而導致這個的原因是客戶端用戶的快速增長以及輪詢請求太過頻繁，從而使服務端不得不關閉一些連接以防止宕機。

4.為何客戶端日誌沒有記錄到重試的請求

這個其實就涉及到okhttp的執行流程了，這裡有很多的博客理解有很大的錯誤，其實各個Interceptor並不是順序執行的，而是遞歸執行的，有些Interceptor甚至會執行多次。

而客戶端進行AOP時選擇的切面是RealCall中的getResponseWithInterceptorChain()方法中，而重試發生在內部的RetryAndFollowUpInterceptor中，所以無法得知。

5.擴展問題

那如果這裡不採用AOP,而是採用為okhttpclient增加Interceptor的方法，能夠記錄下重試的請求嗎?

注意以下代碼:

private Response getResponseWithInterceptorChain() throws IOException {n // Build a full stack of interceptors.n List<Interceptor> interceptors = new ArrayList<>();n interceptors.addAll(client.interceptors());n interceptors.add(retryAndFollowUpInterceptor);n interceptors.add(new BridgeInterceptor(client.cookieJar()));n interceptors.add(new CacheInterceptor(client.internalCache()));n interceptors.add(new ConnectInterceptor(client));n if (!retryAndFollowUpInterceptor.isForWebSocket()) {n interceptors.addAll(client.networkInterceptors());n }n interceptors.add(new CallServerInterceptor(n retryAndFollowUpInterceptor.isForWebSocket()));nn Interceptor.Chain chain = new RealInterceptorChain(n interceptors, null, null, null, 0, originalRequest);n return chain.proceed(originalRequest);n }n

顯然，為client增加的Interceptor(確切地說是應用攔截器)在最外層，那它對於RetryAndFollowUpInterceptor的內部執行過程就一無所知了，所以即使採用為okhttpclient增加Interceptor的方法，也無法記錄下重試的過程。

但是如果為client增加networkInterceptors(這些被稱為網路攔截器),由於在最內層，所以是可以記錄到包括重試在內的所有網路請求的。

而如果要想採用AOP的方式記錄下重試的過程的話，還是只能對RetryAndFollowUpInterceptor內層的Interceptor進行切面，比如ConnectInterceptor或者CallServerInterceptor.

okhttp和http 2.0相遇引發的&quot;血案&quot;

okhttp和http 2.0相遇引發的"血案"