okhttp和http 2.0相遇引發的"血案"

1.起因

今天在做流量分析時,從服務端記錄的請求日誌發現流量排名前10的用戶里有一個用戶的某個網路請求竟然在很短的時間內請求了10000多次! 那段時間內的流量佔比如下圖:

2.分析

考慮到客戶端採用AOP的方式記錄了所有的網路請求信息,於是趕緊把客戶端日誌拉上來進行分析,看看是不是某些條件觸發的,卻意外地並沒有看到頻繁地網路請求! 我的內心是崩潰的!!!

既然客戶端日誌查不出什麼問題,於是只能繼續從服務端日誌入手,仔細看一下簡直嚇尿了,那一萬多個請求竟然都是同一個requestId,於是懷疑是服務端的日誌記錄有問題,但是可以看到各個請求雖然requestId一樣,但是時間戳是不一樣的,那就排除了服務端的日誌記錄出錯的可能!

那就只能是客戶端的網路請求的重試機制有問題了!!!

但是OkHttp的重試機制很簡單啊,就是添加攔截器,示例代碼如下:

client.interceptors().add(new Interceptor() {n @Overriden public Response intercept(Chain chain) throws IOException {n Request request = chain.request();nn // try the requestn Response response = chain.proceed(request);nn int tryCount = 0;n while (!response.isSuccessful() && tryCount < 3) {nn Log.d("intercept", "Request is not successful - " + tryCount);nn tryCount++;nn // retry the requestn response = chain.proceed(request);n }nn // otherwise just pass the original response onn return response;n }n });n

而且由於並不是一個很緊迫的配置項,所以當時並沒有給它添加重試的攔截器。

那就只可能是它自身的重試機制出問題了,由於我們App裡面用的是okhttp版本是3.4.2,其對應的重試代碼如下:

@Override public Response intercept(Chain chain) throws IOException {n Request request = chain.request();nn streamAllocation = new StreamAllocation(n client.connectionPool(), createAddress(request.url()));nn int followUpCount = 0;n Response priorResponse = null;n while (true) {n if (canceled) {n streamAllocation.release();n throw new IOException("Canceled");n }nn Response response = null;n boolean releaseConnection = true;n try {n response = ((RealInterceptorChain) chain).proceed(request, streamAllocation, null, null);n releaseConnection = false;n } catch (RouteException e) {n // The attempt to connect via a route failed. The request will not have been sent.n if (!recover(e.getLastConnectException(), true, request)) throw e.getLastConnectException();n releaseConnection = false;n continue;n } catch (IOException e) {n // An attempt to communicate with a server failed. The request may have been sent.n if (!recover(e, false, request)) throw e;n releaseConnection = false;n continue;n } finally {n // Were throwing an unchecked exception. Release any resources.n if (releaseConnection) {n streamAllocation.streamFailed(null);n streamAllocation.release();n }n }nn // Attach the prior response if it exists. Such responses never have a body.n if (priorResponse != null) {n response = response.newBuilder()n .priorResponse(priorResponse.newBuilder()n .body(null)n .build())n .build();n }nn Request followUp = followUpRequest(response);nn if (followUp == null) {n if (!forWebSocket) {n streamAllocation.release();n }n return response;n }nn closeQuietly(response.body());nn if (++followUpCount > MAX_FOLLOW_UPS) {n streamAllocation.release();n throw new ProtocolException("Too many follow-up requests: " + followUpCount);n }nn if (followUp.body() instanceof UnrepeatableRequestBody) {n throw new HttpRetryException("Cannot retry streamed HTTP body", response.code());n }nn if (!sameConnection(response, followUp.url())) {n streamAllocation.release();n streamAllocation = new StreamAllocation(n client.connectionPool(), createAddress(followUp.url()));n } else if (streamAllocation.stream() != null) {n throw new IllegalStateException("Closing the body of " + responsen + " didnt close its backing stream. Bad interceptor?");n }nn request = followUp;n priorResponse = response;n }n }n

這段代碼是在RetryAndFollowUpInterceptor中的intercept()方法中,這個類的目的很明確,就是用於重試和跳轉,但是redirect最大次數是20, 那怎麼會無限重試呢?

顯然只可能是在if (++followUpCount > MAX_FOLLOW_UPS) 這個條件判斷之前,繼續分析下去會發現在出現RouteException或者IOException時,會調用recover()方法,而如果每次recover()都返回true的話,就不會釋放連接,而是一直重試下去。

那有沒有可能我們這個請求是發生了RouteException或者IOException呢?

看到服務端有如下記錄:

注意其中的errorDomain和errorDescription,很清楚地可以看到是發生了shutdown的IOException,其實這種情況一般發生在伺服器處於較大壓力的情況下,再看一下recover()這個方法:

/**n * Report and attempt to recover from a failure to communicate with a server. Returns true ifn * {@code e} is recoverable, or false if the failure is permanent. Requests with a body can onlyn * be recovered if the body is buffered.n */n private boolean recover(IOException e, boolean routeException, Request userRequest) {n streamAllocation.streamFailed(e);nn // The application layer has forbidden retries.n if (!client.retryOnConnectionFailure()) return false;nn // We cant send the request body again.n if (!routeException && userRequest.body() instanceof UnrepeatableRequestBody) return false;nn // This exception is fatal.n if (!isRecoverable(e, routeException)) return false;nn // No more routes to attempt.n if (!streamAllocation.hasMoreRoutes()) return false;nn // For failure recovery, use the same route selector with a new connection.n return true;n }n

由於應用層並沒有禁止重試,Request的body()也並非UnrepeatableRequestBody,而且路由選擇也不為false, 所以最終返回true,從而判定為連接可以恢復,最終導致無限重連,耗費大量的流量

之後在看okhttp的issues時發現其實這個bug早在2016.7.28(對應版本為3.4.1)就有了提了,如下圖所示:

只是這個bug的修復在3.4.2並沒有完成,而是一直到3.5.0才完成,詳細信息見鏈接[Gracefully recover from an HTTP/2 connection shutdown at start of request. #2992](Gracefully recover from an HTTP/2 connection shutdown at start of request. by dave-r12 · Pull Request #2992 · square/okhttp) 所以如果遇到類似問題的同學,請將okhttp升級到3.5.0以上。

3.Root Cause

不管是從okhttp的issue還是我們自己的日誌記錄來看,這種情況的出現都是非常小概率的,那到底是什麼情況下會出現shutdown這樣的IOException呢?

採用charles抓包,發現該網路請求採用的竟然是http 2.0協議,如下圖所示:

為了更好地理解h2的請求過程,簡單地了解一下h2與1.1請求的區別。

h2是基於Google的SPDY協議的,其實h2和1.1最大的區別就是多路復用,即1.1中必須一個連接對應一個stream,而h2中可以多個stream共用一個連接,如下圖所示:

其實現的方式就是將要傳輸的信息分割成一個個二進位幀,其中首部信息會放到HEADER Frame中,而request body會放到DATA Frame中,示意圖如下:

幀傳輸的示意圖如下:

而數據幀的格式其實在okhttp的注釋中就有,如下:

p.s:不得不佩服維護okhttp的團隊注釋寫得非常好。

由於TCP協議的滑動窗口,因而1.1中每次建立連接後都是需要慢啟動,即數據先是慢慢地傳,之後滑動窗口變大,才能高速傳遞,而h2由於共用連接,不僅提高於吞吐量,而且可以使平均傳輸速度增大,如下的動圖很好地表示了h2與1.1在傳輸相同圖片時的對比:

另外,由於h2進行了首部壓縮,所以其實它還可以節省流量。

了解了這些,知道h2是通過幀傳輸的,就能很快地梳理出h2協議的請求過程,如下圖所示:

其中重要的節點我都用紅色標出來了,如前面所說,h2協議是一幀一幀地讀取地,所以正常的請求下Http2.Reader.nextFrame()方法中的type是TYPE_HEADERS和TYPE_DATA,該方法如下:

@Override public boolean nextFrame(Handler handler) throws IOException {n try {n source.require(9); // Frame header sizen } catch (IOException e) {n return false; // This might be a normal socket close.n }nn /* 0 1 2 3n * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1n * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+n * | Length (24) |n * +---------------+---------------+---------------+n * | Type (8) | Flags (8) |n * +-+-+-----------+---------------+-------------------------------+n * |R| Stream Identifier (31) |n * +=+=============================================================+n * | Frame Payload (0...) ...n * +---------------------------------------------------------------+n */n int length = readMedium(source);n if (length < 0 || length > INITIAL_MAX_FRAME_SIZE) {n throw ioException("FRAME_SIZE_ERROR: %s", length);n }n byte type = (byte) (source.readByte() & 0xff);n byte flags = (byte) (source.readByte() & 0xff);n int streamId = (source.readInt() & 0x7fffffff); // Ignore reserved bit.n if (logger.isLoggable(FINE)) logger.fine(formatHeader(true, streamId, length, type, flags));nn switch (type) {n case TYPE_DATA:n readData(handler, length, flags, streamId);n break;nn case TYPE_HEADERS:n readHeaders(handler, length, flags, streamId);n break;nn case TYPE_PRIORITY:n readPriority(handler, length, flags, streamId);n break;nn case TYPE_RST_STREAM:n readRstStream(handler, length, flags, streamId);n break;nn case TYPE_SETTINGS:n readSettings(handler, length, flags, streamId);n break;nn case TYPE_PUSH_PROMISE:n readPushPromise(handler, length, flags, streamId);n break;nn case TYPE_PING:n readPing(handler, length, flags, streamId);n break;nn case TYPE_GOAWAY:n readGoAway(handler, length, flags, streamId);n break;nn case TYPE_WINDOW_UPDATE:n readWindowUpdate(handler, length, flags, streamId);n break;nn default:n // Implementations MUST discard frames that have unknown or unsupported types.n source.skip(length);n }n return true;n }n

一直到服務端出現某種錯誤關閉連接,okhttp就會從數據幀中讀取到TYPE_GOAWAY的標誌位,然後進入readGoAway()的流程中,該方法如下:

private void readGoAway(Handler handler, int length, byte flags, int streamId)n throws IOException {n if (length < 8) throw ioException("TYPE_GOAWAY length < 8: %s", length);n if (streamId != 0) throw ioException("TYPE_GOAWAY streamId != 0");n int lastStreamId = source.readInt();n int errorCodeInt = source.readInt();n int opaqueDataLength = length - 8;n ErrorCode errorCode = ErrorCode.fromHttp2(errorCodeInt);n if (errorCode == null) {n throw ioException("TYPE_GOAWAY unexpected error code: %d", errorCodeInt);n }n ByteString debugData = EMPTY;n if (opaqueDataLength > 0) { // Must read debug data in order to not corrupt the connection.n debugData = source.readByteString(opaqueDataLength);n }n handler.goAway(lastStreamId, errorCode, debugData);n }n

注意到這個方法中的handler其實是實現了Handler介面的FramedConnection.Reader,其goAway()方法如下:

@Override public void goAway(int lastGoodStreamId, ErrorCode errorCode, ByteString debugData) {n if (debugData.size() > 0) { // TODO: log the debugDatan }nn // Copy the streams first. We dont want to hold a lock when we call receiveRstStream().n FramedStream[] streamsCopy;n synchronized (FramedConnection.this) {n streamsCopy = streams.values().toArray(new FramedStream[streams.size()]);n shutdown = true;n }nn // Fail all streams created after the last good stream ID.n for (FramedStream framedStream : streamsCopy) {n if (framedStream.getId() > lastGoodStreamId && framedStream.isLocallyInitiated()) {n framedStream.receiveRstStream(ErrorCode.REFUSED_STREAM);n removeStream(framedStream.getId());n }n }n }n

注意其中將shutdown設置為true了,之後,當StreamAllocation再次調用nextFrame()時,就會拋出名為"shutdown"的IOException, 主要代碼如下:

private FramedStream newStream(int associatedStreamId, List<Header> requestHeaders, boolean out,n boolean in) throws IOException {n boolean outFinished = !out;n boolean inFinished = !in;n boolean flushHeaders;n FramedStream stream;n int streamId;nn synchronized (frameWriter) {n synchronized (this) {n if (shutdown) {n throw new IOException("shutdown");n }n ...n return stream;n }n

由於中間沒有捕獲IOException的操作,所以這個異常就會一直往上傳遞到RetryAndFollowUpInterceptor的intercept()方法中,最終導致前面所說的死循環。

而至於服務端為何會在某些請求中shutdown, 查看流量統計就可以發現這個介面在一天時間內有高達幾億次請求,遠遠超過了當初設計的要求,而導致這個的原因是客戶端用戶的快速增長以及輪詢請求太過頻繁,從而使服務端不得不關閉一些連接以防止宕機。

4.為何客戶端日誌沒有記錄到重試的請求

這個其實就涉及到okhttp的執行流程了,這裡有很多的博客理解有很大的錯誤,其實各個Interceptor並不是順序執行的,而是遞歸執行的,有些Interceptor甚至會執行多次。

而客戶端進行AOP時選擇的切面是RealCall中的getResponseWithInterceptorChain()方法中,而重試發生在內部的RetryAndFollowUpInterceptor中,所以無法得知。

5.擴展問題

那如果這裡不採用AOP,而是採用為okhttpclient增加Interceptor的方法,能夠記錄下重試的請求嗎?

注意以下代碼:

private Response getResponseWithInterceptorChain() throws IOException {n // Build a full stack of interceptors.n List<Interceptor> interceptors = new ArrayList<>();n interceptors.addAll(client.interceptors());n interceptors.add(retryAndFollowUpInterceptor);n interceptors.add(new BridgeInterceptor(client.cookieJar()));n interceptors.add(new CacheInterceptor(client.internalCache()));n interceptors.add(new ConnectInterceptor(client));n if (!retryAndFollowUpInterceptor.isForWebSocket()) {n interceptors.addAll(client.networkInterceptors());n }n interceptors.add(new CallServerInterceptor(n retryAndFollowUpInterceptor.isForWebSocket()));nn Interceptor.Chain chain = new RealInterceptorChain(n interceptors, null, null, null, 0, originalRequest);n return chain.proceed(originalRequest);n }n

顯然,為client增加的Interceptor(確切地說是應用攔截器)在最外層,那它對於RetryAndFollowUpInterceptor的內部執行過程就一無所知了,所以即使採用為okhttpclient增加Interceptor的方法,也無法記錄下重試的過程。

但是如果為client增加networkInterceptors(這些被稱為網路攔截器),由於在最內層,所以是可以記錄到包括重試在內的所有網路請求的。

而如果要想採用AOP的方式記錄下重試的過程的話,還是只能對RetryAndFollowUpInterceptor內層的Interceptor進行切面,比如ConnectInterceptor或者CallServerInterceptor.


推薦閱讀:

基於 Appium 的 Android UI 自動化測試
給力!據說裝了這些APP,就能讓你的安卓機徹底和卡頓說再見
SRC 對於音質會有怎樣的影響?
「合理」的修改簡訊時間,有時候是很必要的

TAG:Android | Android开发 | 计算机网络 |