【量化策略】當Trading遇上Reinforcement Learning

Introduction

增強學習(Reinforcement Learning)和通常的機器學習不一樣,並不是一個pure forecasting method(純粹預測的方法),而是可以通過action和outcome的不斷反饋進行學習,最後提供訓練者一個合理的決策而不僅僅是預測。這種直接提供決策的學習方法更符合量化交易的需求,因為這就意味著增強學習可以直接告訴你是該long還是short,該持有多少倉位。由於具有這樣的特性,很多人都著力於研究如何用增強學習構建自動化交易系統,進行交易獲利等。關於增強學習在量化交易上可行性的更多的討論可以見Quora上的一些問題和回答:

Can deep reinforcement learning be used to make automated trading better?

Is reinforcement learning popularly used in trade execution optimization?

Can reinforcement learning be used to forecast time series?

PS:近兩年由於深度學習的火爆,深度增強學習的出現將帶來新的一波浪潮,聽說美國已經有人用Deep Reinforcement Learning進行量化交易,效果還不錯。

Paper

關於基於Reinforcement Learning的Trading的Paper和資料能找到很多,簡單列舉一些,供大家閱讀參考:

Reinforcement Learning for Trading

Reinforcement Learning for Optimized Trade Execution

Algorithm Trading using Q-Learning and Recurrent Reinforcement Learning

Reinforcement Learning for Trading Systems

Performance functions and reinforcement learning for trading systems and portfolios

A Multiagent Approach to Q-Learning for Daily Stock Trading

Adaptive stock trading with dynamic asset allocation using reinforcement learning

An automated FX trading system using adaptive reinforcement learning

Intraday FX trading: An evolutionary reinforcement learning approach

FX trading via recurrent reinforcement learning

Machine Learning for Market Microstructure and High Frequency Trading

Git & Code

關於這個主題有一個相關的Git項目:deependersingla/deep_trader

此外,國外有人還貼出了基於Forex的Python實現的完整code(原網址:Reinforcement Learning + FX Trading Strategy):

#coding: UTF-8import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport matplotlib.cm as cmimport itertoolsimport mpl_toolkits.mplot3d#------------------------------------------------------------------------------# DEFINITION#------------------------------------------------------------------------------class RLMomentum(): def __init__(self, datapath): self.data = pd.read_csv(datapath, header=None) self.ret = self.data / self.data.shift(1) - 1 self.ret = self.ret.fillna(0) self.window_short = 20 self.window_long = 60 self.samples = len(self.data) self.states = 6 self.actions = 3 #long, flat, short self.epsilon = 0.1 self.gamma = 0.9 #discount factor self.mc = 100 #Monte Carlo self.q = np.zeros((self.states, self.states, self.actions)) self.rewards = np.zeros((self.states, self.states, self.actions)) self.count = np.zeros((self.states, self.states, self.actions), dtype = np.int16) self.isVisited = np.zeros((self.states, self.states, self.actions), dtype = np.bool) self.momentum = np.zeros(self.samples) def init(self): self.count = np.zeros((self.states, self.states, self.actions), dtype = np.int16) self.isVisited = np.zeros((self.states, self.states, self.actions), dtype = np.bool) def currentState(self, signal): signal = float(signal) sep = np.linspace(-1, 1, self.states-1) return sum(sep < signal) def selectAction(self, state_short, state_long): if (self.q[state_short, state_long, :]==0).sum() == self.actions: #if all action-values are 0 return np.random.randint(0, self.actions) else: #Epsilon-Greedy if np.random.random(1) < self.epsilon: return np.random.randint(0, self.actions) else: return np.argmax(self.q[state_short, state_long, :]) def actionToPosition(self, action): if action == 0: return -1 elif action == 1: return 0 elif action == 2: return 1 def updateRewards(self, reward, state_short, state_long, action): self.isVisited[state_short, state_long, action] = True self.rewards = self.rewards + reward.values * (self.gamma ** self.count) self.count = self.count + self.isVisited def updateQ(self, itr): self.q = (self.q * itr + self.rewards) / (itr + 1) def episode(self): for i in range(self.samples - 1): if i <= self.window_long - 1: self.momentum[i] = self.ret.ix[i] else: sub_short = self.momentum[i - self.window_short : i - 1] sub_long = self.momentum[i - self.window_long : i - 1] #state = annualized Sharpe ratio state_short = self.currentState( np.mean(sub_short) / np.std(sub_short) * np.sqrt(252) ) state_long = self.currentState( np.mean(sub_long) / np.std(sub_long) * np.sqrt(252) ) action = self.selectAction(state_short, state_long) reward = self.ret.ix[i + 1] * self.actionToPosition(action) self.updateRewards(reward, state_short, state_long, action) self.momentum[i] = reward def monteCarlo(self): for i in range(self.mc): self.init() self.episode() print("episode",i,"done. cumulative return is",sum(self.momentum)) self.updateQ(i) #plt.plot(100 * (1 + self.momentum).cumprod(), label="RL-momentum "+str(i)) #plt.plot(100 * (1 + self.ret).cumprod(), label="long-only") #plt.plot(100 * (1 + self.momentum).cumprod(), label="RL-momentum") #plt.legend(loc="best") #plt.show() #plot Q-value matrix x = np.linspace(0,5,self.states) y = np.linspace(0,5,self.states) x,y = np.meshgrid(x, y) for i in range(self.actions): if i == 0: position = "short" elif i == 1: position = "flat" elif i == 2: position = "long" fig = plt.figure() ax = fig.gca(projection="3d") ax.set_xlabel("state_short") ax.set_ylabel("state_long") ax.set_zlabel("Q-value") ax.set_title("Q-value for " + position + " position") #ax.view_init(90, 90) urf = ax.plot_surface(x, y, self.q[:, :, i], rstride=1, cstride=1, cmap=cm.coolwarm, linewidth_=0, antialiased=False) plt.show()#------------------------------------------------------------------------------# MAIN#------------------------------------------------------------------------------m = RLMomentum("usdjpy.csv")m.monteCarlo()

機器學習策略組招人

最近我在進行A股委託單數據和期貨高頻數據(半秒級別)以及用機器學習優化策略的研究,我們量化投資協會有一個深度智能研究小組,由一個計算機系的老師指導,目前成員以交叉信息學院、計算機、自動化、電子、經管等院系的小夥伴為主。如果你有有一定機器學習的基礎,並對機器學習在量化方面感興趣的同學,歡迎私信我加入組織。(不必要一定是清華的同學,國內及海外不錯學校的碩博及優秀本科生、機器學習相關行業人員、量化行業人員均可報名

關注鏈接:

我的知乎賬號:溫如

專欄:清華大學量化投資協會成果集萃 - 知乎專欄


推薦閱讀:

深入淺出理解決策樹演算法(二)-ID3演算法與C4.5演算法
用貝葉斯思想,來談談為什麼趙處長演的這麼像,檢察機關還是揪著他不放。
目標檢測(1)-Selective Search
Generating Chinese Poetry from Variational Auto-Encoder
機器學習識別cfDNA

TAG:机器学习 | 量化交易 | 宽客Quant |