添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

本文主要基于Auquan的Pairs Trading Strategy转载和整理(非原创),主要修改了两个地方:

  • 用A股(主要是银行股)替换了美股。(虽然A股事实上无法做空,但是用于研究是没有问题的。)
  • 增加了协整的ADF检验,旨在帮助读者更深刻理解协整的定义。

如此,构成比较完整的配对交易的入门资料,供大家阅读。本文的所有代码已在JoinQuant上运行过。


Pairs trading is a nice example of a strategy based on mathematical analysis.

The principle is as follows:
Let's say you have a pair of securities X and Y that have some underlying economic link. An example might be two companies that manufacture the same product, for example Pepsi and Coca Cola. You expect the spread (ratio or difference in prices) between these two to remain constant with time. However, from time to time, there might be a divergence in the spread between these two pairs. The divergence within a pair can be caused by temporary supply/demand changes, large buy/sell orders for one security, reaction for important news about one of the companies, and so on. When there is a temporary divergence between the two securities, i.e. one stock moves up while the other moves down, the pairs trade would be to short the outperforming stock and to long the underperforming one, betting that the "spread" between the two would eventually converge.

Pairs trading is a market neutral trading strategy enabling traders to profit from virtually any market conditions: uptrend, downtrend, or sideways movement.

We'll start by constructing an artificial example.

1.Explaining the Concept—start by generating two fake securities

We model X's daily returns by drawing from a normal distribution. Then we perform a cumulative sum to get the value of X on each day.

import numpy as np
import pandas as pd
import statsmodels
from statsmodels.tsa.stattools import coint
import matplotlib.pyplot as plt
# just set the seed for the random number generator
np.random.seed(107)
# Generate the daily returns
Xreturns = np.random.normal(0, 1, 100) 
# sum them and shift all the prices up
X = pd.Series(np.cumsum(Xreturns), name='X') + 50
X.plot(figsize=(15,7))
plt.show()
4a47a0db6e60853dedfcfdf08a5ca249

Now we generate Y. Y is supposed to have a deep economic link to X, so the price of Y should vary pretty similarly. We model this by taking X, shifting it up and adding some random noise drawn from a normal distribution.

noise = np.random.normal(0, 1, 100)
Y = X + 5 + noise
Y.name = 'Y'
pd.concat([X, Y], axis=1).plot(figsize=(15,7))
plt.show()
fb5c81ed3a220004b71069645f112867

2.Cointegration

Cointegration, very loosely speaking, is a "different" form of correlation. If two series are cointegrated, the ratio between them will vary around a mean. For pairs trading to work between two timeseries, the expected value of the ratio over time must converge to the mean, i.e. they should be cointegrated.The time series we conctructued above are cointegrated.

协整性,区别于相关性,需要满足两个条件:
1)股票X和股票Y的价格对数序列是一阶单整序列。
如果股票X的对数价格是非平稳时间序列(毫无疑问,这是常识)& 股票X的简单单期收益率(对数价格相减结果可看做简单单期收益率)序列是平稳时间序列,则股票X的对数价格是一阶单整序列。
2)股票X和股票Y构造的回归方程,其残差序列是平稳的。
满足以上两个条件,股票X和股票Y可以说具有协整性。

高相关性并不一定意味着协整性。即使两个股票的相关性是差不多的,但协整关系的概率差别比较大。

(Y/X).plot(figsize=(15,7)) 
plt.axhline((Y/X).mean(), color='red', linestyle='--') 
plt.xlabel('Time')
plt.legend(['Price Ratio', 'Mean'])
plt.show()
10fb15c77258a991b0028080a64fb42d

3. Testing for Cointegration

方法1:按照定义去判断是否具有协整性。

首先,验证股票X和股票Y的对数价格是否非平稳序列。

平稳性检验,我们用ADF检验:

  • 原假设(Null Hypothesis):有单位根(unit root),即非平稳
  • 备择假设(Alternative Hypoyhesis):无单位根,即平稳

ADF(y,lags,trend,max_lags,method)函数说明:

(参考《量化投资以Python为工具》P338)

  • 参数lags的值确定检验模型。lags的值表示一阶差分的滞后阶数。
  • 检验结果的Test Statistic值和CriticalValues值进行比较,落在哪个区间。小于显著性水平(1%或5%)的话,则拒绝原假设。(或者只看p-value即可,p-value足够小,就拒绝原假设。)
stock_list = ['600015.XSHG', '601169.XSHG']
sh = get_price(stock_list, start_date="2019-01-01", end_date="2020-04-30", frequency="daily", fields=['close'])['close']
PAf=sh['600015.XSHG']
PBf=sh['601169.XSHG']
PAflog=np.log(PAf)
PBflog=np.log(PBf)
from arch.unitroot import ADF
adfA=ADF(PAflog)
print(adfA.summary().as_text())
   Augmented Dickey-Fuller Results   
=====================================
Test Statistic                 -0.972
P-value                         0.763
Lags                                0
-------------------------------------
Trend: Constant
Critical Values: -3.45 (1%), -2.87 (5%), -2.57 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.

从上述结果来看,Test Statistic=-0.972,大于原假设分布下的1、5、10分位数,从而不能拒绝原假设,进而说明600015(华夏银行)的对数价格序列是非平稳的。同理,601169(北京银行)的对数价格序列也是非平稳的(验证略)。

第二步,验证股票X和股票Y的简单单期收益率序列是否平稳。

retA=PAflog.diff()[1:]
adfretA=ADF(retA)
print(adfretA.summary().as_text())
 Augmented Dickey-Fuller Results   
=====================================
Test Statistic                -17.403
P-value                         0.000
Lags                                0
-------------------------------------
Trend: Constant
Critical Values: -3.45 (1%), -2.87 (5%), -2.57 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.

从上述结果来看,600015(华夏银行)的对数价格的差分(即简单单期收益率)序列是平稳的。同理,601169(北京银行)的简单单期收益率序列也是平稳的(验证略)。

最后,验证股票X和股票Y构造的回归方程,其残差序列是平稳的。

import statsmodels.api as sm
model = sm.OLS(PBflog, sm.add_constant(PAflog))
results = model.fit()
print(results.summary())
d2b5ca33bd970f64a6301fa75ae2eb22
#回归截距项
alpha=results.params[0]
#提取回归系数
beta=results.params[1]
spread=PBflog-beta*PAflog-alpha
#绘制残差序列时序图
spread.plot()
plt.title('价差序列')
09dd8c2662b96ce14928333f055c5580
#对残差做平稳性检验
adfSpread = ADF(spread,trend='nc') #残差均值为0,所以trend设为nc
print(adfSpread.summary().as_text())
Augmented Dickey-Fuller Results   
=====================================
Test Statistic                 -3.555
P-value                         0.000
Lags                                3
-------------------------------------
Trend: No Trend
Critical Values: -2.57 (1%), -1.94 (5%), -1.62 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.

从上述结果可知,残差是平稳的。

至此为止,我们可以说股票X和股票Y具有协整关系。

方法2:There is a convenient test that lives in statsmodels.tsa.stattools. We should see a very low p-value, as we've artifically created two series that are as cointegrated as physically possible.
# compute the p-value of the cointegration test
# will inform us as to whether the ratio between the 2 timeseries is stationary
# around its mean
score, pvalue, _ = coint(X,Y)

看p-value即可,p-value足够小,即可认为股票X和股票Y具有协整关系。(是不是觉得方法2比方法1简单多了~~不过,方法1更帮助理解协整的概念)

4. How to make a pairs trade?

Because two cointegrated time series (such as 600015.XSHG and 601169.XSHG above) drift towards and apart from each other, there will be times when the spread is high and times when the spread is low. We make a pairs trade by buying one security and selling another. This way, if both securities go down together or go up together, we neither make nor lose money — we are market neutral.

Going back to X and Y above that follow Y = ⍺ X + e, such that ratio (Y/X) moves around it’s mean value ⍺, we make money on the ratio of the two reverting to the mean. In order to do this we’ll watch for when X and Y are far apart, i.e ⍺ is too high or too low:

  • Going Long the Ratio This is when the ratio ⍺ is smaller than usual and we expect it to increase. In the above example, we place a bet on this by buying Y and selling X.
  • Going Short the Ratio This is when the ratio ⍺ is large and we expect it to become smaller. In the above example, we place a bet on this by selling Y and buying X.

Note that we always have a “hedged position”: a short position makes money if the security sold loses value, and a long position will make money if a security gains value, so we’re immune to overall market movement. We only make or lose money if securities X and Y move relative to each other.

5. Using Data to find securities that behave like this

The best way to do this is to start with securities you suspect may be cointegrated and perform a statistical test. If you just run statistical tests over all pairs, you’ll fall prey to multiple comparison bias.

Multiple comparisons bias is simply the fact that there is an increased chance to incorrectly generate a significant p-value when many tests are run, because we are running a lot of tests. If 100 tests are run on random data, we should expect to see 5 p-values below 0.05. If you are comparing n securities for co-integration, you will perform n(n-1)/2 comparisons, and you should expect to see many incorrectly significant p-values, which will increase as you increase. To avoid this, pick a small number of pairs you have reason to suspect might be cointegrated and test each individually. This will result in less exposure to multiple comparisons bias.

So let’s try to find some securities that display cointegration. Let’s work with a basket of Chiina's bank stocks,such as "600000.XSHG", "600015.XSHG", "600016.XSHG", "600036.XSHG", "601009.XSHG" etc.These stocks operate in a similar segment and could have cointegrated prices. We scan through a list of securities and test for cointegration between all pairs. It returns a cointegration test score matrix, a p-value matrix, and any pairs for which the p-value was less than 0.05. This method is prone to multiple comparison bias and in practice the securities should be subject to a second verification step. Let’s ignore this for the sake of this example.

def find_cointegrated_pairs(data):
    # data的长度
    n = data.shape[1]
    # 初始化
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    # 抽取列的名称
    keys = data.keys()
    # 初始化强协整组
    pairs = []
    for i in range(n):
        for j in range(i+1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < 0.05:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs
stock_list = ["600000.XSHG", "600015.XSHG", "600016.XSHG", "600036.XSHG", "601009.XSHG","601166.XSHG", "601169.XSHG", "601328.XSHG", "601398.XSHG", "601988.XSHG", "601998.XSHG"]
data = get_price(stock_list, start_date="2019-01-01", end_date="2020-04-30", frequency="daily", fields=["close"])["close"]
# Heatmap to show the p-values of the cointegration test
# between each pair of stocks
scores, pvalues, pairs = find_cointegrated_pairs(data)
import seaborn
m = [0,0.2,0.4,0.6,0.8,1]
seaborn.heatmap(pvalues, 
                xticklabels=stock_list, 
                yticklabels=stock_list, 
                cmap='RdYlGn_r' , 
                mask = (pvalues >= 0.98))
plt.show()
print(pairs)
8266e4bfeda1bd42d8f9794eb4ea0a13
[('600015.XSHG', '601169.XSHG'), ('600036.XSHG', '601328.XSHG'), ('601166.XSHG', '601398.XSHG')]

Looks like '600015.XSHG' and '601169.XSHG' are cointegrated. Let's take a look at the prices to make sure there's nothing weird going on.

S1 = data['600015.XSHG']
S2 = data['601169.XSHG']
score, pvalue, _ = coint(S1, S2)
print(pvalue)
ratios = S1 / S2
ratios.plot(figsize=(15,7))
plt.axhline(ratios.mean())
plt.legend(['Price Ratio'])
plt.show()

f19c9085129709ee14d013be869df69b

The ratio does look like it moved around a stable mean.The absolute ratio isn’t very useful in statistical terms. It is more helpful to normalize our signal by treating it as a z-score. Z score is defined as:

Z Score (Value) = (Value — Mean) / Standard Deviation

WARNING
In practice this is usually done to try to give some scale to the data, but this assumes an underlying distribution. Usually normal. However, much financial data is not normally distributed, and we must be very careful not to simply assume normality, or any specific distribution when generating statistics. The true distribution of ratios could be very fat-tailed and prone to extreme values messing up our model and resulting in large losses.

def zscore(series):
    return (series - series.mean()) / np.std(series)
zscore(ratios).plot(figsize=(15,7))
plt.axhline(zscore(ratios).mean(), color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Ratio z-score', 'Mean', '+1', '-1'])
plt.show()
9eb9cd58b9ea5e04c890326b5c1f471f

6.Simple Strategy:

  • Go "Long" the ratio whenever the z-score is below -1.0
  • Go "Short" the ratio when the z-score is above 1.0
  • Exit positions when the z-score approaches zero

This is just the tip of the iceberg, and only a very simplistic example to illustrate the concepts.

  • In practice you would want to compute a more optimal weighting for how many shares to hold for S1 and S2
  • You would also want to trade using constantly updating statistics.

First break the data into training set of 70% data and test set of 30% data.

ratios = data['600015.XSHG'] / data['601169.XSHG']
print(len(ratios))
le = int(len(ratios)*7/10)
train = ratios[:le]
test = ratios[le:]

In general taking a statistic over your whole sample size can be bad. For example, if the market is moving up, and both securities with it, then your average price over the last 3 years may not be representative of today. For this reason traders often use statistics that rely on rolling windows of the most recent data.

Instead of using ratio values, let's use 5d Moving Average to compute to z score, and the 60d Moving Average and 60d Standard Deviation as the mean and standard deviation.

ratios_mavg5 = train.rolling(window=5,center=False).mean()
ratios_mavg60 = train.rolling(window=60,center=False).mean()
std_60 = train.rolling(window=60,center=False).std()
zscore_60_5 = (ratios_mavg5 - ratios_mavg60)/std_60
plt.figure(figsize=(15,7))
plt.plot(train.index, train.values)
plt.plot(ratios_mavg5.index, ratios_mavg5.values)
plt.plot(ratios_mavg60.index, ratios_mavg60.values)
plt.legend(['Ratio','5d Ratio MA', '60d Ratio MA'])
plt.ylabel('Ratio')
plt.show()
602e8f042f463dc47ebfdf6a94ed5a6d

We can use the moving averages to compute the z-score of the ratio at each given time. This will tell us how extreme the ratio is and whether it's a good idea to enter a position at this time. Let's take a look at the z-score now.

# Take a rolling 60 day standard deviation
std_60 = train.rolling(window=60,center=False).std()
std_60.name = 'std 60d'
# Compute the z score for each day
zscore_60_5 = (ratios_mavg5 - ratios_mavg60)/std_60
zscore_60_5.name = 'z-score'
plt.figure(figsize=(15,7))
zscore_60_5.plot()
plt.axhline(0, color='black')
plt.axhline(1.0, color='red', linestyle='--')
plt.axhline(-1.0, color='green', linestyle='--')
plt.legend(['Rolling Ratio z-Score', 'Mean', '+1', '-1'])
plt.show()
d2b5ca33bd970f64a6301fa75ae2eb22 1

The z-score doesn't mean much out of context, let's plot it next to the prices to get an idea of what it looks like.

# Plot the ratios and buy and sell signals from z score
plt.figure(figsize=(15,7))
train[60:].plot()
buy = train.copy()
sell = train.copy()
buy[zscore_60_5>-1] = 0
sell[zscore_60_5<1] = 0
buy[60:].plot(color='g', linestyle='None', marker='^')
sell[60:].plot(color='r', linestyle='None', marker='^')
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,ratios.min(),ratios.max()))
plt.legend(['Ratio', 'Buy Signal', 'Sell Signal'])
plt.show()
d2b5ca33bd970f64a6301fa75ae2eb22 2

What does that mean for actual stocks that we are trading? Let’s take a look.

# Plot the prices and buy and sell signals from z score
plt.figure(figsize=(18,9))
S1 = data['600015.XSHG'].iloc[:le]
S2 = data['601169.XSHG'].iloc[:le]
S1[60:].plot(color='b')
S2[60:].plot(color='c')
buyR = 0*S1.copy()
sellR = 0*S1.copy()
# When buying the ratio, buy S1 and sell S2
buyR[buy!=0] = S1[buy!=0]
sellR[buy!=0] = S2[buy!=0]
# When selling the ratio, sell S1 and buy S2 
buyR[sell!=0] = S2[sell!=0]
sellR[sell!=0] = S1[sell!=0]
buyR[60:].plot(color='g', linestyle='None', marker='^')
sellR[60:].plot(color='r', linestyle='None', marker='^')
x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,min(S1.min(),S2.min())*0.95,max(S1.max(),S2.max())*1.05))
plt.legend(['华夏银行','北京银行', 'Buy Signal', 'Sell Signal'])
plt.show()
d2b5ca33bd970f64a6301fa75ae2eb22 3

Notice how we sometimes make money on the short leg and sometimes on the long leg, and sometimes both.

Let’s see what kind of profits this signal can generate. We write a simple backtester which buys 1 ratio (buy 1 600015.XSHG stock and sell ratio x 601169.XSHG stock) when ratio is low, sell 1 ratio (sell 1 600015.XSHG stock and buy ratio x 601169.XSHG stock) when it’s high and calculate PnL of these trades.

def trade(S1, S2, window1, window2):
    # If window length is 0, algorithm doesn't make sense, so exit
    if (window1 == 0) or (window2 == 0):
        return 0
    # Compute rolling mean and rolling standard deviation
    ratios = S1/S2
    ma1 = ratios.rolling(window=window1,center=False).mean()
    ma2 = ratios.rolling(window=window2,center=False).mean()
    std = ratios.rolling(window=window2,center=False).std()
    zscore = (ma1 - ma2)/std
    # Simulate trading
    # Start with no money and no positions
    money = 0
    countS1 = 0
    countS2 = 0
    for i in range(len(ratios)):
        # Sell short if the z-score is > 1
        if zscore[i] > 1:
            money += S1[i] - S2[i] * ratios[i]
            countS1 -= 1
            countS2 += ratios[i]
        # Buy long if the z-score is < 1
        elif zscore[i] < -1:
            money -= S1[i] - S2[i] * ratios[i]
            countS1 += 1
            countS2 -= ratios[i]
        # Clear positions if the z-score between -.5 and .5
        elif abs(zscore[i]) < 0.5:
            money += countS1*S1[i] + S2[i] * countS2
            countS1 = 0
            countS2 = 0
#         print('Z-score: '+ str(zscore[i]), countS1, countS2, S1[i] , S2[i])
    return money
trade(data['600015.XSHG'].iloc[:le], data['601169.XSHG'].iloc[:le], 5, 60)

The strategy seems profitable!(result=12.716) Now we can optimize further by changing our moving average windows, by changing the thresholds for buy/sell and exit positions etc and check for performance improvements on validation data.We could also try more sophisticated models like Logisitic Regression, SVM etc to make our 1/-1 predictions.

Let's see how it does on test data as follow( result=1.487)

trade(data['600015.XSHG'].iloc[le:], data['601169.XSHG'].iloc[le:], 5, 60)

7. Avoid Overfitting

Overfitting is the most dangerous pitfall of a trading strategy. In our model, we used rolling parameter estimates and may wish to optimize window length. We can simply iterate over all possible, reasonable window lengths and pick the length based on which our model performs the best . Below we write a simple loop to to score window lengths based on pnl of training data and find the best one and the result is 58.

# Find the window length 0-254 
# that gives the highest returns using this strategy
length_scores = [trade(data['600015.XSHG'].iloc[:le], 
                data['601169.XSHG'].iloc[:le], 5, l) 
                for l in range(255)]
best_length = np.argmax(length_scores)
print ('Best window length:', best_length)

Now we check the performance of our model on test data and we find that this window length is different from optimal! This is because our original choice was clearly overfitted to the sample data.

# Find the returns for test data
# using what we think is the best window length
length_scores2 = [trade(data['600015.XSHG'].iloc[le:], 
                  data['601169.XSHG'].iloc[le:],5, l) 
                  for l in range(255)]
# 用上述计算得出的best_length来计算length_scores2
print (best_length, 'day window:', length_scores2[best_length])
# Find the best window length based on this dataset, 
# and the returns using this window length
best_length2 = np.argmax(length_scores2)
print (best_length2, 'day window:', length_scores2[best_length2])

我们可以看到上述的第一个结果是基于之前计算的58,得出length_scores2=1.487, 而实际最佳的结果是best_length=52,对应的length_scores2=2.09。

We can see this if we also plot Pnl by window length separately for traning and test data.

plt.figure(figsize=(15,7))
plt.plot(length_scores)
plt.plot(length_scores2)