引言¶
研究目的¶
传统的因子指标挖掘主要集中于财务报表、个股中低频率的价量等相关的数据维度,而这部分数据维度的增量价值的挖掘已逐渐饱和,需从其他新的数据维度中挖掘新的因子指标,本篇研究内容参考海通证券研报《高频量价因子在股票与期货中的表现》,本篇报告从个股日内高频数据出发,将目光聚焦于日内价量信息和交易特征,使用分钟数据构建一系列高频因子,实证各因子在股票市场中的表现(本研究主要探讨股票部分,原研报期货部分未做研究)。
研究框架¶
高频因子可以分为收益率分布、成交量分布、量价复合、资金流和日内动量等几个主要的类别,各类因子还可以做进一步的细化,例如收益率分布因子包括已实现偏度、已实现峰度和上下行波动率等。按照这个思路,展开研究:
1)因子数据获取:
- 从设置股票池,获取指定时间范围的分钟数据,按照研报思路,获取每分钟的数据,进行因子值计算函数构造,按日期将因子值存入字典
2)因子数据处理:
- 将计算出来的因子值按日期遍历获取,根据需要进行数据处理
- 在T期因子值中加入T+1期收益值,方便因子分析统计
3)因子统计分析:
- 针对每期截面进行因子IC统计
- 汇总为时间序列上的IC数据,进行展示
4)分组回测分析:
- 根据所选的因子,在每个调仓日进行股票排序,分层进行收益净值观察
- 根据所选因子,在每个调仓日进行股票排序,进行多头组合收益分析
参数设置¶
1)时间范围:
- 2014.6 ~ 2019.6
2)研究指数:
- 上证50、沪深300、中证500
3)股票池:
- 上证50、沪深300、中证500指数成分股以及全部A股
- 剔除 ST、停牌、涨跌停、上市不满 6 个月
- 月初调仓
4)费用设置:
- 暂无交易成本设置
研究内容及结论¶
1)收益率分布因子。
- 高频偏度因子与高频峰度因子具有一定的选股效果,统计期高频偏度因子 IC均值达到0.034 与0.0198,负值占比为 69.5% 与 71.2%,多空组合年化收益差分别为 14.45% 和 15%,最优分组年化收益为 18.02% 和 20.9%,且因子在股票中呈现出反转效应,即高频偏度与峰度小的股票未来收益表现更好,下行波动占比因子未有良好表现(最优分组默认头尾分组)。
2)成交量分布因子。
- 日内不同时段的成交量分布能够反映投资者的行为特征。上午10 点-11 点成交量占比因子和股票下月收益显著正相关,而收盘前半小时成交量占比因子和股票下月收益显著负相关,IC均值达到0.040 与0.0565,负值占比为 35.6% 与 72.9%,二者分层效果明显,多空组合年化收益差均可以到达 17%左右,最优分组年化收益率分别为 22.33% 和 25.21%(默认头尾分组)。
3)量价复合因子。
- 高频量价相关性因子具有显著选股效果,日内走势呈现出“量价背离” 特征的股票未来收益表现好于“量价同向”的股票,IC均值达到 -0.047 ,负值占比为 62.7%。分层效果较为明显(最优分组在中间),多空组合年化收益差可以到达 21%,最优分组年化收益率为 19.12%。
4) 资金流因子。
- 资金流向来自交易时产生的委托信息,反映的是微观层面的供求关系,IC均值达到 -0.0519,负值占比为 55.9%,分层效果很显著,多空组合年化收益差可以到达 30%,最优分组年化收益率为 26.57%。
5)趋势强度因子。
- 股票趋势强度因子选股效果一般,IC统计均值为 -0.009,负值占比为 0.491 %,分层效果比较显著,多空组合年化收益差为 13%,最优分组年化收益率为 18.77%。
6) 改进反转因子。
- 除了直接使用分钟级别数据构建因子之外,我们还可以使用日内信息对传统因子做增强。 剔除了隔夜和开盘后小时涨幅的一个月反转因子的多投组合日均收益达到0.106%,分层效果明显,多空组合年化收益差可达27.6%,最优分组年化收益到达24.12%
#导入需要的库、定义需要用到的工具函数
#工具函数
import time
from datetime import datetime, timedelta
from jqdata import *
import numpy as np
import pandas as pd
import math
from statsmodels import regression
import statsmodels.api as sm
import matplotlib.pyplot as plt
import datetime
from scipy import stats
from jqfactor import *
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
#输入起止日期,返回所有自然日日期
def get_date_list(begin_date, end_date):
dates = []
dt = datetime.strptime(begin_date,"%Y-%m-%d")
date = begin_date[:]
while date <= end_date:
dates.append(date)
dt += timedelta(days=1)
date = dt.strftime("%Y-%m-%d")
return dates
#获取日期列表
def get_tradeday_list(start,end,frequency=None,count=None):
if count != None:
df = get_price('000001.XSHG',end_date=end,count=count)
else:
df = get_price('000001.XSHG',start_date=start,end_date=end)
if frequency == None or frequency =='day':
return df.index
else:
df['year-month'] = [str(i)[0:7] for i in df.index]
if frequency == 'month':
return df.drop_duplicates('year-month').index
elif frequency == 'quarter':
df['month'] = [str(i)[5:7] for i in df.index]
df = df[(df['month']=='01') | (df['month']=='04') | (df['month']=='07') | (df['month']=='10') ]
return df.drop_duplicates('year-month').index
elif frequency =='halfyear':
df['month'] = [str(i)[5:7] for i in df.index]
df = df[(df['month']=='01') | (df['month']=='06')]
return df.drop_duplicates('year-month').index
#输入开始日期和结束日期,返回每周第一个交易日
def get_weekday(s_date,d_date):
df = get_price('000001.XSHG',start_date=s_date,end_date=d_date)
dt_list = []
for d1,d2 in zip(df.index[:-1],df.index[1:]):
d_1 = datetime.datetime(int(str(d1)[:4]),int(str(d1)[5:7]),int(str(d1)[8:10]))
d_2 = datetime.datetime(int(str(d2)[:4]),int(str(d2)[5:7]),int(str(d2)[8:10]))
weekday1 = d_1.strftime("%w")
weekday2 = d_2.strftime("%w")
interday = (d_2 - d_1).days
if (int(weekday1) >= int(weekday2)) or interday>7:
dt_list.append(str(d2)[:10])
return dt_list
def ret_se(start_date='2018-6-1',end_date='2018-7-1',stock_pool=None,weight=0):
pool = stock_pool
if len(pool) != 0:
#得到股票的历史价格数据
df = get_price(list(pool),start_date=start_date,end_date=end_date,fields=['close']).close
df = df.dropna(axis=1)
#获取列表中的股票流通市值对数值
df_mkt = get_fundamentals(query(valuation.code,valuation.circulating_market_cap).filter(valuation.code.in_(df.columns)))
df_mkt.index = df_mkt['code'].values
fact_se =pd.Series(df_mkt['circulating_market_cap'].values,index = df_mkt['code'].values)
fact_se = np.log(fact_se)
else:
df = get_price('000001.XSHG',start_date=start_date,end_date=end_date,fields=['close'])
df['v'] = [1]*len(df)
del df['close']
#相当于昨天的百分比变化
pct = df.pct_change()+1
pct.iloc[0,:] = 1
if weight == 0:
#等权重平均收益结果
se = pct.cumsum(axis=1).iloc[:,-1]/pct.shape[1]
return se
else:
#按权重的方式计算
se = (pct*fact_se).cumsum(axis=1).iloc[:,-1]/sum(fact_se)
return se
#获取所有分组pct
def get_all_pct(pool_dict,trade_list,groups=5):
num = 1
for s,e in zip(trade_list[:-1],trade_list[1:]):
stock_list = pool_dict[s]
stock_num = len(stock_list)//groups
if num == 0:
pct_se_list = []
for i in range(groups):
pct_se_list.append(ret_se(start_date=s,end_date=e,stock_pool=stock_list[i*stock_num:(i+1)*stock_num]))
pct_df1 = pd.concat(pct_se_list,axis=1)
pct_df1.columns = range(groups)
pct_df = pd.concat([pct_df,pct_df1],axis=0)
else:
pct_se_list = []
for i in range(groups):
pct_se_list.append(ret_se(start_date=s,end_date=e,stock_pool=stock_list[i*stock_num:(i+1)*stock_num]))
pct_df = pd.concat(pct_se_list,axis=1)
pct_df.columns = range(groups)
num = 0
return pct_df
def tradedays_before(date,count):#获取指定交易日往前推count天交易日
date = get_price('000001.XSHG',end_date=date,count=count+1).index[0]
return date
def ShiftTradingDay(date,shift):
# 获取所有的交易日,返回一个包含所有交易日的 list,元素值为 datetime.date 类型.
tradingday = get_all_trade_days()
# 得到date之后shift天那一天在列表中的行标号 返回一个数
date = datetime.date(int(str(date)[:4]),int(str(date)[5:7]),int(str(date)[8:10]))
shiftday_index = list(tradingday).index(date)+shift
# 根据行号返回该日日期 为datetime.date类型
return tradingday[shiftday_index]
#进行新股、St股过滤,返回筛选后的股票
def filter_stock(stockList,date,days=21*3,skip_paused=1,limit=0):#日频策略加入开盘涨停过滤
#去除上市距beginDate不足3个月的股票
def delect_stop(stocks,beginDate,n=days):
stockList=[]
beginDate = datetime.datetime.strptime(beginDate, "%Y-%m-%d")
for stock in stocks:
start_date=get_security_info(stock).start_date
if start_date<(beginDate-datetime.timedelta(days=n)).date():
stockList.append(stock)
return stockList
#剔除ST股
st_data=get_extras('is_st',stockList, count = 1,end_date=date)
stockList = [stock for stock in stockList if not st_data[stock][0]]
#剔除当天停牌股
if skip_paused == 1:
paused_df = get_price(stockList,end_date=date,count=1,fields=['paused'])['paused'].T
paused_df.columns = ['paused']
paused_df = paused_df[paused_df['paused']==0]
stockList = paused_df.index
#新股及退市股票
stockList=delect_stop(stockList,date)
#剔除开盘涨停股票
if limit == 1:
#如果需要收盘涨跌停可以改字段即可
df = get_price(stockList,end_date=date,fields=['open','high_limit','low_limit'],count=1).iloc[:,0,:]
df['h_limit']=(df['open']==df['high_limit'])
df['l_limit']=(df['open']==df['low_limit'])
stockList = [df.index[i] for i in range(len(df)) if not (df.h_limit[i] or df.l_limit[i])] #过滤涨跌停股票
return stockList
因子数据获取¶
#设置股票池(指数成分股)
index = '000905.XSHG' #设置股票池,和对比基准,这里是中证500
#设置统计起止日期
date_start = '2014-06-01'
date_end = '2019-06-24'
#获取统计期内交易日列表、用于计算因子数据
date_list = get_tradeday_list(start=date_start,end=date_end,count=None)#获取回测日期间的所有交易日
date_list
根据现有的研究成果,我们可以把高频因子分为收益率分布、成交量分布、量价复合、资金流和日内动量等几个主要的类别,
各类因子还可以做进一步的细化, 例如收益率分布因子包括已实现偏度、已实现峰度和上下行波动率等等
#计算并保存因子值数据
#因子3 偏度与峰度因子与下行波动率
def factor3(pool,date,days=1,freq = 1):#freq用来计算统计频率
#偏度与峰度因子
def ffactor3(pool,date,freq = 1):#freq用来计算统计频率
factor = pd.DataFrame(index = ['RVar_t','RSkew_t','RKurt_t','RVar_t_short'])
for stock in pool:
df = pd.DataFrame()
price = get_price(stock,end_date=str(date)[:10]+' 15:00:00',count=240,frequency='1m',fields=['open','close'])
price_list = list(price['close'].values)
price_list.insert(0,price.iloc[0,0])
df['close'] = price_list
df.index = range(0,len(df))
mark_l = [i for i in df.index if i%freq==0]
df = df.loc[mark_l,:]
df['ln_c'] = log(df['close'])
df['r'] = df['ln_c']-df['ln_c'].shift(1)
df = df.dropna()
RDVar_t = sum([r**2 for r in df['r'].values])
RDvar_t_s= sum([r**2 for r in df['r'].values if r <0])
RDSkew_t= np.sqrt(len(df))*sum([r**3 for r in df['r'].values])/RDVar_t**(3/2)
RDKurt_t= len(df)*sum([r**4 for r in df['r'].values])/RDVar_t**2
factor[stock] = [RDVar_t,RDSkew_t,RDKurt_t,RDvar_t_s]
return factor.T
t = get_price('000001.XSHG',end_date=date,count=days).index
mark = 1
for d in t:
if mark == 0:
factor_temp = ffactor3(pool,d,freq = freq)
factor = factor+factor_temp
else:
factor = ffactor3(pool,d,freq = freq)
mark = 0
return factor/len(t)
#4 成交量分布因子
def factor4(stocks_list,date):#
min_df = get_price(stocks_list,end_date=str(date)[:10]+' 15:00:00',frequency='30m',count=8,fields=['volume'])['volume']
df = min_df.T
df.columns=[1,2,3,4,5,6,7,8]
df['sum'] = df.sum(axis=1)
far = df.div(df['sum'].values,axis=0)
return far.iloc[:,:-1]
#5 量价复合因子
def factor5(pool,date,days=1,freq = 1):#freq用来计算统计频率
#高频量价相关性
def ffactor5(pool,date,freq = 1):#freq用来计算统计频率
factor = pd.DataFrame(index = ['corr'])
for stock in pool:
df = pd.DataFrame()
price = get_price(stock,end_date=str(date)[:10]+' 15:00:00',count=240,frequency='1m',fields=['close','volume'])
corr = price['close'].corr(price['volume'])
factor[stock] = [corr]
return factor.T
t = get_price('000001.XSHG',end_date=date,count=days).index
mark = 1
for d in t:
if mark == 0:
factor_temp = ffactor5(pool,d,freq = freq)
factor = factor+factor_temp
else:
factor = ffactor5(pool,d,freq = freq)
mark = 0
return factor/len(t)
#6 资金流因子
def factor6(pool,date,days=1,freq = 1,op=1):#freq用来计算统计频率
#资金流因子
def ffactor6(pool,date,freq = 1,op=op):#freq用来计算统计频率
factor = pd.DataFrame(index = ['flowinratio'])
for stock in pool:
df = pd.DataFrame()
min_df = get_price(stock,end_date=str(date)[:10]+' 15:00:00',frequency='1m',count=240,fields=['open','close','volume','money'])
min_df['b_s'] = (min_df['close']-min_df['close'].shift(1))/abs(min_df['close']-min_df['close'].shift(1))
if op == 1:
min_df.iloc[0,4] = 1 if min_df.iloc[0,1] > min_df.iloc[0,0] else -1 #是否统计第一分钟数据
flowinratio = sum(min_df[min_df['b_s']==1]['money'])/sum(sum(min_df['money']))
factor[stock] = [flowinratio]
return factor.T
t = get_price('000001.XSHG',end_date=date,count=days).index
mark = 1
for d in t:
if mark == 0:
factor_temp = ffactor6(pool,d,freq = freq)
factor = factor+factor_temp
else:
factor = ffactor6(pool,d,freq = freq)
mark = 0
return factor/len(t)
#因子7 动量类因子
def factor7(stocks_list,date,n1=1,n2=30):
min_df = get_price(stocks_list,end_date=date+' 15:00:00',frequency='1m',count=241-n1,fields=['close'])['close']
er_s = min_df.values[-1]-min_df.values[0]
er_ay = np.array([abs(j-i) for i,j in zip(min_df.values[:-1],min_df.values[1:])])
er_x = er_ay.sum(axis=0)
far1 = pd.DataFrame(er_s/er_x,index=stocks_list,columns=['er'])
min_df = get_price(stocks_list,end_date=date+' 15:00:00',frequency='1m',count=241-n2,fields=['close'])['close']
r = min_df.iloc[-1,:]/min_df.iloc[0,:]-1
far2 = pd.DataFrame(r,index=stocks_list,columns=['r'])
return pd.concat([far1,far2],axis=1)
#循环日期列表,进行因子值记录,按字典的方式存储
#进行因子值计算
factor_dict = {}
#循环时间列表获取原始因子数据组成dict
for end_date in date_list[:]:
end_date=str(end_date)[:10]
print('正在计算 {} 因子数据......'.format(end_date))
stocks_list = get_index_stocks(index,date=end_date)#获取指定日期成分股列表
stocks_list = filter_stock(stocks_list,end_date,days=183,limit=1)#进行股票筛选
pool = stocks_list
date = end_date
factor_dict[end_date] = pd.concat([factor3(pool,date),factor4(pool,date),factor5(pool,date),factor6(pool,date),factor7(pool,date)],axis=1)#计算因子值进行存储
#进行因子值计算
factor1_dict = {}
n = 20
#循环时间列表获取原始因子数据组成dict
for end_date in date_list[20:]:
date = str(end_date)[:10]
t = get_price('000001.XSHG',end_date=date,count=n).index
mark = 1
for d in t:
date_ = str(d)[:10]
if mark == 0:
factor_temp = factor_dict[date_]
factor = factor+factor_temp
else:
factor = factor_dict[date_]
mark = 0
factor1_dict[date] = factor/len(t)
factor1_dict[date].head(3)
因子数据处理¶
- 进行因子值数据处理、如有必要可进行去极值、标准化、中性化
- 加入收益数据:将T期因子值中加入T~T+1的收益数据进行记录
#参数设置
#设置是否中性化
neu = 0 #1为进行中性化;0为不进行中性化
how_=['sw_l1', 'market_cap'] #中性化
#获取调仓日历、交易列表
s_date = '2014-7-1'
d_date = '2019-6-24'
#获取统计期内交易日列表、用于计算因子数据
trade_list = get_tradeday_list(start=s_date,end=d_date,frequency='month',count=None)#获取每月第一个交易日
trade_list
#数据清洗、包括去极值、标准化、中性化等,并加入y值
import time
t1 = time.time()
factor_y_dict = {}
for date_1,date_2 in zip(trade_list[:-1],trade_list[1:]):
d1 = ShiftTradingDay(date_1,1) #往后推一天
d2 = ShiftTradingDay(date_2,1)
print('开始整理 {} 数据...'.format(str(date_1)[:10]))
factor_df = factor1_dict[str(date_1)[:10]] #根据字典存储的日期格式不同进行不同设置
pool = list(factor_df.index)
#计算指数涨跌幅
df_1 = get_price(index,end_date=d1,fields=['open'],count = 1)['open']
df_2 = get_price(index,end_date=d2,fields=['open'],count = 1)['open']
index_pct = df_2.values[0]/df_1.values[0] - 1#具体数值
#计算各股票涨跌幅
df_1 = get_price(pool,end_date=d1,fields=['open'],count = 1)['open']
df_2 = get_price(pool,end_date=d2,fields=['open'],count = 1)['open']
df_3 = pd.concat([df_1,df_2],axis=0).T #进行合并
stock_pct = df_3.iloc[:,1]/df_3.iloc[:,0] - 1 #计算pct,series
#对数据进行处理、标准化、去极值、中性化
#factor_df = winsorize_med(factor_df, scale=3, inclusive=True, inf2nan=True, axis=0) #中位数去极值处理
#factor_df = standardlize(factor_df, inf2nan=True, axis=0) #对每列做标准化处理
if neu == 1:
factor_df = neutralize(factor_df, how=how_, date=date_1, axis=0,fillna='sw_l1')#中性化
#factor_df['pct_alpha'] = stock_pct-index_pct
factor_df['pct_'] = stock_pct
factor_y_dict[str(date_1)[:10]] = factor_df
t2 = time.time()
print('计算数据耗时:{0}'.format(t2-t1))
print(factor_y_dict[str(date_1)[:10]].shape)
因子分析¶
- IC统计
#统计记录IC值
ic_df = pd.DataFrame()
for d in trade_list[:-1]:
d = str(d)[:10]
ic_df[d] = (factor_y_dict[d].corr()).iloc[:-1,-1]
ic_df.head(3)
在下面的内容中,我们将对所有因子IC值进行统计记录,并记录各因子值IC均值、IC标准差、IC最小值、IC最大值、负IC占比
并将IC值与累计IC进行统计展示
#所有因子信息统计如下
for factor in ic_df.index:
ic_ = ic_df.T
ic_df_temp = ic_df.T[factor]
tab_ic = pd.DataFrame()
for year in range(2014,2020):
#表格统计
ic_temp = ic_[(ic_.index>(str(year)+'-01-01')) & (ic_.index<(str(year+1)+'-01-01'))]
tab_ic[str(year)] = [ic_temp[factor].mean(),ic_temp[factor].std(),ic_temp[factor].min(),ic_temp[factor].max(),round(sum(ic_temp[factor]<0)/len(ic_temp),4)]
tab_ic['所有年份'] = [ic_[factor].mean(),ic_[factor].std(),ic_[factor].min(),ic_[factor].max(),round(sum(ic_[factor]<0)/len(ic_),4)]
tab_ic.index=['IC均值','IC标准差',"IC最小值","IC最大值","负IC占比"]
print('========================因子:{} IC统计信息如下======================'.format(factor))
print(tab_ic.T)
#进行IC值展示
ic_df_temp1 = ic_df.T[[factor]]
ic_df_temp1['ic_sum'] = ic_df_temp1[factor].cumsum()
ic_df_temp1.plot(use_index=False,y=[factor,'ic_sum'],secondary_y=['ic_sum'],figsize=(9,5))
plt.show()
分组回测¶
- 多头收益
- 分组收益
接下来,我们将对所有因子值进行分组收益统计,并记录各分组总收益、年化收益、夏普率、最大回撤、每日收益情况
并将分组年化收益进行统计展示
#进行因子值回测分组收益统计
#分组个数
group = 10 #分组组数
factor_list = list(ic_df.index)#获取所有统计因子值
def get_risk_index(se): #输入每日收益变化,从零算起
return_se = se.cumprod()-1
total_returns = return_se[-1]
total_an_returns = ((1+total_returns)**(250/len(return_se))-1)
sharpe = (total_an_returns-0.025)/(np.std(se)*np.sqrt(250))
returns_mean = round(se.mean()-1,6)*100
ret = return_se.dropna()
ret = ret+1
maxdown_list = []
for i in range(1,len(ret)):
low = min(ret[i:])
high = max(ret[0:i])
if high>low:
#print(high,low)
maxdown_list.append((high-low)/high)
#print((high-low)/high)
else:
maxdown_list.append(0)
max_drawdown = max(maxdown_list)
#print('策略运行时间:{} 至 {}'.format(str(return_se.index[0])[:10],str(return_se.index[-1])[:10]))
total_returns = str(round(total_returns*100,2))+'%'
total_an_returns = str(round(total_an_returns*100,2))+'%'
sharpe = str(round(sharpe,2))
max_drawdown = str(round(max_drawdown*100,2))+'%'
'''
print('总收益:%s'%round(total_returns*100,2)+'%')
print('年化收益:%s'%round(total_an_returns*100,2)+'%')
print('夏普比率:%s'%round(sharpe,2))
print('最大回撤:%s'%round(max_drawdown*100,2)+'%')
'''
return total_returns,total_an_returns,sharpe,max_drawdown,returns_mean
for factor in factor_list:
factor_df = pd.DataFrame()
for d in trade_list[:]:
d = str(d)[:10]
factor_df[d] = factor1_dict[d].loc[:,factor]#/factor_dict[d].loc[:,'turnover_ratio']
factor_df =factor_df.T
#统计分组收益
#分组回测分析
#输入:index为日期,column是股票名,values是因子值得factor_df
#输出:股票池分组收益
pool_dict = {}
for i in range(len(factor_df.index)):
temp_se = factor_df.iloc[i,:].sort_values(ascending=False)#从大到小排序
#pool = temp_se[temp_se>0].index #去掉小于0的值
temp_se = temp_se.dropna() #去掉空值
pool = temp_se.index #不做负值处理
num = int(len(pool)/group)
#print('第%s期每组%s只股票'%(i,num))
pool_dict[factor_df.index[i]] = pool
backtest_list = factor_df.index
group_pct = get_all_pct(pool_dict,backtest_list,groups=group)
group_pct.columns = ['group'+str(i) for i in range(len(group_pct.columns))]
#进行分组收益统计
risk_index = group_pct.apply(get_risk_index,axis=0)
risk_tab = pd.DataFrame(index=["总收益","年化收益","夏普率","最大回撤","每日收益%"])
for i in range(group):
risk_tab['group'+str(i)] = list(risk_index.values[i])
print('=========================因子: {} 分组收益如下=========================='.format(factor))
print(risk_tab.T)
risk_plt = pd.Series([float(str(i)[:-1]) for i in risk_tab.values[1]],index=risk_tab.T.index)
risk_plt.plot(kind='bar',figsize=(9,5))
plt.show()
这里我们将表现较好的因子取出,对其分组收益情况进行展示,可以看到头尾收益差额较大,分组效果明显。
#分组回测曲线
#设置检查因子
factor = 'r'
factor_df = pd.DataFrame()
for d in trade_list[:]:
d = str(d)[:10]
factor_df[d] = factor1_dict[d].loc[:,factor]#/factor_dict[d].loc[:,'turnover_ratio']
factor_df =factor_df.T
pool_dict = {}
for i in range(len(factor_df.index)):
temp_se = factor_df.iloc[i,:].sort_values(ascending=False)#从大到小排序
#pool = temp_se[temp_se>0].index #去掉小于0的值
temp_se = temp_se.dropna() #去掉空值
pool = temp_se.index #不做负值处理
num = int(len(pool)/group)
#print('第%s期每组%s只股票'%(i,num))
pool_dict[factor_df.index[i]] = pool
backtest_list = factor_df.index
group_pct = get_all_pct(pool_dict,backtest_list,groups=group)
group_pct.columns = ['group'+str(i) for i in range(len(group_pct.columns))]
group_pct.cumprod().plot(figsize=(15,8))
因子相关性检查¶
股票高频因子秩相关系数矩阵如下表所示,大多数因子间具有较低的相关性,而下行波动率占比和高频偏度因子间具有较高的负相关性,
- 成交量占比中我们发现,9:30-10:00成交量占比与下午11:00以后的所有分段成交量占比负相关关系较强
- 高频峰度因子与 高频偏度因子、资金流入因子成较强的正相关关系
- 趋势强度因子与改进反转因子,因为在描述同一件事情,虽然处理方式上有所不同,但是二者仍然有着较强正相关关系
import seaborn as sns
df = factor_y_dict[list(factor_y_dict.keys())[-1]].corr(method='spearman')
fig = plt.figure(figsize= (12,8))
ax = fig.add_subplot(111)
ax = sns.heatmap(df,annot=True,annot_kws={'size':9,'weight':'bold'})
结论¶
本篇报告中我们使用分钟级别数据构建了一系列高频因子,并对比各因子在股票中的表现,具体的因子表现看参见研究前面的结论内容
1)高频偏度、量价相关性因子、以及改进反转一致具有显著的选股效果。成交量分布因子在上午和下午呈现出不同的选股效果,收盘前成交量越大的股票未来表现越差,而上午 10-11 点成交量越大的股票未来表现越好。
2)通过日内信息对传统因子进行改进,反转因子在进行改造后分层效果明显,多空组合年化收益差可达27.6%,最优分组年化收益到达24.12%。
3)多数高频因子在股票中体现出反转效应,可能与交易机制和投资者结构有关。股票市场以单向做多机制为主,相比于可以多空双向 T+0 交易,机构参与度高,程序化交易应用广泛的期货市场,股票市场散户交易占比较高,更容易出现过度反应和定价偏误。