研究目的:
本文参考华泰研报《人工智能选股之支持向量机模型》,对研报里面的结果进行了分析,并研究了支持向量机模型(SVM)在多因子选股模型领域的应用,实现股票因子与收益之间非线性模型的建立,通过对模型进行多维度分析实现模型评价。
研究内容:
多因子模型的本质是关于股票当期因子暴露和未来收益之间的线性回归模型。我们希望引入机器学习的思想,对传统多因子模型进行优化。然而,由于金融市场是复杂的、非线性的,因此线性模型就会存在比较大的缺陷。线性支持向量机能够解决线性分类问题,核支持向量机则主要针对非线性分类问题,支持向量回归能够处理回归问题。本篇报告我们将支持向量机应用于多因子选股,主要关注如下几方面的问题:
(1)首先是模型选择的问题:常见的核函数有线性核、多项式核、高斯核等,不同核函数的选择会构建不同的模型,如何确定核函数才能使模型更优。
(2)其次是参数寻优的问题:支持向量机包含两个重要参数,即惩罚系数 C 和 gamma 值,如何确定最优参数。
(3)然后是组合构建的问题:在衡量过不同支持向量机模型的表现之后,应如何利用模型的预测结果构建策略组合进行回测。
(4)最后是模型评价的问题:在根据模型预测结果构建策略组合进行回测后,如何对回测结果进行评价,从而判断出模型的优劣。
研究结论:
(1)本文以 HS300 成分股为标的,以 2010-2014 年的因子及下期收益作为样本内集合,2014 年 2018 年的数据为样本外测试集,高斯核 SVM 在 HS300 选股模型交叉验证集正确率为 56.2%,AUC 为 0.561,样本外测试集平均正确率为 55.1%,平均 AUC 为 0.550。
(2)以 HS300 为票池,利用高斯核 SVM 模型构建选股策略。对于 HS300 成份股内选股的等权重策略以及行业中性策略。总体而言,高斯核 SVM 在收益、夏普比率、最大回测等方面分层明确,可见模型是有效的。
(3)比较了不同核支持向量机 (SVM) 的预测能力,一般情况下,高斯核收益能力要高于其他核函数模型。
(4)比较了高斯核支持向量机 (SVM) 以及支持向量回归 (SVR) 的预测能力,绝大多数时候,高斯核 SVC 的收益能力比 SVR 模型更强。
研究耗时:
(1)数据准备部分:大约需要 8h,主要出现在数据采集部分,为了避免程序耗时过长,已将数据提前运行出来,下载地址:https://pan.baidu.com/s/1DSI7uc5yBNY3hzedu9L7mw
(2)模型测试部分:大约需要 6h,主要耗时在参数交叉验证,该步骤可仅作为参考,直接跳过该步骤。
(3)策略构建部分:大约需要 36min,主要回测组合较多,且行业中性较为复杂,耗时较多。
在每个月的月末对因子数据进行提取,因此需要对每个月的月末日期进行统计。
输入参数分别为 peroid、start_date 和 end_date,其中 peroid 进行周期选择,可选周期为周(W)、月(M)和季(Q),start_date和end_date 分别为开始日期和结束日期。
函数返回值为对应的月末日期。本文选取开始日期为 2010.1.1,结束日期为 2018.1.1。
from jqdata import *
from jqlib.technical_analysis import *
from jqfactor import get_factor_values
from jqfactor import winsorize_med
from jqfactor import standardlize
from jqfactor import neutralize
import datetime
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels import regression
import pickle
from six import StringIO
#导入pca
from sklearn.decomposition import PCA
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import metrics
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import seaborn as sns
/opt/conda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) /opt/conda/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning)
import jqdatasdk
from jqdatasdk import *
auth('1866530448','')
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-106-6252468b5dad> in <module>() ----> 1 import jqdatasdk 2 from jqdatasdk import * 3 auth('1866530448','') ImportError: No module named 'jqdatasdk'
import numpy as np
import pandas as pd
import datetime as datetime
stock_data = get_price('000001.XSHE','2010-01-01', '2018-01-01','daily',fields=['close'])
stock_data['date']=stock_data.index
stock_data.head(50)
close | date | |
---|---|---|
2010-01-04 | 7.89 | 2010-01-04 |
2010-01-05 | 7.75 | 2010-01-05 |
2010-01-06 | 7.62 | 2010-01-06 |
2010-01-07 | 7.54 | 2010-01-07 |
2010-01-08 | 7.52 | 2010-01-08 |
2010-01-11 | 7.52 | 2010-01-11 |
2010-01-12 | 7.47 | 2010-01-12 |
2010-01-13 | 6.97 | 2010-01-13 |
2010-01-14 | 6.98 | 2010-01-14 |
2010-01-15 | 7.13 | 2010-01-15 |
2010-01-18 | 7.14 | 2010-01-18 |
2010-01-19 | 7.40 | 2010-01-19 |
2010-01-20 | 7.11 | 2010-01-20 |
2010-01-21 | 7.55 | 2010-01-21 |
2010-01-22 | 7.68 | 2010-01-22 |
2010-01-25 | 7.38 | 2010-01-25 |
2010-01-26 | 7.34 | 2010-01-26 |
2010-01-27 | 7.29 | 2010-01-27 |
2010-01-28 | 7.24 | 2010-01-28 |
2010-01-29 | 7.22 | 2010-01-29 |
2010-02-01 | 7.07 | 2010-02-01 |
2010-02-02 | 7.11 | 2010-02-02 |
2010-02-03 | 7.51 | 2010-02-03 |
2010-02-04 | 7.39 | 2010-02-04 |
2010-02-05 | 7.32 | 2010-02-05 |
2010-02-08 | 7.17 | 2010-02-08 |
2010-02-09 | 7.32 | 2010-02-09 |
2010-02-10 | 7.41 | 2010-02-10 |
2010-02-11 | 7.36 | 2010-02-11 |
2010-02-12 | 7.47 | 2010-02-12 |
2010-02-22 | 7.35 | 2010-02-22 |
2010-02-23 | 7.17 | 2010-02-23 |
2010-02-24 | 7.22 | 2010-02-24 |
2010-02-25 | 7.37 | 2010-02-25 |
2010-02-26 | 7.47 | 2010-02-26 |
2010-03-01 | 7.47 | 2010-03-01 |
2010-03-02 | 7.69 | 2010-03-02 |
2010-03-03 | 7.75 | 2010-03-03 |
2010-03-04 | 7.68 | 2010-03-04 |
2010-03-05 | 7.74 | 2010-03-05 |
2010-03-08 | 7.93 | 2010-03-08 |
2010-03-09 | 7.92 | 2010-03-09 |
2010-03-10 | 7.78 | 2010-03-10 |
2010-03-11 | 7.87 | 2010-03-11 |
2010-03-12 | 7.63 | 2010-03-12 |
2010-03-15 | 7.45 | 2010-03-15 |
2010-03-16 | 7.50 | 2010-03-16 |
2010-03-17 | 7.71 | 2010-03-17 |
2010-03-18 | 7.67 | 2010-03-18 |
2010-03-19 | 7.66 | 2010-03-19 |
stock_data.head(50).resample('W',how='mean')
C:\Users\18665\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).mean() """Entry point for launching an IPython kernel.
close | |
---|---|
2010-01-10 | 7.664 |
2010-01-17 | 7.214 |
2010-01-24 | 7.376 |
2010-01-31 | 7.294 |
2010-02-07 | 7.280 |
2010-02-14 | 7.346 |
2010-02-21 | NaN |
2010-02-28 | 7.316 |
2010-03-07 | 7.666 |
2010-03-14 | 7.826 |
2010-03-21 | 7.598 |
stock_data.head(50).resample('W',how='last')
C:\Users\18665\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).last() """Entry point for launching an IPython kernel.
close | date | |
---|---|---|
2010-01-10 | 7.52 | 2010-01-08 |
2010-01-17 | 7.13 | 2010-01-15 |
2010-01-24 | 7.68 | 2010-01-22 |
2010-01-31 | 7.22 | 2010-01-29 |
2010-02-07 | 7.32 | 2010-02-05 |
2010-02-14 | 7.47 | 2010-02-12 |
2010-02-21 | NaN | NaT |
2010-02-28 | 7.47 | 2010-02-26 |
2010-03-07 | 7.74 | 2010-03-05 |
2010-03-14 | 7.63 | 2010-03-12 |
2010-03-21 | 7.66 | 2010-03-19 |
period_stock_data = stock_data.head(50).resample('M',how='last')
C:\Users\18665\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).last() """Entry point for launching an IPython kernel.
date=period_stock_data.index
date
DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31'], dtype='datetime64[ns]', freq='M')
date[0]
Timestamp('2010-01-31 00:00:00', freq='M')
date.to_pydatetime()
array([datetime.datetime(2010, 1, 31, 0, 0), datetime.datetime(2010, 2, 28, 0, 0), datetime.datetime(2010, 3, 31, 0, 0)], dtype=object)
np.vectorize(lambda x:x.strftime('%Y-%m-%d'))(date.to_pydatetime())
array(['2010-01-31', '2010-02-28', '2010-03-31'], dtype='<U10')
pd.Series(np.vectorize(lambda x:x.strftime('%Y-%m-%d'))(date.to_pydatetime())).values.tolist()
['2010-01-31', '2010-02-28', '2010-03-31']
from jqdata import *
stock_data.resample('M',how='last')
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).last() """Entry point for launching an IPython kernel.
date | |
---|---|
date | |
2010-01-31 | 2010-01-29 |
2010-02-28 | 2010-02-26 |
2010-03-31 | 2010-03-31 |
2010-04-30 | 2010-04-30 |
2010-05-31 | 2010-05-31 |
2010-06-30 | 2010-06-30 |
2010-07-31 | 2010-07-30 |
2010-08-31 | 2010-08-31 |
2010-09-30 | 2010-09-30 |
2010-10-31 | 2010-10-29 |
2010-11-30 | 2010-11-30 |
2010-12-31 | 2010-12-31 |
2011-01-31 | 2011-01-31 |
2011-02-28 | 2011-02-28 |
2011-03-31 | 2011-03-31 |
2011-04-30 | 2011-04-29 |
2011-05-31 | 2011-05-31 |
2011-06-30 | 2011-06-30 |
2011-07-31 | 2011-07-29 |
2011-08-31 | 2011-08-31 |
2011-09-30 | 2011-09-30 |
2011-10-31 | 2011-10-31 |
2011-11-30 | 2011-11-30 |
2011-12-31 | 2011-12-30 |
2012-01-31 | 2012-01-31 |
2012-02-29 | 2012-02-29 |
2012-03-31 | 2012-03-30 |
2012-04-30 | 2012-04-27 |
2012-05-31 | 2012-05-31 |
2012-06-30 | 2012-06-29 |
2012-07-31 | 2012-07-31 |
2012-08-31 | 2012-08-31 |
2012-09-30 | 2012-09-28 |
2012-10-31 | 2012-10-31 |
2012-11-30 | 2012-11-30 |
2012-12-31 | 2012-12-31 |
'''
计算每个月最后一个交易日的方式:
1、取出股票交易的起止时间:一种是可以通过取指数数据的方式拿到交易时间;一种是可以直接调用get_trade_days获取指定日期范围内的所有交易日;
2、将时间按照需求进行聚合,用resample得到我们想要的聚合bins的时间点,然后取出所有的时间点;
3、对应时间索引得到的array可以用vectorize方法,先将要apply的方法向量化,最后得到向量化的方法,往方法里传参即可;
'''
from jqdata import *
import pandas as pd
#获取指定周期的日期列表 'W、M、Q'
def get_period_date(peroid,start_date, end_date):
#设定转换周期period_type 转换为周是'W',月'M',季度线'Q',五分钟'5min',12天'12D'
# stock_data = get_price('000001.XSHE',start_date,end_date,'daily',fields=['close'])
# #记录每个周期中最后一个交易日
# stock_data['date']=stock_data.index
'''直接调用get_trade_days获取指定日期范围内的所有交易日'''
stock_data = pd.DataFrame()
stock_data['date'] = get_trade_days(start_date='2010-01-01', end_date='2018-01-01', count=None)
stock_data.index = stock_data['date'].apply(lambda x:pd.to_datetime(x))
#进行转换,周线的每个变量都等于那一周中最后一个交易日的变量值
period_stock_data=stock_data.resample(peroid,how='last')
date=period_stock_data.index
pydate_array = date.to_pydatetime()
date_only_array = list(np.vectorize(lambda s: s.strftime('%Y-%m-%d'))(pydate_array))
# date_only_series = pd.Series(date_only_array)
start_date = datetime.datetime.strptime(start_date, "%Y-%m-%d")
start_date=start_date-datetime.timedelta(days=1)
start_date = start_date.strftime("%Y-%m-%d")
date_only_array.insert(0,start_date)
return date_only_array
np.array(get_period_date('M','2010-01-01', '2018-01-01'))
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:20: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).last()
array(['2009-12-31', '2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30', '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31', '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31', '2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30', '2011-05-31', '2011-06-30', '2011-07-31', '2011-08-31', '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-31', '2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30', '2012-05-31', '2012-06-30', '2012-07-31', '2012-08-31', '2012-09-30', '2012-10-31', '2012-11-30', '2012-12-31', '2013-01-31', '2013-02-28', '2013-03-31', '2013-04-30', '2013-05-31', '2013-06-30', '2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31', '2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30', '2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30', '2014-12-31', '2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30', '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31', '2015-09-30', '2015-10-31', '2015-11-30', '2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31'], dtype='<U10')
# 计算一段时间每个月的开始和最后一个交易日
def calculate_FL(time_list):
# time_list 可以通过get_trade_days来获取交易时间
time_list_df = pd.DataFrame(time_list,columns=['time'])
time_list_df['time_str'] = time_list_df['time'].apply(lambda x:datetime.datetime.strftime(x,'%Y-%m-%d'))
time_list_df['year'] = time_list_df['time_str'].apply(lambda x:int(x.split('-')[0]))
time_list_df['month'] = time_list_df['time_str'].apply(lambda x:int(x.split('-')[1]))
time_list_df['day'] = time_list_df['time_str'].apply(lambda x:int(x.split('-')[2]))
time_list_df['cum_year'] = time_list_df['year']-time_list_df['year'].iloc[0]
time_list_df['cum_month'] = time_list_df['cum_year']*12 + time_list_df['month']
time_list_df['diff_month'] = time_list_df['cum_month'].diff()
time_list_df['diff_shift_month'] = time_list_df['diff_month'].shift(-1)
trade_end = list(time_list_df[time_list_df['diff_shift_month']==1]['time_str'].values)
trade_start = list(time_list_df[time_list_df['diff_month'] == 1]['time_str'].values)
trade_start.append(time_list_df['time_str'].iloc[0])
trade_start = sorted(trade_start)
trade_end.append(time_list_df['time_str'].iloc[-1])
return trade_start,trade_end
可选股票池: HS300、ZZ500、中证 800、创业板指以及全 A 股
股票筛选:剔除 ST 股票,剔除上市 3 个月内的股票,每只股票视作一个样本
以 HS300 为例,取 2017-06-01 当天的股票成分股
#去除上市距beginDate不足3个月的股票
def delect_stop(stocks,beginDate,n=30*3):
stockList=[]
beginDate = datetime.datetime.strptime(beginDate, "%Y-%m-%d")
for stock in stocks:
start_date=get_security_info(stock).start_date
if start_date<(beginDate-datetime.timedelta(days=n)).date():
stockList.append(stock)
return stockList
#获取股票池
def get_stock(stockPool,begin_date):
if stockPool=='HS300':
stockList=get_index_stocks('000300.XSHG',begin_date)
elif stockPool=='ZZ500':
stockList=get_index_stocks('399905.XSHE',begin_date)
elif stockPool=='ZZ800':
stockList=get_index_stocks('399906.XSHE',begin_date)
elif stockPool=='CYBZ':
stockList=get_index_stocks('399006.XSHE',begin_date)
elif stockPool=='ZXBZ':
stockList=get_index_stocks('399005.XSHE',begin_date)
elif stockPool=='A':
stockList=get_index_stocks('000002.XSHG',begin_date)+get_index_stocks('399107.XSHE',begin_date)
# stockList = list(get_all_securities('stock').index)
#剔除ST股
st_data=get_extras('is_st',stockList, count = 1,end_date=begin_date)
stockList = [stock for stock in stockList if not st_data[stock][0]]
#剔除停牌、新股及退市股票
stockList=delect_stop(stockList,begin_date)
return stockList
get_stock('HS300','2017-06-01')
[u'000001.XSHE', u'000002.XSHE', u'000008.XSHE', u'000009.XSHE', u'000027.XSHE', u'000039.XSHE', u'000060.XSHE', u'000061.XSHE', u'000063.XSHE', u'000069.XSHE', u'000100.XSHE', u'000156.XSHE', u'000157.XSHE', u'000166.XSHE', u'000333.XSHE', u'000338.XSHE', u'000402.XSHE', u'000413.XSHE', u'000415.XSHE', u'000423.XSHE', u'000425.XSHE', u'000503.XSHE', u'000538.XSHE', u'000540.XSHE', u'000555.XSHE', u'000559.XSHE', u'000568.XSHE', u'000623.XSHE', u'000625.XSHE', u'000627.XSHE', u'000630.XSHE', u'000651.XSHE', u'000671.XSHE', u'000686.XSHE', u'000709.XSHE', u'000712.XSHE', u'000718.XSHE', u'000725.XSHE', u'000728.XSHE', u'000738.XSHE', u'000750.XSHE', u'000768.XSHE', u'000776.XSHE', u'000778.XSHE', u'000783.XSHE', u'000792.XSHE', u'000793.XSHE', u'000800.XSHE', u'000826.XSHE', u'000839.XSHE', u'000858.XSHE', u'000876.XSHE', u'000895.XSHE', u'000917.XSHE', u'000938.XSHE', u'000963.XSHE', u'000977.XSHE', u'000983.XSHE', u'001979.XSHE', u'002007.XSHE', u'002008.XSHE', u'002024.XSHE', u'002027.XSHE', u'002049.XSHE', u'002065.XSHE', u'002074.XSHE', u'002081.XSHE', u'002085.XSHE', u'002129.XSHE', u'002131.XSHE', u'002142.XSHE', u'002146.XSHE', u'002152.XSHE', u'002153.XSHE', u'002174.XSHE', u'002183.XSHE', u'002195.XSHE', u'002202.XSHE', u'002230.XSHE', u'002236.XSHE', u'002241.XSHE', u'002252.XSHE', u'002292.XSHE', u'002299.XSHE', u'002304.XSHE', u'002310.XSHE', u'002385.XSHE', u'002415.XSHE', u'002424.XSHE', u'002426.XSHE', u'002450.XSHE', u'002456.XSHE', u'002465.XSHE', u'002466.XSHE', u'002470.XSHE', u'002475.XSHE', u'002500.XSHE', u'002568.XSHE', u'002594.XSHE', u'002673.XSHE', u'002714.XSHE', u'002736.XSHE', u'002739.XSHE', u'002797.XSHE', u'300002.XSHE', u'300015.XSHE', u'300017.XSHE', u'300024.XSHE', u'300027.XSHE', u'300033.XSHE', u'300058.XSHE', u'300059.XSHE', u'300070.XSHE', u'300072.XSHE', u'300085.XSHE', u'300104.XSHE', u'300124.XSHE', u'300133.XSHE', u'300144.XSHE', u'300146.XSHE', u'300168.XSHE', u'300182.XSHE', u'300251.XSHE', u'300315.XSHE', u'600000.XSHG', u'600008.XSHG', u'600009.XSHG', u'600010.XSHG', u'600015.XSHG', u'600016.XSHG', u'600018.XSHG', u'600019.XSHG', u'600021.XSHG', u'600023.XSHG', u'600028.XSHG', u'600029.XSHG', u'600030.XSHG', u'600031.XSHG', u'600036.XSHG', u'600037.XSHG', u'600038.XSHG', u'600048.XSHG', u'600050.XSHG', u'600060.XSHG', u'600061.XSHG', u'600066.XSHG', u'600068.XSHG', u'600074.XSHG', u'600085.XSHG', u'600089.XSHG', u'600100.XSHG', u'600104.XSHG', u'600109.XSHG', u'600111.XSHG', u'600115.XSHG', u'600118.XSHG', u'600150.XSHG', u'600153.XSHG', u'600157.XSHG', u'600170.XSHG', u'600177.XSHG', u'600188.XSHG', u'600196.XSHG', u'600208.XSHG', u'600221.XSHG', u'600252.XSHG', u'600256.XSHG', u'600271.XSHG', u'600276.XSHG', u'600297.XSHG', u'600309.XSHG', u'600332.XSHG', u'600340.XSHG', u'600352.XSHG', u'600362.XSHG', u'600369.XSHG', u'600372.XSHG', u'600373.XSHG', u'600376.XSHG', u'600383.XSHG', u'600406.XSHG', u'600415.XSHG', u'600446.XSHG', u'600482.XSHG', u'600485.XSHG', u'600489.XSHG', u'600498.XSHG', u'600518.XSHG', u'600519.XSHG', u'600535.XSHG', u'600547.XSHG', u'600549.XSHG', u'600570.XSHG', u'600582.XSHG', u'600583.XSHG', u'600585.XSHG', u'600588.XSHG', u'600606.XSHG', u'600637.XSHG', u'600648.XSHG', u'600649.XSHG', u'600660.XSHG', u'600663.XSHG', u'600666.XSHG', u'600674.XSHG', u'600685.XSHG', u'600688.XSHG', u'600690.XSHG', u'600703.XSHG', u'600704.XSHG', u'600705.XSHG', u'600718.XSHG', u'600737.XSHG', u'600739.XSHG', u'600741.XSHG', u'600754.XSHG', u'600783.XSHG', u'600795.XSHG', u'600804.XSHG', u'600816.XSHG', u'600820.XSHG', u'600827.XSHG', u'600837.XSHG', u'600839.XSHG', u'600867.XSHG', u'600871.XSHG', u'600873.XSHG', u'600875.XSHG', u'600886.XSHG', u'600887.XSHG', u'600893.XSHG', u'600895.XSHG', u'600900.XSHG', u'600958.XSHG', u'600959.XSHG', u'600999.XSHG', u'601006.XSHG', u'601009.XSHG', u'601018.XSHG', u'601021.XSHG', u'601088.XSHG', u'601099.XSHG', u'601111.XSHG', u'601118.XSHG', u'601127.XSHG', u'601155.XSHG', u'601166.XSHG', u'601169.XSHG', u'601186.XSHG', u'601198.XSHG', u'601211.XSHG', u'601216.XSHG', u'601225.XSHG', u'601258.XSHG', u'601288.XSHG', u'601318.XSHG', u'601328.XSHG', u'601333.XSHG', u'601336.XSHG', u'601377.XSHG', u'601390.XSHG', u'601398.XSHG', u'601555.XSHG', u'601600.XSHG', u'601601.XSHG', u'601607.XSHG', u'601608.XSHG', u'601611.XSHG', u'601618.XSHG', u'601628.XSHG', u'601633.XSHG', u'601668.XSHG', u'601669.XSHG', u'601688.XSHG', u'601718.XSHG', u'601727.XSHG', u'601766.XSHG', u'601788.XSHG', u'601800.XSHG', u'601818.XSHG', u'601857.XSHG', u'601866.XSHG', u'601872.XSHG', u'601877.XSHG', u'601888.XSHG', u'601899.XSHG', u'601901.XSHG', u'601919.XSHG', u'601928.XSHG', u'601933.XSHG', u'601939.XSHG', u'601958.XSHG', u'601985.XSHG', u'601988.XSHG', u'601989.XSHG', u'601998.XSHG', u'603000.XSHG', u'603885.XSHG', u'603993.XSHG']
特征提取: 每个自然月的最后一个交易日,计算之前报告里的 66 个因子暴露度(其中wind因子数据无法提取,股东因子存在缺失较多,暂不提取),作为样本的原始特征。
特征预处理:
(1) 中位数去极值: 设第 T 期某因子在所有个股上的暴露度序列为 𝐷𝑖 ,𝐷𝑀 为该序列中位数,𝐷𝑀1 为序列|𝐷𝑖-𝐷𝑀|的中位数,则将序列𝐷𝑖中所有大于 𝐷𝑀+5𝐷𝑀1 的数重设为 𝐷𝑀+5𝐷𝑀1,将序列 𝐷𝑖 中所有小于 𝐷𝑀-5𝐷𝑀1 的数重设为 𝐷𝑀-5𝐷𝑀1,本文采用聚宽的函数库 winsorize_med() 实现该功能。
(2) 缺失值处理: 得到新的因子暴露度序列后,将因子暴露度缺失的地方设为申万一级行业相同个股的平均值,通过函数 replace_nan_indu() 实现该功能。
(3) 行业市值中性化: 将填充缺失值后的因子暴露度对行业哑变量和取对数后的市值做线性回归,取残差作为新的因子暴露度,通过聚宽函数库 neutralize() 实现该功能。
(4) 标准化: 将中性化处理后的因子暴露度序列减去其现在的均值、除以其标准差,得到一个新的近似服从 N(0,1) 分布的序列,通过聚宽函数库 standardlize() 实现该功能。
注:考虑到时间运行时间较长,已将数据提前运行出来,下载地址:https://pan.baidu.com/s/1DSI7uc5yBNY3hzedu9L7mw
def linreg(X,Y,columns=3):
X=sm.add_constant(array(X))
Y=array(Y)
if len(Y)>1:
results = regression.linear_model.OLS(Y, X).fit()
return results.params
else:
return [float("nan")]*(columns+1)
linreg([1],[1])
[nan, nan, nan, nan]
#取股票对应行业
def get_industry_name(i_Constituent_Stocks, value):
return [k for k, v in i_Constituent_Stocks.items() if value in v]
#缺失值处理
def replace_nan_indu(factor_data,stockList,industry_code,date):
#把nan用行业平均值代替,依然会有nan,此时用所有股票平均值代替
i_Constituent_Stocks={}
data_temp=pd.DataFrame(index=industry_code,columns=factor_data.columns)
for i in industry_code:
temp = get_industry_stocks(i, date)
i_Constituent_Stocks[i] = list(set(temp).intersection(set(stockList)))
data_temp.loc[i]=mean(factor_data.loc[i_Constituent_Stocks[i],:])
for factor in data_temp.columns:
#行业缺失值用所有行业平均值代替
null_industry=list(data_temp.loc[pd.isnull(data_temp[factor]),factor].keys())
for i in null_industry:
data_temp.loc[i,factor]=mean(data_temp[factor])
null_stock=list(factor_data.loc[pd.isnull(factor_data[factor]),factor].keys())
for i in null_stock:
industry=get_industry_name(i_Constituent_Stocks, i)
if industry:
factor_data.loc[i,factor]=data_temp.loc[industry[0],factor]
else:
factor_data.loc[i,factor]=mean(factor_data[factor])
return factor_data
'''生成面板取切片'''
from jqfactor import get_factor_values
factors = get_factor_values(['000001.XSHE','000002.XSHE'],['roe_ttm','roa_ttm','total_asset_turnover_rate',\
'net_operate_cash_flow_ttm','net_profit_ttm',\
'cash_to_current_liability','current_ratio',\
'gross_income_ratio','non_recurring_gain_loss',\
'operating_revenue_ttm','net_profit_growth_rate'],end_date='2018-01-01',count=1)
df_ = pd.Panel(factors)
df_.iloc[:,0,:]
'''循环股票,将每一个股票对应的dataframe取出值添加到以股票为index的dataFrame中'''
factor_data=get_factor_values(['000001.XSHE','000002.XSHE'],['roe_ttm','roa_ttm','total_asset_turnover_rate',\
'net_operate_cash_flow_ttm','net_profit_ttm',\
'cash_to_current_liability','current_ratio',\
'gross_income_ratio','non_recurring_gain_loss',\
'operating_revenue_ttm','net_profit_growth_rate'],end_date='2018-01-01',count=1)
factor=pd.DataFrame(index=['000001.XSHE','000002.XSHE'])
for i in factor_data.keys():
factor[i]=factor_data[i].iloc[0,:]
cash_to_current_liability | current_ratio | gross_income_ratio | net_operate_cash_flow_ttm | net_profit_growth_rate | net_profit_ttm | non_recurring_gain_loss | operating_revenue_ttm | roa_ttm | roe_ttm | total_asset_turnover_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|
000001.XSHE | NaN | NaN | NaN | 5.944002e+09 | 0.008274 | 2.303300e+10 | -1.899981e+07 | 1.055800e+11 | 0.007341 | 0.105602 | 0.033651 |
000002.XSHE | 0.131772 | 1.210249 | 0.319474 | 1.328647e+10 | 0.150733 | 3.187653e+10 | 1.291532e+09 | 2.405229e+11 | 0.031301 | 0.190052 | 0.236181 |
q = query(valuation,balance,cash_flow,income,indicator).filter(valuation.code.in_(['000001.XSHE','000002.XSHE']))
df = get_fundamentals(q, '2018-01-01')
df['market_cap']=df['market_cap']*100000000
df
id | code | pe_ratio | turnover_ratio | pb_ratio | ps_ratio | pcf_ratio | capitalization | market_cap | circulating_cap | ... | inc_total_revenue_year_on_year | inc_total_revenue_annual | inc_revenue_year_on_year | inc_revenue_annual | inc_operation_profit_year_on_year | inc_operation_profit_annual | inc_net_profit_year_on_year | inc_net_profit_annual | inc_net_profit_to_shareholders_year_on_year | inc_net_profit_to_shareholders_annual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16041672 | 000001.XSHE | 9.9148 | 0.5810 | 1.1524 | 2.1630 | -23.2837 | 1717041.125 | 2.283665e+11 | 1691799.0 | ... | -5.2906 | -2.2799 | -5.2906 | -2.2799 | 1.7046 | 3.5437 | 2.6762 | 4.0852 | 2.6762 | 4.0852 |
1 | 16041673 | 000002.XSHE | 14.3756 | 0.3968 | 2.9463 | 1.4255 | 20.6124 | 1103915.250 | 3.363281e+11 | 970916.5 | ... | 11.9039 | -7.6750 | 11.9039 | -7.6750 | 19.4354 | -41.9464 | 13.5359 | -46.7393 | 30.1334 | -42.6654 |
2 rows × 237 columns
his_date = [pd.to_datetime('2018-01-01') - datetime.timedelta(90*i) for i in range(0, 4)]
his_date
[Timestamp('2018-01-01 00:00:00'), Timestamp('2017-10-03 00:00:00'), Timestamp('2017-07-05 00:00:00'), Timestamp('2017-04-06 00:00:00')]
stats.linregress([1,2,3], [2,3,4])
LinregressResult(slope=1.0, intercept=1.0, rvalue=1.0, pvalue=9.003163161571059e-11, stderr=0.0)
stats.linregress([1,2,3], [2,3,4])[:2]
(1.0, 1.0)
#取换手率数据
data_turnover_ratio=pd.DataFrame()
data_turnover_ratio['code']=['000001.XSHE','000002.XSHE']
trade_days=list(get_trade_days(end_date='2018-01-01', count=240*2))
q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(['000001.XSHE','000002.XSHE']))
temp = get_fundamentals(q, '2018-01-01')
# for i in trade_days:
# q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(stock))
# temp = get_fundamentals(q, i)
# data_turnover_ratio=pd.merge(data_turnover_ratio, temp,how='left',on='code')
# data_turnover_ratio=data_turnover_ratio.rename(columns={'turnover_ratio':i})
# data_turnover_ratio=data_turnover_ratio.set_index('code').T
temp
code | turnover_ratio | |
---|---|---|
0 | 000001.XSHE | 0.5810 |
1 | 000002.XSHE | 0.3968 |
pd.merge(temp2, temp,how='left',on='code')
code | turnover_ratio_x | turnover_ratio_y | |
---|---|---|---|
0 | 000001.XSHE | 1.2304 | 1.2304 |
1 | 000002.XSHE | 0.7039 | 0.7039 |
#取换手率数据
data_turnover_ratio=pd.DataFrame()
data_turnover_ratio['code']=['000001.XSHE','000002.XSHE']
trade_days=list(get_trade_days(end_date='2018-01-01', count=240*2))
for i in trade_days:
q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(['000001.XSHE','000002.XSHE']))
temp = get_fundamentals(q, i)
data_turnover_ratio=pd.merge(data_turnover_ratio, temp,how='left',on='code')
data_turnover_ratio=data_turnover_ratio.rename(columns={'turnover_ratio':i})
data_turnover_ratio.set_index('code')
2016-01-14 | 2016-01-15 | 2016-01-18 | 2016-01-19 | 2016-01-20 | 2016-01-21 | 2016-01-22 | 2016-01-25 | 2016-01-26 | 2016-01-27 | ... | 2017-12-18 | 2017-12-19 | 2017-12-20 | 2017-12-21 | 2017-12-22 | 2017-12-25 | 2017-12-26 | 2017-12-27 | 2017-12-28 | 2017-12-29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
code | |||||||||||||||||||||
000001.XSHE | 0.5645 | 0.3797 | 0.3567 | 0.4245 | 0.5115 | 0.5135 | 0.3954 | 0.3189 | 0.5489 | 0.4821 | ... | 0.4761 | 1.4174 | 0.6539 | 0.8779 | 0.4391 | 0.9372 | 0.6642 | 0.8078 | 0.9180 | 0.5810 |
000002.XSHE | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | ... | 0.4339 | 0.3503 | 0.3389 | 0.6130 | 0.3525 | 0.5216 | 0.2872 | 0.4073 | 0.3783 | 0.3968 |
2 rows × 480 columns
data_turnover_ratio=data_turnover_ratio.set_index('code').T
data_turnover_ratio
code | 000001.XSHE | 000002.XSHE |
---|---|---|
2016-01-14 | 0.5645 | 0.0000 |
2016-01-15 | 0.3797 | 0.0000 |
2016-01-18 | 0.3567 | 0.0000 |
2016-01-19 | 0.4245 | 0.0000 |
2016-01-20 | 0.5115 | 0.0000 |
2016-01-21 | 0.5135 | 0.0000 |
2016-01-22 | 0.3954 | 0.0000 |
2016-01-25 | 0.3189 | 0.0000 |
2016-01-26 | 0.5489 | 0.0000 |
2016-01-27 | 0.4821 | 0.0000 |
2016-01-28 | 0.2563 | 0.0000 |
2016-01-29 | 0.4612 | 0.0000 |
2016-02-01 | 0.3539 | 0.0000 |
2016-02-02 | 0.3127 | 0.0000 |
2016-02-03 | 0.2326 | 0.0000 |
2016-02-04 | 0.3161 | 0.0000 |
2016-02-05 | 0.2295 | 0.0000 |
2016-02-15 | 0.2359 | 0.0000 |
2016-02-16 | 0.3629 | 0.0000 |
2016-02-17 | 0.4957 | 0.0000 |
2016-02-18 | 0.3441 | 0.0000 |
2016-02-19 | 0.2702 | 0.0000 |
2016-02-22 | 0.5233 | 0.0000 |
2016-02-23 | 0.3608 | 0.0000 |
2016-02-24 | 0.2542 | 0.0000 |
2016-02-25 | 0.5270 | 0.0000 |
2016-02-26 | 0.3322 | 0.0000 |
2016-02-29 | 0.4803 | 0.0000 |
2016-03-01 | 0.3202 | 0.0000 |
2016-03-02 | 0.5732 | 0.0000 |
... | ... | ... |
2017-11-20 | 1.6810 | 0.4069 |
2017-11-21 | 1.4750 | 1.1142 |
2017-11-22 | 1.5192 | 1.0845 |
2017-11-23 | 1.4358 | 0.5977 |
2017-11-24 | 1.5547 | 0.5400 |
2017-11-27 | 1.2201 | 0.5372 |
2017-11-28 | 1.0441 | 0.4179 |
2017-11-29 | 0.9245 | 0.9246 |
2017-11-30 | 0.8155 | 1.0306 |
2017-12-01 | 1.0551 | 0.5741 |
2017-12-04 | 0.8595 | 0.3913 |
2017-12-05 | 1.0188 | 0.5812 |
2017-12-06 | 0.9575 | 0.4015 |
2017-12-07 | 0.6986 | 0.6077 |
2017-12-08 | 0.7987 | 0.6058 |
2017-12-11 | 1.3370 | 0.5172 |
2017-12-12 | 1.0305 | 0.4816 |
2017-12-13 | 0.7621 | 0.3450 |
2017-12-14 | 0.5923 | 0.3935 |
2017-12-15 | 0.6502 | 0.4081 |
2017-12-18 | 0.4761 | 0.4339 |
2017-12-19 | 1.4174 | 0.3503 |
2017-12-20 | 0.6539 | 0.3389 |
2017-12-21 | 0.8779 | 0.6130 |
2017-12-22 | 0.4391 | 0.3525 |
2017-12-25 | 0.9372 | 0.5216 |
2017-12-26 | 0.6642 | 0.2872 |
2017-12-27 | 0.8078 | 0.4073 |
2017-12-28 | 0.9180 | 0.3783 |
2017-12-29 | 0.5810 | 0.3968 |
480 rows × 2 columns
# 辅助线性回归的函数
def linreg(X,Y,columns=3):
X=sm.add_constant(array(X))
Y=array(Y)
if len(Y)>1:
results = regression.linear_model.OLS(Y, X).fit()
return results.params
else:
return [float("nan")]*(columns+1)
#取股票对应行业
def get_industry_name(i_Constituent_Stocks, value):
return [k for k, v in i_Constituent_Stocks.items() if value in v]
#缺失值处理
def replace_nan_indu(factor_data,stockList,industry_code,date):
#把nan用行业平均值代替,依然会有nan,此时用所有股票平均值代替
i_Constituent_Stocks={}
data_temp=pd.DataFrame(index=industry_code,columns=factor_data.columns)
for i in industry_code:
temp = get_industry_stocks(i, date)
i_Constituent_Stocks[i] = list(set(temp).intersection(set(stockList)))
data_temp.loc[i]=mean(factor_data.loc[i_Constituent_Stocks[i],:])
for factor in data_temp.columns:
#行业缺失值用所有行业平均值代替
null_industry=list(data_temp.loc[pd.isnull(data_temp[factor]),factor].keys())
for i in null_industry:
data_temp.loc[i,factor]=mean(data_temp[factor])
null_stock=list(factor_data.loc[pd.isnull(factor_data[factor]),factor].keys())
for i in null_stock:
industry=get_industry_name(i_Constituent_Stocks, i)
if industry:
factor_data.loc[i,factor]=data_temp.loc[industry[0],factor]
else:
factor_data.loc[i,factor]=mean(factor_data[factor])
return factor_data
#数据预处理
def data_preprocessing(factor_data,stockList,industry_code,date):
#去极值
factor_data=winsorize_med(factor_data, scale=5, inf2nan=False,axis=0)
#缺失值处理
factor_data=replace_nan_indu(factor_data,stockList,industry_code,date)
#中性化处理
factor_data=neutralize(factor_data, how=['sw_l1', 'market_cap'], date=date, axis=0)
#标准化处理
factor_data=standardlize(factor_data,axis=0)
return factor_data
#获取时间为date的全部因子数据
def get_factor_data(stock,date):
data=pd.DataFrame(index=stock)
q = query(valuation,balance,cash_flow,income,indicator).filter(valuation.code.in_(stock))
df = get_fundamentals(q, date)
df['market_cap']=df['market_cap']*100000000
factor_data=get_factor_values(stock,['roe_ttm','roa_ttm','total_asset_turnover_rate',\
'net_operate_cash_flow_ttm','net_profit_ttm',\
'cash_to_current_liability','current_ratio',\
'gross_income_ratio','non_recurring_gain_loss',\
'operating_revenue_ttm','net_profit_growth_rate'],end_date=date,count=1)
factor=pd.DataFrame(index=stock)
for i in factor_data.keys():
factor[i]=factor_data[i].iloc[0,:]
df.index = df['code']
del df['code'],df['id']
#合并得大表
df=pd.concat([df,factor],axis=1)
#净利润(TTM)/总市值
data['EP']=df['net_profit_ttm']/df['market_cap']
#净资产/总市值
data['BP']=1/df['pb_ratio']
#营业收入(TTM)/总市值
data['SP']=1/df['ps_ratio']
#净现金流(TTM)/总市值
data['NCFP']=1/df['pcf_ratio']
#经营性现金流(TTM)/总市值
data['OCFP']=df['net_operate_cash_flow_ttm']/df['market_cap']
#净利润(TTM)同比增长率/PE_TTM
data['G/PE']=df['net_profit_growth_rate']/df['pe_ratio']
#ROE_ttm
data['roe_ttm']=df['roe_ttm']
#ROE_YTD
data['roe_q']=df['roe']
#ROA_ttm
data['roa_ttm']=df['roa_ttm']
#ROA_YTD
data['roa_q']=df['roa']
#毛利率TTM
data['grossprofitmargin_ttm']=df['gross_income_ratio']
#毛利率YTD
data['grossprofitmargin_q']=df['gross_profit_margin']
#扣除非经常性损益后净利润率YTD
data['profitmargin_q']=df['adjusted_profit']/df['operating_revenue']
#资产周转率TTM
data['assetturnover_ttm']=df['total_asset_turnover_rate']
#资产周转率YTD 营业收入/总资产
data['assetturnover_q']=df['operating_revenue']/df['total_assets']
#经营性现金流/净利润TTM
data['operationcashflowratio_ttm']=df['net_operate_cash_flow_ttm']/df['net_profit_ttm']
#经营性现金流/净利润YTD
data['operationcashflowratio_q']=df['net_operate_cash_flow']/df['net_profit']
#净资产
df['net_assets']=df['total_assets']-df['total_liability']
#总资产/净资产
data['financial_leverage']=df['total_assets']/df['net_assets']
#非流动负债/净资产
data['debtequityratio']=df['total_non_current_liability']/df['net_assets']
#现金比率=(货币资金+有价证券)÷流动负债
data['cashratio']=df['cash_to_current_liability']
#流动比率=流动资产/流动负债*100%
data['currentratio']=df['current_ratio']
#总市值取对数
data['ln_capital']=np.log(df['market_cap'])
#TTM所需时间
his_date = [pd.to_datetime(date) - datetime.timedelta(90*i) for i in range(0, 4)]
tmp = pd.DataFrame()
tmp['code']=stock
for i in his_date:
tmp_adjusted_dividend = get_fundamentals(query(indicator.code, indicator.adjusted_profit, \
cash_flow.dividend_interest_payment).
filter(indicator.code.in_(stock)), date = i)
tmp=pd.merge(tmp,tmp_adjusted_dividend,how='outer',on='code')
tmp=tmp.rename(columns={'adjusted_profit':'adjusted_profit'+str(i.month), \
'dividend_interest_payment':'dividend_interest_payment'+str(i.month)})
tmp=tmp.set_index('code')
tmp_columns=tmp.columns.values.tolist()
tmp_adjusted=sum(tmp[[i for i in tmp_columns if 'adjusted_profit'in i ]],1)
tmp_dividend=sum(tmp[[i for i in tmp_columns if 'dividend_interest_payment'in i ]],1)
#扣除非经常性损益后净利润(TTM)/总市值
data['EPcut']=tmp_adjusted/df['market_cap']
#近12个月现金红利(按除息日计)/总市值
data['DP']=tmp_dividend/df['market_cap']
#扣除非经常性损益后净利润率TTM
data['profitmargin_ttm']=tmp_adjusted/df['operating_revenue_ttm']
#营业收入(YTD)同比增长率
#_x现在 _y前一年
his_date = pd.to_datetime(date) - datetime.timedelta(365)
name=['operating_revenue','net_profit','net_operate_cash_flow','roe']
temp_data=df[name]
his_temp_data = get_fundamentals(query(valuation.code, income.operating_revenue,income.net_profit,\
cash_flow.net_operate_cash_flow,indicator.roe).
filter(valuation.code.in_(stock)), date = his_date)
his_temp_data=his_temp_data.set_index('code')
#重命名 his_temp_data last_year
for i in name:
his_temp_data=his_temp_data.rename(columns={i:i+'last_year'})
temp_data =pd.concat([temp_data,his_temp_data],axis=1)
#营业收入(YTD)同比增长率
data['sales_g_q']=temp_data['operating_revenue']/temp_data['operating_revenuelast_year']-1
#净利润(YTD)同比增长率
data['profit_g_q']=temp_data['net_profit']/temp_data['net_profitlast_year']-1
#经营性现金流(YTD)同比增长率
data['ocf_g_q']=temp_data['net_operate_cash_flow']/temp_data['net_operate_cash_flowlast_year']-1
#ROE(YTD)同比增长率
data['roe_g_q']=temp_data['roe']/temp_data['roelast_year']-1
#个股60个月收益与上证综指回归的截距项与BETA
stock_close=get_price(stock, count = 60*20+1, end_date=date, frequency='daily', fields=['close'])['close']
SZ_close=get_price('000001.XSHG', count = 60*20+1, end_date=date, frequency='daily', fields=['close'])['close']
stock_pchg=stock_close.pct_change().iloc[1:]
SZ_pchg=SZ_close.pct_change().iloc[1:]
beta=[]
stockalpha=[]
for i in stock:
temp_beta, temp_stockalpha = stats.linregress(SZ_pchg, stock_pchg[i])[:2]
beta.append(temp_beta)
stockalpha.append(temp_stockalpha)
#此处alpha beta为list
data['alpha']=stockalpha
data['beta']=beta
#动量
data['return_1m']=stock_close.iloc[-1]/stock_close.iloc[-20]-1
data['return_3m']=stock_close.iloc[-1]/stock_close.iloc[-60]-1
data['return_6m']=stock_close.iloc[-1]/stock_close.iloc[-120]-1
data['return_12m']=stock_close.iloc[-1]/stock_close.iloc[-240]-1
#取换手率数据
data_turnover_ratio=pd.DataFrame()
data_turnover_ratio['code']=stock
trade_days=list(get_trade_days(end_date=date, count=240*2))
for i in trade_days:
q = query(valuation.code,valuation.turnover_ratio).filter(valuation.code.in_(stock))
temp = get_fundamentals(q, i)
data_turnover_ratio=pd.merge(data_turnover_ratio, temp,how='left',on='code')
data_turnover_ratio=data_turnover_ratio.rename(columns={'turnover_ratio':i})
data_turnover_ratio=data_turnover_ratio.set_index('code').T
#个股个股最近N个月内用每日换手率乘以每日收益率求算术平均值
data['wgt_return_1m']=mean(stock_pchg.iloc[-20:]*data_turnover_ratio.iloc[-20:])
data['wgt_return_3m']=mean(stock_pchg.iloc[-60:]*data_turnover_ratio.iloc[-60:])
data['wgt_return_6m']=mean(stock_pchg.iloc[-120:]*data_turnover_ratio.iloc[-120:])
data['wgt_return_12m']=mean(stock_pchg.iloc[-240:]*data_turnover_ratio.iloc[-240:])
#个股个股最近N个月内用每日换手率乘以函数exp(-x_i/N/4)再乘以每日收益率求算术平均值
temp_data=pd.DataFrame(index=data_turnover_ratio[-240:].index,columns=stock)
temp=[]
for i in range(240):
if i/20<1:
temp.append(exp(-i/1/4))
elif i/20<3:
temp.append(exp(-i/3/4))
elif i/20<6:
temp.append(exp(-i/6/4))
elif i/20<12:
temp.append(exp(-i/12/4))
temp.reverse()
for i in stock:
temp_data[i]=temp
data['exp_wgt_return_1m']=mean(stock_pchg.iloc[-20:]*temp_data.iloc[-20:]*data_turnover_ratio.iloc[-20:])
data['exp_wgt_return_3m']=mean(stock_pchg.iloc[-60:]*temp_data.iloc[-60:]*data_turnover_ratio.iloc[-60:])
data['exp_wgt_return_6m']=mean(stock_pchg.iloc[-120:]*temp_data.iloc[-120:]*data_turnover_ratio.iloc[-120:])
data['exp_wgt_return_12m']=mean(stock_pchg.iloc[-240:]*temp_data.iloc[-240:]*data_turnover_ratio.iloc[-240:])
#特异波动率
#获取FF三因子残差数据
LoS=len(stock)
S=df.sort('market_cap')[:LoS/3].index
B=df.sort('market_cap')[LoS-LoS/3:].index
df['BTM']=df['total_owner_equities']/df['market_cap']
L=df.sort('BTM')[:LoS/3].index
H=df.sort('BTM')[LoS-LoS/3:].index
df_temp=stock_pchg.iloc[-240:]
#求因子的值
SMB=sum(df_temp[S].T)/len(S)-sum(df_temp[B].T)/len(B)
HMI=sum(df_temp[H].T)/len(H)-sum(df_temp[L].T)/len(L)
#用沪深300作为大盘基准
dp=get_price('000300.XSHG',count=12*20+1,end_date=date,frequency='daily', fields=['close'])['close']
RM=dp.pct_change().iloc[1:]-0.04/252
#将因子们计算好并且放好
X=pd.DataFrame({"RM":RM,"SMB":SMB,"HMI":HMI})
resd=pd.DataFrame()
for i in stock:
temp=df_temp[i]-0.04/252
t_r=linreg(X,temp)
resd[i]=list(temp-(t_r[0]+X.iloc[:,0]*t_r[1]+X.iloc[:,1]*t_r[2]+X.iloc[:,2]*t_r[3]))
data['std_FF3factor_1m']=resd[-1*20:].std()
data['std_FF3factor_3m']=resd[-3*20:].std()
data['std_FF3factor_6m']=resd[-6*20:].std()
data['std_FF3factor_12m']=resd[-12*20:].std()
#波动率
data['std_1m']=stock_pchg.iloc[-20:].std()
data['std_3m']=stock_pchg.iloc[-60:].std()
data['std_6m']=stock_pchg.iloc[-120:].std()
data['std_12m']=stock_pchg.iloc[-240:].std()
#股价
data['ln_price']=np.log(stock_close.iloc[-1])
#换手率
data['turn_1m']=mean(data_turnover_ratio.iloc[-20:])
data['turn_3m']=mean(data_turnover_ratio.iloc[-60:])
data['turn_6m']=mean(data_turnover_ratio.iloc[-120:])
data['turn_12m']=mean(data_turnover_ratio.iloc[-240:])
data['bias_turn_1m']=mean(data_turnover_ratio.iloc[-20:])/mean(data_turnover_ratio)-1
data['bias_turn_3m']=mean(data_turnover_ratio.iloc[-60:])/mean(data_turnover_ratio)-1
data['bias_turn_6m']=mean(data_turnover_ratio.iloc[-120:])/mean(data_turnover_ratio)-1
data['bias_turn_12m']=mean(data_turnover_ratio.iloc[-240:])/mean(data_turnover_ratio)-1
#技术指标
data['PSY']=pd.Series(PSY(stock, date, timeperiod=20))
data['RSI']=pd.Series(RSI(stock, date, N1=20))
data['BIAS']=pd.Series(BIAS(stock,date, N1=20)[0])
dif,dea,macd=MACD(stock, date, SHORT = 10, LONG = 30, MID = 15)
data['DIF']=pd.Series(dif)
data['DEA']=pd.Series(dea)
data['MACD']=pd.Series(macd)
return data
peroid = 'M'
start_date = '2010-01-01'
end_date = '2018-01-01'
industry_old_code = ['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
'801160','801170','801180','801200','801210','801230']
industry_new_code = ['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
'801160','801170','801180','801200','801210','801230','801710','801720','801730','801740','801750',\
'801760','801770','801780','801790','801880','801890']
dateList = get_period_date(peroid,start_date, end_date)
factor_origl_data = {}
factor_solve_data = {}
for date in dateList:
#获取行业因子数据
if datetime.datetime.strptime(date,"%Y-%m-%d").date()<datetime.date(2014,2,21):
industry_code=industry_old_code
else:
industry_code=industry_new_code
stockList=get_stock('HS300',date)
factor_origl_data[date] = get_factor_data(stockList,date)
factor_solve_data[date] = data_preprocessing(factor_origl_data[date],stockList,industry_code,date)
content = pickle.dumps(factor_solve_data)
write_file('factor_solve_data.pkl', content, append=False)
特征数据经过预处理后,为了方便后续对数据进行处理,将数据保存为 pkl 文件。读取文件,文件格式如下所示:数据类型为 dict,key 为日期,value 为每一期对应的因子数据,格式为 dataframe,index 为股票列表,columns 为因子。
注:可直接将pkl文件上传至研究环境,进行数据读取。
import pickle
#使用pickle模块将数据对象保存到文件
with open("factor_solve_data.pkl",'rb') as f:
#pkl_file_read = read_file("factor_solve_data.pkl")
# factor_data = pickle.loads(StringIO(pkl_file_read))
factor_data = pickle.load(f,encoding='iso-8859-1')
factor_data['2009-12-31'].head()
EP | BP | SP | NCFP | OCFP | G/PE | roe_ttm | roe_q | roa_ttm | roa_q | ... | bias_turn_1m | bias_turn_3m | bias_turn_6m | bias_turn_12m | PSY | RSI | BIAS | DIF | DEA | MACD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
000001.XSHE | 0.028950 | 1.131116 | 0.197401 | -1.261756 | -2.181723 | -0.684842 | 0.025820 | 1.701937 | -0.272718 | -0.194391 | ... | -0.698740 | -0.666942 | -0.357962 | 0.349568 | 1.148705 | 1.291735 | -0.202865 | -0.362807 | -0.079355 | -0.788347 |
000002.XSHE | 0.246983 | -0.479406 | 0.331701 | 0.110568 | 0.679921 | -0.513610 | -0.181344 | -1.024590 | -0.225544 | -0.825526 | ... | 0.752829 | 0.393239 | 0.429813 | 0.190625 | -0.942532 | -0.768319 | 0.125205 | 0.148512 | 0.022209 | 0.357588 |
000009.XSHE | -0.262468 | -1.249222 | -0.017061 | 0.214158 | 0.451111 | -0.044448 | -0.110604 | -0.697583 | -0.042721 | -0.269572 | ... | -0.395701 | -0.566032 | -0.833514 | -0.112768 | -0.802378 | -0.713272 | -0.262686 | 0.283520 | 0.059975 | 0.630850 |
000012.XSHE | 0.467468 | 0.901849 | 0.214317 | -0.304663 | 0.498606 | -0.487198 | 0.518937 | 1.500490 | 0.641853 | 1.789205 | ... | -0.586362 | 0.393510 | 0.987122 | 1.588373 | 1.631216 | 1.555538 | 0.422617 | 0.208753 | 0.557065 | -0.721810 |
000021.XSHE | 0.574750 | 1.552436 | 2.686058 | -0.022736 | 0.135909 | -0.522318 | 0.279071 | 0.115994 | 0.730127 | 0.608440 | ... | -0.660390 | -1.038726 | -1.097826 | 1.312336 | 0.818352 | 1.350547 | 0.126273 | -0.598198 | -0.618886 | -0.184690 |
5 rows × 66 columns
对于 SVM 模型原理,可参考量化课堂-SVM原理入门,此处不再多加赘述。
对于支持向量机模型,本文主要采用分类模型,因此需要将收益率数据转化为标签,在每个月末截面期,选取下月收益排前、后 30% 的股票分别作为正例 (𝑦 = 1)、负例 (𝑦 = −1)。
为了对模型进行测试,需要对数据集进行分割,故将前 4 年数据样本合并,作为训练集,后 4 年数据样本作为测试集。
peroid='M'
start_date='2010-01-01'
end_date='2018-01-01'
dateList=get_period_date(peroid,start_date, end_date)
dateList
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:20: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).last()
['2009-12-31', '2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30', '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31', '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31', '2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30', '2011-05-31', '2011-06-30', '2011-07-31', '2011-08-31', '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-31', '2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30', '2012-05-31', '2012-06-30', '2012-07-31', '2012-08-31', '2012-09-30', '2012-10-31', '2012-11-30', '2012-12-31', '2013-01-31', '2013-02-28', '2013-03-31', '2013-04-30', '2013-05-31', '2013-06-30', '2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31', '2013-11-30', '2013-12-31', '2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30', '2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30', '2014-12-31', '2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30', '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31', '2015-09-30', '2015-10-31', '2015-11-30', '2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31']
dateList[4*12:-1]
['2013-12-31', '2014-01-31', '2014-02-28', '2014-03-31', '2014-04-30', '2014-05-31', '2014-06-30', '2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30', '2014-12-31', '2015-01-31', '2015-02-28', '2015-03-31', '2015-04-30', '2015-05-31', '2015-06-30', '2015-07-31', '2015-08-31', '2015-09-30', '2015-10-31', '2015-11-30', '2015-12-31', '2016-01-31', '2016-02-29', '2016-03-31', '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31', '2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30', '2017-10-31', '2017-11-30']
peroid='M'
start_date='2010-01-01'
end_date='2018-01-01'
industry_old_code=['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
'801160','801170','801180','801200','801210','801230']
industry_new_code=['801010','801020','801030','801040','801050','801080','801110','801120','801130','801140','801150',\
'801160','801170','801180','801200','801210','801230','801710','801720','801730','801740','801750',\
'801760','801770','801780','801790','801880','801890']
dateList=get_period_date(peroid,start_date, end_date)
# 训练集数据
train_data=pd.DataFrame()
for date in dateList[:4*12]:
traindf=factor_data[date]
stockList=list(traindf.index)
#取收益率数据
data_close=get_price(stockList,date,dateList[dateList.index(date)+1],'1d','close')['close']
traindf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1
#剔除空值
traindf=traindf.dropna()
#选取前后各30%的股票,剔除中间的噪声
traindf=traindf.sort_values(by=['pchg'])
traindf=traindf.iloc[:int(len(traindf['pchg'])/10*3),:].append(traindf.iloc[int(len(traindf['pchg'])/10*7):,:])
traindf['label']=list(traindf['pchg'].apply(lambda x:1 if x>np.mean(list(traindf['pchg'])) else -1))
if train_data.empty:
train_data=traindf
else:
train_data=train_data.append(traindf)
# 测试集数据
test_data={}
for date in dateList[4*12:-1]:
testdf=factor_data[date]
stockList=list(testdf.index)
# 取收益率数据
data_close=get_price(stockList,date,dateList[dateList.index(date)+1],'1d','close')['close']
testdf['pchg']=data_close.iloc[-1]/data_close.iloc[0]-1
#剔除空值
testdf=testdf.dropna()
#选取前后各30%的股票,剔除中间的噪声
testdf=testdf.sort_values(by=['pchg'])
testdf=testdf.iloc[:int(len(testdf['pchg'])/10*3),:].append(testdf.iloc[int(len(testdf['pchg'])/10*7):,:])
testdf['label']=list(testdf['pchg'].apply(lambda x:1 if x>np.mean(list(testdf['pchg'])) else -1))
test_data[date]=testdf
/opt/conda/lib/python3.5/site-packages/ipykernel_launcher.py:20: FutureWarning: how in .resample() is deprecated the new syntax is .resample(...).last()
train_data
EP | BP | SP | NCFP | OCFP | G/PE | roe_ttm | roe_q | roa_ttm | roa_q | ... | bias_turn_6m | bias_turn_12m | PSY | RSI | BIAS | DIF | DEA | MACD | pchg | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
000898.XSHE | -1.339908 | -0.823496 | -1.553743 | -0.336050 | -0.851616 | 0.901782 | -0.771852 | -0.184476 | -0.898484 | 0.285664 | ... | -0.633324 | -1.247146 | -0.200697 | -0.492782 | 0.901998 | 1.187574 | 1.229229 | 0.335472 | -0.261379 | -1 |
600348.XSHG | 0.253238 | -0.709708 | 0.314576 | 0.811228 | 0.250254 | 2.415528 | 1.252358 | 1.103479 | 0.755359 | 0.671009 | ... | -1.384773 | -2.431529 | -0.407887 | -0.139634 | 1.312277 | 1.146901 | 0.426139 | 2.016046 | -0.237435 | -1 |
600104.XSHG | -0.204334 | 0.840664 | 1.827264 | 1.702015 | 0.651325 | -0.555469 | -0.157791 | 1.409792 | -0.058212 | 1.288909 | ... | -0.801224 | -0.777448 | 1.023477 | 1.353981 | -0.148693 | 1.551631 | 1.985120 | -0.419854 | -0.229304 | -1 |
601699.XSHG | 0.604046 | -0.584376 | -0.158244 | -0.084201 | 0.236129 | 1.474444 | 1.198756 | 0.311401 | 1.020940 | 0.241325 | ... | -0.735090 | 0.417693 | -0.632352 | -0.401398 | -0.448861 | 0.466879 | 0.787098 | -0.596676 | -0.220944 | -1 |
000983.XSHE | -0.609660 | -0.945398 | -0.696665 | -0.149154 | -0.316074 | -0.460934 | -0.025241 | 0.189220 | 0.104082 | 0.251961 | ... | -0.281636 | 0.369459 | -0.473663 | -0.442339 | -0.194350 | -0.497980 | -0.042027 | -1.263650 | -0.219931 | -1 |
000718.XSHE | -0.563044 | -1.551998 | -0.612245 | -0.651606 | -0.641480 | -0.756999 | 0.474626 | 2.170109 | 0.320176 | 1.700314 | ... | -0.880877 | -0.290465 | -0.609921 | -0.725955 | -1.224751 | -0.567860 | -0.102092 | -1.248264 | -0.218586 | -1 |
600019.XSHG | -0.280231 | -0.208837 | -0.612423 | -0.914817 | 0.927408 | -0.772553 | -0.188920 | -0.331904 | -0.191782 | 0.030894 | ... | -0.852683 | -2.015797 | -0.478094 | -0.260537 | 1.681595 | 1.102547 | 1.020271 | 0.583499 | -0.216066 | -1 |
600307.XSHG | 0.246073 | -1.547046 | -0.076332 | -0.064550 | 0.121182 | 0.049713 | -0.004440 | -0.212457 | -0.049140 | -0.108843 | ... | -1.608684 | -1.235297 | -0.296960 | -0.732394 | -0.860267 | -0.680652 | -0.553390 | -0.550320 | -0.215686 | -1 |
000800.XSHE | 0.645978 | 0.607839 | 1.395095 | 0.411715 | 0.600378 | 0.346006 | 1.096834 | 1.134758 | 1.199610 | 1.662074 | ... | -1.339557 | -0.307526 | 1.607137 | 1.595132 | 0.375372 | 2.855703 | 2.827991 | -1.548921 | -0.205201 | -1 |
600582.XSHG | 1.220935 | 0.161788 | 0.330084 | -0.074172 | 0.013625 | 0.475748 | 1.955091 | 1.553555 | 2.021309 | 2.160534 | ... | 0.769649 | 1.451860 | 1.009850 | 1.643493 | 0.645583 | 1.386249 | 1.525338 | 0.189815 | -0.205186 | -1 |
601166.XSHG | 1.628138 | 1.097977 | -0.069126 | -2.607464 | 2.469360 | -0.238659 | 1.363768 | 1.259437 | -0.165464 | -0.228330 | ... | -0.410905 | -0.162214 | 1.398375 | 1.219716 | 0.515697 | -0.040880 | -0.079556 | 0.058303 | -0.202107 | -1 |
600741.XSHG | 1.597167 | 2.126435 | 1.309559 | 2.473835 | 0.868861 | 2.448113 | 0.558987 | 1.170686 | 0.839599 | 2.343310 | ... | 0.395003 | 0.780068 | 1.830805 | 1.656058 | 1.410046 | 1.219544 | 1.117727 | 0.673861 | -0.201711 | -1 |
000932.XSHE | -0.432491 | 0.962227 | 0.044747 | 0.702973 | 1.556231 | 0.351090 | -0.103027 | -0.540683 | -0.066951 | -0.641312 | ... | 0.209975 | -0.186155 | -0.905888 | -0.577609 | -0.299507 | 0.005007 | -0.004132 | 0.013794 | -0.198175 | -1 |
000825.XSHE | -0.560819 | -1.161042 | 0.018134 | -0.397080 | -0.630818 | 0.568659 | -0.470182 | -0.139794 | -0.369891 | -0.224836 | ... | 0.438879 | 0.197424 | -0.534888 | -0.585478 | -0.141694 | -0.032672 | 0.156913 | -0.451530 | -0.198091 | -1 |
000937.XSHE | -0.357993 | 0.346295 | -0.115037 | 0.358325 | 0.117128 | -1.037641 | -0.622369 | -0.831234 | -0.905939 | -1.125563 | ... | -0.277565 | -0.145597 | -0.804577 | -0.340420 | -0.709074 | 0.287962 | 0.765937 | -1.020075 | -0.197540 | -1 |
600395.XSHG | -0.482925 | -0.461043 | -0.642086 | 0.048169 | -0.162035 | 0.495222 | -0.396606 | 0.563451 | 0.029798 | 1.538482 | ... | -0.223589 | -0.998191 | -0.389464 | -0.410738 | -0.433025 | -0.383082 | -0.404235 | -0.137799 | -0.197498 | -1 |
601169.XSHG | 1.259979 | 1.324686 | -0.157433 | -1.067213 | 1.206745 | -0.017963 | 0.905450 | 0.696879 | -0.108399 | -0.173770 | ... | -0.885053 | -1.486276 | 1.504733 | 1.409171 | 0.710609 | 0.232750 | 0.140371 | 0.283704 | -0.194399 | -1 |
000709.XSHE | 0.157932 | 0.954871 | 0.040987 | -0.831024 | 0.092326 | -1.025100 | 0.021088 | -0.595044 | -0.015583 | -0.577060 | ... | -0.189408 | 0.198160 | -0.704704 | -0.817105 | -1.586145 | 0.198305 | 0.319341 | -0.209632 | -0.191523 | -1 |
601001.XSHG | 0.970052 | -0.123389 | -0.341525 | -0.969352 | -0.225479 | 0.331777 | 0.276158 | -0.287747 | 1.065197 | 0.290460 | ... | -0.971557 | -0.568488 | -0.396240 | -0.459546 | -0.728115 | -0.487051 | -0.080269 | -1.146645 | -0.191341 | -1 |
000060.XSHE | -0.520340 | -0.746671 | -0.513973 | -0.269318 | -0.635668 | -0.061345 | -0.684667 | -0.468283 | -0.685499 | -0.437695 | ... | -0.204781 | -0.116877 | -0.662184 | -0.440292 | 1.365643 | 0.550537 | -0.156007 | 1.770859 | -0.190367 | -1 |
600028.XSHG | -0.445205 | -1.906611 | 0.606283 | -0.929019 | -0.693458 | 1.847548 | -0.538408 | -0.670272 | 0.125637 | 0.015970 | ... | -0.640204 | -1.910889 | -1.910722 | -1.908922 | 0.560512 | 1.024769 | 0.744422 | 0.978541 | -0.189959 | -1 |
600997.XSHG | -1.014771 | -0.482487 | -0.130176 | 0.351810 | -0.235677 | -0.779942 | -1.231717 | -0.289891 | -1.650070 | -0.873043 | ... | 0.372966 | 0.352345 | -0.180321 | -0.426032 | -0.704594 | -0.632989 | -0.159991 | -1.350788 | -0.189551 | -1 |
600325.XSHG | 0.323981 | -0.204300 | -0.103861 | 1.415510 | 1.612112 | 0.433534 | -0.058768 | -0.797225 | -0.243463 | -0.748118 | ... | 0.013669 | -0.426368 | -0.599389 | -0.666574 | -1.248069 | -0.299024 | 0.287645 | -1.422291 | -0.187717 | -1 |
000630.XSHE | -0.033863 | -0.349515 | 1.671445 | 0.158159 | 0.445587 | -0.975289 | -0.074184 | -0.462556 | -0.202445 | -0.742567 | ... | 0.829676 | 0.960215 | -0.454961 | -0.421783 | 0.684300 | 0.807681 | 0.639520 | 0.645877 | -0.187708 | -1 |
000717.XSHE | -1.413005 | -0.240964 | 0.052592 | -0.789472 | 1.573190 | 0.922596 | -1.764376 | 0.181048 | -2.029510 | -0.000950 | ... | 1.825339 | 0.469849 | -0.893000 | -0.681956 | -0.817644 | -0.386848 | -0.290421 | -0.370801 | -0.186846 | -1 |
600690.XSHG | -0.371019 | -0.971475 | -0.024128 | 0.813279 | 0.868150 | 0.005171 | -0.462605 | -0.519816 | -0.160385 | -0.128200 | ... | -0.669995 | 0.335240 | -0.450148 | -0.439052 | -0.634005 | 0.329262 | 0.584221 | -0.464916 | -0.186667 | -1 |
600585.XSHG | 0.824933 | 1.472747 | 0.480301 | -0.610967 | 1.244304 | -0.231784 | 0.531591 | 0.566293 | 1.011384 | 1.132269 | ... | -1.270613 | -1.195534 | 1.754538 | 1.533549 | 1.038025 | 1.287030 | 1.215400 | 0.629031 | -0.186502 | -1 |
000878.XSHE | -2.191582 | -0.989264 | -0.170909 | -1.137625 | 0.528887 | 1.906376 | -2.248274 | 0.536849 | -2.456989 | -0.594343 | ... | -0.961349 | -1.031000 | -0.882532 | -0.601176 | 0.314222 | -2.081045 | -2.358223 | 0.048406 | -0.184907 | -1 |
000933.XSHE | -1.071980 | -0.686538 | 0.060221 | 1.306697 | 0.665096 | -0.907191 | -1.236326 | 0.072424 | -1.882922 | -1.295668 | ... | -0.335325 | 0.277423 | -0.175778 | -0.057790 | 0.853452 | 1.960233 | 2.006145 | 0.580587 | -0.184879 | -1 |
002155.XSHE | -0.208642 | -0.757256 | -0.922840 | -0.468557 | -0.656105 | -0.679338 | -0.225711 | -0.508898 | -0.249386 | -0.060348 | ... | 1.204158 | 0.945165 | -0.850366 | -0.716255 | -0.634589 | -0.997182 | -1.082108 | -0.212759 | -0.184638 | -1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
002603.XSHE | -0.740411 | -0.630370 | -0.636225 | -0.524613 | -0.313795 | -0.629744 | -1.450573 | -1.227070 | -1.352312 | -1.097113 | ... | -0.286062 | 0.668399 | -0.154614 | -0.476288 | 0.439482 | 0.553265 | 0.158740 | 0.413281 | 0.069565 | 1 |
600703.XSHG | -0.567762 | -1.093158 | -0.687669 | -1.003212 | -0.539549 | -0.534213 | -0.067931 | 0.109176 | 0.019778 | 0.160198 | ... | -0.663827 | -0.733325 | 0.052621 | -0.321021 | 1.310997 | 0.777162 | 0.941886 | -0.138748 | 0.070718 | 1 |
002431.XSHE | 0.289960 | 0.106463 | 0.120107 | 0.047904 | -0.437662 | 0.208937 | 1.231325 | 0.800984 | 1.118577 | 0.849554 | ... | 0.026242 | 0.973704 | 1.462412 | 1.344508 | 0.455285 | 0.939335 | 0.740287 | 0.139487 | 0.073976 | 1 |
000651.XSHE | -0.419611 | -1.230606 | -1.425879 | 1.118061 | 0.307694 | 0.027272 | 0.762115 | 1.286906 | -0.008618 | 0.675532 | ... | -0.543453 | -0.098476 | -0.489475 | -0.727966 | -0.642664 | 0.796666 | 0.845183 | -0.330869 | 0.078773 | 1 |
002038.XSHE | -0.538867 | -0.949778 | -0.732675 | -0.053173 | -0.254428 | -0.252169 | 0.452339 | 0.548469 | 0.980556 | 1.095278 | ... | -0.227217 | -0.625242 | -0.949553 | -0.873113 | -1.117598 | -1.763377 | -1.502315 | -0.763865 | 0.081422 | 1 |
600859.XSHG | 0.174558 | 0.249184 | 0.709174 | -0.005873 | 0.631249 | -0.715380 | -0.417121 | -0.272449 | -0.387305 | -0.232204 | ... | 1.466069 | 0.859221 | -0.846247 | -0.867853 | -1.397736 | -1.568108 | -0.739348 | -1.260990 | 0.087922 | 1 |
002353.XSHE | -0.134379 | -0.418818 | -0.297253 | 0.093242 | -0.104265 | 0.393295 | 2.301361 | 2.666142 | 2.910887 | 3.185530 | ... | -0.194131 | -0.764993 | 1.775606 | 1.182167 | -0.520576 | 2.061677 | 2.245906 | -2.157168 | 0.090355 | 1 |
002344.XSHE | -0.412449 | -1.305909 | -2.005687 | -0.343428 | -0.194461 | 0.466977 | 1.539826 | 1.052538 | 1.963830 | 1.458686 | ... | -2.220716 | -2.164643 | -0.483827 | -0.769876 | -1.025165 | -1.264064 | -0.517739 | -1.160696 | 0.092265 | 1 |
600062.XSHG | 0.015265 | -0.145719 | -0.078995 | 0.190691 | 0.007572 | -0.336422 | -0.599105 | -0.476129 | -0.380900 | -0.202776 | ... | -0.740355 | -1.241207 | -0.731070 | -0.451970 | -0.189394 | 1.987875 | 1.345734 | 0.637467 | 0.092443 | 1 |
002051.XSHE | 0.662932 | 0.417339 | 0.448513 | 0.274389 | -0.021163 | -0.015847 | 1.196050 | 1.579642 | 0.764441 | 0.937565 | ... | 0.498952 | 0.817594 | 0.866168 | 0.792453 | 0.174041 | -2.305717 | -2.517494 | 2.258645 | 0.092629 | 1 |
600873.XSHG | -0.907022 | -0.120058 | -0.241906 | 0.248707 | 0.020900 | -0.438278 | -1.939781 | -1.726519 | -2.222759 | -2.098000 | ... | 0.128432 | -0.197061 | -0.650798 | -0.228685 | 1.694538 | 0.739207 | 0.847563 | -0.694428 | 0.092784 | 1 |
600875.XSHG | 1.280535 | 1.308062 | 1.607978 | 1.432229 | 0.851162 | -0.598115 | 0.971294 | 1.137612 | 0.377022 | 0.443275 | ... | -0.366564 | 0.113943 | 1.234419 | 1.316257 | 0.530772 | 0.470040 | 0.061041 | 0.626221 | 0.093190 | 1 |
600010.XSHG | -0.685397 | -2.128134 | -1.757104 | 0.036846 | 0.138641 | -0.304107 | -0.488046 | -0.385950 | -0.447053 | -0.392533 | ... | -0.979394 | -1.056849 | -0.874804 | -1.054628 | -1.635823 | -0.436518 | -0.085497 | -0.675888 | 0.100000 | 1 |
600383.XSHG | 0.109267 | 0.007560 | 0.325999 | -0.366492 | 0.448903 | -1.953387 | -0.735331 | -0.643937 | -0.349126 | -0.406657 | ... | -0.630525 | -0.118691 | -0.275433 | -0.466084 | 0.072946 | 0.390424 | 0.429765 | -0.132612 | 0.107570 | 1 |
002385.XSHE | -0.058794 | -0.763792 | -0.443226 | -0.176786 | -0.195929 | -0.524683 | 0.998812 | 0.083053 | 1.440451 | 0.321642 | ... | -1.108461 | -0.984246 | -0.688405 | -0.657357 | -0.260509 | -0.833264 | -0.617463 | -0.535587 | 0.115570 | 1 |
600085.XSHG | -0.320429 | -0.760672 | -0.417198 | 1.383557 | -0.102754 | -0.376286 | -0.652079 | -0.779401 | -0.573246 | -0.724716 | ... | -0.589341 | -0.838624 | -0.761572 | -0.851642 | -0.230226 | -1.766027 | -1.504032 | 1.161345 | 0.116658 | 1 |
600600.XSHG | -0.872722 | -0.753274 | -0.264229 | 0.472731 | 0.020361 | 0.039284 | -0.947307 | -0.045656 | -0.991207 | 0.039137 | ... | -0.405104 | -0.834415 | -1.140018 | -0.677486 | -0.465365 | 1.531872 | 2.004210 | -1.557847 | 0.118047 | 1 |
000876.XSHE | 1.100259 | 0.190652 | 1.993477 | 0.171924 | 0.409285 | -1.854602 | 0.836407 | 0.677926 | 0.931414 | 1.078163 | ... | -0.985522 | -0.052069 | -0.881026 | -0.637804 | -0.412150 | -0.773408 | -0.659267 | -0.348566 | 0.123862 | 1 |
600196.XSHG | -0.093481 | -0.313940 | -0.542384 | 0.061656 | -0.230569 | 0.249560 | -0.721332 | -1.055811 | -0.780183 | -1.146487 | ... | 2.026238 | 0.940875 | -0.987536 | -0.673922 | -1.545417 | 2.572600 | 3.268202 | -3.334972 | 0.126536 | 1 |
000883.XSHE | -0.942596 | -0.042609 | -0.637517 | -2.401748 | -0.383065 | 0.314144 | -0.859761 | -1.115513 | -0.591986 | -0.654318 | ... | -1.269979 | -1.566752 | -0.087526 | -0.596491 | -0.195732 | 0.007835 | 0.077862 | -0.204332 | 0.126923 | 1 |
002422.XSHE | 0.011247 | -0.038333 | -0.422562 | 0.596048 | 0.010152 | -0.250485 | -0.770060 | -0.632135 | -1.059059 | -0.842909 | ... | 0.289315 | -0.112635 | -0.553314 | -1.049855 | -0.954339 | -1.280748 | -1.501790 | 1.017469 | 0.131336 | 1 |
002007.XSHE | -0.503363 | -0.720398 | -0.700548 | 0.315512 | -0.017371 | -0.259966 | -0.675518 | -0.299809 | 0.179116 | 0.630590 | ... | -0.020485 | 0.177289 | 0.048873 | -0.775451 | -1.004383 | -0.679244 | -0.138067 | -1.178437 | 0.138740 | 1 |
600150.XSHG | -0.799910 | 0.893093 | 0.279119 | -2.269365 | -1.316620 | 0.838701 | -0.812839 | -0.508223 | -0.675602 | -0.592327 | ... | -0.028588 | 0.079554 | 1.810216 | 1.850153 | 2.880992 | 2.072110 | 2.252665 | 2.340815 | 0.139073 | 1 |
000963.XSHE | -0.245027 | -0.856743 | 0.319901 | -0.022001 | -0.393461 | -0.267133 | 0.706364 | 0.331815 | -0.468074 | -0.424644 | ... | 0.105847 | 0.275388 | -0.939725 | -0.918321 | -1.507453 | -0.958342 | -0.014469 | -1.909879 | 0.140575 | 1 |
000596.XSHE | -0.250955 | -0.343144 | -0.271222 | -1.967373 | -0.041948 | -0.446417 | -0.538039 | -0.932088 | 0.065171 | -0.770943 | ... | 1.522942 | 1.295109 | -0.436997 | -0.467456 | -0.400809 | 2.300280 | 2.827565 | -0.220746 | 0.141176 | 1 |
600276.XSHG | -0.603900 | -0.963671 | -0.675125 | 0.108870 | -0.160117 | -0.417082 | 0.005417 | -0.114954 | 0.953771 | 1.066472 | ... | 0.505386 | 0.631655 | -0.806648 | -0.555916 | 0.198815 | 1.703565 | 1.044202 | 0.710243 | 0.152260 | 1 |
000729.XSHE | -0.728202 | 0.258428 | 0.033015 | -0.000422 | 0.122282 | -0.461471 | -1.976899 | -0.823760 | -1.871123 | -0.509208 | ... | -0.733784 | -0.453398 | -0.852614 | -0.339944 | 0.354340 | 0.933502 | 1.102975 | -0.848701 | 0.155096 | 1 |
000581.XSHE | 0.261958 | 0.277667 | -0.129266 | -0.088535 | 0.091233 | -0.001918 | 0.826967 | 0.670129 | 1.761931 | 1.648232 | ... | 1.480982 | 1.478227 | 1.618169 | 1.387193 | 0.341498 | 2.073535 | 2.253588 | -0.277483 | 0.166022 | 1 |
600588.XSHG | 0.156879 | 0.007058 | 0.047764 | 0.449100 | 0.387652 | -0.360739 | 1.136623 | -1.186450 | 1.219026 | -1.192913 | ... | 1.531776 | 1.722551 | 1.849656 | 1.211272 | 0.872691 | -0.013752 | -0.490483 | 0.848518 | 0.172241 | 1 |
600690.XSHG | -0.102724 | -1.197982 | -0.569657 | 0.112731 | -0.419247 | -0.785047 | 0.628862 | 1.261226 | 0.307113 | 0.823273 | ... | -0.832264 | -0.753233 | -0.580476 | -0.606552 | 0.095736 | 0.513175 | 0.567618 | -0.274850 | 0.173797 | 1 |
8618 rows × 68 columns
本文将前 4 年数据作为验证集,进行交叉验证,实现参数的设计,采用方法为 K 折交叉验证,将数据分成 K 组 (K=10) ,操作步骤如下:
(1)随机抽取 1 组做为验证集(10%),剩余 K-1 组训练集(90%);
(2)训练集建立模型后,将验证集放到模型中,得到预测标签;
(3)比较预测标签和实际标签,计算模型的评价指标(正确率、AUC 等);
(4)上述步骤 1-3 重复进行 n 次,对评价指标取均值,作为最后评价结果。
首先,确定核函数,核函数参数设定如下所示,用来比较不容核函数对模型的影响。
核函数 | 参数设定 |
---|---|
线性核 | C = 1e-4 |
3 阶多项式核 | C = 0.003 gamma = 0.03 |
7 阶多项式核 | C = 0.03 gamma = 0.01 |
高斯核 | C = 1 gamma = 3e-5 |
考虑到高斯核函数能够将任意维数据映射到无穷维空间,因此高斯核函数应用更有意义。然后,确定模型参数,惩罚系数 C 和 gamma 值是支持向量机模型最重要的两个参数。我们希望同时对 C 和 gamma 值进行遍历,找到全局最优解。参数寻优最常用的方法是网格搜索。下面我们以 SVM 为例,展示网格搜索的过程。
取 C = (0.01, 0.03, 0.1, 0.3, 1, 3, 10),γ= (1e-4, 3e-4, 1e-3, 3e-3, 0.01, 0.03, 0.1, 0.3, 1)
测试每一组 C 和 gamma 值,得到交叉验证集的 AUC 值。(约 6 h)
np.array(train)
array([[-1.33990756, -0.82349568, -1.55374297, ..., 1.18757404, 1.22922898, 0.33547193], [ 0.25323793, -0.70970837, 0.31457629, ..., 1.1469006 , 0.42613943, 2.01604604], [-0.2043338 , 0.84066448, 1.8272637 , ..., 1.55163125, 1.98512026, -0.41985356], ..., [ 0.26195801, 0.27766662, -0.12926636, ..., 2.073535 , 2.2535879 , -0.27748326], [ 0.15687879, 0.00705828, 0.04776438, ..., -0.01375164, -0.49048256, 0.84851754], [-0.10272446, -1.19798227, -0.56965663, ..., 0.51317528, 0.56761772, -0.27485043]])
np.array(target)
array([-1, -1, -1, ..., 1, 1, 1])
regressor.fit(np.array(train),np.array(target))
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
target=train_data['label']
train=train_data.copy()
del train['pchg']
del train['label']
regressor = svm.SVC()
# SVM算法参数设计
# parameters = {'kernel':['rbf'],'C':[0.01,0.03,0.1,0.3,1,3,10],\
# 'gamma':[1e-4,3e-4,1e-3,3e-3,0.01,0.03,0.1,0.3,1]}
parameters = {'kernel':['rbf'],'C':[1,3,10],\
'gamma':[0.01,0.03]}
# 创建网格搜索 scoring:指定多个评估指标 cv: N折交叉验证
clf = GridSearchCV(regressor,parameters,scoring='roc_auc',cv=10)
clf.fit(np.array(train),np.array(target))
# 输出交叉验证的结果统计列表
print(clf.cv_results_)
# 输出每个模型的结果
print(clf.grid_scores_)
# 输出最佳模型结果
print(clf.best_score_)
# 输出最佳模型参数
print(clf.best_params_)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-127-34c731808881> in <module>() 14 15 # 输出交叉验证的结果统计列表 ---> 16 print(clf.cv_results_) 17 # 输出每个模型的结果 18 print(clf.grid_scores_) AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'
'''
方法1(0.20版本已删除):
grid_search.grid_scores_
方式2(0.20版本适用的方式):
means = grid_search.cv_results_['mean_test_score']
params = grid_search.cv_results_['params']
'''
import sklearn
print(sklearn.__version__)
0.18.1
# 输出每个模型的结果
print(clf.grid_scores_)
# 输出最佳模型结果
print(clf.best_score_)
# 输出最佳模型参数
print(clf.best_params_)
[mean: 0.54892, std: 0.03498, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.01}, mean: 0.55322, std: 0.03532, params: {'kernel': 'rbf', 'C': 1, 'gamma': 0.03}, mean: 0.55224, std: 0.03255, params: {'kernel': 'rbf', 'C': 3, 'gamma': 0.01}, mean: 0.54560, std: 0.03482, params: {'kernel': 'rbf', 'C': 3, 'gamma': 0.03}, mean: 0.55037, std: 0.03197, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.01}, mean: 0.54393, std: 0.03503, params: {'kernel': 'rbf', 'C': 10, 'gamma': 0.03}] 0.5532239884430333 {'kernel': 'rbf', 'C': 1, 'gamma': 0.03}
根据统计结果,选择 C=1,gamma=0.01,对前4年数据进行训练,在样本内进行交叉验证,预测结果如下所示。根据计算结果可知,高斯核 SVM 模型样本内训练集和交叉验证集合正确率分别为 81.9% 和 56.2%,AUC 分别为 0.819 和 0.561。(约 5 min)
# 迭代次数
m=10
# 获取特征及标签
target=train_data['label']
train=train_data.copy()
del train['pchg']
del train['label']
train.head()
train_score=[]
test_score=[]
train_auc=[]
test_auc=[]
# 获取模型
clf = svm.SVC(C=1,gamma=0.03,kernel='rbf')
for i in range(m):
# 随机获取10%的数据作为交叉验证集
X_train,X_test, y_train, y_test =train_test_split(np.array(train),\
np.array(target),test_size=0.1)
# 模型训练
clf.fit(X_train,y_train)
# 模型预测
train_predict=clf.predict(X_train)
test_predict=clf.predict(X_test)
# 样本内训练集正确率
train_score.append(clf.score(X_train, y_train))
# 交叉验证集正确率
test_score.append(clf.score(X_test, y_test))
# 样本内训练集auc值
train_auc.append(metrics.roc_auc_score(y_train,train_predict))
# 交叉验证集auc值
test_auc.append(metrics.roc_auc_score(y_test,test_predict))
print('样本内训练集正确率:',mean(train_score))
print('交叉验证集正确率:',mean(test_score))
print('样本内训练集AUC:',mean(train_auc))
print('交叉验证集AUC:',mean(test_auc))
样本内训练集正确率: 0.8110624033006705 交叉验证集正确率: 0.5545243619489559 样本内训练集AUC: 0.8110425478617895 交叉验证集AUC: 0.5545502084708616
在后 4 年,每月末获取因子数据,放入模型中进行预测,并统计预测结果分别为:样本外测试正确率为 55.1% 及 AUC 均值为 55%,样本外正确率及 AUC 每个月变化曲线图如下所示。(约 1 min)
# 获取特征及标签
train_target=train_data['label']
train_feature=train_data.copy()
del train_feature['pchg']
del train_feature['label']
test_sample_predict={}
test_sample_score=[]
test_sample_auc=[]
test_sample_date=[]
# 获取模型
clf = svm.SVC(C=10,gamma=0.03,kernel='rbf')
# 模型训练
clf.fit(np.array(train_feature),np.array(train_target))
for date in dateList[4*12:-1]:
test_sample_date.append(date)
# 取样本外数据特征及标签
test_target=test_data[date]['label']
test_feature=test_data[date].copy()
del test_feature['pchg']
del test_feature['label']
test_target=np.array(test_target)
test_feature=np.array(test_feature)
# 模型预测
test_predict=clf.predict(test_feature)
# 样本外预测结果
test_sample_predict[date]=test_predict
# 样本外正确率
test_sample_score.append(clf.score(test_feature, test_target))
# 样本外auc值
test_sample_auc.append(metrics.roc_auc_score(test_target,test_predict))
print( '测试集正确率:',mean(test_sample_score))
print ( '测试集AUC:',mean(test_sample_auc) )
测试集正确率: 0.522491852886406 测试集AUC: 0.5225434512013857
xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]
ys_score = test_sample_score
# 配置横坐标
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.plot(xs_date, ys_score,'r')
# 自动旋转日期标记
plt.gcf().autofmt_xdate()
# 横坐标标记
plt.xlabel('date')
# 纵坐标标记
plt.ylabel("test accuracy")
plt.show()
xs_date = [datetime.datetime.strptime(d, '%Y-%m-%d').date() for d in test_sample_date]
ys_auc = test_sample_auc
# 配置横坐标
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.plot(xs_date, ys_auc,'r')
# 自动旋转日期标记
plt.gcf().autofmt_xdate()
# 横坐标标记
plt.xlabel('date')
# 纵坐标标记
plt.ylabel("test AUC")
plt.show()
在每个截面上,将高斯核 SVM 模型 (C=1,γ=0.03) 对全部个股下期涨跌的预测值 与因子池中各个因子值之间计算相关系数,查看模型预测值与各个因子值之间 “真实的”相关情况,如下图所示。我们发现,超额收益预测值与换手率、技术等交易类因子关联性较弱,与基本面类型因子关联性较强。
factor_predict_corr=pd.DataFrame()
for date in dateList[4*12:-1]:
test_feature=test_data[date].copy()
del test_feature['pchg']
del test_feature['label']
test_feature['predict']=list(test_sample_predict[date])
factor_predict_corr[date]=test_feature.corr()['predict']
factor_predict_corr=factor_predict_corr.iloc[:-1]
# 高斯核 SVM 模型对于下期涨跌预测值与本期因子值之间相关系数示意图
fig = plt.figure(figsize= (15,10))
ax = fig.add_subplot(111)
sns.set()
ax = sns.heatmap(factor_predict_corr)
本部分主要进行 SVM 模型的测试。首先,对核函数进行分类,根据需求将核函数分为线性核、3 阶多项式核、7 阶多项式核以及高斯核四大类;然后设定一系列 C 和 gamma 值,通过交叉验证的方法,确定高斯核 SVM 模型的最佳参数,经过验证发现 C=1,gamma=0.01 时模型最佳;紧接着,利用模型分别对样本内数据及样本完数据进行检验,根据计算结果可知,高斯核 SVM 模型样本内训练集和交叉验证集合正确率分别为 81.9% 和 56.2%,AUC 分别为 0.819 和 0.561,样本外测试正确率为 55.1% 及 AUC 均值为 55%;最后,评估了模型预测收益率与因子之间的相关性,发现超额收益预测值与换手率、技术等交易类因子关联性较弱,与基本面类型因子关联性较强。
沪深 300 成份股:剔除 ST、停牌、上市时间 <3 个月的股票
回测时间:2014-01-01 至 2018-01-01
调仓期:每月第一个交易日
SVM 模型参数: 高斯核函数,C=10,γ=0.03
选股:
(1)利用 2010-2014 年数据建立 SVM 模型
(2)利用 SVM 模型预测 2014-2018 年下月的收益
(3)根据预测值选股投资
评价方法: 回测年化收益率、夏普比率、最大回撤、胜率等。
回测年化收益率: 年化收益率通常指投资一年后能够获得的收益率,由于回测时间的长短,往往会由于复利的影响导致长时间的总收益率更大,此时可通过年化收益率衡量模型的收益能力。
夏普比率: 一般情况下,风险和收益总是背向而驰,模型在承担收益的同时也会承担相对应的风险,因此合理控制模型的收益与风险能够提高模型的优势。
最大回撤: 最大回撤是指模型在过去的某一段时间可能出现的最大亏损程度,通常用来衡量模型的风险。在实际投资中,若是出现最大回撤较大的情况,往往会导致投资者对模型丧失信心,因此合理控制模型的最大回撤显得尤为重要。
策略步骤:
(1)每月根据股票预测值进行排序
(2)排序后将股票分为 N 层 (N=5)
(3)按层级分别回测,每层股票等权重投资,得到 5 根回测曲线
#1 先导入所需要的程序包
import datetime
import numpy as np
import pandas as pd
import time
from jqdata import *
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import copy
import pickle
class parameter_analysis(object):
# 定义函数中不同的变量
def __init__(self, algorithm_id=None):
self.algorithm_id = algorithm_id # 回测id
self.params_df = pd.DataFrame() # 回测中所有调参备选值的内容,列名字为对应修改面两名称,对应回测中的 g.XXXX
self.results = {} # 回测结果的回报率,key 为 params_df 的行序号,value 为
self.evaluations = {} # 回测结果的各项指标,key 为 params_df 的行序号,value 为一个 dataframe
self.backtest_ids = {} # 回测结果的 id
# 新加入的基准的回测结果 id,可以默认为空 '',则使用回测中设定的基准
self.benchmark_id = ''
self.benchmark_returns = [] # 新加入的基准的回测回报率
self.returns = {} # 记录所有回报率
self.excess_returns = {} # 记录超额收益率
self.log_returns = {} # 记录收益率的 log 值
self.log_excess_returns = {} # 记录超额收益的 log 值
self.dates = [] # 回测对应的所有日期
self.excess_max_drawdown = {} # 计算超额收益的最大回撤
self.excess_annual_return = {} # 计算超额收益率的年化指标
self.evaluations_df = pd.DataFrame() # 记录各项回测指标,除日回报率外
# 定义排队运行多参数回测函数
def run_backtest(self, #
algorithm_id=None, # 回测策略id
running_max=10, # 回测中同时巡行最大回测数量
start_date='2016-01-01', # 回测的起始日期
end_date='2016-04-30', # 回测的结束日期
frequency='day', # 回测的运行频率
initial_cash='1000000', # 回测的初始持仓金额
param_names=[], # 回测中调整参数涉及的变量
param_values=[] # 回测中每个变量的备选参数值
):
# 当此处回测策略的 id 没有给出时,调用类输入的策略 id
if algorithm_id == None: algorithm_id=self.algorithm_id
# 生成所有参数组合并加载到 df 中
# 包含了不同参数具体备选值的排列组合中一组参数的 tuple 的 list
param_combinations = list(itertools.product(*param_values))
# 生成一个 dataframe, 对应的列为每个调参的变量,每个值为调参对应的备选值
to_run_df = pd.DataFrame(param_combinations)
# 修改列名称为调参变量的名字
to_run_df.columns = param_names
to_run_df['backtestID']=''
to_run_df['state']='waiting'
to_run_df['times']=0
# 设定运行起始时间和保存格式
start = time.time()
# 记录结束的运行回测
finished_backtests = {}
# 记录运行中的回测
running_backtests = {}
failed_backtests={}
running_count=0
# 总运行回测数目,等于排列组合中的元素个数
total_backtest_num = len(param_combinations)
# 记录回测结果的回报率
all_results = {}
# 记录回测结果的各项指标
all_evaluations = {}
# 在运行开始时显示
print('【已完成|运行中|待运行||失败】:' )
# 当运行回测开始后,如果没有全部运行完全的话:
while len(to_run_df[(to_run_df.state=='waiting') | (to_run_df.state=='running')].index)>0:
# 显示运行、完成和待运行的回测个数
print('[%s|%s|%s||%s].' % (len(finished_backtests),
len(running_backtests),
(total_backtest_num-len(finished_backtests)-len(running_backtests)- len(failed_backtests)),
len(failed_backtests)
)),
# 把可用的空位进行跑回测
for index in (to_run_df[to_run_df.state=='waiting'].index):
# 备选的参数排列组合的 df 中第 i 行变成 dict,每个 key 为列名字,value 为 df 中对应的值
if running_count>=running_max:
continue
params = to_run_df.ix[index,param_names].to_dict()
# 记录策略回测结果的 id,调整参数 extras 使用 params 的内容
backtest = create_backtest(algorithm_id = algorithm_id,
start_date = start_date,
end_date = end_date,
frequency = frequency,
initial_cash = initial_cash,
extras = params,
# 再回测中把改参数的结果起一个名字,包含了所有涉及的变量参数值
name =str( params).replace('{','').replace('}','').replace('\'','')
)
# 记录运行中 i 回测的回测 id
to_run_df.at[index,'backtestID'] = backtest
to_run_df.at[index,'state']='running'
to_run_df.at[index,'times']=to_run_df.at[index,'times']+1
running_count=running_count+1
# 获取回测结果
failed = []
finished = []
# 对于运行中的回测,key 为 to_run_df 中所有排列组合中的序数
for index in to_run_df[to_run_df.state=='running'].index:
# 研究调用回测的结果,running_backtests[key] 为运行中保存的结果 id
bt = get_backtest(to_run_df.at[index,'backtestID'])
# 获得运行回测结果的状态,成功和失败都需要运行结束后返回,如果没有返回则运行没有结束
status = bt.get_status()
# 当运行回测失败
if status in [ 'failed','canceled','deleted']:
# 失败 list 中记录对应的回测结果 id
failed.append(index)
# 当运行回测成功时
elif status == 'done':
# 成功 list 记录对应的回测结果 id,finish 仅记录运行成功的
finished.append(index)
# 回测回报率记录对应回测的回报率 dict, key to_run_df 中所有排列组合中的序数, value 为回报率的 dict
# 每个 value 一个 list 每个对象为一个包含时间、日回报率和基准回报率的 dict
all_results[index] = bt.get_results()
# 回测回报率记录对应回测结果指标 dict, key to_run_df 中所有排列组合中的序数, value 为回测结果指标的 dataframe
all_evaluations[index] = bt.get_risk()
# 记录运行中回测结果 id 的 list 中删除失败的运行
for index in failed:
if to_run_df.at[index,'times']<3:
to_run_df.at[index,'state']='waiting'
else:
to_run_df.at[index,'state']='failed'
# 在结束回测结果 dict 中记录运行成功的回测结果 id,同时在运行中的记录中删除该回测
for index in finished:
to_run_df.at[index,'state']='done'
running_count=len(to_run_df[to_run_df.state=='running'].index)
running_backtests=to_run_df[to_run_df.state=='running']['backtestID'].to_dict()
finished_backtests=to_run_df[to_run_df.state=='done']['backtestID'].to_dict()
failed_backtests=to_run_df[to_run_df.state=='failed']['backtestID'].to_dict()
# 当一组同时运行的回测结束时报告时间
if len(finished_backtests) != 0 and len(finished_backtests) % running_max == 0 :
# 记录当时时间
middle = time.time()
# 计算剩余时间,假设没工作量时间相等的话
remain_time = (middle - start) * (total_backtest_num - len(finished_backtests)) / len(finished_backtests)
# print 当前运行时间
print('[已用%s时,尚余%s时,请不要关闭浏览器].' % (str(round((middle - start) / 60.0 / 60.0,3)),
str(round(remain_time / 60.0 / 60.0,3)))),
# 5秒钟后再跑一下
time.sleep(5)
# 记录结束时间
end = time.time()
print('')
print('【回测完成】总用时:%s秒(即%s小时)。' % (str(int(end-start)),
str(round((end-start)/60.0/60.0,2)))),
# 对应修改类内部对应
self.params_df = to_run_df.ix[:,param_names]
self.results = all_results
self.evaluations = all_evaluations
self.backtest_ids = finished_backtests
#7 最大回撤计算方法
def find_max_drawdown(self, returns):
# 定义最大回撤的变量
result = 0
# 记录最高的回报率点
historical_return = 0
# 遍历所有日期
for i in range(len(returns)):
# 最高回报率记录
historical_return = max(historical_return, returns[i])
# 最大回撤记录
drawdown = 1-(returns[i] + 1) / (historical_return + 1)
# 记录最大回撤
result = max(drawdown, result)
# 返回最大回撤值
return result
# log 收益、新基准下超额收益和相对与新基准的最大回撤
def organize_backtest_results(self, benchmark_id=None):
# 若新基准的回测结果 id 没给出
if benchmark_id==None:
# 使用默认的基准回报率,默认的基准在回测策略中设定
self.benchmark_returns = [x['benchmark_returns'] for x in self.results[0]]
# 当新基准指标给出后
else:
# 基准使用新加入的基准回测结果
self.benchmark_returns = [x['returns'] for x in get_backtest(benchmark_id).get_results()]
# 回测日期为结果中记录的第一项对应的日期
self.dates = [x['time'] for x in self.results[0]]
# 对应每个回测在所有备选回测中的顺序 (key),生成新数据
# 由 {key:{u'benchmark_returns': 0.022480100091729405,
# u'returns': 0.03184566700000002,
# u'time': u'2006-02-14'}} 格式转化为:
# {key: []} 格式,其中 list 为对应 date 的一个回报率 list
for key in self.results.keys():
self.returns[key] = [x['returns'] for x in self.results[key]]
# 生成对于基准(或新基准)的超额收益率
for key in self.results.keys():
self.excess_returns[key] = [(x+1)/(y+1)-1 for (x,y) in zip(self.returns[key], self.benchmark_returns)]
# 生成 log 形式的收益率
for key in self.results.keys():
self.log_returns[key] = [log(x+1) for x in self.returns[key]]
# 生成超额收益率的 log 形式
for key in self.results.keys():
self.log_excess_returns[key] = [log(x+1) for x in self.excess_returns[key]]
# 生成超额收益率的最大回撤
for key in self.results.keys():
self.excess_max_drawdown[key] = self.find_max_drawdown(self.excess_returns[key])
# 生成年化超额收益率
for key in self.results.keys():
self.excess_annual_return[key] = (self.excess_returns[key][-1]+1)**(252./float(len(self.dates)))-1
# 把调参数据中的参数组合 df 与对应结果的 df 进行合并
self.evaluations_df = pd.concat([self.params_df, pd.DataFrame(self.evaluations).T], axis=1)
# self.evaluations_df =
# 获取最总分析数据,调用排队回测函数和数据整理的函数
def get_backtest_data(self,
algorithm_id=None, # 回测策略id
benchmark_id=None, # 新基准回测结果id
file_name='results1.pkl', # 保存结果的 pickle 文件名字
running_max=10, # 最大同时运行回测数量
start_date='2006-01-01', # 回测开始时间
end_date='2016-11-30', # 回测结束日期
frequency='day', # 回测的运行频率
initial_cash='1000000', # 回测初始持仓资金
param_names=[], # 回测需要测试的变量
param_values=[] # 对应每个变量的备选参数
):
# 调运排队回测函数,传递对应参数
self.run_backtest(algorithm_id=algorithm_id,
running_max=running_max,
start_date=start_date,
end_date=end_date,
frequency=frequency,
initial_cash=initial_cash,
param_names=param_names,
param_values=param_values
)
# 回测结果指标中加入 log 收益率和超额收益率等指标
self.organize_backtest_results(benchmark_id)
# 生成 dict 保存所有结果。
results = {'returns':self.returns,
'excess_returns':self.excess_returns,
'log_returns':self.log_returns,
'log_excess_returns':self.log_excess_returns,
'dates':self.dates,
'benchmark_returns':self.benchmark_returns,
'evaluations':self.evaluations,
'params_df':self.params_df,
'backtest_ids':self.backtest_ids,
'excess_max_drawdown':self.excess_max_drawdown,
'excess_annual_return':self.excess_annual_return,
'evaluations_df':self.evaluations_df}
# 保存 pickle 文件
pickle_file = open(file_name, 'wb')
pickle.dump(results, pickle_file)
pickle_file.close()
# 读取保存的 pickle 文件,赋予类中的对象名对应的保存内容
def read_backtest_data(self, file_name='results.pkl'):
pickle_file = open(file_name, 'rb')
results = pickle.load(pickle_file)
self.returns = results['returns']
self.excess_returns = results['excess_returns']
self.log_returns = results['log_returns']
self.log_excess_returns = results['log_excess_returns']
self.dates = results['dates']
self.benchmark_returns = results['benchmark_returns']
self.evaluations = results['evaluations']
self.params_df = results['params_df']
self.backtest_ids = results['backtest_ids']
self.excess_max_drawdown = results['excess_max_drawdown']
self.excess_annual_return = results['excess_annual_return']
self.evaluations_df = results['evaluations_df']
# 回报率折线图
def plot_returns(self):
# 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
# 作图
for key in self.returns.keys():
ax.plot(range(len(self.returns[key])), self.returns[key], label=key)
# 设定benchmark曲线并标记
ax.plot(range(len(self.benchmark_returns)), self.benchmark_returns, label='benchmark', c='k', linestyle='--')
ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
plt.xticks(ticks, [self.dates[i] for i in ticks])
# 设置图例样式
ax.legend(loc = 2, fontsize = 10)
# 设置y标签样式
ax.set_ylabel('returns',fontsize=20)
# 设置x标签样式
ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
# 设置图片标题样式
ax.set_title("Strategy's performances with different parameters", fontsize=21)
plt.xlim(0, len(self.returns[0]))
# 多空组合图
def plot_long_short(self):
# 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
# 作图
a1 = [i+1 for i in self.returns[0]]
a2 = [i+1 for i in self.returns[4]]
a1.insert(0,1)
a2.insert(0,1)
b = []
for i in range(len(a1)-1):
b.append((a1[i+1]/a1[i]-a2[i+1]/a2[i])/2)
c = []
c.append(1)
for i in range(len(b)):
c.append(c[i]*(1+b[i]))
ax.plot(range(len(c)), c)
ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
plt.xticks(ticks, [self.dates[i] for i in ticks])
# 设置图例样式
ax.legend(loc = 2, fontsize = 10)
ax.set_title("Strategy's long_short performances",fontsize=20)
# 设置图片标题样式
plt.xlim(0, len(c))
# 获取不同年份的收益及排名分析
def get_profit_year(self):
profit_year = {}
for key in self.returns.keys():
temp = []
date_year = []
for i in range(len(self.dates)-1):
if self.dates[i][:4] != self.dates[i+1][:4]:
temp.append(self.returns[key][i])
date_year.append(self.dates[i][:4])
temp.append(self.returns[key][-1])
date_year.append(self.dates[-1][:4])
temp1 = []
temp1.append(temp[0])
for i in range(len(temp)-1):
temp1.append((temp[i+1]+1)/(temp[i]+1)-1)
profit_year[key] = temp1
result = pd.DataFrame(index = list(self.returns.keys()), columns = date_year)
for key in self.returns.keys():
result.loc[key,:] = profit_year[key]
return result
# 超额收益率图
def plot_excess_returns(self):
# 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
# 作图
for key in self.returns.keys():
ax.plot(range(len(self.excess_returns[key])), self.excess_returns[key], label=key)
# 设定benchmark曲线并标记
ax.plot(range(len(self.benchmark_returns)), [0]*len(self.benchmark_returns), label='benchmark', c='k', linestyle='--')
ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
plt.xticks(ticks, [self.dates[i] for i in ticks])
# 设置图例样式
ax.legend(loc = 2, fontsize = 10)
# 设置y标签样式
ax.set_ylabel('excess returns',fontsize=20)
# 设置x标签样式
ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
# 设置图片标题样式
ax.set_title("Strategy's performances with different parameters", fontsize=21)
plt.xlim(0, len(self.excess_returns[0]))
# log回报率图
def plot_log_returns(self):
# 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
# 作图
for key in self.returns.keys():
ax.plot(range(len(self.log_returns[key])), self.log_returns[key], label=key)
# 设定benchmark曲线并标记
ax.plot(range(len(self.benchmark_returns)), [log(x+1) for x in self.benchmark_returns], label='benchmark', c='k', linestyle='--')
ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
plt.xticks(ticks, [self.dates[i] for i in ticks])
# 设置图例样式
ax.legend(loc = 2, fontsize = 10)
# 设置y标签样式
ax.set_ylabel('log returns',fontsize=20)
# 设置图片标题样式
ax.set_title("Strategy's performances with different parameters", fontsize=21)
plt.xlim(0, len(self.log_returns[0]))
# 超额收益率的 log 图
def plot_log_excess_returns(self):
# 通过figsize参数可以指定绘图对象的宽度和高度,单位为英寸;
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111)
# 作图
for key in self.returns.keys():
ax.plot(range(len(self.log_excess_returns[key])), self.log_excess_returns[key], label=key)
# 设定benchmark曲线并标记
ax.plot(range(len(self.benchmark_returns)), [0]*len(self.benchmark_returns), label='benchmark', c='k', linestyle='--')
ticks = [int(x) for x in np.linspace(0, len(self.dates)-1, 11)]
plt.xticks(ticks, [self.dates[i] for i in ticks])
# 设置图例样式
ax.legend(loc = 2, fontsize = 10)
# 设置y标签样式
ax.set_ylabel('log excess returns',fontsize=20)
# 设置图片标题样式
ax.set_title("Strategy's performances with different parameters", fontsize=21)
plt.xlim(0, len(self.log_excess_returns[0]))
# 回测的4个主要指标,包括总回报率、最大回撤夏普率和波动
def get_eval4_bar(self, sort_by=[]):
sorted_params = self.params_df
for by in sort_by:
sorted_params = sorted_params.sort(by)
indices = sorted_params.index
fig = plt.figure(figsize=(20,7))
# 定义位置
ax1 = fig.add_subplot(221)
# 设定横轴为对应分位,纵轴为对应指标
ax1.bar(range(len(indices)),
[self.evaluations[x]['algorithm_return'] for x in indices], 0.6, label = 'Algorithm_return')
plt.xticks([x+0.3 for x in range(len(indices))], indices)
# 设置图例样式
ax1.legend(loc='best',fontsize=15)
# 设置y标签样式
ax1.set_ylabel('Algorithm_return', fontsize=15)
# 设置y标签样式
ax1.set_yticklabels([str(x*100)+'% 'for x in ax1.get_yticks()])
# 设置图片标题样式
ax1.set_title("Strategy's of Algorithm_return performances of different quantile", fontsize=15)
# x轴范围
plt.xlim(0, len(indices))
# 定义位置
ax2 = fig.add_subplot(224)
# 设定横轴为对应分位,纵轴为对应指标
ax2.bar(range(len(indices)),
[self.evaluations[x]['max_drawdown'] for x in indices], 0.6, label = 'Max_drawdown')
plt.xticks([x+0.3 for x in range(len(indices))], indices)
# 设置图例样式
ax2.legend(loc='best',fontsize=15)
# 设置y标签样式
ax2.set_ylabel('Max_drawdown', fontsize=15)
# 设置x标签样式
ax2.set_yticklabels([str(x*100)+'% 'for x in ax2.get_yticks()])
# 设置图片标题样式
ax2.set_title("Strategy's of Max_drawdown performances of different quantile", fontsize=15)
# x轴范围
plt.xlim(0, len(indices))
# 定义位置
ax3 = fig.add_subplot(223)
# 设定横轴为对应分位,纵轴为对应指标
ax3.bar(range(len(indices)),
[self.evaluations[x]['sharpe'] for x in indices], 0.6, label = 'Sharpe')
plt.xticks([x+0.3 for x in range(len(indices))], indices)
# 设置图例样式
ax3.legend(loc='best',fontsize=15)
# 设置y标签样式
ax3.set_ylabel('Sharpe', fontsize=15)
# 设置x标签样式
ax3.set_yticklabels([str(x*100)+'% 'for x in ax3.get_yticks()])
# 设置图片标题样式
ax3.set_title("Strategy's of Sharpe performances of different quantile", fontsize=15)
# x轴范围
plt.xlim(0, len(indices))
# 定义位置
ax4 = fig.add_subplot(222)
# 设定横轴为对应分位,纵轴为对应指标
ax4.bar(range(len(indices)),
[self.evaluations[x]['algorithm_volatility'] for x in indices], 0.6, label = 'Algorithm_volatility')
plt.xticks([x+0.3 for x in range(len(indices))], indices)
# 设置图例样式
ax4.legend(loc='best',fontsize=15)
# 设置y标签样式
ax4.set_ylabel('Algorithm_volatility', fontsize=15)
# 设置x标签样式
ax4.set_yticklabels([str(x*100)+'% 'for x in ax4.get_yticks()])
# 设置图片标题样式
ax4.set_title("Strategy's of Algorithm_volatility performances of different quantile", fontsize=15)
# x轴范围
plt.xlim(0, len(indices))
#14 年化回报和最大回撤,正负双色表示
def get_eval(self, sort_by=[]):
sorted_params = self.params_df
for by in sort_by:
sorted_params = sorted_params.sort(by)
indices = sorted_params.index
# 大小
fig = plt.figure(figsize = (20, 8))
# 图1位置
ax = fig.add_subplot(111)
# 生成图超额收益率的最大回撤
ax.bar([x+0.3 for x in range(len(indices))],
[-self.evaluations[x]['max_drawdown'] for x in indices], color = '#32CD32',
width = 0.6, label = 'Max_drawdown', zorder=10)
# 图年化超额收益
ax.bar([x for x in range(len(indices))],
[self.evaluations[x]['annual_algo_return'] for x in indices], color = 'r',
width = 0.6, label = 'Annual_return')
plt.xticks([x+0.3 for x in range(len(indices))], indices)
# 设置图例样式
ax.legend(loc='best',fontsize=15)
# 基准线
plt.plot([0, len(indices)], [0, 0], c='k',
linestyle='--', label='zero')
# 设置图例样式
ax.legend(loc='best',fontsize=15)
# 设置y标签样式
ax.set_ylabel('Max_drawdown', fontsize=15)
# 设置x标签样式
ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
# 设置图片标题样式
ax.set_title("Strategy's performances of different quantile", fontsize=15)
# 设定x轴长度
plt.xlim(0, len(indices))
#14 超额收益的年化回报和最大回撤
# 加入新的benchmark后超额收益和
def get_excess_eval(self, sort_by=[]):
sorted_params = self.params_df
for by in sort_by:
sorted_params = sorted_params.sort(by)
indices = sorted_params.index
# 大小
fig = plt.figure(figsize = (20, 8))
# 图1位置
ax = fig.add_subplot(111)
# 生成图超额收益率的最大回撤
ax.bar([x+0.3 for x in range(len(indices))],
[-self.excess_max_drawdown[x] for x in indices], color = '#32CD32',
width = 0.6, label = 'Excess_max_drawdown')
# 图年化超额收益
ax.bar([x for x in range(len(indices))],
[self.excess_annual_return[x] for x in indices], color = 'r',
width = 0.6, label = 'Excess_annual_return')
plt.xticks([x+0.3 for x in range(len(indices))], indices)
# 设置图例样式
ax.legend(loc='best',fontsize=15)
# 基准线
plt.plot([0, len(indices)], [0, 0], c='k',
linestyle='--', label='zero')
# 设置图例样式
ax.legend(loc='best',fontsize=15)
# 设置y标签样式
ax.set_ylabel('Max_drawdown', fontsize=15)
# 设置x标签样式
ax.set_yticklabels([str(x*100)+'% 'for x in ax.get_yticks()])
# 设置图片标题样式
ax.set_title("Strategy's performances of different quantile", fontsize=15)
# 设定x轴长度
plt.xlim(0, len(indices))
def group_backtest(start_date,end_date):
warnings.filterwarnings("ignore")
pa = parameter_analysis('78c119654ae4be5678a2e92ac6565b3f')
pa.get_backtest_data(file_name = 'results_1.pkl',
running_max = 5,
start_date=start_date,
end_date=end_date,
frequency = 'day',
initial_cash = '10000000',
param_names = ['factor', 'quantile'],#变量名,即在策略中的g.xxxx变量
param_values = [['svm'], tuple(zip(range(0,50,10), range(10,51,10)))]
)
start_date = '2014-01-01'
end_date = '2018-01-01'
group_backtest(start_date,end_date)
【已完成|运行中|待运行||失败】: [0|0|5||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [1|4|0||0]. [2|3|0||0]. [2|3|0||0]. [3|2|0||0]. [已用0.051时,尚余0.0时,请不要关闭浏览器]. 【回测完成】总用时:188秒(即0.05小时)。
为了对模型的收益能力进行具体分析,将模型预测结果看成单因子,按照单因子有效性的测试方法,对模型的有效性进行测试,具体分析方法为:(1)根据预测结果按照从大到小的顺序进行排序;(2)将股票平均分为 5 等份,分别构成 5 个投资组合。具体每个组合的收益指标如下表所示。
pa = parameter_analysis()
pa.read_backtest_data('results_1.pkl')
pa.evaluations_df
factor | quantile | __version | algorithm_return | algorithm_volatility | alpha | annual_algo_return | annual_bm_return | avg_position_days | avg_trade_return | ... | max_drawdown_period | max_leverage | period_label | profit_loss_ratio | sharpe | sortino | trading_days | treasury_return | win_count | win_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | svm | (0, 10) | 101 | 0.663323 | 0.269375 | -0.0138488 | 0.139055 | 0.150562 | NaN | NaN | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | NaN | 0.367722 | 0.407257 | 977 | 0.159671 | NaN | NaN |
1 | svm | (10, 20) | 101 | 0.63249 | 0.266582 | -0.0188011 | 0.133614 | 0.150562 | 135.74 | 0.0164623 | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | 1.30284 | 0.351166 | 0.394109 | 977 | 0.159671 | 1393 | 0.527851 |
2 | svm | (20, 30) | 101 | 0.509376 | 0.262645 | -0.0391229 | 0.111096 | 0.150562 | NaN | NaN | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | NaN | 0.270693 | 0.304905 | 977 | 0.159671 | NaN | NaN |
3 | svm | (30, 40) | 101 | 0.136907 | 0.272248 | -0.121327 | 0.0333779 | 0.150562 | NaN | NaN | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | NaN | -0.024324 | -0.0269605 | 977 | 0.159671 | NaN | NaN |
4 | svm | (40, 50) | 101 | 0.0971357 | 0.261152 | -0.124857 | 0.0240049 | 0.150562 | NaN | NaN | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | NaN | -0.0612482 | -0.0683193 | 977 | 0.159671 | NaN | NaN |
5 rows × 28 columns
为了进一步更直观的对 5 个组合进行分析,绘制了 5 个组合及 HS300 基准的净值收益曲线,具体下图所示。由图可以看出,组合 1 能够明显跑赢组合 5 ,可见符合单因子有效性的检验,即模型证明是有效的。
pa.plot_returns()
从分层组合回测净值曲线图来看,每个组合波动性较大,策略存在较大的风险,因此考虑建立多空组合。多空组合是买入组合 1、卖空组合 5 (月度调仓)的一个资产组合,为了方便统计,多空组合每日收益率为(组合 1 每日收益率 - 组合 5 每日收益率)/2,然后获得多空组合的净值收益曲线,如图所示,多空组合净值收益曲线明显比任何一个组合的波动性更低,能够获得更为稳定的收益,风险控制效果较好。
pa.plot_long_short()
No handles with labels found to put in legend.
为了进一步分析模型的稳定性,对每一年每个组合的收益能力进行分析。如表所示,组合 1 每一年在 5 个组合中均能够获得较高的收益,而组合 5 基本上每年收益能力都排在最后两名。
pa.get_profit_year()
2014 | 2015 | 2016 | 2017 | |
---|---|---|---|---|
0 | 0.410576 | 0.295119 | -0.0946605 | 0.00567782 |
1 | 0.473467 | 0.180543 | -0.134506 | 0.0843375 |
2 | 0.481693 | 0.262112 | -0.205667 | 0.0161062 |
3 | 0.374386 | 0.0572377 | -0.195955 | -0.026887 |
4 | 0.306113 | 0.0628772 | -0.210157 | 0.000589785 |
pa.get_eval4_bar()
pa.get_eval()
pa.get_excess_eval()
策略步骤:
(1)所有股票归属申万一级行业
(2)行业内股票按每月预测结果进行排序
(3)行业内股票均分成 N层 (N=5)
(4)每个分层组合的股票进行权重配置
股票权重配置方法:
在每个一级行业内部对所有个股按因子大小进行排序,每个行业内均分成 N 个分层组合。如图所示,黄色方块代表各行业内个股初始权重,可以相等也 可以不等(我们直接取相等权重进行测试),分层具体操作方法为 N 等分行业内个股 权重累加值,例如图示行业 1 中,5 只个股初始权重相等(不妨设每只个股权重为 0.2), 假设我们欲分成 3 层,则分层组合 1 在权重累加值 1/3 处截断,即分层组合 1 包含个股 1 和个股 2,它们的权重配比为 0.2:(1/3-0.2)=3:2,同样推理,分层组合 2 包含个股 2、3、4,配比为(0.4-1/3):0.2:(2/3-0.6)=1:3:1,分层组合 4 包含个股 4、5,配比 为 2:3。以上方法是用来计算各个一级行业内部个股权重配比的,行业间权重配比与基准组合(我们使用沪深 300)相同,也即行业中性。
### 这里注意的是需要将 num的参数变为字符串掺进去
def group_backtest(start_date,end_date,num):
warnings.filterwarnings("ignore")
pa = parameter_analysis()
pa.get_backtest_data(file_name = 'results_2.pkl',
running_max = 10,
algorithm_id = '794c29655a9b316c82b0cf87340631cf',
start_date=start_date,
end_date=end_date,
frequency = 'day',
initial_cash = '10000000',
param_names = ['num'],
param_values = [num]
)
start_date = '2014-01-01'
end_date = '2018-01-01'
num = ['1','2','3','4','5']
group_backtest(start_date,end_date,num)
【已完成|运行中|待运行||失败】: [0|0|5||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [1|4|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [4|1|0||0]. 【回测完成】总用时:545秒(即0.15小时)。
def group_backtest(start_date,end_date):
warnings.filterwarnings("ignore")
pa = parameter_analysis()
pa.get_backtest_data(file_name = 'results_2.pkl',
running_max = 10,
algorithm_id = 'aaece09901876fda726e8768b1d34e11',
start_date=start_date,
end_date=end_date,
frequency = 'day',
initial_cash = '10000000',
param_names = ['factor', 'quantile'],
param_values = [['svm'], tuple(zip(range(0,50,10), range(10,51,10)))]
)
start_date = '2014-01-01'
end_date = '2018-01-01'
group_backtest(start_date,end_date)
【已完成|运行中|待运行||失败】: [0|0|5||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [0|5|0||0]. [1|4|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [2|3|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [3|2|0||0]. [4|1|0||0]. 【回测完成】总用时:876秒(即0.24小时)。
为了对模型的收益能力进行具体分析,将模型预测结果看成单因子,按照单因子有效性的测试方法,对模型的有效性进行测试,具体分析方法为:(1)根据预测结果按照从大到小的顺序进行排序;(2)将股票平均分为 5 等份,分别构成 5 个投资组合。具体每个组合的收益指标如下表所示。
pa = parameter_analysis()
pa.read_backtest_data('results_2.pkl')
pa.evaluations_df
num | __version | algorithm_return | algorithm_volatility | alpha | annual_algo_return | annual_bm_return | avg_position_days | avg_trade_return | benchmark_return | ... | max_drawdown_period | max_leverage | period_label | profit_loss_ratio | sharpe | sortino | trading_days | treasury_return | win_count | win_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 101 | 0.738469 | 0.255861 | 0.00123646 | 0.152007 | 0.150562 | NaN | NaN | 0.729959 | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | NaN | 0.437766 | 0.492744 | 977 | 0.159671 | NaN | NaN |
1 | 2 | 101 | 0.455817 | 0.258532 | -0.0514965 | 0.100871 | 0.150562 | 186.156 | 0.0156864 | 0.729959 | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | 1.24784 | 0.23545 | 0.261717 | 977 | 0.159671 | 1948 | 0.533114 |
2 | 3 | 101 | 0.633183 | 0.264279 | -0.0211756 | 0.133737 | 0.150562 | 179.59 | 0.0144208 | 0.729959 | ... | [2015-06-12, 2016-01-28] | 0 | 2017-12 | 1.32176 | 0.35469 | 0.403786 | 977 | 0.159671 | 1921 | 0.521303 |
3 | 4 | 101 | 0.212724 | 0.265138 | -0.104247 | 0.0505905 | 0.150562 | 176.945 | 0.00921821 | 0.729959 | ... | [2015-06-12, 2016-02-29] | 0 | 2017-12 | 1.14759 | 0.0399431 | 0.0445347 | 977 | 0.159671 | 1854 | 0.505728 |
4 | 5 | 101 | 0.237787 | 0.252249 | -0.0927294 | 0.0561041 | 0.150562 | 163.021 | 0.00871081 | 0.729959 | ... | [2015-06-12, 2017-05-23] | 0 | 2017-12 | 1.15374 | 0.0638424 | 0.0732078 | 977 | 0.159671 | 1588 | 0.499528 |
5 rows × 27 columns
为了进一步更直观的对 5 个组合进行分析,绘制了 5 个组合及 HS300 基准的净值收益曲线,具体下图所示。由图可以看出,组合 1 能够明显跑赢组合 5 ,可见符合单因子有效性的检验,即模型证明是有效的。
pa.plot_returns()