本文主要介绍机器学习中的特征提取方法,包括filter,wrapper,embedded,另外借用了github上feature_selection引用最多的方法,并在此基础上做了部分修改。
三个模块:
1、数据获取,使用的是聚宽提供的技术因子,未做处理
2、数据处理,去极值和标准化,PCA降维,本文数据未用PCA,但代码写好了
3、特征选择,三种基本方式,外加基于相关系数删除法和基于LightGBM进行特征选择
1、为什么要做特征选择
在有限的样本数目下,用大量的特征来设计分类器计算开销太大而且分类性能差。
2、特征选择的确切含义
将高维空间的样本通过映射或者是变换的方式转换到低维空间,达到降维的目的,然后通过特征选取删选掉冗余和不相关的特征来进一步降维。
3、特征选取的原则
获取尽可能小的特征子集,不显著降低分类精度、不影响类分布以及特征子集应具有稳定适应性强等特点
1、Filter方法
其主要思想是:对每一维的特征“打分”,即给每一维的特征赋予权重,这样的权重就代表着该维特征的重要性,然后依据权重排序。
主要的方法有:
Chi-squared test(卡方检验)
information gain(信息增益),详细可见“简单易学的机器学习算法——决策树之ID3算法”
correlation coefficient scores(相关系数)
2、Wrapper方法
其主要思想是:将子集的选择看作是一个搜索寻优问题,生成不同的组合,对组合进行评价,再与其他的组合进行比较。
主要方法有: (递归特征消除算法RFE)
3、Embedded方法
其主要思想是:在模型既定的情况下学习出对提高模型准确性最好的属性。通过threshold与feature_importance比较控制保留特征数,L1正则化,threshold默认1e-5,其他方法可设,selectFromModel默认median。
github连接:https://github.com/WillKoehrsen/feature-selector
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest,SelectPercentile,SelectFromModel,chi2,f_classif,mutual_info_classif,RFE
from scipy.stats import pearsonr
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.svm import SVC,LinearSVC,LinearSVR,SVR
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import gc
from jqdata import *
from jqlib.technical_analysis import *
数据获取¶
start_date = '2016-01-01'
end_date = '2018-11-01'
trade_days = get_trade_days(start_date=start_date,end_date=end_date).tolist()
date = trade_days[0]
lookback = 5 #lstm时间轴数据长度
stocks = '000905.XSHG' #中证500
#技术因子数据准备
def get_factors_one_stock(stocks,date):
'''
获取一只股票一天的因子数据集
input:
stocks:一只股票
date:日期
output:
df:dataframe,各个因子一天的数值
'''
if type(date) != str:
date = datetime.datetime.strftime(date,'%Y-%m-%d')
price = get_price(stocks,end_date=date,count=1)
price.index = [date]
accer = ACCER(stocks,check_date=date,N=5)
accer_df = pd.DataFrame(list(accer.values()),columns=['ACCER'])
#ADTM-动态买卖气指标
adtm,maadtm = ADTM(stocks, date, N = 23, M = 8)
adtm_df = pd.DataFrame(list(adtm.values()),columns=['ADTM'])
maadtm_df = pd.DataFrame(list(maadtm.values()),columns=['MAADTM'])
#ATR-真实波幅
mtr,atr = ATR(stocks, date, timeperiod=14)
mtr_df = pd.DataFrame(list(mtr.values()),columns=['MTR'])
atr_df = pd.DataFrame(list(atr.values()),columns=['ATR'])
#乘离率
bias,bias_ma = BIAS_QL(stocks, date, M = 6)
bias_df = pd.DataFrame(list(bias.values()),columns=['BIAS'])
#商品路径
cci = CCI(stocks, date, N=14)
cci_df = pd.DataFrame(list(cci.values()),columns=['CCI'])
#多空线
dkx,madkx = DKX(stocks, date, M = 10)
dkx_df = pd.DataFrame(list(dkx.values()),columns=['DKX'])
#随机指标
k,d = SKDJ(stocks, date, N = 9, M = 3)
k_df = pd.DataFrame(list(k.values()),columns=['KBJ'])
#市场趋势
cye,_ = CYE(stocks, date)
cye_df = pd.DataFrame(list(cye.values()),columns=['CYE'])
#MFI-资金流量指标
mfi = MFI(stocks, date, timeperiod=14)
mfi_df = pd.DataFrame(list(mfi.values()),columns=['MFI'])
#MTM-动量线
mtm = MTM(stocks, date, timeperiod=14)
mtm_df = pd.DataFrame(list(mtm.values()),columns=['MTM'])
#简单波动指标
emv,_ = EMV(stocks, date, N = 14, M = 9)
emv_df = pd.DataFrame(list(mtm.values()),columns=['EMV'])
#ROC-变动率指标
roc = ROC(stocks, date, timeperiod=12)
roc_df = pd.DataFrame(list(roc.values()),columns=['ROC'])
#RSI-相对强弱指标
rsi = RSI(stocks, date, N1=6)
rsi_df = pd.DataFrame(list(rsi.values()),columns=['RSI'])
#MARSI-相对强弱平均线
rsi10,rsi6 = MARSI(stocks, date, M1 = 10, M2 = 6)
rsi10_df = pd.DataFrame(list(rsi10.values()),columns=['RSI10'])
rsi6_df = pd.DataFrame(list(rsi6.values()),columns=['RSI6'])
#OSC-变动速率线
osc,maosc = OSC(stocks, date, N = 20, M = 6)
osc_df = pd.DataFrame(list(osc.values()),columns=['OSC'])
maosc_df = pd.DataFrame(list(maosc.values()),columns=['MAOSC'])
#UDL-引力线
udl,maudl = UDL(stocks, date, N1 = 3, N2 = 5, N3 = 10, N4 = 20, M = 6)
udl_df = pd.DataFrame(list(udl.values()),columns=['UDL'])
maudl_df = pd.DataFrame(list(maudl.values()),columns=['MAUDL'])
wr,mawr = WR(stocks, date, N = 10, N1 = 6)
wr_df = pd.DataFrame(list(wr.values()),columns=['WR'])
mawr_df = pd.DataFrame(list(mawr.values()),columns=['MAWR'])
#FSL-分水岭
fsl,mafsl = FSL(stocks, date)
fsl_df = pd.DataFrame(list(fsl.values()),columns=['FSL'])
mafsl_df = pd.DataFrame(list(mafsl.values()),columns=['MAFSL'])
#趋势型
cho,macho = CHO(stocks, date, N1 = 10, N2 = 20, M = 6)
cho_df = pd.DataFrame(list(cho.values()),columns=['CHO'])
macho_df = pd.DataFrame(list(macho.values()),columns=['MACHO'])
dif,difma = DMA(stocks, date, N1 = 10, N2 = 50, M = 10)
dif_df = pd.DataFrame(list(dif.values()),columns=['DIF'])
difma_df = pd.DataFrame(list(difma.values()),columns=['DIFMA'])
emv,maemv = EMV(stocks, date, N = 14, M = 9)
emv_df = pd.DataFrame(list(emv.values()),columns=['EMV'])
maemv_df = pd.DataFrame(list(maemv.values()),columns=['MAEMV'])
#能量型
#相对强弱
br, ar = BRAR(stocks, date, N=26)
br_df = pd.DataFrame(list(br.values()),columns=['BR'])
ar_df = pd.DataFrame(list(ar.values()),columns=['AR'])
cr,M1,M2,M3,M4 = CR(stocks, date, N=26, M1=10, M2=20, M3=40, M4=62)
cr_df = pd.DataFrame(list(cr.values()),columns=['CR'])
mass,mamass = MASS(stocks, date, N1=9, N2=25, M=6)
mass_df = pd.DataFrame(list(mass.values()),columns=['MASS'])
mamass_df = pd.DataFrame(list(mamass.values()),columns=['MAMASS'])
#成交量型
amo,amo1,amo2 = AMO(stocks, date, M1 = 5, M2 = 10)
amo_df = pd.DataFrame(list(amo.values()),columns=['AMO'])
amo1_df = pd.DataFrame(list(amo1.values()),columns=['AMO1'])
amo2_df = pd.DataFrame(list(amo2.values()),columns=['AMO2'])
df = pd.concat([accer_df,adtm_df,maadtm_df,mtr_df,atr_df,bias_df,cci_df,dkx_df,k_df,cye_df,
mfi_df,mtm_df,emv_df,roc_df,rsi_df,rsi10_df,rsi6_df,osc_df,udl_df,maudl_df,wr_df,
mawr_df,fsl_df,mafsl_df,cho_df,macho_df,dif_df,difma_df,cr_df,mass_df,mamass_df,
amo_df,amo1_df,amo2_df,
br_df,ar_df],axis=1)
df.index = [date]
df = pd.concat([price,df],axis=1)
return df
def get_data_from_date(start_date,end_date,stocks):
'''
获取时间轴数据
'''
trade_date = get_trade_days(start_date=start_date,end_date=end_date)
df = get_factors_one_stock(stocks,trade_date[0])
for date in trade_date[1:]:
df1 = get_factors_one_stock(stocks,date)
df = pd.concat([df,df1])
return df
data = get_data_from_date(start_date,end_date,stocks)
def get_day_profit(stocks,end_date,start_date=None,count=-1,pre_num=1):
'''
获取每天的收益率
input:
stocks:list or Series,股票代码
start_date:开始时间
end_date:结束时间
count:与start_date二选一,向前取值个数
pre_num:int,向前计算的天数
output:
profit:dataframe,index为日期,values为收益率,收益率大于0标记为1,否则为0
'''
if count == -1:
price = get_price(stocks,start_date,end_date,fields=['close'])['close']
else:
price = get_price(stocks,end_date=end_date,count=count,fields=['close'])['close']
profit = price.pct_change(periods=pre_num).dropna()
profit[profit > 0] = 1
profit[profit < 0] = 0
profit = profit.to_frame()
profit.columns=['profit_dis']
return profit
profit_dis = get_day_profit(stocks,start_date=start_date,end_date=end_date)
def get_day_profit_data(stocks,end_date,start_date=None,count=-1,pre_num=1):
'''
获取每天的收益率
input:
stocks:list or Series,股票代码
start_date:开始时间
end_date:结束时间
count:与start_date二选一,向前取值个数
pre_num:int,向前计算的天数
output:
profit:dataframe,index为日期,values为收益率,收益率大于0标记为1,否则为0
'''
if count == -1:
price = get_price(stocks,start_date,end_date,fields=['close'])['close']
else:
price = get_price(stocks,end_date=end_date,count=count,fields=['close'])['close']
profit = price.pct_change(periods=pre_num).dropna()
profit = profit.to_frame()
profit.columns=['profit']
return profit
profit = get_day_profit_data(stocks,start_date=start_date,end_date=end_date)
data_profit = pd.concat([profit,profit_dis],axis=1)
index = data_profit.index
index = [ind.date() for ind in index]
index = [datetime.datetime.strftime(ind,'%Y-%m-%d') for ind in index]
data_profit.index = index
data_concat = pd.concat([data,data_profit],axis=1).dropna(axis=1,how='all').dropna()
data_concat
columns = data_concat.columns
data_x = data_concat[columns[:-2]]
data_y = data_concat[columns[-1]]
data_x
数据处理¶
def winsorize_and_standarlize(data,qrange=[0.05,0.95],axis=0):
'''
input:
data:Dataframe or series,输入数据
qrange:list,list[0]下分位数,list[1],上分位数,极值用分位数代替
'''
if isinstance(data,pd.DataFrame):
if axis == 0:
q_down = data.quantile(qrange[0])
q_up = data.quantile(qrange[1])
index = data.index
col = data.columns
for n in col:
array = np.array(data[n])
data[n][data[n] > q_up[n]] = q_up[n]
data[n][data[n] < q_down[n]] = q_down[n]
data = (data - data.mean())/data.std()
data = data.fillna(0)
else:
data = data.stack()
data = data.unstack(0)
q = data.quantile(qrange)
index = data.index
col = data.columns
for n in col:
data[n][data[n] > q[n]] = q[n]
data = (data - data.mean())/data.std()
data = data.stack().unstack(0)
data = data.fillna(0)
elif isinstance(data,pd.Series):
name = data.name
q = data.quantile(qrange)
data[data>q] = q
data = (data - data.mean())/data.std()
return data
datax_new = winsorize_and_standarlize(data_x)
#PCA降维
def pca_analysis(data,n_components='mle'):
index = data.index
model = PCA(n_components=n_components)
model.fit(data)
data_pca = model.transform(data)
df = pd.DataFrame(data_pca,index=index)
return df
特征选择¶
class FeatureSelection():
'''
特征选择:
identify_collinear:基于相关系数,删除小于correlation_threshold的特征
identify_importance_lgbm:基于LightGBM算法,得到feature_importance,选择和大于p_importance的特征
filter_select:单变量选择,指定k,selectKBest基于method提供的算法选择前k个特征,selectPercentile选择前p百分百的特征
wrapper_select:RFE,基于estimator递归特征消除,保留n_feature_to_select个特征
'''
def __init__(self):
self.supports = None #bool型,特征是否被选中
self.columns = None #选择的特征
self.record_collinear = None #自相关矩阵大于门限值
def identify_collinear(self, data, correlation_threshold):
"""
Finds collinear features based on the correlation coefficient between features.
For each pair of features with a correlation coefficient greather than `correlation_threshold`,
only one of the pair is identified for removal.
Using code adapted from: https://gist.github.com/Swarchal/e29a3a1113403710b6850590641f046c
Parameters
--------
data : dataframe
Data observations in the rows and features in the columns
correlation_threshold : float between 0 and 1
Value of the Pearson correlation cofficient for identifying correlation features
"""
columns = data.columns
self.correlation_threshold = correlation_threshold
# Calculate the correlations between every column
corr_matrix = data.corr()
self.corr_matrix = corr_matrix
# Extract the upper triangle of the correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
# Select the features with correlations above the threshold
# Need to use the absolute value
to_drop = [column for column in upper.columns if any(upper[column].abs() > correlation_threshold)]
obtain_columns = [column for column in columns if column not in to_drop]
self.columns = obtain_columns
# Dataframe to hold correlated pairs
record_collinear = pd.DataFrame(columns = ['drop_feature', 'corr_feature', 'corr_value'])
# Iterate through the columns to drop
for column in to_drop:
# Find the correlated features
corr_features = list(upper.index[upper[column].abs() > correlation_threshold])
# Find the correlated values
corr_values = list(upper[column][upper[column].abs() > correlation_threshold])
drop_features = [column for _ in range(len(corr_features))]
# Record the information (need a temp df for now)
temp_df = pd.DataFrame.from_dict({'drop_feature': drop_features,
'corr_feature': corr_features,
'corr_value': corr_values})
# Add to dataframe
record_collinear = record_collinear.append(temp_df, ignore_index = True)
self.record_collinear = record_collinear
return data[obtain_columns]
def identify_importance_lgbm(self, features, labels,p_importance=0.8, eval_metric='auc', task='classification',
n_iterations=10, early_stopping = True):
"""
Identify the features with zero importance according to a gradient boosting machine.
The gbm can be trained with early stopping using a validation set to prevent overfitting.
The feature importances are averaged over n_iterations to reduce variance.
Uses the LightGBM implementation (http://lightgbm.readthedocs.io/en/latest/index.html)
Parameters
--------
features : dataframe
Data for training the model with observations in the rows
and features in the columns
labels : array, shape = (1, )
Array of labels for training the model. These can be either binary
(if task is 'classification') or continuous (if task is 'regression')
p_importance:float, range[0,1],default = 0.8
sum of the importance of features above the value
eval_metric : string
Evaluation metric to use for the gradient boosting machine
task : string, default = 'classification'
The machine learning task, either 'classification' or 'regression'
n_iterations : int, default = 10
Number of iterations to train the gradient boosting machine
early_stopping : boolean, default = True
Whether or not to use early stopping with a validation set when training
Notes
--------
- Features are one-hot encoded to handle the categorical variables before training.
- The gbm is not optimized for any particular task and might need some hyperparameter tuning
- Feature importances, including zero importance features, can change across runs
"""
# One hot encoding
data = features
features = pd.get_dummies(features)
# Extract feature names
feature_names = list(features.columns)
# Convert to np array
features = np.array(features)
labels = np.array(labels).reshape((-1, ))
# Empty array for feature importances
feature_importance_values = np.zeros(len(feature_names))
print('Training Gradient Boosting Model\n')
# Iterate through each fold
for _ in range(n_iterations):
if task == 'classification':
model = lgb.LGBMClassifier(n_estimators=100, learning_rate = 0.05, verbose = -1)
elif task == 'regression':
model = lgb.LGBMRegressor(n_estimators=100, learning_rate = 0.05, verbose = -1)
else:
raise ValueError('Task must be either "classification" or "regression"')
# If training using early stopping need a validation set
if early_stopping:
train_features, valid_features, train_labels, valid_labels = train_test_split(features, labels, test_size = 0.15)
# Train the model with early stopping
model.fit(train_features, train_labels, eval_metric = eval_metric,
eval_set = [(valid_features, valid_labels)],
verbose = -1)
# Clean up memory
gc.enable()
del train_features, train_labels, valid_features, valid_labels
gc.collect()
else:
model.fit(features, labels)
# Record the feature importances
feature_importance_values += model.feature_importances_ / n_iterations
feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
# Sort features according to importance
feature_importances = feature_importances.sort_values('importance', ascending = False).reset_index(drop = True)
# Normalize the feature importances to add up to one
feature_importances['normalized_importance'] = feature_importances['importance'] / feature_importances['importance'].sum()
feature_importances['cumulative_importance'] = np.cumsum(feature_importances['normalized_importance'])
select_df = feature_importances[feature_importances['cumulative_importance']<=p_importance]
select_columns = select_df['feature']
self.columns = list(select_columns.values)
res = data[self.columns]
return res
def filter_select(self, data_x, data_y, k=None, p=50,method=f_classif):
columns = data_x.columns
if k != None:
model = SelectKBest(method,k)
res = model.fit_transform(data_x,data_y)
supports = model.get_support()
else:
model = SelectPercentile(method,p)
res = model.fit_transform(data_x,data_y)
supports = model.get_support()
self.support_ = supports
self.columns = columns[supports]
return res
def wrapper_select(self,data_x,data_y,n,estimator):
columns = data_x.columns
model = RFE(estimator=estimator,n_features_to_select=n)
res = model.fit_transform(data_x,data_y)
supports = model.get_support() #标识被选择的特征在原数据中的位置
self.supports = supports
self.columns = columns[supports]
return res
def embedded_select(self,data_x,data_y,estimator,threshold=None):
'''
threshold : string, float, optional default None
The threshold value to use for feature selection. Features whose importance is greater or
equal are kept while the others are discarded. If “median” (resp. “mean”), then the
threshold value is the median (resp. the mean) of the feature importances.
A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator
has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso),
the threshold used is 1e-5. Otherwise, “mean” is used by default.
'''
columns = data_x.columns
model = SelectFromModel(estimator=estimator,prefit=False,threshold=threshold)
res = model.fit_transform(data_x,data_y)
supports = model.get_support()
self.supports = supports
self.columns = columns[supports]
return res
f = FeatureSelection()
lgbm_res = f.identify_importance_lgbm(data_x,data_y)
print(f.columns)
print(lgbm_res)
estimator = LinearSVC()
res = f.wrapper_select(data_x=data_x,data_y=data_y,n=5,estimator=estimator)
print(f.columns)
est = LinearSVC(C=0.01,penalty='l1',dual=False)
est1 = RandomForestClassifier()
e_res = f.embedded_select(data_x=data_x,data_y=data_y,estimator=est1)
print(f.columns)