搜索此博客

2018年1月17日星期三

R语言特征选择

ibrary(Boruta)
library(mice)
library(missForest)
library(caret)
library(randomForest)

Boruta

boruta算法运行的步骤: 1.首先,它通过创建混合副本的所有特征(即阴影特征)为给定的数据集增加了随机性。 2.然后,它训练一个随机森林分类的扩展数据集,并采用一个特征重要性措施(默认设定为平均减少精度),以评估的每个特征的重要性,越高则意味着越重要。 3.在每次迭代中,它检查一个真实特征是否比最好的阴影特征具有更高的重要性(即该特征是否比最大的阴影特征得分更高)并且不断删除它视为非常不重要的特征。 4.最后,当所有特征得到确认或拒绝,或算法达到随机森林运行的一个规定的限制时,算法停止。
traindata <- read.csv("/home/xuelfiang/PycharmProjects/titanic/titanic.csv", header = T, stringsAsFactors = F,na.strings = T)
str(traindata)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(traindata)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##

missForest法进行缺失值插补,变量为因子和数字

traindata$Sex <- factor(traindata$Sex)
traindata$Embarked <- factor(traindata$Embarked)
traintest <- missForest(traindata[,-c(4,9,11)])$ximp
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!

实施和检查Boruta包的性能

boruta_train <- Boruta(traintest$Survived~.-PassengerId,data = traintest)
#7个变量被确认
print(boruta_train)
## Boruta performed 10 iterations in 4.891767 secs.
## 7 attributes confirmed important: Age, Embarked, Fare, Parch,
## Pclass and 2 more;
## No attributes deemed unimportant.
#图表展示Boruta变量的重要性
plot(boruta_train, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta_train$ImpHistory),function(i) boruta_train$ImpHistory[is.finite(boruta_train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta_train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta_train$ImpHistory), cex.axis = 0.7)

对实验性属性进行判定。实验性属性将通过比较属性的Z分数中位数和最佳阴影属性的Z分数中位数被归类为确认或拒绝

final_boruta <- TentativeRoughFix(boruta_train)
## Warning in TentativeRoughFix(boruta_train): There are no Tentative
## attributes! Returning original object.
print(final_boruta)
## Boruta performed 10 iterations in 4.891767 secs.
## 7 attributes confirmed important: Age, Embarked, Fare, Parch,
## Pclass and 2 more;
## No attributes deemed unimportant.
getSelectedAttributes(final_boruta, withTentative = F)
## [1] "Pclass" "Sex" "Age" "SibSp" "Parch" "Fare"
## [7] "Embarked"
boruta_df <- attStats(final_boruta)
print(boruta_df)
## meanImp medianImp minImp maxImp normHits decision
## Pclass 33.91001 34.06550 32.33488 35.51682 1 Confirmed
## Sex 76.18385 76.74179 73.12429 78.15511 1 Confirmed
## Age 30.85620 30.93894 27.39081 33.15044 1 Confirmed
## SibSp 18.52499 18.54277 16.96003 20.98996 1 Confirmed
## Parch 12.46269 12.59703 11.08369 14.48570 1 Confirmed
## Fare 30.27967 30.28331 28.69235 31.52378 1 Confirmed
## Embarked 12.58664 11.97207 10.65647 16.09177 1 Confirmed

传统的特征选择算法,caret

control <- rfeControl(functions=rfFuncs, method="cv", number=10)
rfe_train <- rfe(traintest[,3:9], traintest[,2], sizes=1:12, rfeControl=control)
## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?

## Warning in randomForest.default(x, y, importance = (first | last), ...):
## The response has five or fewer unique values. Are you sure you want to do
## regression?
rfe_train
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 0.4086 0.2986 0.3342 0.02685 0.09779 0.02150
## 2 0.3922 0.3559 0.3243 0.02485 0.09674 0.02183
## 3 0.3719 0.4297 0.3084 0.02650 0.11046 0.02196
## 4 0.3627 0.4612 0.2998 0.02685 0.11105 0.02197
## 5 0.3635 0.4686 0.3069 0.02395 0.10605 0.01948
## 6 0.3452 0.4911 0.2488 0.02927 0.10422 0.02462 *
## 7 0.3471 0.4862 0.2549 0.02846 0.10176 0.02225
##
## The top 5 variables (out of 6):
## Sex, Pclass, Age, Fare, SibSp
plot(rfe_train, type=c("g", "o"), cex = 1.0, col = 1:11)
predictors(rfe_train)
## [1] "Sex" "Pclass" "Age" "Fare" "SibSp" "Embarked"

总结


相比传统的特征选择算法,Boruta能够返回变量重要性的更好结果。