kaggle Top8% (681th of 8802) 🥉

2019. 10. 17. 15:13

Summary of Santander-Customer-Transaction-Prediction

kaggle Top8% (681th of 8802) 🥉

Smote+XGboost [Public Score= 0.75161]

XGboost [Public Score = 0.88411]

XGboost_tuning [Public Score = 0.89692]

max_depth(최대 트리깊이)= 3 -> 2, colsample_bytree(트리마다 변수선택비율)=1 -> 0.3, learning_rate(학습률)=0.05 ->0.02

Lgboost [Public Score = 0.89766]

Ensemble Models(XGboost + Lgboost) [Public Score = 0.90043]

Lgboost_oof [Public Score = 0.90043]

Lgboost_oof_augment [Public Score = 0.90060]

Lgboost_oof_frequency [Public Score = 0.90119]

1% solution[private = 0.92159]

Sum, std, mean, min, max, skew, kurtosis, etc for row
Augment
Random over sampling ,SMOTE over samplint
Round1, Round2, Round3
Frequency encoding – solutio이었지만 fake test set을 발견하지 못 함
var_0 – (var_1, var_2, … var_199), var_1 – (var_0, var_2, … var_199)을 반복하며 time series 변수 찾아봄
Binning 을 이용해 catecory 적용
Eda를 통해 useless해 보이는 컬럼제거
Feature importance top5변수의 상호작용(+, x, etc) 컬럼
PCA
Clustering
NN model ensamble

OOF(Out Of Fold)
Agument

It means that there is no interaction between variables (the variables are completely independent). Here are some examples. If the variables did have interactions, then you might need var_0=5 AND var_1=2.5 AND var_2=15 for target=1.If only one of those 3 occurred you would have target=0 but if all three occurred, you would have target=1. That's interaction. If there was interaction, you cannot train on shuffled columns because you would destroy the interactions that are present in the training data and your model could not learn them. But with this Santander data, there are no interactions, so you can shuffle columns and make new data.
Fake data

Removing fake samples is the key of this competition
Frequency Encoding

Frequency Encoding works even if continuous variable
how did adding these 200 new/duplicated features improve the score in a tree model (LightGBM)?

It turns out LGBM benefits from helping it find interactions too.Instead of waiting for a decision tree to send all the observations with count=1 to the left and count>1to the right. You just give it a new feature with all the count=1 values removed (converted to NAN). Instead of waiting for the decision tree to send count <= 2 to the left and count>2 to the right. You just give it a new feature with all the count<=2 values removed. (Additionally adding new columns like var_x times var_x_FE also helps LGBM find interactions)

Compare optimizer of efficientNet (2)	2019.11.06
Frequency Encoding이란? (0)	2019.10.17
kaggle Top6% (95th of 1836)🥉 (0)	2019.10.17
[kaggle] Adversarial validation part1 (0)	2019.06.11
make_classification(데이터 만들기) (0)	2019.06.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`