๋ฐ˜์‘ํ˜•

Summary of Santander-Customer-Transaction-Prediction

 

kaggle Top8% (681th of 8802) ๐Ÿฅ‰

image

useful

Smote+XGboost [Public Score= 0.75161]

  • Target ๋ฐ˜์‘๋ณ€์ˆ˜ ๋น„์œจ์ด 0.1๋ฐ–์— ๋˜์ง€ ์•Š์•„ SmoteOver Sampling ์ ์šฉํ›„ xgboost ๋ชจ๋ธ๋ง

XGboost [Public Score = 0.88411]

  • Smote ์„ฑ๋Šฅ์ด ๋‚ฎ์•„ Xgboost ๋ชจ๋ธ๋ง๋งŒ ์‹œ๋„

XGboost_tuning [Public Score = 0.89692]

  • max_depth(์ตœ๋Œ€ ํŠธ๋ฆฌ๊นŠ์ด)= 3 -> 2, colsample_bytree(ํŠธ๋ฆฌ๋งˆ๋‹ค ๋ณ€์ˆ˜์„ ํƒ๋น„์œจ)=1 -> 0.3, learning_rate(ํ•™์Šต๋ฅ )=0.05 ->0.02

Lgboost [Public Score = 0.89766]

  • Lgboost๊ฐ€ XGgboost์— ๋น„ํ•ด ๋งค์šฐ ๋น ๋ฅด๊ณ , ๊ทธ๋กœ์ธํ•ด parameter tuninig์ด ์šฉ์ดํ•จ

Ensemble Models(XGboost + Lgboost) [Public Score = 0.90043]

  • ๋น„์„ ํ˜• ๋ชจ๋ธNuSVC์— StandardScaler ์ ์šฉ

Lgboost_oof [Public Score = 0.90043]

  • Out of fold๋กœ Public Score 0.003 ์ƒ์Šน

Lgboost_oof_augment [Public Score = 0.90060]

  • Augment(๋ฐ์ดํ„ฐ ์ฆ์‹)์„ ํ†ตํ•ด Public Score 0.00017, cv score 0.00061 ์ƒ์Šน

Lgboost_oof_frequency [Public Score = 0.90119]

  • Frequency๋ณ€์ˆ˜ ์ถ”๊ฐ€๋กœ ์•ฝ0.005์˜ cv score๊ฐ€ ์ฆ๊ฐ€ํ–ˆ์ง€๋งŒ publi๋Š” ๋ฏธ์„ธํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค.

1% solution[private = 0.92159]

  • remove fake from test
  • concat train and test
  • frequency encoding
  • 200 model train and predict(LGB)

try

  • Sum, std, mean, min, max, skew, kurtosis, etc for row
  • Augment
  • Random over sampling ,SMOTE over samplint
  • Round1, Round2, Round3
  • Frequency encoding – solutio์ด์—ˆ์ง€๋งŒ fake test set์„ ๋ฐœ๊ฒฌํ•˜์ง€ ๋ชป ํ•จ
  • var_0 – (var_1, var_2, … var_199), var_1 – (var_0, var_2, … var_199)์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ time series ๋ณ€์ˆ˜ ์ฐพ์•„๋ด„
  • Binning ์„ ์ด์šฉํ•ด catecory ์ ์šฉ
  • Eda๋ฅผ ํ†ตํ•ด uselessํ•ด ๋ณด์ด๋Š” ์ปฌ๋Ÿผ์ œ๊ฑฐ
  • Feature importance top5๋ณ€์ˆ˜์˜ ์ƒํ˜ธ์ž‘์šฉ(+, x, etc) ์ปฌ๋Ÿผ
  • PCA
  • Clustering
  • NN model ensamble

Learning

  • OOF(Out Of Fold)

    image
  • Agument

    It means that there is no interaction between variables (the variables are completely independent). Here are some examples. If the variables did have interactions, then you might need var_0=5 AND var_1=2.5 AND var_2=15 for target=1.If only one of those 3 occurred you would have target=0 but if all three occurred, you would have target=1. That's interaction. If there was interaction, you cannot train on shuffled columns because you would destroy the interactions that are present in the training data and your model could not learn them. But with this Santander data, there are no interactions, so you can shuffle columns and make new data.

  • Fake data

    Removing fake samples is the key of this competition

  • Frequency Encoding

    Frequency Encoding works even if continuous variable

  • how did adding these 200 new/duplicated features improve the score in a tree model (LightGBM)?

    It turns out LGBM benefits from helping it find interactions too.Instead of waiting for a decision tree to send all the observations with count=1 to the left and count>1to the right. You just give it a new feature with all the count=1 values removed (converted to NAN). Instead of waiting for the decision tree to send count <= 2 to the left and count>2 to the right. You just give it a new feature with all the count<=2 values removed. (Additionally adding new columns like var_x times var_x_FE also helps LGBM find interactions)

๋ฐ˜์‘ํ˜•

'competition' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Compare optimizer of efficientNet  (2) 2019.11.06
Frequency Encoding์ด๋ž€?  (0) 2019.10.17
kaggle Top6% (95th of 1836)๐Ÿฅ‰  (0) 2019.10.17
[kaggle] Adversarial validation part1  (0) 2019.06.11
make_classification(๋ฐ์ดํ„ฐ ๋งŒ๋“ค๊ธฐ)  (0) 2019.06.11

+ Recent posts