Kaggle 'instant Gratification'대회에서 Adversarial Validation에 대한 기법이 나와 조사해보려고 한다.

참고 URL

배경¶

Adversarial Validation(적대적 검증?)수법은 데이터 분석 대회에 존재하는 문제를 해결하기 위해 고안되었다. 그 문제는 제공되는 훈련데이터와 테스트데이터의 분포가 다를 경우 어떻게 예측할 것인가?이다.

일반적으로 kaggle대회에서는 학습데이터와 테스트데이터가 제공된다. 학습데이터에는 변수와 Target값이 주어지는 반면에 테스트데이터에는 Target값이 주어지지 않는다. 참가자들은 학습데이터를 통해 모델을 훈련하여 테스트데이터의 타겟값을 예측하여 제출한다.

제출한 예측 Target값의 일부 예를들어 10만명의 고객의 Target값을 예측했다면 그 중 5만명의 고객의 타겟값을 사용하여 평가하고 Kaggle Public Leader Board (Public LB)에 기록된다. 그러나 일부(5만명)를 이용한 평가 이기 때문에 전체 데이터를 사용하여 평가할때는 어떻게 될지 모른다, 특히 이런 평가방식 때문에 Shake up/down(Public LB의 점수에 비해 Private LB가 높아지거나 낮아지는 현상)이 나타난다. 이를 방지하려면 Cross Validation (교차 검증) 등을 이용한 로컬 일반화 성능 평가 (Local CV)가 중요하다.

그러나 Kaggle에서 제공되는 학습데이터와 테스트데이터는 어떠한 경우 분포가 크게 다르다. 이럴 경우에는 Local CV와 Public LB/Private LB가 다르게 나타난다, 예를 들어 Local CV는 높은 점수를 얻을 수 있는데, 제출 한 결과에서는 낮은 점수를 얻는 사태가 일어난다.

이 문제를 해결하기 위해 고안된 기법이 Adversarial Validation(적대적 검증?)이다.

Adversarial Validation(적대적 검증?)¶

Adversarial Validation기법을 하기 위해서는 우선 두 데이터 세트 훈련데이터와 테스트데이터에 본래 Target값과는 다른 Target 값을 설정한다, 예를들어 훈련데이터에는 Target값을 1 테스트데이터에는 Target값 0을 부여한다.

그리고 훈련데이터와 테스트데이터를 병합, 섞은 상태에서 두데이터가 어디에 속하는지 예측하는 모델을 만든다. 만약 두데이터가 다른분포의 데이터에서 나왔다면 예측은 쉽게 될것이다. 하지만 같은 분포에서 나온 데이터라면 두데이터를 구분하기는 어려울 것이다. 이부분에서 훈련데이터와 테스트데이터가 얼마나 다른지 알 수 있다

동일한 분포에서 유래된 데이터일때¶

#패키지 로드
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from matplotlib import pyplot as plt
from lightgbm import LGBMClassifier

#make_classification 매개변수
args = { 'n_samples': 5000,
        'n_features': 2,
        'n_informative': 2,
        'n_redundant': 0,
        'n_repeated': 0,
        'n_classes': 2,
        'n_clusters_per_class': 1,
        'flip_y': 0,
        'class_sep': 1.5,
        'weights': [0.5, 0.5],
        'random_state': 42,
        }

#종속변수 2개에 반응변수가 2가지 클래스인 5000개의 데이터 생성
X, y = make_classification(**args)

#데이터를 무작위로 절반할당
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                        shuffle=True, random_state=42)
print("X_train :" ,X_train.shape)
print("y_train :" ,y_train.shape)
print("X_test :" ,X_test.shape)
print("y_test :" ,y_test.shape)

#분할된 데이터가 어느 쪽에서 파생되었는지 구별하기 위한 레이블
y_train[:]= 0  # train label은 임의적으로 0
y_test[:] = 1   # test label은 임의적으로 1

print("-"*50)
# 데이터결합
X_concat = np.concatenate([X_train, X_test], axis=0)
y_concat = np.concatenate([y_train, y_test], axis=0)

print("X_concat :" ,X_concat.shape)
print("y_concat :" ,y_concat.shape)

X_train : (2500, 2)
y_train : (2500,)
X_test : (2500, 2)
y_test : (2500,)
--------------------------------------------------
X_concat : (5000, 2)
y_concat : (5000,)

동일한 분포의 데이터에서 나왔기 때문에 1과 0을 구분하지 못한다.

오히려 무작위로 할당할 때보다 정확도가 낮다(클래스 비율 0.5이기때문에)

#LGBMClassifier로 분류
clf = LGBMClassifier(n_estimators=100,random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

#동일한 분포에서 만든 데이터이기 때문에 분류하기 어려움
score = cross_validate(clf, X_concat, y_concat, cv=skf)

# 5-Fold CV로 평가 정확도 (Accuracy)의 평균
print('Accuracy:', score['test_score'].mean())

Accuracy: 0.4838

Plot 해보기¶

보이는 바와 같이 같은 데이터의 분포에서 나왔기 때문에 0과 1을 분류하기는 힘들다.

X, y = make_classification(**args)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
                                                        shuffle=True, random_state=42)

#plot
plt.scatter(X_train[y_train == 0, 0],
                X_train[y_train == 0, 1],
                alpha=0.5,
                label='Train (0)')
plt.scatter(X_train[y_train == 1, 0],
                X_train[y_train == 1, 1],
                alpha=0.5,
                label='Train (1)')
plt.scatter(X_test[y_test == 0, 0],
                X_test[y_test == 0, 1],
                alpha=0.5,
                label='Test (0)')
plt.scatter(X_test[y_test == 1, 0],
                X_test[y_test == 1, 1],
                alpha=0.5,
                label='Test (1)')
plt.legend()
plt.show()

다른 분포에서 유래된 데이터일때¶

# 서로다른 make_classification 매개변수, 난수를 바꿔준다
args1 = { 'n_samples': 2500,
        'n_features': 2,
        'n_informative': 2,
        'n_redundant': 0,
        'n_repeated': 0,
        'n_classes': 2,
        'n_clusters_per_class': 1,
        'flip_y': 0,
        'class_sep': 1.5,
        'weights': [0.5, 0.5],
        'random_state': 42,
        }

args2 = { 'n_samples': 2500,
        'n_features': 2,
        'n_informative': 2,
        'n_redundant': 0,
        'n_repeated': 0,
        'n_classes': 2,
        'n_clusters_per_class': 1,
        'flip_y': 0,
        'class_sep': 1.5,
        'weights': [0.5, 0.5],
        'random_state': 12,
        }

#종속변수 2개에 반응변수가 2가지 클래스인 5000개의 데이터 생성
X_train, y_train = make_classification(**args1)
X_test, y_test = make_classification(**args2)

print("X_train :" ,X_train.shape)
print("y_train :" ,y_train.shape)
print("X_test :" ,X_test.shape)
print("y_test :" ,y_test.shape)

#분할된 데이터가 어느 쪽에서 파생되었는지 구별하기 위한 레이블
y_train[:]= 0  # train label은 임의적으로 0
y_test[:] = 1   # test label은 임의적으로 1

print("-"*50)
# 데이터결합
X_concat = np.concatenate([X_train, X_test], axis=0)
y_concat = np.concatenate([y_train, y_test], axis=0)

print("X_concat :" ,X_concat.shape)
print("y_concat :" ,y_concat.shape)

#LGBMClassifier로 분류
clf = LGBMClassifier(n_estimators=100,random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

#동일한 분포에서 만든 데이터이기 때문에 분류하기 어려움
score = cross_validate(clf, X_concat, y_concat, cv=skf)

# 5-Fold CV로 평가 정확도 (Accuracy)의 평균
print('Accuracy:', score['test_score'].mean())

X_train : (2500, 2)
y_train : (2500,)
X_test : (2500, 2)
y_test : (2500,)
--------------------------------------------------
X_concat : (5000, 2)
y_concat : (5000,)
Accuracy: 0.9780000000000001

다른분포일 떄 이처럼 매우 높은 정확도가 나온다.
plot의 결과도 매우 다른 분포를 볼 수 있다.

X_train, y_train = make_classification(**args1)
X_test, y_test = make_classification(**args2)

#plot
plt.scatter(X_train[y_train == 0, 0],
                X_train[y_train == 0, 1],
                alpha=0.5,
                label='Train (0)')
plt.scatter(X_train[y_train == 1, 0],
                X_train[y_train == 1, 1],
                alpha=0.5,
                label='Train (1)')
plt.scatter(X_test[y_test == 0, 0],
                X_test[y_test == 0, 1],
                alpha=0.5,
                label='Test (0)')
plt.scatter(X_test[y_test == 1, 0],
                X_test[y_test == 1, 1],
                alpha=0.5,
                label='Test (1)')
plt.legend()
plt.show()

결론¶

지금까지 다른 분포에서 온 데이터와 같은 분포에서 온 데이터를 통하여 훈련데이터와 테스트데이터간의 분포가 다른지의 유무를 알 수 있었다.

만약 정확도가 높지 않다면 같은 분포에서 분할된 데이터라고 생각할 수 있지만 정확도가 높을 경우에는 훈련데이터와 테스트데이터 간의 분포 차이가 있어 제대로된 검증을 할 수 없다.

그러한 경우 모델에 적합한 검증셋을 어떻게 만들어야하나?

part2에서 해보도록하자

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

make_classification¶

make_classification은 사이킷런의 패키지로 가상의 분류모형 데이터를 생성해주는 함수이다.

매개변수에 대해 알아보자. 참고 : https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

매개변수¶

make_classification

n_samples : 표본 데이터의 수 (default=100)
n_features : 독립 변수의 수(전체 피처의 수) (default=20)
n_informative : 독립 변수 중 종속 변수와 상관 관계가 있는 성분의 수 (default=2)
n_redundant : 독립 변수 중 다른 독립 변수의 선형 조합으로 나타나는 성분의 수 (default=2)
n_repeated : 독립 변수 중 단순 중복된 성분의 수 (default=0)
n_classes : 종속 변수의 클래스 수 default=2)
n_clusters_per_class : 클래스 당 클러스터의 수 (default=2)
weights : 각 클래스에 할당된 표본 수 (default=None)
flip_y : 클래스가 임의로 교환되는 샘플의 일부, 라벨에 노이즈를 생성하여 분류를 어렵게 만든다(default=0.01)

실습¶

# 패키지 load
from sklearn.datasets import make_classification
from matplotlib import rc
import pylab as plt
%matplotlib inline
rc('font', family='NanumGothic')

plt.title("1개의 독립변수를 가진 가상 데이터")
X, y = make_classification(n_features=1, 
                           n_informative=1,
                           n_redundant=0, 
                           n_clusters_per_class=1, 
                           random_state=42)
plt.scatter(X, y, marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("X")
plt.ylabel("y")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

X형태 = (100, 1)
y형태 = (100,)

plt.title("2개의 독립변수를 가진 가상 데이터")
X, y = make_classification(n_features=2, 
                           n_informative=1,
                           n_redundant=0, 
                           n_clusters_per_class=1, 
                           random_state=42)
plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

X형태 = (100, 2)
y형태 = (100,)

plt.title("2개의 독립변수를 가진 가상 데이터")
X, y = make_classification(n_features=2, 
                           n_informative=2,
                           n_redundant=0, 
                           n_clusters_per_class=1, 
                           random_state=42)
plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

X형태 = (100, 2)
y형태 = (100,)

plt.title("2개의 독립변수를 가진 비대칭 데이터")
#weifhts를 이용하면 비대칭 데이터를 만들 수 있다.
X, y = make_classification(n_features=2, 
                           n_informative=2,
                           n_redundant=0, 
                           weights=[0.9, 0.1],
                           n_clusters_per_class=1, 
                           random_state=42)
plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

X형태 = (100, 2)
y형태 = (100,)

plt.title("노이즈를 추가한 데이터")
#flip_y 매개변수를 이용한 노이즈 추가
X, y = make_classification(n_features=2, 
                           n_informative=2,
                           n_redundant=0, 
                           flip_y=0.1,
                           n_clusters_per_class=1, 
                           random_state=42)
plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

X형태 = (100, 2)
y형태 = (100,)

plt.title("클래스당 두개의 클러스터를 가진 데이터")
#flip_y 매개변수를 이용한 노이즈 추가
X, y = make_classification(n_features=2, 
                           n_informative=2,
                           n_redundant=0, 
                           n_clusters_per_class=2, 
                           random_state=42)
plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

X형태 = (100, 2)
y형태 = (100,)

plt.title("다중 클래스")
#flip_y 매개변수를 이용한 노이즈 추가
X, y = make_classification(n_features=2, 
                           n_informative=2,
                           n_redundant=0, 
                           n_clusters_per_class=1, 
                           random_state=42,
                           n_classes=3
                          )
plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

X형태 = (100, 2)
y형태 = (100,)

자주뜨는 오류¶

make_classification함수 이용시

n_classes * n_clusters_per_class must be smaller or equal 2 ** n_informative 이러한 오류를 보게될것이다.

이오류는 반응변수 곱하기 클래스당 군집수가 2^ 독립변수중 종속변수와 상관관계가 있는 성분의 수보다 작거나 같지않으면 발생하는 오류다.

여기서는 3 * 2 > 2^2로 오류가 발생한다.

plt.title("다중 클래스")
#flip_y 매개변수를 이용한 노이즈 추가
X, y = make_classification(n_features=2, 
                           n_informative=2,
                           n_redundant=0, 
                           n_clusters_per_class=2, 
                           random_state=42,
                           n_classes=3
                          )
plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,
            s=200, edgecolor="k",)

plt.xlabel("$X_1$")
plt.ylabel("$X_2$")
plt.show()
print('X형태 =' ,X.shape)
print('y형태 =' ,y.shape)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-123-3df2ee10eaa0> in <module>
      6                            n_clusters_per_class=2,
      7                            random_state=42,
----> 8                            n_classes=3
      9                           )
     10 plt.scatter(X[:, 0], X[:, 1], marker='v', c=y,

~\Anaconda3\lib\site-packages\sklearn\datasets\samples_generator.py in make_classification(n_samples, n_features, n_informative, n_redundant, n_repeated, n_classes, n_clusters_per_class, weights, flip_y, class_sep, hypercube, shift, scale, shuffle, random_state)
    163     # Use log2 to avoid overflow errors
    164     if n_informative < np.log2(n_classes * n_clusters_per_class):
--> 165         raise ValueError("n_classes * n_clusters_per_class must"
    166                          " be smaller or equal 2 ** n_informative")
    167     if weights and len(weights) not in [n_classes, n_classes - 1]:

ValueError: n_classes * n_clusters_per_class must be smaller or equal 2 ** n_informative

from IPython.core.display import display, HTML

display(HTML("<style> .container{width:90% !important;}</style>"))

Compare optimizer of efficientNet (2)	2019.11.06
Frequency Encoding이란? (0)	2019.10.17
kaggle Top8% (681th of 8802) 🥉 (0)	2019.10.17
[kaggle] Adversarial validation part1 (0)	2019.06.11
make_classification(데이터 만들기) (0)	2019.06.11

Compare optimizer of efficientNet (2)	2019.11.06
Frequency Encoding이란? (0)	2019.10.17
kaggle Top8% (681th of 8802) 🥉 (0)	2019.10.17
kaggle Top6% (95th of 1836)🥉 (0)	2019.10.17
make_classification(데이터 만들기) (0)	2019.06.11

Compare optimizer of efficientNet (2)	2019.11.06
Frequency Encoding이란? (0)	2019.10.17
kaggle Top8% (681th of 8802) 🥉 (0)	2019.10.17
kaggle Top6% (95th of 1836)🥉 (0)	2019.10.17
[kaggle] Adversarial validation part1 (0)	2019.06.11

Taegu

competition

kaggle Top6% (95th of 1836)🥉

Summary of Instant Gratification

useful

try

Learning

top10 kernel

'competition' 카테고리의 다른 글

[kaggle] Adversarial validation part1

배경¶

Adversarial Validation(적대적 검증?)¶

동일한 분포에서 유래된 데이터일때¶

Plot 해보기¶

다른 분포에서 유래된 데이터일때¶

결론¶

'competition' 카테고리의 다른 글

make_classification(데이터 만들기)

make_classification¶

매개변수¶

실습¶

자주뜨는 오류¶

'competition' 카테고리의 다른 글

+ Recent posts

티스토리툴바