Digit recognizer

발표자 소개
홍철주
• http://blog.fegs.kr
• https://github.com/FeGs
• Machine Learning Newbie
• SW Maestro 5th
2

목차
• 문제 소개
• k-NN, …
• Dimensionality Reduction
• 다시 k-NN, …
• Neural Networks
3

문제 소개
데이터 : 손으로 쓴 숫자 이미지 (28px * 28px, grayscale)
목적 변수 : 숫자 (0, 1, 2, …, 9)
0 1 2 3 4 5
4
손으로 적은 숫자들을 분류하기
학습 데이터 : 42000개 테스트 데이터 : 28000개

Example: MNIST (kaggle)
[…, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
23, 210, 254, 253, 159, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …]
784차원 공간을 생각해보자 :
Method 1: k-NN
Method 2: SVM
Method 3: Random Forest, etc..
784차원 공간에서 가까이 있는 좌표의 label은?
784차원 공간을 783차원 초평면으로 갈라서 분류해보면?
이 위치에 하얀 픽셀이 있고 저기엔 없으면?
어떤 방법이 가장 좋은가?
5
기억나시는지?

Feature?
[…, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
23, 210, 254, 253, 159, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, …]
이미지의 픽셀값을 feature로 이용
feature0 feature1 feature2 … feature783

k-NN 784차원 공간에서 가까이 있는 좌표의 label은?
sklearn.neighbors.KNeighborsClassifier(n_neighbors=10, weights='distance',
algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None)
Accuracy = 96.65% 10-fold CV

RF 이 위치에 하얀 픽셀이 있고 저기엔 없으면? (784개의 feature)
Accuracy = 90.28%
sklearn.ensemble.RandomForestClassifier(n_estimators=100, criterion='gini',
max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='auto',
max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None,
verbose=0, min_density=None, compute_importances=None)
10-fold CV
시간이 오래 걸림.

정확도도 가져가고
시간도 단축하고 싶은데

PCA Principal component analysis
다시 설명하면
v v = a1e1 + a2e2 + … + anen (n = dim(v))
v 는 t = [a1, a2, …, an]로 표현 가능, ||t|| = n
근데 e1, e2, …, en 대신 w1, w2, …, wm을 쓰면
v ~= b1w1 + b2w2 + … + bmwm 처럼 되더라
v 는 u = [b1, b2, …, bm]으로도 표현 가능, ||u|| = m

PCA Principal component analysis
w1, w2, …, wm 의 정의는?
http://www.stat.cmu.edu/~cshalizi/350/lectures/10/lecture-10.pdf실제로 구하는 방법은 ->

PCA + k-NN
10차원으로 낮췄을 때 : 92.6%
5 3 8 9
차원을 낮추고 학습시키면?
sklearn.neighbors.KNeighborsClassifier(n_neighbors=10, weights='distance',
algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None)
10-fold CV

PCA + RF
5 3 8 9
차원을 낮추고 학습시키면?
10-fold CV
sklearn.ensemble.RandomForestClassifier(n_estimators=200, criterion='gini',
max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='auto',
max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None,
verbose=0, min_density=None, compute_importances=None)

Deep Learning
• 간단한 역사를 설명하면
• 인공신경망 연구는 오류 역전파 알고리즘으로 …
• 하지만 학습에 시간이 오래 걸리고, Overfitting, …
• 그러다가 하드웨어도 좋아지고 이걸 빅데이터가?
• Local minima 이슈는 High-dimension non-convex
optimization에서는 별로..
• RBM, DBN에 대한 이야기는 넘어가고 NN 이야기만..
http://www.slideshare.net/secondmath/deep-learning-by-jskim

Logistic function
Logistic function의 일반화
Softmax function
Sigmoid function

Gradient Descent 지역 최적점을 찾아서

BP Algorithm Weight, bias의 보정

MLP Multi Layer Perceptron
Hidden Layer 가 여러 층

input0
input1
output0
output1
hidden0
hidden1
hidden2
x
Layer0 Layer1 Layer2
W
b0
z = Wx + b
tanh
y = tanh(z)

input0
input1
output0
output1
hidden0
hidden1
hidden2
x
W
b0
z = Wx + b
tanh
y = tanh(z)
Hidden Layer

output0
output1
hidden0
hidden1
hidden2
Layer1 Layer2
W
b0
z = Wx + b
tanh
y = tanh(z)
softmax
y_pred
argmax
z = Wx + b
b1

z = Wx + b
b1
output0
output1
hidden0
hidden1
hidden2
Layer1 Layer2
W
b0
z = Wx + b
tanh
y = tanh(z)
softmax
y_pred
argmax
Logistic Regression

CNN Convolutional Neural Network

Convolution Feature map?
Image Convolution

CNN Convolutional Neural Network
Convolution Layer MLP

Dropout/connect Avoid overfitting
input0
input1
output0
output1
hidden0
hidden1
hidden2
b0
dropconnect
dropout

CNN 여기까지 오느라 수고하셨습니다-
Convolutional neural networks 1. convolution layer (feature map = 4, 5*5) 2. max
pooling layer (2*2) 3. convolution layer (feature map = 10, 5*5) 4. max pooling
layer (2*2) 5. hidden layer (500 neurons, tanh activation) 6. output layer (500 ->
10, logistic regression)
Accuracy = 96.74% 10-fold CV
발표자료 만들고보니 feature 맵을 잘못 넣었다는 것을 깨달음
제대로 했으면 sample 수는 적지만 98+%도 가능한지는 다음에
것보다 kaggle에서 원래 데이터를 다 안 줌 =_=

Digit recognizer

More Related Content

What's hot (20)

Similar to Digit recognizer (20)

More from Chul Ju Hong (8)

Digit recognizer