30 分鐘學會實作 Python Feature Selection

30 分鐘學會
實作 Python Feature Selection
James CC Huang

Disclaimer
• 只有實作
• 沒有數學
• 沒有統計
Source: Internet

Warming Up
• 聽說這場分享不會有人問問題 (把講者釘在台上)
• 原 session 只講 40 分鐘，但是今天的分享給了 2 小時
• 考驗我的記憶力和理解力
• 講者講了一大堆名詞但沒有講實作 (不可能有時間講)
• 我用 Python 實作範例
• 希望大家如果跟我一樣，不搞理論也不搞數學統計，回家用剪貼的就可
以用 scikit-learn 做 feature selection

Reinventing the Wheel?
Source: P.60 http://www.slideshare.net/tw_dsconf/ss-62245351

進行 Machine Learning 和 Deep Learning…
• 到底需不需要懂背後的數學、統計、理論…？
• 推廣及普及 Machine Learning / Deep Learning
• 工具的易用性及快速開發
• 正反方意見都有
• 正方例子：談到投入大演算 ”… 你會認為這需要繁重的數
學和嚴謹的理論工作，其實不然，反倒這所需要的是從
艱深的數學理論抽離，以便能看到學習現象的整體模
式。” (大演算 The Master Algorithm, P. 40)
• 反方例子：Deep Neural Networks - A Developmental
Perspective (slides, video)

2014 – 2016
台灣資料科學”愛好者”年會
我的分享
一、連續 3 年吃便當的經驗
二、2016 聽完 Feature Engineering in Machine Learning 演講後夢到的東西

三年的進化
• 參加的人愈來愈多
• [不負責任目測] 與會者平均年齡愈來愈大 XD
• 內容愈來愈多、場次愈來愈多
• 演講者身份的改變：教授和來自研究單位變多
• Deep Learning 這個詞出現頻率大幅增加
• $$ 愈來愈貴
• 朝向使用者付費
• 部分付費課程也會持續開課
• 便當沒有進化（都是同樣那幾家）

http://datasci.tw/agenda.php

Feature Engineering in Machine Learning
Session (Speaker: 李俊良)
Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in-machine-learning

用 Feature Engineering 可否判斷出寫作風
格？
• 羅琳化名寫小說曝光後銷量飆升
http://www.bbc.com/zhongwen/trad/uk_study/2013/07/130714_ro
wling_novel
• “曾有書評評價新書《杜鵑鳥在呼喚》是部「才華橫溢的處女作」，還有
書評盛讚這名男性作者，能如此精湛地描述女性的服裝。”
• “… 出版( 3 個月)的這部小說，已經售出1500冊。但亞馬遜網站報道說，
周日正午12點後，該書的銷售量飆增，增速高達500000%。”
• 原投影片 P. 14 (Source:
http://www.slideshare.net/tw_dsconf/feature-engineering-in-
machine-learning)

Find Word / Doc Similarity with
Deep Learning
Using word2vec and Gensim (Python)

Goal (or Problem to Solve)
• Problem: Tech Support engineers (TS) want to “precisely” categorize
support cases. The task is being performed manually by TS engineers.
• Goal: Automatically categorize support case.
• What I have:
• 156 classified cases (with “so-called” correct issue categories)
• Support cases in database
• Challenges:
• Based on current data available, supervised classification algorisms can‘t be
applied.
• Clustering may not 100% achieve the goal.
• What about Deep Learning?

Gensim (word2vec implementation in Python)
from os import listdir
import gensim
LabeledSentence = gensim.models.doc2vec.LabeledSentence
docLabels = []
docLabels = [f for f in listdir(“../corpora/2016/”) if f.endswith(‘.txt’)]
data = []
for doc in docLabels:
data.append(open(“../corpora/2016/” + doc, ‘r’))
class LabeledLineSentence(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield LabeledSentence(words=doc.read().split(),
labels=[self.labels_list[idx]])

Gensim (Cont’d)
it = LabeledLineSentence(data, docLabels)
model = gensim.models.Doc2Vec(alpha=0.025,
min_alpha=0.025)
model.build_vocab(it)
for epoch in range(10):
model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
# find most similar support case
print model.most_similar(“00111105”)

江湖傳言
• 用 Deep Learning 就不需要做 feature selection，因為 deep learning
會自動幫你決定
• From Wikipedia (https://en.wikipedia.org/wiki/Deep_learning):
• “One of the promises of deep learning is replacing handcrafted features with
efficient algorithms for unsupervised or semi-supervised feature learning and
hierarchical feature extraction.”
• 真的有這麼神奇嗎？

Feature selection for Iris Dataset as Example
• Iris dataset attributes
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

Feature Selection - LASSO
>>> from sklearn.linear_model import Lasso
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> print X.shape
(150, 4)
>>> clf = Lasso(alpha=0.01)
>>> sfm = SelectFromModel(clf, threshold=0.25)
>>> sfm.fit(X, y)
>>> n_features = sfm.transform(X).shape[1]
>>> print n_features
2
petal width & petal length

Feature Selection - LASSO (Cont’d)
>>> scaler = StandardScaler()
>>> X = scaler.fit_transform(X)
>>> names = iris["feature_names"]
>>> lasso = Lasso(alpha=0.01, positive=True)
>>> lasso.fit(X, y)
>>> print (sorted(zip(map(lambda x: round(x, 4),
lasso.coef_), names), reverse=True))
[(0.47199999999999998, 'petal width (cm)'),
(0.3105, 'petal length (cm)'), (0.0, 'sepal
width (cm)'), (0.0, 'sepal length (cm)')]

Feature Selection – Random Forest
>>> from sklearn.ensemble import RandomForestRegressor
>>> print (X.shape)
(150, 4)
>>> names = iris["feature_names"]
>>> rf = RandomForestRegressor()
>>> rf.fit(X, y)
>>> print (sorted(zip(map(lambda x: round(x, 4),
rf.feature_importances_), names), reverse=True))
[(0.50729999999999997, 'petal width (cm)'), (0.47870000000000001,
'petal length (cm)'), (0.0091000000000000004, 'sepal width (cm)'),
(0.0048999999999999998, 'sepal length (cm)')]

Dimension Reduction - PCA
>>> from sklearn.decomposition import PCA as pca
>>> from sklearn.preprocessing import StandardScaler
>>> X = StandardScaler().fit_transform(X)
>>> sklearn_pca = pca(n_components=2)
>>> sklearn_pca.fit_transform(X)
>>> print (sklearn_pca.components_)
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]]

There are many others…
這次分享就是僅是把原講者所提到的方式實際做出來
簡單的我做完了, 難的就留給大家去發掘~

scikit-learn
• Feature selection
http://scikit-learn.org/stable/modules/feature_selection.html
• sklearn.linear_model.Lasso
http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
• sklearn.decomposition.PCA http://scikit-
learn.org/stable/modules/generated/sklearn.decomposition.PCA.htm
l

Gensim
• https://radimrehurek.com/gensim/index.html

HoG (Histogram of Oriented Gradients)
• Python code example http://scikit-
image.org/docs/dev/auto_examples/plot_hog.html

30 分鐘學會實作 Python Feature Selection

More Related Content

Similar to 30 分鐘學會實作 Python Feature Selection (20)

Recently uploaded (20)

30 分鐘學會實作 Python Feature Selection