SlideShare a Scribd company logo
30 分鐘學會
實作 Python Feature Selection
James CC Huang
Disclaimer
• 只有實作
• 沒有數學
• 沒有統計
Source: Internet
Warming Up
• 聽說這場分享不會有人問問題 (把講者釘在台上)
• 原 session 只講 40 分鐘,但是今天的分享給了 2 小時
• 考驗我的記憶力和理解力
• 講者講了一大堆名詞但沒有講實作 (不可能有時間講)
• 我用 Python 實作範例
• 希望大家如果跟我一樣,不搞理論也不搞數學統計,回家用剪貼的就可
以用 scikit-learn 做 feature selection
Reinventing the Wheel?
Source: P.60 http://www.slideshare.net/tw_dsconf/ss-62245351
進行 Machine Learning 和 Deep Learning…
• 到底需不需要懂背後的數學、統計、理論…?
• 推廣及普及 Machine Learning / Deep Learning
• 工具的易用性及快速開發
• 正反方意見都有
• 正方例子:談到投入大演算 ”… 你會認為這需要繁重的數
學和嚴謹的理論工作,其實不然,反倒這所需要的是從
艱深的數學理論抽離,以便能看到學習現象的整體模
式。” (大演算 The Master Algorithm, P. 40)
• 反方例子:Deep Neural Networks - A Developmental
Perspective (slides, video)
2014 – 2016
台灣資料科學”愛好者”年會
我的分享
一、連續 3 年吃便當的經驗
二、2016 聽完 Feature Engineering in Machine Learning 演講後夢到的東西
三年的進化
• 參加的人愈來愈多
• [不負責任目測] 與會者平均年齡愈來愈大 XD
• 內容愈來愈多、場次愈來愈多
• 演講者身份的改變:教授和來自研究單位變多
• Deep Learning 這個詞出現頻率大幅增加
• $$ 愈來愈貴
• 朝向使用者付費
• 部分付費課程也會持續開課
• 便當沒有進化(都是同樣那幾家)
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
http://datasci.tw/agenda.php
Feature Engineering in Machine Learning
Session (Speaker: 李俊良)
Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in-machine-learning
用 Feature Engineering 可否判斷出寫作風
格?
• 羅琳化名寫小說 曝光後銷量飆升
http://www.bbc.com/zhongwen/trad/uk_study/2013/07/130714_ro
wling_novel
• “曾有書評評價新書《杜鵑鳥在呼喚》是部「才華橫溢的處女作」,還有
書評盛讚這名男性作者,能如此精湛地描述女性的服裝。”
• “… 出版( 3 個月)的這部小說,已經售出1500冊。但亞馬遜網站報道說,
周日正午12點後,該書的銷售量飆增,增速高達500000%。”
• 原投影片 P. 14 (Source:
http://www.slideshare.net/tw_dsconf/feature-engineering-in-
machine-learning)
Find Word / Doc Similarity with
Deep Learning
Using word2vec and Gensim (Python)
Goal (or Problem to Solve)
• Problem: Tech Support engineers (TS) want to “precisely” categorize
support cases. The task is being performed manually by TS engineers.
• Goal: Automatically categorize support case.
• What I have:
• 156 classified cases (with “so-called” correct issue categories)
• Support cases in database
• Challenges:
• Based on current data available, supervised classification algorisms can‘t be
applied.
• Clustering may not 100% achieve the goal.
• What about Deep Learning?
Gensim (word2vec implementation in Python)
from os import listdir
import gensim
LabeledSentence = gensim.models.doc2vec.LabeledSentence
docLabels = []
docLabels = [f for f in listdir(“../corpora/2016/”) if f.endswith(‘.txt’)]
data = []
for doc in docLabels:
data.append(open(“../corpora/2016/” + doc, ‘r’))
class LabeledLineSentence(object):
def __init__(self, doc_list, labels_list):
self.labels_list = labels_list
self.doc_list = doc_list
def __iter__(self):
for idx, doc in enumerate(self.doc_list):
yield LabeledSentence(words=doc.read().split(),
labels=[self.labels_list[idx]])
Gensim (Cont’d)
it = LabeledLineSentence(data, docLabels)
model = gensim.models.Doc2Vec(alpha=0.025,
min_alpha=0.025)
model.build_vocab(it)
for epoch in range(10):
model.train(it)
model.alpha -= 0.002
model.min_alpha = model.alpha
# find most similar support case
print model.most_similar(“00111105”)
江湖傳言
• 用 Deep Learning 就不需要做 feature selection,因為 deep learning
會自動幫你決定
• From Wikipedia (https://en.wikipedia.org/wiki/Deep_learning):
• “One of the promises of deep learning is replacing handcrafted features with
efficient algorithms for unsupervised or semi-supervised feature learning and
hierarchical feature extraction.”
• 真 的 有 這 麼 神 奇 嗎 ?
Feature selection for Iris Dataset as Example
• Iris dataset attributes
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Feature Selection - LASSO
>>> from sklearn.linear_model import Lasso
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> print X.shape
(150, 4)
>>> clf = Lasso(alpha=0.01)
>>> sfm = SelectFromModel(clf, threshold=0.25)
>>> sfm.fit(X, y)
>>> n_features = sfm.transform(X).shape[1]
>>> print n_features
2
petal width & petal length
Feature Selection - LASSO (Cont’d)
>>> scaler = StandardScaler()
>>> X = scaler.fit_transform(X)
>>> names = iris["feature_names"]
>>> lasso = Lasso(alpha=0.01, positive=True)
>>> lasso.fit(X, y)
>>> print (sorted(zip(map(lambda x: round(x, 4),
lasso.coef_), names), reverse=True))
[(0.47199999999999998, 'petal width (cm)'),
(0.3105, 'petal length (cm)'), (0.0, 'sepal
width (cm)'), (0.0, 'sepal length (cm)')]
Feature Selection – Random Forest
>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestRegressor
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> print (X.shape)
(150, 4)
>>> names = iris["feature_names"]
>>> rf = RandomForestRegressor()
>>> rf.fit(X, y)
>>> print (sorted(zip(map(lambda x: round(x, 4),
rf.feature_importances_), names), reverse=True))
[(0.50729999999999997, 'petal width (cm)'), (0.47870000000000001,
'petal length (cm)'), (0.0091000000000000004, 'sepal width (cm)'),
(0.0048999999999999998, 'sepal length (cm)')]
Dimension Reduction - PCA
>>> from sklearn.datasets import load_iris
>>> from sklearn.decomposition import PCA as pca
>>> from sklearn.preprocessing import StandardScaler
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X = StandardScaler().fit_transform(X)
>>> sklearn_pca = pca(n_components=2)
>>> sklearn_pca.fit_transform(X)
>>> print (sklearn_pca.components_)
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]]
There are many others…
這次分享就是僅是把原講者所提到的方式實際做出來
簡單的我做完了, 難的就留給大家去發掘~
Reference
scikit-learn
• Feature selection
http://scikit-learn.org/stable/modules/feature_selection.html
• sklearn.linear_model.Lasso
http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
• sklearn.decomposition.PCA http://scikit-
learn.org/stable/modules/generated/sklearn.decomposition.PCA.htm
l
Gensim
• https://radimrehurek.com/gensim/index.html
HoG (Histogram of Oriented Gradients)
• Python code example http://scikit-
image.org/docs/dev/auto_examples/plot_hog.html

More Related Content

PDF
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
PDF
Neural Architectures for Named Entity Recognition
PDF
Introduction to Sentiment Analysis
PPT
Algorithm programming
PDF
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
PDF
200819 NAVER TECH CONCERT 06_놓치기 쉬운 안드로이드 UI 디테일 살펴보기
PPTX
Chromium에 contribution하기
PDF
社内勉強会資料_LLM Agents                              .
Sentiment Analysis Using Hybrid Structure of Machine Learning Algorithms
Neural Architectures for Named Entity Recognition
Introduction to Sentiment Analysis
Algorithm programming
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
200819 NAVER TECH CONCERT 06_놓치기 쉬운 안드로이드 UI 디테일 살펴보기
Chromium에 contribution하기
社内勉強会資料_LLM Agents                              .

Similar to 30 分鐘學會實作 Python Feature Selection (20)

PPTX
30 分鐘學會實作 Python Feature Selection
PDF
ML Toolkit Share
PPTX
Learning with classification and clustering, neural networks
PDF
Lab 2: Classification and Regression Prediction Models, training and testing ...
PDF
Machine Learning Guide maXbox Starter62
PDF
How to use SVM for data classification
PDF
Scikit-Learn: Machine Learning in Python
PDF
Hands-on ML - CH1
PDF
Machine Learning: Classification Concepts (Part 1)
PDF
Scikit-learn Cheatsheet-Python
PDF
MLDM CM Kaggle Tips
PDF
Cheat Sheet for Machine Learning in Python: Scikit-learn
PDF
Scikit learn cheat_sheet_python
PDF
Machine Learning With R
PDF
用 Python 玩 LHC 公開數據
PDF
Machine Learning Crash Course by Sebastian Raschka
PPTX
Machine learning and decision trees
PDF
Hands-on Tutorial of Machine Learning in Python
PDF
Useing PSO to optimize logit model with Tensorflow
PPTX
Baisc Deep Learning HandsOn
30 分鐘學會實作 Python Feature Selection
ML Toolkit Share
Learning with classification and clustering, neural networks
Lab 2: Classification and Regression Prediction Models, training and testing ...
Machine Learning Guide maXbox Starter62
How to use SVM for data classification
Scikit-Learn: Machine Learning in Python
Hands-on ML - CH1
Machine Learning: Classification Concepts (Part 1)
Scikit-learn Cheatsheet-Python
MLDM CM Kaggle Tips
Cheat Sheet for Machine Learning in Python: Scikit-learn
Scikit learn cheat_sheet_python
Machine Learning With R
用 Python 玩 LHC 公開數據
Machine Learning Crash Course by Sebastian Raschka
Machine learning and decision trees
Hands-on Tutorial of Machine Learning in Python
Useing PSO to optimize logit model with Tensorflow
Baisc Deep Learning HandsOn
Ad

Recently uploaded (20)

PDF
Nekopoi APK 2025 free lastest update
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Digital Strategies for Manufacturing Companies
PPTX
Essential Infomation Tech presentation.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
AI in Product Development-omnex systems
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Nekopoi APK 2025 free lastest update
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Digital Strategies for Manufacturing Companies
Essential Infomation Tech presentation.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PTS Company Brochure 2025 (1).pdf.......
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
AI in Product Development-omnex systems
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Odoo Companies in India – Driving Business Transformation.pdf
How Creative Agencies Leverage Project Management Software.pdf
Operating system designcfffgfgggggggvggggggggg
Softaken Excel to vCard Converter Software.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Ad

30 分鐘學會實作 Python Feature Selection

  • 1. 30 分鐘學會 實作 Python Feature Selection James CC Huang
  • 2. Disclaimer • 只有實作 • 沒有數學 • 沒有統計 Source: Internet
  • 3. Warming Up • 聽說這場分享不會有人問問題 (把講者釘在台上) • 原 session 只講 40 分鐘,但是今天的分享給了 2 小時 • 考驗我的記憶力和理解力 • 講者講了一大堆名詞但沒有講實作 (不可能有時間講) • 我用 Python 實作範例 • 希望大家如果跟我一樣,不搞理論也不搞數學統計,回家用剪貼的就可 以用 scikit-learn 做 feature selection
  • 4. Reinventing the Wheel? Source: P.60 http://www.slideshare.net/tw_dsconf/ss-62245351
  • 5. 進行 Machine Learning 和 Deep Learning… • 到底需不需要懂背後的數學、統計、理論…? • 推廣及普及 Machine Learning / Deep Learning • 工具的易用性及快速開發 • 正反方意見都有 • 正方例子:談到投入大演算 ”… 你會認為這需要繁重的數 學和嚴謹的理論工作,其實不然,反倒這所需要的是從 艱深的數學理論抽離,以便能看到學習現象的整體模 式。” (大演算 The Master Algorithm, P. 40) • 反方例子:Deep Neural Networks - A Developmental Perspective (slides, video)
  • 6. 2014 – 2016 台灣資料科學”愛好者”年會 我的分享 一、連續 3 年吃便當的經驗 二、2016 聽完 Feature Engineering in Machine Learning 演講後夢到的東西
  • 7. 三年的進化 • 參加的人愈來愈多 • [不負責任目測] 與會者平均年齡愈來愈大 XD • 內容愈來愈多、場次愈來愈多 • 演講者身份的改變:教授和來自研究單位變多 • Deep Learning 這個詞出現頻率大幅增加 • $$ 愈來愈貴 • 朝向使用者付費 • 部分付費課程也會持續開課 • 便當沒有進化(都是同樣那幾家)
  • 15. Feature Engineering in Machine Learning Session (Speaker: 李俊良) Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in-machine-learning
  • 16. 用 Feature Engineering 可否判斷出寫作風 格? • 羅琳化名寫小說 曝光後銷量飆升 http://www.bbc.com/zhongwen/trad/uk_study/2013/07/130714_ro wling_novel • “曾有書評評價新書《杜鵑鳥在呼喚》是部「才華橫溢的處女作」,還有 書評盛讚這名男性作者,能如此精湛地描述女性的服裝。” • “… 出版( 3 個月)的這部小說,已經售出1500冊。但亞馬遜網站報道說, 周日正午12點後,該書的銷售量飆增,增速高達500000%。” • 原投影片 P. 14 (Source: http://www.slideshare.net/tw_dsconf/feature-engineering-in- machine-learning)
  • 17. Find Word / Doc Similarity with Deep Learning Using word2vec and Gensim (Python)
  • 18. Goal (or Problem to Solve) • Problem: Tech Support engineers (TS) want to “precisely” categorize support cases. The task is being performed manually by TS engineers. • Goal: Automatically categorize support case. • What I have: • 156 classified cases (with “so-called” correct issue categories) • Support cases in database • Challenges: • Based on current data available, supervised classification algorisms can‘t be applied. • Clustering may not 100% achieve the goal. • What about Deep Learning?
  • 19. Gensim (word2vec implementation in Python) from os import listdir import gensim LabeledSentence = gensim.models.doc2vec.LabeledSentence docLabels = [] docLabels = [f for f in listdir(“../corpora/2016/”) if f.endswith(‘.txt’)] data = [] for doc in docLabels: data.append(open(“../corpora/2016/” + doc, ‘r’)) class LabeledLineSentence(object): def __init__(self, doc_list, labels_list): self.labels_list = labels_list self.doc_list = doc_list def __iter__(self): for idx, doc in enumerate(self.doc_list): yield LabeledSentence(words=doc.read().split(), labels=[self.labels_list[idx]])
  • 20. Gensim (Cont’d) it = LabeledLineSentence(data, docLabels) model = gensim.models.Doc2Vec(alpha=0.025, min_alpha=0.025) model.build_vocab(it) for epoch in range(10): model.train(it) model.alpha -= 0.002 model.min_alpha = model.alpha # find most similar support case print model.most_similar(“00111105”)
  • 21. 江湖傳言 • 用 Deep Learning 就不需要做 feature selection,因為 deep learning 會自動幫你決定 • From Wikipedia (https://en.wikipedia.org/wiki/Deep_learning): • “One of the promises of deep learning is replacing handcrafted features with efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction.” • 真 的 有 這 麼 神 奇 嗎 ?
  • 22. Feature selection for Iris Dataset as Example • Iris dataset attributes 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: -- Iris Setosa -- Iris Versicolour -- Iris Virginica
  • 23. Feature Selection - LASSO >>> from sklearn.linear_model import Lasso >>> from sklearn.datasets import load_iris >>> from sklearn.feature_selection import SelectFromModel >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> print X.shape (150, 4) >>> clf = Lasso(alpha=0.01) >>> sfm = SelectFromModel(clf, threshold=0.25) >>> sfm.fit(X, y) >>> n_features = sfm.transform(X).shape[1] >>> print n_features 2 petal width & petal length
  • 24. Feature Selection - LASSO (Cont’d) >>> scaler = StandardScaler() >>> X = scaler.fit_transform(X) >>> names = iris["feature_names"] >>> lasso = Lasso(alpha=0.01, positive=True) >>> lasso.fit(X, y) >>> print (sorted(zip(map(lambda x: round(x, 4), lasso.coef_), names), reverse=True)) [(0.47199999999999998, 'petal width (cm)'), (0.3105, 'petal length (cm)'), (0.0, 'sepal width (cm)'), (0.0, 'sepal length (cm)')]
  • 25. Feature Selection – Random Forest >>> from sklearn.datasets import load_iris >>> from sklearn.ensemble import RandomForestRegressor >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> print (X.shape) (150, 4) >>> names = iris["feature_names"] >>> rf = RandomForestRegressor() >>> rf.fit(X, y) >>> print (sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), reverse=True)) [(0.50729999999999997, 'petal width (cm)'), (0.47870000000000001, 'petal length (cm)'), (0.0091000000000000004, 'sepal width (cm)'), (0.0048999999999999998, 'sepal length (cm)')]
  • 26. Dimension Reduction - PCA >>> from sklearn.datasets import load_iris >>> from sklearn.decomposition import PCA as pca >>> from sklearn.preprocessing import StandardScaler >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X = StandardScaler().fit_transform(X) >>> sklearn_pca = pca(n_components=2) >>> sklearn_pca.fit_transform(X) >>> print (sklearn_pca.components_) [[ 0.52237162 -0.26335492 0.58125401 0.56561105] [-0.37231836 -0.92555649 -0.02109478 -0.06541577]]
  • 27. There are many others… 這次分享就是僅是把原講者所提到的方式實際做出來 簡單的我做完了, 難的就留給大家去發掘~
  • 29. scikit-learn • Feature selection http://scikit-learn.org/stable/modules/feature_selection.html • sklearn.linear_model.Lasso http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html • sklearn.decomposition.PCA http://scikit- learn.org/stable/modules/generated/sklearn.decomposition.PCA.htm l
  • 31. HoG (Histogram of Oriented Gradients) • Python code example http://scikit- image.org/docs/dev/auto_examples/plot_hog.html