SlideShare a Scribd company logo
FeatureRank
-
An Insight Data Science Consulting Project
Kuhan Wang
October 8th, 2015
1 / 11
Consulting Scenario
Company X wishes to maximize user engagement through
optimal placement of advertisements on content URLs.
Ad Type: Tourism
Keyword: Cuba
Keyword:
Package Tour
Keyword: Airplane
Ad Type X
Keyword 1
Keyword 2
Keyword 3
Keyword N
.
.
.
Example: Tourism ads not ideal on investment content URL.
2 / 11
A Pipeline to Analyze Textual Features
Developed and implemented a pipeline to analyze
importance of textual feature on content URLs relative to
engagement.
Scrape
URL
Process
Text
Model
Features
Extract
Keywords
Update
Keywords
Collect Data, Reiterate
Begin
3 / 11
User Engagement Data
Occurrences
Counts
Summary of Engagement Data
Page Loaded
Ad Viewed
Ad Clicked
Summary of Engagement Data
4 / 11
Modeling
Attempted linear regression.
Classify engagement as yes/no.
- Features are bags of words from content URL.
Word Count
0 1 2 3 4 5 6 7 8 9 10
Probability[%]
0
0.2
0.4
0.6
0.8
1
Logistic Classification Model
Ad Clicked
Ad Not Clicked
Logistic Classification Model
5 / 11
Validation
Randomly split data into training/test sets.
- Generate distribution of validation scores.
Precision
0.55 0.6 0.65 0.7 0.75 0.8 0.85
Recall
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
NumberofMCToys
0
5
10
15
20
25
30
Distribution of Precision vs Recall
〉Precision, Recall〈
6 / 11
Deliverables
Extracted keywords:
Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 4
1 debt coordinator mortgage gold
2 gift administrative home 0
3 profit minimum procurement stock
4 check minimum wage loan fund
5 balance reports trustee event
Pipeline in Python is delivered to company for
implementation.
7 / 11
About Myself
PhD Particle Physics, McGill University, researcher on the
Large Hadron Collider.
Lead the search for black holes and string objects as part of
the ATLAS Collaboration.
About project and myself at http://kuhanw.zohosites.com/.
8 / 11
Backup
Feature Frequency/Documents
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
RelativeNumberofDocuments[%]
4−
10
3−
10
2−
10
1−
10
1
Ad Type 1Ad Type 1
9 / 11
10 / 11
FeatureRank
Kuhan Wang1
1. Insight Data Science
October 2, 2015
Abstract
FeatureRank is a software tool for extracting correlations between text
ngram features and user engagement, thereby optimizing the placement
of financial widgets on URL articles.
1 Directory Structure
• /
processing.py
Pre-processing to parse relevant information from engagement csv files.
crawl.py
A simple web crawler that pulls the title and < p > tag text from URLs.
FeatureRank.py
Driver file to execute main functions.
feature_extraction_model.py
The core program that contains the machine learning algorithms.
post_processing.py
Post processing to produce evaluation metrics and ngram rankings. 11 / 11

More Related Content

PDF
Insight Consulting Project
PDF
demo_teralytics
PPTX
Improving the reported use and impact of institutional repositories
PPTX
RAMP: Repository Analytics and Metrics Portal
PPTX
Walk Before You Run: Prerequisites to Linked Data
PPT
3 Understanding Search
PPTX
Web Mining Projects Topics
PDF
Research Paper
Insight Consulting Project
demo_teralytics
Improving the reported use and impact of institutional repositories
RAMP: Repository Analytics and Metrics Portal
Walk Before You Run: Prerequisites to Linked Data
3 Understanding Search
Web Mining Projects Topics
Research Paper

What's hot (15)

PPTX
SEO Glossary By Rahul Gupta-SEO Lucknow-Hyderabad
PPTX
Google indexing
PDF
QUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASES
PPTX
Web mining (1)
PDF
Quality aware subgraph matching over inconsistent probabilistic graph databases
PDF
PageRank and Related Methods
DOCX
Keyword Query Routing
PPTX
User Search Terms and Controlled Subject Vocabularies in an Institutional Rep...
DOCX
Keyword query routing
PPTX
Web Page Ranking using Machine Learning
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
PDF
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
PDF
Linked Pasts IV - Linking Syriac Geographic Data
PDF
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
PDF
Presentation 10all
SEO Glossary By Rahul Gupta-SEO Lucknow-Hyderabad
Google indexing
QUALITY-AWARE SUBGRAPH MATCHING OVER INCONSISTENT PROBABILISTIC GRAPH DATABASES
Web mining (1)
Quality aware subgraph matching over inconsistent probabilistic graph databases
PageRank and Related Methods
Keyword Query Routing
User Search Terms and Controlled Subject Vocabularies in an Institutional Rep...
Keyword query routing
Web Page Ranking using Machine Learning
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
Linked Pasts IV - Linking Syriac Geographic Data
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
Presentation 10all
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
modul_python (1).pptx for professional and student
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Managing Community Partner Relationships
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Transcultural that can help you someday.
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
annual-report-2024-2025 original latest.
Galatica Smart Energy Infrastructure Startup Pitch Deck
Reliability_Chapter_ presentation 1221.5784
STERILIZATION AND DISINFECTION-1.ppthhhbx
STUDY DESIGN details- Lt Col Maksud (21).pptx
modul_python (1).pptx for professional and student
[EN] Industrial Machine Downtime Prediction
Data_Analytics_and_PowerBI_Presentation.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
climate analysis of Dhaka ,Banglades.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Managing Community Partner Relationships
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Transcultural that can help you someday.
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
SAP 2 completion done . PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
annual-report-2024-2025 original latest.
Ad

Insight Consulting Project

  • 1. FeatureRank - An Insight Data Science Consulting Project Kuhan Wang October 8th, 2015 1 / 11
  • 2. Consulting Scenario Company X wishes to maximize user engagement through optimal placement of advertisements on content URLs. Ad Type: Tourism Keyword: Cuba Keyword: Package Tour Keyword: Airplane Ad Type X Keyword 1 Keyword 2 Keyword 3 Keyword N . . . Example: Tourism ads not ideal on investment content URL. 2 / 11
  • 3. A Pipeline to Analyze Textual Features Developed and implemented a pipeline to analyze importance of textual feature on content URLs relative to engagement. Scrape URL Process Text Model Features Extract Keywords Update Keywords Collect Data, Reiterate Begin 3 / 11
  • 4. User Engagement Data Occurrences Counts Summary of Engagement Data Page Loaded Ad Viewed Ad Clicked Summary of Engagement Data 4 / 11
  • 5. Modeling Attempted linear regression. Classify engagement as yes/no. - Features are bags of words from content URL. Word Count 0 1 2 3 4 5 6 7 8 9 10 Probability[%] 0 0.2 0.4 0.6 0.8 1 Logistic Classification Model Ad Clicked Ad Not Clicked Logistic Classification Model 5 / 11
  • 6. Validation Randomly split data into training/test sets. - Generate distribution of validation scores. Precision 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Recall 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 NumberofMCToys 0 5 10 15 20 25 30 Distribution of Precision vs Recall 〉Precision, Recall〈 6 / 11
  • 7. Deliverables Extracted keywords: Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 4 1 debt coordinator mortgage gold 2 gift administrative home 0 3 profit minimum procurement stock 4 check minimum wage loan fund 5 balance reports trustee event Pipeline in Python is delivered to company for implementation. 7 / 11
  • 8. About Myself PhD Particle Physics, McGill University, researcher on the Large Hadron Collider. Lead the search for black holes and string objects as part of the ATLAS Collaboration. About project and myself at http://kuhanw.zohosites.com/. 8 / 11
  • 9. Backup Feature Frequency/Documents 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 RelativeNumberofDocuments[%] 4− 10 3− 10 2− 10 1− 10 1 Ad Type 1Ad Type 1 9 / 11
  • 11. FeatureRank Kuhan Wang1 1. Insight Data Science October 2, 2015 Abstract FeatureRank is a software tool for extracting correlations between text ngram features and user engagement, thereby optimizing the placement of financial widgets on URL articles. 1 Directory Structure • / processing.py Pre-processing to parse relevant information from engagement csv files. crawl.py A simple web crawler that pulls the title and < p > tag text from URLs. FeatureRank.py Driver file to execute main functions. feature_extraction_model.py The core program that contains the machine learning algorithms. post_processing.py Post processing to produce evaluation metrics and ngram rankings. 11 / 11