SlideShare a Scribd company logo
3
Most read
6
Most read
7
Most read
Featurizing log data
before XGBoost
Xavier Conort
Thursday, August 20, 2015 @
● XuetangX, a Chinese MOOC learning platform initiated
by Tsinghua University,
● launched online on Oct 10th, 2013.
● more than 100 Chinese courses and over 260
international courses
● high dropout rate
The competition host
● challenge: predict whether a user will drop a course
within next 10 days based on his or her prior activities.
● data:
○ enrollment_train (120K rows) / enrollment_test (80K rows):
■ Columns: enrollment_id, username, course_id
○ log_train / log_test
■ Columns: enrollment_id, time, source, event, object
○ object
■ Columns: course_id, module_id, category, children, start
○ truth_train
■ Columns: enrollment_id, dropped_out
Problem to solve
Log data 5890
objects
Team
Chief Product Officer Chief Data Scientist
Data Scientist Data Scientist
(O. Zhang)
How we worked as a Team
● worked separately on feature engineering. 90% of
our time was spent here.
● delegated Modeling part to DataRobot to:
○ find best algorithm (with XGboost as a winner!)
○ model text features
○ tune hyperparameters
○ experiment different feature sets and blend 8 XGBoost
using different sets
○ communicate results
Feature engineering techniques used
● counts
● time statistics (min, mean, max, diff)
● entropy
● sequences treated as text on which we ran
○ SVD on 3grams
○ DataRobot text mining solution
● 20 first components of SVD on user x object
NB: removed duplicated log info and used training + test
sets to build most features
How to build efficient features in R
Key course features
● course_id
● first log time
● enrollment counts
● unique log counts
● mean time interval
Key enrollment count features
● log counts
● unique log counts
● ratio between unique log counts over log counts
● unique log counts by event (nagivate, access,
problem, video, page_close, discussion, wiki)
● unique log counts before end of course (5 days, 10
days and 30 days before)
● sequence number of enrollment in that course
Key enrollment time stats
● log time stats (min, mean, max)
● gap between first and last log of enrollment
● gap between enrollment first log and course first log
● gap between enrollment last log and course last logs
● difference between mean log time and mid point
between first and last log
● log interval stats (mean, 90, 99 and 100 quantiles)
Enrollment entropy features
enrollment entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
● hours of the day for the last 1/3/7 days before last
logs
● object (when event == problem)
● chapter ids
Example of entropy feature
- log(weekday_log_count / enrollment_log_count) *
weekday_log_count / enrollment_log_count
Sum => weekday_entropy[enrollment_id==1]
1.589988
Enrollment sequence features
● for each enrollment_id, built sequences of
○ weekdays
○ objects
■ all objects / 'problem' and 'video' objects only
○ events
● treated sequences as 4 text variables. Ran for each
○ svd on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
& Nystroem SVM on (tuned) n-grams
Extract of enrollment object sequences
1/2-grams from Object sequences
DataRobot on Object 1-2 grams
Key user count features and time
stats
● enrollment count
● binary indicator whether user signed up for each of
the 38 courses
● unique log count
● mean log time interval
● sequence number of enrollment for that user
User entropy features
user entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
User sequence features
● for each user, built sequences of
○ weekdays
○ chapter_ids
○ events
● treated them as 3 text variables. Ran
○ SVD on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
+ Nystroem SVM on (tuned) n-grams
How we got to the TOP3
● entropy features mentioned before
● exploited info in
○ log count in the 5 / 10 / 20 days after end of course
○ log counts by event, sign_up counts and day entropy in the next
10 days after end of course
○ time to sign up for new course
○ time until the next log for same user
added ~0.001 to AUC (vs
less powerful features)
added ~0.002 to AUC
XGBoost
Thank you!

More Related Content

PPTX
Kaggle winning solutions: Retail Sales Forecasting
PDF
Winning Data Science Competitions
PDF
Winning data science competitions, presented by Owen Zhang
PDF
Feature Importance Analysis with XGBoost in Tax audit
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
Feature Engineering
PPTX
How to Win Machine Learning Competitions ?
Kaggle winning solutions: Retail Sales Forecasting
Winning Data Science Competitions
Winning data science competitions, presented by Owen Zhang
Feature Importance Analysis with XGBoost in Tax audit
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Feature Engineering
How to Win Machine Learning Competitions ?

What's hot (20)

PDF
General Tips for participating Kaggle Competitions
PDF
Overview of tree algorithms from decision tree to xgboost
PPTX
Demystifying Graph Neural Networks
PPTX
adversarial training.pptx
PDF
Tips for data science competitions
PDF
Feature Engineering
PDF
[Pgday.Seoul 2017] 2. PostgreSQL을 위한 리눅스 커널 최적화 - 김상욱
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PDF
レコメンドエンジン作成コンテストの勝ち方
PDF
The Graph Traversal Programming Pattern
PDF
Change Data Feed in Delta
PPTX
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
PDF
LiDARとSensor Fusion
PDF
Building an ML Platform with Ray and MLflow
PDF
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems. (AAAI 2022)”
PDF
Accelerating Data Ingestion with Databricks Autoloader
PDF
Productionalizing Models through CI/CD Design with MLflow
PDF
Pinot: Near Realtime Analytics @ Uber
PDF
Kaggle presentation
PDF
Introduction to YOLO detection model
General Tips for participating Kaggle Competitions
Overview of tree algorithms from decision tree to xgboost
Demystifying Graph Neural Networks
adversarial training.pptx
Tips for data science competitions
Feature Engineering
[Pgday.Seoul 2017] 2. PostgreSQL을 위한 리눅스 커널 최적화 - 김상욱
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
レコメンドエンジン作成コンテストの勝ち方
The Graph Traversal Programming Pattern
Change Data Feed in Delta
TalkingData AdTracking Fraud Detection Challenge (1st place solution)
LiDARとSensor Fusion
Building an ML Platform with Ray and MLflow
[DL輪読会]“Meta-Learning for Online Update of Recommender Systems. (AAAI 2022)”
Accelerating Data Ingestion with Databricks Autoloader
Productionalizing Models through CI/CD Design with MLflow
Pinot: Near Realtime Analytics @ Uber
Kaggle presentation
Introduction to YOLO detection model
Ad

Viewers also liked (20)

PPT
Open Source Tools & Data Science Competitions
PDF
PDF
How hackathons can drive top line revenue growth
PDF
Work - LIGHT Ministry
PDF
6 rules of enterprise innovation
PDF
Open Innovation - A Case Study
PPTX
Managing Data Science | Lessons from the Field
PPTX
Kill the wabbit
PDF
No-Bullshit Data Science
PPTX
Smart Switchboard: An home automation system
PDF
USC LIGHT Ministry Introduction
PDF
Menstrual Health Reader - mEo
PDF
Druva Casestudy - HackerEarth
PPTX
A Panorama of Natural Language Processing
PDF
How to assess & hire Java developers accurately?
PDF
Data science at the command line
PDF
Leveraged Analytics at Scale
PDF
Marriage - LIGHT Ministry
PDF
Tda presentation
PPTX
Vowpal Wabbit
Open Source Tools & Data Science Competitions
How hackathons can drive top line revenue growth
Work - LIGHT Ministry
6 rules of enterprise innovation
Open Innovation - A Case Study
Managing Data Science | Lessons from the Field
Kill the wabbit
No-Bullshit Data Science
Smart Switchboard: An home automation system
USC LIGHT Ministry Introduction
Menstrual Health Reader - mEo
Druva Casestudy - HackerEarth
A Panorama of Natural Language Processing
How to assess & hire Java developers accurately?
Data science at the command line
Leveraged Analytics at Scale
Marriage - LIGHT Ministry
Tda presentation
Vowpal Wabbit
Ad

Similar to Featurizing log data before XGBoost (20)

PDF
KDD CUP 2015 - 9th solution
PPTX
Christopher Brooks SOED 2016
PPTX
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
PPTX
Explainable AI
PPTX
Make Sense Out of Data with Feature Engineering
PPTX
ML_case_studysakil_Analysis_result. pptx
PPTX
03-classificationTrees03-classificationTrees.pptx
PDF
Chapter 04-discriminant analysis
PPTX
BAS 250 Lecture 8
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
Human_Activity_Recognition_Predictive_Model
PDF
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
PDF
Group13 kdd cup_report_submitted
PPTX
Decision Trees
PPT
[ppt]
PPT
[ppt]
PDF
Fundamentals of data science presentation
PPTX
Lecture4.pptx
PPTX
Ml application on_student_non_deployment
KDD CUP 2015 - 9th solution
Christopher Brooks SOED 2016
NIPS 2017 Competition Track : Personalized Cancer Treatment -- Classifying Cl...
Explainable AI
Make Sense Out of Data with Feature Engineering
ML_case_studysakil_Analysis_result. pptx
03-classificationTrees03-classificationTrees.pptx
Chapter 04-discriminant analysis
BAS 250 Lecture 8
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Text Classification Powered by Apache Mahout and Lucene
Human_Activity_Recognition_Predictive_Model
Data-Driven Education 2020: Using Big Educational Data to Improve Teaching an...
Group13 kdd cup_report_submitted
Decision Trees
[ppt]
[ppt]
Fundamentals of data science presentation
Lecture4.pptx
Ml application on_student_non_deployment

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Advanced methodologies resolving dimensionality complications for autism neur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity

Featurizing log data before XGBoost

  • 1. Featurizing log data before XGBoost Xavier Conort Thursday, August 20, 2015 @
  • 2. ● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University, ● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260 international courses ● high dropout rate The competition host
  • 3. ● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities. ● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows): ■ Columns: enrollment_id, username, course_id ○ log_train / log_test ■ Columns: enrollment_id, time, source, event, object ○ object ■ Columns: course_id, module_id, category, children, start ○ truth_train ■ Columns: enrollment_id, dropped_out Problem to solve
  • 5. Team Chief Product Officer Chief Data Scientist Data Scientist Data Scientist (O. Zhang)
  • 6. How we worked as a Team ● worked separately on feature engineering. 90% of our time was spent here. ● delegated Modeling part to DataRobot to: ○ find best algorithm (with XGboost as a winner!) ○ model text features ○ tune hyperparameters ○ experiment different feature sets and blend 8 XGBoost using different sets ○ communicate results
  • 7. Feature engineering techniques used ● counts ● time statistics (min, mean, max, diff) ● entropy ● sequences treated as text on which we ran ○ SVD on 3grams ○ DataRobot text mining solution ● 20 first components of SVD on user x object NB: removed duplicated log info and used training + test sets to build most features
  • 8. How to build efficient features in R
  • 9. Key course features ● course_id ● first log time ● enrollment counts ● unique log counts ● mean time interval
  • 10. Key enrollment count features ● log counts ● unique log counts ● ratio between unique log counts over log counts ● unique log counts by event (nagivate, access, problem, video, page_close, discussion, wiki) ● unique log counts before end of course (5 days, 10 days and 30 days before) ● sequence number of enrollment in that course
  • 11. Key enrollment time stats ● log time stats (min, mean, max) ● gap between first and last log of enrollment ● gap between enrollment first log and course first log ● gap between enrollment last log and course last logs ● difference between mean log time and mid point between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)
  • 12. Enrollment entropy features enrollment entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day ● hours of the day for the last 1/3/7 days before last logs ● object (when event == problem) ● chapter ids
  • 13. Example of entropy feature - log(weekday_log_count / enrollment_log_count) * weekday_log_count / enrollment_log_count Sum => weekday_entropy[enrollment_id==1] 1.589988
  • 14. Enrollment sequence features ● for each enrollment_id, built sequences of ○ weekdays ○ objects ■ all objects / 'problem' and 'video' objects only ○ events ● treated sequences as 4 text variables. Ran for each ○ svd on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. & Nystroem SVM on (tuned) n-grams
  • 15. Extract of enrollment object sequences
  • 17. DataRobot on Object 1-2 grams
  • 18. Key user count features and time stats ● enrollment count ● binary indicator whether user signed up for each of the 38 courses ● unique log count ● mean log time interval ● sequence number of enrollment for that user
  • 19. User entropy features user entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day
  • 20. User sequence features ● for each user, built sequences of ○ weekdays ○ chapter_ids ○ events ● treated them as 3 text variables. Ran ○ SVD on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. + Nystroem SVM on (tuned) n-grams
  • 21. How we got to the TOP3 ● entropy features mentioned before ● exploited info in ○ log count in the 5 / 10 / 20 days after end of course ○ log counts by event, sign_up counts and day entropy in the next 10 days after end of course ○ time to sign up for new course ○ time until the next log for same user added ~0.001 to AUC (vs less powerful features) added ~0.002 to AUC