SlideShare a Scribd company logo
Insights into Customer Behavior
from Clickstream Data
Ronald J. Nowling
Red Hat, Inc.
rnowling@redhat.com
http://rnowling.github.io/
Who Am I?
•  Software Engineer at Red Hat
•  Data Science Team, Emerging Technologies
–  Evaluate solutions in open-source Big Data
space
–  Ensure software works for Red Hat customers
–  Promote data science internally through
consulting projects
•  Apache Bigtop PMC
2	
  
Clickstream Data
3	
  
Clickstream Data
61 million page views
4	
  
Clickstream Data
61 million page views
125,000 registered users
5	
  
Clickstream Data
61 million page views
125,000 registered users
500,000 pages
6	
  
Clickstream Data
61 million page views
125,000 registered users
500,000 pages
125,000 knowledgebase articles
7	
  
Potential Applications
•  Build customer profiles to aid sales teams
•  Recommendation system for
knowledgebase
•  Improve customer portal search
•  Guide selection of new knowledgebase
topics by content writers
8	
  
9	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux?
=============================================================
Issue
------
What are the different types of kernel packages in Red Hat
Enterprise Linux?
Environment
---------------
Red Hat Enterprise Linux
Resolution
------------
Red Hat Enterprise Linux contains the following kernel
packages:
10	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux
Issue
What are the different types of kernel packages in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contains the following kernel
packages some may not apply to your architecture and not all
are available in all major releases kernel contains the
kernel and following key features
11	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different types of kernel packages in Red Hat
Enterprise Linux
Issue
What are the different types of kernel packages in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contains the following kernel
packages some may not apply to your architecture and not all
are available in all major releases kernel contains the
kernel and following key features
12	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different type of kernel package in Red Hat
Enterprise Linux
Issue
What are the different type of kernel package in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain the follow kernel
package some may not apply to your architecture and not all
are available in all major release kernel contain the
kernel and follow key feature
13	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
What are the different type of kernel package in Red Hat
Enterprise Linux
Issue
What are the different type of kernel package in Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain the follow kernel
package some may not apply to your architecture and not all
are available in all major release kernel contain the
kernel and follow key feature
14	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
different type kernel package Red Hat
Enterprise Linux
Issue
different type kernel package Red Hat
Enterprise Linux
Environment
Red Hat Enterprise Linux
Resolution
Red Hat Enterprise Linux contain kernel
package apply architecture
available major release kernel contain
kernel follow key feature
15	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
different: 2
type: 2
intel: 2
environment: 1
resolution: 1
follow: 1
system: 1
16	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
different: 2
type: 2
intel: 2
environment: 1
resolution: 1
follow: 1
system: 1
17	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
kernel: 5
red: 4
hat: 4
enterprise: 4
linux: 4
package: 3
contain: 3
18	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
19	
  
Strip
Formatting
Clean
Words
Vectorize Cluster
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
20	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
21	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
22	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
23	
  
Topics
openshift gear cartridge online
node broker
vm rhev virtualization disk
glusterfs storage volume brick rhs
glusterd node client mount geo
rhel support driver hp hardware
version firmware card intel
24	
  
Topic Article Counts
25	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
26	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
27	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
28	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
29	
  
Clickstream Processing
Parse
Raw Daily
Page Views
Clean &
Filter
Raw Daily
Page Views
Raw Daily
Page Views
Parse
Parse
Clean &
Filter
Clean &
Filter
Accounts
Aggregate
Topic View
Counts
Project onto
Topics
30	
  
Customer Profiles
•  Dominant topics
– JBoss
– Red Hat Enterprise Virtualization
– Hardware support
– Gluster
– Booting into rescue mode
– Packages
31	
  
Customer Profiles
•  Supporting topics
– Logging
– LDAP
– Samba
– High resource usage
– File systems / LVM / block devices
– Networking
32	
  
Customer Profiles
•  JBoss and RHEV appear in combination
with a number of other products
•  Some products only appear by
themselves with supporting topics
(logging, networking, filesystems)
– OpenShift
– Gluster
33	
  
Topic Enrichments
34	
  
Malformed TSV Files
•  Gzip files need to be read sequentially
•  Tab-separated, no quoting (in theory!)
•  Escaped tabs and newlines within records
•  E.g., n or t
•  Improperly escaped tabs and newlines
•  E.g., t vs t
•  Extraneous unmatched quote marks
•  E.g., ‘some_user
35	
  
Lessons Learned
•  Consider custom Hadoop input formats
for tricky file formats
•  Verify everything – what works in general
may not work for you
– Stemming
– Filtering most frequent words
– K-Means vs LDA
36	
  
Lessons Learned
•  K-Means
– Improve accuracy: Multiple runs, more
iterations
•  Watch out for memory leaks
– Un-persist cached RDDs
– Un-persist broadcasted variables
•  Parquet for performance
37	
  
Potential Applications
•  Build customer profiles to aid sales teams
•  Recommendation system for
knowledgebase
•  Improve customer portal search
•  Guide selection of new knowledgebase
topics for content writers
38	
  
Resources
http://rnowling.github.io/
39	
  
QUESTIONS
40	
  

More Related Content

PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
PDF
When NOT to use Apache Kafka?
PPTX
Volvo Cars - Retrieving Safety Insights using Graphs (GraphSummit Stockholm 2...
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Data Stores @ Netflix
PPTX
Capture the Streams of Database Changes
PDF
Kubernetes 101
PDF
Mastering Azure Monitor
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
When NOT to use Apache Kafka?
Volvo Cars - Retrieving Safety Insights using Graphs (GraphSummit Stockholm 2...
Spark (Structured) Streaming vs. Kafka Streams
Data Stores @ Netflix
Capture the Streams of Database Changes
Kubernetes 101
Mastering Azure Monitor

What's hot (20)

PDF
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
PPTX
Azure cloud governance deck
PDF
Combining logs, metrics, and traces for unified observability
PDF
Admission controllers - PSP, OPA, Kyverno and more!
PPTX
Log analysis using elk
PPTX
Hybrid Cloud Customer Use Cases on AWS
PPTX
Splunk Cloud
PDF
Splunk Cloud
PDF
Data Centre Relocation PowerPoint Presentation Slides
PDF
Diving into Delta Lake: Unpacking the Transaction Log
PDF
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
PDF
Container Security Deep Dive & Kubernetes
PPTX
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
PDF
8 - OpenShift - A look at a container platform: what's in the box
PDF
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
PPTX
Kubernetes 101 for Beginners
PDF
Neo4j in Depth
PPTX
Kubernetes introduction
PPTX
Kubernetes day 2 Operations
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
Azure cloud governance deck
Combining logs, metrics, and traces for unified observability
Admission controllers - PSP, OPA, Kyverno and more!
Log analysis using elk
Hybrid Cloud Customer Use Cases on AWS
Splunk Cloud
Splunk Cloud
Data Centre Relocation PowerPoint Presentation Slides
Diving into Delta Lake: Unpacking the Transaction Log
DevOps for Applications in Azure Databricks: Creating Continuous Integration ...
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Container Security Deep Dive & Kubernetes
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
8 - OpenShift - A look at a container platform: what's in the box
Kubernetes Summit 2021: Multi-Cluster - The Good, the Bad and the Ugly
Kubernetes 101 for Beginners
Neo4j in Depth
Kubernetes introduction
Kubernetes day 2 Operations
Ad

Viewers also liked (20)

PDF
Time Series Analysis with Spark by Sandy Ryza
PPTX
How we solved Real-time User Segmentation using HBase
PDF
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
PDF
Clickstream Data Warehouse - Turning clicks into customers
PDF
Not Your Father's Database by Vida Ha
PDF
Implementing and Visualizing Clickstream data with MongoDB
PDF
Viadeos Segmentation platform with Spark on Mesos
PDF
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
PDF
20 Inspiring Quotes On Customer Service
PDF
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
PDF
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
PDF
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
PDF
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
PPTX
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
PDF
Data Scientist Workbench 入門
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
PDF
Spark Summit EU 2015: SparkUI visualization: a lens into your application
PDF
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Time Series Analysis with Spark by Sandy Ryza
How we solved Real-time User Segmentation using HBase
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
Clickstream Data Warehouse - Turning clicks into customers
Not Your Father's Database by Vida Ha
Implementing and Visualizing Clickstream data with MongoDB
Viadeos Segmentation platform with Spark on Mesos
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
20 Inspiring Quotes On Customer Service
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Production Readiness Testing At Salesforce Using Spark MLlib
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Data Scientist Workbench 入門
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Ad

Similar to Insights into Customer Behavior from Clickstream Data by Ronald Nowling (20)

PDF
Red Hat Enterprise Linux 8 Workshop
PDF
Bigdata ready reference
PDF
Red Hat Enterprise Linux 8 Technical overview v1(1).pdf
PDF
Red Hat Enterprise Linux 8
PDF
2012-03-15 What's New at Red Hat
PDF
Why Pay for Open Source Linux? Avoid the Hidden Cost of DIY
PPTX
Saeed al ali 10 bb
PDF
RHEL roadmap
PDF
2008-01-22 Red Hat (Security) Roadmap Presentation
PDF
Linux Unveiled: From Novice to Guru Kameron Hussain
PDF
24HOP Introduction to Linux for SQL Server DBAs
PPTX
Openslava 2017 - Are developers the real emerging technology?
PDF
2011-03-15 Lockheed Martin Open Source Day
PDF
2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z
PDF
PDF
RHEL roadmap
PDF
2011 NASA Open Source Summit - Brian Stevens
PDF
Administer and Secure Enterprise Linux 2021st Edition Russell Overton
PDF
Administer and Secure Enterprise Linux 2021st Edition Russell Overton
PDF
Red Hat Enterprise Linux 8 Workshop
Bigdata ready reference
Red Hat Enterprise Linux 8 Technical overview v1(1).pdf
Red Hat Enterprise Linux 8
2012-03-15 What's New at Red Hat
Why Pay for Open Source Linux? Avoid the Hidden Cost of DIY
Saeed al ali 10 bb
RHEL roadmap
2008-01-22 Red Hat (Security) Roadmap Presentation
Linux Unveiled: From Novice to Guru Kameron Hussain
24HOP Introduction to Linux for SQL Server DBAs
Openslava 2017 - Are developers the real emerging technology?
2011-03-15 Lockheed Martin Open Source Day
2008-07-30 IBM Teach the Teacher (IBM T3), Red Hat Update for System z
RHEL roadmap
2011 NASA Open Source Summit - Brian Stevens
Administer and Secure Enterprise Linux 2021st Edition Russell Overton
Administer and Secure Enterprise Linux 2021st Edition Russell Overton

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Computer network topology notes for revision
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Acceptance and paychological effects of mandatory extra coach I classes.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IBA_Chapter_11_Slides_Final_Accessible.pptx
Launch Your Data Science Career in Kochi – 2025
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Foundation of Data Science unit number two notes
STUDY DESIGN details- Lt Col Maksud (21).pptx
Computer network topology notes for revision
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
Introduction-to-Cloud-ComputingFinal.pptx
Clinical guidelines as a resource for EBP(1).pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Supervised vs unsupervised machine learning algorithms

Insights into Customer Behavior from Clickstream Data by Ronald Nowling

  • 1. Insights into Customer Behavior from Clickstream Data Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/
  • 2. Who Am I? •  Software Engineer at Red Hat •  Data Science Team, Emerging Technologies –  Evaluate solutions in open-source Big Data space –  Ensure software works for Red Hat customers –  Promote data science internally through consulting projects •  Apache Bigtop PMC 2  
  • 4. Clickstream Data 61 million page views 4  
  • 5. Clickstream Data 61 million page views 125,000 registered users 5  
  • 6. Clickstream Data 61 million page views 125,000 registered users 500,000 pages 6  
  • 7. Clickstream Data 61 million page views 125,000 registered users 500,000 pages 125,000 knowledgebase articles 7  
  • 8. Potential Applications •  Build customer profiles to aid sales teams •  Recommendation system for knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase topics by content writers 8  
  • 9. 9   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux? ============================================================= Issue ------ What are the different types of kernel packages in Red Hat Enterprise Linux? Environment --------------- Red Hat Enterprise Linux Resolution ------------ Red Hat Enterprise Linux contains the following kernel packages:
  • 10. 10   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux Issue What are the different types of kernel packages in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features
  • 11. 11   Strip Formatting Clean Words Vectorize Cluster What are the different types of kernel packages in Red Hat Enterprise Linux Issue What are the different types of kernel packages in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contains the following kernel packages some may not apply to your architecture and not all are available in all major releases kernel contains the kernel and following key features
  • 12. 12   Strip Formatting Clean Words Vectorize Cluster What are the different type of kernel package in Red Hat Enterprise Linux Issue What are the different type of kernel package in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature
  • 13. 13   Strip Formatting Clean Words Vectorize Cluster What are the different type of kernel package in Red Hat Enterprise Linux Issue What are the different type of kernel package in Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain the follow kernel package some may not apply to your architecture and not all are available in all major release kernel contain the kernel and follow key feature
  • 14. 14   Strip Formatting Clean Words Vectorize Cluster different type kernel package Red Hat Enterprise Linux Issue different type kernel package Red Hat Enterprise Linux Environment Red Hat Enterprise Linux Resolution Red Hat Enterprise Linux contain kernel package apply architecture available major release kernel contain kernel follow key feature
  • 15. 15   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3 different: 2 type: 2 intel: 2 environment: 1 resolution: 1 follow: 1 system: 1
  • 16. 16   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3 different: 2 type: 2 intel: 2 environment: 1 resolution: 1 follow: 1 system: 1
  • 17. 17   Strip Formatting Clean Words Vectorize Cluster kernel: 5 red: 4 hat: 4 enterprise: 4 linux: 4 package: 3 contain: 3
  • 20. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 20  
  • 21. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 21  
  • 22. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 22  
  • 23. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 23  
  • 24. Topics openshift gear cartridge online node broker vm rhev virtualization disk glusterfs storage volume brick rhs glusterd node client mount geo rhel support driver hp hardware version firmware card intel 24  
  • 26. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 26  
  • 27. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 27  
  • 28. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 28  
  • 29. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 29  
  • 30. Clickstream Processing Parse Raw Daily Page Views Clean & Filter Raw Daily Page Views Raw Daily Page Views Parse Parse Clean & Filter Clean & Filter Accounts Aggregate Topic View Counts Project onto Topics 30  
  • 31. Customer Profiles •  Dominant topics – JBoss – Red Hat Enterprise Virtualization – Hardware support – Gluster – Booting into rescue mode – Packages 31  
  • 32. Customer Profiles •  Supporting topics – Logging – LDAP – Samba – High resource usage – File systems / LVM / block devices – Networking 32  
  • 33. Customer Profiles •  JBoss and RHEV appear in combination with a number of other products •  Some products only appear by themselves with supporting topics (logging, networking, filesystems) – OpenShift – Gluster 33  
  • 35. Malformed TSV Files •  Gzip files need to be read sequentially •  Tab-separated, no quoting (in theory!) •  Escaped tabs and newlines within records •  E.g., n or t •  Improperly escaped tabs and newlines •  E.g., t vs t •  Extraneous unmatched quote marks •  E.g., ‘some_user 35  
  • 36. Lessons Learned •  Consider custom Hadoop input formats for tricky file formats •  Verify everything – what works in general may not work for you – Stemming – Filtering most frequent words – K-Means vs LDA 36  
  • 37. Lessons Learned •  K-Means – Improve accuracy: Multiple runs, more iterations •  Watch out for memory leaks – Un-persist cached RDDs – Un-persist broadcasted variables •  Parquet for performance 37  
  • 38. Potential Applications •  Build customer profiles to aid sales teams •  Recommendation system for knowledgebase •  Improve customer portal search •  Guide selection of new knowledgebase topics for content writers 38