SlideShare a Scribd company logo
ON

I
EDIT
G

NIN

T
LIGH

A practical data science project
● A CRM dataset (100k business accounts) belonging to
a national energy supplier
● A knotty problem: multiple accounts per company,
without any grouping ids
● How can we to find groups of accounts (larger
company structures), using just the CRM data?
● Machine Learning (ML) and Natural Language
Processing (NLP) tools and techniques in Python.
● Import: Scikit Learn and TextBlob (NLTK & Pattern)
Company
ID

Account name

Contact
name

Premises address
lines 1 - 4

Billing address
lines 1 - 4

1

Bob’s Pizza

Big Bob

5 High St, Wexford

5 High St, Wexford

1

Bob’s Pizza

Big Bob

Temple Bar, D2

5 High St, Wexford

1

Mike’s Kebabs

Mad Mike

3 Upper St, Dublin

5 High St, Wexford

2

Mark’s Kebabs

Mild Mark

8 Upper St, Dublin

Main St, Waterford

3

Fred’s Falafel

Fat Fred

9 Henry St, Cork

9 Henry st, cork

3

Fred Fallafell

Freddie

Bridges St, Galway

Henrys St, Cork

This crucial bit of info groups the separatelyrecorded accounts into companies…
and was missing from the dataset

… x100,000
Account holder
Business name
Premises address
Billing address
...

Cleaned, parsed,
tokenised text
strings

NTLK.PunktTokeniser

x100,000

sklearn.TfidfVectorizer

SVD N-D
MATRIX

TF-IDF
2-D MATRIX

sklearn.TruncatedSVD
1. Suggest similar
accounts to be grouped
2. Human
validation &
verification

3. Incorporate & propagate
valid groupings

sklearn.MiniBatchKMeans
sklear.AffinityPropagation

sklearn.RadiusNeighborsClassifier
● A very quick turnaround from raw data to tagged
companies to 93% accuracy
● ~40% of accounts found to belong to a company, ~3.5
accounts per company
● NLP toolkits and scikit-learn allowed rapid
development and testing of solution
● Incorporated human identification at critical stages:
no ML problem is an island
Any questions?

More Related Content

PPTX
15 facts you probably didn't know about ZIP codes
PDF
Text Mining to Correct Missing CRM Information by Jonathan Sedar
PPT
Datamining for crm
PPTX
Recommender Systems: Advances in Collaborative Filtering
PDF
Customer relationship management_dwm_ankita_dubey
PDF
Ranking Related News Predictions
PPT
How to apply CRM using data mining techniques.
PDF
Recommender Systems and Active Learning
15 facts you probably didn't know about ZIP codes
Text Mining to Correct Missing CRM Information by Jonathan Sedar
Datamining for crm
Recommender Systems: Advances in Collaborative Filtering
Customer relationship management_dwm_ankita_dubey
Ranking Related News Predictions
How to apply CRM using data mining techniques.
Recommender Systems and Active Learning

Viewers also liked (13)

PDF
Online recommendations at scale using matrix factorisation
PPTX
Requirements for Processing Datasets for Recommender Systems
PDF
Customer Relationship Management in Ireland Managing your Customers for Busin...
PDF
Recommendation Engine Demystified
PPT
Recommendation techniques
PPT
Data mining
PPT
Data Mining Techniques for CRM
PPT
Association rule mining
PPT
The comparative study of apriori and FP-growth algorithm
PPTX
Role-Based Contextual Recommendation
PDF
Lecture13 - Association Rules
PDF
Recommender system algorithm and architecture
Online recommendations at scale using matrix factorisation
Requirements for Processing Datasets for Recommender Systems
Customer Relationship Management in Ireland Managing your Customers for Busin...
Recommendation Engine Demystified
Recommendation techniques
Data mining
Data Mining Techniques for CRM
Association rule mining
The comparative study of apriori and FP-growth algorithm
Role-Based Contextual Recommendation
Lecture13 - Association Rules
Recommender system algorithm and architecture
Ad

More from Jonathan Sedar (8)

PDF
Demystifying Data Science
PDF
How is Data Science going to Improve Insurance?
PDF
Visualising High Dimensional Data with TSNE
PDF
Bayesian Robust Linear Regression with Outlier Detection
PDF
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
PDF
Applied AI Tech Talk: How to Setup a Data Science Dept
PDF
Customer Clustering For Retail Marketing
PDF
Customer Clustering for Retailer Marketing
Demystifying Data Science
How is Data Science going to Improve Insurance?
Visualising High Dimensional Data with TSNE
Bayesian Robust Linear Regression with Outlier Detection
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Applied AI Tech Talk: How to Setup a Data Science Dept
Customer Clustering For Retail Marketing
Customer Clustering for Retailer Marketing
Ad

Recently uploaded (20)

DOCX
Handbook of Entrepreneurship- Chapter 5: Identifying business opportunity.docx
PPTX
basic introduction to research chapter 1.pptx
PPTX
Negotiation and Persuasion Skills: A Shrewd Person's Perspective
PPTX
2025 Product Deck V1.0.pptxCATALOGTCLCIA
PDF
Tata consultancy services case study shri Sharda college, basrur
PDF
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
PDF
NISM Series V-A MFD Workbook v December 2024.khhhjtgvwevoypdnew one must use ...
PPT
Lecture notes on Business Research Methods
PPTX
Principles of Marketing, Industrial, Consumers,
PDF
Charisse Litchman: A Maverick Making Neurological Care More Accessible
PDF
Keppel_Proposed Divestment of M1 Limited
PPT
Lecture 3344;;,,(,(((((((((((((((((((((((
PDF
How to Get Business Funding for Small Business Fast
PPTX
Slide gioi thieu VietinBank Quy 2 - 2025
PDF
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
PDF
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications
PPTX
TRAINNING, DEVELOPMENT AND APPRAISAL.pptx
PDF
Digital Marketing & E-commerce Certificate Glossary.pdf.................
PDF
ANALYZING THE OPPORTUNITIES OF DIGITAL MARKETING IN BANGLADESH TO PROVIDE AN ...
PDF
Robin Fischer: A Visionary Leader Making a Difference in Healthcare, One Day ...
Handbook of Entrepreneurship- Chapter 5: Identifying business opportunity.docx
basic introduction to research chapter 1.pptx
Negotiation and Persuasion Skills: A Shrewd Person's Perspective
2025 Product Deck V1.0.pptxCATALOGTCLCIA
Tata consultancy services case study shri Sharda college, basrur
TyAnn Osborn: A Visionary Leader Shaping Corporate Workforce Dynamics
NISM Series V-A MFD Workbook v December 2024.khhhjtgvwevoypdnew one must use ...
Lecture notes on Business Research Methods
Principles of Marketing, Industrial, Consumers,
Charisse Litchman: A Maverick Making Neurological Care More Accessible
Keppel_Proposed Divestment of M1 Limited
Lecture 3344;;,,(,(((((((((((((((((((((((
How to Get Business Funding for Small Business Fast
Slide gioi thieu VietinBank Quy 2 - 2025
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
Nante Industrial Plug Factory: Engineering Quality for Modern Power Applications
TRAINNING, DEVELOPMENT AND APPRAISAL.pptx
Digital Marketing & E-commerce Certificate Glossary.pdf.................
ANALYZING THE OPPORTUNITIES OF DIGITAL MARKETING IN BANGLADESH TO PROVIDE AN ...
Robin Fischer: A Visionary Leader Making a Difference in Healthcare, One Day ...

Text mining to correct missing CRM information: a practical data science project

  • 2. ● A CRM dataset (100k business accounts) belonging to a national energy supplier ● A knotty problem: multiple accounts per company, without any grouping ids ● How can we to find groups of accounts (larger company structures), using just the CRM data? ● Machine Learning (ML) and Natural Language Processing (NLP) tools and techniques in Python. ● Import: Scikit Learn and TextBlob (NLTK & Pattern)
  • 3. Company ID Account name Contact name Premises address lines 1 - 4 Billing address lines 1 - 4 1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford 1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford 1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford 2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford 3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork 3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork This crucial bit of info groups the separatelyrecorded accounts into companies… and was missing from the dataset … x100,000
  • 4. Account holder Business name Premises address Billing address ... Cleaned, parsed, tokenised text strings NTLK.PunktTokeniser x100,000 sklearn.TfidfVectorizer SVD N-D MATRIX TF-IDF 2-D MATRIX sklearn.TruncatedSVD
  • 5. 1. Suggest similar accounts to be grouped 2. Human validation & verification 3. Incorporate & propagate valid groupings sklearn.MiniBatchKMeans sklear.AffinityPropagation sklearn.RadiusNeighborsClassifier
  • 6. ● A very quick turnaround from raw data to tagged companies to 93% accuracy ● ~40% of accounts found to belong to a company, ~3.5 accounts per company ● NLP toolkits and scikit-learn allowed rapid development and testing of solution ● Incorporated human identification at critical stages: no ML problem is an island