SlideShare a Scribd company logo
Data Science and Machine Learning
for eCommerce and Retail
Dr. Andrei Lopatenko
Director of Engineering,
Recruit Institute of Technology
Recruit Holdings
former Walmart Labs, Google (twice), Apple (twice)
andrei@recruit.ai
ML for eCommerce
• Search, Browse, for commerce sites and
application
• Help users to find and discover items they
will purchase
• Maximize revenue/profit per user session
Search
Search - ranking
ranking
Search - LHN
Left
Hand
Navigation
Search spell correction
Search type ahead
Browse
Search data size
• Catalogue items
• 8 M items now compare ~ 400 M
Amazon / eBay
• X 10 in near future
• 2 K text description per item + images
• Several hundreds of structured attributes
per catalog
Search – user searches
• Tens of millions per day
• Tens billions session per year
• Online sales 13.2 B per year (http://
fortune.com/2015/11/17/walmart-
ecommerce/)
• 500B per year sales offline stories (8% USA
economy) in ~ 11K stores
• The number of transactions ~ 10B (public
data)
ML addressable problems
• Learning to rank
• Given a query, what’s the list of items
with the highest probability of conversion
(purchase), ATC (add to card), page view
ML addressable problems
• Typeahead
• Given a sequence of characters types by
user, what’s most probably competitions,
what are most probable items users wants
to buy
ML addressable problems
• Spell correction
• Given a user query, what’s the query user
actually wanted to type
ML addressable problems
• Cold start
• Given a new items with it’s set of
attributes and no history of sales or
exposure on site, predict items sales and
item sales per query
ML addressable problems
• Prediction of LHN
• Given a user query, what’s the best set of
facet and facet values, which gives higher
probability of users interacting with them
and finally buying an item
ML addressable problems
• Query understanding
• Given a query, build a semantic parse of
query, tag tokens with attributes: blue
tshirts for teenagers -> blue:color
tshirts:type for:opt
teenagers:agerestriction10-20
• Classification: blue tshirts for teenagers: -
> type:apparel, price preference: 10-30,
releaseyearpreference: 2014-2016
ML addressable problems
• Related searches
• Given a query, what are queries which are
either semantically close to this one, or
represent coincidental users interests
• Nike shoes -> adidas shoes, sport shoes,
• Coffee mugs -> travel mugs, photo coffee
mugs, cappuccino cups
ML addressable problems
• product discovery
• help users to explore product assortment,
• drive users to diverse products
• reduce risk of selecting irrelevant items
• help to find price,quality,brand etc
alternatives
• reduce pigeonhole risk
• provide relevant data to make a decision
ML addressable problems
• Image similarity
• Given images of the items, give other
items such that images of those are
visually appealing to the users which like
the original item (appealing by shape?
Color? Texture?) -> causing high conversion
in recommendation
ML addressable problems
• Voice search
• Given voice input, reply with a list of the
best items
• “what are the cheapest samsung tvs in the
store”
• “what is best deal on queen bed today?”
ML addressable problems
• extraction of item attributes
• Given an item: what are item attributes:
brand, color, size (wheel, screen, height,
S/M/XL, Queen/Twin/King/Full), Gender,
Pattern, Shape, Features
ML addressable problems
• Representations of users : actions on
websites/apps -> searches, clicks,
browsing behaviour, product -> purchase
preferences, reviews, ratings, return rates
ML addressable problems
• title generation: how to generate the title
which will cause maximum conversion
rate
• which product attributes select for the
title?
What makes a good title?
What makes a good title?
Limits
• Most models should be served in
production
• 50ms on prediction
• Part of big system, memory limits ~ 10G
Retail
Retail
• Key directions which require machine
learning:
• discounting tools
• coupons and rewards
• loyalty
• inventory management
Inventory management
• Customer want to buy products
• Customers have diverse needs
• Products should be in stock, ideally in
warehouses close to customers
• but it’s expensive to store products
• Problem: How many products of each type
should be stored, when product supply
should be refilled?
Customer intelligence
• Retail
• analyze sales data, find anomalies, explain
them
• low sales of umbrellas during last month in
North California’s stores
• No rains? (integration with external data about
weather conditions)
• Seasonal / the same as last year / time series
• Competitors
Fraud detection
• identify fraudulent transactions online
• Hundreds fraud schemas detected daily
• Global retail shrinkage is $119 billion in
2011, an average of 1.45% of retail sales.
• from stolen credit card to price tag
replaced, price discounts by high level
managers to achieve personal goals
Propensity Modeling for Marketing
Campaigns
• build effective email/facebook/google
ads campaign addressing proper customer
at proper time at proper costs
• behavior based customer segmentation
and clusterization with demographics,
lifestyle, attitudinal information
Online Grocery
• which items can be replaced by other
items and by which items they can be
replaced
• data are individual purchases in chain
grocery, drug stores, online grocery
shopping
• the problem - find which items can be
replaced by other item if they are not in
store to fulfill the order
Dynamic pricing
• define the best price
• scrap continuously prices of competitors,
predict demand by price, know the
expenses
• online commerce sites change prices
every 10 minutes
Challenges
• Data volumes: transactions: Walmart: 10
Million per day
• Computations: complicated modeling
techniques
Hardware platform
• Needs:
• Data storage
• Data processing
• Serving online
Data storage
• Volumes of data:
• 10 M transactions per day, 5 years - 18
billion transactions -> 1T
• Catalog: 500 M items * 2K per each -> 1T
Data Storage
• but if go to video: petabytes of data,
RetailNext 75P per year from 30000+
sensors
• Walmart 500P
• eBay 40 P in 2013 (transactions + online
behaviours)
Data processing
• Rebuild model over fresh data:
• typically daily: add daily data (millions of
transactions, hundreds of millions of
behavior units) to year data store (billions
of transactions, hundred billion/trillion
behavior units)
• build a model to serve in production the
next day
Data processing
• some models such as fraud
detection,dynamic pricing should be
almost online (10-15 minutes)
• build over data such as daily transactions
or web crawl over competitors' sites
Serving online
• online commerce WML - thousands / tens
thousands queries per second in peak
times
• complicated algorithm of ranking,
recommendation,
• 50ms limit
serving online
• price, in store availability - millions
requests per second in peak times
• item informations - millions requests per
second
• serving online - Solr/Lucene/Elastic
Search, Cassandra, MongoDB, Oracle,
CouchDB,Node.JS/Java solutions etc
Data processing
• Hadoop / Spark clusters
• a lot of I/O
• HDFS does the redundancy , RAID is not
necessary, RAID is slow to write, Hadoop
writes a lot
• SAN, NAS are not good either
• so bare metal with DAS Directly Attached
Storage
Data Processing
• more servers, cheaper servers
• more smaller disks is better than large
disks
• allocate cluster 100% to Hadoop
Data processing
• Hadoop Masters vs Workers
• large clusters: Masters > 64G RAM, dual
Ethernet NIC, dual quad core CPU
• Workers: memory 64G+, SAS 6Gb/s disk
controllers, 2 Ethernet cards, 2*6core
processors, 15M cache, Intel’s Hyper-
Threading and QPI good to have
Data Processing
• big models, deep learning
• Nvidia DGX-1 and alike
• Pascal GPUs , NVLink interconnect
• Tesla k40, K80 work pretty well too
• may require a lot of tuning http://
timdettmers.com/2015/03/09/deep-learning-
hardware-guide/
• hard to buy: big data solutions are considered
profit generators, HPC servers are not
Serving online
• Typically large memory, but not necessary
(for example, Elastic Search/Solr
degrades over 64G)
• CPUs: more cores rather than faster
• Disks: SSD, RAID 0, no NAS, a lot of
conditions frequently optimize wrt how
easy to change drivers rather than SSD
endurance
ecommerce example
• Database servers
• Unified hardware platform : from HP
• HP DL line:
• 4 cpu sockets
• 256 GM RAM
• network interfaces
• not much HDD, data is in NAS
ecommerce example
• cloud servers:
• purchased by racks: 40 in a rack
• 2 CPU socket
• 198G
• 18 core CPU
• SSD
network requirements
• 1 network card per server - a big mistake,
1 switch per rack
• 3 cards per servers:
• typical three data flows:
• production
• “administrative” (dockers etc)
• analytics
example
• application servers vs big data servers
• application servers (java, node.js apps):
• 1TB SSD, RAID 5
• Big data servers:
• 5T SAS
Questions?
Dr. Andrei Lopatenko
Director of Engineering,
Recruit Institute of Technology
Recruit Holdings
• andrei@recruit.ai

More Related Content

PDF
How Data Science can increase Ecommerce profits
PDF
Machine Learning for retail and ecommerce
PDF
Machine Learning in Ecommerce
PDF
Cortana Analytics Workshop: Intelligent Retail -- The Machine Learning Approach
PPTX
Data Science for e-commerce
PPTX
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
PPTX
Applied Data Science for E-Commerce
PPT
Ecommerce Web Site Design And Internet Marketing (3)
How Data Science can increase Ecommerce profits
Machine Learning for retail and ecommerce
Machine Learning in Ecommerce
Cortana Analytics Workshop: Intelligent Retail -- The Machine Learning Approach
Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Applied Data Science for E-Commerce
Ecommerce Web Site Design And Internet Marketing (3)

What's hot (20)

PDF
INTERSPORT e-Commerce with Divante
PPTX
All you wanted to know about analytics in e commerce- amazon, ebay, flipkart
PPT
Research retail software products
PDF
Omnichannel Customer Experience
PDF
It's Not Just About Google. Well, it is, sort of.
PDF
Susina--The Rise of B2B eCommerce
PPTX
Merchandising around out of stock to save the sale
PPTX
How to Build a Recommendation Engine with Neo4j
PPTX
50 ways to increase your online sales (Roger Willcocks)
PDF
B2B and Ecommerce (Relation, Market, Growth & Future)
PPTX
[Webinar preview] future of merchandising
PPTX
Harvesting business Value with Data Science
PPTX
Andrey Shapovalov: Didieji duomenys (Big data) elektroninėje rinkodaroje - ...
PDF
Home and Decoration Report
PDF
Internet Marketing Strategies for Ecommerce Websites
PDF
Benchmark of Ecommerce solution - full [english]
PDF
Online Industry - New Era
PDF
How marketing automation boost sales and increases retention: ecommerce case ...
PPTX
Business Model Canvas of daraz.pk
INTERSPORT e-Commerce with Divante
All you wanted to know about analytics in e commerce- amazon, ebay, flipkart
Research retail software products
Omnichannel Customer Experience
It's Not Just About Google. Well, it is, sort of.
Susina--The Rise of B2B eCommerce
Merchandising around out of stock to save the sale
How to Build a Recommendation Engine with Neo4j
50 ways to increase your online sales (Roger Willcocks)
B2B and Ecommerce (Relation, Market, Growth & Future)
[Webinar preview] future of merchandising
Harvesting business Value with Data Science
Andrey Shapovalov: Didieji duomenys (Big data) elektroninėje rinkodaroje - ...
Home and Decoration Report
Internet Marketing Strategies for Ecommerce Websites
Benchmark of Ecommerce solution - full [english]
Online Industry - New Era
How marketing automation boost sales and increases retention: ecommerce case ...
Business Model Canvas of daraz.pk
Ad

Viewers also liked (20)

PDF
Applying machine learning to product categorization
PDF
E-commerce product classification with deep learning
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PPTX
Data mining with Google analytics
PDF
Machine Learning in Ecommerce
PPTX
PPC for ecommerce websites
PDF
Building Data Teams
PDF
The Evolution of Digital Ecommerce
PDF
Supervised Classifcation Portland Metro
PPTX
Presentation: mongo db & elasticsearch & membase
PDF
UX mockups for an advanced search
PPTX
Improving UX checkout
PDF
Die Bedeutung von Machine Learning für den e-Commerce am Beispiel von Amazon
PPTX
Maximizing ROI in eCommerce with Search
PPT
UX: internal search for e-commerce
DOCX
Boosting conversion rates on ecommerce using deep learning algorithms
PDF
Digital marketing ROI - An introduction to attribution modelling
PDF
NoSQL into E-Commerce: lessons learned
PDF
Intro to Elasticsearch
PDF
Machine Learning without the Math: An overview of Machine Learning
Applying machine learning to product categorization
E-commerce product classification with deep learning
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Data mining with Google analytics
Machine Learning in Ecommerce
PPC for ecommerce websites
Building Data Teams
The Evolution of Digital Ecommerce
Supervised Classifcation Portland Metro
Presentation: mongo db & elasticsearch & membase
UX mockups for an advanced search
Improving UX checkout
Die Bedeutung von Machine Learning für den e-Commerce am Beispiel von Amazon
Maximizing ROI in eCommerce with Search
UX: internal search for e-commerce
Boosting conversion rates on ecommerce using deep learning algorithms
Digital marketing ROI - An introduction to attribution modelling
NoSQL into E-Commerce: lessons learned
Intro to Elasticsearch
Machine Learning without the Math: An overview of Machine Learning
Ad

Similar to Data Science and Machine Learning for eCommerce and Retail (20)

PDF
The New Model
PPTX
Prepare for Peak Holiday Season with MongoDB
PDF
Expanding Retail Frontiers with MongoDB
PDF
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
PPT
1 DM introduction busy flow HDMI Nat .ppt
PDF
Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
PPT
New Marketing for the New Economy - Kotler
PPTX
Data-Science-Fundamentals- Session 2.pptx
PDF
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
PDF
Graphs in Action: In-depth look at Neo4j in Production
PPTX
Tamr | Strata hadoop 2014 Michael Stonebraker
PDF
Predict Repeat Shoppers with H20 and Spark
PPTX
The First Kilometre: Building a Back-End That Sets You Up For Success
PPSX
Data Refinement: The missing link between data collection and decisions
PPT
Webinar: Expanding Retail Frontiers with MongoDB
PPT
Big Data and the Next Best Offer
PDF
4 Steps to Make Customer Data Actionable
PPTX
The Future of Supply Chain Managment: Imagining Supply Chain 2030
PPTX
Making advertising personal, 4th NL Recommenders Meetup
The New Model
Prepare for Peak Holiday Season with MongoDB
Expanding Retail Frontiers with MongoDB
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
1 DM introduction busy flow HDMI Nat .ppt
Neo4j GraphTalks Oslo - Next Generation Solutions built on Neoej
New Marketing for the New Economy - Kotler
Data-Science-Fundamentals- Session 2.pptx
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Graphs in Action: In-depth look at Neo4j in Production
Tamr | Strata hadoop 2014 Michael Stonebraker
Predict Repeat Shoppers with H20 and Spark
The First Kilometre: Building a Back-End That Sets You Up For Success
Data Refinement: The missing link between data collection and decisions
Webinar: Expanding Retail Frontiers with MongoDB
Big Data and the Next Best Offer
4 Steps to Make Customer Data Actionable
The Future of Supply Chain Managment: Imagining Supply Chain 2030
Making advertising personal, 4th NL Recommenders Meetup

More from Andrei Lopatenko (6)

PDF
Natural Language Processing at Scale
PPTX
Driving Customer Experience and Business Revenues Through Search Engines
PDF
AI in multi billion search engines. Building AI and Search teams
PDF
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
PDF
AI in Search Engines
PDF
Building multi billion ( dollars, users, documents ) search engines on open ...
Natural Language Processing at Scale
Driving Customer Experience and Business Revenues Through Search Engines
AI in multi billion search engines. Building AI and Search teams
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Search Engines
Building multi billion ( dollars, users, documents ) search engines on open ...

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Business Data Analytics.
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Computer network topology notes for revision
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
Major-Components-ofNKJNNKNKNKNKronment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
climate analysis of Dhaka ,Banglades.pptx
Database Infoormation System (DBIS).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Miokarditis (Inflamasi pada Otot Jantung)
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Reliability_Chapter_ presentation 1221.5784
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Business Data Analytics.
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
.pdf is not working space design for the following data for the following dat...
Computer network topology notes for revision
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu

Data Science and Machine Learning for eCommerce and Retail

  • 1. Data Science and Machine Learning for eCommerce and Retail Dr. Andrei Lopatenko Director of Engineering, Recruit Institute of Technology Recruit Holdings former Walmart Labs, Google (twice), Apple (twice) andrei@recruit.ai
  • 2. ML for eCommerce • Search, Browse, for commerce sites and application • Help users to find and discover items they will purchase • Maximize revenue/profit per user session
  • 9. Search data size • Catalogue items • 8 M items now compare ~ 400 M Amazon / eBay • X 10 in near future • 2 K text description per item + images • Several hundreds of structured attributes per catalog
  • 10. Search – user searches • Tens of millions per day • Tens billions session per year • Online sales 13.2 B per year (http:// fortune.com/2015/11/17/walmart- ecommerce/) • 500B per year sales offline stories (8% USA economy) in ~ 11K stores • The number of transactions ~ 10B (public data)
  • 11. ML addressable problems • Learning to rank • Given a query, what’s the list of items with the highest probability of conversion (purchase), ATC (add to card), page view
  • 12. ML addressable problems • Typeahead • Given a sequence of characters types by user, what’s most probably competitions, what are most probable items users wants to buy
  • 13. ML addressable problems • Spell correction • Given a user query, what’s the query user actually wanted to type
  • 14. ML addressable problems • Cold start • Given a new items with it’s set of attributes and no history of sales or exposure on site, predict items sales and item sales per query
  • 15. ML addressable problems • Prediction of LHN • Given a user query, what’s the best set of facet and facet values, which gives higher probability of users interacting with them and finally buying an item
  • 16. ML addressable problems • Query understanding • Given a query, build a semantic parse of query, tag tokens with attributes: blue tshirts for teenagers -> blue:color tshirts:type for:opt teenagers:agerestriction10-20 • Classification: blue tshirts for teenagers: - > type:apparel, price preference: 10-30, releaseyearpreference: 2014-2016
  • 17. ML addressable problems • Related searches • Given a query, what are queries which are either semantically close to this one, or represent coincidental users interests • Nike shoes -> adidas shoes, sport shoes, • Coffee mugs -> travel mugs, photo coffee mugs, cappuccino cups
  • 18. ML addressable problems • product discovery • help users to explore product assortment, • drive users to diverse products • reduce risk of selecting irrelevant items • help to find price,quality,brand etc alternatives • reduce pigeonhole risk • provide relevant data to make a decision
  • 19. ML addressable problems • Image similarity • Given images of the items, give other items such that images of those are visually appealing to the users which like the original item (appealing by shape? Color? Texture?) -> causing high conversion in recommendation
  • 20. ML addressable problems • Voice search • Given voice input, reply with a list of the best items • “what are the cheapest samsung tvs in the store” • “what is best deal on queen bed today?”
  • 21. ML addressable problems • extraction of item attributes • Given an item: what are item attributes: brand, color, size (wheel, screen, height, S/M/XL, Queen/Twin/King/Full), Gender, Pattern, Shape, Features
  • 22. ML addressable problems • Representations of users : actions on websites/apps -> searches, clicks, browsing behaviour, product -> purchase preferences, reviews, ratings, return rates
  • 23. ML addressable problems • title generation: how to generate the title which will cause maximum conversion rate • which product attributes select for the title?
  • 24. What makes a good title?
  • 25. What makes a good title?
  • 26. Limits • Most models should be served in production • 50ms on prediction • Part of big system, memory limits ~ 10G
  • 28. Retail • Key directions which require machine learning: • discounting tools • coupons and rewards • loyalty • inventory management
  • 29. Inventory management • Customer want to buy products • Customers have diverse needs • Products should be in stock, ideally in warehouses close to customers • but it’s expensive to store products • Problem: How many products of each type should be stored, when product supply should be refilled?
  • 30. Customer intelligence • Retail • analyze sales data, find anomalies, explain them • low sales of umbrellas during last month in North California’s stores • No rains? (integration with external data about weather conditions) • Seasonal / the same as last year / time series • Competitors
  • 31. Fraud detection • identify fraudulent transactions online • Hundreds fraud schemas detected daily • Global retail shrinkage is $119 billion in 2011, an average of 1.45% of retail sales. • from stolen credit card to price tag replaced, price discounts by high level managers to achieve personal goals
  • 32. Propensity Modeling for Marketing Campaigns • build effective email/facebook/google ads campaign addressing proper customer at proper time at proper costs • behavior based customer segmentation and clusterization with demographics, lifestyle, attitudinal information
  • 33. Online Grocery • which items can be replaced by other items and by which items they can be replaced • data are individual purchases in chain grocery, drug stores, online grocery shopping • the problem - find which items can be replaced by other item if they are not in store to fulfill the order
  • 34. Dynamic pricing • define the best price • scrap continuously prices of competitors, predict demand by price, know the expenses • online commerce sites change prices every 10 minutes
  • 35. Challenges • Data volumes: transactions: Walmart: 10 Million per day • Computations: complicated modeling techniques
  • 36. Hardware platform • Needs: • Data storage • Data processing • Serving online
  • 37. Data storage • Volumes of data: • 10 M transactions per day, 5 years - 18 billion transactions -> 1T • Catalog: 500 M items * 2K per each -> 1T
  • 38. Data Storage • but if go to video: petabytes of data, RetailNext 75P per year from 30000+ sensors • Walmart 500P • eBay 40 P in 2013 (transactions + online behaviours)
  • 39. Data processing • Rebuild model over fresh data: • typically daily: add daily data (millions of transactions, hundreds of millions of behavior units) to year data store (billions of transactions, hundred billion/trillion behavior units) • build a model to serve in production the next day
  • 40. Data processing • some models such as fraud detection,dynamic pricing should be almost online (10-15 minutes) • build over data such as daily transactions or web crawl over competitors' sites
  • 41. Serving online • online commerce WML - thousands / tens thousands queries per second in peak times • complicated algorithm of ranking, recommendation, • 50ms limit
  • 42. serving online • price, in store availability - millions requests per second in peak times • item informations - millions requests per second • serving online - Solr/Lucene/Elastic Search, Cassandra, MongoDB, Oracle, CouchDB,Node.JS/Java solutions etc
  • 43. Data processing • Hadoop / Spark clusters • a lot of I/O • HDFS does the redundancy , RAID is not necessary, RAID is slow to write, Hadoop writes a lot • SAN, NAS are not good either • so bare metal with DAS Directly Attached Storage
  • 44. Data Processing • more servers, cheaper servers • more smaller disks is better than large disks • allocate cluster 100% to Hadoop
  • 45. Data processing • Hadoop Masters vs Workers • large clusters: Masters > 64G RAM, dual Ethernet NIC, dual quad core CPU • Workers: memory 64G+, SAS 6Gb/s disk controllers, 2 Ethernet cards, 2*6core processors, 15M cache, Intel’s Hyper- Threading and QPI good to have
  • 46. Data Processing • big models, deep learning • Nvidia DGX-1 and alike • Pascal GPUs , NVLink interconnect • Tesla k40, K80 work pretty well too • may require a lot of tuning http:// timdettmers.com/2015/03/09/deep-learning- hardware-guide/ • hard to buy: big data solutions are considered profit generators, HPC servers are not
  • 47. Serving online • Typically large memory, but not necessary (for example, Elastic Search/Solr degrades over 64G) • CPUs: more cores rather than faster • Disks: SSD, RAID 0, no NAS, a lot of conditions frequently optimize wrt how easy to change drivers rather than SSD endurance
  • 48. ecommerce example • Database servers • Unified hardware platform : from HP • HP DL line: • 4 cpu sockets • 256 GM RAM • network interfaces • not much HDD, data is in NAS
  • 49. ecommerce example • cloud servers: • purchased by racks: 40 in a rack • 2 CPU socket • 198G • 18 core CPU • SSD
  • 50. network requirements • 1 network card per server - a big mistake, 1 switch per rack • 3 cards per servers: • typical three data flows: • production • “administrative” (dockers etc) • analytics
  • 51. example • application servers vs big data servers • application servers (java, node.js apps): • 1TB SSD, RAID 5 • Big data servers: • 5T SAS
  • 52. Questions? Dr. Andrei Lopatenko Director of Engineering, Recruit Institute of Technology Recruit Holdings • andrei@recruit.ai