Speeding Up Minwise Hashing for Weighted Sets
2
•
•
•
•
•
•
•
•
•
•
•
•
3
4
• Useful if objects can be represented as sets of features
• and Jaccard similarity is an appropriate similarity measure
coronavirus
hate
the
“I hate the coronavirus!”
I
“I hate lockdowns!”
25 21 18 41 98 12 15 41
25 32 18 11 98 56 33 72
Set representation
lockdowns
hateI
Object Signature Similarity estimation
Minwise hashing
Minwise hashing
used for deduplication of similar web pages
5
I 25 63 98
hate 67 41 18
the 79 34 35
coronavirus 36 21 52
25 21 18
input set
signature
minimum hash value
defines signature component
independent hash functions
6
7
8
hate
the
I
coronavirus
9
10
11
12
13
14
Step 1
Step 2
Step 1
Step 2
15
claims that Ioffe’s algorithm is wrong!
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
“Bagminhash - Minwise hashing algorithm for weighted sets” (Ertl, KDD 2018)
32
“DartMinHash: Fast Sketching for Weighted Sets” (Christiani, 2020)
33
“DartMinHash: Fast Sketching for Weighted Sets” (Christiani, 2020)
34https://github.com/oertl/treeminhash
35https://github.com/oertl/treeminhash
36http://www.nrbook.com/devroye/Devroye_files/chapter_five.pdf
37https://github.com/oertl/treeminhash
38https://github.com/oertl/treeminhash
39https://github.com/oertl/treeminhash
40
DartMinHash performs
best if weights are
normalized
Performance of
DartMinHash depends
on total weight
https://github.com/oertl/treeminhash
41https://github.com/oertl/treeminhash
42
43
“Maximally consistent sampling and the Jaccard index of probability distributions” (Moulton & Jiang, ICDMW 2018)
44
“ProbMinHash–A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity” (Ertl, TKDE 2020)
45
46
47
48
ProbMinHash4ProbMinHash3
ProbMinHash2ProbMinHash1
with replacement w/o replacement
Label sampling
uncorrelatedcorrelated
Pointsampling
49
50
Correlated point generation of ProbMinHash3/4 may reduce estimation error for small sets!
51
52

More Related Content

PDF
Estimating Mutual Information for Discrete‐Continuous Mixtures 離散・連続混合の相互情報量の推定
PPT
NVIDIA's OpenGL Functionality
PDF
Sparse Codingをなるべく数式を使わず理解する(PCAやICAとの関係)
PDF
画像処理でのPythonの利用
PPTX
A bufferrrrrrrrrr (1)
PPTX
[DL輪読会]Real-Time Semantic Stereo Matching
PPTX
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
PPTX
ボイパの音をリアルタイムで解析してみる 〜リザバーコンピューティングを添えて〜
Estimating Mutual Information for Discrete‐Continuous Mixtures 離散・連続混合の相互情報量の推定
NVIDIA's OpenGL Functionality
Sparse Codingをなるべく数式を使わず理解する(PCAやICAとの関係)
画像処理でのPythonの利用
A bufferrrrrrrrrr (1)
[DL輪読会]Real-Time Semantic Stereo Matching
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
ボイパの音をリアルタイムで解析してみる 〜リザバーコンピューティングを添えて〜

What's hot (20)

PPTX
Social Media Mining - Chapter 6 (Community Analysis)
PDF
(公開版)Reconf研2017GUINNESS
PDF
PPT
Image processing9 segmentation(pointslinesedges)
PPTX
Globally and Locally Consistent Image Completion
PDF
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
PDF
文献紹介:言い換え技術に関する研究動向
PDF
Lecture 6-computer vision features descriptors matching
PPTX
Chapter 9 morphological image processing
PDF
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
PDF
Mask-RCNN for Instance Segmentation
PPTX
モデルアーキテクチャ観点からの高速化2019
PDF
Chapter 2. Digital Image Fundamentals.pdf
PDF
How to generate game character behaviors using AI and ML - Unite Copenhagen
PPT
Image compression
PPTX
K means clustering | K Means ++
PPTX
計算スケジューリングの効果~もし,Halideがなかったら?~
PDF
Cv_Chap 4 Segmentation
PDF
論文輪読: Generative Adversarial Text to Image Synthesis
PPT
Boundary Extraction
Social Media Mining - Chapter 6 (Community Analysis)
(公開版)Reconf研2017GUINNESS
Image processing9 segmentation(pointslinesedges)
Globally and Locally Consistent Image Completion
雑音環境下音声を用いた音声合成のための雑音生成モデルの敵対的学習
文献紹介:言い換え技術に関する研究動向
Lecture 6-computer vision features descriptors matching
Chapter 9 morphological image processing
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
Mask-RCNN for Instance Segmentation
モデルアーキテクチャ観点からの高速化2019
Chapter 2. Digital Image Fundamentals.pdf
How to generate game character behaviors using AI and ML - Unite Copenhagen
Image compression
K means clustering | K Means ++
計算スケジューリングの効果~もし,Halideがなかったら?~
Cv_Chap 4 Segmentation
論文輪読: Generative Adversarial Text to Image Synthesis
Boundary Extraction
Ad

Recently uploaded (20)

PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPT
LEC Synthetic Biology and its application.ppt
PPT
Mutation in dna of bacteria and repairss
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Packaging materials of fruits and vegetables
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPT
veterinary parasitology ````````````.ppt
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
endocrine - management of adrenal incidentaloma.pptx
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PPTX
Probability.pptx pearl lecture first year
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
LEC Synthetic Biology and its application.ppt
Mutation in dna of bacteria and repairss
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Packaging materials of fruits and vegetables
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
veterinary parasitology ````````````.ppt
Seminar Hypertension and Kidney diseases.pptx
BODY FLUIDS AND CIRCULATION class 11 .pptx
Animal tissues, epithelial, muscle, connective, nervous tissue
endocrine - management of adrenal incidentaloma.pptx
Enhancing Laboratory Quality Through ISO 15189 Compliance
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
Probability.pptx pearl lecture first year
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Ad

Speeding Up Minwise Hashing for Weighted Sets