SlideShare a Scribd company logo
Framework for a suite of Co-clustering
algorithms for predictive modeling on Hadoop




 Vaijanath N. Rao
 (vaijanath.rao@teamaol.com)
 Rohini Uppuluri
 (rohini.uppuluri@teamaol.com)
Agenda
• Introduction
   • Background
   • Some Approaches

• Co-Clustering
   • Introduction
   • Related Work
   • Why Hadoop?

• Goal
• Our Framework
• Conclusions and Future Work

                                Presentation for
                                       [CLIENT]
Background
Modeling for Prediction
   • Will user A like this movie?

   • Will a user B like this camera

   • Customer purchase decisions in an e-commerce setting

   And tons of other things…




                                                  Presentation for
                                                         [CLIENT]
Some Approaches
• Collaborative filtering
   • User Based, Item Based, Model Based, Content Based, Hybrid (See [1],
       [2] ) etc
• Latent Models
   • Probabilistic Latent Semantic Indexing [3,6]
   •   Matrix Factorization [4,7,8],
   • Probabilistic Discrete Latent Factor[5]

• Co-clustering
   • Clustering along multiple axes: [9,10] etc; survey in [16]




                                                                  Presentation for
                                                                         [CLIENT]
Co-clustering
                     Products
                                         Product
                                        Attributes




                                                                              R
                 1   0    1     1   1




                                                                               ed
                                              Row Cluster Updation
Users




                                                                                 uc
                 0   ?    1     ?   0        Column Cluster Updation
                                              Global Model Updation




                                                                                       in
                 1   1    ?     0   0




                                                                                         g
                                                                                                             Er
                 ?   0    0     ?   0
                                                     ...




                                                                                                                ro
                                                                  Row Cluster Updation




                                                                                                                  r
          User                                                   Column Cluster Updation
        Attributes                                                Global Model Updation




                                                                       ...           Row Cluster Updation
                                                                                    Column Cluster Updation
                                                                                     Global Model Updation


                                                                                                               Clustered Products
                                                                                                               0      0   1         1        1




                                                                                           Clustered Users
                                                                                                               1      ?   1         ?        0

                                                                                                               1      1   ?         0        0

                                                                                                               ?      1   0         ?        0


                                                                                                                              Presentation for
                                                                                                                                     [CLIENT]
Some Approaches
• Bregman co-clustering - Framework [11]
   • Information theoretic co-clustering [12]
   • Min sum squared co-clustering [13]

• Scalable Framework based on Bregman
  framework[14]
• DisCo [15]




                                                Presentation for
                                                       [CLIENT]
Why Hadoop
• Real world data – Huge
• Large matrix to operate on(millions and
  millions of rows, millions of columns!)
• Lot of computations




                                       Presentation for
                                              [CLIENT]
Goal
• Number of approaches, need for a common
  framework
• To build a framework to fit in the multiple algorithms
  on hadoop
• Easy framework for users to choose and use




                                               Presentation for
                                                      [CLIENT]
Overview

                Row Cluster
                Updator Job     Row Clusters

       Input


               Column Cluster
                                Column Clusters
                Updator Job




                Global Model
                Updator Job




                Global Model



                                               Presentation for
                                                      [CLIENT]
Overview : Core Interfaces
• Input vector (type, id, datavec, attributevec, cost, assignment)
• Cluster ( vector, len)
   • Row Cluster
   • Column Cluster
• Distance/Error Function (vector1, vector2)
• Model (matrix)
   • Row Model
   • Column Model
   • Group Model
• Objective Function (Model1, Model2)




                                                       Presentation for
                                                              [CLIENT]
Currently we have
• Graph Based Bi-clustering
• Disco




                              Presentation for
                                     [CLIENT]
Disco Algorithm
 1. Initialization
           1.1 row and column clusters
           1.2 Compute global model
 2. While objective function is met
           2.1 For each row in the data, pick the row group
                which minimizes error
           2.2 Update row clusters
           2.3 Update global model
           2.4 For each column in the data, pick the column
                group which minimizes error
           2.5 Update column clusters
           2.6 Update global model
 3. Return row and column clusters

                                                   Presentation for
                                                          [CLIENT]
Pick the Best Row Group/Cluster




                                  Presentation for
                                         [CLIENT]
Example




          Presentation for
                 [CLIENT]
RowCluster Updator Job




                         Presentation for
                                [CLIENT]
Example




          Presentation for
                 [CLIENT]
BiClustering




               Presentation for
                      [CLIENT]
Pick the Best Row Group/Cluster




                                  Presentation for
                                         [CLIENT]
Example




          Presentation for
                 [CLIENT]
Row Updator Job
                                                                  key                             value

                 value              RowCluster Mapper
key                                                                                                                                 KeyType
          0                                                      rowId       clickVector attributeVector bestRowClusterId cost
          rowId                                                                                                                      DATA
          clickVector               Pick the best row group
 lineId
          attributeVector        cluster which minimizes cost
          curRowClusterId        or error                       Best Row                                            Key type
                                                                                          clickVector
          curRowClusterError                                    Cluster Id                                         ROWCLUSTER




                                   RowCluster Reducer
                               keyType:

                               DATA:
                                 Just Emit                        rowId         clickVector attributeVector bestRowClusterId cost
                               ROW CLUSTER
                                Aggregate Row Cluster
                                                                  Also write


                                                                Updated Row Clusters




                                                                                                           Presentation for
                                                                                                                  [CLIENT]
Example




          Presentation for
                 [CLIENT]
Conclusions and Future Work
• Implementing more algorithms
• Easy to use examples and more documentation




                                         Presentation for
                                                [CLIENT]
References
[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for
     performing collaborative filtering. In SIGIR, pages 230–237, 1999
[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML
     ’04, pages 65–72, 2004.
[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI
     ’99, 1999.
[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007
[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale
     dyadic data. In Proc. KDD ’07, pages 26–35, 2007
[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable
     online collaborative filtering. In WWW ’07, pages 271–280, 2007
[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative
     filtering model. In KDD ’08, pages 426–434, 2008
[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic
     matrix factorization. In CIKM ’08, pages 931–940, 2008
[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages
     93–103, 2000
[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-
     clustering. In ICDM, pages 625 – 628, 2005




                                                                             Presentation for
                                                                                    [CLIENT]
References (contd..)
[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum
    entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--
    1986, 2007.
[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD
    ’03, pages 89–98, 2003
[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-
    clustering of gene expression data. In Proc. SDM ’04, 2004
[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for
    discovering coherent co-clusters in noisy data. In ICML ’08, 2008
[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case
    study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008
[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A
    survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004




                                                                             Presentation for
                                                                                    [CLIENT]
Thank you




            Presentation for
                   [CLIENT]
Row Cluster Updator Job
                                                                   key                        value

                                                                                                                                     Value type
                                                                   rowId        clickVector attributeVector bestRowClusterId cost
                                   RowCluster Mapper                                                                                   DATA
key             value
         clickVector
         attributeVector            Pick the best row group                                                          Value type
 rowId                                                             rowId              Updated rowCluster
         curRowClusterId         cluster which minimizes cost                                                       ROWCLUSTER
         curRowClusterError      or error
                                                                                                                      Value type
                                                                   rowId         Updated Partial GlobalModel        ROW GLOB MODEL




                                  RowCluster Reducer
                              ValueType:
                              DATA:                                rowId         clickVector attributeVector bestRowClusterId cost
                                  Just Emit
                              ROW CLUSTER
                                Aggregate Row Cluster              Also write
                              ROW GLOB CLUSTER
                                  Aggregate Partial Global Model    Updated Row Clusters
                              for given row cluster


                                                                    Updated Partial Global
                                                                           Model




                                                                                                               Presentation for
                                                                                                                      [CLIENT]
Column Cluster Updator Job
                                                                   key                         value

                                                                                                                                       Value type
                                                                   colId       clickVector attributeVector bestColClusterId cost
                                    ColCluster Mapper                                                                                    DATA
key             value
         clickVector
         attributeVector            Pick the best col group                                                             Value type
 colId                                                             colId                Updated colCluster
         curColClusterId         cluster which minimizes cost                                                          COLCLUSTER
         curColClusterError      or error
                                                                                                                         Value type
                                                                   colId           Updated Partial GlobalModel         COL GLOB MODEL




                                   ColCluster Reducer
                                                                   colId           clickVector attributeVector bestColClusterId cost
                              ValueType:
                              DATA:
                                  Just Emit
                              COL CLUSTER                             Also write
                                Aggregate Col Cluster
                              COL GLOB CLUSTER                     Updated Col Clusters
                                  Aggregate Partial Global Model
                              for given col cluster
                                                                   Updated Partial Global
                                                                          Model




                                                                                                                 Presentation for
                                                                                                                        [CLIENT]

More Related Content

PDF
Cascading[1]
PPTX
June 2014 HUG: Interactive analytics over hadoop
PDF
Hadoop Summit 2010 Keynote
PPTX
Extractiv
PDF
TWSE - Intel Big Data & Cloud Summit 2013
PPTX
Yahoo xtra Open Technolgies
PPTX
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
PDF
Yahoo! Hack India: Hyderabad | Visualizing User Experience
Cascading[1]
June 2014 HUG: Interactive analytics over hadoop
Hadoop Summit 2010 Keynote
Extractiv
TWSE - Intel Big Data & Cloud Summit 2013
Yahoo xtra Open Technolgies
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Yahoo! Hack India: Hyderabad | Visualizing User Experience

Similar to Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri (20)

KEY
What Can Data Tell Us?
PDF
Computing Risk without Numbers: A Semantic Approach to Risk Metrics - Tim Ke...
PPTX
Water Environmental Hub at Geo Alberta - May 8 2012
PDF
Machine Learning Lecture
PDF
Virus Detection System
PDF
Grid Observatory @ CCGrid 2011
PPT
Extent 2013 Obninsk Managing Uncertain Data at Scale
PPT
IBM - Managing Uncertain Data at Scale
PDF
Zimmerman.mary
PPTX
Nearest Neighbor Customer Insight
KEY
MapReduce In The Cloud Infinispan Distributed Task Execution Framework
PPTX
ACM 2013-02-25
PPTX
Oxford 05-oct-2012
PPT
Protocol Optimizations using anonymous EPC Gen2 Inventories
PPTX
Smarter Planet: How Big Data changes our world
PDF
Sentient Computing For Innovation
PDF
WETSoM 2011
PPT
Introduction to Performance Testing Part 1
PDF
XCS: Current capabilities and future challenges
PDF
Energy smart grid-analytics and insights of Intelen patented Technology
What Can Data Tell Us?
Computing Risk without Numbers: A Semantic Approach to Risk Metrics - Tim Ke...
Water Environmental Hub at Geo Alberta - May 8 2012
Machine Learning Lecture
Virus Detection System
Grid Observatory @ CCGrid 2011
Extent 2013 Obninsk Managing Uncertain Data at Scale
IBM - Managing Uncertain Data at Scale
Zimmerman.mary
Nearest Neighbor Customer Insight
MapReduce In The Cloud Infinispan Distributed Task Execution Framework
ACM 2013-02-25
Oxford 05-oct-2012
Protocol Optimizations using anonymous EPC Gen2 Inventories
Smarter Planet: How Big Data changes our world
Sentient Computing For Innovation
WETSoM 2011
Introduction to Performance Testing Part 1
XCS: Current capabilities and future challenges
Energy smart grid-analytics and insights of Intelen patented Technology
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Ad

Recently uploaded (20)

PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Getting Started with Data Integration: FME Form 101
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
The various Industrial Revolutions .pptx
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
1. Introduction to Computer Programming.pptx
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
Modernising the Digital Integration Hub
PDF
WOOl fibre morphology and structure.pdf for textiles
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Hindi spoken digit analysis for native and non-native speakers
Zenith AI: Advanced Artificial Intelligence
Getting Started with Data Integration: FME Form 101
NewMind AI Weekly Chronicles – August ’25 Week III
Enhancing emotion recognition model for a student engagement use case through...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
The various Industrial Revolutions .pptx
Tartificialntelligence_presentation.pptx
Programs and apps: productivity, graphics, security and other tools
Univ-Connecticut-ChatGPT-Presentaion.pdf
A comparative study of natural language inference in Swahili using monolingua...
Assigned Numbers - 2025 - Bluetooth® Document
1. Introduction to Computer Programming.pptx
Web App vs Mobile App What Should You Build First.pdf
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Modernising the Digital Integration Hub
WOOl fibre morphology and structure.pdf for textiles
Module 1.ppt Iot fundamentals and Architecture
DP Operators-handbook-extract for the Mautical Institute
Hindi spoken digit analysis for native and non-native speakers

Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

  • 1. Framework for a suite of Co-clustering algorithms for predictive modeling on Hadoop Vaijanath N. Rao (vaijanath.rao@teamaol.com) Rohini Uppuluri (rohini.uppuluri@teamaol.com)
  • 2. Agenda • Introduction • Background • Some Approaches • Co-Clustering • Introduction • Related Work • Why Hadoop? • Goal • Our Framework • Conclusions and Future Work Presentation for [CLIENT]
  • 3. Background Modeling for Prediction • Will user A like this movie? • Will a user B like this camera • Customer purchase decisions in an e-commerce setting And tons of other things… Presentation for [CLIENT]
  • 4. Some Approaches • Collaborative filtering • User Based, Item Based, Model Based, Content Based, Hybrid (See [1], [2] ) etc • Latent Models • Probabilistic Latent Semantic Indexing [3,6] • Matrix Factorization [4,7,8], • Probabilistic Discrete Latent Factor[5] • Co-clustering • Clustering along multiple axes: [9,10] etc; survey in [16] Presentation for [CLIENT]
  • 5. Co-clustering Products Product Attributes R 1 0 1 1 1 ed Row Cluster Updation Users uc 0 ? 1 ? 0 Column Cluster Updation Global Model Updation in 1 1 ? 0 0 g Er ? 0 0 ? 0 ... ro Row Cluster Updation r User Column Cluster Updation Attributes Global Model Updation ... Row Cluster Updation Column Cluster Updation Global Model Updation Clustered Products 0 0 1 1 1 Clustered Users 1 ? 1 ? 0 1 1 ? 0 0 ? 1 0 ? 0 Presentation for [CLIENT]
  • 6. Some Approaches • Bregman co-clustering - Framework [11] • Information theoretic co-clustering [12] • Min sum squared co-clustering [13] • Scalable Framework based on Bregman framework[14] • DisCo [15] Presentation for [CLIENT]
  • 7. Why Hadoop • Real world data – Huge • Large matrix to operate on(millions and millions of rows, millions of columns!) • Lot of computations Presentation for [CLIENT]
  • 8. Goal • Number of approaches, need for a common framework • To build a framework to fit in the multiple algorithms on hadoop • Easy framework for users to choose and use Presentation for [CLIENT]
  • 9. Overview Row Cluster Updator Job Row Clusters Input Column Cluster Column Clusters Updator Job Global Model Updator Job Global Model Presentation for [CLIENT]
  • 10. Overview : Core Interfaces • Input vector (type, id, datavec, attributevec, cost, assignment) • Cluster ( vector, len) • Row Cluster • Column Cluster • Distance/Error Function (vector1, vector2) • Model (matrix) • Row Model • Column Model • Group Model • Objective Function (Model1, Model2) Presentation for [CLIENT]
  • 11. Currently we have • Graph Based Bi-clustering • Disco Presentation for [CLIENT]
  • 12. Disco Algorithm 1. Initialization 1.1 row and column clusters 1.2 Compute global model 2. While objective function is met 2.1 For each row in the data, pick the row group which minimizes error 2.2 Update row clusters 2.3 Update global model 2.4 For each column in the data, pick the column group which minimizes error 2.5 Update column clusters 2.6 Update global model 3. Return row and column clusters Presentation for [CLIENT]
  • 13. Pick the Best Row Group/Cluster Presentation for [CLIENT]
  • 14. Example Presentation for [CLIENT]
  • 15. RowCluster Updator Job Presentation for [CLIENT]
  • 16. Example Presentation for [CLIENT]
  • 17. BiClustering Presentation for [CLIENT]
  • 18. Pick the Best Row Group/Cluster Presentation for [CLIENT]
  • 19. Example Presentation for [CLIENT]
  • 20. Row Updator Job key value value RowCluster Mapper key KeyType 0 rowId clickVector attributeVector bestRowClusterId cost rowId DATA clickVector Pick the best row group lineId attributeVector cluster which minimizes cost curRowClusterId or error Best Row Key type clickVector curRowClusterError Cluster Id ROWCLUSTER RowCluster Reducer keyType: DATA: Just Emit rowId clickVector attributeVector bestRowClusterId cost ROW CLUSTER Aggregate Row Cluster Also write Updated Row Clusters Presentation for [CLIENT]
  • 21. Example Presentation for [CLIENT]
  • 22. Conclusions and Future Work • Implementing more algorithms • Easy to use examples and more documentation Presentation for [CLIENT]
  • 23. References [1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In SIGIR, pages 230–237, 1999 [2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML ’04, pages 65–72, 2004. [3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI ’99, 1999. [4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007 [5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In Proc. KDD ’07, pages 26–35, 2007 [6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW ’07, pages 271–280, 2007 [7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08, pages 426–434, 2008 [8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In CIKM ’08, pages 931–940, 2008 [9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages 93–103, 2000 [10] T. George and S. Merugu. A scalable collaborative filtering framework based on co- clustering. In ICDM, pages 625 – 628, 2005 Presentation for [CLIENT]
  • 24. References (contd..) [11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919-- 1986, 2007. [12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD ’03, pages 89–98, 2003 [13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co- clustering of gene expression data. In Proc. SDM ’04, 2004 [14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for discovering coherent co-clusters in noisy data. In ICML ’08, 2008 [15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008 [16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004 Presentation for [CLIENT]
  • 25. Thank you Presentation for [CLIENT]
  • 26. Row Cluster Updator Job key value Value type rowId clickVector attributeVector bestRowClusterId cost RowCluster Mapper DATA key value clickVector attributeVector Pick the best row group Value type rowId rowId Updated rowCluster curRowClusterId cluster which minimizes cost ROWCLUSTER curRowClusterError or error Value type rowId Updated Partial GlobalModel ROW GLOB MODEL RowCluster Reducer ValueType: DATA: rowId clickVector attributeVector bestRowClusterId cost Just Emit ROW CLUSTER Aggregate Row Cluster Also write ROW GLOB CLUSTER Aggregate Partial Global Model Updated Row Clusters for given row cluster Updated Partial Global Model Presentation for [CLIENT]
  • 27. Column Cluster Updator Job key value Value type colId clickVector attributeVector bestColClusterId cost ColCluster Mapper DATA key value clickVector attributeVector Pick the best col group Value type colId colId Updated colCluster curColClusterId cluster which minimizes cost COLCLUSTER curColClusterError or error Value type colId Updated Partial GlobalModel COL GLOB MODEL ColCluster Reducer colId clickVector attributeVector bestColClusterId cost ValueType: DATA: Just Emit COL CLUSTER Also write Aggregate Col Cluster COL GLOB CLUSTER Updated Col Clusters Aggregate Partial Global Model for given col cluster Updated Partial Global Model Presentation for [CLIENT]