SlideShare a Scribd company logo
Demys&fying	
  Technology	
  
   Assisted	
  Review	
  


Part	
  3:	
  Deconstruc&ng	
  the	
  
              Technology	
  

        Sonya	
  L.	
  Sigler	
  
Agenda	
  

  Review/Overview	
  
  Underlying	
  Search	
  Technology	
  
        dtSearch	
  
        Lucene	
  (open	
  source)	
  
        Others	
  –	
  My	
  SQL,	
  etc.	
  
  Underlying	
  StaCsCcal	
  Based	
  Technology	
  
        Rules	
  Based	
  Technology	
  (LinguisCc	
  or	
  StaCsCcal)	
  
        Bayesian	
  ProbabilisCc	
  Technologies	
  
        Latent	
  SemanCc	
  Indexing	
  
  Q	
  &	
  A	
  

                                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Review/Overview	
  -­‐	
  Search	
  &	
  Review	
  Spectrum	
  


               Linear	
  Review	
  
               Culling	
  
               IteraCve	
  search	
  
               Review	
  


                                                    Accelerated	
  Review	
  	
  
                                                    Email	
  Threading	
  
                                                    Near	
  Duplicate	
  DetecCon	
                                   Automated	
  Review	
  	
  
Per	
  	
                                           CA	
  -­‐	
  Clustering	
                                         Relevance	
  Ranking	
  
Document	
                                          CategorizaCon	
  (Supervised)	
                                   Machine	
  Learning	
  
Cost	
  
                                                                                                                      Latent	
  SemanCc	
  Indexing	
  
                                                                                                                      (staCsCcal	
  probability)	
  
                                                                                                                      PaRern	
  Analysis	
  
                                                                                                                      Sampling	
  Data	
  for	
  High	
  
                                                                                                                      Precision	
  and	
  Recall	
  Rates	
  

                                        Organiza3on	
  Commitment	
  
                                                                                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Underlying	
  Technologies	
  
                                          Rules	
  Based	
  Systems	
  
                                                                               dtSearch	
  
                                                           Key	
  word	
  Search	
  
                                                                                                                              Ontologies	
  

                                                                     Lucene	
                                   Other	
  Search	
  Engines	
  

                                                               LinguisCc	
  –	
  word	
  based	
  



StaCsCcal	
  -­‐	
  #s	
  based	
  
     Bayesian	
  ClassificaCon	
  

                            Support	
  Vector	
  Models	
  
  Latent	
  SemanCc	
  Indexing	
  

                                                                   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Database	
  NormalizaCon	
  

From:	
  Nuala	
  Coogan	
  Nuala@SFLData.com	
  
Subject:	
  EDI	
  Summit	
  –	
  Florida	
  
Date:	
  October	
  3,	
  2012	
  10:11:21	
  AM	
  PDT	
  
To:	
  Sigler	
  L.	
  Sonya	
  Sonya@sigler.name	
  	
  

From:	
  Nuala	
  Coogan	
  
Subject:	
  EDI	
  Summit	
  –	
  Florida	
  
Date:	
  10/03/12	
  
To:	
  Sonya	
  Sigler	
  

                                                	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
TokenizaCon	
  

  Words,	
  Phrases,	
  Symbols	
  
      Mostly	
  at	
  the	
  word	
  level	
  
      Numbers	
  
      PunctuaCon	
  
  Meaningful	
  Elements	
  or	
  Pieces	
  –>	
  Tokens	
  
  Parsing	
  and	
  Text	
  Mining	
  
  Treatment	
  of	
  ContracCons,	
  Hyphenated	
  words,	
  
   EmoCcons	
  and	
  Larger	
  Constructs	
  (like	
  urls)	
  
  Look-­‐up	
  tables	
  


                                                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
LinguisCc	
  Based	
  Technologies	
  

  Keyword	
  Sample	
                                                      Ontology	
  Sample	
  

	
  Simple:	
                                                                  	
  	
  q	
  ((+(std:%CapacityReports_%	
  std:%DINCapacity_
                                                                                   %)	
  	
  (std:%ACMEEPPlant_%	
  std:%ProductName_%))	
  
	
  "legal	
  systems"	
  OR	
  legalsystems	
                                     (+(std:%ACMEPNPlant_%	
  std:%ProductName_%)	
  +
	
  "Mike	
  Custodian”	
  	
                                                      (std:%ProducCveCapability_%	
  std:
                                                                                   %CapacityReports_%))	
  (+(std:%CapacityCreep_%	
  
                                                                                   std:%OperaConsImprovement_%	
  std:
	
  Medium:	
                                                                      %CapacityExpansion_%	
  std:%CapacityRestoraCon_
	
  mail(custodian@domain.com)	
  AND	
  "legal	
  systems”	
   %)	
  +(std:%ACMEPNPlant_%	
  std:%ProductName_
                                                                                   %))	
  (+(std:%EquipmentReplacement_%	
  std:
	
  (Custodian	
  w/3	
  (Mike	
  OR	
  Michael	
  OR	
  M))	
                     %FinishingColumn_%)	
  +(std:%ACMEPNPlant_%	
  std:
                                                                                   %ProductName_%))	
  (std:%Audit_%	
  actor:%Audit_
	
  Complex:	
                                                                     %)	
  (+(std:%SeRlementNegoCaCons_%	
  std:
                                                                                   %ContractNegoCaCons_%	
  )	
  +(actor:
	
  (privilege	
  OR	
  privileged	
  OR	
  legally	
  OR	
  "work	
               %ACMEOutsideCounsel_%	
  std:
    product")	
  NOT	
  w/35	
  (((original	
  OR	
  intended	
  OR	
   %ACMEOutsideCounsel_%	
  actor:%ACME	
  
    designated	
  OR	
  named)	
  w/3	
  (recipient	
  OR	
                        UBOutsideCounsel_%	
  std:
    recipients	
  OR	
  addressee	
  OR	
  addressees	
  OR	
                      %AcmeSubOutsideCounsel_%	
  actor:%AcmeSub_%	
  
    solely))	
  OR	
  ("message	
  in	
  error")	
  OR	
  ("received	
  in	
   std:%AcmeSub_%))	
  (std:%FTC_%	
  actor:%FTC_%)	
  
                                                                                   ((+subject:%ProductName_%	
  +(std:swap	
  
    error")	
  OR	
  ("named	
  above")	
  OR	
  ((electronic	
  or	
              std:"supply	
  agreement"	
  std:"exchange	
  agreement"	
  
    email	
  or	
  e-­‐mail)	
  w/3	
  (message	
  or	
  transmission))	
   std:"agree	
  to	
  exchange"))	
  std:"name	
  
OR	
  ("confidenCality	
  noCce"))	
  

                                                                                    	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Search	
  Engines	
  

  dtSearch	
  
        dtSearch	
  Corp.,	
  founded	
  1991	
  
        Incorporated	
  into	
  Symantec’s	
  Norton	
  Navigator	
  
        SDKs	
  available,	
  most	
  license	
  off	
  the	
  shelf	
  
        hRp://support.dtsearch.com/faq/search.html	
  
  Lucene	
  
      Open	
  source	
  -­‐	
  hRp://lucene.apache.org/core/	
  
      Doug	
  Cukng,	
  1999,	
  Part	
  of	
  Apache	
  projects	
  in	
  2001	
  
      APIs,	
  Customizable	
  
  Other	
  –	
  My	
  SQL,	
  SQL,	
  (DBMS,	
  RDBMS)	
  

                                                        	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
dtSearch	
  

    RelaCvity,	
  Concordance,	
  Viewpoint,	
  others	
  
    Single	
  User	
  desktop	
  license	
  $199	
  
    LiRle	
  CustomizaCon	
  –	
  more	
  similariCes	
  across	
  apps	
  
    Includes	
  Boolean	
  operators	
  
    Includes	
  Proximity	
  searching	
  
    Includes	
  Fuzzy	
  Searching	
  
        Alphabet	
  -­‐>	
  Alphaqet,	
  alpphabet,	
  alpkaqet	
  




                                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Lucene	
  

  Clearwell,	
  Intella,	
  Cataphora,	
  SHIFT,	
  others	
  
  Open	
  Source	
  Tool	
  –	
  meant	
  to	
  be	
  customized	
  
  LiRle	
  SimilariCes	
  Across	
  Apps	
  
      Know	
  your	
  defaults!	
  
  Includes	
  Boolean	
  Operators	
  
  Includes	
  Proximity	
  Searching	
  




                                           	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
dtSearch	
  –	
  Fuzzy	
  Searching	
  

  Degrees	
  of	
  Fuzziness	
  
        1-­‐10;	
  dtSearch	
  uses	
  1-­‐3	
  
        Marked	
  by	
  use	
  of	
  %	
  symbol	
  
        InserCon:	
  co%t	
  →	
  coat	
  
        DeleCon:	
  coat	
  →	
  co%t	
  
        SubsCtuCon:	
  coat	
  →	
  cost	
  
        TransposiCon	
  cots	
  →	
  cost	
  
  Fuzziness	
  Degrees	
  
      Alphabet	
  –	
  Alphaqet,	
  Alpkaqet	
  



                                                        	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Boolean	
  Operators	
  –	
  AND,	
  OR	
  ,	
  NOT	
  
dtSearch	
                                Lucene	
  
  Search	
  for	
                          Depends	
  on	
  
      MulCple	
  words	
                    customizaCon	
  
       treated	
  as	
  a	
  phrase	
          OR	
  
  ANY	
  –	
  treats	
  word	
                AND	
  
   list	
  as	
  separated	
  by	
  
   OR	
                                     Know	
  your	
  defaults	
  
  ALL	
  –	
  treats	
  word	
  list	
     Spell	
  out	
  	
  variaCons	
  
   as	
  separated	
  by	
  AND	
  

                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Proximity	
  
dtSearch	
                               Lucene	
  
  Pre/post	
                              w/	
  order	
  doesn’t	
  
     w/	
  order	
  doesn’t	
              maRer	
  
      maRer	
  
        House	
  white	
                  No	
  pre	
  usage	
  
        White	
  house	
  
     Pre/	
  finds	
  first	
  word	
  
      prior	
  to	
  second	
  
      word	
  
        White	
  house	
  

                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Punctua&on	
  

dtSearch	
                              Lucene	
  
  LeRers	
                               All	
  punctuaCon	
  
  Space	
                                 treated	
  as	
  a	
  word	
  
  Ignored	
  	
                           break	
  
  Hyphens	
  
  %	
  -­‐	
  fuzzy	
  searching	
  
  _	
  -­‐	
  ignored	
  


                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
dtSearch	
  Hyphen	
  Example	
  




                                	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Noise	
  Words	
  –	
  Unindexed,	
  Ignored	
  

dtSearch	
                             Lucene	
  
  Unindexed,	
  Can	
                    Ignores	
  *	
  in	
  quotes	
  
   create	
  Custom	
  Index	
     (Quality	
  Control*)	
  =	
  
  Many,	
  but	
  a	
  few	
              Quality	
  Control	
  but	
  
   examples:	
  Do,	
  not,	
              nothing	
  else	
  
   for,	
  your,	
  only,	
  under,	
  
   made,	
  way	
  
  Know	
  defualts	
  


                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Stemming	
  v.	
  Wild	
  Cards	
  
 Stemming	
                           Wild	
  Cards	
  
  SyntacCc	
  VariaCons	
             Strings	
  of	
  characters	
  
                                       Replacements	
  for	
  beginning,	
  
      Regular	
  Verbs	
  
                                        parts,	
  or	
  endings	
  
      Irregular	
  Verbs	
            Lucene	
  -­‐	
  *	
  
  dtSearch	
  performs	
              dtSearch	
  -­‐	
  ?	
  For	
  single	
  
                                        character,	
  *	
  for	
  any	
  #	
  of	
  
   poorly	
  with	
  irregular	
  
                                        characters	
  
   verbs	
                             Time	
  consuming	
  
                                       Spelling	
  out	
  recommended	
  
                                       Wild	
  cards	
  in	
  quotes	
  


                                            	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Stemming	
  v.	
  Wild	
  Cards	
  Example	
  

Stemming	
                                     Wild	
  Cards	
  

Catch	
  –	
  Lucene	
                         Catch*	
  
Catch~	
  -­‐	
  dtSearch	
                             Catch	
  
      Catch	
                                          Catches	
  
      Catches	
                                        Catching	
  
      Catching	
                                       Catcher	
  
      Catcher	
                                        Catch1234	
  –	
  not	
  in	
  
      Caught	
  -­‐	
  not	
  in	
  dtSearch	
  	
      stemming	
  


                                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Technologies	
  

  Rules	
  Based	
  
  Bayesian	
  ClassificaCon	
  
  Vector	
  Space	
  Modeling	
  
     Latent	
  SemanCc	
  Indexing	
  




                                          	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Concept	
  -­‐	
  Clustering	
  
      Machine	
  
      Unsupervised	
  
      Quickly	
  understand	
  	
  
      	
  data	
  
      Uncontrolled	
  Clusters	
  




                                       	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Concept	
  -­‐	
  Categoriza&on	
  	
  
      User	
  Created	
  
      Supervised	
  
      Control	
  Topics	
  
      Time	
  Consuming	
  




                                             	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

Rules	
  Based	
  Systems	
  
  If..	
  Then…	
  
       If	
  email	
  =	
  person	
  1	
  to	
  person	
  2	
  then	
  return	
  it	
  
       If	
  email	
  =	
  person	
  1	
  or	
  person	
  2	
  then	
  return	
  it	
  
  ArCficial	
  Intelligence	
  Systems	
  
       EnCty	
  extracCon	
  (&	
  dicConaries)	
  
       Time	
  consuming	
  
       Mirror	
  human	
  thinking	
  
             Case,	
  subject	
  maRer	
  
  Transparent	
  System	
  

                                                                   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Bayesian	
  ClassificaCon	
  	
  
      ProbabilisCc	
  
      Co-­‐occurrence	
  
      Frequency	
  
  Spam	
  Filters	
  
      Viagra	
  
      Concepts	
  
      Words,	
  phrases	
  



                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Bayesian	
  	
  

  Bayesian	
  illustraCon	
  
      Baseball,	
  glove,	
  diamond,	
  bats,	
  hit,	
  home	
  run	
  
      Diamond,	
  pendant,	
  jewelry	
  


  Co-­‐occurrence	
  
      Local	
  –	
  within	
  a	
  document	
  
      Global	
  –	
  across	
  document	
  populaCon	
  


  Frequency	
  –	
  how	
  ozen	
  does	
  it	
  appear	
  
      WeighCng	
  –	
  uniqueness	
  counts	
  

                                                       	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Vector	
  Space	
  Modeling	
  
  Latent	
  Seman&c	
  Indexing/Analysis	
  
      Words	
  
      Phrases,	
  Concepts	
  
      Tables	
  
      Algebraic	
  equaCons	
  represenCng	
  docs	
  
      WeighCng	
  Algorithms	
  




                                          	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Latent	
  SemanCc	
  Indexing	
  Example	
  

  Exclude	
  Noise	
  Words	
  
      The,	
  and,	
  or,	
  etc.	
  
  Vector	
  Space	
  Modeling	
  
      Build	
  Document	
  Profile	
  
  Diamond	
  
      Base,	
  ball	
  
      Necklace,	
  pendant	
  
      Diamond	
  Saw	
  




                                         	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
MathemaCcal	
  FoundaCon	
  

  Tables	
  built	
  with	
  0s,	
  1s	
  
  Yes	
  it	
  has	
  that	
  word	
  or	
  phrase	
  
  No	
  it	
  doesn’t	
  




                                                          	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Simple	
  Matrix	
  with	
  WeighCng	
  




                                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Weighted	
  by	
  Document	
  (not	
  just	
  type)	
  




                                     	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Defensibility	
  Report 	
  	
  

    Document,	
  Document,	
  Document	
  
    Transparency	
  
    Workflow	
  
    What	
  Was	
  Considered,	
  By	
  Whom?	
  
    QC	
  Process	
  
    Metrics	
  




                                             	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Q&A - Thank you!	
  


       Post	
  your	
  ques&ons	
  to	
  the	
  
      presenter	
  in	
  the	
  chat	
  secCon	
  

                         Sonya	
  L.	
  Sigler	
  
   Vice	
  President,	
  Product	
  Strategy	
  &	
  Consul&ng	
  
                            SFL	
  Data	
  
                         415-­‐321-­‐8385	
  
                   sonya@sfldata.com	
  	
  
                    www.sfldata.com	
  	
  



                         	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  

More Related Content

PDF
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
PDF
2012 6 27 TAR Webinar Part 1 Sigler
PDF
2013 3 27 TAR Webinar Part 4 Getting Started Sigler
PDF
Who's Afraid of eDiscovery?
PPTX
Georgetown lecture 2012 6 2 full
PDF
2013 7 24 TAR Webinar 5 Tips & Myths Sigler
PDF
Julia Brickell - Your "Big Buckets" Are Full Of "Big Data" - The Information ...
PPT
Sejarah perkembangan akuntansi syariah
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
2012 6 27 TAR Webinar Part 1 Sigler
2013 3 27 TAR Webinar Part 4 Getting Started Sigler
Who's Afraid of eDiscovery?
Georgetown lecture 2012 6 2 full
2013 7 24 TAR Webinar 5 Tips & Myths Sigler
Julia Brickell - Your "Big Buckets" Are Full Of "Big Data" - The Information ...
Sejarah perkembangan akuntansi syariah

Similar to 2012 11 7 TAR Webinar Part 3 Sigler (20)

PDF
2012 8 29 TAR Webinar Part 2 Sigler
PDF
Concept Searching Portal Solutions Search Engine Face Off
PDF
DDC2011 - Association
PPTX
Taxonomy Assessments - Part Two
PDF
Semantic Technology: State of the arts and Trends
PDF
Concept Searching Overview Google Vs Fast
PDF
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
KEY
A Data Scientist And A Log File Walk Into A Bar...
PDF
Semantic Search Trend
PPTX
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
PPTX
Чираг Шах «Коллективный поиск, взаимодействие пользователей: подходы к изучен...
PDF
How Search 2.0 Has Been Redefined by Enterprise 2.0
PDF
Tutorial kcc-2011
PPTX
Machine Learned Relevance at A Large Scale Search Engine
PDF
Scale, Structure, and Semantics
PDF
FAST Search for SharePoint
KEY
Evolution: It's a process
PDF
Introduction to Information Retrieval & Models
PPTX
SF Women in eDiscovery Sept 2011
PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
2012 8 29 TAR Webinar Part 2 Sigler
Concept Searching Portal Solutions Search Engine Face Off
DDC2011 - Association
Taxonomy Assessments - Part Two
Semantic Technology: State of the arts and Trends
Concept Searching Overview Google Vs Fast
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
A Data Scientist And A Log File Walk Into A Bar...
Semantic Search Trend
랭킹 최적화를 넘어 인간적인 검색으로 - 서울대 융합기술원 발표
Чираг Шах «Коллективный поиск, взаимодействие пользователей: подходы к изучен...
How Search 2.0 Has Been Redefined by Enterprise 2.0
Tutorial kcc-2011
Machine Learned Relevance at A Large Scale Search Engine
Scale, Structure, and Semantics
FAST Search for SharePoint
Evolution: It's a process
Introduction to Information Retrieval & Models
SF Women in eDiscovery Sept 2011
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Ad

2012 11 7 TAR Webinar Part 3 Sigler

  • 1. Demys&fying  Technology   Assisted  Review   Part  3:  Deconstruc&ng  the   Technology   Sonya  L.  Sigler  
  • 2. Agenda     Review/Overview     Underlying  Search  Technology     dtSearch     Lucene  (open  source)     Others  –  My  SQL,  etc.     Underlying  StaCsCcal  Based  Technology     Rules  Based  Technology  (LinguisCc  or  StaCsCcal)     Bayesian  ProbabilisCc  Technologies     Latent  SemanCc  Indexing     Q  &  A                        Demys&fying  Technology  Assisted  Review  
  • 3. Review/Overview  -­‐  Search  &  Review  Spectrum   Linear  Review   Culling   IteraCve  search   Review   Accelerated  Review     Email  Threading   Near  Duplicate  DetecCon   Automated  Review     Per     CA  -­‐  Clustering   Relevance  Ranking   Document   CategorizaCon  (Supervised)   Machine  Learning   Cost   Latent  SemanCc  Indexing   (staCsCcal  probability)   PaRern  Analysis   Sampling  Data  for  High   Precision  and  Recall  Rates   Organiza3on  Commitment                        Demys&fying  Technology  Assisted  Review  
  • 4. Underlying  Technologies   Rules  Based  Systems   dtSearch   Key  word  Search   Ontologies   Lucene   Other  Search  Engines   LinguisCc  –  word  based   StaCsCcal  -­‐  #s  based   Bayesian  ClassificaCon   Support  Vector  Models   Latent  SemanCc  Indexing                        Demys&fying  Technology  Assisted  Review  
  • 5. Database  NormalizaCon   From:  Nuala  Coogan  Nuala@SFLData.com   Subject:  EDI  Summit  –  Florida   Date:  October  3,  2012  10:11:21  AM  PDT   To:  Sigler  L.  Sonya  Sonya@sigler.name     From:  Nuala  Coogan   Subject:  EDI  Summit  –  Florida   Date:  10/03/12   To:  Sonya  Sigler                        Demys&fying  Technology  Assisted  Review  
  • 6. TokenizaCon     Words,  Phrases,  Symbols     Mostly  at  the  word  level     Numbers     PunctuaCon     Meaningful  Elements  or  Pieces  –>  Tokens     Parsing  and  Text  Mining     Treatment  of  ContracCons,  Hyphenated  words,   EmoCcons  and  Larger  Constructs  (like  urls)     Look-­‐up  tables                        Demys&fying  Technology  Assisted  Review  
  • 7. LinguisCc  Based  Technologies   Keyword  Sample   Ontology  Sample    Simple:      q  ((+(std:%CapacityReports_%  std:%DINCapacity_ %)    (std:%ACMEEPPlant_%  std:%ProductName_%))    "legal  systems"  OR  legalsystems   (+(std:%ACMEPNPlant_%  std:%ProductName_%)  +  "Mike  Custodian”     (std:%ProducCveCapability_%  std: %CapacityReports_%))  (+(std:%CapacityCreep_%   std:%OperaConsImprovement_%  std:  Medium:   %CapacityExpansion_%  std:%CapacityRestoraCon_  mail(custodian@domain.com)  AND  "legal  systems”   %)  +(std:%ACMEPNPlant_%  std:%ProductName_ %))  (+(std:%EquipmentReplacement_%  std:  (Custodian  w/3  (Mike  OR  Michael  OR  M))   %FinishingColumn_%)  +(std:%ACMEPNPlant_%  std: %ProductName_%))  (std:%Audit_%  actor:%Audit_  Complex:   %)  (+(std:%SeRlementNegoCaCons_%  std: %ContractNegoCaCons_%  )  +(actor:  (privilege  OR  privileged  OR  legally  OR  "work   %ACMEOutsideCounsel_%  std: product")  NOT  w/35  (((original  OR  intended  OR   %ACMEOutsideCounsel_%  actor:%ACME   designated  OR  named)  w/3  (recipient  OR   UBOutsideCounsel_%  std: recipients  OR  addressee  OR  addressees  OR   %AcmeSubOutsideCounsel_%  actor:%AcmeSub_%   solely))  OR  ("message  in  error")  OR  ("received  in   std:%AcmeSub_%))  (std:%FTC_%  actor:%FTC_%)   ((+subject:%ProductName_%  +(std:swap   error")  OR  ("named  above")  OR  ((electronic  or   std:"supply  agreement"  std:"exchange  agreement"   email  or  e-­‐mail)  w/3  (message  or  transmission))   std:"agree  to  exchange"))  std:"name   OR  ("confidenCality  noCce"))                        Demys&fying  Technology  Assisted  Review  
  • 8. Search  Engines     dtSearch     dtSearch  Corp.,  founded  1991     Incorporated  into  Symantec’s  Norton  Navigator     SDKs  available,  most  license  off  the  shelf     hRp://support.dtsearch.com/faq/search.html     Lucene     Open  source  -­‐  hRp://lucene.apache.org/core/     Doug  Cukng,  1999,  Part  of  Apache  projects  in  2001     APIs,  Customizable     Other  –  My  SQL,  SQL,  (DBMS,  RDBMS)                        Demys&fying  Technology  Assisted  Review  
  • 9. dtSearch     RelaCvity,  Concordance,  Viewpoint,  others     Single  User  desktop  license  $199     LiRle  CustomizaCon  –  more  similariCes  across  apps     Includes  Boolean  operators     Includes  Proximity  searching     Includes  Fuzzy  Searching     Alphabet  -­‐>  Alphaqet,  alpphabet,  alpkaqet                        Demys&fying  Technology  Assisted  Review  
  • 10. Lucene     Clearwell,  Intella,  Cataphora,  SHIFT,  others     Open  Source  Tool  –  meant  to  be  customized     LiRle  SimilariCes  Across  Apps     Know  your  defaults!     Includes  Boolean  Operators     Includes  Proximity  Searching                        Demys&fying  Technology  Assisted  Review  
  • 11. dtSearch  –  Fuzzy  Searching     Degrees  of  Fuzziness     1-­‐10;  dtSearch  uses  1-­‐3     Marked  by  use  of  %  symbol     InserCon:  co%t  →  coat     DeleCon:  coat  →  co%t     SubsCtuCon:  coat  →  cost     TransposiCon  cots  →  cost     Fuzziness  Degrees     Alphabet  –  Alphaqet,  Alpkaqet                        Demys&fying  Technology  Assisted  Review  
  • 12. Boolean  Operators  –  AND,  OR  ,  NOT   dtSearch   Lucene     Search  for     Depends  on     MulCple  words   customizaCon   treated  as  a  phrase     OR     ANY  –  treats  word     AND   list  as  separated  by   OR     Know  your  defaults     ALL  –  treats  word  list     Spell  out    variaCons   as  separated  by  AND                        Demys&fying  Technology  Assisted  Review  
  • 13. Proximity   dtSearch   Lucene     Pre/post     w/  order  doesn’t     w/  order  doesn’t   maRer   maRer    House  white     No  pre  usage    White  house     Pre/  finds  first  word   prior  to  second   word    White  house                        Demys&fying  Technology  Assisted  Review  
  • 14. Punctua&on   dtSearch   Lucene     LeRers     All  punctuaCon     Space   treated  as  a  word     Ignored     break     Hyphens     %  -­‐  fuzzy  searching     _  -­‐  ignored                        Demys&fying  Technology  Assisted  Review  
  • 15. dtSearch  Hyphen  Example                        Demys&fying  Technology  Assisted  Review  
  • 16. Noise  Words  –  Unindexed,  Ignored   dtSearch   Lucene     Unindexed,  Can     Ignores  *  in  quotes   create  Custom  Index     (Quality  Control*)  =     Many,  but  a  few   Quality  Control  but   examples:  Do,  not,   nothing  else   for,  your,  only,  under,   made,  way     Know  defualts                        Demys&fying  Technology  Assisted  Review  
  • 17. Stemming  v.  Wild  Cards   Stemming   Wild  Cards     SyntacCc  VariaCons     Strings  of  characters     Replacements  for  beginning,     Regular  Verbs   parts,  or  endings     Irregular  Verbs     Lucene  -­‐  *     dtSearch  performs     dtSearch  -­‐  ?  For  single   character,  *  for  any  #  of   poorly  with  irregular   characters   verbs     Time  consuming     Spelling  out  recommended     Wild  cards  in  quotes                        Demys&fying  Technology  Assisted  Review  
  • 18. Stemming  v.  Wild  Cards  Example   Stemming   Wild  Cards   Catch  –  Lucene   Catch*   Catch~  -­‐  dtSearch     Catch     Catch     Catches     Catches     Catching     Catching     Catcher     Catcher     Catch1234  –  not  in     Caught  -­‐  not  in  dtSearch     stemming                        Demys&fying  Technology  Assisted  Review  
  • 19. StaCsCcal  Technologies     Rules  Based     Bayesian  ClassificaCon     Vector  Space  Modeling     Latent  SemanCc  Indexing                        Demys&fying  Technology  Assisted  Review  
  • 20. StaCsCcal  Based  Technologies     Concept  -­‐  Clustering     Machine     Unsupervised     Quickly  understand      data     Uncontrolled  Clusters                        Demys&fying  Technology  Assisted  Review  
  • 21. StaCsCcal  Based  Technologies     Concept  -­‐  Categoriza&on       User  Created     Supervised     Control  Topics     Time  Consuming                        Demys&fying  Technology  Assisted  Review  
  • 22. StaCsCcal  Based  Technologies   Rules  Based  Systems     If..  Then…     If  email  =  person  1  to  person  2  then  return  it     If  email  =  person  1  or  person  2  then  return  it     ArCficial  Intelligence  Systems     EnCty  extracCon  (&  dicConaries)     Time  consuming     Mirror  human  thinking     Case,  subject  maRer     Transparent  System                        Demys&fying  Technology  Assisted  Review  
  • 23. StaCsCcal  Based  Technologies     Bayesian  ClassificaCon       ProbabilisCc     Co-­‐occurrence     Frequency     Spam  Filters     Viagra     Concepts     Words,  phrases                        Demys&fying  Technology  Assisted  Review  
  • 24. Bayesian       Bayesian  illustraCon     Baseball,  glove,  diamond,  bats,  hit,  home  run     Diamond,  pendant,  jewelry     Co-­‐occurrence     Local  –  within  a  document     Global  –  across  document  populaCon     Frequency  –  how  ozen  does  it  appear     WeighCng  –  uniqueness  counts                        Demys&fying  Technology  Assisted  Review  
  • 25. StaCsCcal  Based  Technologies     Vector  Space  Modeling     Latent  Seman&c  Indexing/Analysis     Words     Phrases,  Concepts     Tables     Algebraic  equaCons  represenCng  docs     WeighCng  Algorithms                        Demys&fying  Technology  Assisted  Review  
  • 26. Latent  SemanCc  Indexing  Example     Exclude  Noise  Words     The,  and,  or,  etc.     Vector  Space  Modeling     Build  Document  Profile     Diamond     Base,  ball     Necklace,  pendant     Diamond  Saw                        Demys&fying  Technology  Assisted  Review  
  • 27. MathemaCcal  FoundaCon     Tables  built  with  0s,  1s     Yes  it  has  that  word  or  phrase     No  it  doesn’t                        Demys&fying  Technology  Assisted  Review  
  • 28. Simple  Matrix  with  WeighCng                        Demys&fying  Technology  Assisted  Review  
  • 29. Weighted  by  Document  (not  just  type)                        Demys&fying  Technology  Assisted  Review  
  • 30. Defensibility  Report       Document,  Document,  Document     Transparency     Workflow     What  Was  Considered,  By  Whom?     QC  Process     Metrics                        Demys&fying  Technology  Assisted  Review  
  • 31. Q&A - Thank you!   Post  your  ques&ons  to  the   presenter  in  the  chat  secCon   Sonya  L.  Sigler   Vice  President,  Product  Strategy  &  Consul&ng   SFL  Data   415-­‐321-­‐8385   sonya@sfldata.com     www.sfldata.com                          Demys&fying  Technology  Assisted  Review