SlideShare a Scribd company logo
Erik Nooijen,
            Boudewijn v. Dongen, Dirk Fahland


Process Mining for ERP Systems
Process Discovery


                             process
             event                            process
                            discovery
              log                              model
                            algorithm



 c1: A B C D E   assumptions
 c2: A C B D E   • case = sequence of events of this case
 c3: A F D E     • cases are isolated:
                   event A in c1 happens only in c1 (and not in c2)
 …
                 • cases of the same process

                 • one unique case id,
                 • each event associated to exactly one case id



                                                              PAGE 1
Typical Process in an ERP System

                             Manufacturer



                    Material A        Material B
          order
                    Material B        Material B
        product X                                   order
Alice                                              materials
                                                                ACME Inc.




                    Material B        Material A
          order
                    Material C        Material C
        product Y                                   order
Bob
                                                   materials
                       Build to Order                           Mega Corp.

                                                               PAGE 2
n-to-m relations  database


                                                  process
                                                                        process
                                                 discovery
                                                                         model
                                                 algorithm

id attributes       time-stamp attributes                  ProductOrder                          Customer
poID    cust.   …   created        processed       built         shipped            cust.     address      …
po1     Alice       30-08 9:22     30-08 13:12     01-09 15:12   03-09 10:15        Alice     …            …

po2     Bob         30-08 10:15    30-08 13:14     01-09 16:13   03-09 17:18        Bob       …            …

      relations                                                                    data attributes
              OrderedMaterial              id attributes                                    MaterialOrder
poID    moID type added                     moID suppl.          …   completed     sent            received
po1     mo3     B    30-08 13:13            mo3      ACME            30-08 13:15   30-08 14:15     01-09 9:05
po1     mo4     A    30-08 13:14            mo4      MEGA            30-08 13:17   30-08 16:12     01-09 10:13
po2     mo3     B    30-08 13:15
po2     mo4     C    30-08 13:16                   relations
                                                                                                  PAGE 3
Process Discovery for ERP Systems


                                                          process
                                                                             process
                                                         discovery
                                                                              model
                                                         algorithm


                   0..*
                          Customer
                                                                   reality: data in a relational DB
ProductOrder              - cust
               1
                          -…                                       • events stored as time-stamped
- poID
- cust                                                               attributes in tables
- created                 OrderedMat.
                                                   MaterialOrder
- processed               - poID
- built        1
                          - moID
                                                   - moID          • multiple primary keys
- shipped                               1..*       - supplier         multiple notions of case
                          - type
                   1..*                            - completed
                          - added              1
                                                   - sent
                                                   - received      • tables are related
                                                                      one event related to
                                                                     multiple cases

                                                                                              PAGE 4
Process Discovery for ERP Systems


                                                          process
                                                                             process
                                                         discovery
                                                                              model
                                                         algorithm


                   0..*
                          Customer
                                                                   reality: data in a relational DB
ProductOrder              - cust
               1
                          -…                                       • events stored as time-stamped
- poID
- cust                                                               attributes in tables
- created                 OrderedMat.
                                                   MaterialOrder
- processed               - poID
- built        1
                          - moID
                                                   - moID          • multiple primary keys
- shipped                               1..*       - supplier         multiple notions of case
                          - type
                   1..*                            - completed
                          - added              1
                                                   - sent
                                                   - received      • tables are related
                                                                      one event related to
                                                                     multiple cases

                                                                                              PAGE 5
Outline


                                                   process
                                                    model


                                                                  related by
                                                              primary foreign-key
                                                                   relations

            decompose       by primary keys




                                                             model f.
                        log f.         discovery              PO
   log f.                                                                 model f.
                         MO
    PO                                                                     MO
                                       discovery
                                                                        PAGE 6
Find Artifact Schemas


                                                   process
                                                    model


                                                                  related by
                                                              primary foreign-key
                                                                   relations

            decompose       by primary keys




                                                             model f.
                        log f.         discovery              PO
   log f.                                                                 model f.
                         MO
    PO                                                                     MO
                                       discovery
                                                                        PAGE 7
Step 0: discover database schema

 document schema vs. actual schema  identify
 • column types (esp. time-stamped columns)
 • primary keys
 • foreign keys
 various (non-trivial) techniques available
 key discovery is NP-complete in the size of the
  table(s)
 result:




                                                PAGE 8
Step 1: decompose schema into processes

= schema summarization                  find:
                                        1. sets of
                                           corresponding
                                           tables
                                        2. links between
                                           those
         ProductOrder   MaterialOrder




                                                 PAGE 9
Automatic Schema Summarization

= group similar tables
  through clustering
 define a distance between
    any 2 tables
    •     by relations
    •     by information content


       tables that are close to
        each other
         same cluster
       # of clusters: user input



                                    PAGE 10
Automatic Schema Summarization


1. structural distance                     A
   between tables                          1
                                           2         fanout: 1 = (2+0)/2
   fanout ~ avg. # of child   fanout: 1
   records related to the                      fanout: 2
   same parent record
                              A B         A B              A B
                              1 X         1 X              1 X
                              2 Y         1 Y              1 Y
                                          2 Z
                                          2 U




                                                           PAGE 11
Automatic Schema Summarization


1. structural distance                        A
   between tables                             1
                                              2          fanout: 1
   fanout ~ avg. # of child      fanout: 1                 m.fr: 2 = 1/ (1/2)
   records related to the        m.fr: 1          fanout: 2
   same parent record                             m.fr: 1
                                 A B         A B              A B
   matched fraction ~            1 X         1 X              1 X
   1 / (fraction of records in   2 Y         1 Y              1 Y
   parent with matching child                2 Z
   record)                                   2 U




                                                                PAGE 12
Grouping by Clustering

1. structural distance
2. information distance
   importance of each table
   = entropy (is maximal if all
   records are different)
   distance: 2 tables with high
   entropies  large distance
3. weighted distance by
   structure + information
4. k-means clustering:            most important table of cluster
   k clusters based on            = table with least distance to all
                                   key attribute of the cluster
   weighted distance
                                                            PAGE 13
Artifact Schema  Artifact Log


                                                   process
                                                    model


                                                                  related by
                                                              primary foreign-key
                                                                   relations

            decompose       by primary keys




                                                             model f.
                        log f.         discovery              PO
   log f.                                                                 model f.
                         MO
    PO                                                                     MO
                                       discovery
                                                                        PAGE 14
Log Extraction

                  cluster = set of related tables
                            + primary key of most important table

                                         case id




                poID   cust.   …   created       processed     built          shipped
       log f.
        PO      po1    Alice       30-08 9:22    30-08 13:12   01-09 15:12    03-09 10:15
                po2    Bob         30-08 10:15   30-08 13:14   01-09 16:13    03-09 17:18

                                                      poID     moID type added
                                                      po1      mo3     B      30-08 13:13
po1:                                                  po1      mo4     A      30-08 13:14
                                                      po2      mo3     B      30-08 13:15

po2:                                                  po2      mo4     C      30-08 13:16

                                                                             PAGE 15
Log Extraction

                           cluster = set of related tables
                                     + primary key of most important table

                                                 case id

                           time-stamped attribute  event


                        poID   cust.   …   created       processed     built          shipped
          log f.
           PO           po1    Alice       30-08 9:22    30-08 13:12   01-09 15:12    03-09 10:15
                        po2    Bob         30-08 10:15   30-08 13:14   01-09 16:13    03-09 17:18

                                                              poID     moID type added
                                                              po1      mo3     B      30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, …)                  po1      mo4     A      30-08 13:14
                                                              po2      mo3     B      30-08 13:15
                                                              po2      mo4     C      30-08 13:16

                                                                                     PAGE 16
Log Extraction

                           cluster = set of related tables
                                     + primary key of most important table

                                                  case id

                           time-stamped attribute  event
                           related attributes  event attributes
                         poID   cust.   …   created       processed     built          shipped
           log f.
            PO           po1    Alice       30-08 9:22    30-08 13:12   01-09 15:12    03-09 10:15
                         po2    Bob         30-08 10:15   30-08 13:14   01-09 16:13    03-09 17:18

                                                               poID     moID type added
                                                               po1      mo3     B      30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1            mo4     A      30-08 13:14
                                                               po2      mo3     B      30-08 13:15
                                                               po2      mo4     C      30-08 13:16

                                                                                      PAGE 17
Log Extraction

                           cluster = set of related tables
                                     + primary key of most important table

                                                  case id

                           time-stamped attribute  event
                           related attributes  event attributes
                         poID   cust.   …   created       processed     built          shipped
           log f.
            PO           po1    Alice       30-08 9:22    30-08 13:12   01-09 15:12    03-09 10:15
                         po2    Bob         30-08 10:15   30-08 13:14   01-09 16:13    03-09 17:18

                                                               poID     moID type added
                                                               po1      mo3     B      30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1            mo4     A      30-08 13:14
    (processed, poID=po1, time=30-08 13:12, …)                 po2      mo3     B      30-08 13:15
                                                               po2      mo4     C      30-08 13:16

                                                                                      PAGE 18
Log Extraction

                           cluster = set of related tables
                                     + primary key of most important table

                                                    case id

                           time-stamped attribute  event
                           related attributes  event attributes
                         poID     cust.   …   created       processed     built          shipped
           log f.
            PO           po1      Alice       30-08 9:22    30-08 13:12   01-09 15:12    03-09 10:15
                         po2      Bob         30-08 10:15   30-08 13:14   01-09 16:13    03-09 17:18

                                                                 poID     moID type added
                                                                 po1      mo3     B      30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1              mo4     A      30-08 13:14
    (processed, poID=po1, time=30-08 13:12, …)                   po2      mo3     B      30-08 13:15
    (added, poID=po1, time=30-08 13:13, moID=mo3, …)po2                   mo4     C      30-08 13:16
                               refers to artifact “MaterialOrder”
                                                                                        PAGE 19
Outline


                                                   process
                                                    model


                                                                  compose by
                                                              primary foreign-key
                                                                    relations

            decompose       by primary keys




                                                             model f.
                        log f.         discovery              order
   log f.                                                                 model f.
                        order
   quote                                                                   quote
                                       discovery
                                                                        PAGE 20
Resulting Model(s)
                Product Order                         Material Order
                                       1..*
                                              added
       create

                                                       completed

      processed

                    added       1..*                      sent

        built

                                                        received

       shipped


                        (addded, poID=po1, …, moID=mo3)
                                                                   PAGE 21
Implementation & Evaluation

 prototype tool
 • input: relational database (via JDBC), .csv tables
 • steps
   − discover database schema (types, keys, relations)
   − discover artifact schema
     − by k-means clustering
     − by user picking tables
   − extract logs  ProM




                                                     PAGE 22
Evaluation: SAP System of Sligro

 > 300 tables, > 40 GiB of data
 schema extraction time-stamp attributes: 15 hrs
                       primary keys:          4 hrs
                       foreign keys:          5 hrs (single col)/
                                              6 days (double col.)

 clustering           entropies:               17 hrs
                       table distances:         5 hrs
                       clustering:              a few seconds
                       ~20 different artifacts found
                       largest: 47 tables, 869 columns

 log extraction       extract 1000 traces of > 246,000 events
                       query database:          1 hrs
                       write log file:          32 hrs

                                                             PAGE 23
Sligro: Artikel lifecycle model




                                  PAGE 24
Open issues

 performance
 •   key discovery: NP-complete in R (# of records)
 •   foreign key discovery: NP-complete in R2
 •   problem is in the “hard part” of NP
 •    sampling of data, domain knowledge, semi-automatic
 requires good database structure
 •   proper relations, proper keys
 •   otherwise wrong clusters are formed
 •   events don’t get right attributes
 •    semi-automatic approach
 events shared by multiple cases… working on it…
                                                    PAGE 25
Erik Nooijen,
            Boudewijn v. Dongen, Dirk Fahland


Process Mining for ERP Systems

More Related Content

PPTX
Process Mining Introduction
PPTX
Process Mining - a new governance approach
PDF
Process Mining - Chapter 1 - Introduction
PDF
Process Mining - Chapter 3 - Data Mining
PDF
Process Mining - Chapter 2 - Process Modeling and Analysis
PDF
Process Mining - Chapter 6 - Advanced Process Discovery_techniques
PPT
Polyflint pitch deck
PDF
S-CUBE LP: Analyzing Business Process Performance Using KPI Dependency Analysis
Process Mining Introduction
Process Mining - a new governance approach
Process Mining - Chapter 1 - Introduction
Process Mining - Chapter 3 - Data Mining
Process Mining - Chapter 2 - Process Modeling and Analysis
Process Mining - Chapter 6 - Advanced Process Discovery_techniques
Polyflint pitch deck
S-CUBE LP: Analyzing Business Process Performance Using KPI Dependency Analysis

Similar to Process Mining for ERP Systems (20)

PDF
Process Mining - Chapter 4 - Getting the Data
PDF
Process mining chapter_04_getting_the_data
PDF
DHW Fundamentals
PDF
Process Mining - Chapter 8 - Mining Additional Perspectives
PDF
Process mining chapter_08_mining_additional_perspectives
PDF
Process Mining - Chapter 9 - Operational Support
PDF
Process mining chapter_09_operational_support
PDF
Process Mining - Chapter 5 - Process Discovery
PDF
Process mining chapter_05_process_discovery
PDF
Process mining chapter_01_introduction
PDF
Hyp01 essbase+planning
PPTX
Up2012 scaling my sql in the cloud by moshe shadmon, founder, cto scaledb
PDF
20.project inventry management system
PDF
Process Mining - Chapter 14 - Epilogue
PDF
Process mining chapter_14_epilogue
PDF
DFD ภาษาอังกฤษ
PDF
Cascading: Enterprise Data Workflows based on Functional Programming
PDF
Sub-process discovery: opportunities for process diagnostics
PDF
Discovering Concurrency: Learning (Business) Process Models from Examples
PPTX
[DSBW Spring 2009] Unit 04: From Requirements to the UX Model
Process Mining - Chapter 4 - Getting the Data
Process mining chapter_04_getting_the_data
DHW Fundamentals
Process Mining - Chapter 8 - Mining Additional Perspectives
Process mining chapter_08_mining_additional_perspectives
Process Mining - Chapter 9 - Operational Support
Process mining chapter_09_operational_support
Process Mining - Chapter 5 - Process Discovery
Process mining chapter_05_process_discovery
Process mining chapter_01_introduction
Hyp01 essbase+planning
Up2012 scaling my sql in the cloud by moshe shadmon, founder, cto scaledb
20.project inventry management system
Process Mining - Chapter 14 - Epilogue
Process mining chapter_14_epilogue
DFD ภาษาอังกฤษ
Cascading: Enterprise Data Workflows based on Functional Programming
Sub-process discovery: opportunities for process diagnostics
Discovering Concurrency: Learning (Business) Process Models from Examples
[DSBW Spring 2009] Unit 04: From Requirements to the UX Model
Ad

More from Dirk Fahland (15)

PDF
Object-Centric Processes - from cases to objects and relations… and beyond
PDF
Multi-Dimensional Process Analysis
PPTX
Artifacts and Databases - the Need for Event Relation Graphs and Synchronous ...
PDF
Describing, Discovering, and Understanding Multi-Dimensional Processes
PDF
Process Mining: Past, Present, and Open Challenges (AIST 2017 Keynote)
PDF
Where did I go wrong? Explaining errors in process models
PDF
Mining Branch-Time Scenarios From Execution Logs
PDF
From Live Sequence Chart Specifications to Distributed Components
PDF
LSC Revisited - From Scenarios to Distributed Components
PDF
Repairing Process Models to Match Reality
PDF
Simplifying Mined Process Models
PPTX
The Process of Process Modeling
PPTX
Behavioral Conformance of Artifact-Centric Process Models
PPTX
Many-to-Many: Interactions in Artifact-Centric Choreographies
PPTX
Artifacts - Processes with Multiple Instances
Object-Centric Processes - from cases to objects and relations… and beyond
Multi-Dimensional Process Analysis
Artifacts and Databases - the Need for Event Relation Graphs and Synchronous ...
Describing, Discovering, and Understanding Multi-Dimensional Processes
Process Mining: Past, Present, and Open Challenges (AIST 2017 Keynote)
Where did I go wrong? Explaining errors in process models
Mining Branch-Time Scenarios From Execution Logs
From Live Sequence Chart Specifications to Distributed Components
LSC Revisited - From Scenarios to Distributed Components
Repairing Process Models to Match Reality
Simplifying Mined Process Models
The Process of Process Modeling
Behavioral Conformance of Artifact-Centric Process Models
Many-to-Many: Interactions in Artifact-Centric Choreographies
Artifacts - Processes with Multiple Instances
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Modernizing your data center with Dell and AMD
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf

Process Mining for ERP Systems

  • 1. Erik Nooijen, Boudewijn v. Dongen, Dirk Fahland Process Mining for ERP Systems
  • 2. Process Discovery process event process discovery log model algorithm c1: A B C D E assumptions c2: A C B D E • case = sequence of events of this case c3: A F D E • cases are isolated: event A in c1 happens only in c1 (and not in c2) … • cases of the same process • one unique case id, • each event associated to exactly one case id PAGE 1
  • 3. Typical Process in an ERP System Manufacturer Material A Material B order Material B Material B product X order Alice materials ACME Inc. Material B Material A order Material C Material C product Y order Bob materials Build to Order Mega Corp. PAGE 2
  • 4. n-to-m relations  database process process discovery model algorithm id attributes time-stamp attributes ProductOrder Customer poID cust. … created processed built shipped cust. address … po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 Alice … … po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 Bob … … relations data attributes OrderedMaterial id attributes MaterialOrder poID moID type added moID suppl. … completed sent received po1 mo3 B 30-08 13:13 mo3 ACME 30-08 13:15 30-08 14:15 01-09 9:05 po1 mo4 A 30-08 13:14 mo4 MEGA 30-08 13:17 30-08 16:12 01-09 10:13 po2 mo3 B 30-08 13:15 po2 mo4 C 30-08 13:16 relations PAGE 3
  • 5. Process Discovery for ERP Systems process process discovery model algorithm 0..* Customer reality: data in a relational DB ProductOrder - cust 1 -… • events stored as time-stamped - poID - cust attributes in tables - created OrderedMat. MaterialOrder - processed - poID - built 1 - moID - moID • multiple primary keys - shipped 1..* - supplier  multiple notions of case - type 1..* - completed - added 1 - sent - received • tables are related  one event related to multiple cases PAGE 4
  • 6. Process Discovery for ERP Systems process process discovery model algorithm 0..* Customer reality: data in a relational DB ProductOrder - cust 1 -… • events stored as time-stamped - poID - cust attributes in tables - created OrderedMat. MaterialOrder - processed - poID - built 1 - moID - moID • multiple primary keys - shipped 1..* - supplier  multiple notions of case - type 1..* - completed - added 1 - sent - received • tables are related  one event related to multiple cases PAGE 5
  • 7. Outline process model related by primary foreign-key relations decompose by primary keys model f. log f. discovery PO log f. model f. MO PO MO discovery PAGE 6
  • 8. Find Artifact Schemas process model related by primary foreign-key relations decompose by primary keys model f. log f. discovery PO log f. model f. MO PO MO discovery PAGE 7
  • 9. Step 0: discover database schema  document schema vs. actual schema  identify • column types (esp. time-stamped columns) • primary keys • foreign keys  various (non-trivial) techniques available  key discovery is NP-complete in the size of the table(s)  result: PAGE 8
  • 10. Step 1: decompose schema into processes = schema summarization find: 1. sets of corresponding tables 2. links between those ProductOrder MaterialOrder PAGE 9
  • 11. Automatic Schema Summarization = group similar tables through clustering  define a distance between any 2 tables • by relations • by information content  tables that are close to each other  same cluster  # of clusters: user input PAGE 10
  • 12. Automatic Schema Summarization 1. structural distance A between tables 1 2 fanout: 1 = (2+0)/2 fanout ~ avg. # of child fanout: 1 records related to the fanout: 2 same parent record A B A B A B 1 X 1 X 1 X 2 Y 1 Y 1 Y 2 Z 2 U PAGE 11
  • 13. Automatic Schema Summarization 1. structural distance A between tables 1 2 fanout: 1 fanout ~ avg. # of child fanout: 1 m.fr: 2 = 1/ (1/2) records related to the m.fr: 1 fanout: 2 same parent record m.fr: 1 A B A B A B matched fraction ~ 1 X 1 X 1 X 1 / (fraction of records in 2 Y 1 Y 1 Y parent with matching child 2 Z record) 2 U PAGE 12
  • 14. Grouping by Clustering 1. structural distance 2. information distance importance of each table = entropy (is maximal if all records are different) distance: 2 tables with high entropies  large distance 3. weighted distance by structure + information 4. k-means clustering: most important table of cluster k clusters based on = table with least distance to all  key attribute of the cluster weighted distance PAGE 13
  • 15. Artifact Schema  Artifact Log process model related by primary foreign-key relations decompose by primary keys model f. log f. discovery PO log f. model f. MO PO MO discovery PAGE 14
  • 16. Log Extraction cluster = set of related tables + primary key of most important table case id poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13 po1: po1 mo4 A 30-08 13:14 po2 mo3 B 30-08 13:15 po2: po2 mo4 C 30-08 13:16 PAGE 15
  • 17. Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute  event poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13 po1: (created, poID=po1, time=30-08 9:22, …) po1 mo4 A 30-08 13:14 po2 mo3 B 30-08 13:15 po2 mo4 C 30-08 13:16 PAGE 16
  • 18. Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute  event related attributes  event attributes poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13 po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14 po2 mo3 B 30-08 13:15 po2 mo4 C 30-08 13:16 PAGE 17
  • 19. Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute  event related attributes  event attributes poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13 po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14 (processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15 po2 mo4 C 30-08 13:16 PAGE 18
  • 20. Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute  event related attributes  event attributes poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13 po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14 (processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15 (added, poID=po1, time=30-08 13:13, moID=mo3, …)po2 mo4 C 30-08 13:16 refers to artifact “MaterialOrder” PAGE 19
  • 21. Outline process model compose by primary foreign-key relations decompose by primary keys model f. log f. discovery order log f. model f. order quote quote discovery PAGE 20
  • 22. Resulting Model(s) Product Order Material Order 1..* added create completed processed added 1..* sent built received shipped (addded, poID=po1, …, moID=mo3) PAGE 21
  • 23. Implementation & Evaluation  prototype tool • input: relational database (via JDBC), .csv tables • steps − discover database schema (types, keys, relations) − discover artifact schema − by k-means clustering − by user picking tables − extract logs  ProM PAGE 22
  • 24. Evaluation: SAP System of Sligro  > 300 tables, > 40 GiB of data  schema extraction time-stamp attributes: 15 hrs primary keys: 4 hrs foreign keys: 5 hrs (single col)/ 6 days (double col.)  clustering entropies: 17 hrs table distances: 5 hrs clustering: a few seconds ~20 different artifacts found largest: 47 tables, 869 columns  log extraction extract 1000 traces of > 246,000 events query database: 1 hrs write log file: 32 hrs PAGE 23
  • 25. Sligro: Artikel lifecycle model PAGE 24
  • 26. Open issues  performance • key discovery: NP-complete in R (# of records) • foreign key discovery: NP-complete in R2 • problem is in the “hard part” of NP •  sampling of data, domain knowledge, semi-automatic  requires good database structure • proper relations, proper keys • otherwise wrong clusters are formed • events don’t get right attributes •  semi-automatic approach  events shared by multiple cases… working on it… PAGE 25
  • 27. Erik Nooijen, Boudewijn v. Dongen, Dirk Fahland Process Mining for ERP Systems