SlideShare a Scribd company logo
Python for Business
        Intelligence


Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
python business intelligence




                )
Results

Q/A and articles with Java
  solution references


               (not listed here)
Python business intelligence (PyData 2012 talk)
Why?
Overview

■ Traditional Data Warehouse
■ Python and Data
■ Is Python Capable?
■ Conclusion
Business
Intelligence
people

technology processes
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Traditional Data
  Warehouse
■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures
   in the data to achieve consistency across the original sources



■ Delivering data in a physical format that can be used by
   query tools, report writers, and dashboards.




                         Source: Ralph Kimball – The Data Warehouse ETL Toolkit
Source               Staging Area     Operational Data Store   Datamarts
Systems



   structured
   documents




   databases

                Temporary
                Staging
                Area
      APIs




                            staging              relational        dimensional

                             L0                    L1                 L2
real time = daily
Multi-dimensional
    Modeling
Python business intelligence (PyData 2012 talk)
aggregation browsing
     slicing and dicing
business / analyst’s
       point of view

regardless of physical schema implementation
Facts

                  measurable


     fact

                   fact data cell




most detailed information
location




type




              time



           dimensions
Dimension

■ provide context for facts
■ used to filter queries or reports
■ control scope of aggregation of facts
Pentaho
Python and Data
   community perception*




                           *as of Oct 2012
Scientific & Financial
Python
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Scientific Data
      T1[s]     T2[s]     T3[s]     T4[s]
P1     112,68    941,67    171,01    660,48

P2      96,15    306,51    725,88    877,82

P3     313,39    189,31     41,81    428,68

P4     760,62    983,48    371,21    281,19

P5     838,56     39,27    389,42    231,12




     n-dimensional array of numbers
Assumptions

■ data is mostly numbers
■ data is neatly organized...
■ … in one multi-dimensional array
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Business Data
multiple snapshots of one source




multiple representations     categories are

     of same data                  changing
❄
Is Python Capable?
     very basic examples
Data Pipes with
   SQLAlchemy

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
■ connection: create_engine
■ schema reflection: MetaData,   Table

■ expressions: select(),   insert()
src_engine = create_engine("sqlite:///data.sqlite")
src_metadata = MetaData(bind=src_engine)
src_table = Table('data', src_metadata, autoload=True)




target_engine = create_engine("postgres://localhost/sandbox")
target_metadata = MetaData(bind=target_engine)
target_table = Table('data', target_metadata)
clone schema:

for column in src_table.columns:
    target_table.append_column(column.copy())

target_table.create()




copy data:

insert = target_table.insert()

for row in src_table.select().execute():
    insert.execute(row)
magic used:

metadata reflection
text file (CSV) to table:




reader = csv.reader(file_stream)

columns = reader.next()

for column in columns:
    table.append_column(Column(column, String))

table.create()

for row in reader:
    insert.execute(row)
Simple T from ETL

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
transformation = [

 ('fiscal_year',         {"w function": int,
                          ". field":"fiscal_year"}),
 ('region_code',         {"4 mapping": region_map,
                          ". field":"region"}),
 ('borrower_country',    None),
 ('project_name',        None),
 ('procurement_type',    None),
 ('major_sector_code',   {"4 mapping": sector_code_map,
                          ". field":"major_sector"}),
 ('major_sector',        None),
 ('supplier',            None),
 ('contract_amount',     {"w function": currency_to_number,
                          ". field": 'total_contract_amount'}
 ]



     target fields        source transformations
Transformation

for row in source:
    result = transform(row, [ transformation)
    table.insert(result).execute()
OLAP with Cubes

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Model
           {
               “name” = “My Model”
               “description” = ....

               “cubes” = [...]
               “dimensions” = [...]
           }




cubes                          dimensions
measures                        levels, attributes, hierarchy
logical




              physical

          ❄
1   load_model("model.json")

           Application



                  ∑

                                 3   model.cube("sales")
                                 4   workspace.browser(cube)


             cubes

       Aggregation Browser
            backend



2   create_workspace("sql",
                     model,
                     url="sqlite:///data.sqlite")
browser.aggregate(o cell,
                  . drilldown=[9 "sector"])




                        drill-down
for row in result.table_rows(“sector”):




          row.record["amount_sum"]
q row.label                     k row.key
whole cube


                                           o cell = Cell(cube)
                                           browser.aggregate(o cell)
                Total




                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])


2006 2007 2008 2009 2010


                                          ✂ cut = PointCut(9 “date”, [2010])
                                          o cell = o cell.slice(✂ cut)

                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])
Jan   Feb Mar Apr March April May   ...
How can Python
  be Useful
just the   Language
 ■ saves maintenance resources
 ■ shortens development time
 ■ saves your from going insane
Source               Staging Area      Operational Data Store   Datamarts
Systems



   structured
   documents




   databases
                                      faster
                Temporary
                Staging
                Area
      APIs




                            staging               relational        dimensional

                             L0                     L1                 L2
faster                      advanced


 Data                                            Analysis and
          Extraction, Transformation, Loading
Sources                                          Presentation

                       Data Governance

                   Technologies and Utilities




    understandable, maintainable
Conclusion
BI is about…



       people

technology processes
don’t forget
 metadata
Future

who is going to fix your COBOL Java tool
 if you have only Python guys around?
is capable, let’s start
Thank You
      [t


          Twitter:

        @Stiivi
     DataBrewery blog:

blog.databrewery.org
          Github:

  github.com/Stiivi

More Related Content

PDF
Hierarchical data models in Relational Databases
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
An introduction to property based testing
PPTX
Regular expressions
PDF
Spring Framework - Validation
PDF
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
PPTX
USER DEFINE FUNCTIONS IN PYTHON[WITH PARAMETERS]
PDF
TDD and BDD and ATDD
Hierarchical data models in Relational Databases
Building Event Driven (Micro)services with Apache Kafka
An introduction to property based testing
Regular expressions
Spring Framework - Validation
Using AI to Build a Self-Driving Query Optimizer with Shivnath Babu and Adria...
USER DEFINE FUNCTIONS IN PYTHON[WITH PARAMETERS]
TDD and BDD and ATDD

What's hot (20)

PDF
Retrofit library for android
PPTX
Python Scipy Numpy
PDF
Angular - Chapter 2 - TypeScript Programming
PDF
Python Tutorial | Python Tutorial for Beginners | Python Training | Edureka
KEY
ATDD in Practice
PPTX
クラウドにおける責任共有モデルと クラウド利用者のセキュリティの課題
PDF
REST APIs with Spring
PPTX
Regular Expression
PDF
Fake It Outside-In TDD @XP2017
PDF
Sql Injection Myths and Fallacies
PDF
Get started python programming part 1
PDF
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
PDF
ZIO Queue
PPTX
Selenium Locators
PPTX
1. Arrow Functions | JavaScript | ES6
PPTX
Node.js File system & Streams
PPTX
Python
PPTX
PDF
What is performance_engineering_v0.2
DOCX
complaint (finished)
Retrofit library for android
Python Scipy Numpy
Angular - Chapter 2 - TypeScript Programming
Python Tutorial | Python Tutorial for Beginners | Python Training | Edureka
ATDD in Practice
クラウドにおける責任共有モデルと クラウド利用者のセキュリティの課題
REST APIs with Spring
Regular Expression
Fake It Outside-In TDD @XP2017
Sql Injection Myths and Fallacies
Get started python programming part 1
Wakanda: NoSQL for Model-Driven Web applications - NoSQL matters 2012
ZIO Queue
Selenium Locators
1. Arrow Functions | JavaScript | ES6
Node.js File system & Streams
Python
What is performance_engineering_v0.2
complaint (finished)
Ad

Similar to Python business intelligence (PyData 2012 talk) (20)

PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PDF
Klout changing landscape of social media
PPT
Tspbug 2 24_2014_final
PPTX
How Klout is changing the landscape of social media with Hadoop and BI
PDF
Data Mining with Excel 2010 and PowerPivot 201106
KEY
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Patterns of Enterprise Application Architecture (by example)
PDF
PoEAA by Example
PDF
Prague data management meetup 2017-01-23
PDF
OSCON 2014: Data Workflows for Machine Learning
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
KEY
Datacamp @ Transparency Camp 2010
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PDF
A general introduction to Spring Data / Neo4J
PPTX
Salesforce & SAP Integration
PDF
20170126 big data processing
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PPTX
Data Access Tech Ed India
PDF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Klout changing landscape of social media
Tspbug 2 24_2014_final
How Klout is changing the landscape of social media with Hadoop and BI
Data Mining with Excel 2010 and PowerPivot 201106
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Jump Start into Apache® Spark™ and Databricks
Patterns of Enterprise Application Architecture (by example)
PoEAA by Example
Prague data management meetup 2017-01-23
OSCON 2014: Data Workflows for Machine Learning
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Datacamp @ Transparency Camp 2010
Spark Based Distributed Deep Learning Framework For Big Data Applications
A general introduction to Spring Data / Neo4J
Salesforce & SAP Integration
20170126 big data processing
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Data Access Tech Ed India
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ad

More from Stefan Urbanek (19)

PDF
StepTalk Introduction
PDF
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
PDF
Sepro - introduction
PDF
New york data brewery meetup #1 – introduction
PDF
Cubes 1.0 Overview
PDF
Cubes – pluggable model explained
PDF
Cubes – ways of deployment
PDF
Knowledge Management Lecture 4: Models
PDF
Dallas Data Brewery Meetup #2: Data Quality Perception
PDF
Dallas Data Brewery - introduction
PDF
Bubbles – Virtual Data Objects
PDF
Knowledge Management Lecture 3: Cycle
PDF
Knowledge Management Lecture 2: Individuals, communities and organizations
KEY
Knowledge Management Lecture 1: definition, history and presence
KEY
Open spending as-is 2011-06
PDF
Cubes - Lightweight OLAP Framework
PDF
Open Data Decentralisation
PDF
Data Cleansing introduction (for BigClean Prague 2011)
PDF
Knowledge Management Introduction
StepTalk Introduction
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Sepro - introduction
New york data brewery meetup #1 – introduction
Cubes 1.0 Overview
Cubes – pluggable model explained
Cubes – ways of deployment
Knowledge Management Lecture 4: Models
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery - introduction
Bubbles – Virtual Data Objects
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 1: definition, history and presence
Open spending as-is 2011-06
Cubes - Lightweight OLAP Framework
Open Data Decentralisation
Data Cleansing introduction (for BigClean Prague 2011)
Knowledge Management Introduction

Python business intelligence (PyData 2012 talk)

  • 1. Python for Business Intelligence Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
  • 3. Results Q/A and articles with Java solution references (not listed here)
  • 6. Overview ■ Traditional Data Warehouse ■ Python and Data ■ Is Python Capable? ■ Conclusion
  • 9. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 10. Traditional Data Warehouse
  • 11. ■ Extracting data from the original sources ■ Quality assuring and cleaning data ■ Conforming the labels and measures in the data to achieve consistency across the original sources ■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  • 12. Source Staging Area Operational Data Store Datamarts Systems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 13. real time = daily
  • 14. Multi-dimensional Modeling
  • 16. aggregation browsing slicing and dicing
  • 17. business / analyst’s point of view regardless of physical schema implementation
  • 18. Facts measurable fact fact data cell most detailed information
  • 19. location type time dimensions
  • 20. Dimension ■ provide context for facts ■ used to filter queries or reports ■ control scope of aggregation of facts
  • 22. Python and Data community perception* *as of Oct 2012
  • 25. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 26. Scientific Data T1[s] T2[s] T3[s] T4[s] P1 112,68 941,67 171,01 660,48 P2 96,15 306,51 725,88 877,82 P3 313,39 189,31 41,81 428,68 P4 760,62 983,48 371,21 281,19 P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  • 27. Assumptions ■ data is mostly numbers ■ data is neatly organized... ■ … in one multi-dimensional array
  • 28. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 30. multiple snapshots of one source multiple representations categories are of same data changing
  • 31.
  • 32. Is Python Capable? very basic examples
  • 33. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 34. ■ connection: create_engine ■ schema reflection: MetaData, Table ■ expressions: select(), insert()
  • 35. src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata, autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata)
  • 36. clone schema: for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() copy data: insert = target_table.insert() for row in src_table.select().execute(): insert.execute(row)
  • 38. text file (CSV) to table: reader = csv.reader(file_stream) columns = reader.next() for column in columns: table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row)
  • 39. Simple T from ETL Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 40. transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations
  • 41. Transformation for row in source: result = transform(row, [ transformation) table.insert(result).execute()
  • 42. OLAP with Cubes Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 43. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } cubes dimensions measures levels, attributes, hierarchy
  • 44. logical physical ❄
  • 45. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend 2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  • 46. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  • 47. for row in result.table_rows(“sector”): row.record["amount_sum"] q row.label k row.key
  • 48. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) Jan Feb Mar Apr March April May ...
  • 49. How can Python be Useful
  • 50. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  • 51. Source Staging Area Operational Data Store Datamarts Systems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 52. faster advanced Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities understandable, maintainable
  • 54. BI is about… people technology processes
  • 56. Future who is going to fix your COBOL Java tool if you have only Python guys around?
  • 58. Thank You [t Twitter: @Stiivi DataBrewery blog: blog.databrewery.org Github: github.com/Stiivi