SlideShare a Scribd company logo
Jeffrey
                Stanton

      WHAT IS   School of
                Information
DATA SCIENCE?   Studies

                Syracuse
                University
BIG   DATA
KILO, MEGA, GIGA, TERA, PETA, EXA
        ZETTA = 10 21 BYTES
…An organization          Over 95% of the
employing      1,000      digital universe is
knowledge workers         "unstructured data"
loses $5.7 million        –     meaning       its
annually just in          content can't be truly
time wasted having        represented by its
to         reformat       field       in        a
information as they       record,    such      as
move          among       name, address, or
applications.   Not       date      of       last
finding information       transaction. In
costs that same           organizations, unstr
organization      an      uctured data
additional $5.3m a        accounts for more
year.                     than 80% of all
                          information.
Source: IDC
                          Source: IDC
WHY DATA SCIENCE?

 Available data on a scale millions of times larger than 20
  years ago: customer transactions; environmental sensor
  outputs; genetic and epigenetic sequences; web documents;
  digital images and audio
 Heterogeneous data sets, with different representations and
  formats; mixtures of structured and unstructured data;
  some, little, or no metadata; distributed across systems
 Chaotic information life cycle, where little time and effort is
  spent on what should be kept and what can be discarded
 Diverse and/or legacy infrastructure: mainframes running
  Cobol connected with high speed networks to sensor arrays
  running Linux
CRITICAL QUESTIONS

 How will global climate change affect sea levels in major
  coastal metropolitan areas worldwide?
 Does genetic screening reduce cancer mortality for adults
  between the ages of 50 and 59?
 What gene sequences in cereal grains are associated with
  greater crop yields in arid environments?
 How can we reduce false positives in automated airline
  baggage scans without reducing accuracy?
 What Internet data can be mined as predictive of firm
  creation among startups that provide new jobs?
“BIG DATA” PROVIDES ANSWERS

 Water sustainability                              Drug design and
 Climate analysis and                               development
  prediction                                        Advanced materials
 Energy through fusion                              analysis
 CO 2 Sequestration                                New combustion
 Hazard analysis and                                systems
  management                                        Virtual product design
 Cancer detection and                              In silico semiconductor
  therapy                                            design
NSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report,
March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
NSF Advisory
“All grand challenges face        Committee
                                  for
barriers due to challenges in     Cyberinfra-
software, in data management      structure, Tas
                                  kforce for
and visualization, and in         Grand
                                  Challenges, F
coordinating the work of          inal
                                  Report, Marc
diverse communities that must     h 2011.
work together to develop new      http://www.n
                                  sf.gov/od/oci/
models and algorithms, and to     taskforces/Ta
                                  skForceRepor
evaluate outputs as a basis for   t_GrandChall
                                  enges.pdf
critical decisions.”
Knowledge Development
                                            for
                             Industry, Education, Governme
                                       nt, Research
       Domain
       Experts                                                            Infrastructure
                                       Information
                                                                          Professionals
  Expertise in specific
                                      Organization &                       Rapid pace of
     subject areas                     Visualization                      IT development

Limited opportunity to                                                  Limited expertise in
master technology skills    Information      Data         Solution
                                                                           domain areas
                              Analysis    Scientists     Integration

Proliferation of big data
                                                                       Specialized knowledge
  & new technology                                                      of HW, FW, MW, SW
                                      Digital Curation
Need for knowledge and                                                    Communication
 information managers                                                       challenges

         Data Scientists: Transforming Data Into Decisions
A DEFINITION OF A DATA SCIENTIST

 A data scientist uses deep expertise in the
  management, transformation, and analysis of large,
  heterogeneous data sets to:
   Help infrastructure experts with the architecture of hardware
    and software to manage big data challenges
   Help domain experts and decision makers reduce the data
    deluge into usable knowledge, visualizations, and
    presentations
   Help institutions and organizations control and curate data
    throughout the information lifecycle

More Related Content

PPTX
Building Optimisation using Scenario Modeling and Linked Data
PDF
Approximate Semantic Matching of Heterogeneous Events
PPTX
Slims arindam presentaion
PPTX
Wikipedia (DBpedia): Crowdsourced Data Curation
PPTX
Data Curation at the New York Times
PDF
Using Linked Data and the Internet of Things for Energy Management
PPTX
An Environmental Chargeback for Data Center and Cloud Computing Consumers
PDF
20120605 icse zurich
Building Optimisation using Scenario Modeling and Linked Data
Approximate Semantic Matching of Heterogeneous Events
Slims arindam presentaion
Wikipedia (DBpedia): Crowdsourced Data Curation
Data Curation at the New York Times
Using Linked Data and the Internet of Things for Energy Management
An Environmental Chargeback for Data Center and Cloud Computing Consumers
20120605 icse zurich

What's hot (10)

PDF
Citizen Actuation For Lightweight Energy Management
PDF
Challenges Ahead for Converging Financial Data
PPT
BeSTGRID OpenGridForum 29 GIN session
PDF
System of Systems Information Interoperability using a Linked Dataspace
PDF
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
PDF
NextGen Infrastructure for Big Data
PPTX
Innovation in Silicon Valley
PPT
Big Data Public Private Forum (BIG) @ European Data Forum 2013
PDF
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
PDF
The Essential Ingredient for Today's Enterprise
Citizen Actuation For Lightweight Energy Management
Challenges Ahead for Converging Financial Data
BeSTGRID OpenGridForum 29 GIN session
System of Systems Information Interoperability using a Linked Dataspace
Crowdsourcing Approaches to Big Data Curation for Earth Sciences
NextGen Infrastructure for Big Data
Innovation in Silicon Valley
Big Data Public Private Forum (BIG) @ European Data Forum 2013
Sustainable IT for Energy Management: Approaches, Challenges, and Trends
The Essential Ingredient for Today's Enterprise
Ad

Viewers also liked (8)

DOCX
Oficina pbworks 1 de 2
PPT
O Ambiente Sócrates ProUCA-CE
PPT
Plano CNO
PDF
Onda Carioca Condominium Club - Consulte-nos (21) 4109-6372
PPT
Idadecontempornea imperialismoeneocolonialismo-110429224422-phpapp01
PDF
ASSETS'11 Doctoral Consortium
DOC
Compress
Oficina pbworks 1 de 2
O Ambiente Sócrates ProUCA-CE
Plano CNO
Onda Carioca Condominium Club - Consulte-nos (21) 4109-6372
Idadecontempornea imperialismoeneocolonialismo-110429224422-phpapp01
ASSETS'11 Doctoral Consortium
Compress
Ad

Similar to Jeff's what isdatascience (20)

PPTX
Introduction to Advance Analytics Course
PPTX
What is Data Science
PDF
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
PPTX
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
PDF
Big data appliances for BI on Cloud
PDF
Informatics technologies in an evolving r & d landscape
PDF
Educating a New Breed of Data Scientists for Scientific Data Management
PPTX
Big data_郭惠民
DOC
Top IT skills in very high demand in 2025.doc
PPTX
An Overview of BigData
DOCX
Nikita rajbhoj(a 50)
DOC
Complete-SRS.doc
PDF
Big data: Challenges, Practices and Technologies
PDF
Analytics big data ibm
PDF
IBM-Infoworld Big Data deep dive
PPTX
The Zen and Art of IT Management (VM World Keynote 2012)
PDF
Hitachi Data Systems Big Data Roadmap
PPTX
Big data
KEY
Exploring Big Data value for your business
PDF
Smart Data for Smart Labs
Introduction to Advance Analytics Course
What is Data Science
EDF2013: Invited Talk Julie Marguerite: Big data: a new world of opportunitie...
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
Big data appliances for BI on Cloud
Informatics technologies in an evolving r & d landscape
Educating a New Breed of Data Scientists for Scientific Data Management
Big data_郭惠民
Top IT skills in very high demand in 2025.doc
An Overview of BigData
Nikita rajbhoj(a 50)
Complete-SRS.doc
Big data: Challenges, Practices and Technologies
Analytics big data ibm
IBM-Infoworld Big Data deep dive
The Zen and Art of IT Management (VM World Keynote 2012)
Hitachi Data Systems Big Data Roadmap
Big data
Exploring Big Data value for your business
Smart Data for Smart Labs

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I
Cloud computing and distributed systems.

Jeff's what isdatascience

  • 1. Jeffrey Stanton WHAT IS School of Information DATA SCIENCE? Studies Syracuse University
  • 2. BIG DATA
  • 3. KILO, MEGA, GIGA, TERA, PETA, EXA ZETTA = 10 21 BYTES …An organization Over 95% of the employing 1,000 digital universe is knowledge workers "unstructured data" loses $5.7 million – meaning its annually just in content can't be truly time wasted having represented by its to reformat field in a information as they record, such as move among name, address, or applications. Not date of last finding information transaction. In costs that same organizations, unstr organization an uctured data additional $5.3m a accounts for more year. than 80% of all information. Source: IDC Source: IDC
  • 4. WHY DATA SCIENCE?  Available data on a scale millions of times larger than 20 years ago: customer transactions; environmental sensor outputs; genetic and epigenetic sequences; web documents; digital images and audio  Heterogeneous data sets, with different representations and formats; mixtures of structured and unstructured data; some, little, or no metadata; distributed across systems  Chaotic information life cycle, where little time and effort is spent on what should be kept and what can be discarded  Diverse and/or legacy infrastructure: mainframes running Cobol connected with high speed networks to sensor arrays running Linux
  • 5. CRITICAL QUESTIONS  How will global climate change affect sea levels in major coastal metropolitan areas worldwide?  Does genetic screening reduce cancer mortality for adults between the ages of 50 and 59?  What gene sequences in cereal grains are associated with greater crop yields in arid environments?  How can we reduce false positives in automated airline baggage scans without reducing accuracy?  What Internet data can be mined as predictive of firm creation among startups that provide new jobs?
  • 6. “BIG DATA” PROVIDES ANSWERS  Water sustainability  Drug design and  Climate analysis and development prediction  Advanced materials  Energy through fusion analysis  CO 2 Sequestration  New combustion  Hazard analysis and systems management  Virtual product design  Cancer detection and  In silico semiconductor therapy design NSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report, March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
  • 7. NSF Advisory “All grand challenges face Committee for barriers due to challenges in Cyberinfra- software, in data management structure, Tas kforce for and visualization, and in Grand Challenges, F coordinating the work of inal Report, Marc diverse communities that must h 2011. work together to develop new http://www.n sf.gov/od/oci/ models and algorithms, and to taskforces/Ta skForceRepor evaluate outputs as a basis for t_GrandChall enges.pdf critical decisions.”
  • 8. Knowledge Development for Industry, Education, Governme nt, Research Domain Experts Infrastructure Information Professionals Expertise in specific Organization & Rapid pace of subject areas Visualization IT development Limited opportunity to Limited expertise in master technology skills Information Data Solution domain areas Analysis Scientists Integration Proliferation of big data Specialized knowledge & new technology of HW, FW, MW, SW Digital Curation Need for knowledge and Communication information managers challenges Data Scientists: Transforming Data Into Decisions
  • 9. A DEFINITION OF A DATA SCIENTIST  A data scientist uses deep expertise in the management, transformation, and analysis of large, heterogeneous data sets to:  Help infrastructure experts with the architecture of hardware and software to manage big data challenges  Help domain experts and decision makers reduce the data deluge into usable knowledge, visualizations, and presentations  Help institutions and organizations control and curate data throughout the information lifecycle

Editor's Notes

  • #3: Facebook friend connections worldwide, a network diagram of the Enron email set, a comparison of similar gene sequences between humans, chimps, and macaques
  • #9: HW, FW, MW, SW: Hardware Firmware Middleware Software