2. Data All Around !!
• Lots of data is being
collected and warehoused
1. – Scientific Experiments
2. Internet of Things
3. Web data
4. e-commerce
5. Financial transactions
6. bank/credit
transactions
7. Online trading and
purchasing –
8. Social Network – ……
many more!
3. What To Do With These Data?
1.Aggregation and Statistics –
Data warehousing and OLAP
2.• Indexing, Searching, and Querying –
Keyword based search
Pattern matching (XML/RDF)
3 • Knowledge discovery –
Data Mining –
Statistical Modeling
4• Data Driven –
Predictive Analytics
Deep Learning
4. What is Data Science ??
• An area that manages, manipulates,
extracts, and interprets knowledge
from tremendous amount of data
• Data science (DS) is a
multidisciplinary field of study with
goal to address the challenges in big
data
• Data science principles apply to all
data – big and small.
5. What is Data Science ??
• Theories and techniques from many fields and disciplines are used
to investigate and analyze a large amount of data to help decision
makers in many industries such as science, engineering, economics,
politics, finance, and education
– Computer Science
• Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
– Mathematics
• Mathematical Modeling
– Statistics •
Statistical and Stochastic modeling,
Probability
6. Component Traditional Analysis Traditional Software
Delivery
Data Science
Tools SAS, R, Excel, SQL, inhouse tools Java, source control,
Linux, continuous
integration, unit testing,
bug reports and project
management
R, Java, scientific Python
libraries, Excel, SQL,
Hadoop, Hive, machine
learning libraries, github
for source control and
issue management
Analytical Methods Regressions, classifications,
measuring prediction accuracy and
coverage/error, sampling
N/A Classification, clustering,
similarity detection,
recommenders,
unsupervised and
supervised learning and
other.
Team Structure Statisticians, Mathematicians,
Scientists
Developers, Project
Managers, Systems
Engineers
Mathematicians,
Statisticians, Scientists,
Developers, Systems
Engineers
Time Frame Either: • Usually on-going research
and discovery within a team in the
organization Or: • Specific project
Regular software release
cycle, continuous delivery,
etc.
Either: •
Discovery/learning phase
leading to product
Delivery
7. Machine Learning Data Science
Develop new (individual) models Explore many models, build and tune hybrids
Prove mathematical properties of models Understand empirical properties of models
Improve/validate on a few, relatively clean, small
datasets
Develop/use tools that can handle massive datasets
Publish a paper Take action!
Contrast : Machine Learning
8. Contrast : Data Engineering
Data Science Data Engineering
Approach Scientific (Exploration) Engineering (Development)
Problems Unbounded Bounded
Path to Solution Iterative, exploratory,
nonlinear
Mostly linear
Education More is better (PhD’s common) BS and/or self-trained
Presentation Skills Important Not as important
Research Experience Important Not as important
Programming Skills Not as important Important
Data Skills Important Important
11. Data Science : Case Study
Cancer Research
• Cancer is an incredibly complex disease; a
single tumor can have more than 100 billion
cells, and each cell can acquire mutations
individually. The disease is always changing,
evolving, and adapting.
• Employ the power of big data analytics
and high-performance computing.
• Leverage sophisticated pattern and
machine learning algorithms to identify
patterns that are potentially linked to cancer.
• Huge amount of data processing and
recognition.