1. Velammal College of Engineering and Technology
(Autonomous)
Department of Information Technology
21IT401
BIG DATA ENGINEERING
2. Syllabus
UNIT I - INTRODUCTION
Big Data Overview, Evolution of Big Data,
Definition of Big Data, Challenges with Big Data -
State of practice in Analytics,
Key roles for New Big Data Ecosystem,
Data Analytics Lifecycle Overview,
Examples for Big Data Analytics.
21IT401- BIG DATA ENGINEERING UNIT-I 2
3. Understanding Big Data
• Devices and sensors automatically generate
diagnostic information that needs to be stored and
processed in real time.
• Credit card companies monitor every purchase
their customers make and can identify fraudulent
purchases
• Mobile phone companies analyze subscribers’
calling patterns to determine, for example,
whether a caller’s frequent contacts are on a rival
network.
21IT401- BIG DATA ENGINEERING UNIT-I 3
4. Three attributes stand out as defining
Big Data characteristics:
• Huge volume of data
• Complexity of data types and structures
• Speed of new data creation and growth
21IT401- BIG DATA ENGINEERING UNIT-I 4
5. Definition of Big Data
• Big Data is data whose scale, distribution,
diversity, and/or timeliness require the use of
new technical architectures and analytics to
enable insights that unlock new sources of
business value.
21IT401- BIG DATA ENGINEERING UNIT-I 5
6. Evolution of Big Data
The evolution of big data has been driven by technological
advancements, increasing data generation, and the growing
recognition of the value of data insights. Here’s a chronological
overview of the key phases and milestones in the evolution of big
data:
1.Early Data Processing (1960s - 1980s)
2. The Rise of the Internet and Digitalization (1990s)
3. Web 2.0 and the Explosion of User-Generated Content (2000s)
4. Advancements in Big Data Technologies (2010s)
5. Integration and Real-Time Analytics (Mid 2010s - Present)
6. Current Trends and Future Directions (2020s and Beyond)
21IT401- BIG DATA ENGINEERING UNIT-I 6
7. Evolution of Big Data
1.Early Data Processing
• 1960s: The advent of databases and data management systems
began with the development of hierarchical and network
databases, such as IBM’s IMS (Information Management
System).
• 1970s: The introduction of relational databases by E.F. Codd
at IBM, leading to the creation of the SQL language and the
development of the first relational database management
systems (RDBMS), like Oracle.
• 1980s: Data warehousing concepts emerged, allowing
organizations to aggregate data from various sources for
analysis and reporting.
21IT401- BIG DATA ENGINEERING UNIT-I 7
8. Evolution of Big Data
2. The Rise of the Internet and Digitalization (1990s)
• 1990s: The proliferation of the internet and the digitalization
of information led to exponential growth in data generation. E-
commerce, email, and early web applications contributed
significantly to data volumes.
• Data mining techniques were developed to extract patterns and
insights from large datasets.
21IT401- BIG DATA ENGINEERING UNIT-I 8
9. Evolution of Big Data
3. Web 2.0 and the Explosion of User-Generated Content
(2000s)
• Early 2000s: The rise of Web 2.0 technologies, characterized
by user-generated content, social media, and multimedia,
resulted in an explosion of unstructured data.
• 2004: The term "big data" began to gain traction, emphasizing
the challenges associated with managing and processing vast
amounts of diverse and rapidly growing data.
• 2006: The introduction of Hadoop by Doug Cutting and Mike
Cafarella, inspired by Google’s MapReduce and Google File
System (GFS) papers, provided a scalable, distributed
framework for processing large datasets.
21IT401- BIG DATA ENGINEERING UNIT-I 9
10. Evolution of Big Data
4. Advancements in Big Data Technologies (2010s)
• 2010s: Significant advancements in big data technologies and
frameworks emerged, including Apache Spark, which offered
faster data processing capabilities than Hadoop.
• NoSQL databases, such as MongoDB, Cassandra, and HBase,
were developed to handle unstructured and semi-structured
data.
• The cloud computing revolution provided scalable and cost-
effective storage and processing solutions, with services like
Amazon Web Services (AWS), Microsoft Azure, and Google
Cloud Platform becoming popular.
• Machine learning and AI technologies advanced, enabling
more sophisticated data analytics and predictive modeling.
21IT401- BIG DATA ENGINEERING UNIT-I 10
11. Evolution of Big Data
5. Integration and Real-Time Analytics
(Mid 2010s - Present)
• The focus shifted to integrating big data with traditional
enterprise data systems, leading to the rise of hybrid data
architectures.
• Real-time data processing and analytics became critical, with
technologies like Apache Kafka enabling real-time data
streaming and processing.
• Data lakes emerged as a way to store vast amounts of raw data
in its native format until needed for analysis.
21IT401- BIG DATA ENGINEERING UNIT-I 11
12. Evolution of Big Data
6. Current Trends and Future Directions (2020s and
Beyond)
• 2020s: The convergence of big data with AI and machine learning
has led to more intelligent and automated data analytics.
• Edge computing is gaining traction, processing data closer to where
it is generated to reduce latency and bandwidth usage.
• Privacy, security, and ethical considerations have become
paramount, driven by regulations like GDPR and CCPA.
• The rise of DataOps (Data Operations) emphasizes the need for
collaboration and automation in data management processes to
improve the quality and speed of data analytics.
• Quantum computing, though in its early stages, holds the potential
to revolutionize data processing capabilities in the future.
21IT401- BIG DATA ENGINEERING UNIT-I 12
13. Challenges with Big Data
Big data refers to the vast volumes of structured and
unstructured data generated at high velocity from a wide
variety of sources.
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Complexity
6. Scalability
7. Storage
21IT401- BIG DATA ENGINEERING UNIT-I 13
14. Challenges with Big Data
8. Data Governance
9. Security
10. Data Integration
11. Data Analysis
12. Talent Gap
13. Cost
21IT401- BIG DATA ENGINEERING UNIT-I 14
15. Strategies to Address Big Data Challenges:
• Invest in scalable and flexible infrastructure: Use cloud-
based solutions to handle data storage and processing needs.
• Employ robust data governance frameworks: Implement
policies and practices to ensure data quality, security, and
compliance.
• Use advanced analytics tools: Leverage machine learning
and AI to derive actionable insights from big data.
• Foster talent development: Invest in training and
development programs to build a skilled workforce.
• Adopt data integration platforms: Utilize ETL (Extract,
Transform, Load) tools to streamline data integration from
diverse sources.
21IT401- BIG DATA ENGINEERING UNIT-I 15
16. State of practice in Analytics
Current business problems provide many opportunities
for organizations to become more analytical and data driven.
• BI Versus Data Science
• Current Analytical Architecture
• Drivers of Big Data
• Emerging Big Data Ecosystem and a New Approach to
Analytics
21IT401- BIG DATA ENGINEERING UNIT-I 16
17. State of practice in Analytics
BI Versus Data Science
21IT401- BIG DATA ENGINEERING UNIT-I 17
18. State of practice in Analytics
BI Versus Data Science
21IT401- BIG DATA ENGINEERING UNIT-I 18
19. State of practice in Analytics
BI Versus Data Science
21IT401- BIG DATA ENGINEERING UNIT-I 19
20. State of practice in Analytics
Current Analytical Architecture
21IT401- BIG DATA ENGINEERING UNIT-I 20
21. State of practice in Analytics
Drivers of Big Data
21IT401- BIG DATA ENGINEERING UNIT-I 21
The data now comes from multiple sources, such as these:
● Medical information, such as genomic sequencing and diagnostic imaging
● Photos and video footage uploaded to the World Wide Web
● Video surveillance, such as the thousands of video cameras spread across a
city
● Mobile devices, which provide geospatial location data of the users, as well
as metadata about text messages, phone calls, and application usage on
smart phones
● Smart devices, which provide sensor-based collection of information from
smart electric grids, smart buildings, and many other public and industry
infrastructures
● Nontraditional IT devices, including the use of radio-frequency
identification (RFID) readers, GPS navigation systems, and seismic
processing
22. State of practice in Analytics
Drivers of Big Data
21IT401- BIG DATA ENGINEERING UNIT-I 22
23. State of practice in Analytics
Emerging Big Data Ecosystem and a New Approach to
Analytics
21IT401- BIG DATA ENGINEERING UNIT-I 23
24. Key roles for New Big Data Ecosystem
21IT401- BIG DATA ENGINEERING UNIT-I 24
25. Key roles for New Big Data Ecosystem
Deep Analytical Talent
21IT401- BIG DATA ENGINEERING UNIT-I 25
• This role is technically savvy, with strong analytical
skills.
• Members possess a combination of skills to handle
raw, unstructured data and to apply complex analytical
techniques at massive scales.
• This group has advanced training in quantitative
disciplines, such as mathematics, statistics, and
machine learning.
• Examples of current professions fitting into this group
include statisticians, economists, mathematicians, and
the new role of the Data Scientist.
26. Key roles for New Big Data Ecosystem
Data Savvy Professionals
21IT401- BIG DATA ENGINEERING UNIT-I 26
• It has less technical depth but has a basic knowledge of
statistics or machine learning and can define key
questions that can be answered using advanced
analytics.
• These people tend to have a base knowledge of
working with data, or an appreciation for some of the
work being performed by data scientists and others
with deep analytical talent.
• Examples of data savvy professionals include financial
analysts, market research analysts, life scientists,
operations managers, and business and functional
managers
27. Key roles for New Big Data Ecosystem
Technology and Data Enablers
21IT401- BIG DATA ENGINEERING UNIT-I 27
• This group represents people providing
technical expertise to support analytical
projects
• This role requires skills related to computer
engineering, programming, and database
administration.
28. Key roles for New Big Data Ecosystem
Profile of a Data Scientist
21IT401- BIG DATA ENGINEERING UNIT-I 28
29. Data Analytics Lifecycle Overview:
21IT401- BIG DATA ENGINEERING UNIT-I 29
• The Data Analytics Lifecycle defines analytics
process best practices spanning discovery to
project completion.
• Phase 1—Discovery
• Phase 2—Data preparation:
• Phase 3—Model planning:
• Phase 4—Model building:
• Phase 5—Communicate results:
• Phase 6—Operationalize:
31. Data Analytics Lifecycle Overview:
21IT401- BIG DATA ENGINEERING UNIT-I 31
Phase 1—Discovery: In Phase 1, the team learns
the business domain, including relevant
history such as whether the organization or
business unit has attempted similar projects in
the past from which they can learn. The team
assesses the resources available to support
the project in terms of people, technology,
time, and data.
32. Data Analytics Lifecycle Overview:
21IT401- BIG DATA ENGINEERING UNIT-I 32
Phase 2—Data preparation: Phase 2 requires
the presence of an analytic sandbox, in which
the team can work with data and perform
analytics for the duration of the project. The
team needs to execute extract, load, and
transform (ELT) or extract, transform and load
(ETL) to get data into the sandbox. The ELT
and ETL are sometimes abbreviated as ETLT.
33. Data Analytics Lifecycle Overview:
21IT401- BIG DATA ENGINEERING UNIT-I 33
Phase 3—Model planning: Phase 3 is model
planning, where the team determines the
methods,techniques, and workflow it intends
to follow for the subsequent model building
phase. The team explores the data to learn
about the relationships between variables and
subsequently selects key variables and the
most suitable models.
34. Data Analytics Lifecycle Overview:
21IT401- BIG DATA ENGINEERING UNIT-I 34
Phase 4—Model building: In Phase 4, the team
develops datasets for testing, training, and
production purposes. In addition, in this phase
the team builds and executes models based
on the work done in the model planning
phase.
35. Data Analytics Lifecycle Overview:
21IT401- BIG DATA ENGINEERING UNIT-I 35
Phase 5—Communicate results: In Phase 5, the
team, in collaboration with major
stakeholders, determines if the results of the
project are a success or a failure based on the
criteria developed in Phase 1.
36. Data Analytics Lifecycle Overview:
21IT401- BIG DATA ENGINEERING UNIT-I 36
Phase 6—Operationalize: In Phase 6, the team
delivers final reports, briefings, code, and
technical documents. In addition, the team
may run a pilot project to implement the
models in a production environment.
37. Examples for Big Data Analytics
• Retail and E-commerce
– Customer behavior analysis
– Recommendation systems
– Inventory and supply chain optimization
• Banking and Finance
– Fraud detection
– Risk analytics and credit scoring
– Algorithmic trading
• Healthcare
– Patient diagnostics and treatment predictions
– Real-time monitoring using IoT
– Disease outbreak forecasting
21IT401- BIG DATA ENGINEERING
UNIT-I
37
38. Examples for Big Data Analytics
• Telecommunications
– Churn prediction
– Network optimization
– Customer segmentation
• Manufacturing
– Predictive maintenance
– Process optimization
– Quality control using sensor data
• Government
– Crime pattern analysis
– Traffic and transportation analytics
– Smart city planning
21IT401- BIG DATA ENGINEERING
UNIT-I
38
39. Examples for Big Data Analytics
• Energy
– Smart grid data analysis
– Forecasting energy demand
– Equipment failure prediction
• Social Media
– Sentiment analysis
– Trend analysis
– Targeted advertising
21IT401- BIG DATA ENGINEERING
UNIT-I
39