2. • What is data?
• Role of Databases in Data Analytics
• Difference between a database and a dataset
3. Data refers to symbols or signs that represent a measurement or a model of
reality. By itself, data has no meaning until it is interpreted according to
higher-level conventions and understandings (HLCUs).
• Why This Definition?
When preparing data, you need to consider the purpose (or HLCU) it’s being
prepared for:
1.If the data is for humans, it may need to be clear and visually understandable.
2.If the data is for machines or algorithms, it should be structured and
formatted appropriately.
3.Different algorithms may require different kinds of data preprocessing based
on their requirements.
4. • The DIKW pyramid explains the hierarchy of data
and its transformation:
1. Data: A collection of symbols – cannot answer any
questions.
2. Information: Processed data – can answer the questions
who, when, where, and what.
3. Knowledge: Descriptive application of Information – can
answer the question how.
4. Wisdom: Embodiment of Knowledge and appreciation of
why.
• The pyramid also shows the relationship between
abundance and value:
• Data is abundant but has low value until it is processed.
• Wisdom is rare but highly valuable as it involves
actionable insights.
5. • The definitions of all four elements of DDVW are
presented as follows:
o Data: All possible data from across all the data
resources
o Dataset: A relevant collection of data selected from all
the available data sources and organized for the next
step
o Visualization: The comprehensible presentation of
what has been found in the dataset (similar to
Knowledge in DIKW – descriptive application of
Information)
o Wisdom: Embodiment of Knowledge and appreciation
of why (the same as Wisdom in DIKW)
6. The most universal data structure – a table
At the end of successful data preprocessing, we want to
create a table that is ready to be mined, analyzed, or
visualized. We call this table a dataset.
• Data objects are known by many different names, such as data points, rows,
records, examples, samples, tuples, and many more. However, as you know for a
table to make sense, you need the conceptual definition of data objects.
• Data attributes have different names such as columns, variables, features, and
dimensions might be used instead of attributes.
7. A database is a technology for storing and retrieving data in a way
that is both effective and efficient.
• Why Databases Are Not Analytics-Ready
Databases are primarily designed to:
1.Store data efficiently.
2.Retrieve data quickly.
However, they are not specifically organized for analytics. This means
that the structure of the data in a database might not match what is
needed for analysis.
8. 1.The first step in data analytics is to find
and gather data from databases and other
sources.
2.The next step is to reorganize the data
into a format or dataset that is useful for
answering analytical or decision-making
questions.
• In essence, databases store raw data, and
it’s up to analysts to prepare this data for
meaningful analysis.
9. • A database as a technological solution for storing and retrieving
data both effectively and efficiently.
• A dataset is a specific organization and presentation of some data
for a specific reason.
10. Goal: Predict hourly electricity consumption in the city of Redlands based on weather data.
Dataset Design:
• Data object: Each row represents an hour in the city of Redlands.
• Attributes (columns):
• Average temperature.
• Average humidity.
• Average wind speed.
• Electricity consumption.
• Data Sources:
• Weather data: Comes from five databases, each recording weather every 15 minutes for its specific location in
the city.
• Electricity data: Comes from one database managed by the city’s electricity supplier, recording consumption
every 5 minutes.
11. • Mainly there are four types of databases:
1. Relational databases, or structured databases, are an ecosystem of data collection and
management in which both the collected data and the incoming data must conform with a
pre-defined set of relationships between the data.
2. NoSQL, or unstructured databases, are precisely the solution for the problem of wanting to
store data that we are unable to structure, or are ambivalent to do so. Furthermore,
unstructured databases can be used as an interim house for data we do not have the resources
to structure just now.
3. Distributed database is a collection of databases (structured, unstructured, or a combination
of the two) whose data is physically stored in multiple locations.
4. Blockchain is a database alternative that does not have a central authority while providing
data safety; like Bitcoin.
12. 1. Supervised Datasets
• Definition: These datasets contain input-output (feature-label) pairs. Each data
point has both independent variables (features) and a corresponding dependent
variable (label).
• Used For: Supervised learning tasks such as classification and regression.
• Examples:
• Classification: A dataset of images labeled as "cat" or "dog".
• Regression: A dataset of house prices based on square footage, number of bedrooms, etc.
13. 2. Unsupervised Datasets
• Definition: These datasets contain only independent variables (features)
without predefined labels.
• Used For: Unsupervised learning tasks such as clustering, dimensionality
reduction, and anomaly detection.
• Examples:
• Clustering: Grouping customers based on purchasingbuy behavior without predefined
categories.
• Dimensionality Reduction: Principal Component Analysis (PCA) applied to high-
dimensional data.
14. Feature Supervised Dataset Unsupervised Dataset
Labels Available? Yes (Labeled) No (Unlabeled)
Task Examples Classification, Regression Clustering, Anomaly Detection
Algorithms Used
SVM, Decision Trees, Neural Networks,
Linear Regression
K-Means, DBSCAN, PCA,
Autoencoders
Output Type Predicts specific values or categories Finds patterns and structures
Complexity Less More
15. • Understanding the differences between structured and unstructured
database
16. 1. What is a database and its types?
2. In your own words, describe the difference between a dataset and a
database.