SlideShare a Scribd company logo
Data Pre-processing, Data
Exploration and Visualization
Data Exploration
• Understanding the data
– Having a closer look at attributes and data values
– Real world data are typically
• Noisy
• Enormous in volume
• Originate from heterogeneous sources
– Transactions
– Unstructured data
– Expert opinions (Validation)
– Data poolers
– Publicly available data
– Important for data preprocessing
Typical questions about data
• What are the types of attributes or fields that make up
your data?
• What kind of values does each attribute have?
• Which attributes are discrete and which are continuous?
• What do the data look like?
• How are the values distributed?
• Are there better ways to visualize?
• Can we spot outliers?
• Can we measure the similarity of data objects against
others?
Data Objects
• A data object represents an entity
– Represents an entity
– Example: Customers, store items and sales
• Typically described by attributes
• Also known as samples, examples, instances,
data points or objects
• Called data tuples when stored in a database
– Rows: data objects
– Columns: Attributes
Attributes
• Represents a characteristic or a feature of a data
object
– Also known as: dimension, feature, variable
• Observations: observed values for a given attribute
• Attribute vector: a set of attributes used to
describe a given object
• Univariate distribution: a distribution of data
involving one attribute
• Bivariate distribution: a distribution that involves
two attributes
Types of attributes
• Nominal attribute
– Takes symbols or names of things
– Example: Marital_Status can take single, married,
divorced and widowed
– Also known as categorical
– Can be made numerical by assigning a numerical code
• Example: 0 for single, 1 for married, etc.
• Binary attribute
– A nominal attribute that has only two states or categories
• Example: 0 or 1, True or False
Types of attributes
• Ordinal attribute
– An attribute with possible values that have a
meaningful order or ranking
– Example: drink_size can be small, medium or large
• Numeric attributes
– Represented in integer or real values
– Quantitative; that is a measurable quantity
Sampling
• Why is sampling needed?
– To obtain an optimal time window
• Sampling techniques
– Random sampling
• Each member of the population has an equal chance for
being selected
– Stratified sampling
• Stratification followed by random sampling
• Stratification: identification of sub-populations within a
population
Statistical Descriptions
• Measures of Central Tendency
– Mean, Median and Mode
– Depicts the centrality of data in a distribution
• Symmetric, positively skewed or negatively skewed
• Measures of dispersion
– Range, quartiles and interquartile range (IQR)
– Quartile: a part representing one of four equal sized,
consecutive parts of a data distribution
– Percentile: a part representing one of hundred equal
sized, consecutive parts of a data distribution
Five-number summary, box plots and outliers
• Outliers: data points that are 1.5 IQR away to
both sides (a common rule of thumb)
• Five number summary: Median (Q2), the
quartiles Q1 and Q3, smallest and largest
• Box plot: An effective way of visualizing a
distribution
A box plot example
Visualizing basic statistical date
• Histogram
Visualizing basic statistical data
• Scatter plot
Variance and Standard Deviation
• Measures data dispersion
• Low standard deviation: data observed are
closer to the mean
Data Quality and Pre-processing
Data Quality
• An essential requirement for the reliability of decision
making
• Characteristics:
– Complete: All relevant data —such as accounts, addresses and
relationships for a given customer—is linked.
– Accurate: Common data problems like misspellings, typos, and
random abbreviations have been cleaned up.
– Available: Required data is accessible on demand; users do not
need to search manually for the information.
– Timely: Up-to-date information is readily available to support
decisions.
– Believable: How much the data could be trusted by its users
– Interpretable: How easy is it to understand data
What could happen 
What can quality data bring in?
• Deliver high-quality data for a range of enterprise initiatives including business
intelligence, applications consolidation and retirement, and master data
management
• Reduce time and cost to implement CRM, data warehouse/BI, data governance,
and other strategic IT initiatives and maximize the return on investments
• Construct consolidated customer and household views, enabling more effective
cross-selling, up-selling, and customer retention
• Help improve customer service and identify a company's most profitable
customers
• Provide business intelligence on individuals and organizations for research,
fraud detection, and planning
• Reduce the time required for data cleansing—saving on average 5 million hours,
for an average company with 6.2 million records (Aberdeen Group research)
Data Pre-processing
• Why is it necessary?
– Noisy, missing and inconsistent data
• Caused by the size and heterogeneousness of data sources
– Low-quality data lead to low-quality mining
• Several data pre-processing techniques
– Data cleaning
– Data integration
– Data reduction
– Data transformation
Forms of data preprocessing
Missing Data Example
How missing values occur
• Attrition due to social/natural processes
Example: School graduation, dropout, death
• Skip pattern in survey
Example: Certain questions only asked to
respondents who indicate they are married
• Intentional missing as part of data collection
process
• Random data collection issues
• Respondent refusal/Non-response
Characteristics of missing data
• Missing Completely At Random
– probability of a record having a missing value for an attribute does
not depend on either the observed data or the missing data.
– A lab sample drop, resulting a data loss
• Missing At Random
– probability of a record having a missing value for an attribute could
depend on the observed data, but not on the value of the missing
data itself.
– Respondents in service occupations less likely to report income
• Not Missing At Random
– probability of a record having a missing value for an attribute could
depend on the value of the attribute
– A sensor not capturing data below a certain threshold or people
not filling income data if high.
Data cleaning
• Attempt to fill missing values, smooth out noise
and correct inconsistencies
• Dealing with missing values
– Ignore the tuple
– Fill manually
– Use a global constant to fill (Ex. Unknown)
– Use a measure of central tendency (ex. Mean, median)
– Use the attribute mean or median of the same class
– Use the most probable value
List-wise Deletion
• What are the advantages
and disadvantages?
• Advantages
– Simplicity
– Analysis could be compared
across multiple samples
• Disadvantages
– Reduces statistical power
– Doesn’t use all information
– Estimates could be biased if
data is not MCAR
Pair-wise deletion
• Analysis with all cases in
which the variables of
interest are present
• Advantages
– Keeps as many cases as
possible for each analysis
– Uses all information
possible with each analysis
• Disadvantage
– Can't compare analyses
because sample different
each time
Single Imputation Methods
• Mean/mode substitution
– Substitute missing values with sample mean/mode
• Dummy variable adjustment
– Create an indicator for missing values
– Use a constant to impute missing values
• Regression Imputation
– Replace missing values with a predicted value from
a regression equation
Dealing with noise
• Noise: a random error or variance in a
measured variable
• Smoothing by binning:
Dealing with noise
• Smoothing by regression
– Linear regression involves finding the best line to
fit two variables
– One attribute could be predicted using the other
– (multiple linear regression moves to a
multidimensional plane)
• smoothing by outlier analysis
– Values the fall outside clusters are outliers
Data integration
• Merging of data from multiple sources
– Example: into a data warehouse
– Helps reducing and avoiding redundancy and
inconsistency
– Challenge: How to match schema and objects from
different sources?
• “entity identification problem”
• Redundancy and correlation analysis
– Attribute is redundant if it could be derived from one or other
attributes
• Tuple duplication
– Ex. De-normalized tables
Data reduction
• Obtain a reduced representation of the dataset
• Reduction strategies
– Dimensionality reduction
• Reducing the number of random variables or attributes
– Numerosity reduction
• Replace the original data volume by alternative smaller
representation of data
– Data compression
• Transforming to obtain a reduced or compressed
representation of the original data
Data transformation
• Data are transformed or consolidated into forms appropriate
for mining
• Strategies:
– Smoothing: remove noise from data
– Attribute construction: new attributes are constructed from given
ones to help mining
– Aggregation: summary or aggregation applied to data
– Normalization: attribute data scaling to fall within a smaller range
– Discretization: replacing raw values of a numeric variable by interval
labels or concept labels (Ex: age by age group or words like ‘young’
– Concept hierarchy generation for nominal data: example:
generalizing attribute “street” to “city” or “country”
Data Visualization
• Aims to communicate data clearly and
effectively through graphical representation
• Example:
– Reporting
– Managing business operations
– Tracking task progress
• Also used to discover data relationships that
are otherwise not easily observable
Basic concepts of visualization
• Multidimensional data: data stored in
relational databases
• Complex data and relations
• Representative approaches
– Pixel oriented techniques
– Geometric projection techniques
– Icon based techniques
– Hierarchical and graph based techniques
Pixel-oriented visualization
• Used to visualize values of m-dimensions using pixels
– Color of the pixel corresponds to the value of the
dimension
– There are m windows on the screen corresponding to
each dimension
– Example: An electronics shop maintains a customer
information table having four dimensions: income,
credit_limit, transaction_volume and age. Can we
analyze the correlation between income and the other
attributes using visualization?
Pixel-oriented visualization
• All customer records are sorted according to
ascending order
• Four windows are laid out
Geometric Projection Visualization
Techniques
• Pixel-oriented techniques are not good to
understand distribution of data in multi-
dimensional space
3-D Scatter Plot
• http://upload.wikimedia.org/wikipedia/commons/c/c4/
Scatter_plot.jpg
Scatter Plot Matrix
• Example: 5 dimensional data set (5x5 matrix)
http://support.sas.com/documentation/cdl/en/grstatproc/62603/HTML/default/images/gsgscmat.gif
Parallel Coordinates
Negative
relationship
Parallel Coordinates
• Used to handle higher dimensionality
• For an n-dimensional dataset, n equally spaced axes,
one for each dimension
• A data record is a polygonal line that intersects each
axis at the corresponding value
• Limitation: cannot effectively show a dataset of many
records
• What is the use?
– To compare profiles in order to find similarities
– Negative and positive correlations
Parallel Coordinates
Rendering data for icon-based visualization
• Involves two steps
– Association step
• Associate data dimensions/columns with visual
elements
– Transformation step
• Map dimension values to visual element’s features
Icon-based Visualization
• Two methods: Chernoff faces and stick figures
• Chernoff faces: display multidimensional data
of up to 18 variables as a cartoon human face
– Components of the face (Ex. Eyes) represent
dimensional values by their shape, size, placement
and orientation
– Make data easier to digest
– Facilitate presenting regularities and irregularities
in data
Chernoff faces
Stick Figures
• Maps multidimensional data to five piece stick figures
(four limbs and a body)
• Two dimensions mapped to the display (x and y) axes
• Rest of the dimensions mapped to the angle and/or
length of the limbs
• Example: Census data with age and income mapped to
display axes and remaining to the stick figures
– Texture patterns displayed with dense data
http://slidewiki.org/upload/media/images/141/5530.png
Hierarchical Visualization Techniques
• Large datasets with high dimensionality make
visualization difficult
– Cannot visualize all dimensions at the same time
• Hierarchical visualization techniques make
subsets of dimensions
– Subsets are visualized hierarchically
– Example:
• “Worlds-within-Worlds”
• Tree maps
Worlds-within-Worlds
• Also known as n-Vision
• Example: Visualizing a 6-D dataset
– Plot three dimensions as the “outer world”
– Select a fixed point of the outer world as the
origin for the plot of the other three dimensions
(Inner World)
– Plot the inner world accordingly
– Interactively change the origin and view the
results
World-within-Worlds Example
http://graphics.cs.columbia.edu/projects/AutoVisual/images/1.dipstick.5.gif
Tree maps
• Displays hierarchical data as nested rectangles
– Example: A tree map visualizing Google news stories
http://www.cs.umd.edu/class/spring2005/cmsc838s/viz4all/ss/newsmap.png
Visualizing Complex Data
• Visualization also applies to non-numeric data
– Example: people on Web tag various objects such
as pictures, blog entries
– Tag cloud is a visualization technique for tag
statistics
– Font size or color indicate the importance of a tag
Tag Cloud
Visualizing Complex Relationships
• What is the relationship between your
customer and the various products you offer?
• What events trigger similar responses?
• Can traditional visualizations effectively
represent these situations?
• Graph data visualization
Visualizing complex relations

More Related Content

PPT
Data_Preparation_Modeling_Evaluation.ppt
PPT
Getting to Know Your Data Some sources from where you can access datasets for...
PPTX
Data mining techniques unit 2
PPT
Data preprocessing
PPT
Data preprocessing
DOCX
PPT
Upstate CSCI 525 Data Mining Chapter 2
PDF
Data_Preparation_Modeling_Evaluation.ppt
Getting to Know Your Data Some sources from where you can access datasets for...
Data mining techniques unit 2
Data preprocessing
Data preprocessing
Upstate CSCI 525 Data Mining Chapter 2

Similar to Preprocessing_exploring_and_Visualization.pptx (20)

PPT
02Data mining 243657786756868766758(1).ppt
PPT
02Data.ppt data mining introduction topic
PPT
02Data.ppt 02Data.ppt data mining introduction topic1
PPT
hanjia chapter_2.ppt data mining chapter 2
PPT
PPT
Data mining Concepts and Techniques
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PPT
Data mining :Concepts and Techniques Chapter 2, data
PPT
Chapter 2. Know Your Data.ppt
PPT
Pre_processing_the_data_using_advance_technique
PPT
02 data
PPT
Data mining data characteristics
PPT
Data mining techniques in data mining with examples
PPT
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
PPT
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
PPT
02Data.ppt
PPT
02Data.ppt
PPT
Data Mining and Warehousing Concept and Techniques
PPT
Data preprocessing 2
PPTX
Know Your Data in data mining applications
02Data mining 243657786756868766758(1).ppt
02Data.ppt data mining introduction topic
02Data.ppt 02Data.ppt data mining introduction topic1
hanjia chapter_2.ppt data mining chapter 2
Data mining Concepts and Techniques
Data Mining: Concepts and Techniques — Chapter 2 —
Data mining :Concepts and Techniques Chapter 2, data
Chapter 2. Know Your Data.ppt
Pre_processing_the_data_using_advance_technique
02 data
Data mining data characteristics
Data mining techniques in data mining with examples
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
02Data.ppt
02Data.ppt
Data Mining and Warehousing Concept and Techniques
Data preprocessing 2
Know Your Data in data mining applications
Ad

More from Eric Amarasinghe (6)

PPTX
it-Craft-a-Cloud-Strategy-for-the-Enterprise-Database-Phases-1-3.pptx
PDF
Microsoft Azure Security Techniquesand How Azure security can enhance your or...
PDF
AzureDevOps_ProjectManagement_Patrick.pdf
PPTX
Importance of Managment to lead the organization
PPTX
Importance of Managment to have an efficient Organization
PPTX
Leading to Win as a Manager and improve Leadership qualities
it-Craft-a-Cloud-Strategy-for-the-Enterprise-Database-Phases-1-3.pptx
Microsoft Azure Security Techniquesand How Azure security can enhance your or...
AzureDevOps_ProjectManagement_Patrick.pdf
Importance of Managment to lead the organization
Importance of Managment to have an efficient Organization
Leading to Win as a Manager and improve Leadership qualities
Ad

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Business Data Analytics.
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Mega Projects Data Mega Projects Data
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Database Infoormation System (DBIS).pptx
PDF
Foundation of Data Science unit number two notes
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
Major-Components-ofNKJNNKNKNKNKronment.pptx
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Business Data Analytics.
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Mega Projects Data Mega Projects Data
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Acumen Training GuidePresentation.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
IB Computer Science - Internal Assessment.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Database Infoormation System (DBIS).pptx
Foundation of Data Science unit number two notes
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Launch Your Data Science Career in Kochi – 2025

Preprocessing_exploring_and_Visualization.pptx

  • 2. Data Exploration • Understanding the data – Having a closer look at attributes and data values – Real world data are typically • Noisy • Enormous in volume • Originate from heterogeneous sources – Transactions – Unstructured data – Expert opinions (Validation) – Data poolers – Publicly available data – Important for data preprocessing
  • 3. Typical questions about data • What are the types of attributes or fields that make up your data? • What kind of values does each attribute have? • Which attributes are discrete and which are continuous? • What do the data look like? • How are the values distributed? • Are there better ways to visualize? • Can we spot outliers? • Can we measure the similarity of data objects against others?
  • 4. Data Objects • A data object represents an entity – Represents an entity – Example: Customers, store items and sales • Typically described by attributes • Also known as samples, examples, instances, data points or objects • Called data tuples when stored in a database – Rows: data objects – Columns: Attributes
  • 5. Attributes • Represents a characteristic or a feature of a data object – Also known as: dimension, feature, variable • Observations: observed values for a given attribute • Attribute vector: a set of attributes used to describe a given object • Univariate distribution: a distribution of data involving one attribute • Bivariate distribution: a distribution that involves two attributes
  • 6. Types of attributes • Nominal attribute – Takes symbols or names of things – Example: Marital_Status can take single, married, divorced and widowed – Also known as categorical – Can be made numerical by assigning a numerical code • Example: 0 for single, 1 for married, etc. • Binary attribute – A nominal attribute that has only two states or categories • Example: 0 or 1, True or False
  • 7. Types of attributes • Ordinal attribute – An attribute with possible values that have a meaningful order or ranking – Example: drink_size can be small, medium or large • Numeric attributes – Represented in integer or real values – Quantitative; that is a measurable quantity
  • 8. Sampling • Why is sampling needed? – To obtain an optimal time window • Sampling techniques – Random sampling • Each member of the population has an equal chance for being selected – Stratified sampling • Stratification followed by random sampling • Stratification: identification of sub-populations within a population
  • 9. Statistical Descriptions • Measures of Central Tendency – Mean, Median and Mode – Depicts the centrality of data in a distribution • Symmetric, positively skewed or negatively skewed • Measures of dispersion – Range, quartiles and interquartile range (IQR) – Quartile: a part representing one of four equal sized, consecutive parts of a data distribution – Percentile: a part representing one of hundred equal sized, consecutive parts of a data distribution
  • 10. Five-number summary, box plots and outliers • Outliers: data points that are 1.5 IQR away to both sides (a common rule of thumb) • Five number summary: Median (Q2), the quartiles Q1 and Q3, smallest and largest • Box plot: An effective way of visualizing a distribution
  • 11. A box plot example
  • 12. Visualizing basic statistical date • Histogram
  • 13. Visualizing basic statistical data • Scatter plot
  • 14. Variance and Standard Deviation • Measures data dispersion • Low standard deviation: data observed are closer to the mean
  • 15. Data Quality and Pre-processing
  • 16. Data Quality • An essential requirement for the reliability of decision making • Characteristics: – Complete: All relevant data —such as accounts, addresses and relationships for a given customer—is linked. – Accurate: Common data problems like misspellings, typos, and random abbreviations have been cleaned up. – Available: Required data is accessible on demand; users do not need to search manually for the information. – Timely: Up-to-date information is readily available to support decisions. – Believable: How much the data could be trusted by its users – Interpretable: How easy is it to understand data
  • 18. What can quality data bring in? • Deliver high-quality data for a range of enterprise initiatives including business intelligence, applications consolidation and retirement, and master data management • Reduce time and cost to implement CRM, data warehouse/BI, data governance, and other strategic IT initiatives and maximize the return on investments • Construct consolidated customer and household views, enabling more effective cross-selling, up-selling, and customer retention • Help improve customer service and identify a company's most profitable customers • Provide business intelligence on individuals and organizations for research, fraud detection, and planning • Reduce the time required for data cleansing—saving on average 5 million hours, for an average company with 6.2 million records (Aberdeen Group research)
  • 19. Data Pre-processing • Why is it necessary? – Noisy, missing and inconsistent data • Caused by the size and heterogeneousness of data sources – Low-quality data lead to low-quality mining • Several data pre-processing techniques – Data cleaning – Data integration – Data reduction – Data transformation
  • 20. Forms of data preprocessing
  • 22. How missing values occur • Attrition due to social/natural processes Example: School graduation, dropout, death • Skip pattern in survey Example: Certain questions only asked to respondents who indicate they are married • Intentional missing as part of data collection process • Random data collection issues • Respondent refusal/Non-response
  • 23. Characteristics of missing data • Missing Completely At Random – probability of a record having a missing value for an attribute does not depend on either the observed data or the missing data. – A lab sample drop, resulting a data loss • Missing At Random – probability of a record having a missing value for an attribute could depend on the observed data, but not on the value of the missing data itself. – Respondents in service occupations less likely to report income • Not Missing At Random – probability of a record having a missing value for an attribute could depend on the value of the attribute – A sensor not capturing data below a certain threshold or people not filling income data if high.
  • 24. Data cleaning • Attempt to fill missing values, smooth out noise and correct inconsistencies • Dealing with missing values – Ignore the tuple – Fill manually – Use a global constant to fill (Ex. Unknown) – Use a measure of central tendency (ex. Mean, median) – Use the attribute mean or median of the same class – Use the most probable value
  • 25. List-wise Deletion • What are the advantages and disadvantages? • Advantages – Simplicity – Analysis could be compared across multiple samples • Disadvantages – Reduces statistical power – Doesn’t use all information – Estimates could be biased if data is not MCAR
  • 26. Pair-wise deletion • Analysis with all cases in which the variables of interest are present • Advantages – Keeps as many cases as possible for each analysis – Uses all information possible with each analysis • Disadvantage – Can't compare analyses because sample different each time
  • 27. Single Imputation Methods • Mean/mode substitution – Substitute missing values with sample mean/mode • Dummy variable adjustment – Create an indicator for missing values – Use a constant to impute missing values • Regression Imputation – Replace missing values with a predicted value from a regression equation
  • 28. Dealing with noise • Noise: a random error or variance in a measured variable • Smoothing by binning:
  • 29. Dealing with noise • Smoothing by regression – Linear regression involves finding the best line to fit two variables – One attribute could be predicted using the other – (multiple linear regression moves to a multidimensional plane) • smoothing by outlier analysis – Values the fall outside clusters are outliers
  • 30. Data integration • Merging of data from multiple sources – Example: into a data warehouse – Helps reducing and avoiding redundancy and inconsistency – Challenge: How to match schema and objects from different sources? • “entity identification problem” • Redundancy and correlation analysis – Attribute is redundant if it could be derived from one or other attributes • Tuple duplication – Ex. De-normalized tables
  • 31. Data reduction • Obtain a reduced representation of the dataset • Reduction strategies – Dimensionality reduction • Reducing the number of random variables or attributes – Numerosity reduction • Replace the original data volume by alternative smaller representation of data – Data compression • Transforming to obtain a reduced or compressed representation of the original data
  • 32. Data transformation • Data are transformed or consolidated into forms appropriate for mining • Strategies: – Smoothing: remove noise from data – Attribute construction: new attributes are constructed from given ones to help mining – Aggregation: summary or aggregation applied to data – Normalization: attribute data scaling to fall within a smaller range – Discretization: replacing raw values of a numeric variable by interval labels or concept labels (Ex: age by age group or words like ‘young’ – Concept hierarchy generation for nominal data: example: generalizing attribute “street” to “city” or “country”
  • 33. Data Visualization • Aims to communicate data clearly and effectively through graphical representation • Example: – Reporting – Managing business operations – Tracking task progress • Also used to discover data relationships that are otherwise not easily observable
  • 34. Basic concepts of visualization • Multidimensional data: data stored in relational databases • Complex data and relations • Representative approaches – Pixel oriented techniques – Geometric projection techniques – Icon based techniques – Hierarchical and graph based techniques
  • 35. Pixel-oriented visualization • Used to visualize values of m-dimensions using pixels – Color of the pixel corresponds to the value of the dimension – There are m windows on the screen corresponding to each dimension – Example: An electronics shop maintains a customer information table having four dimensions: income, credit_limit, transaction_volume and age. Can we analyze the correlation between income and the other attributes using visualization?
  • 36. Pixel-oriented visualization • All customer records are sorted according to ascending order • Four windows are laid out
  • 37. Geometric Projection Visualization Techniques • Pixel-oriented techniques are not good to understand distribution of data in multi- dimensional space
  • 38. 3-D Scatter Plot • http://upload.wikimedia.org/wikipedia/commons/c/c4/ Scatter_plot.jpg
  • 39. Scatter Plot Matrix • Example: 5 dimensional data set (5x5 matrix) http://support.sas.com/documentation/cdl/en/grstatproc/62603/HTML/default/images/gsgscmat.gif
  • 41. Parallel Coordinates • Used to handle higher dimensionality • For an n-dimensional dataset, n equally spaced axes, one for each dimension • A data record is a polygonal line that intersects each axis at the corresponding value • Limitation: cannot effectively show a dataset of many records • What is the use? – To compare profiles in order to find similarities – Negative and positive correlations
  • 43. Rendering data for icon-based visualization • Involves two steps – Association step • Associate data dimensions/columns with visual elements – Transformation step • Map dimension values to visual element’s features
  • 44. Icon-based Visualization • Two methods: Chernoff faces and stick figures • Chernoff faces: display multidimensional data of up to 18 variables as a cartoon human face – Components of the face (Ex. Eyes) represent dimensional values by their shape, size, placement and orientation – Make data easier to digest – Facilitate presenting regularities and irregularities in data
  • 46. Stick Figures • Maps multidimensional data to five piece stick figures (four limbs and a body) • Two dimensions mapped to the display (x and y) axes • Rest of the dimensions mapped to the angle and/or length of the limbs • Example: Census data with age and income mapped to display axes and remaining to the stick figures – Texture patterns displayed with dense data
  • 48. Hierarchical Visualization Techniques • Large datasets with high dimensionality make visualization difficult – Cannot visualize all dimensions at the same time • Hierarchical visualization techniques make subsets of dimensions – Subsets are visualized hierarchically – Example: • “Worlds-within-Worlds” • Tree maps
  • 49. Worlds-within-Worlds • Also known as n-Vision • Example: Visualizing a 6-D dataset – Plot three dimensions as the “outer world” – Select a fixed point of the outer world as the origin for the plot of the other three dimensions (Inner World) – Plot the inner world accordingly – Interactively change the origin and view the results
  • 51. Tree maps • Displays hierarchical data as nested rectangles – Example: A tree map visualizing Google news stories http://www.cs.umd.edu/class/spring2005/cmsc838s/viz4all/ss/newsmap.png
  • 52. Visualizing Complex Data • Visualization also applies to non-numeric data – Example: people on Web tag various objects such as pictures, blog entries – Tag cloud is a visualization technique for tag statistics – Font size or color indicate the importance of a tag
  • 54. Visualizing Complex Relationships • What is the relationship between your customer and the various products you offer? • What events trigger similar responses? • Can traditional visualizations effectively represent these situations? • Graph data visualization