Preprocessing_exploring_and_Visualization.pptx

Data Pre-processing, Data
Exploration and Visualization

Data Exploration
• Understanding the data
– Having a closer look at attributes and data values
– Real world data are typically
• Noisy
• Enormous in volume
• Originate from heterogeneous sources
– Transactions
– Unstructured data
– Expert opinions (Validation)
– Data poolers
– Publicly available data
– Important for data preprocessing

Typical questions about data
• What are the types of attributes or fields that make up
your data?
• What kind of values does each attribute have?
• Which attributes are discrete and which are continuous?
• What do the data look like?
• How are the values distributed?
• Are there better ways to visualize?
• Can we spot outliers?
• Can we measure the similarity of data objects against
others?

Data Objects
• A data object represents an entity
– Represents an entity
– Example: Customers, store items and sales
• Typically described by attributes
• Also known as samples, examples, instances,
data points or objects
• Called data tuples when stored in a database
– Rows: data objects
– Columns: Attributes

Attributes
• Represents a characteristic or a feature of a data
object
– Also known as: dimension, feature, variable
• Observations: observed values for a given attribute
• Attribute vector: a set of attributes used to
describe a given object
• Univariate distribution: a distribution of data
involving one attribute
• Bivariate distribution: a distribution that involves
two attributes

Types of attributes
• Nominal attribute
– Takes symbols or names of things
– Example: Marital_Status can take single, married,
divorced and widowed
– Also known as categorical
– Can be made numerical by assigning a numerical code
• Example: 0 for single, 1 for married, etc.
• Binary attribute
– A nominal attribute that has only two states or categories
• Example: 0 or 1, True or False

Types of attributes
• Ordinal attribute
– An attribute with possible values that have a
meaningful order or ranking
– Example: drink_size can be small, medium or large
• Numeric attributes
– Represented in integer or real values
– Quantitative; that is a measurable quantity

Sampling
• Why is sampling needed?
– To obtain an optimal time window
• Sampling techniques
– Random sampling
• Each member of the population has an equal chance for
being selected
– Stratified sampling
• Stratification followed by random sampling
• Stratification: identification of sub-populations within a
population

Statistical Descriptions
• Measures of Central Tendency
– Mean, Median and Mode
– Depicts the centrality of data in a distribution
• Symmetric, positively skewed or negatively skewed
• Measures of dispersion
– Range, quartiles and interquartile range (IQR)
– Quartile: a part representing one of four equal sized,
consecutive parts of a data distribution
– Percentile: a part representing one of hundred equal
sized, consecutive parts of a data distribution

Five-number summary, box plots and outliers
• Outliers: data points that are 1.5 IQR away to
both sides (a common rule of thumb)
• Five number summary: Median (Q2), the
quartiles Q1 and Q3, smallest and largest
• Box plot: An effective way of visualizing a
distribution

Visualizing basic statistical date
• Histogram

Visualizing basic statistical data
• Scatter plot

Variance and Standard Deviation
• Measures data dispersion
• Low standard deviation: data observed are
closer to the mean

Data Quality and Pre-processing

Data Quality
• An essential requirement for the reliability of decision
making
• Characteristics:
– Complete: All relevant data —such as accounts, addresses and
relationships for a given customer—is linked.
– Accurate: Common data problems like misspellings, typos, and
random abbreviations have been cleaned up.
– Available: Required data is accessible on demand; users do not
need to search manually for the information.
– Timely: Up-to-date information is readily available to support
decisions.
– Believable: How much the data could be trusted by its users
– Interpretable: How easy is it to understand data

What can quality data bring in?
• Deliver high-quality data for a range of enterprise initiatives including business
intelligence, applications consolidation and retirement, and master data
management
• Reduce time and cost to implement CRM, data warehouse/BI, data governance,
and other strategic IT initiatives and maximize the return on investments
• Construct consolidated customer and household views, enabling more effective
cross-selling, up-selling, and customer retention
• Help improve customer service and identify a company's most profitable
customers
• Provide business intelligence on individuals and organizations for research,
fraud detection, and planning
• Reduce the time required for data cleansing—saving on average 5 million hours,
for an average company with 6.2 million records (Aberdeen Group research)

Data Pre-processing
• Why is it necessary?
– Noisy, missing and inconsistent data
• Caused by the size and heterogeneousness of data sources
– Low-quality data lead to low-quality mining
• Several data pre-processing techniques
– Data cleaning
– Data integration
– Data reduction
– Data transformation

How missing values occur
• Attrition due to social/natural processes
Example: School graduation, dropout, death
• Skip pattern in survey
Example: Certain questions only asked to
respondents who indicate they are married
• Intentional missing as part of data collection
process
• Random data collection issues
• Respondent refusal/Non-response

Characteristics of missing data
• Missing Completely At Random
– probability of a record having a missing value for an attribute does
not depend on either the observed data or the missing data.
– A lab sample drop, resulting a data loss
• Missing At Random
– probability of a record having a missing value for an attribute could
depend on the observed data, but not on the value of the missing
data itself.
– Respondents in service occupations less likely to report income
• Not Missing At Random
– probability of a record having a missing value for an attribute could
depend on the value of the attribute
– A sensor not capturing data below a certain threshold or people
not filling income data if high.

Data cleaning
• Attempt to fill missing values, smooth out noise
and correct inconsistencies
• Dealing with missing values
– Ignore the tuple
– Fill manually
– Use a global constant to fill (Ex. Unknown)
– Use a measure of central tendency (ex. Mean, median)
– Use the attribute mean or median of the same class
– Use the most probable value

List-wise Deletion
• What are the advantages
and disadvantages?
• Advantages
– Simplicity
– Analysis could be compared
across multiple samples
• Disadvantages
– Reduces statistical power
– Doesn’t use all information
– Estimates could be biased if
data is not MCAR

Pair-wise deletion
• Analysis with all cases in
which the variables of
interest are present
• Advantages
– Keeps as many cases as
possible for each analysis
– Uses all information
possible with each analysis
• Disadvantage
– Can't compare analyses
because sample different
each time

Single Imputation Methods
• Mean/mode substitution
– Substitute missing values with sample mean/mode
• Dummy variable adjustment
– Create an indicator for missing values
– Use a constant to impute missing values
• Regression Imputation
– Replace missing values with a predicted value from
a regression equation

Dealing with noise
• Noise: a random error or variance in a
measured variable
• Smoothing by binning:

Dealing with noise
• Smoothing by regression
– Linear regression involves finding the best line to
fit two variables
– One attribute could be predicted using the other
– (multiple linear regression moves to a
multidimensional plane)
• smoothing by outlier analysis
– Values the fall outside clusters are outliers

Data integration
• Merging of data from multiple sources
– Example: into a data warehouse
– Helps reducing and avoiding redundancy and
inconsistency
– Challenge: How to match schema and objects from
different sources?
• “entity identification problem”
• Redundancy and correlation analysis
– Attribute is redundant if it could be derived from one or other
attributes
• Tuple duplication
– Ex. De-normalized tables

Data reduction
• Obtain a reduced representation of the dataset
• Reduction strategies
– Dimensionality reduction
• Reducing the number of random variables or attributes
– Numerosity reduction
• Replace the original data volume by alternative smaller
representation of data
– Data compression
• Transforming to obtain a reduced or compressed
representation of the original data

Data transformation
• Data are transformed or consolidated into forms appropriate
for mining
• Strategies:
– Smoothing: remove noise from data
– Attribute construction: new attributes are constructed from given
ones to help mining
– Aggregation: summary or aggregation applied to data
– Normalization: attribute data scaling to fall within a smaller range
– Discretization: replacing raw values of a numeric variable by interval
labels or concept labels (Ex: age by age group or words like ‘young’
– Concept hierarchy generation for nominal data: example:
generalizing attribute “street” to “city” or “country”

Data Visualization
• Aims to communicate data clearly and
effectively through graphical representation
• Example:
– Reporting
– Managing business operations
– Tracking task progress
• Also used to discover data relationships that
are otherwise not easily observable

Basic concepts of visualization
• Multidimensional data: data stored in
relational databases
• Complex data and relations
• Representative approaches
– Pixel oriented techniques
– Geometric projection techniques
– Icon based techniques
– Hierarchical and graph based techniques

Pixel-oriented visualization
• Used to visualize values of m-dimensions using pixels
– Color of the pixel corresponds to the value of the
dimension
– There are m windows on the screen corresponding to
each dimension
– Example: An electronics shop maintains a customer
information table having four dimensions: income,
credit_limit, transaction_volume and age. Can we
analyze the correlation between income and the other
attributes using visualization?

Pixel-oriented visualization
• All customer records are sorted according to
ascending order
• Four windows are laid out

Geometric Projection Visualization
Techniques
• Pixel-oriented techniques are not good to
understand distribution of data in multi-
dimensional space

3-D Scatter Plot
• http://upload.wikimedia.org/wikipedia/commons/c/c4/
Scatter_plot.jpg

Scatter Plot Matrix
• Example: 5 dimensional data set (5x5 matrix)
http://support.sas.com/documentation/cdl/en/grstatproc/62603/HTML/default/images/gsgscmat.gif

Parallel Coordinates
Negative
relationship

Parallel Coordinates
• Used to handle higher dimensionality
• For an n-dimensional dataset, n equally spaced axes,
one for each dimension
• A data record is a polygonal line that intersects each
axis at the corresponding value
• Limitation: cannot effectively show a dataset of many
records
• What is the use?
– To compare profiles in order to find similarities
– Negative and positive correlations

Rendering data for icon-based visualization
• Involves two steps
– Association step
• Associate data dimensions/columns with visual
elements
– Transformation step
• Map dimension values to visual element’s features

Icon-based Visualization
• Two methods: Chernoff faces and stick figures
• Chernoff faces: display multidimensional data
of up to 18 variables as a cartoon human face
– Components of the face (Ex. Eyes) represent
dimensional values by their shape, size, placement
and orientation
– Make data easier to digest
– Facilitate presenting regularities and irregularities
in data

Stick Figures
• Maps multidimensional data to five piece stick figures
(four limbs and a body)
• Two dimensions mapped to the display (x and y) axes
• Rest of the dimensions mapped to the angle and/or
length of the limbs
• Example: Census data with age and income mapped to
display axes and remaining to the stick figures
– Texture patterns displayed with dense data

http://slidewiki.org/upload/media/images/141/5530.png

Hierarchical Visualization Techniques
• Large datasets with high dimensionality make
visualization difficult
– Cannot visualize all dimensions at the same time
• Hierarchical visualization techniques make
subsets of dimensions
– Subsets are visualized hierarchically
– Example:
• “Worlds-within-Worlds”
• Tree maps

Worlds-within-Worlds
• Also known as n-Vision
• Example: Visualizing a 6-D dataset
– Plot three dimensions as the “outer world”
– Select a fixed point of the outer world as the
origin for the plot of the other three dimensions
(Inner World)
– Plot the inner world accordingly
– Interactively change the origin and view the
results

World-within-Worlds Example
http://graphics.cs.columbia.edu/projects/AutoVisual/images/1.dipstick.5.gif

Tree maps
• Displays hierarchical data as nested rectangles
– Example: A tree map visualizing Google news stories
http://www.cs.umd.edu/class/spring2005/cmsc838s/viz4all/ss/newsmap.png

Visualizing Complex Data
• Visualization also applies to non-numeric data
– Example: people on Web tag various objects such
as pictures, blog entries
– Tag cloud is a visualization technique for tag
statistics
– Font size or color indicate the importance of a tag

Visualizing Complex Relationships
• What is the relationship between your
customer and the various products you offer?
• What events trigger similar responses?
• Can traditional visualizations effectively
represent these situations?
• Graph data visualization

Preprocessing_exploring_and_Visualization.pptx

More Related Content

Similar to Preprocessing_exploring_and_Visualization.pptx (20)

More from Eric Amarasinghe (6)

Recently uploaded (20)

Preprocessing_exploring_and_Visualization.pptx