SlideShare a Scribd company logo
WEB MINING
12/03/18
Businesses and customers are connected by a click
• The Web has shortened the distance between a
business and its customers. It is just a click away.
• These clicks drive the economic models that support
our Web search engines and provide the economic
fuel for an increasing number of businesses.
• The click is at the heart of an economic engine that is
changing the nature of commerce with the near
instantaneous, real-time recording of customer
decisions to buy or not to buy
12/03/18 Professor V. Nagadevara
12/03/18 Professor V. Nagadevara
User Behaviors during a Searching Process
VIEW RESULTS Behavior in which the user viewed or scrolled one or more
pages from the results listing. If a results page was present
and the user did not scroll, we counted this as a View
Results Page
With Scrolling User scrolled the results page
Without Scrolling User did not scroll the results page
SELECTION Behavior in which the user makes a selection in the results
listing
Click URL (in results
listing)
Interaction in which the user clicked on a URL of one of the
results in the results page
Next in Set of
Results List
User moved to the Next results page
Previous in Set of
Results List
User moved to the Previous results page.
12/03/18 Professor V. Nagadevara
The Basic Premise
• The user-generated Internet data can provide
insight to understanding these users better or
point to needed changes or improvements to
existing Web systems
• It tells us the “what.”
• It does not give the insights into the
motivations or decision processes of that user.
12/03/18 Professor V. Nagadevara
Internet or Web Data
• Web Traffic Data
– Traditionally mined out of web server logs. Page
tags is recent phenomenon
• Web Transactional Data
– Information with respect to various transactions
– No. of customers, no. of orders, average order size
etc. It may also include customer demoraphics
12/03/18 Professor V. Nagadevara
Internet or Web Data
• Web Server Performance Data
– Web pages have many parts-text, scripts, images,
multimedia…
– Reassembled at the user end
– Higher the “weight” of the page, more is the time
to download and reassemble.
– The 10-second rule becomes important
12/03/18 Professor V. Nagadevara
The 10-second rule
• One tenth of second is the limit for a user to feel
that the system is responding “immediately”
• One second is the limit for the user’s flow to
remain uninterrupted, allowing him/her to
notice a delay
• Ten seconds is the upper limit for keeping users
focused on a single task
12/03/18 Professor V. Nagadevara
Table 1: Actions Taken After Abandoning Online
Search for Products
Did not Buy
Brought at
Brand Store
Bought at
different
website
Bought at
Discount
store
Bought from
paper
catelogue
All 34% 24% 14% 13% 7%
Age < 25 27% 40% 13% 13% 7%
Age 25 to 34 43% 20% 15% 10% 2%
Age 35 to 44 27% 30% 14% 13% 11%
Age 45 to 49 39% 18% 7% 25% 11%
Age 50 to 54 31% 23% 15% 12% 4%
Age 55+ 33% 7% 20% 13% 13%
Male 29% 24% 15% 19% 5%
Female 41% 21% 13% 7% 9%
12/03/18 Professor V. Nagadevara
Internet or Web Data
• User submitted data
– User entry forms (registration forms)
– Obtained from CRM, ERP systems (Customer
loyalty programs)
– Survey data
– Opinions and feedback
– Reviews
12/03/18 Professor V. Nagadevara
Data Collection
• Collect behavioral data using an application
that logs user behavior on the Website, along
with other associated measures
• But, this data is inaccurate
– Anonymous logging
– Common use computers
– Bots/Spiders
– cookies, internal visitors, caching servers, and
incorrect page tagging
– Error rates range from 5 to 10 percent
12/03/18 Professor V. Nagadevara
Web Log Data
• Trace Data: Internet customer interaction data from
Web systems (traces left behind that indicate human
behaviors)
• Unobtrusive: collection of the data does not interfere
with the natural flow of behavior and events in the
given context
• Nonreactive: there is no observer present where the
behaviors occur to affect the participants’ actions
• Inexpensive to collect (with transaction logging
software)
12/03/18 Professor V. Nagadevara
Criteria
• Credibility
– How trustworthy or believable the data collection
method is.
– The Analyst must ensure that the data collection
approach records the data needed to address the
underlying business questions.
12/03/18 Professor V. Nagadevara
Criteria
• Validity
– Internal validity: the extent to which the contents of the test,
method, analysis, or procedure measure what they are
supposed to measure.
– Content or construct validity: the extent to which the content
of the test, method, analysis, or procedure adequately
represents all that is required for validation (i.e., are you
collecting and accounting for all that you should collect and
account for).
– External validity: the extent to which one can generalize the
results across populations, situations, environments, and
contexts of the test, method, analysis, or procedure.
12/03/18 Professor V. Nagadevara
Criteria
• Reliability:
– is a term used to describe the stability of the
measurement.
– Essentially, reliability addresses whether the
measurement assesses the same thing, in the
same way, in repeated tests.
12/03/18 Professor V. Nagadevara
Credibility, validity, and reliability – Six Questions (Holst)
1. Which data is analyzed? The format and
content of recorded trace data.
– With transaction log software, this is much easier
than in other forms of trace data, as logging
applications can be reverse engineered to
articulate exactly what behavioral data is
recorded.
12/03/18 Professor V. Nagadevara
Credibility, validity, and reliability – Six Questions (Holst)
2. How is this data defined? Define each
trace measure in a manner that permits
replication on other systems and with other
users.
– As TLA has proliferated in a variety of venues,
more precise definitions of measures are
developing
12/03/18 Professor V. Nagadevara
Credibility, validity, and reliability – Six Questions (Holst)
3. What is the population from which the data
was drawn ? Identify the actors, both people and
systems, that created the trace data.
– With transaction logs on the Web, this is sometimes
a difficult issue to address, unless the system
requires some type of logon and these profiles are
then available.
– In the absence of these profiles, the analyst must
rely on demographic surveys, studies of the system’s
user population, or general Web demographics.
12/03/18 Professor V. Nagadevara
Credibility, validity, and reliability – Six Questions (Holst)
4. What is the context in which data is
analyzed? Explain the environmental, situational, and
contextual factors.
– Include information about the temporal factors of the data
collection (i.e., the date and time the data was recorded) and
the make-up of the system at the time of the data recording
– Transaction logs have the significant advantage of time
sampling. Record observations at predefined points of time
and then record the action that is taking place, using the
classification of action defined in the ethogram
12/03/18 Professor V. Nagadevara
Credibility, validity, and reliability – Six Questions (Holst)
5. What are the boundaries of the analysis?
Do not overreach with the business questions
and findings.
– The implications of the research are confined by
the data and the method of data collection.
– Transaction log data can clearly state whether or
not a user clicked on a link.
– Will not inform us as to why the user clicked on a
link. Was it intentional? Was it a mistake? Did the
user become sidetracked?
12/03/18 Professor V. Nagadevara
Credibility, validity, and reliability – Six Questions (Holst)
6. What is the target of the inferences?
Articulate the relationship among the separate
measures either to inform or to make inferences.
•Trace data can be used for both descriptive and predictive
purposes in terms of making inferences. These descriptions and
inferences can be at any level of granularity (i.e., individual,
collection of individuals, organization, etc.).
•Transaction log data is best used for aggregate level analysis. But,
with enough data at the individual level, one can tell a lot from log
data.
12/03/18 Professor V. Nagadevara
Analysis
• Generate proper metrics and KPIs
– Commercial sites: overall purchase conversions,
average order size, and items per order.
– Lead generation sites: Overall conversions, conversion
by campaigns, dropouts, and conversions of leads to
actual customers.
– Customer service sites: reducing expenses and
improving customer experiences.
– Advertising on content sites: visits per week, page
viewed per visit, visit length, advertising click ratio,
and ratio of new to returning visitors.
12/03/18 Professor V. Nagadevara
Analysis
• indirect analysis: The Analyst is able to collect
the data without introducing any formal
measurement procedure.
• TLA typically focuses on the interaction
behaviors occurring among the users, system,
and information. There are several examples
of utilizing transaction analysis as an indirect
approach
12/03/18 Professor V. Nagadevara
Analysis
• Context analysis: Analysis of text documents.
It can be quantitative, qualitative, or a mixed
methods.
– Purpose is to identify patterns in text. It is
unobtrusive and, can be a relatively rapid method
for analyzing large amounts of text. In Web
analytics, it typically focuses on search queries or
analysis of retrieved results. A variety of examples
are available in this area of transaction log
research
12/03/18 Professor V. Nagadevara
Analysis
• Secondary analysis: Makes use of already
existing sources of data.
– Refers to the re-analysis of quantitative data rather
than text analysis. Uses data that was collected by
others to address different research questions or
uses different methods of analysis
– Websites collect transaction log data for system
performance analysis. This can be used to address
other questions.
12/03/18 Professor V. Nagadevara
Actionable
• Action driven by the data that is in line with
the established KPI. (actionable outcomes)
– Publications that shed insight on user behavior, or
changes to some methods or system.
– In a business, calculated change to improve the
Website or business process that is directly
dependent on the KPI selected.
– generating additional revenue, reducing costs, or
improving the user experience
12/03/18 Professor V. Nagadevara
Web Mining
• Web Structure Mining
– Discovers useful knowledge from hyperlinks.
– Discover important web pages
– Discover communities which have common
interests
– Traditional data mining cannot perform this task!
12/03/18 Professor V. Nagadevara
Web Mining
• Web Content Mining
– Mining useful knowledge from web page contents
– Classifies or clusters similar web pages based on
content
– Extracts information about products, postings in
fora, customer reviews, discover customer
sentiments
– Traditional data mining can do this well
12/03/18 Professor V. Nagadevara
Web Mining
• Web Usage Mining
– Discovers user access patterns
– Uses web log data, click stream data, page tags
– Requires large amount of pre-processing
– Uses many traditional data mining techniques
12/03/18 Professor V. Nagadevara
Case Study 1 – Institute for Policy Studies
www.ips-dc.org
• Twelve months of data are used (2011)
• During that period, the IPS received 292,000
visits, 202,000 (69%) from new visitors,
• 16,000 (5.5%) of the visitors were repeat
visitors.
• Visitors came from around the world: 201
countries and territories. The United States
contributed the most traffic, accounting for
78% of the visitors.
12/03/18 Professor V. Nagadevara
Issues
• IPS conducts “ad campaigns” nearly every
week with limited success
• Devotes little attention to its loyal visitors
• Too much emphasis on social websites
– On an average day, of the 4711 Facebook users
who “like” the IPS’s site, only 4 people have
checked in. That same day, 16,000 people visited
the organization’s website on their own
12/03/18 Professor V. Nagadevara
Time Spent
• Returning visitors averaged 3.38 pages per visit
- new visitors’ 2.33 pages per visit.
• The average length of a visit is 8 min. Returning
visitors spent 15 min on average - new visitors’
average of 5 min,
• 140,000 new visitors leave immediately.
12/03/18 Professor V. Nagadevara
Implications
• The visitor loyalty data suggest that more than
10% (12.16%) of the site’s 292,000 visitors
return more than once per month, and 5.47%
of the site’s visitors (16,025 visitors) return
weekly or daily.
• From a strategic communication standpoint,
the number of loyal visitors suggests that
specific message content addressing the
interests of this segment should be developed
12/03/18 Professor V. Nagadevara
Implications
• Campaign traffic associated with ad words on
Google accounts for only 1.5% of the total
• Overall, the weekly campaigns have boosted
traffic from “new visitors.” But, bounce rate of
the “new visitors” generated by these
campaigns approaches 80–100% most weeks.
• Campaigns or the strategic messages need to
be reconfigured.
12/03/18 Professor V. Nagadevara
Implications
• The bounce rate of AdWord visitors is 7.4%
higher than the site average and the visitor
time on site is 27% lower. These are not IPS’
target
• Other referrals: Wordpress.org (19.48% bounce
rate), Hotsalsa.org (27.19% bounce rate),
Netvibes (34.51% bounce rate), and Wikipedia
(39.73% bounce rate)
12/03/18 Professor V. Nagadevara
Top Landing Pages
• Top landing pages were the front page (108,000 visits),
about/join us (19,000), reports/executive (13,000),
staff/Phyllis (5000), and staff/Bob (3000),
• Both “join us” and “executive” had bounce rates
exceeding 80%.
• The home page had only a 44% bounce rate, while staff/
Phyllis and staff/Bob had 56% and 59%. The site average
is 62.5%.
• Phyllis, Bob, and the home page are highly desirable
places to visit
12/03/18 Professor V. Nagadevara
Case Study 2 - City of Prague (Oklahoma)
www.CityofPragueOK.org
• A small town of approximately 2400 people
• Immigrants from the Czech Republic (former
Czechoslovakia) settled the town.
• Avatar Meher Baba Heartland Center
http://www.ambhc.org/
• Provides space for multiple story stubs, along
with photos
12/03/18 Professor V. Nagadevara
Visitors
• Data for six months
• Received 2559 visits. 2123 were unique visitors, i.e.,
17% (436) returned to the site more than once.
• Averaged 12.01 visits per day, high 32 and a low just 2.
• Overall bounce rate is 45.37%. Oklahoma viewers, the
bounce rate was only 39.71%.
• The bounce rate from the Czech Republic is 60.87%.
• More than half of the sites visitors spend some time on
the site. 20% of the town’s residents come back more
than once
12/03/18 Professor V. Nagadevara
Visitors
• Half of the site’s traffic comes from within the state of
Oklahoma (50.68%). An additional 13.13% visited the
site from California, Texas, and Kansas. Ninety percent
were from the US.
• Prague site also had visits from 36 countries.
• Oklahoma visitors spent an average of more than 2
min and viewed an average of 3.3 pages per visit.
Prague is a small enough city that IP addresses of
nearby residents do not sync with their town. Thus,
saying how many Oklahoma visitors visited the site
from in or near Prague is not possible.
12/03/18 Professor V. Nagadevara
Key Words
• Some combination of “Prague” and “Oklahoma”
(or “OK”) directed 479 (16%) of the visitors to
the site.
• Visitors also searched specifically for the city of
Prague Oklahoma (332 visitors), Prague lake
(102 visitors), and the Prague police department
(113 visitors). Thus, 40% of the key word
searches were by people who were interested in
information related to Prague, OK.
12/03/18 Professor V. Nagadevara
Top Pages
• The most popular page is the home page (2331
visits),
• PragueOK news page (506 visits), the directory
page (346 visits), the city’s contact information
page (296 visits), the police department’s page
(287 visits), the library’s page (262 visits), and
the calendar page (198 visits).
• Prague clearly serves a role in providing
information about a majority of city services.
12/03/18 Professor V. Nagadevara
Suggestions
• Creating a system whereby local agencies can
contribute content each week on their own to increase
content in general without increasing the workload of
the webmaster.
• Publicizing the website - add the URL to city stationery,
business cards, other agencies, the local paper, etc.
• Adding links on the home page to all city departments
and services.
• Adding links to local attractions (not many!).
• Clarifying the names of some of the sites pages to better
reflect the content.
12/03/18 Professor V. Nagadevara
THANK YOU
12/03/18 Professor V. Nagadevara

More Related Content

ODP
Introduction To Analytics
PPTX
Introduction to Business Data Analytics
PDF
Data Analytics and Big Data on IoT
PPSX
Data Analytics Business Intelligence
PPTX
Business intelligence
ODP
Data quality overview
PPTX
Introduction to Business Anlytics and Strategic Landscape
PPT
Building a Data Quality Program from Scratch
Introduction To Analytics
Introduction to Business Data Analytics
Data Analytics and Big Data on IoT
Data Analytics Business Intelligence
Business intelligence
Data quality overview
Introduction to Business Anlytics and Strategic Landscape
Building a Data Quality Program from Scratch

What's hot (14)

PPT
Data mining Introduction
PPTX
Data Quality Presentation
PPTX
The Business Analytics Value Proposition
PPTX
Data Quality: A Raising Data Warehousing Concern
PPTX
Business analytics and data mining
PDF
An Introduction to Advanced analytics and data mining
PPTX
Data warehousing and data mining
PPT
Data Quality Rules introduction
PPTX
intro_to_business_analytics_and_data_science_ver 1.0
PDF
Foundations of analytics.ppt
PPTX
CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...
PPTX
USE OF DATA MINING IN BANKING SECTOR
PDF
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PDF
Data mining on Financial Data
Data mining Introduction
Data Quality Presentation
The Business Analytics Value Proposition
Data Quality: A Raising Data Warehousing Concern
Business analytics and data mining
An Introduction to Advanced analytics and data mining
Data warehousing and data mining
Data Quality Rules introduction
intro_to_business_analytics_and_data_science_ver 1.0
Foundations of analytics.ppt
CBIG Event June 20th, 2013. Presentation by Albert Khair. “Emerging Trends in...
USE OF DATA MINING IN BANKING SECTOR
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
Data mining on Financial Data
Ad

Similar to Web mining (20)

PPTX
Web usage mining
PPT
Web analytics webinar
PPT
Web analytics presentation
PDF
PDF
ASA conference Feb 2013
PDF
IRJET- Web Traffic Analysis through Data Analysis and Machine Learning
PPT
What Is Log Analyis
PPT
Web structure mining
PPTX
Real time analytics in Big Data
PDF
Business Intelligence and Analytics Systems for Decision Support 10th Edition...
PPT
BAQMaR - Conference DM
PDF
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
PDF
Government Web Analytics
PPTX
Wa mw 2013
PDF
Web analyticspres -am-long
PPTX
Module 1 introduction to web analytics
PPTX
Module 1 introduction to web analytics
PPTX
A brief of Osint and its uses in cyber crime.pptx
PDF
Business Intelligence and Analytics Systems for Decision Support 10th Edition...
PDF
From Reporting to Action: How to Understand and Drive Interactive Results
Web usage mining
Web analytics webinar
Web analytics presentation
ASA conference Feb 2013
IRJET- Web Traffic Analysis through Data Analysis and Machine Learning
What Is Log Analyis
Web structure mining
Real time analytics in Big Data
Business Intelligence and Analytics Systems for Decision Support 10th Edition...
BAQMaR - Conference DM
RESEARCH CHALLENGES IN WEB ANALYTICS – A STUDY
Government Web Analytics
Wa mw 2013
Web analyticspres -am-long
Module 1 introduction to web analytics
Module 1 introduction to web analytics
A brief of Osint and its uses in cyber crime.pptx
Business Intelligence and Analytics Systems for Decision Support 10th Edition...
From Reporting to Action: How to Understand and Drive Interactive Results
Ad

Recently uploaded (20)

PPTX
Managing Community Partner Relationships
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Business Analytics and business intelligence.pdf
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Leprosy and NLEP programme community medicine
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Managing Community Partner Relationships
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Introduction to Data Science and Data Analysis
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
CYBER SECURITY the Next Warefare Tactics
ISS -ESG Data flows What is ESG and HowHow
Business Analytics and business intelligence.pdf
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
modul_python (1).pptx for professional and student
importance of Data-Visualization-in-Data-Science. for mba studnts
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Leprosy and NLEP programme community medicine
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Web mining

  • 2. Businesses and customers are connected by a click • The Web has shortened the distance between a business and its customers. It is just a click away. • These clicks drive the economic models that support our Web search engines and provide the economic fuel for an increasing number of businesses. • The click is at the heart of an economic engine that is changing the nature of commerce with the near instantaneous, real-time recording of customer decisions to buy or not to buy 12/03/18 Professor V. Nagadevara
  • 4. User Behaviors during a Searching Process VIEW RESULTS Behavior in which the user viewed or scrolled one or more pages from the results listing. If a results page was present and the user did not scroll, we counted this as a View Results Page With Scrolling User scrolled the results page Without Scrolling User did not scroll the results page SELECTION Behavior in which the user makes a selection in the results listing Click URL (in results listing) Interaction in which the user clicked on a URL of one of the results in the results page Next in Set of Results List User moved to the Next results page Previous in Set of Results List User moved to the Previous results page. 12/03/18 Professor V. Nagadevara
  • 5. The Basic Premise • The user-generated Internet data can provide insight to understanding these users better or point to needed changes or improvements to existing Web systems • It tells us the “what.” • It does not give the insights into the motivations or decision processes of that user. 12/03/18 Professor V. Nagadevara
  • 6. Internet or Web Data • Web Traffic Data – Traditionally mined out of web server logs. Page tags is recent phenomenon • Web Transactional Data – Information with respect to various transactions – No. of customers, no. of orders, average order size etc. It may also include customer demoraphics 12/03/18 Professor V. Nagadevara
  • 7. Internet or Web Data • Web Server Performance Data – Web pages have many parts-text, scripts, images, multimedia… – Reassembled at the user end – Higher the “weight” of the page, more is the time to download and reassemble. – The 10-second rule becomes important 12/03/18 Professor V. Nagadevara
  • 8. The 10-second rule • One tenth of second is the limit for a user to feel that the system is responding “immediately” • One second is the limit for the user’s flow to remain uninterrupted, allowing him/her to notice a delay • Ten seconds is the upper limit for keeping users focused on a single task 12/03/18 Professor V. Nagadevara
  • 9. Table 1: Actions Taken After Abandoning Online Search for Products Did not Buy Brought at Brand Store Bought at different website Bought at Discount store Bought from paper catelogue All 34% 24% 14% 13% 7% Age < 25 27% 40% 13% 13% 7% Age 25 to 34 43% 20% 15% 10% 2% Age 35 to 44 27% 30% 14% 13% 11% Age 45 to 49 39% 18% 7% 25% 11% Age 50 to 54 31% 23% 15% 12% 4% Age 55+ 33% 7% 20% 13% 13% Male 29% 24% 15% 19% 5% Female 41% 21% 13% 7% 9% 12/03/18 Professor V. Nagadevara
  • 10. Internet or Web Data • User submitted data – User entry forms (registration forms) – Obtained from CRM, ERP systems (Customer loyalty programs) – Survey data – Opinions and feedback – Reviews 12/03/18 Professor V. Nagadevara
  • 11. Data Collection • Collect behavioral data using an application that logs user behavior on the Website, along with other associated measures • But, this data is inaccurate – Anonymous logging – Common use computers – Bots/Spiders – cookies, internal visitors, caching servers, and incorrect page tagging – Error rates range from 5 to 10 percent 12/03/18 Professor V. Nagadevara
  • 12. Web Log Data • Trace Data: Internet customer interaction data from Web systems (traces left behind that indicate human behaviors) • Unobtrusive: collection of the data does not interfere with the natural flow of behavior and events in the given context • Nonreactive: there is no observer present where the behaviors occur to affect the participants’ actions • Inexpensive to collect (with transaction logging software) 12/03/18 Professor V. Nagadevara
  • 13. Criteria • Credibility – How trustworthy or believable the data collection method is. – The Analyst must ensure that the data collection approach records the data needed to address the underlying business questions. 12/03/18 Professor V. Nagadevara
  • 14. Criteria • Validity – Internal validity: the extent to which the contents of the test, method, analysis, or procedure measure what they are supposed to measure. – Content or construct validity: the extent to which the content of the test, method, analysis, or procedure adequately represents all that is required for validation (i.e., are you collecting and accounting for all that you should collect and account for). – External validity: the extent to which one can generalize the results across populations, situations, environments, and contexts of the test, method, analysis, or procedure. 12/03/18 Professor V. Nagadevara
  • 15. Criteria • Reliability: – is a term used to describe the stability of the measurement. – Essentially, reliability addresses whether the measurement assesses the same thing, in the same way, in repeated tests. 12/03/18 Professor V. Nagadevara
  • 16. Credibility, validity, and reliability – Six Questions (Holst) 1. Which data is analyzed? The format and content of recorded trace data. – With transaction log software, this is much easier than in other forms of trace data, as logging applications can be reverse engineered to articulate exactly what behavioral data is recorded. 12/03/18 Professor V. Nagadevara
  • 17. Credibility, validity, and reliability – Six Questions (Holst) 2. How is this data defined? Define each trace measure in a manner that permits replication on other systems and with other users. – As TLA has proliferated in a variety of venues, more precise definitions of measures are developing 12/03/18 Professor V. Nagadevara
  • 18. Credibility, validity, and reliability – Six Questions (Holst) 3. What is the population from which the data was drawn ? Identify the actors, both people and systems, that created the trace data. – With transaction logs on the Web, this is sometimes a difficult issue to address, unless the system requires some type of logon and these profiles are then available. – In the absence of these profiles, the analyst must rely on demographic surveys, studies of the system’s user population, or general Web demographics. 12/03/18 Professor V. Nagadevara
  • 19. Credibility, validity, and reliability – Six Questions (Holst) 4. What is the context in which data is analyzed? Explain the environmental, situational, and contextual factors. – Include information about the temporal factors of the data collection (i.e., the date and time the data was recorded) and the make-up of the system at the time of the data recording – Transaction logs have the significant advantage of time sampling. Record observations at predefined points of time and then record the action that is taking place, using the classification of action defined in the ethogram 12/03/18 Professor V. Nagadevara
  • 20. Credibility, validity, and reliability – Six Questions (Holst) 5. What are the boundaries of the analysis? Do not overreach with the business questions and findings. – The implications of the research are confined by the data and the method of data collection. – Transaction log data can clearly state whether or not a user clicked on a link. – Will not inform us as to why the user clicked on a link. Was it intentional? Was it a mistake? Did the user become sidetracked? 12/03/18 Professor V. Nagadevara
  • 21. Credibility, validity, and reliability – Six Questions (Holst) 6. What is the target of the inferences? Articulate the relationship among the separate measures either to inform or to make inferences. •Trace data can be used for both descriptive and predictive purposes in terms of making inferences. These descriptions and inferences can be at any level of granularity (i.e., individual, collection of individuals, organization, etc.). •Transaction log data is best used for aggregate level analysis. But, with enough data at the individual level, one can tell a lot from log data. 12/03/18 Professor V. Nagadevara
  • 22. Analysis • Generate proper metrics and KPIs – Commercial sites: overall purchase conversions, average order size, and items per order. – Lead generation sites: Overall conversions, conversion by campaigns, dropouts, and conversions of leads to actual customers. – Customer service sites: reducing expenses and improving customer experiences. – Advertising on content sites: visits per week, page viewed per visit, visit length, advertising click ratio, and ratio of new to returning visitors. 12/03/18 Professor V. Nagadevara
  • 23. Analysis • indirect analysis: The Analyst is able to collect the data without introducing any formal measurement procedure. • TLA typically focuses on the interaction behaviors occurring among the users, system, and information. There are several examples of utilizing transaction analysis as an indirect approach 12/03/18 Professor V. Nagadevara
  • 24. Analysis • Context analysis: Analysis of text documents. It can be quantitative, qualitative, or a mixed methods. – Purpose is to identify patterns in text. It is unobtrusive and, can be a relatively rapid method for analyzing large amounts of text. In Web analytics, it typically focuses on search queries or analysis of retrieved results. A variety of examples are available in this area of transaction log research 12/03/18 Professor V. Nagadevara
  • 25. Analysis • Secondary analysis: Makes use of already existing sources of data. – Refers to the re-analysis of quantitative data rather than text analysis. Uses data that was collected by others to address different research questions or uses different methods of analysis – Websites collect transaction log data for system performance analysis. This can be used to address other questions. 12/03/18 Professor V. Nagadevara
  • 26. Actionable • Action driven by the data that is in line with the established KPI. (actionable outcomes) – Publications that shed insight on user behavior, or changes to some methods or system. – In a business, calculated change to improve the Website or business process that is directly dependent on the KPI selected. – generating additional revenue, reducing costs, or improving the user experience 12/03/18 Professor V. Nagadevara
  • 27. Web Mining • Web Structure Mining – Discovers useful knowledge from hyperlinks. – Discover important web pages – Discover communities which have common interests – Traditional data mining cannot perform this task! 12/03/18 Professor V. Nagadevara
  • 28. Web Mining • Web Content Mining – Mining useful knowledge from web page contents – Classifies or clusters similar web pages based on content – Extracts information about products, postings in fora, customer reviews, discover customer sentiments – Traditional data mining can do this well 12/03/18 Professor V. Nagadevara
  • 29. Web Mining • Web Usage Mining – Discovers user access patterns – Uses web log data, click stream data, page tags – Requires large amount of pre-processing – Uses many traditional data mining techniques 12/03/18 Professor V. Nagadevara
  • 30. Case Study 1 – Institute for Policy Studies www.ips-dc.org • Twelve months of data are used (2011) • During that period, the IPS received 292,000 visits, 202,000 (69%) from new visitors, • 16,000 (5.5%) of the visitors were repeat visitors. • Visitors came from around the world: 201 countries and territories. The United States contributed the most traffic, accounting for 78% of the visitors. 12/03/18 Professor V. Nagadevara
  • 31. Issues • IPS conducts “ad campaigns” nearly every week with limited success • Devotes little attention to its loyal visitors • Too much emphasis on social websites – On an average day, of the 4711 Facebook users who “like” the IPS’s site, only 4 people have checked in. That same day, 16,000 people visited the organization’s website on their own 12/03/18 Professor V. Nagadevara
  • 32. Time Spent • Returning visitors averaged 3.38 pages per visit - new visitors’ 2.33 pages per visit. • The average length of a visit is 8 min. Returning visitors spent 15 min on average - new visitors’ average of 5 min, • 140,000 new visitors leave immediately. 12/03/18 Professor V. Nagadevara
  • 33. Implications • The visitor loyalty data suggest that more than 10% (12.16%) of the site’s 292,000 visitors return more than once per month, and 5.47% of the site’s visitors (16,025 visitors) return weekly or daily. • From a strategic communication standpoint, the number of loyal visitors suggests that specific message content addressing the interests of this segment should be developed 12/03/18 Professor V. Nagadevara
  • 34. Implications • Campaign traffic associated with ad words on Google accounts for only 1.5% of the total • Overall, the weekly campaigns have boosted traffic from “new visitors.” But, bounce rate of the “new visitors” generated by these campaigns approaches 80–100% most weeks. • Campaigns or the strategic messages need to be reconfigured. 12/03/18 Professor V. Nagadevara
  • 35. Implications • The bounce rate of AdWord visitors is 7.4% higher than the site average and the visitor time on site is 27% lower. These are not IPS’ target • Other referrals: Wordpress.org (19.48% bounce rate), Hotsalsa.org (27.19% bounce rate), Netvibes (34.51% bounce rate), and Wikipedia (39.73% bounce rate) 12/03/18 Professor V. Nagadevara
  • 36. Top Landing Pages • Top landing pages were the front page (108,000 visits), about/join us (19,000), reports/executive (13,000), staff/Phyllis (5000), and staff/Bob (3000), • Both “join us” and “executive” had bounce rates exceeding 80%. • The home page had only a 44% bounce rate, while staff/ Phyllis and staff/Bob had 56% and 59%. The site average is 62.5%. • Phyllis, Bob, and the home page are highly desirable places to visit 12/03/18 Professor V. Nagadevara
  • 37. Case Study 2 - City of Prague (Oklahoma) www.CityofPragueOK.org • A small town of approximately 2400 people • Immigrants from the Czech Republic (former Czechoslovakia) settled the town. • Avatar Meher Baba Heartland Center http://www.ambhc.org/ • Provides space for multiple story stubs, along with photos 12/03/18 Professor V. Nagadevara
  • 38. Visitors • Data for six months • Received 2559 visits. 2123 were unique visitors, i.e., 17% (436) returned to the site more than once. • Averaged 12.01 visits per day, high 32 and a low just 2. • Overall bounce rate is 45.37%. Oklahoma viewers, the bounce rate was only 39.71%. • The bounce rate from the Czech Republic is 60.87%. • More than half of the sites visitors spend some time on the site. 20% of the town’s residents come back more than once 12/03/18 Professor V. Nagadevara
  • 39. Visitors • Half of the site’s traffic comes from within the state of Oklahoma (50.68%). An additional 13.13% visited the site from California, Texas, and Kansas. Ninety percent were from the US. • Prague site also had visits from 36 countries. • Oklahoma visitors spent an average of more than 2 min and viewed an average of 3.3 pages per visit. Prague is a small enough city that IP addresses of nearby residents do not sync with their town. Thus, saying how many Oklahoma visitors visited the site from in or near Prague is not possible. 12/03/18 Professor V. Nagadevara
  • 40. Key Words • Some combination of “Prague” and “Oklahoma” (or “OK”) directed 479 (16%) of the visitors to the site. • Visitors also searched specifically for the city of Prague Oklahoma (332 visitors), Prague lake (102 visitors), and the Prague police department (113 visitors). Thus, 40% of the key word searches were by people who were interested in information related to Prague, OK. 12/03/18 Professor V. Nagadevara
  • 41. Top Pages • The most popular page is the home page (2331 visits), • PragueOK news page (506 visits), the directory page (346 visits), the city’s contact information page (296 visits), the police department’s page (287 visits), the library’s page (262 visits), and the calendar page (198 visits). • Prague clearly serves a role in providing information about a majority of city services. 12/03/18 Professor V. Nagadevara
  • 42. Suggestions • Creating a system whereby local agencies can contribute content each week on their own to increase content in general without increasing the workload of the webmaster. • Publicizing the website - add the URL to city stationery, business cards, other agencies, the local paper, etc. • Adding links on the home page to all city departments and services. • Adding links to local attractions (not many!). • Clarifying the names of some of the sites pages to better reflect the content. 12/03/18 Professor V. Nagadevara