Web User  Behavior Analysis Doctorado en Sistemas de Ingeniería, Universidad de Chile.  Prof. Guía: Juan D. Velásquez Pablo E. Román [email_address]
Outline Motivation, Hypothesis, Achievement The problems & solutions Pre-processing Simulation Calibration Conclusions & Future Work
Motivation, Hypothesis, Achievement
Most famous web companies are analyzing the web user browsing behavior. Google 2009 net profit: 6,520 Millions US$ Amazon: 902 Millions US$. NetFlix: 116 Millions US$. (Codelco net profit: 1,262 Millions ) Adaptive Web Sites
Why we study the web user browsing behavior? A web user need to fast information  fast and complete. To enhance a web site Administrators/owners can only modify: Web Pages’ Contents Web Site Links Hopefully, the modification likes to objective group of members!
The main Problem There are only Heuristics in order to analyze the web user browsing behavior to enhance the contents and structure of a web site We think we can do it better…
Research hypothesis It is possible to apply neurophysiology’s decision making theories to explain web user navigational behavior by using web data.
The Thesis Proposal Web Intelligence A.I. in the Web Web Mining Knowledge  Representation Advanced Inf Tech. in the Web Agent Ubiquitous Sys. Wireless Sys. Grid & Cloud Sys. Social Network Web Structure Mining Web Content Mining Web Usage Mining Web user neurocomputing Neurophysiology model for the analysis of the behavior discovering pattern of web user navigational behavior from the set of user’ trails
Web user neurocomputing in Brief We use a brain model  of decision making to study  how people browse  a web site. Based on neurophysiology  first principles .
Machine learning vs.  First principle model Traditional Web Mining:  Machine Learning (ML) Generic algorithm that can found or be trained to reproduce data regularities. First principle models  (FPP): e.g. Newton’s Law. Can we use ML or FPP to build trajectories of the  Apollo mission ? One million dollar Netflix contest : achieve a 10% improvement to the accuracy of customer movie preference.    4 years without a winner!!! If conditions of the problem change, then ML system’s must be  recalibrated . Proposed Solution
Thesis dissertation: Main Contributions Novel mechanism for web session extraction from web log based on Integer Programming. 2008, WI-IAT Int. Conf. R. Dell, P. Román, J. Velásquez.    Using a linear objective function. 2010, Submitted to IDA Journal. P. Román, R. Dell, J. Velásquez.    Using a network model. Application of a Psychology model for describing web user navigation. 2009, AWIC Int. Conf. P. Román, J. Velásquez.    Simulation of decision’s making Neurophysiology. Calibration and simulation of Psychology based stochastic model. 2009, IAAA BICA Symposium, P. Román, J. Velásquez. 2010, WI-IAT Int. Conf. , P. Román, J. Velásquez.
First problem: Pre-processing
Web basic operation
Web data: Web logs (Usage)
Web data: Content (text, object,..) You can put anything you want on a Web page, from family to business info…. Hyperlink structure
Web Data: Hyperlink structure
Proposal: Data sources  Neurophysiology commonly uses data obtained from neural-cabled subjects or psychological tests (surveys). I use web data for the study of human behavior using the web
Problem: Web data pre-processing Hyperlink graph, Web page content, Web user session (sequence of pages). Web Logs do not directly capture sessions How to reconstruct sessions ? SESSIONIZATION : process for obtaining sessions. If invasive methods are used privacy right are violated (forbidden by law in several countries). Cookies Spyware Tracking applications
Traditional approach for sessionization Proactive : direct  tracking  of the web user Privacy issue The most exact Reactive : reconstruction of web user’s page sequence  heuristically . Only an approximation (40% noise) Use anonymous activity data sources like web logs. Models of behavior are sensitive to noise in data.
Traditional heuristic for sessionization How to identify individual web users? Filtering: IP+Browser(Agent) Timeout of 30 minute Path completion:  shortest path backward
Sessionization: The proposal Incorporate all restrictions as a combinatorial optimization problem. Two formulation: Maximization of a linear reward, network flow model.
Integer Programming  for sessionization   ( WI-IAT08 R. Dell, P.Roman, J. Velasquez ) X ros   : 1 if log register  “ r ”  is assigned as the  “ o-th ”  request during session  “ s ”  and zero otherwise. It is a labeling problem! Log register Sessions
Integer Program ~ Maximize the number of sessions. ( WI-IAT08, KES09 P. Roman et al ) Register used once One register on o Structure and time
Network model: Minimize number of session. (IDA10 P. Roman et al) Edge indicates register precedence Node is a register (duplicated)  Flow = Number of sessions Source Sink … Z=3 1 0 0 (1,1) 0 0 0 1 : flow of a session 1 (1,1) 1’ 2 (1,1) 2’ 3 (1,1) 3’ 1 0 Now is feasible N N’ (1,1) 4 4’ 1 1 1 0 0
Experiment: Large scale (15 month) DII departmental web site. ~4000 pages ~17000 links ~15000 visits per month Simple: precise information Content mainly based on text Objective: Academics, Study programs, Projects, …  http://www.dii.uchile.cl/
A large scale experiment evaluation:  F-Score over cookie retrieved sessions.  (IDA10 P. Roman et al) 0<F<1 Higher F is better Traditional sessionization Both proposal
A large scale experiment evaluation: F-Score over cookie retrieved sessions. Compared with 15 months of cookie retrieval Method Precision Recall F-Score Time Sessionization Integer Programming (SIP) 0.7788 0.6696 0.7201 6 Hour Network Flow (BCM) 0.7777 0.6671 0.7182 4 Min Canonical Sessionization 0.5091 0.6996 0.5993 1 Min
Summary: Pre-processing It is possible to ensure data quality using optimality Even in the worst scenario when only web logs are available. Main Achievement: F=0.72 In acceptable processing time 4min/month Ready for Neurocomputing!
Second problem: Simulation
Strong Regularities: Distribution  of sessions (WI-IAT08 P. Roman et al) Empirical  power rule  for session size has been observed in the literature [Huberman et al. 1998, Science]. Web Surfer Law. The correlation coefficient and standard error of fitting to a power law gives us a sense of the quality of the sessions. Our correlation coefficient is 0.94 and our standard error is 0.3817. A common heuristic has a correlation coefficient of 0.91 and a standard error of 0.64.
Regularity     presence of internal rule  (2008, CLAIO P. Roman et al.) Law of surfing Machine learning algorithm has been applied  in order to capture such regularities. Today new directions based on the brain’s informatics are used to explains navigation.  What  we need is a theory for  explaining such regularities!
Proposal:  To adapt Psychological theory to web navigation, using web data. Human behavior on the web is  the result of brain neural network processing . Require historical data of individual’s trajectories on a web site. Difficult to calculate or predict the calculation of  10 11  neuron and 10 14  Interconnection . Diffusion process -> average at  mesoscopic level This is the point of view of this thesis.
Biological experiment (1970-2005) Rhesus monkey with sensor placed on  Lateral intra-parietal  (LIP) cortex (2002-2008) Screen with moving dots, the decision is to select the correct direction of motion. Monkeys are trained to receive a reward if they answer the correctly. Possible options map on the LIP cortex and the point with higher neural activity will correspond to the decision of the subject.
Neurophysiology of decision making: First Principles First hitting time -> time to decide. First hitting coordinate  -> the choice X 1 X 2 It decides option 1. Two options 0
LCA Model (Leacky Competing Accumulator)  [M. Usher et al, 2001] X>0  -> Biological condition: Neural activity is positive I   is considered exogenous and constant Others parameters ( k, λ , σ ) in the model are positives The stochastic equation: I j   : Likelihood to make choice  j . It drives the decision! Result from other area processing (e.g. Visual Cortex). Important parameter!! X I
Application: The browsing process Arrivals  (first page) are  exogenous  to this model. Based on  historic sessions , the model predicts probability of following a link. Web users are  information seeker  and respond according to text.
Modeling the likelihood of choosing each option (vector  I ) I j  considered a probability of choosing option  j . Discrete choice theory Text must be represented as numeric entities ->  Bag of words  model with  TF-IDF  (~ vector of frequency of appearance of word).
Likelihood of a decision and  web user utility Random Utility Model (Economy): Individuals decide within discrete options { j } with utility  V j   with probability  P j  of choice  j . The likelihood of taking decision  j  should be proportional to  P j Web user objectives  are modeled as a text vector µ.  Web users are information (TEXT) seeker . Similarity between text is measured as the cosine between both vectors.
Assumption & Approximation Web browsing is characterized only by jumps. Independence of available choices. Utility only depends on text. Independent of the past visited trail. No information Satiety Rational web user Correctness of web site information Web pages with little content. Web page with simple content. Web user information processing time is negligible.
Adaptation of the LCA model  (WI-IAT10, P. Roman et al) It is a  Langevin’s equation . force interpretation of the stochastic neural activity evolution. Open the way for improving the dynamic system: Adding forces. Evidence Inhibition Dissipation Noise
The Fokker-Planck equation: probability density of not reaching  a decision  (AWIC09 P. Roman et al) . Never reach a decision in t’<t Neural activity is positive  Neural activity is initially near to 0  Probability density
The probability of reaching  a decision in time t. The probability of deciding option “j” in time “t”
Unconstrained exact solution Hermite Polinomials Exact solution (Ornstein-Uhlenbeck)
Exact unconstrained  solution evolution Nearly a delta in t=0, X=0 Large time solution    0 No border condition But in t=0 the delta values on border are nearly 0 (Ornstein-Uhlenbeck)
This approach is threefold Stochastic equation allows  simulations  for finding probabilities given a web site. But parameters  need to be calibrated. Approximation: constant for all users. Calibration  of the model is performed by maximum likelihood. But  requires web data  (session set). Requires  approximation of the density   φ Session  needs to be obtained with higher accuracy.
Simulation: Monte Carlo simulation Euler approximation Exact simulation
Simulation algorithm:  Deciding which link to follows.
Results: Simulated session  length distribution ( BICA08, AWIC09, BAO10 ). Empirical result : Session length [1] distribution follows a power law [4,5]. Kind of average web user u contains all text in the web site Sessions L>20 diverge: users that performs more elaborate processing? Session L=1: users that have others text interest?
Results: Number of visits per page. Fuzzy, but averages remain similar.
Adjustment of distribution of time used per session. Simulated session Same power law than real case. Shift in time, change time scale that is used for adjusting white noise variance. Slope represent more structural behavior. Intended to adjust other scalar parameters.
In Summary With only an estimation of the parameter, simulation shows result that are close to real. Calibrating the model should produce better simulation.
Third problem: Calibration
Calibration  ( WI-IAT 2010, P. Roman et al ) Parameters : Should correspond to properties of neural tissue. Approximation: constant for all users. Parameter : The evidence vector  I Corresponds to the intention of the web user It is distributed  The density must be approximated!!!
Parameter Inference SESSION DATA:  (i,j)  : Hyperlink from  i  to  j . k : numerate the time distribution. n ijk   : The number of observed transitions t ijk : The observed time used on this observation  Maximum log-likelihood: Approximate P by a linear combination of unconstrained exact solutions. The approximated probability function must agree restriction of LCA model.  j i
Curse of dimensionality Many numerical methods for solving differential equations require a partition of the space. Discretization involves: Any coordinate partition in 100. A typical number of links on a page is 20 Then the total number of points of the discretization is about 10 40    unmanageable
Distribution of number of links per page.
Proposal (1):  To use symbolic processing Explicit expression are not manageable by hand. Operation involved: Integration, differentiation, product, … Φ  is based on polynomials    Instead of evaluating at each step, it is better to  perform symbolic manipulation  until evaluation is needed.  Grid is not necessary for intermediate step. / 1 - 1 ^ X 2
Proposal (2): Use the time propagator of  the Cauchi problem  Initial condition is concentrated on 0. But  L must ensure border condition!!!!
Proposal (3): Penalization method for ensuring border condition A force   F P  on boundary that is added to ensure reflection and adsorption F P (x) =(1-x) 2n +x 2n
Approximating the probability distribution  Φ Unconstrained case involves polynomial solution. Propagator takes  Φ  on a small  t  to a  t ’. Propagator involves only derivatives. Symbolic processing of the solution could be performed for building solutions for the required time  t ’. Probability P is built on a derivative of a definite integral that are easily calculated by symbolic processing. A solution for the dimensionality problem!!!
Experiment:  DII departmental web site. ~4000 pages ~17000 links ~15000 visits per month Simple: concise and precise information. Content mainly based on text. Objective: Academics, Study programs, Projects, …
Calibration of parameter Neurophisiology =  0.4 =  0.2 =  0.03 Text vector preference 1 vector: Most ranked words Mba Syllabus Project A distribution of Gaussian vector 3 main clusters related to : study programs, academics, economics. λ κ σ
Simulation of in the DII site Average error of only 5% in distribution of session size precision: 0.8, recall 0.74 by number of specific sessions
Comparing with ML approach ML Algorithm based on clustering session with text measured. Simulation approximates 70% of reality. ML reachs 60%  [J. Borges, 2007, IEEE Trans.] [Ghorbani, 2007, WI-IAT][J. Velasquez et Al., 2007,  International Journal of Artificial Intelligence Tools  ][J. Velasquez et Al., 2007, Journal of Knowledge-Based Systems (Elsevier)]
Situation after 1 month: stability of the calibration. 5% of links were modified 2% of pages are new or deleted 30% of words in documents have changed. Simulation reach an F-score of 0.7
In Summary In spite of changing web site configuration (after 1 month) simulation returns similar session distribution. Complexity of calibration process is improved by symbolic calculation. Density of session length is matched in 95%. But Distribution of sessions is matched in 70%.
Conclusions & Future Work
Conclusion (1) In spite of the anonymous character of a web log, It is possible to extract sessions in good agreement with an empirical statistical law. ~70% F-score Quick pre-processing can be obtained with the use of network models. Further explorations using combinatorial  model leads us to retrieve other likely values.
Conclusions (2) Web users are shown to behave like text information seeker using simulation. Simulation of a web user is a straightforward algorithm if parameters are known. Distribution of web user sessions are obtained with notable precision. Calibration is notably difficult due to the dimensionality. A method based on symbolic manipulation and semi-group propagation was proposed for density estimation.
Conclusion (3) The model is robust to changes to the web site maintaining 70% accuracy in  predicting distribution of sessions.  Compared with traditional data mining methods that have only 60-70% only one step prediction.
Future Work Web personalization Simulation is cheap and parallelizable, once it is trained (expansion coefficients are fitted). Small changes (same semantic) in web site (hyperlink and structure) produce changes on web user trails on the web site. Simulation predict web usage!  Since assuming same users with the same fitted behaviour will visit the web site.  Iteration on changes and simulation could find better changes given a measure of quality.
Publications: Book Chapters 2010 . Web Usage Mining, P. Román, G. L’Huillier, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain,  Springer . 2010 . Advanced Techniques in Web Data Pre-Processing and Cleaning, P. Román, R. F. Dell, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain,  Springer . Publications: International Journal 2010 , Optimization Models For Sessionization, Submitted to  Journal of Intelligent Data Analysis . 2011, Simulation of web user navigation. In preparation.
International Conferences 2006 .Improving a Web Site using Keywords, P. Román, J. Velásquez,  CLAIO  XIII, Int. Conf. Uruguay. 2008 .Markov Chain for modeling the Web User Behavior, P. Román, J. Velásquez, Infomrs,  CLAIO  XIV, Int. Conf. Colombia. 2008 . Identifying Web User Session using an Integer programming Approach, R. Dell, P. Román, J. Velásquez,  CLAIO  XIV, Int. Conf. Colombia. 2008 . Web User Session Reconstruction Using Integer Programming, R. Dell, P. Román, J. Velásquez, IEEE/ACM,  WI-IAT  Int. Conf. Australia. 2009 . A Dynamic Stochastic Model Applied to the Analysis of the Web User Behavior, P. Román, J. Velásquez, IEEE,  AWIC  Int. Conf. Czech Republic. 2009 . Fast Combinatorial Algorithm for Web User Session Reconstruction, R. Dell, P. Román, J. Velásquez, the 24th  IFIP  TC7 Int. Conf.,  Argentina. 2009 . Analysis of the Web User Behavior with a Psychologically-Based Diffusion Model, P. Román, J. Velásquez,  AAAI BICA  Int. Conf., USA. 2009 . Web User Session Reconstruction with Back Button Browsing, P. Román, R. Dell, J. Velásquez, IEEE  LNAI 5711, KES  Int. Conf. Chile. 2010 . Stochastic Simulation of Web Users, P. Román, J. Velásquez, IEEE/ACM,  WI-IAT  Int. Conf. Canada.
Publications: National Conferences 2010 . Ant Colony Surfer: Discovering the Distribution of Text Preferences from Web Usage, P. Loyola, P.E. Román and J.D. Velásquez,  BAO . 2010 . Best Web Site Structure for Users Based on a Genetic Algorithm Approach, E. Andaur, S. Rios, P.E. Román and J.D. Velásquez,  BAO . 2010 . Artificial Web User Simulation and Web Usage Mining, P.E. Román and J.D. Velásquez,  BAO . 2010 . Time Course of the Web User, P.E. Román and J.D. Velásquez,  TUO2 . Publications: National review 2009 ,  ; Un método de optimización lineal entera para el análisis de sesiones de usuarios web , Revista de Ingenieria de Sistemas; Vol. 23.
Thanks you for your attention.

More Related Content

PDF
Collective Mind infrastructure and repository to crowdsource auto-tuning (c-m...
PPSX
Satya Sahoo Thesis Defense
PDF
INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
PDF
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
PPTX
Semantic Data Retrieval: Search, Ranking, and Summarization
PDF
PDF
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
PDF
Ontology Based Approach for Semantic Information Retrieval System
Collective Mind infrastructure and repository to crowdsource auto-tuning (c-m...
Satya Sahoo Thesis Defense
INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A model of hybrid genetic algorithm particle swarm optimization(hgapso) based...
Semantic Data Retrieval: Search, Ranking, and Summarization
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
Ontology Based Approach for Semantic Information Retrieval System

What's hot (9)

PPTX
The Web of Data: do we actually understand what we built?
PDF
Research Inventy : International Journal of Engineering and Science
PPTX
Data Science, Data Curation, and Human-Data Interaction
PPTX
A Knowledge Discovery Framework for Planetary Defense
PPTX
MUDROD - Ranking
PPTX
Dagstuhl14 intro-v1
PDF
NLP applicata a LIS
PDF
Text Mining: (Asynchronous Sequences)
PPTX
Democratizing Data Science in the Cloud
The Web of Data: do we actually understand what we built?
Research Inventy : International Journal of Engineering and Science
Data Science, Data Curation, and Human-Data Interaction
A Knowledge Discovery Framework for Planetary Defense
MUDROD - Ranking
Dagstuhl14 intro-v1
NLP applicata a LIS
Text Mining: (Asynchronous Sequences)
Democratizing Data Science in the Cloud
Ad

Viewers also liked (8)

PPTX
Edu links
PPTX
Wikis EDU5230
PPT
Resume for Stephen R. Henshaw, R.G.
PPT
NCCC Alumni Association Engagement Review Presentation 2010
KEY
TApresentation-fr
PPTX
Charts, Visuals, Infographics...Oh My!
PDF
Frost And Sullivan Keynote: November 2008
PPTX
Wikis ped3758
Edu links
Wikis EDU5230
Resume for Stephen R. Henshaw, R.G.
NCCC Alumni Association Engagement Review Presentation 2010
TApresentation-fr
Charts, Visuals, Infographics...Oh My!
Frost And Sullivan Keynote: November 2008
Wikis ped3758
Ad

Similar to Defensa.V11 (20)

PPT
DBLP-SSE: A DBLP Search Support Engine
PPT
Mazhiming
PPT
Internet 信息检索中的数学
PPT
Web analytics webinar
PPT
Web analytics presentation
PPTX
Presentationonline
PDF
The Nature of Information
PPT
eScience: A Transformed Scientific Method
PDF
H2O with Erin LeDell at Portland R User Group
PPT
AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvh
PPT
Artificial Intelligence and the Internet
PPTX
Numenta ACM Data Min - PowerPoint Presentation
PDF
Semantic Web
PDF
Semantic Web
PDF
PDF
CS8080_IRT__UNIT_I_NOTES.pdf
PDF
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
PDF
Mining Fuzzy Association Rules from Web Usage Quantitative Data
PDF
201500 Cognitive Informatics
PPTX
Building a Semantic search Engine in a library
DBLP-SSE: A DBLP Search Support Engine
Mazhiming
Internet 信息检索中的数学
Web analytics webinar
Web analytics presentation
Presentationonline
The Nature of Information
eScience: A Transformed Scientific Method
H2O with Erin LeDell at Portland R User Group
AI (1).ppt ug gjhghhhjkjhhjjffdfhhcchhvvh
Artificial Intelligence and the Internet
Numenta ACM Data Min - PowerPoint Presentation
Semantic Web
Semantic Web
CS8080_IRT__UNIT_I_NOTES.pdf
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
Mining Fuzzy Association Rules from Web Usage Quantitative Data
201500 Cognitive Informatics
Building a Semantic search Engine in a library

Recently uploaded (20)

PPTX
Build Your First AI Agent with UiPath.pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
The various Industrial Revolutions .pptx
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Five Habits of High-Impact Board Members
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPT
Geologic Time for studying geology for geologist
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
UiPath Agentic Automation session 1: RPA to Agents
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
Build Your First AI Agent with UiPath.pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
OpenACC and Open Hackathons Monthly Highlights July 2025
A contest of sentiment analysis: k-nearest neighbor versus neural network
A proposed approach for plagiarism detection in Myanmar Unicode text
The various Industrial Revolutions .pptx
CloudStack 4.21: First Look Webinar slides
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
The influence of sentiment analysis in enhancing early warning system model f...
Five Habits of High-Impact Board Members
Module 1.ppt Iot fundamentals and Architecture
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Geologic Time for studying geology for geologist
Improvisation in detection of pomegranate leaf disease using transfer learni...
Taming the Chaos: How to Turn Unstructured Data into Decisions
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
Microsoft Excel 365/2024 Beginner's training
UiPath Agentic Automation session 1: RPA to Agents
Final SEM Unit 1 for mit wpu at pune .pptx

Defensa.V11

  • 1. Web User Behavior Analysis Doctorado en Sistemas de Ingeniería, Universidad de Chile. Prof. Guía: Juan D. Velásquez Pablo E. Román [email_address]
  • 2. Outline Motivation, Hypothesis, Achievement The problems & solutions Pre-processing Simulation Calibration Conclusions & Future Work
  • 4. Most famous web companies are analyzing the web user browsing behavior. Google 2009 net profit: 6,520 Millions US$ Amazon: 902 Millions US$. NetFlix: 116 Millions US$. (Codelco net profit: 1,262 Millions ) Adaptive Web Sites
  • 5. Why we study the web user browsing behavior? A web user need to fast information fast and complete. To enhance a web site Administrators/owners can only modify: Web Pages’ Contents Web Site Links Hopefully, the modification likes to objective group of members!
  • 6. The main Problem There are only Heuristics in order to analyze the web user browsing behavior to enhance the contents and structure of a web site We think we can do it better…
  • 7. Research hypothesis It is possible to apply neurophysiology’s decision making theories to explain web user navigational behavior by using web data.
  • 8. The Thesis Proposal Web Intelligence A.I. in the Web Web Mining Knowledge Representation Advanced Inf Tech. in the Web Agent Ubiquitous Sys. Wireless Sys. Grid & Cloud Sys. Social Network Web Structure Mining Web Content Mining Web Usage Mining Web user neurocomputing Neurophysiology model for the analysis of the behavior discovering pattern of web user navigational behavior from the set of user’ trails
  • 9. Web user neurocomputing in Brief We use a brain model of decision making to study how people browse a web site. Based on neurophysiology first principles .
  • 10. Machine learning vs. First principle model Traditional Web Mining: Machine Learning (ML) Generic algorithm that can found or be trained to reproduce data regularities. First principle models (FPP): e.g. Newton’s Law. Can we use ML or FPP to build trajectories of the Apollo mission ? One million dollar Netflix contest : achieve a 10% improvement to the accuracy of customer movie preference.  4 years without a winner!!! If conditions of the problem change, then ML system’s must be recalibrated . Proposed Solution
  • 11. Thesis dissertation: Main Contributions Novel mechanism for web session extraction from web log based on Integer Programming. 2008, WI-IAT Int. Conf. R. Dell, P. Román, J. Velásquez.  Using a linear objective function. 2010, Submitted to IDA Journal. P. Román, R. Dell, J. Velásquez.  Using a network model. Application of a Psychology model for describing web user navigation. 2009, AWIC Int. Conf. P. Román, J. Velásquez.  Simulation of decision’s making Neurophysiology. Calibration and simulation of Psychology based stochastic model. 2009, IAAA BICA Symposium, P. Román, J. Velásquez. 2010, WI-IAT Int. Conf. , P. Román, J. Velásquez.
  • 14. Web data: Web logs (Usage)
  • 15. Web data: Content (text, object,..) You can put anything you want on a Web page, from family to business info…. Hyperlink structure
  • 16. Web Data: Hyperlink structure
  • 17. Proposal: Data sources Neurophysiology commonly uses data obtained from neural-cabled subjects or psychological tests (surveys). I use web data for the study of human behavior using the web
  • 18. Problem: Web data pre-processing Hyperlink graph, Web page content, Web user session (sequence of pages). Web Logs do not directly capture sessions How to reconstruct sessions ? SESSIONIZATION : process for obtaining sessions. If invasive methods are used privacy right are violated (forbidden by law in several countries). Cookies Spyware Tracking applications
  • 19. Traditional approach for sessionization Proactive : direct tracking of the web user Privacy issue The most exact Reactive : reconstruction of web user’s page sequence heuristically . Only an approximation (40% noise) Use anonymous activity data sources like web logs. Models of behavior are sensitive to noise in data.
  • 20. Traditional heuristic for sessionization How to identify individual web users? Filtering: IP+Browser(Agent) Timeout of 30 minute Path completion: shortest path backward
  • 21. Sessionization: The proposal Incorporate all restrictions as a combinatorial optimization problem. Two formulation: Maximization of a linear reward, network flow model.
  • 22. Integer Programming for sessionization ( WI-IAT08 R. Dell, P.Roman, J. Velasquez ) X ros : 1 if log register “ r ” is assigned as the “ o-th ” request during session “ s ” and zero otherwise. It is a labeling problem! Log register Sessions
  • 23. Integer Program ~ Maximize the number of sessions. ( WI-IAT08, KES09 P. Roman et al ) Register used once One register on o Structure and time
  • 24. Network model: Minimize number of session. (IDA10 P. Roman et al) Edge indicates register precedence Node is a register (duplicated) Flow = Number of sessions Source Sink … Z=3 1 0 0 (1,1) 0 0 0 1 : flow of a session 1 (1,1) 1’ 2 (1,1) 2’ 3 (1,1) 3’ 1 0 Now is feasible N N’ (1,1) 4 4’ 1 1 1 0 0
  • 25. Experiment: Large scale (15 month) DII departmental web site. ~4000 pages ~17000 links ~15000 visits per month Simple: precise information Content mainly based on text Objective: Academics, Study programs, Projects, … http://www.dii.uchile.cl/
  • 26. A large scale experiment evaluation: F-Score over cookie retrieved sessions. (IDA10 P. Roman et al) 0<F<1 Higher F is better Traditional sessionization Both proposal
  • 27. A large scale experiment evaluation: F-Score over cookie retrieved sessions. Compared with 15 months of cookie retrieval Method Precision Recall F-Score Time Sessionization Integer Programming (SIP) 0.7788 0.6696 0.7201 6 Hour Network Flow (BCM) 0.7777 0.6671 0.7182 4 Min Canonical Sessionization 0.5091 0.6996 0.5993 1 Min
  • 28. Summary: Pre-processing It is possible to ensure data quality using optimality Even in the worst scenario when only web logs are available. Main Achievement: F=0.72 In acceptable processing time 4min/month Ready for Neurocomputing!
  • 30. Strong Regularities: Distribution of sessions (WI-IAT08 P. Roman et al) Empirical power rule for session size has been observed in the literature [Huberman et al. 1998, Science]. Web Surfer Law. The correlation coefficient and standard error of fitting to a power law gives us a sense of the quality of the sessions. Our correlation coefficient is 0.94 and our standard error is 0.3817. A common heuristic has a correlation coefficient of 0.91 and a standard error of 0.64.
  • 31. Regularity  presence of internal rule (2008, CLAIO P. Roman et al.) Law of surfing Machine learning algorithm has been applied in order to capture such regularities. Today new directions based on the brain’s informatics are used to explains navigation. What we need is a theory for explaining such regularities!
  • 32. Proposal: To adapt Psychological theory to web navigation, using web data. Human behavior on the web is the result of brain neural network processing . Require historical data of individual’s trajectories on a web site. Difficult to calculate or predict the calculation of 10 11 neuron and 10 14 Interconnection . Diffusion process -> average at mesoscopic level This is the point of view of this thesis.
  • 33. Biological experiment (1970-2005) Rhesus monkey with sensor placed on Lateral intra-parietal (LIP) cortex (2002-2008) Screen with moving dots, the decision is to select the correct direction of motion. Monkeys are trained to receive a reward if they answer the correctly. Possible options map on the LIP cortex and the point with higher neural activity will correspond to the decision of the subject.
  • 34. Neurophysiology of decision making: First Principles First hitting time -> time to decide. First hitting coordinate -> the choice X 1 X 2 It decides option 1. Two options 0
  • 35. LCA Model (Leacky Competing Accumulator) [M. Usher et al, 2001] X>0 -> Biological condition: Neural activity is positive I is considered exogenous and constant Others parameters ( k, λ , σ ) in the model are positives The stochastic equation: I j : Likelihood to make choice j . It drives the decision! Result from other area processing (e.g. Visual Cortex). Important parameter!! X I
  • 36. Application: The browsing process Arrivals (first page) are exogenous to this model. Based on historic sessions , the model predicts probability of following a link. Web users are information seeker and respond according to text.
  • 37. Modeling the likelihood of choosing each option (vector I ) I j considered a probability of choosing option j . Discrete choice theory Text must be represented as numeric entities -> Bag of words model with TF-IDF (~ vector of frequency of appearance of word).
  • 38. Likelihood of a decision and web user utility Random Utility Model (Economy): Individuals decide within discrete options { j } with utility V j with probability P j of choice j . The likelihood of taking decision j should be proportional to P j Web user objectives are modeled as a text vector µ. Web users are information (TEXT) seeker . Similarity between text is measured as the cosine between both vectors.
  • 39. Assumption & Approximation Web browsing is characterized only by jumps. Independence of available choices. Utility only depends on text. Independent of the past visited trail. No information Satiety Rational web user Correctness of web site information Web pages with little content. Web page with simple content. Web user information processing time is negligible.
  • 40. Adaptation of the LCA model (WI-IAT10, P. Roman et al) It is a Langevin’s equation . force interpretation of the stochastic neural activity evolution. Open the way for improving the dynamic system: Adding forces. Evidence Inhibition Dissipation Noise
  • 41. The Fokker-Planck equation: probability density of not reaching a decision (AWIC09 P. Roman et al) . Never reach a decision in t’<t Neural activity is positive Neural activity is initially near to 0 Probability density
  • 42. The probability of reaching a decision in time t. The probability of deciding option “j” in time “t”
  • 43. Unconstrained exact solution Hermite Polinomials Exact solution (Ornstein-Uhlenbeck)
  • 44. Exact unconstrained solution evolution Nearly a delta in t=0, X=0 Large time solution  0 No border condition But in t=0 the delta values on border are nearly 0 (Ornstein-Uhlenbeck)
  • 45. This approach is threefold Stochastic equation allows simulations for finding probabilities given a web site. But parameters need to be calibrated. Approximation: constant for all users. Calibration of the model is performed by maximum likelihood. But requires web data (session set). Requires approximation of the density φ Session needs to be obtained with higher accuracy.
  • 46. Simulation: Monte Carlo simulation Euler approximation Exact simulation
  • 47. Simulation algorithm: Deciding which link to follows.
  • 48. Results: Simulated session length distribution ( BICA08, AWIC09, BAO10 ). Empirical result : Session length [1] distribution follows a power law [4,5]. Kind of average web user u contains all text in the web site Sessions L>20 diverge: users that performs more elaborate processing? Session L=1: users that have others text interest?
  • 49. Results: Number of visits per page. Fuzzy, but averages remain similar.
  • 50. Adjustment of distribution of time used per session. Simulated session Same power law than real case. Shift in time, change time scale that is used for adjusting white noise variance. Slope represent more structural behavior. Intended to adjust other scalar parameters.
  • 51. In Summary With only an estimation of the parameter, simulation shows result that are close to real. Calibrating the model should produce better simulation.
  • 53. Calibration ( WI-IAT 2010, P. Roman et al ) Parameters : Should correspond to properties of neural tissue. Approximation: constant for all users. Parameter : The evidence vector I Corresponds to the intention of the web user It is distributed The density must be approximated!!!
  • 54. Parameter Inference SESSION DATA: (i,j) : Hyperlink from i to j . k : numerate the time distribution. n ijk : The number of observed transitions t ijk : The observed time used on this observation Maximum log-likelihood: Approximate P by a linear combination of unconstrained exact solutions. The approximated probability function must agree restriction of LCA model. j i
  • 55. Curse of dimensionality Many numerical methods for solving differential equations require a partition of the space. Discretization involves: Any coordinate partition in 100. A typical number of links on a page is 20 Then the total number of points of the discretization is about 10 40  unmanageable
  • 56. Distribution of number of links per page.
  • 57. Proposal (1): To use symbolic processing Explicit expression are not manageable by hand. Operation involved: Integration, differentiation, product, … Φ is based on polynomials  Instead of evaluating at each step, it is better to perform symbolic manipulation until evaluation is needed.  Grid is not necessary for intermediate step. / 1 - 1 ^ X 2
  • 58. Proposal (2): Use the time propagator of the Cauchi problem Initial condition is concentrated on 0. But L must ensure border condition!!!!
  • 59. Proposal (3): Penalization method for ensuring border condition A force F P on boundary that is added to ensure reflection and adsorption F P (x) =(1-x) 2n +x 2n
  • 60. Approximating the probability distribution Φ Unconstrained case involves polynomial solution. Propagator takes Φ on a small t to a t ’. Propagator involves only derivatives. Symbolic processing of the solution could be performed for building solutions for the required time t ’. Probability P is built on a derivative of a definite integral that are easily calculated by symbolic processing. A solution for the dimensionality problem!!!
  • 61. Experiment: DII departmental web site. ~4000 pages ~17000 links ~15000 visits per month Simple: concise and precise information. Content mainly based on text. Objective: Academics, Study programs, Projects, …
  • 62. Calibration of parameter Neurophisiology = 0.4 = 0.2 = 0.03 Text vector preference 1 vector: Most ranked words Mba Syllabus Project A distribution of Gaussian vector 3 main clusters related to : study programs, academics, economics. λ κ σ
  • 63. Simulation of in the DII site Average error of only 5% in distribution of session size precision: 0.8, recall 0.74 by number of specific sessions
  • 64. Comparing with ML approach ML Algorithm based on clustering session with text measured. Simulation approximates 70% of reality. ML reachs 60% [J. Borges, 2007, IEEE Trans.] [Ghorbani, 2007, WI-IAT][J. Velasquez et Al., 2007, International Journal of Artificial Intelligence Tools ][J. Velasquez et Al., 2007, Journal of Knowledge-Based Systems (Elsevier)]
  • 65. Situation after 1 month: stability of the calibration. 5% of links were modified 2% of pages are new or deleted 30% of words in documents have changed. Simulation reach an F-score of 0.7
  • 66. In Summary In spite of changing web site configuration (after 1 month) simulation returns similar session distribution. Complexity of calibration process is improved by symbolic calculation. Density of session length is matched in 95%. But Distribution of sessions is matched in 70%.
  • 68. Conclusion (1) In spite of the anonymous character of a web log, It is possible to extract sessions in good agreement with an empirical statistical law. ~70% F-score Quick pre-processing can be obtained with the use of network models. Further explorations using combinatorial model leads us to retrieve other likely values.
  • 69. Conclusions (2) Web users are shown to behave like text information seeker using simulation. Simulation of a web user is a straightforward algorithm if parameters are known. Distribution of web user sessions are obtained with notable precision. Calibration is notably difficult due to the dimensionality. A method based on symbolic manipulation and semi-group propagation was proposed for density estimation.
  • 70. Conclusion (3) The model is robust to changes to the web site maintaining 70% accuracy in predicting distribution of sessions. Compared with traditional data mining methods that have only 60-70% only one step prediction.
  • 71. Future Work Web personalization Simulation is cheap and parallelizable, once it is trained (expansion coefficients are fitted). Small changes (same semantic) in web site (hyperlink and structure) produce changes on web user trails on the web site. Simulation predict web usage! Since assuming same users with the same fitted behaviour will visit the web site. Iteration on changes and simulation could find better changes given a measure of quality.
  • 72. Publications: Book Chapters 2010 . Web Usage Mining, P. Román, G. L’Huillier, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain, Springer . 2010 . Advanced Techniques in Web Data Pre-Processing and Cleaning, P. Román, R. F. Dell, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain, Springer . Publications: International Journal 2010 , Optimization Models For Sessionization, Submitted to Journal of Intelligent Data Analysis . 2011, Simulation of web user navigation. In preparation.
  • 73. International Conferences 2006 .Improving a Web Site using Keywords, P. Román, J. Velásquez, CLAIO XIII, Int. Conf. Uruguay. 2008 .Markov Chain for modeling the Web User Behavior, P. Román, J. Velásquez, Infomrs, CLAIO XIV, Int. Conf. Colombia. 2008 . Identifying Web User Session using an Integer programming Approach, R. Dell, P. Román, J. Velásquez, CLAIO XIV, Int. Conf. Colombia. 2008 . Web User Session Reconstruction Using Integer Programming, R. Dell, P. Román, J. Velásquez, IEEE/ACM, WI-IAT Int. Conf. Australia. 2009 . A Dynamic Stochastic Model Applied to the Analysis of the Web User Behavior, P. Román, J. Velásquez, IEEE, AWIC Int. Conf. Czech Republic. 2009 . Fast Combinatorial Algorithm for Web User Session Reconstruction, R. Dell, P. Román, J. Velásquez, the 24th IFIP TC7 Int. Conf., Argentina. 2009 . Analysis of the Web User Behavior with a Psychologically-Based Diffusion Model, P. Román, J. Velásquez, AAAI BICA Int. Conf., USA. 2009 . Web User Session Reconstruction with Back Button Browsing, P. Román, R. Dell, J. Velásquez, IEEE LNAI 5711, KES Int. Conf. Chile. 2010 . Stochastic Simulation of Web Users, P. Román, J. Velásquez, IEEE/ACM, WI-IAT Int. Conf. Canada.
  • 74. Publications: National Conferences 2010 . Ant Colony Surfer: Discovering the Distribution of Text Preferences from Web Usage, P. Loyola, P.E. Román and J.D. Velásquez, BAO . 2010 . Best Web Site Structure for Users Based on a Genetic Algorithm Approach, E. Andaur, S. Rios, P.E. Román and J.D. Velásquez, BAO . 2010 . Artificial Web User Simulation and Web Usage Mining, P.E. Román and J.D. Velásquez, BAO . 2010 . Time Course of the Web User, P.E. Román and J.D. Velásquez, TUO2 . Publications: National review 2009 , ; Un método de optimización lineal entera para el análisis de sesiones de usuarios web , Revista de Ingenieria de Sistemas; Vol. 23.
  • 75. Thanks you for your attention.

Editor's Notes

  • #2: Explicar en 1 minuto explicar lo que es web user analysis.
  • #23: Sessions en no + de 6 slide