Defensa.V11

Web User Behavior Analysis Doctorado en Sistemas de Ingeniería, Universidad de Chile. Prof. Guía: Juan D. Velásquez Pablo E. Román [email_address]

Outline Motivation, Hypothesis, Achievement The problems & solutions Pre-processing Simulation Calibration Conclusions & Future Work

Motivation, Hypothesis, Achievement

Most famous web companies are analyzing the web user browsing behavior. Google 2009 net profit: 6,520 Millions US$ Amazon: 902 Millions US$. NetFlix: 116 Millions US$. (Codelco net profit: 1,262 Millions ) Adaptive Web Sites

Why we study the web user browsing behavior? A web user need to fast information fast and complete. To enhance a web site Administrators/owners can only modify: Web Pages’ Contents Web Site Links Hopefully, the modification likes to objective group of members!

The main Problem There are only Heuristics in order to analyze the web user browsing behavior to enhance the contents and structure of a web site We think we can do it better…

Research hypothesis It is possible to apply neurophysiology’s decision making theories to explain web user navigational behavior by using web data.

The Thesis Proposal Web Intelligence A.I. in the Web Web Mining Knowledge Representation Advanced Inf Tech. in the Web Agent Ubiquitous Sys. Wireless Sys. Grid & Cloud Sys. Social Network Web Structure Mining Web Content Mining Web Usage Mining Web user neurocomputing Neurophysiology model for the analysis of the behavior discovering pattern of web user navigational behavior from the set of user’ trails

Web user neurocomputing in Brief We use a brain model of decision making to study how people browse a web site. Based on neurophysiology first principles .

Machine learning vs. First principle model Traditional Web Mining: Machine Learning (ML) Generic algorithm that can found or be trained to reproduce data regularities. First principle models (FPP): e.g. Newton’s Law. Can we use ML or FPP to build trajectories of the Apollo mission ? One million dollar Netflix contest : achieve a 10% improvement to the accuracy of customer movie preference.  4 years without a winner!!! If conditions of the problem change, then ML system’s must be recalibrated . Proposed Solution

Thesis dissertation: Main Contributions Novel mechanism for web session extraction from web log based on Integer Programming. 2008, WI-IAT Int. Conf. R. Dell, P. Román, J. Velásquez.  Using a linear objective function. 2010, Submitted to IDA Journal. P. Román, R. Dell, J. Velásquez.  Using a network model. Application of a Psychology model for describing web user navigation. 2009, AWIC Int. Conf. P. Román, J. Velásquez.  Simulation of decision’s making Neurophysiology. Calibration and simulation of Psychology based stochastic model. 2009, IAAA BICA Symposium, P. Román, J. Velásquez. 2010, WI-IAT Int. Conf. , P. Román, J. Velásquez.

Web data: Content (text, object,..) You can put anything you want on a Web page, from family to business info…. Hyperlink structure

Proposal: Data sources Neurophysiology commonly uses data obtained from neural-cabled subjects or psychological tests (surveys). I use web data for the study of human behavior using the web

Problem: Web data pre-processing Hyperlink graph, Web page content, Web user session (sequence of pages). Web Logs do not directly capture sessions How to reconstruct sessions ? SESSIONIZATION : process for obtaining sessions. If invasive methods are used privacy right are violated (forbidden by law in several countries). Cookies Spyware Tracking applications

Traditional approach for sessionization Proactive : direct tracking of the web user Privacy issue The most exact Reactive : reconstruction of web user’s page sequence heuristically . Only an approximation (40% noise) Use anonymous activity data sources like web logs. Models of behavior are sensitive to noise in data.

Traditional heuristic for sessionization How to identify individual web users? Filtering: IP+Browser(Agent) Timeout of 30 minute Path completion: shortest path backward

Sessionization: The proposal Incorporate all restrictions as a combinatorial optimization problem. Two formulation: Maximization of a linear reward, network flow model.

Integer Programming for sessionization ( WI-IAT08 R. Dell, P.Roman, J. Velasquez ) X ros : 1 if log register “ r ” is assigned as the “ o-th ” request during session “ s ” and zero otherwise. It is a labeling problem! Log register Sessions

Integer Program ~ Maximize the number of sessions. ( WI-IAT08, KES09 P. Roman et al ) Register used once One register on o Structure and time

Network model: Minimize number of session. (IDA10 P. Roman et al) Edge indicates register precedence Node is a register (duplicated) Flow = Number of sessions Source Sink … Z=3 1 0 0 (1,1) 0 0 0 1 : flow of a session 1 (1,1) 1’ 2 (1,1) 2’ 3 (1,1) 3’ 1 0 Now is feasible N N’ (1,1) 4 4’ 1 1 1 0 0

Experiment: Large scale (15 month) DII departmental web site. ~4000 pages ~17000 links ~15000 visits per month Simple: precise information Content mainly based on text Objective: Academics, Study programs, Projects, … http://www.dii.uchile.cl/

A large scale experiment evaluation: F-Score over cookie retrieved sessions. (IDA10 P. Roman et al) 0<F<1 Higher F is better Traditional sessionization Both proposal

A large scale experiment evaluation: F-Score over cookie retrieved sessions. Compared with 15 months of cookie retrieval Method Precision Recall F-Score Time Sessionization Integer Programming (SIP) 0.7788 0.6696 0.7201 6 Hour Network Flow (BCM) 0.7777 0.6671 0.7182 4 Min Canonical Sessionization 0.5091 0.6996 0.5993 1 Min

Summary: Pre-processing It is possible to ensure data quality using optimality Even in the worst scenario when only web logs are available. Main Achievement: F=0.72 In acceptable processing time 4min/month Ready for Neurocomputing!

Strong Regularities: Distribution of sessions (WI-IAT08 P. Roman et al) Empirical power rule for session size has been observed in the literature [Huberman et al. 1998, Science]. Web Surfer Law. The correlation coefficient and standard error of fitting to a power law gives us a sense of the quality of the sessions. Our correlation coefficient is 0.94 and our standard error is 0.3817. A common heuristic has a correlation coefficient of 0.91 and a standard error of 0.64.

Regularity  presence of internal rule (2008, CLAIO P. Roman et al.) Law of surfing Machine learning algorithm has been applied in order to capture such regularities. Today new directions based on the brain’s informatics are used to explains navigation. What we need is a theory for explaining such regularities!

Proposal: To adapt Psychological theory to web navigation, using web data. Human behavior on the web is the result of brain neural network processing . Require historical data of individual’s trajectories on a web site. Difficult to calculate or predict the calculation of 10 11 neuron and 10 14 Interconnection . Diffusion process -> average at mesoscopic level This is the point of view of this thesis.

Biological experiment (1970-2005) Rhesus monkey with sensor placed on Lateral intra-parietal (LIP) cortex (2002-2008) Screen with moving dots, the decision is to select the correct direction of motion. Monkeys are trained to receive a reward if they answer the correctly. Possible options map on the LIP cortex and the point with higher neural activity will correspond to the decision of the subject.

Neurophysiology of decision making: First Principles First hitting time -> time to decide. First hitting coordinate -> the choice X 1 X 2 It decides option 1. Two options 0

LCA Model (Leacky Competing Accumulator) [M. Usher et al, 2001] X>0 -> Biological condition: Neural activity is positive I is considered exogenous and constant Others parameters ( k, λ , σ ) in the model are positives The stochastic equation: I j : Likelihood to make choice j . It drives the decision! Result from other area processing (e.g. Visual Cortex). Important parameter!! X I

Application: The browsing process Arrivals (first page) are exogenous to this model. Based on historic sessions , the model predicts probability of following a link. Web users are information seeker and respond according to text.

Modeling the likelihood of choosing each option (vector I ) I j considered a probability of choosing option j . Discrete choice theory Text must be represented as numeric entities -> Bag of words model with TF-IDF (~ vector of frequency of appearance of word).

Likelihood of a decision and web user utility Random Utility Model (Economy): Individuals decide within discrete options { j } with utility V j with probability P j of choice j . The likelihood of taking decision j should be proportional to P j Web user objectives are modeled as a text vector µ. Web users are information (TEXT) seeker . Similarity between text is measured as the cosine between both vectors.

Assumption & Approximation Web browsing is characterized only by jumps. Independence of available choices. Utility only depends on text. Independent of the past visited trail. No information Satiety Rational web user Correctness of web site information Web pages with little content. Web page with simple content. Web user information processing time is negligible.

Adaptation of the LCA model (WI-IAT10, P. Roman et al) It is a Langevin’s equation . force interpretation of the stochastic neural activity evolution. Open the way for improving the dynamic system: Adding forces. Evidence Inhibition Dissipation Noise

The Fokker-Planck equation: probability density of not reaching a decision (AWIC09 P. Roman et al) . Never reach a decision in t’<t Neural activity is positive Neural activity is initially near to 0 Probability density

The probability of reaching a decision in time t. The probability of deciding option “j” in time “t”

Unconstrained exact solution Hermite Polinomials Exact solution (Ornstein-Uhlenbeck)

Exact unconstrained solution evolution Nearly a delta in t=0, X=0 Large time solution  0 No border condition But in t=0 the delta values on border are nearly 0 (Ornstein-Uhlenbeck)

This approach is threefold Stochastic equation allows simulations for finding probabilities given a web site. But parameters need to be calibrated. Approximation: constant for all users. Calibration of the model is performed by maximum likelihood. But requires web data (session set). Requires approximation of the density φ Session needs to be obtained with higher accuracy.

Simulation: Monte Carlo simulation Euler approximation Exact simulation

Simulation algorithm: Deciding which link to follows.

Results: Simulated session length distribution ( BICA08, AWIC09, BAO10 ). Empirical result : Session length [1] distribution follows a power law [4,5]. Kind of average web user u contains all text in the web site Sessions L>20 diverge: users that performs more elaborate processing? Session L=1: users that have others text interest?

Results: Number of visits per page. Fuzzy, but averages remain similar.

Adjustment of distribution of time used per session. Simulated session Same power law than real case. Shift in time, change time scale that is used for adjusting white noise variance. Slope represent more structural behavior. Intended to adjust other scalar parameters.

In Summary With only an estimation of the parameter, simulation shows result that are close to real. Calibrating the model should produce better simulation.

Calibration ( WI-IAT 2010, P. Roman et al ) Parameters : Should correspond to properties of neural tissue. Approximation: constant for all users. Parameter : The evidence vector I Corresponds to the intention of the web user It is distributed The density must be approximated!!!

Parameter Inference SESSION DATA: (i,j) : Hyperlink from i to j . k : numerate the time distribution. n ijk : The number of observed transitions t ijk : The observed time used on this observation Maximum log-likelihood: Approximate P by a linear combination of unconstrained exact solutions. The approximated probability function must agree restriction of LCA model. j i

Curse of dimensionality Many numerical methods for solving differential equations require a partition of the space. Discretization involves: Any coordinate partition in 100. A typical number of links on a page is 20 Then the total number of points of the discretization is about 10 40  unmanageable

Distribution of number of links per page.

Proposal (1): To use symbolic processing Explicit expression are not manageable by hand. Operation involved: Integration, differentiation, product, … Φ is based on polynomials  Instead of evaluating at each step, it is better to perform symbolic manipulation until evaluation is needed.  Grid is not necessary for intermediate step. / 1 - 1 ^ X 2

Proposal (2): Use the time propagator of the Cauchi problem Initial condition is concentrated on 0. But L must ensure border condition!!!!

Proposal (3): Penalization method for ensuring border condition A force F P on boundary that is added to ensure reflection and adsorption F P (x) =(1-x) 2n +x 2n

Approximating the probability distribution Φ Unconstrained case involves polynomial solution. Propagator takes Φ on a small t to a t ’. Propagator involves only derivatives. Symbolic processing of the solution could be performed for building solutions for the required time t ’. Probability P is built on a derivative of a definite integral that are easily calculated by symbolic processing. A solution for the dimensionality problem!!!

Experiment: DII departmental web site. ~4000 pages ~17000 links ~15000 visits per month Simple: concise and precise information. Content mainly based on text. Objective: Academics, Study programs, Projects, …

Calibration of parameter Neurophisiology = 0.4 = 0.2 = 0.03 Text vector preference 1 vector: Most ranked words Mba Syllabus Project A distribution of Gaussian vector 3 main clusters related to : study programs, academics, economics. λ κ σ

Simulation of in the DII site Average error of only 5% in distribution of session size precision: 0.8, recall 0.74 by number of specific sessions

Comparing with ML approach ML Algorithm based on clustering session with text measured. Simulation approximates 70% of reality. ML reachs 60% [J. Borges, 2007, IEEE Trans.] [Ghorbani, 2007, WI-IAT][J. Velasquez et Al., 2007, International Journal of Artificial Intelligence Tools ][J. Velasquez et Al., 2007, Journal of Knowledge-Based Systems (Elsevier)]

Situation after 1 month: stability of the calibration. 5% of links were modified 2% of pages are new or deleted 30% of words in documents have changed. Simulation reach an F-score of 0.7

In Summary In spite of changing web site configuration (after 1 month) simulation returns similar session distribution. Complexity of calibration process is improved by symbolic calculation. Density of session length is matched in 95%. But Distribution of sessions is matched in 70%.

Conclusion (1) In spite of the anonymous character of a web log, It is possible to extract sessions in good agreement with an empirical statistical law. ~70% F-score Quick pre-processing can be obtained with the use of network models. Further explorations using combinatorial model leads us to retrieve other likely values.

Conclusions (2) Web users are shown to behave like text information seeker using simulation. Simulation of a web user is a straightforward algorithm if parameters are known. Distribution of web user sessions are obtained with notable precision. Calibration is notably difficult due to the dimensionality. A method based on symbolic manipulation and semi-group propagation was proposed for density estimation.

Conclusion (3) The model is robust to changes to the web site maintaining 70% accuracy in predicting distribution of sessions. Compared with traditional data mining methods that have only 60-70% only one step prediction.

Future Work Web personalization Simulation is cheap and parallelizable, once it is trained (expansion coefficients are fitted). Small changes (same semantic) in web site (hyperlink and structure) produce changes on web user trails on the web site. Simulation predict web usage! Since assuming same users with the same fitted behaviour will visit the web site. Iteration on changes and simulation could find better changes given a measure of quality.

Publications: Book Chapters 2010 . Web Usage Mining, P. Román, G. L’Huillier, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain, Springer . 2010 . Advanced Techniques in Web Data Pre-Processing and Cleaning, P. Román, R. F. Dell, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain, Springer . Publications: International Journal 2010 , Optimization Models For Sessionization, Submitted to Journal of Intelligent Data Analysis . 2011, Simulation of web user navigation. In preparation.

International Conferences 2006 .Improving a Web Site using Keywords, P. Román, J. Velásquez, CLAIO XIII, Int. Conf. Uruguay. 2008 .Markov Chain for modeling the Web User Behavior, P. Román, J. Velásquez, Infomrs, CLAIO XIV, Int. Conf. Colombia. 2008 . Identifying Web User Session using an Integer programming Approach, R. Dell, P. Román, J. Velásquez, CLAIO XIV, Int. Conf. Colombia. 2008 . Web User Session Reconstruction Using Integer Programming, R. Dell, P. Román, J. Velásquez, IEEE/ACM, WI-IAT Int. Conf. Australia. 2009 . A Dynamic Stochastic Model Applied to the Analysis of the Web User Behavior, P. Román, J. Velásquez, IEEE, AWIC Int. Conf. Czech Republic. 2009 . Fast Combinatorial Algorithm for Web User Session Reconstruction, R. Dell, P. Román, J. Velásquez, the 24th IFIP TC7 Int. Conf., Argentina. 2009 . Analysis of the Web User Behavior with a Psychologically-Based Diffusion Model, P. Román, J. Velásquez, AAAI BICA Int. Conf., USA. 2009 . Web User Session Reconstruction with Back Button Browsing, P. Román, R. Dell, J. Velásquez, IEEE LNAI 5711, KES Int. Conf. Chile. 2010 . Stochastic Simulation of Web Users, P. Román, J. Velásquez, IEEE/ACM, WI-IAT Int. Conf. Canada.

Publications: National Conferences 2010 . Ant Colony Surfer: Discovering the Distribution of Text Preferences from Web Usage, P. Loyola, P.E. Román and J.D. Velásquez, BAO . 2010 . Best Web Site Structure for Users Based on a Genetic Algorithm Approach, E. Andaur, S. Rios, P.E. Román and J.D. Velásquez, BAO . 2010 . Artificial Web User Simulation and Web Usage Mining, P.E. Román and J.D. Velásquez, BAO . 2010 . Time Course of the Web User, P.E. Román and J.D. Velásquez, TUO2 . Publications: National review 2009 , ; Un método de optimización lineal entera para el análisis de sesiones de usuarios web , Revista de Ingenieria de Sistemas; Vol. 23.

Thanks you for your attention.

Defensa.V11

More Related Content

What's hot (9)

Viewers also liked (8)

Similar to Defensa.V11 (20)

Recently uploaded (20)

Defensa.V11

Editor's Notes