SlideShare a Scribd company logo
arXiv:math/0609173v1[math.ST]6Sep2006
Statistical Science
2006, Vol. 21, No. 2, 155–166
DOI: 10.1214/088342306000000132
c Institute of Mathematical Statistics, 2006
Functional Data Analysis in Electronic
Commerce Research
Wolfgang Jank and Galit Shmueli
Abstract. This paper describes opportunities and challenges of using
functional data analysis (FDA) for the exploration and analysis of data
originating from electronic commerce (eCommerce). We discuss the
special data structures that arise in the online environment and why
FDA is a natural approach for representing and analyzing such data.
The paper reviews several FDA methods and motivates their useful-
ness in eCommerce research by providing a glimpse into new domain
insights that they allow. We argue that the wedding of eCommerce
with FDA leads to innovations both in statistical methodology, due to
the challenges and complications that arise in eCommerce data, and in
online research, by being able to ask (and subsequently answer) new
research questions that classical statistical methods are not able to ad-
dress, and also by expanding on research questions beyond the ones
traditionally asked in the offline environment. We describe several ap-
plications originating from online transactions which are new to the
statistics literature, and point out statistical challenges accompanied
by some solutions. We also discuss some promising future directions
for joint research efforts between researchers in eCommerce and statis-
tics.
Key words and phrases: Process dynamics, special data structures,
online auctions.
Wolfgang Jank is Assistant Professor of Management
Science and Statistics, Department of Decision and
Information Technologies, Robert H. Smith School of
Business, University of Maryland, College Park,
Maryland 20742, USA e-mail: wjank@rhsmith.umd.edu.
Galit Shmueli is Assistant Professor of Management
Science and Statistics, Department of Decision and
Information Technologies, Robert H. Smith School of
Business, University of Maryland, College Park,
Maryland 20742, USA e-mail:
gshmueli@rhsmith.umd.edu.
This is an electronic reprint of the original article
published by the Institute of Mathematical Statistics in
Statistical Science, 2006, Vol. 21, No. 2, 155–166. This
reprint differs from the original in pagination and
typographic detail.
1. INTRODUCTION
Functional data analysis (FDA) has been gain-
ing momentum in many fields. While much of the
methodological advances have been made within the
statistics literature, FDA has found many useful ap-
plications in the agricultural sciences (Ogden et al.,
2002), the behavioral sciences (Rossi, Wang and Ram-
say, 2002), in medical research (Pfeiffer et al., 2002)
and many more. One reason for this momentum is
the technological advancement in computer storage
and computing power. Today’s researchers gather
more and more data, often automatically, and store
them in large databases. However, these new capa-
bilities for data generation and data storage have
also led to new data structures which do not nec-
essarily fit into the classical statistical concept. Re-
searchers measure characteristics of customers over
time, store digitalized two- or three-dimensional im-
ages of the brain, and record three- or even four-
1
2 W. JANK AND G. SHMUELI
dimensional movements of objects through space and
time. Many of these new data structures call for new
statistical methods in order to unveil the informa-
tion that they carry. Data can contain trends that
vary in longitudinal or spatial aspects, that vary
across different groups of customers or objects, or
that show different magnitudes of dynamics.
FDA is a tool-set that, although based on the
ideas of classical statistics, differs from it (and, in a
sense, generalizes it), especially with respect to the
type of data structures that it encompasses. While
the underlying ideas for FDA have been around for a
longer time, the surge in associated research can be
attributed to the monograph of Ramsay and Silver-
man (1997). In FDA, the object of interest is a set of
curves, shapes, objects, or, more generally, a set of
functional observations. This is in contrast to clas-
sical statistics where the interest centers around a
set of data vectors. In recent years, a range of clas-
sical statistical methods have been generalized to
the functional framework; James, Hastie and Sugar
(2000) developed a principal components approach
for a set of sparsely sampled curves. Other exploratory
tools include curve clustering (see Abraham et al.,
2003; James and Sugar, 2003; Tarpey and Kinateder,
2003) and curve classification (see Hall, Poskitt and
Presnell, 2001; James and Hastie, 2001). Classical
linear models have also been generalized to func-
tional ANOVA (Fan and Lin, 1998; Guo, 2002),
functional regression (Faraway, 1997; Cuevas, Febrero
and Fraiman, 2002; Ratcliffe, Leader and Heller, 2002)
and the functional generalized linear model (Rat-
cliffe, Heller and Leader, 2002; James, 2002). More-
over, Ramsay (2000b) and Ramsay and Ramsay (2002)
suggest differential equations for data of functional
form. While this list is far from complete, it shows
some of the current methodological efforts in this
emerging field.
Electronic commerce (eCommerce) is a growing
field of scholarly research especially in information
systems, economics and marketing, but it has re-
ceived little to no attention in statistics. This is sur-
prising because it arrives with an enormous amount
of data and data-related questions and problems.
Like other web-based data, eCommerce data tend
to be very rich, clean and structurally different from
offline data. eCommerce research arrives with many
new data- and model-related challenges that promise
new ideas and motivation for further methodologi-
cal advancements of FDA. One of the main char-
acteristics of eCommerce data is the combination
of longitudinal information (time-series data) with
cross-sectional information (attribute data). A sam-
ple of n records typically comprises n time series,
each linked with a set of n attributes. Take eBay’s
online auctions as an example. There, each auction
is characterized by a time series of bids placed over
time. This information is coupled with additional
auction attributes such as a seller’s rating, the auc-
tion duration and the currency used. Another ex-
ample is online product ratings on Amazon.com or
movie ratings on Yahoo! Movies (see Dellarocas and
Narayan, 2006). Yahoo! Movies allows users to rate
any movie according to different measures. This re-
sults in a time series that describes the average daily
rating or the number of daily postings (or both)
from the date of the movie release until the time
of data collection. This information is coupled with
attribute data about the movie such as the movie
genre and critics’ rating. A third example, described
in Stewart, Darcy and Daniel (2006), is the evolution
of open-source software projects that is monitored
by websites such as SourceForge.net. Here an obser-
vation is a certain project, and it is characterized
by a time series that describes project complexity
from its first release until the time of data collection.
Each project also has associated attributes such as
the number of developers, the operating system used
and the programming language.
The combination of longitudinal and cross-sectional
information is only one typical aspect of eCommerce
data. Another aspect is the uneven spacing between
events. In many cases, the observed time series is
composed of events influenced by multiple users or
agents who access the web at different points in time
(and from different geographical locations). Conse-
quently, the resulting times when new events arrive
are extremely unevenly spaced. This is in contrast to
traditional time series, which are typically recorded
at predefined and equidistant time-points, such as
daily, monthly or quarterly scales. Furthermore, be-
cause of psychological, economic or other reasons,
eCommerce time series tend to feature very sparse
areas at some times, followed by extremely dense
areas at other times. For instance, bidding in eBay
auctions tends to be concentrated at the end, re-
sulting in very sparse bid-arrivals during most of
the auction except for its final moments, where the
bidding volume can be extremely high.
eCommerce not only creates new data challenges,
it also motivates the need for innovative models.
FDA IN ECOMMERCE 3
While the field of economics has created many the-
ories for understanding economic behavior at the
individual and market level, many of these theories
were developed before the emergence of the World
Wide Web. The existence of the web now allows re-
searchers, for the first time, to observe and record
data about economic behavior on a large-scale ba-
sis. As it turns out, however, observed data often do
not support classical economic theories. As a result,
empirical research is thriving. In fact, the empirical
literature has continuously shown that online be-
havior deviates in many ways from offline behavior
and from what is expected by economic theory. This
calls for new economic models that can be validated
empirically. In addition, the availability of eCom-
merce data allows researchers to ask new types of
questions. One major enhancement is the ability to
study not only the evolution of a process, but also
its dynamics: how fast it moves and how suddenly
it changes, its rate of change and how this rate dif-
fers at different time-points. Studying dynamics of
processes can be very relevant in the online world,
because it allows new approaches for characterizing
eCommerce processes (and thus distinguishing be-
tween diverse processes), and even forecasting them
(Wang, Jank and Shmueli, 2006). Changing dynam-
ics are inherent in a fast-moving environment like
the online world. Fast movements and change im-
ply nonstationarity which poses challenges to tra-
ditional time series modeling. And finally, it is im-
portant to point out that for any one process that
we observe in the online world, there typically exist
many, many replicates of the same (or at least very
similar) process. On eBay, for instance, if we think
of the formation of price between the start and the
end of an auction as a process of interest, then there
exist several million similar processes of that form,
taking place at any given day on eBay. The replica-
tion of processes, or time series, fits naturally within
the FDA framework and makes this an ideal ground
for the advancement of new functional methodology.
Finally, eCommerce typically arrives with huge
databases which can put a computational burden
on users’ storage and processing facilities. This bur-
den is often increased by the complicated structure
of eCommerce data. Taking a functional data ap-
proach, one can relieve some of that burden. FDA
operates on functional objects which can be more
compactly represented than the original data. Tak-
ing a functional approach may therefore be advan-
tageous also from a resource point of view.
The process of studying a set of data via func-
tional methods consists of two principal steps: First,
the functional object is “recovered,” typically by
means of smoothing. There are multiple different
ways in which this smoothing step can be executed,
and there are many challenges during that step. Sec-
ond, the resulting functional object is used for data
exploration and analysis. Exploratory data analysis
(including data visualization and summary) is per-
formed in order to learn about general characteris-
tics as well as unusual features and anomalies in the
data. Analysis includes explanatory and predictive
modeling and inference, just as in classical statis-
tics. In the next sections we focus on the challenges
and problems that arise during these steps within
the eCommerce context. We would like to note that
our point of view of the functional approach and its
application to eCommerce has been forged during
the teaching of so-called Research Interaction Teams
(www.amsc.umd.edu/Courses/RITDescrips/HowAndWhy.html)
which are research classes that involve graduate stu-
dents from the Statistics and the Applied Mathe-
matics and Scientific Computation programs at the
University of Maryland. Several of our studies per-
formed during these classes have led to new method-
ological and practical insights.
2. RECOVERING FUNCTIONAL OBJECTS
The first step in any functional data analysis con-
sists of recovering, from the observed data, the un-
derlying functional object. There exist a variety of
methods for recovering functional objects from a set
of data, all of which are typically based on some kind
of smoothing. As a result of the smoothing, and of
characterizing the smooth object by its smoothing
parameters only, we obtain a low-dimensional func-
tional object. We focus here on objects, and in par-
ticular curves, that are based on unevenly spaced
time series and of which we have multiple replica-
tions. An example is a set of bid histories from eBay
auctions, as shown in Figure 1. The four panels cor-
respond to four separate seven-day auctions for a
new Palm PDA. Each consists of the bids (in $)
placed at different times during the auction.
2.1 Challenges in Choosing the Right Smoother
The first step in recovering the functional object
is to choose a family of basis functions. The choice of
the basis function depends on the nature of the data,
on the level of smoothness that the application war-
rants, on what aspects of the data we want to study,
4 W. JANK AND G. SHMUELI
on the size of the data and on the types of analy-
ses that we plan to perform. For example, to repre-
sent the price path of an online auction for the pur-
pose of, say, studying price dynamics, one could use
monotone smoothing splines (Ramsay, 1998) since
prices in auctions increase monotonically. Besides
maintaining the price monotonicity, this approach
also permits the computation of derivatives which
lend themselves to price dynamics. However, fitting
monotone splines is computationally more intensive
than fitting ordinary polynomial smoothing splines.
In that sense, it may prove impractical to compute
monotone splines for very large databases if time
and memory restrictions exist. In addition, polyno-
mial smoothing splines can be represented as a linear
combination of basis functions. The practical mean-
ing of this is that if we use polynomial smoothing
splines and the intended analysis is based on a lin-
ear operation (such as computing average curves,
fitting functional linear regression models, or per-
forming functional principal components analysis),
we can operate directly on the basis function coef-
ficients without any loss of information. The same
operation using monotone splines would require an
approximation step due to the need to first represent
the continuous curve in a finite-dimensional manner
by evaluating it on a grid. Conversely, if the type
of operation is nonlinear, then one would have to
perform a grid-based computation for either type
of spline and the choice would therefore not matter
from this point of view. Thus, the way we recover the
functional object is strongly influenced by a variety
Fig. 1. Scatterplots describing the bid history in each of four
eBay auctions, each lasting seven days.
of different objectives all of which might compete
with one another.
Recovering functional objects often involves more
than deciding on the appropriate type of smoother.
This can include a preprocessing step via interpola-
tion, thereby creating a raw functional (e.g., Ram-
say and Silverman, 2002, page 21). This alleviates
the problem of the unevenly spaced series that are
common in eCommerce. An important aspect in any
functional data analysis is the robustness of analy-
sis results with respect to the choice and level of
smoothing. A general study of this sort was carried
out by our research interaction team, comparing the
effects of smoothing splines versus monotone splines
on the conclusions derived from a functional regres-
sion on the price path in online auctions (the func-
tional object) as a function of explanatory variables
such as the seller rating and the opening bid (both
scalar) and current number of bids (a functional ex-
planatory variable). The study indicates that both
smoothers lead to similar conclusions (Alford and
Urimi, 2004).
Another example of the challenges in choosing an
appropriate smoothing method is the functional rep-
resentation of online movie ratings. By that we mean
the series of user movie ratings on online services
such as Yahoo.com. The volume of user postings is
highly periodic, with heavier activity on weekends
(when people tend to watch movies in the theaters).
Fourier basis functions were found to be a better
choice among different alternatives for capturing the
cyclical posting patterns (Wu, 2005).
2.2 Additional Data Challenges
Our different studies using eCommerce data have
raised further challenges in the functional object
recovery stage that have previously not been ad-
dressed in the literature. The first such challenge is
handling the extremely unevenly distributed mea-
surements in eCommerce data. That is, the num-
ber and location of events vary drastically from one
functional object to another. One typical example
is the bid arrival in eBay auctions. Returning to
Figure 1, it can be seen that some bid histories
are very dense at the auction end, while others are
much sparser, and in addition the overall number of
bids per auction can vary widely between none in
some auctions, and more than 100 in others. And
yet, while the varying number of bids per auction
may suggest the use of a varying set of smooth-
ing methods, we prefer the use of a single family
FDA IN ECOMMERCE 5
of smoothers. The reason for this is that, in the end,
the choice of the smoother is merely a means to the
end of arriving at a unifying functional object and it
is not the direct object of our interest. Coming back
to the example of online auctions with sometimes
few and sometimes many bids, this motivates the
need for new methodological advances in creating
functional objects that naturally incorporate all ex-
tremes under one hat. Some promising approaches
in that direction can be found in James and Sugar
(2003) and James, Hastie and Sugar (2000).
The extreme structure of eCommerce data is chal-
lenging even for very basic visualization tasks: stan-
dard time-series visualization tools typically require
evenly spaced events! Because of this restrictive re-
quirement, we collaborated with colleagues at the
Human–Computer Interaction Lab at the University
of Maryland to develop new interactive visualiza-
tion tools that can accommodate this special data
structure. We evaluated different approaches for rep-
resenting eBay bid histories by an evenly spaced
equivalent without losing important information with
respect to order, magnitude and distance between
the bids (see Aris et al., 2005). Interestingly, the
final choice was to use a functional approach by
first smoothing the bid history, and then feeding an
evenly spaced grid of the smooth curves (and their
derivatives) into a standard visualization tool.
Another challenge typical to eCommerce data is
defining meaningful start and end points of the func-
tional objects in order to align the curves. The prob-
lem of aligning functional objects is related to the
problem of registration (Ramsay and Silverman, 2005,
Chapter 7), but there are several additional compli-
cations here: Many web-based events do not start
and end at the same time. For instance, online prod-
uct ratings over time have different starting points,
depending on when the product was first released
to the market, when the first rating was placed,
etc. They also often have different ending points,
for instance, if one product is prematurely taken off
the market, if it is replaced by a product-upgrade,
and so on. Another issue that complicates object
alignment is that the data collection process itself
may act as a censoring mechanism. Therefore, it is
not obvious how methods such as landmark regis-
tration, where curves are aligned according to one
particular feature of the curve such as its peak, can
be adapted to handle this situation. Finally, select-
ing the units for the time axis can be challenging.
In some applications calendar time (e.g., the date
and time a transaction took place) is reasonable,
whereas in other applications the event index (i.e.,
the order of the event arrival) might make more
sense. And yet in other cases an entirely different
“clock” would be even more suitable. For instance,
we pointed out that eBay auctions typically exhibit
very low bidding activity during most of the auction
and then extremely high activity near the end. For
this reason an auction might be better represented
by a clock that “shrinks” the low-activity period and
“stretches” the high-activity period, thereby putting
more emphasis on the part that matters more. All
these issues are illustrated and discussed further in
the paper by Stewart, Darcy and Daniel (2006).
3. FUNCTIONAL EDA
After the data are represented by functional ob-
jects, the analysis steps follow the same process as
in classical statistics, with the first step being ex-
ploratory data analysis (EDA). EDA includes data
summaries, visualization, dimension reduction, out-
lier detection, and more. The main difference be-
tween FDA and classical statistics is the way in
which the methods are applied and especially how
they are interpreted.
3.1 Static vs. Interactive Visualization
Starting with visualization, our goal is to:
1. Visualize a sample of curves.
2. Inspect summaries of these curves.
3. Explore conditional curves, using various relevant
predictor variables.
To that end, one solution is to create static graphs.
For instance, Figure 2 shows the price evolution in
34 eBay auctions for various magazines. We can see
large heterogeneity across the price formation pro-
cess at different times of the auction. We refer to this
approach as static, since once the graph is generated
it can no longer be modified by the user without run-
ning the software code again. This static approach
is useful for differentiating subsets of curves by at-
tributes (e.g., by using color), or for spotting out-
liers. However, a static approach does not allow for
an interactive exploration of the data. By interac-
tive we mean that the user can perform operations
such as zooming in and out, filtering the data and
obtaining details for the filtered data, and do all
of this from within the graphical interface. Interac-
tive visualizations for the special structure of eCom-
merce data are not straightforward, and solutions
6 W. JANK AND G. SHMUELI
have only been proposed recently (Aris et al., 2005;
Shmueli et al., 2006). One such solution is Auction-
Explorer (www.cs.umd.edu/hcil/timesearcher),
which is tailored to handle the special structure of
online auction data. A snapshot of its user interface
is shown in Figure 3. The interface includes several
panels, which correspond to the price curves (top
left), their dynamics (not shown in this view) and
the corresponding attribute data (top right). The
curves can be filtered to display subsets according to
a selection of attribute values, according to a selec-
tion of curves, and one can do pattern search. Sum-
marization is achieved through on-the-fly summary
statistics for attributes, and a “streaming boxplot”
called a “river plot” of the curves (bottom panel of
Figure 3). This is yet another attempt to general-
ize classical visualization methods to the functional
environment.
3.2 Data Reduction
Another goal of EDA is data reduction. Two of
the methods that are useful in this context are curve
clustering and functional principal components anal-
ysis. Curve clustering partitions the set of curves
into a few clusters, thereby reducing the space of
observations, and attempts to derive insight from
the resulting clusters. The clustering can be applied
to the curves themselves or to their derivatives. Jank
and Shmueli (2005) apply curve clustering to bid his-
tories of eBay auctions and find three main clusters.
Linking the curve information with attribute infor-
mation, they find that the different clusters corre-
spond to three types of auctions: “greedy sellers,”
Fig. 2. Static plot of the price progression in 34 eBay auc-
tions for magazines.
Fig. 3. Snapshot of user interface for AuctionExplorer
(www. cs. umd. edu/ hcil/ timesearcher ).
“bazaar auctions” and “experienced seller/buyer”
auctions. Each of these types characterizes a differ-
ent auction profile, combining static and dynamic
information. For instance, “greedy seller” auctions
have the highest average opening price and the low-
est closing price. Unsurprisingly, they do not attract
much competition, since unjustified high opening
prices tend to deter users from bidding. Low compe-
tition is also known to lead to lower prices. Sellers
in these auctions are, on average, less experienced
than those in other clusters (as can be measured
by their eBay rating), scheduling most auctions to
end on a weekday. These auctions also attract ex-
perienced winners who take advantage of the poorer
auction design and resulting lower prices. The price
dynamics of this cluster reflect this setting: price
starts accelerating late in the auction, not allowing
it to achieve its full impact by the time the auction
closes. This mix of insight into static seller and bid-
der characteristics coupled with the price dynamics
is only available with a functional approach.
FDA IN ECOMMERCE 7
Another popular method is functional principal
components analysis (f-PCA). The method uses stan-
dard PCA to find principal sources of variability in
curves (or other functional objects). If we consider
curves that represent a process over time, then f-
PCA can help us find “within-curve” (or more gen-
erally, “within-process”) variation, thereby condens-
ing the time axis. This is done by selecting a dis-
crete grid of time-points and treating the points as
the variables in ordinary PCA. A preliminary study
by Hyde, Moore and Hodge (2004) applied f-PCA
to price curves and derivative curves of a sample
of eBay auctions for premium wristwatches. They
found that two or three principal components cap-
tured most of the within-curve variation: One source
is price variation during the middle of the auction
and the other distinguishes the price variability be-
tween the beginning and end of the auction. Similar
results were obtained when using the price-dynamics
curves. Hyde, Moore and Hodge (2004) also used
f-PCA to compare sources of “within-process” vari-
ation across different brands for the same product
category as well as for different product categories.
It was found that price is most uncertain during
mid-auction. As the auction approaches its end, though,
the price becomes more predictable, especially in
common-value auctions (see also Wang, Jank and
Shmueli, 2006).
An alternative approach of principal components
analysis to functional data that, to the best of our
knowledge, has not been explored would be to treat
the observations as the dimension to be transformed.
The idea is to find main sources of variation across
curves (instead of within curves), achieving a goal
similar to curve clustering, where main features of
the curves are highlighted. The exact meaning and
interpretation of this variation deserve further at-
tention.
4. FUNCTIONAL MODELING, INFERENCE
AND PREDICTION
There is quite a lot of ongoing research on gener-
alizing classical regression models to the functional
setting. Examples include linear regression with func-
tional predictors (Ratcliffe, Leader and Heller, 2002)
or a functional response (Faraway, 1997), logistic re-
gression (Ratcliffe, Heller and Leader, 2002), func-
tional linear discriminant analysis (James and Hastie,
2001) and general linear models with functional pre-
dictors (James, 2002).
4.1 Information in eCommerce Processes
Current empirical research in eCommerce relies
on the use of very standard statistical tools such as
least-squares regression. These tools are used to in-
vestigate how, say, the closing price in an online auc-
tion relates to other auction-specific information. To
that end, one sets the closing price as the response
variable, and regresses it on potential explanatory
variables such as the opening bid, the auction du-
ration, the seller rating, etc. (see, e.g., Bajari and
Horta¸csu, 2003, 2004; Lucking-Reiley et al., 2000).
While this approach is certainly useful for under-
standing some of the variation in closing prices, it
also leads to loss of a large amount of potentially
useful information about everything that happened
between the start and end of the auction. More gen-
erally, current research uses a response variable that
is an aggregation of the process of interest: the maxi-
mum bid in online auctions, the average product rat-
ing, etc. This (direct or indirect) choice is guided by
economic importance but is also likely done so that
standard models can be applied. Furthermore, the
choice of independent variables is limited to static
“snapshot” information. The existence of more de-
tailed data, however, can potentially shed more light
on the entire process rather than only its aggregated
form.
In the online auction example, variables like the
opening bid, the auction duration and the seller’s
rating are determined before the auction start and
thus do not capture any of the information that
arrives after that. However, it is well-known that
events that occur during the auction can also affect
the final price. For instance, the number of com-
peting bidders, the bidders’ experience and the bid
timings can influence the final price. These three
variables are available only after the auction starts
and in fact the information they carry changes as
the auction progresses.
While it is possible to include time-varying ex-
planatory variables like the number of bidders into
a regression model, such a model would no longer be
considered “standard” in the classical least-squares
sense since it would have to account for time-depen-
dence between the explanatory variable and the re-
sponse, and also within the explanatory variable it-
self.
Furthermore, there is additional information revealed
during ongoing processes that cannot be captured
easily by such models. An example is concurrency
8 W. JANK AND G. SHMUELI
and the effect that new events have on future events.
In the online auction context, incoming bids can in-
fluence bidders in different ways: A new bid placed
in an auction can result in an immediate response
by other bidders or can be completely ignored. Bid-
ders also learn from each other: they adopt bid-
ding strategies of other bidders and they learn about
an item’s value from bids that were placed. Many
items sold in online auctions do not have a com-
monly known value (such as collectibles, antiques,
rare pieces of art, etc.), and therefore bidders of-
ten try to infer the item’s value from other people’s
bids. In short, while the final price is certainly af-
fected by directly observable phenomena (such as
the number of competing bidders), it is also depen-
dent on indirect actions, reactions and interactions
among bidders.
4.2 Process Dynamics and FDA
Modeling the effects of user interactions with clas-
sical regression models is challenging, to say the
least. An alternative approach is to capture some of
this dynamic information via evolution curves and
their dynamics. In the auction context this would be
the price evolution, which is the progression of bids
throughout an auction. The evolution curve and its
dynamics can reflect these bidder interactions: High
competition in an auction will manifest itself as a
steep price curve with increasing dynamics. Price
will also increase, albeit at a slower rate, if bidders
merely use the new bid to update their own valua-
tion about the product. The price increase will slow
down if bidders drop out of the auction due to a
newly placed bid or for some other reason. There-
fore, the price-evolution curve, and in particular its
dynamics, has the ability to capture much of the
auction information that would otherwise not be
integrated into the model. By price dynamics we
mean, for example, the price velocity and acceler-
ation which measure the change in price and the
rate at which this change is occurring. The ability
to measure dynamics is one of the most noteworthy
features of functional data analysis. FDA recovers
the price evolution via a smooth curve through the
auction’s bid history and yields the price dynamics
via the derivatives of this curve. Examples of explor-
ing process dynamics via FDA in eCommerce are the
price dynamics in eBay auctions (Jank and Shmueli,
2005) and bid dynamics in auctions for modern In-
dian art by Reddy and Dass (2006). In these two
examples the price curves themselves are not very
illuminating, but their dynamics reveal interesting
patterns and sources of heterogeneity across records.
One can model the relationship between the pro-
cess evolution (or its dynamics) and other predic-
tors via functional regression analysis. For example,
in a few studies of price formation in eBay auctions
(Shmueli and Jank, 2006; Bapna, Jank and Shmueli,
2004; Alford and Urimi, 2004) and other online auc-
tions (Reddy and Dass, 2006) a functional regression
model was fit to price-evolution curves from eBay
auctions (the response) with static predictors (such
as the seller rating) and functional predictors (such
as the cumulative number of bids). One interesting
finding is that the impact of the opening bid on the
current price starts high, and slowly decreases as the
auction progresses. This reflects the shift in informa-
tion about the item’s value due to bidding: At first
there is not much information available and so the
opening bid gives a sense of the item’s value. But as
the auction progresses new bids add more informa-
tion about the value of the item, thereby reducing
the usefulness of the information contained in the
opening bid.
One of the challenges in functional regression anal-
ysis is the interpretation of the results. Instead of
scalar coefficient estimates, we obtain estimated co-
efficient curves. Plotting these curves means that the
x-axis is time, and not the ordinary predictor value.
For example, Figure 4 shows the estimated coeffi-
cient (and a 95% confidence band) for a regression
model with a functional response. In the top panel
the response is the price evolution. The coefficient
is positive throughout the auction, signifying that
the current price is positively associated with the
opening bid throughout the auction. However, this
relationship decreases in magnitude as the auction
proceeds. This is reasonable, because bidders gain
more and more information as the auction proceeds
and therefore derive less utility from the value of
the opening price. The middle and bottom panels in
Figure 4 describe another useful information source:
the relationship between the opening price and the
price dynamics. If we are interested in relationships
between various independent variables and the pro-
cess dynamics, we can use the derivative curves as
the functional response. In this example we set the
price velocity (middle) and price acceleration (bot-
tom) as the responses. We see that the price accel-
eration is positively associated with the opening bid
at the auction start, but then this relationship loses
FDA IN ECOMMERCE 9
Fig. 4. Estimated coefficient for opening price, in three regression models with a functional response: price evolution (top),
price velocity (middle) and price acceleration (bottom).
momentum and even becomes negative as the auc-
tion comes to a close.
The capability of studying process dynamics could
have a huge impact on eCommerce research. Though
the concepts of dynamics are well grounded in physics
and engineering, their exact economic impact re-
quires more thought. However, there is an opportu-
nity to create new economic measures with the help
of FDA. For instance, we can develop concepts such
as “auction energy” using the definition of kinetic
energy (energy = mass × velocity2/2) to arrive at
auction energy
10 W. JANK AND G. SHMUELI
(1)
= (current price) × (price velocity)2
/2.
A major challenge is to find a theoretical foundation
in economics of such concepts. This is only one ex-
ample where collaboration could have a large impact
on the field.
4.3 Other Functional Models for eCommerce
Another level of flexibility, but also complexity,
is to incorporate interaction terms into functional
regression models. Since interactions (in ordinary
linear regression models) are widely used in eCom-
merce studies, it could be useful to measure simi-
lar effects in functional objects. The literature on
interactions in functional linear regression models
appears to be scant, although this seems like an im-
portant extension.
Another important direction for modeling the dy-
namic nature of web content in general, and eCom-
merce in particular, is the use of differential equa-
tions. The use of differential equations in the func-
tional literature is still in its infancy. Ramsay (2000a)
gives an introduction to the use of differential equa-
tions in statistics and several examples of functional
estimation problems such as simultaneous estima-
tion of a regression model and residual density, mono-
tone smoothing, specification of a link function, dif-
ferential equation models of data, and smoothing
over complicated multidimensional domains. Ram-
say calls this “principal differential analysis” (PDA)
because of the similarities that it shares with prin-
cipal components analysis.
PDA is a natural formalization of the exploration
of curve dynamics. Through a differential equation
we can relate, for instance, the price during an auction
to its rate of increase and acceleration. It is possi-
ble that such relations exist in the dynamic, ever-
changing eCommerce world. These relationships need
to be more formally integrated with economic theory
to create a solid foundation for the empirical findings
that have been observed. For example, PDA was
used to study the relationship between price curves
in online auctions and their derivatives by Jank and
Shmueli (2005) and Wang (2005). The main finding
is that relationships between the price curve and its
acceleration are present in some types of auctions,
but not in others, suggesting that dynamics can vary
broadly in eCommerce processes.
5. FUTURE TRENDS IN FUNCTIONAL
MODELING
In the previous sections we have shown multi-
ple facets of FDA that make it a natural approach
in eCommerce empirical research. Unlike currently
used static models, FDA can capture processes and
dynamic information which are inherent in the eCom-
merce environment. In the following we describe a
few important areas that are still undeveloped both
in the eCommerce research world and in the FDA
domain, and in our opinion have the potential to
make a contribution to both.
The first area is related to concurrency of events.
In almost every eCommerce study, the events of in-
terest occur concurrently or have at least some over-
lap. This means that there is a dependence structure
between records, with some events influencing oth-
ers. The most obvious example is the stock mar-
ket with stock prices influencing each other. Some
eCommerce examples are concurrent auctions on eBay
for the same item or even for competing items, and
prices of a certain book at different online vendors
(and perhaps even brick-and-mortar stores) over time.
Although researchers acknowledge such relationships,
nearly all studies make the simplifying assumption
of independent observations. Ignoring the effects of
concurrency can lead to invalid results. A first step
is therefore to find ways to evaluate the degree of
concurrency and its effect on the measure of inter-
est. Shmueli and Jank (2005) introduce and evalu-
ate several data displays for exploring the effect of
concurrency in online auctions on final price. Hyde,
Jank and Shmueli (2006) expands this work to vi-
sualize concurrency of the functional objects, using
curves to represent price evolution and its dynamics.
In addition to data displays, there is a need for defin-
ing measures of concurrency, and finally, for devel-
oping models that can incorporate and account for
relationships between processes that are represented
as functional objects (see, e.g., Jank and Shmueli,
2006).
Another important enhancement to FDA that would
greatly benefit eCommerce research is the incorpo-
ration of change into the functional objects over
time. As pointed out earlier, eCommerce experiences
constant change. Frequent technological advancement,
new website formats, changes in the global economy,
etc., can have a large influence on what we observe
in the eCommerce world. If we use functional ob-
jects to represent observations which are themselves
FDA IN ECOMMERCE 11
longitudinal, we need ways to incorporate an addi-
tional temporal dimension that compares functional
objects over time.
FDA research focuses more and more on p-dimensi-
onal functional objects (e.g., Yushkevich et al., 2001).
In many eCommerce applications such representa-
tions could be very useful. One example is com-
petition in online auctions, where each auction is
represented by its price curve coupled with the cu-
mulative number of bidders, thus yielding a bivari-
ate functional representation (Wang and Wu, 2004).
There is also related work on symbolic data analysis
(SDA) (see Bock and Diday, 2000), which provides
tools for managing complex, aggregated, relational
and higher-level data described by multivalued vari-
ables. This could be a new successful wedding with
FDA methods.
Finally, in many cases, and especially in economics,
the objects of interest are individual users. Economists
are typically interested in how individuals strategize
and react to others. The problem with formulating
functional objects that represent individuals is spar-
sity of data. That is, individuals typically do not
leave many traces during one eCommerce transac-
tion. For example, in eBay auctions if we treat an
individual bidder as our object of interest, then bid-
ders will leave very sparse data (1–2 bids per bid-
der is the norm). In a similar setting, James, Hastie
and Sugar (2000) and James and Sugar (2003) use a
semiparametric setting where information from data
aggregated across individuals is used to supplement
the information at the individual level. An alter-
native model would pool information from previous
records that the individual was involved in and use it
to supplement the current record. Approaches that
enable the functional representation of sparse data
can prove very useful in tying economic theories to
empirical results. This would further strengthen the
eCommerce research area.
6. CONCLUSIONS
The emerging field of empirical eCommerce re-
search is growing fast with many data-related chal-
lenges. In light of the special data structures and
the types of research questions of interest, we be-
lieve that functional data analysis can play a ma-
jor role in this field. On the one hand, this requires
more involvement by statisticians to further explore
statistical issues involved and to develop functional
methods and models that are called for in these ap-
plications. On the other hand, collaborative work
has proven to be extremely fruitful for the multiple
disciplines involved. In that respect, more outreach
should be done to make these tools more popular.
Wider adoption of functional tools by nonstatisti-
cians requires software accessibility. Currently
FDA packages exist for Matlab, S-PLUS and R
(ego.psych.mcgill.ca/misc/fda/software.html). Spe-
cialized programs for particular applications
are anticipated to grow, and we encourage researchers
to make such code and data freely available.
In addition, making sample datasets freely
available will make this field more accessible
and attractive to statisticians. Our website
(www.smith.umd.edu/ceme/statistics/) contains
some eBay data and auction-specific FDA code.
Another important front in further developing this
exciting new interdisciplinary field is the involve-
ment and training of graduate students. This in-
cludes educating statistics students about both the
eCommerce domain and FDA. From our own experi-
ence through Interactive Research Teams, we found
this to be a very exciting ground for advancing sta-
tistical research.
ACKNOWLEDGMENTS
We thank Professor Steve Marron from UNC for
fruitful conversations, Professor Jim Ramsay from
McGill University for continuous advice and support
with FDA software, and the three referees for their
constructive comments.
REFERENCES
Abraham, C., Cornillon, P. A., Matzner-Løber, E. and
Molinari, N. (2003). Unsupervised curve-clustering using
B-splines. Scand. J. Statist. 30 581–595. MR2002229
Alford, B. and Urimi, L. (2004). An analysis of various
spline smoothing techniques for online auctions. Term pa-
per, Research Interaction Team, VIGRE program, Univ.
Maryland.
Aris, A., Shneiderman, B., Plaisant, C., Shmueli, G.
and Jank, W. (2005). Representing unevenly-spaced time
series data for visualization and interactive exploration.
Human–Computer Interaction—INTERACT 2005: IFIP
TC13 International Conference. Lecture Notes in Comput.
Sci. 3585 835–846. Springer, Berlin.
Bajari, P. and Hortac¸su, A. (2003). The winner’s curse, re-
serve prices and endogenous entry: Empirical insights from
eBay auctions. RAND J. Economics 34 329–355.
Bajari, P. and Hortac¸su, A. (2004). Economic insights
from Internet auctions. J. Economic Literature 42 457–486.
Bapna, R., Jank, W. and Shmueli, G. (2004). Price forma-
tion and its dynamics in online auctions. Working paper
RHS-06-003, Smith School of Business, Univ. Maryland.
Available at ssrn.com/abstract=902887.
12 W. JANK AND G. SHMUELI
Bock, H. H. and Diday, E., eds. (2000). Analysis of Sym-
bolic Data: Exploratory Methods for Extracting Statisti-
cal Information from Complex Data. Springer, Heidelberg.
MR1792132
Cuevas, A., Febrero, M. and Fraiman, R. (2002). Linear
functional regression: The case of fixed design and func-
tional response. Canad. J. Statist. 30 285–300. MR1926066
Dellarocas, C. and Narayan, R. (2006). A statistical
measure of a population’s propensity to engage in post-
purchase online word-of-mouth. Statist. Sci. 21 277–285.
Fan, J. and Lin, S.-K. (1998). Test of significance when
data are curves. J. Amer. Statist. Assoc. 93 1007–1021.
MR1649196
Faraway, J. J. (1997). Regression analysis for a functional
response. Technometrics 39 254–261. MR1462586
Guo, W. (2002). Inference in smoothing spline analysis of
variance. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 887–
898. MR1979393
Hall, P., Poskitt, D. S. and Presnell, B. (2001). A
functional data-analytic approach to signal discrimination.
Technometrics 43 1–9. MR1847775
Hyde, V., Jank, W. and Shmueli, G. (2006). Investigat-
ing concurrency in online auctions through visualization.
Amer. Statist. To appear.
Hyde, V., Moore, E. and Hodge, A. (2004). Functional
PCA for exploring bidding activity times for online auc-
tions. Term paper, Research Interaction Team, VIGRE
program, Univ. Maryland.
James, G. M. (2002). Generalized linear models with func-
tional predictors. J. R. Stat. Soc. Ser. B Stat. Methodol.
64 411–432. MR1924298
James, G. M. and Hastie, T. J. (2001). Functional linear
discriminant analysis for irregularly sampled curves. J. R.
Stat. Soc. Ser. B Stat. Methodol. 63 533–550. MR1858401
James, G. M., Hastie, T. J. and Sugar, C. A. (2000).
Principal component models for sparse functional data.
Biometrika 87 587–602. MR1789811
James, G. M. and Sugar, C. A. (2003). Clustering for
sparsely sampled functional data. J. Amer. Statist. Assoc.
98 397–408. MR1995716
Jank, W. and Shmueli, G. (2005). Profiling price dynamics
in online auctions using curve clustering. Working paper
RHS-06-004, Smith School of Business, Univ. Maryland.
Available at ssrn.com/abstract=902893.
Jank, W. and Shmueli, G. (2006). Modeling concurrency
of events in online auctions via spatio-temporal semipara-
metric models. Working paper, Smith School of Business,
Univ. Maryland.
Lucking-Reiley, D., Bryan, D., Prasad, N. and
Reeves, D. (2000). Pennies from eBay: The determinants
of price in online auctions. Technical report, Dept. Eco-
nomics, Univ. Arizona.
Ogden, R. T., Miller, C. E., Takezawa, K. and
Ninomiya, S. (2002). Functional regression in crop lodg-
ing assessment with digital images. J. Agric. Biol. Environ.
Stat. 7 389–402.
Pfeiffer, R. M., Bura, E., Smith, A. and Rutter, J. L.
(2002). Two approaches to mutation detection based on
functional data. Stat. Med. 21 3447–3464.
Ramsay, J. O. (1998). Estimating smooth monotone func-
tions. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 365–375.
MR1616049
Ramsay, J. O. (2000a). Differential equation models for
statistical functions. Canad. J. Statist. 28 225–240.
MR1777224
Ramsay, J. O. (2000b). Functional components of variation
in handwriting. J. Amer. Statist. Assoc. 95 9–15.
Ramsay, J. O. and Ramsey, J. B. (2002). Functional data
analysis of the dynamics of the monthly index of non-
durable goods production. J. Econometrics 107 327–344.
MR1889966
Ramsay, J. O. and Silverman, B. W. (1997). Functional
Data Analysis. Springer, New York.
Ramsay, J. O. and Silverman, B. W. (2002). Applied Func-
tional Data Analysis: Methods and Case Studies. Springer,
New York. MR1910407
Ramsay, J. O. and Silverman, B. W. (2005). Functional
Data Analysis, 2nd ed. Springer, New York. MR2168993
Ratcliffe, S. J., Heller, G. Z. and Leader, L. R. (2002).
Functional data analysis with application to periodically
stimulated foetal heart rate data. II: Functional logistic
regression. Stat. Med. 21 1115–1127.
Ratcliffe, S. J., Leader, L. R. and Heller, G. Z. (2002).
Functional data analysis with application to periodically
stimulated foetal heart rate data. I: Functional regression.
Stat. Med. 21 1103–1114.
Reddy, S. K. and Dass, M. (2006). Modeling on-line art
auction dynamics using functional data analysis. Statist.
Sci. 21 179–193.
Rossi, N., Wang, X. and Ramsay, J. O. (2002). Nonpara-
metric item response function estimates with the EM algo-
rithm. J. Educational and Behavioral Statistics 27 291–317.
Shmueli, G. and Jank, W. (2005). Visualizing online auc-
tions. J. Comput. Graph. Statist. 14 299–319. MR2160815
Shmueli, G. and Jank, W. (2006). Modeling the dynamics
of online auctions: A modern statistical approach. In Eco-
nomics, Information Systems and E-Commerce Research
II : Advanced Empirical Methods 1 (R. Kauffman and P.
Tallon, eds.). Sharpe, Armonk, NY. To appear.
Shmueli, G., Jank, W., Aris, A., Plaisant, C. and Shnei-
derman, B. (2006). Exploring auction databases through
interactive visualization. Decision Support Systems. To ap-
pear.
Stewart, K., Darcy, D. and Daniel, S. (2006). Opportu-
nities and challenges applying functional data analysis to
the study of open source software evolution. Statist. Sci.
21 167–178.
Tarpey, T. and Kinateder, K. K. J. (2003). Clustering
functional data. J. Classification 20 93–114. MR1983123
Wang, S. (2005). Principal differential analysis of online auc-
tions. Term paper, Research Interaction Team, VIGRE
program, Univ. Maryland.
Wang, S., Jank, W. and Shmueli, G. (2006). Forecasting
eBay’s online auction prices using functional data analysis.
J. Bus. Econom. Statist. To appear.
Wang, S. and Wu, O. (2004). Bivariate functional modelling
of the bid amounts and number of bids in online auctions.
Term paper, Research Interaction Team, VIGRE program,
Univ. Maryland.
FDA IN ECOMMERCE 13
Wu, O. (2005). Dynamics of online movie ratings. Term pa-
per, Research Interaction Team, VIGRE program, Univ.
Maryland.
Yushkevich, P., Pizer, S., Joshi, S. and Marron, J. S.
(2001). Intuitive, localized analysis of shape variability. In-
formation Processing in Medical Imaging. Lecture Notes in
Comput. Sci. 2082 402–408. Springer, Berlin.

More Related Content

PDF
Asl rof businessintelligencetechnology2019
PDF
1 s2.0-s1877050917322184-main
PPT
secondary data analysis for MS advance research one Lecture eight
PDF
Use of secondary data in marketing analytics
DOCX
Supply chain management
PDF
Slide 26 sept2017v2
PPTX
The profile of the management (data) scientist: Potential scenarios and skill...
PDF
A forecasting of stock trading price using time series information based on b...
Asl rof businessintelligencetechnology2019
1 s2.0-s1877050917322184-main
secondary data analysis for MS advance research one Lecture eight
Use of secondary data in marketing analytics
Supply chain management
Slide 26 sept2017v2
The profile of the management (data) scientist: Potential scenarios and skill...
A forecasting of stock trading price using time series information based on b...

What's hot (20)

DOCX
Integration paper developmental psychology this assignment ser
PDF
SOCIAL MEDIA ANALYSIS ON SUPPLY CHAIN MANAGEMENT IN FOOD INDUSTRY
DOC
KM.doc
PDF
Corporate bankruptcy prediction using Deep learning techniques
PDF
Parallel session iv d4
PPTX
Application of data mining
PDF
Careersinmath
PDF
Careersinappliedmathematics
DOCX
Glossary
PPT
File 498 Doc 4 01 Dm Intro To Dm
PDF
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
PDF
Ec3212561262
DOCX
International businesscompeting in the global marketpla
PDF
IRJET- Information Reterival of Text-based Deep Stock Prediction
PDF
Data Mining : Healthcare Application
PDF
#StopBigTechGoverningBigTech . More than 170 Civil Society Groups Worldwide O...
PDF
Text Analytics 2014: User Perspectives on Solutions and Providers
PDF
Big Data Opportunities in Census Bureau Research
PDF
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
PDF
Data mining
Integration paper developmental psychology this assignment ser
SOCIAL MEDIA ANALYSIS ON SUPPLY CHAIN MANAGEMENT IN FOOD INDUSTRY
KM.doc
Corporate bankruptcy prediction using Deep learning techniques
Parallel session iv d4
Application of data mining
Careersinmath
Careersinappliedmathematics
Glossary
File 498 Doc 4 01 Dm Intro To Dm
An Improvised Fuzzy Preference Tree Of CRS For E-Services Using Incremental A...
Ec3212561262
International businesscompeting in the global marketpla
IRJET- Information Reterival of Text-based Deep Stock Prediction
Data Mining : Healthcare Application
#StopBigTechGoverningBigTech . More than 170 Civil Society Groups Worldwide O...
Text Analytics 2014: User Perspectives on Solutions and Providers
Big Data Opportunities in Census Bureau Research
STOCK TREND PREDICTION USING NEWS SENTIMENT ANALYSIS
Data mining
Ad

Viewers also liked (8)

PDF
IB ESS Topic 4: Conservation and biodiversity
PPT
Transaction Processing System
PPT
SCM & CRM & ERP
PPT
Transaction processing system
PPT
Transaction processing systems
PPT
Transaction processing system
PPTX
Executive support system (ess)
PPT
Enterprise Systems: SCM, CRM, & ERP
IB ESS Topic 4: Conservation and biodiversity
Transaction Processing System
SCM & CRM & ERP
Transaction processing system
Transaction processing systems
Transaction processing system
Executive support system (ess)
Enterprise Systems: SCM, CRM, & ERP
Ad

Similar to Functional Data Analysis Ecommerce (20)

PDF
A Survey on Big Data Analytics: Challenges
PDF
Big Data Analytics and its Application in E-Commerce
PDF
DSS_Understanding_the_paradigm_shift.pdf
DOCX
Case StudyName Your name (please no DNumber).Date Date o.docx
DOCX
Learning Resources Week 2 Frankfort-Nachmias, C., & Leon-Guerr.docx
DOCX
Learning Resources Week 2 Frankfort-Nachmias, C., & Leon-Guerr.docx
PDF
10[1].1.1.115.9508
PDF
KIT-601 Lecture Notes-UNIT-1.pdf
PDF
Introduction to Data Analytics and data analytics life cycle
DOCX
06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx
DOCX
McKinsey Global Institute Big data The next frontier for innova.docx
PDF
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
PDF
Full Paper: Analytics: Key to go from generating big data to deriving busines...
PDF
s40537-015-0030-3-data-analytics-a-survey.pdf
PDF
Challenges and outlook with Big Data
PDF
Emcien overview v6 01282013
PPTX
DS103 - Unit03DS103 - Unit03DS103 - Unit03.pptx
PDF
Selection of Articles using Data Analytics for Behavioral Dissertation Resear...
PDF
Big Data Challenges faced by Organizations
PDF
Introduction to Information Systems People Technology and Processes 3rd Editi...
A Survey on Big Data Analytics: Challenges
Big Data Analytics and its Application in E-Commerce
DSS_Understanding_the_paradigm_shift.pdf
Case StudyName Your name (please no DNumber).Date Date o.docx
Learning Resources Week 2 Frankfort-Nachmias, C., & Leon-Guerr.docx
Learning Resources Week 2 Frankfort-Nachmias, C., & Leon-Guerr.docx
10[1].1.1.115.9508
KIT-601 Lecture Notes-UNIT-1.pdf
Introduction to Data Analytics and data analytics life cycle
06877 Topic Implicit Association TestNumber of Pages 1 (Doub.docx
McKinsey Global Institute Big data The next frontier for innova.docx
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
Full Paper: Analytics: Key to go from generating big data to deriving busines...
s40537-015-0030-3-data-analytics-a-survey.pdf
Challenges and outlook with Big Data
Emcien overview v6 01282013
DS103 - Unit03DS103 - Unit03DS103 - Unit03.pptx
Selection of Articles using Data Analytics for Behavioral Dissertation Resear...
Big Data Challenges faced by Organizations
Introduction to Information Systems People Technology and Processes 3rd Editi...

Functional Data Analysis Ecommerce

  • 1. arXiv:math/0609173v1[math.ST]6Sep2006 Statistical Science 2006, Vol. 21, No. 2, 155–166 DOI: 10.1214/088342306000000132 c Institute of Mathematical Statistics, 2006 Functional Data Analysis in Electronic Commerce Research Wolfgang Jank and Galit Shmueli Abstract. This paper describes opportunities and challenges of using functional data analysis (FDA) for the exploration and analysis of data originating from electronic commerce (eCommerce). We discuss the special data structures that arise in the online environment and why FDA is a natural approach for representing and analyzing such data. The paper reviews several FDA methods and motivates their useful- ness in eCommerce research by providing a glimpse into new domain insights that they allow. We argue that the wedding of eCommerce with FDA leads to innovations both in statistical methodology, due to the challenges and complications that arise in eCommerce data, and in online research, by being able to ask (and subsequently answer) new research questions that classical statistical methods are not able to ad- dress, and also by expanding on research questions beyond the ones traditionally asked in the offline environment. We describe several ap- plications originating from online transactions which are new to the statistics literature, and point out statistical challenges accompanied by some solutions. We also discuss some promising future directions for joint research efforts between researchers in eCommerce and statis- tics. Key words and phrases: Process dynamics, special data structures, online auctions. Wolfgang Jank is Assistant Professor of Management Science and Statistics, Department of Decision and Information Technologies, Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742, USA e-mail: wjank@rhsmith.umd.edu. Galit Shmueli is Assistant Professor of Management Science and Statistics, Department of Decision and Information Technologies, Robert H. Smith School of Business, University of Maryland, College Park, Maryland 20742, USA e-mail: gshmueli@rhsmith.umd.edu. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in Statistical Science, 2006, Vol. 21, No. 2, 155–166. This reprint differs from the original in pagination and typographic detail. 1. INTRODUCTION Functional data analysis (FDA) has been gain- ing momentum in many fields. While much of the methodological advances have been made within the statistics literature, FDA has found many useful ap- plications in the agricultural sciences (Ogden et al., 2002), the behavioral sciences (Rossi, Wang and Ram- say, 2002), in medical research (Pfeiffer et al., 2002) and many more. One reason for this momentum is the technological advancement in computer storage and computing power. Today’s researchers gather more and more data, often automatically, and store them in large databases. However, these new capa- bilities for data generation and data storage have also led to new data structures which do not nec- essarily fit into the classical statistical concept. Re- searchers measure characteristics of customers over time, store digitalized two- or three-dimensional im- ages of the brain, and record three- or even four- 1
  • 2. 2 W. JANK AND G. SHMUELI dimensional movements of objects through space and time. Many of these new data structures call for new statistical methods in order to unveil the informa- tion that they carry. Data can contain trends that vary in longitudinal or spatial aspects, that vary across different groups of customers or objects, or that show different magnitudes of dynamics. FDA is a tool-set that, although based on the ideas of classical statistics, differs from it (and, in a sense, generalizes it), especially with respect to the type of data structures that it encompasses. While the underlying ideas for FDA have been around for a longer time, the surge in associated research can be attributed to the monograph of Ramsay and Silver- man (1997). In FDA, the object of interest is a set of curves, shapes, objects, or, more generally, a set of functional observations. This is in contrast to clas- sical statistics where the interest centers around a set of data vectors. In recent years, a range of clas- sical statistical methods have been generalized to the functional framework; James, Hastie and Sugar (2000) developed a principal components approach for a set of sparsely sampled curves. Other exploratory tools include curve clustering (see Abraham et al., 2003; James and Sugar, 2003; Tarpey and Kinateder, 2003) and curve classification (see Hall, Poskitt and Presnell, 2001; James and Hastie, 2001). Classical linear models have also been generalized to func- tional ANOVA (Fan and Lin, 1998; Guo, 2002), functional regression (Faraway, 1997; Cuevas, Febrero and Fraiman, 2002; Ratcliffe, Leader and Heller, 2002) and the functional generalized linear model (Rat- cliffe, Heller and Leader, 2002; James, 2002). More- over, Ramsay (2000b) and Ramsay and Ramsay (2002) suggest differential equations for data of functional form. While this list is far from complete, it shows some of the current methodological efforts in this emerging field. Electronic commerce (eCommerce) is a growing field of scholarly research especially in information systems, economics and marketing, but it has re- ceived little to no attention in statistics. This is sur- prising because it arrives with an enormous amount of data and data-related questions and problems. Like other web-based data, eCommerce data tend to be very rich, clean and structurally different from offline data. eCommerce research arrives with many new data- and model-related challenges that promise new ideas and motivation for further methodologi- cal advancements of FDA. One of the main char- acteristics of eCommerce data is the combination of longitudinal information (time-series data) with cross-sectional information (attribute data). A sam- ple of n records typically comprises n time series, each linked with a set of n attributes. Take eBay’s online auctions as an example. There, each auction is characterized by a time series of bids placed over time. This information is coupled with additional auction attributes such as a seller’s rating, the auc- tion duration and the currency used. Another ex- ample is online product ratings on Amazon.com or movie ratings on Yahoo! Movies (see Dellarocas and Narayan, 2006). Yahoo! Movies allows users to rate any movie according to different measures. This re- sults in a time series that describes the average daily rating or the number of daily postings (or both) from the date of the movie release until the time of data collection. This information is coupled with attribute data about the movie such as the movie genre and critics’ rating. A third example, described in Stewart, Darcy and Daniel (2006), is the evolution of open-source software projects that is monitored by websites such as SourceForge.net. Here an obser- vation is a certain project, and it is characterized by a time series that describes project complexity from its first release until the time of data collection. Each project also has associated attributes such as the number of developers, the operating system used and the programming language. The combination of longitudinal and cross-sectional information is only one typical aspect of eCommerce data. Another aspect is the uneven spacing between events. In many cases, the observed time series is composed of events influenced by multiple users or agents who access the web at different points in time (and from different geographical locations). Conse- quently, the resulting times when new events arrive are extremely unevenly spaced. This is in contrast to traditional time series, which are typically recorded at predefined and equidistant time-points, such as daily, monthly or quarterly scales. Furthermore, be- cause of psychological, economic or other reasons, eCommerce time series tend to feature very sparse areas at some times, followed by extremely dense areas at other times. For instance, bidding in eBay auctions tends to be concentrated at the end, re- sulting in very sparse bid-arrivals during most of the auction except for its final moments, where the bidding volume can be extremely high. eCommerce not only creates new data challenges, it also motivates the need for innovative models.
  • 3. FDA IN ECOMMERCE 3 While the field of economics has created many the- ories for understanding economic behavior at the individual and market level, many of these theories were developed before the emergence of the World Wide Web. The existence of the web now allows re- searchers, for the first time, to observe and record data about economic behavior on a large-scale ba- sis. As it turns out, however, observed data often do not support classical economic theories. As a result, empirical research is thriving. In fact, the empirical literature has continuously shown that online be- havior deviates in many ways from offline behavior and from what is expected by economic theory. This calls for new economic models that can be validated empirically. In addition, the availability of eCom- merce data allows researchers to ask new types of questions. One major enhancement is the ability to study not only the evolution of a process, but also its dynamics: how fast it moves and how suddenly it changes, its rate of change and how this rate dif- fers at different time-points. Studying dynamics of processes can be very relevant in the online world, because it allows new approaches for characterizing eCommerce processes (and thus distinguishing be- tween diverse processes), and even forecasting them (Wang, Jank and Shmueli, 2006). Changing dynam- ics are inherent in a fast-moving environment like the online world. Fast movements and change im- ply nonstationarity which poses challenges to tra- ditional time series modeling. And finally, it is im- portant to point out that for any one process that we observe in the online world, there typically exist many, many replicates of the same (or at least very similar) process. On eBay, for instance, if we think of the formation of price between the start and the end of an auction as a process of interest, then there exist several million similar processes of that form, taking place at any given day on eBay. The replica- tion of processes, or time series, fits naturally within the FDA framework and makes this an ideal ground for the advancement of new functional methodology. Finally, eCommerce typically arrives with huge databases which can put a computational burden on users’ storage and processing facilities. This bur- den is often increased by the complicated structure of eCommerce data. Taking a functional data ap- proach, one can relieve some of that burden. FDA operates on functional objects which can be more compactly represented than the original data. Tak- ing a functional approach may therefore be advan- tageous also from a resource point of view. The process of studying a set of data via func- tional methods consists of two principal steps: First, the functional object is “recovered,” typically by means of smoothing. There are multiple different ways in which this smoothing step can be executed, and there are many challenges during that step. Sec- ond, the resulting functional object is used for data exploration and analysis. Exploratory data analysis (including data visualization and summary) is per- formed in order to learn about general characteris- tics as well as unusual features and anomalies in the data. Analysis includes explanatory and predictive modeling and inference, just as in classical statis- tics. In the next sections we focus on the challenges and problems that arise during these steps within the eCommerce context. We would like to note that our point of view of the functional approach and its application to eCommerce has been forged during the teaching of so-called Research Interaction Teams (www.amsc.umd.edu/Courses/RITDescrips/HowAndWhy.html) which are research classes that involve graduate stu- dents from the Statistics and the Applied Mathe- matics and Scientific Computation programs at the University of Maryland. Several of our studies per- formed during these classes have led to new method- ological and practical insights. 2. RECOVERING FUNCTIONAL OBJECTS The first step in any functional data analysis con- sists of recovering, from the observed data, the un- derlying functional object. There exist a variety of methods for recovering functional objects from a set of data, all of which are typically based on some kind of smoothing. As a result of the smoothing, and of characterizing the smooth object by its smoothing parameters only, we obtain a low-dimensional func- tional object. We focus here on objects, and in par- ticular curves, that are based on unevenly spaced time series and of which we have multiple replica- tions. An example is a set of bid histories from eBay auctions, as shown in Figure 1. The four panels cor- respond to four separate seven-day auctions for a new Palm PDA. Each consists of the bids (in $) placed at different times during the auction. 2.1 Challenges in Choosing the Right Smoother The first step in recovering the functional object is to choose a family of basis functions. The choice of the basis function depends on the nature of the data, on the level of smoothness that the application war- rants, on what aspects of the data we want to study,
  • 4. 4 W. JANK AND G. SHMUELI on the size of the data and on the types of analy- ses that we plan to perform. For example, to repre- sent the price path of an online auction for the pur- pose of, say, studying price dynamics, one could use monotone smoothing splines (Ramsay, 1998) since prices in auctions increase monotonically. Besides maintaining the price monotonicity, this approach also permits the computation of derivatives which lend themselves to price dynamics. However, fitting monotone splines is computationally more intensive than fitting ordinary polynomial smoothing splines. In that sense, it may prove impractical to compute monotone splines for very large databases if time and memory restrictions exist. In addition, polyno- mial smoothing splines can be represented as a linear combination of basis functions. The practical mean- ing of this is that if we use polynomial smoothing splines and the intended analysis is based on a lin- ear operation (such as computing average curves, fitting functional linear regression models, or per- forming functional principal components analysis), we can operate directly on the basis function coef- ficients without any loss of information. The same operation using monotone splines would require an approximation step due to the need to first represent the continuous curve in a finite-dimensional manner by evaluating it on a grid. Conversely, if the type of operation is nonlinear, then one would have to perform a grid-based computation for either type of spline and the choice would therefore not matter from this point of view. Thus, the way we recover the functional object is strongly influenced by a variety Fig. 1. Scatterplots describing the bid history in each of four eBay auctions, each lasting seven days. of different objectives all of which might compete with one another. Recovering functional objects often involves more than deciding on the appropriate type of smoother. This can include a preprocessing step via interpola- tion, thereby creating a raw functional (e.g., Ram- say and Silverman, 2002, page 21). This alleviates the problem of the unevenly spaced series that are common in eCommerce. An important aspect in any functional data analysis is the robustness of analy- sis results with respect to the choice and level of smoothing. A general study of this sort was carried out by our research interaction team, comparing the effects of smoothing splines versus monotone splines on the conclusions derived from a functional regres- sion on the price path in online auctions (the func- tional object) as a function of explanatory variables such as the seller rating and the opening bid (both scalar) and current number of bids (a functional ex- planatory variable). The study indicates that both smoothers lead to similar conclusions (Alford and Urimi, 2004). Another example of the challenges in choosing an appropriate smoothing method is the functional rep- resentation of online movie ratings. By that we mean the series of user movie ratings on online services such as Yahoo.com. The volume of user postings is highly periodic, with heavier activity on weekends (when people tend to watch movies in the theaters). Fourier basis functions were found to be a better choice among different alternatives for capturing the cyclical posting patterns (Wu, 2005). 2.2 Additional Data Challenges Our different studies using eCommerce data have raised further challenges in the functional object recovery stage that have previously not been ad- dressed in the literature. The first such challenge is handling the extremely unevenly distributed mea- surements in eCommerce data. That is, the num- ber and location of events vary drastically from one functional object to another. One typical example is the bid arrival in eBay auctions. Returning to Figure 1, it can be seen that some bid histories are very dense at the auction end, while others are much sparser, and in addition the overall number of bids per auction can vary widely between none in some auctions, and more than 100 in others. And yet, while the varying number of bids per auction may suggest the use of a varying set of smooth- ing methods, we prefer the use of a single family
  • 5. FDA IN ECOMMERCE 5 of smoothers. The reason for this is that, in the end, the choice of the smoother is merely a means to the end of arriving at a unifying functional object and it is not the direct object of our interest. Coming back to the example of online auctions with sometimes few and sometimes many bids, this motivates the need for new methodological advances in creating functional objects that naturally incorporate all ex- tremes under one hat. Some promising approaches in that direction can be found in James and Sugar (2003) and James, Hastie and Sugar (2000). The extreme structure of eCommerce data is chal- lenging even for very basic visualization tasks: stan- dard time-series visualization tools typically require evenly spaced events! Because of this restrictive re- quirement, we collaborated with colleagues at the Human–Computer Interaction Lab at the University of Maryland to develop new interactive visualiza- tion tools that can accommodate this special data structure. We evaluated different approaches for rep- resenting eBay bid histories by an evenly spaced equivalent without losing important information with respect to order, magnitude and distance between the bids (see Aris et al., 2005). Interestingly, the final choice was to use a functional approach by first smoothing the bid history, and then feeding an evenly spaced grid of the smooth curves (and their derivatives) into a standard visualization tool. Another challenge typical to eCommerce data is defining meaningful start and end points of the func- tional objects in order to align the curves. The prob- lem of aligning functional objects is related to the problem of registration (Ramsay and Silverman, 2005, Chapter 7), but there are several additional compli- cations here: Many web-based events do not start and end at the same time. For instance, online prod- uct ratings over time have different starting points, depending on when the product was first released to the market, when the first rating was placed, etc. They also often have different ending points, for instance, if one product is prematurely taken off the market, if it is replaced by a product-upgrade, and so on. Another issue that complicates object alignment is that the data collection process itself may act as a censoring mechanism. Therefore, it is not obvious how methods such as landmark regis- tration, where curves are aligned according to one particular feature of the curve such as its peak, can be adapted to handle this situation. Finally, select- ing the units for the time axis can be challenging. In some applications calendar time (e.g., the date and time a transaction took place) is reasonable, whereas in other applications the event index (i.e., the order of the event arrival) might make more sense. And yet in other cases an entirely different “clock” would be even more suitable. For instance, we pointed out that eBay auctions typically exhibit very low bidding activity during most of the auction and then extremely high activity near the end. For this reason an auction might be better represented by a clock that “shrinks” the low-activity period and “stretches” the high-activity period, thereby putting more emphasis on the part that matters more. All these issues are illustrated and discussed further in the paper by Stewart, Darcy and Daniel (2006). 3. FUNCTIONAL EDA After the data are represented by functional ob- jects, the analysis steps follow the same process as in classical statistics, with the first step being ex- ploratory data analysis (EDA). EDA includes data summaries, visualization, dimension reduction, out- lier detection, and more. The main difference be- tween FDA and classical statistics is the way in which the methods are applied and especially how they are interpreted. 3.1 Static vs. Interactive Visualization Starting with visualization, our goal is to: 1. Visualize a sample of curves. 2. Inspect summaries of these curves. 3. Explore conditional curves, using various relevant predictor variables. To that end, one solution is to create static graphs. For instance, Figure 2 shows the price evolution in 34 eBay auctions for various magazines. We can see large heterogeneity across the price formation pro- cess at different times of the auction. We refer to this approach as static, since once the graph is generated it can no longer be modified by the user without run- ning the software code again. This static approach is useful for differentiating subsets of curves by at- tributes (e.g., by using color), or for spotting out- liers. However, a static approach does not allow for an interactive exploration of the data. By interac- tive we mean that the user can perform operations such as zooming in and out, filtering the data and obtaining details for the filtered data, and do all of this from within the graphical interface. Interac- tive visualizations for the special structure of eCom- merce data are not straightforward, and solutions
  • 6. 6 W. JANK AND G. SHMUELI have only been proposed recently (Aris et al., 2005; Shmueli et al., 2006). One such solution is Auction- Explorer (www.cs.umd.edu/hcil/timesearcher), which is tailored to handle the special structure of online auction data. A snapshot of its user interface is shown in Figure 3. The interface includes several panels, which correspond to the price curves (top left), their dynamics (not shown in this view) and the corresponding attribute data (top right). The curves can be filtered to display subsets according to a selection of attribute values, according to a selec- tion of curves, and one can do pattern search. Sum- marization is achieved through on-the-fly summary statistics for attributes, and a “streaming boxplot” called a “river plot” of the curves (bottom panel of Figure 3). This is yet another attempt to general- ize classical visualization methods to the functional environment. 3.2 Data Reduction Another goal of EDA is data reduction. Two of the methods that are useful in this context are curve clustering and functional principal components anal- ysis. Curve clustering partitions the set of curves into a few clusters, thereby reducing the space of observations, and attempts to derive insight from the resulting clusters. The clustering can be applied to the curves themselves or to their derivatives. Jank and Shmueli (2005) apply curve clustering to bid his- tories of eBay auctions and find three main clusters. Linking the curve information with attribute infor- mation, they find that the different clusters corre- spond to three types of auctions: “greedy sellers,” Fig. 2. Static plot of the price progression in 34 eBay auc- tions for magazines. Fig. 3. Snapshot of user interface for AuctionExplorer (www. cs. umd. edu/ hcil/ timesearcher ). “bazaar auctions” and “experienced seller/buyer” auctions. Each of these types characterizes a differ- ent auction profile, combining static and dynamic information. For instance, “greedy seller” auctions have the highest average opening price and the low- est closing price. Unsurprisingly, they do not attract much competition, since unjustified high opening prices tend to deter users from bidding. Low compe- tition is also known to lead to lower prices. Sellers in these auctions are, on average, less experienced than those in other clusters (as can be measured by their eBay rating), scheduling most auctions to end on a weekday. These auctions also attract ex- perienced winners who take advantage of the poorer auction design and resulting lower prices. The price dynamics of this cluster reflect this setting: price starts accelerating late in the auction, not allowing it to achieve its full impact by the time the auction closes. This mix of insight into static seller and bid- der characteristics coupled with the price dynamics is only available with a functional approach.
  • 7. FDA IN ECOMMERCE 7 Another popular method is functional principal components analysis (f-PCA). The method uses stan- dard PCA to find principal sources of variability in curves (or other functional objects). If we consider curves that represent a process over time, then f- PCA can help us find “within-curve” (or more gen- erally, “within-process”) variation, thereby condens- ing the time axis. This is done by selecting a dis- crete grid of time-points and treating the points as the variables in ordinary PCA. A preliminary study by Hyde, Moore and Hodge (2004) applied f-PCA to price curves and derivative curves of a sample of eBay auctions for premium wristwatches. They found that two or three principal components cap- tured most of the within-curve variation: One source is price variation during the middle of the auction and the other distinguishes the price variability be- tween the beginning and end of the auction. Similar results were obtained when using the price-dynamics curves. Hyde, Moore and Hodge (2004) also used f-PCA to compare sources of “within-process” vari- ation across different brands for the same product category as well as for different product categories. It was found that price is most uncertain during mid-auction. As the auction approaches its end, though, the price becomes more predictable, especially in common-value auctions (see also Wang, Jank and Shmueli, 2006). An alternative approach of principal components analysis to functional data that, to the best of our knowledge, has not been explored would be to treat the observations as the dimension to be transformed. The idea is to find main sources of variation across curves (instead of within curves), achieving a goal similar to curve clustering, where main features of the curves are highlighted. The exact meaning and interpretation of this variation deserve further at- tention. 4. FUNCTIONAL MODELING, INFERENCE AND PREDICTION There is quite a lot of ongoing research on gener- alizing classical regression models to the functional setting. Examples include linear regression with func- tional predictors (Ratcliffe, Leader and Heller, 2002) or a functional response (Faraway, 1997), logistic re- gression (Ratcliffe, Heller and Leader, 2002), func- tional linear discriminant analysis (James and Hastie, 2001) and general linear models with functional pre- dictors (James, 2002). 4.1 Information in eCommerce Processes Current empirical research in eCommerce relies on the use of very standard statistical tools such as least-squares regression. These tools are used to in- vestigate how, say, the closing price in an online auc- tion relates to other auction-specific information. To that end, one sets the closing price as the response variable, and regresses it on potential explanatory variables such as the opening bid, the auction du- ration, the seller rating, etc. (see, e.g., Bajari and Horta¸csu, 2003, 2004; Lucking-Reiley et al., 2000). While this approach is certainly useful for under- standing some of the variation in closing prices, it also leads to loss of a large amount of potentially useful information about everything that happened between the start and end of the auction. More gen- erally, current research uses a response variable that is an aggregation of the process of interest: the maxi- mum bid in online auctions, the average product rat- ing, etc. This (direct or indirect) choice is guided by economic importance but is also likely done so that standard models can be applied. Furthermore, the choice of independent variables is limited to static “snapshot” information. The existence of more de- tailed data, however, can potentially shed more light on the entire process rather than only its aggregated form. In the online auction example, variables like the opening bid, the auction duration and the seller’s rating are determined before the auction start and thus do not capture any of the information that arrives after that. However, it is well-known that events that occur during the auction can also affect the final price. For instance, the number of com- peting bidders, the bidders’ experience and the bid timings can influence the final price. These three variables are available only after the auction starts and in fact the information they carry changes as the auction progresses. While it is possible to include time-varying ex- planatory variables like the number of bidders into a regression model, such a model would no longer be considered “standard” in the classical least-squares sense since it would have to account for time-depen- dence between the explanatory variable and the re- sponse, and also within the explanatory variable it- self. Furthermore, there is additional information revealed during ongoing processes that cannot be captured easily by such models. An example is concurrency
  • 8. 8 W. JANK AND G. SHMUELI and the effect that new events have on future events. In the online auction context, incoming bids can in- fluence bidders in different ways: A new bid placed in an auction can result in an immediate response by other bidders or can be completely ignored. Bid- ders also learn from each other: they adopt bid- ding strategies of other bidders and they learn about an item’s value from bids that were placed. Many items sold in online auctions do not have a com- monly known value (such as collectibles, antiques, rare pieces of art, etc.), and therefore bidders of- ten try to infer the item’s value from other people’s bids. In short, while the final price is certainly af- fected by directly observable phenomena (such as the number of competing bidders), it is also depen- dent on indirect actions, reactions and interactions among bidders. 4.2 Process Dynamics and FDA Modeling the effects of user interactions with clas- sical regression models is challenging, to say the least. An alternative approach is to capture some of this dynamic information via evolution curves and their dynamics. In the auction context this would be the price evolution, which is the progression of bids throughout an auction. The evolution curve and its dynamics can reflect these bidder interactions: High competition in an auction will manifest itself as a steep price curve with increasing dynamics. Price will also increase, albeit at a slower rate, if bidders merely use the new bid to update their own valua- tion about the product. The price increase will slow down if bidders drop out of the auction due to a newly placed bid or for some other reason. There- fore, the price-evolution curve, and in particular its dynamics, has the ability to capture much of the auction information that would otherwise not be integrated into the model. By price dynamics we mean, for example, the price velocity and acceler- ation which measure the change in price and the rate at which this change is occurring. The ability to measure dynamics is one of the most noteworthy features of functional data analysis. FDA recovers the price evolution via a smooth curve through the auction’s bid history and yields the price dynamics via the derivatives of this curve. Examples of explor- ing process dynamics via FDA in eCommerce are the price dynamics in eBay auctions (Jank and Shmueli, 2005) and bid dynamics in auctions for modern In- dian art by Reddy and Dass (2006). In these two examples the price curves themselves are not very illuminating, but their dynamics reveal interesting patterns and sources of heterogeneity across records. One can model the relationship between the pro- cess evolution (or its dynamics) and other predic- tors via functional regression analysis. For example, in a few studies of price formation in eBay auctions (Shmueli and Jank, 2006; Bapna, Jank and Shmueli, 2004; Alford and Urimi, 2004) and other online auc- tions (Reddy and Dass, 2006) a functional regression model was fit to price-evolution curves from eBay auctions (the response) with static predictors (such as the seller rating) and functional predictors (such as the cumulative number of bids). One interesting finding is that the impact of the opening bid on the current price starts high, and slowly decreases as the auction progresses. This reflects the shift in informa- tion about the item’s value due to bidding: At first there is not much information available and so the opening bid gives a sense of the item’s value. But as the auction progresses new bids add more informa- tion about the value of the item, thereby reducing the usefulness of the information contained in the opening bid. One of the challenges in functional regression anal- ysis is the interpretation of the results. Instead of scalar coefficient estimates, we obtain estimated co- efficient curves. Plotting these curves means that the x-axis is time, and not the ordinary predictor value. For example, Figure 4 shows the estimated coeffi- cient (and a 95% confidence band) for a regression model with a functional response. In the top panel the response is the price evolution. The coefficient is positive throughout the auction, signifying that the current price is positively associated with the opening bid throughout the auction. However, this relationship decreases in magnitude as the auction proceeds. This is reasonable, because bidders gain more and more information as the auction proceeds and therefore derive less utility from the value of the opening price. The middle and bottom panels in Figure 4 describe another useful information source: the relationship between the opening price and the price dynamics. If we are interested in relationships between various independent variables and the pro- cess dynamics, we can use the derivative curves as the functional response. In this example we set the price velocity (middle) and price acceleration (bot- tom) as the responses. We see that the price accel- eration is positively associated with the opening bid at the auction start, but then this relationship loses
  • 9. FDA IN ECOMMERCE 9 Fig. 4. Estimated coefficient for opening price, in three regression models with a functional response: price evolution (top), price velocity (middle) and price acceleration (bottom). momentum and even becomes negative as the auc- tion comes to a close. The capability of studying process dynamics could have a huge impact on eCommerce research. Though the concepts of dynamics are well grounded in physics and engineering, their exact economic impact re- quires more thought. However, there is an opportu- nity to create new economic measures with the help of FDA. For instance, we can develop concepts such as “auction energy” using the definition of kinetic energy (energy = mass × velocity2/2) to arrive at auction energy
  • 10. 10 W. JANK AND G. SHMUELI (1) = (current price) × (price velocity)2 /2. A major challenge is to find a theoretical foundation in economics of such concepts. This is only one ex- ample where collaboration could have a large impact on the field. 4.3 Other Functional Models for eCommerce Another level of flexibility, but also complexity, is to incorporate interaction terms into functional regression models. Since interactions (in ordinary linear regression models) are widely used in eCom- merce studies, it could be useful to measure simi- lar effects in functional objects. The literature on interactions in functional linear regression models appears to be scant, although this seems like an im- portant extension. Another important direction for modeling the dy- namic nature of web content in general, and eCom- merce in particular, is the use of differential equa- tions. The use of differential equations in the func- tional literature is still in its infancy. Ramsay (2000a) gives an introduction to the use of differential equa- tions in statistics and several examples of functional estimation problems such as simultaneous estima- tion of a regression model and residual density, mono- tone smoothing, specification of a link function, dif- ferential equation models of data, and smoothing over complicated multidimensional domains. Ram- say calls this “principal differential analysis” (PDA) because of the similarities that it shares with prin- cipal components analysis. PDA is a natural formalization of the exploration of curve dynamics. Through a differential equation we can relate, for instance, the price during an auction to its rate of increase and acceleration. It is possi- ble that such relations exist in the dynamic, ever- changing eCommerce world. These relationships need to be more formally integrated with economic theory to create a solid foundation for the empirical findings that have been observed. For example, PDA was used to study the relationship between price curves in online auctions and their derivatives by Jank and Shmueli (2005) and Wang (2005). The main finding is that relationships between the price curve and its acceleration are present in some types of auctions, but not in others, suggesting that dynamics can vary broadly in eCommerce processes. 5. FUTURE TRENDS IN FUNCTIONAL MODELING In the previous sections we have shown multi- ple facets of FDA that make it a natural approach in eCommerce empirical research. Unlike currently used static models, FDA can capture processes and dynamic information which are inherent in the eCom- merce environment. In the following we describe a few important areas that are still undeveloped both in the eCommerce research world and in the FDA domain, and in our opinion have the potential to make a contribution to both. The first area is related to concurrency of events. In almost every eCommerce study, the events of in- terest occur concurrently or have at least some over- lap. This means that there is a dependence structure between records, with some events influencing oth- ers. The most obvious example is the stock mar- ket with stock prices influencing each other. Some eCommerce examples are concurrent auctions on eBay for the same item or even for competing items, and prices of a certain book at different online vendors (and perhaps even brick-and-mortar stores) over time. Although researchers acknowledge such relationships, nearly all studies make the simplifying assumption of independent observations. Ignoring the effects of concurrency can lead to invalid results. A first step is therefore to find ways to evaluate the degree of concurrency and its effect on the measure of inter- est. Shmueli and Jank (2005) introduce and evalu- ate several data displays for exploring the effect of concurrency in online auctions on final price. Hyde, Jank and Shmueli (2006) expands this work to vi- sualize concurrency of the functional objects, using curves to represent price evolution and its dynamics. In addition to data displays, there is a need for defin- ing measures of concurrency, and finally, for devel- oping models that can incorporate and account for relationships between processes that are represented as functional objects (see, e.g., Jank and Shmueli, 2006). Another important enhancement to FDA that would greatly benefit eCommerce research is the incorpo- ration of change into the functional objects over time. As pointed out earlier, eCommerce experiences constant change. Frequent technological advancement, new website formats, changes in the global economy, etc., can have a large influence on what we observe in the eCommerce world. If we use functional ob- jects to represent observations which are themselves
  • 11. FDA IN ECOMMERCE 11 longitudinal, we need ways to incorporate an addi- tional temporal dimension that compares functional objects over time. FDA research focuses more and more on p-dimensi- onal functional objects (e.g., Yushkevich et al., 2001). In many eCommerce applications such representa- tions could be very useful. One example is com- petition in online auctions, where each auction is represented by its price curve coupled with the cu- mulative number of bidders, thus yielding a bivari- ate functional representation (Wang and Wu, 2004). There is also related work on symbolic data analysis (SDA) (see Bock and Diday, 2000), which provides tools for managing complex, aggregated, relational and higher-level data described by multivalued vari- ables. This could be a new successful wedding with FDA methods. Finally, in many cases, and especially in economics, the objects of interest are individual users. Economists are typically interested in how individuals strategize and react to others. The problem with formulating functional objects that represent individuals is spar- sity of data. That is, individuals typically do not leave many traces during one eCommerce transac- tion. For example, in eBay auctions if we treat an individual bidder as our object of interest, then bid- ders will leave very sparse data (1–2 bids per bid- der is the norm). In a similar setting, James, Hastie and Sugar (2000) and James and Sugar (2003) use a semiparametric setting where information from data aggregated across individuals is used to supplement the information at the individual level. An alter- native model would pool information from previous records that the individual was involved in and use it to supplement the current record. Approaches that enable the functional representation of sparse data can prove very useful in tying economic theories to empirical results. This would further strengthen the eCommerce research area. 6. CONCLUSIONS The emerging field of empirical eCommerce re- search is growing fast with many data-related chal- lenges. In light of the special data structures and the types of research questions of interest, we be- lieve that functional data analysis can play a ma- jor role in this field. On the one hand, this requires more involvement by statisticians to further explore statistical issues involved and to develop functional methods and models that are called for in these ap- plications. On the other hand, collaborative work has proven to be extremely fruitful for the multiple disciplines involved. In that respect, more outreach should be done to make these tools more popular. Wider adoption of functional tools by nonstatisti- cians requires software accessibility. Currently FDA packages exist for Matlab, S-PLUS and R (ego.psych.mcgill.ca/misc/fda/software.html). Spe- cialized programs for particular applications are anticipated to grow, and we encourage researchers to make such code and data freely available. In addition, making sample datasets freely available will make this field more accessible and attractive to statisticians. Our website (www.smith.umd.edu/ceme/statistics/) contains some eBay data and auction-specific FDA code. Another important front in further developing this exciting new interdisciplinary field is the involve- ment and training of graduate students. This in- cludes educating statistics students about both the eCommerce domain and FDA. From our own experi- ence through Interactive Research Teams, we found this to be a very exciting ground for advancing sta- tistical research. ACKNOWLEDGMENTS We thank Professor Steve Marron from UNC for fruitful conversations, Professor Jim Ramsay from McGill University for continuous advice and support with FDA software, and the three referees for their constructive comments. REFERENCES Abraham, C., Cornillon, P. A., Matzner-Løber, E. and Molinari, N. (2003). Unsupervised curve-clustering using B-splines. Scand. J. Statist. 30 581–595. MR2002229 Alford, B. and Urimi, L. (2004). An analysis of various spline smoothing techniques for online auctions. Term pa- per, Research Interaction Team, VIGRE program, Univ. Maryland. Aris, A., Shneiderman, B., Plaisant, C., Shmueli, G. and Jank, W. (2005). Representing unevenly-spaced time series data for visualization and interactive exploration. Human–Computer Interaction—INTERACT 2005: IFIP TC13 International Conference. Lecture Notes in Comput. Sci. 3585 835–846. Springer, Berlin. Bajari, P. and Hortac¸su, A. (2003). The winner’s curse, re- serve prices and endogenous entry: Empirical insights from eBay auctions. RAND J. Economics 34 329–355. Bajari, P. and Hortac¸su, A. (2004). Economic insights from Internet auctions. J. Economic Literature 42 457–486. Bapna, R., Jank, W. and Shmueli, G. (2004). Price forma- tion and its dynamics in online auctions. Working paper RHS-06-003, Smith School of Business, Univ. Maryland. Available at ssrn.com/abstract=902887.
  • 12. 12 W. JANK AND G. SHMUELI Bock, H. H. and Diday, E., eds. (2000). Analysis of Sym- bolic Data: Exploratory Methods for Extracting Statisti- cal Information from Complex Data. Springer, Heidelberg. MR1792132 Cuevas, A., Febrero, M. and Fraiman, R. (2002). Linear functional regression: The case of fixed design and func- tional response. Canad. J. Statist. 30 285–300. MR1926066 Dellarocas, C. and Narayan, R. (2006). A statistical measure of a population’s propensity to engage in post- purchase online word-of-mouth. Statist. Sci. 21 277–285. Fan, J. and Lin, S.-K. (1998). Test of significance when data are curves. J. Amer. Statist. Assoc. 93 1007–1021. MR1649196 Faraway, J. J. (1997). Regression analysis for a functional response. Technometrics 39 254–261. MR1462586 Guo, W. (2002). Inference in smoothing spline analysis of variance. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 887– 898. MR1979393 Hall, P., Poskitt, D. S. and Presnell, B. (2001). A functional data-analytic approach to signal discrimination. Technometrics 43 1–9. MR1847775 Hyde, V., Jank, W. and Shmueli, G. (2006). Investigat- ing concurrency in online auctions through visualization. Amer. Statist. To appear. Hyde, V., Moore, E. and Hodge, A. (2004). Functional PCA for exploring bidding activity times for online auc- tions. Term paper, Research Interaction Team, VIGRE program, Univ. Maryland. James, G. M. (2002). Generalized linear models with func- tional predictors. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 411–432. MR1924298 James, G. M. and Hastie, T. J. (2001). Functional linear discriminant analysis for irregularly sampled curves. J. R. Stat. Soc. Ser. B Stat. Methodol. 63 533–550. MR1858401 James, G. M., Hastie, T. J. and Sugar, C. A. (2000). Principal component models for sparse functional data. Biometrika 87 587–602. MR1789811 James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled functional data. J. Amer. Statist. Assoc. 98 397–408. MR1995716 Jank, W. and Shmueli, G. (2005). Profiling price dynamics in online auctions using curve clustering. Working paper RHS-06-004, Smith School of Business, Univ. Maryland. Available at ssrn.com/abstract=902893. Jank, W. and Shmueli, G. (2006). Modeling concurrency of events in online auctions via spatio-temporal semipara- metric models. Working paper, Smith School of Business, Univ. Maryland. Lucking-Reiley, D., Bryan, D., Prasad, N. and Reeves, D. (2000). Pennies from eBay: The determinants of price in online auctions. Technical report, Dept. Eco- nomics, Univ. Arizona. Ogden, R. T., Miller, C. E., Takezawa, K. and Ninomiya, S. (2002). Functional regression in crop lodg- ing assessment with digital images. J. Agric. Biol. Environ. Stat. 7 389–402. Pfeiffer, R. M., Bura, E., Smith, A. and Rutter, J. L. (2002). Two approaches to mutation detection based on functional data. Stat. Med. 21 3447–3464. Ramsay, J. O. (1998). Estimating smooth monotone func- tions. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 365–375. MR1616049 Ramsay, J. O. (2000a). Differential equation models for statistical functions. Canad. J. Statist. 28 225–240. MR1777224 Ramsay, J. O. (2000b). Functional components of variation in handwriting. J. Amer. Statist. Assoc. 95 9–15. Ramsay, J. O. and Ramsey, J. B. (2002). Functional data analysis of the dynamics of the monthly index of non- durable goods production. J. Econometrics 107 327–344. MR1889966 Ramsay, J. O. and Silverman, B. W. (1997). Functional Data Analysis. Springer, New York. Ramsay, J. O. and Silverman, B. W. (2002). Applied Func- tional Data Analysis: Methods and Case Studies. Springer, New York. MR1910407 Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, 2nd ed. Springer, New York. MR2168993 Ratcliffe, S. J., Heller, G. Z. and Leader, L. R. (2002). Functional data analysis with application to periodically stimulated foetal heart rate data. II: Functional logistic regression. Stat. Med. 21 1115–1127. Ratcliffe, S. J., Leader, L. R. and Heller, G. Z. (2002). Functional data analysis with application to periodically stimulated foetal heart rate data. I: Functional regression. Stat. Med. 21 1103–1114. Reddy, S. K. and Dass, M. (2006). Modeling on-line art auction dynamics using functional data analysis. Statist. Sci. 21 179–193. Rossi, N., Wang, X. and Ramsay, J. O. (2002). Nonpara- metric item response function estimates with the EM algo- rithm. J. Educational and Behavioral Statistics 27 291–317. Shmueli, G. and Jank, W. (2005). Visualizing online auc- tions. J. Comput. Graph. Statist. 14 299–319. MR2160815 Shmueli, G. and Jank, W. (2006). Modeling the dynamics of online auctions: A modern statistical approach. In Eco- nomics, Information Systems and E-Commerce Research II : Advanced Empirical Methods 1 (R. Kauffman and P. Tallon, eds.). Sharpe, Armonk, NY. To appear. Shmueli, G., Jank, W., Aris, A., Plaisant, C. and Shnei- derman, B. (2006). Exploring auction databases through interactive visualization. Decision Support Systems. To ap- pear. Stewart, K., Darcy, D. and Daniel, S. (2006). Opportu- nities and challenges applying functional data analysis to the study of open source software evolution. Statist. Sci. 21 167–178. Tarpey, T. and Kinateder, K. K. J. (2003). Clustering functional data. J. Classification 20 93–114. MR1983123 Wang, S. (2005). Principal differential analysis of online auc- tions. Term paper, Research Interaction Team, VIGRE program, Univ. Maryland. Wang, S., Jank, W. and Shmueli, G. (2006). Forecasting eBay’s online auction prices using functional data analysis. J. Bus. Econom. Statist. To appear. Wang, S. and Wu, O. (2004). Bivariate functional modelling of the bid amounts and number of bids in online auctions. Term paper, Research Interaction Team, VIGRE program, Univ. Maryland.
  • 13. FDA IN ECOMMERCE 13 Wu, O. (2005). Dynamics of online movie ratings. Term pa- per, Research Interaction Team, VIGRE program, Univ. Maryland. Yushkevich, P., Pizer, S., Joshi, S. and Marron, J. S. (2001). Intuitive, localized analysis of shape variability. In- formation Processing in Medical Imaging. Lecture Notes in Comput. Sci. 2082 402–408. Springer, Berlin.