Abstract
This paper discusses adaptation policies for information systems that are subject to dynamic and stochastic contexts such as mobile access to multimedia web sites. In our approach, adaptation agents apply sequential decisional policies under uncertainty. We focus on the modeling of such decisional processes depending on whether the context is fully or partially observable. Our case study is a movie browsing service in a mobile environment that we model by using Markov decision processes (MDPs) and partially observable MDP (POMDP). We derive adaptation policies for this service, that take into account the limited resources such as the network bandwidth. We further refine these policies according to the partially observable users_ interest level estimated from implicit feedback. Our theoretical models are validated through numerous simulations.
1. Introduction
Access alternatives to computer services continue to progress, facilitating our interaction with family, friends, or workplace. These new access alternatives encompass a wide range of mobile and distributed devices that our technological environment becomes truly pervasive. The execution contexts in which these devices operate are naturally heterogeneous. The resources offered by wireless networks vary with the number and the position of connected users. The available memory and the processing power also fluctuate dynamically. Last but not least, the needs and expectations of users can change at any instant. As a consequence, there are numerous research projects that aim to provide modern information systems with adaptation capabilities according to context variability.
In order to handle highly dynamic contexts, the approach that we propose in this paper is based on an adaptation agent. The agent perceives the successive states of the context, thanks to observations, and carries out adaptation actions. Often, the adaptations approaches proposed in literature suppose that the contextual data is easy to perceive or at least that there is no possible ambiguity to identify the state of the current context. One calls this an observable context. In this work, we relax this hypothesis and therefore deal with partially observable contexts.
Our case study is an information system for browsing multimedia descriptions of movies on mobile devices. The key idea is to show how a given adaptation strategy can be refined according to the estimation of user interest. User interest is clearly not directly observable by the system.
We build upon research on “implicit feedback” in order to allow the adaptation agent to estimate the user interest level while interacting with the context [1, 2]. The first section of this paper reviews important elements of the state of the art and details our adaptation approach. Next, we introduce the two formalisms used by our model: the Markov decision processes (MDPs) and the partially observable MDP (POMDP). The following section presents our case study and establishes the operational principles of this information system. Thanks to an MDP, we formalize an adaptation policy for our information system seen as an observable context. Then we show how to refine this policy according to user interest using a POMDP (refined itself from an MDP). Various experiments validate this approach and give a practical view of the behavior of an adaptation agent. We conclude this paper with some perspectives on this work.
2. Related Work
This section introduces useful current literature in the field of adaptation to dynamic execution contexts which helps to position our adaptation approach. Adaptive systems commonly provide adaptation capabilities and therefore, these systems can be categorized according to available resources, user preferences, or more generally, to the context.
2.1. Resource-Based Adaptation
Given the heterogeneous nature of modern networks and mobile devices, there is an obvious need for adaptation to limited resources. Networks' QoS parameters vary in terms of available bandwidth, loss rate, or latency. The capabilities of the terminal are also very heterogeneous in terms of memory size, processing power, and display area.
To manage these limitations, one can adapt the content to be displayed or the access/distribution modalities. When considering content adaptation, several authors propose classifications [3] where the elementary components of the content (a media, e.g.) or the entire document's structure is to be transformed. A media can thus be transcoded [4], converted into another modality [5], or summarized [6]. The distribution or the access can also be adapted, for example, by optimizing the streaming [7] or by modifying the degree of interactivity of the service.
2.2. User-Aware Adaptation
In addition to adaptation capabilities to the available resources, one should also consider an application's adaptation according to human factors which are a matter of user preferences and satisfaction. Henceforth, we describe three main research directions as given by the literature.
The first research direction consists of switching the adaptation mechanisms for maximizing the quality of the service perceived by the user. A typical scenario is the choice of the transcoding strategy of a stream (e.g., a video stream) in order to maximize the perceptual quality given a limited bandwidth [8]. What is the best parameter to adapt: the size of the video, its chromatic resolution, or the frame-rate? Models had been proposed [9, 10] to assess quality variation both from technical and user perspectives. They are organized on three distinct levels: network, media, and content levels. For this line of research, the key factor for consideration is how variation in objective multimedia quality impacts on user perception.
A second active direction is related to user modeling. Here, the idea is to customize an application by modeling user profiles in order to recognize them later. For example, adaptive hypermedia contents or services [11] provide a user with navigation support for “easier/better learning using an on-line educational service” or support for “more efficient selling on an e-commerce site” according to the user profile. Very often, these systems use data mining techniques to analyze access patterns and discover interesting relations in usage data [12]. Such knowledge may be useful to recognize profiles and select the most appropriate modifications to improve content effectiveness.
The third research direction finds its motivation in the first two. In order to learn a user model or to evaluate the perceptual impact of a content adaptation solution, it is necessary to either explicitly ask users for evaluations or to obtain implicit feedback information. Research aiming to evaluate “implicit feedback” (IF) is experiencing a growing interest, since it avoids bringing together significant collections of explicit returns (which is intrusive and expensive) [1]. These IF methods are used in particular to decode user reactions in information search systems [2]. The idea is to measure the user interest for a list of query results, in order to adapt the search function. Among the studied implicit feedback signals one can consider: the total browsing time, the number of clicks, the scrolling interactions, and some characteristic sequences of interactions. In our work, we estimate user interest using IF by interpreting interaction sequences [2, 13]. Moreover, from a metadata perspective, IF can provide implicit descriptors like user interest descriptor as shown in [14].
2.3. Mixing Resources and User-Aware Adaptation
More general adaptation mechanisms can be obtained by combining resource-based with user-based adaptation. The characteristics of users and resources are mixed to design an adaptation strategy for a given context. For example, streaming of a heavy media content can be adapted by prefetching while considering both users characteristics and resource constraints [15].
For mobile and pervasive systems, the link between resources and users starts by taking into account the geolocalization of the user, that can be traced in time and even predicted [16].
In the MPEG-21 digital item adaptation (DIA) standard, the context descriptors group the network’s and the terminal’s capabilities together with the user’s preferences and the authors’ recommendations to adapt multimedia productions. Given this complexity, the normative works only propose tools simply for describing the running context as a set of carefully chosen and extensible descriptors [17]. This is an approach by metadata that leaves free the conception of adaptation components while authorizing a high level of interoperability [18].
Naturally, the elements of the context vary in time. Therefore, one speaks of a dynamic context and, by extension, of a dynamic adaptation. It is important to note that static adaptation to static context elements is possible as well: one can negotiate once for all and always in the same manner the favorite language of a user at the moment of access to a multilingual service. On the contrary, the adaptation algorithm itself and/or its parameters can be dynamically changed according to the context state [19]. Our adaptation approach is in line with the latter case.
An important element of research in context adaptation is also the distinction between the adaptation decision and its effective implementation [18]. In a pervasive system, one can decide that a document must be transcoded into another format, but some questions still need to be answered. Is a transcoding component available? Where can it be found? Should one compose the transcoding service? In order to find solutions to these questions, many authors propose to use artificial learning techniques to select the right decision and/or the appropriate implementation of adaptation mechanisms (see [20] for a review). In this case, a description of the running context is given as input to a decision-making agent that predicts the best adaptation actions according to what it has previously learned. We extend this idea in line with a reinforcement learning principle.
We model the context dynamics by a Markov decision process whose states are completely or partially observable. This approach provides means to find the optimal decision (adaptation action) according to the current context. Next section introduces our MDP-based adaptation approach.
3. Markov Decision Processes-Our Formal Approach
Figure 1 summarizes our adaptation approach that has been introduced in [21] and is further refined in this article. In this paper, an adaptation strategy for dynamic contexts is applied by an adaptation agent. This agent perceives sequentially, over a discrete temporal axis, the variations of the context through observations.
From its observations, the agent will compute the context state in order to apply an adaptation policy. Such a policy is simply a function that maps context states to adaptation decisions. Therefore, the agent acts on the context while deciding an adaptation action: it consumes bandwidth, influences the future user's interactions, increases, or reduces the user's interest. It is therefore useful to measure its effect by associating a reward (immediate or delayed) with the adaptation action decided in a given context state. The agent can thus learn from its interaction with the context and perform a “trial-and-error” learning called reinforcement learning [22]. It attempts to reinforce the actions resulting in a good accumulation of rewards and, conversely, avoids renewing fruitless decisions. This process represents a continuous improvement of its “decision policy.”
This dynamic adaptation approach is common to frameworks of sequential decisional policies under uncertainty. In these frameworks, the uncertainty comes from two sources. On the one hand, the dynamic of the context can be random as a consequence of available resources' variability (e.g., the bandwidth); on the other hand, the effect of an agent's decision can be itself random. For example, if an adaptation action aims to anticipate user interactions, the prediction quality is obviously uncertain and subject to the user's behavior variations.
In this situation, by adopting a Markov definition of the context state, the agent's dynamics can be modeled as a Markov decision process (MDP). This section introduces this formalism.
We initially assume that context state variables are observable by the agent which makes it a sufficient condition to identify the decision state without any ambiguity. This paper takes a step forward by refining adaptation policies according to user interest. We estimate sequentially this hidden information through user behavior as suggested by research on the evaluation of “implicit feedback.” Therefore, the new decision-making state contains at the same time observable variables as well as a hidden element associated with user interest.
We then move on from an MDP to a partially observable Markov decision process (POMDP). To the best of our knowledge, the application of the POMDP to the adaptation problem in partially observable contexts has not been studied before. To give concrete expression to this original idea, a case study will be presented in Section 4.
3.1. MDP Definition
An MDP is a stochastic controlled process that assigns rewards to transitions between states [23]. It is defined as a quintuple where is the state space, is the action space, is the discrete temporal axis of instants when actions are taken, are the probability distributions of the transitions between states, and is a function of reward on the transitions. We rediscover in a formal way the ingredients necessary to understand Figure 1: at each instant , the agent observes its state , applies the action that brings the system (randomly, according to ) to a new state , and receives a reward .
As previously mentioned, we are looking for the best policy with respect to the accumulated rewards. A policy is a function that associates an action with each state . Our aim is to find the best one: .
The MDP theoretical framework assigns a value function to each policy . This value function associates each state with a global reward , obtained by applying beginning with . Such a value function allows to compare policies. A policy outperforms another policy ifThe expected sum of rewards obtained by applying starting from is weighted by a parameter in order to limit the influence of infinitely distant rewards,In brief, for each state, this value function gives the expected sum of future rewards that can be obtained if the policy is applied from this state on. This value function allows to formalize the research of the optimal policy which is the one associated with the best value function .
Bellman's optimality equations characterize the optimal value function and an optimal policy that can be obtained from it. In the case of the -weighted criterion and stationary rewards, they can be written as follows:
3.2. Resolution and Reinforcement Learning
When considering to solve an MDP, we can distinguish between two cases, according to whether the model is known or unknown. When the model (probabilities ) and the rewards are known, a dynamic programing solution can be found.
The operator verifying according to is a contraction. The Bellman equation in can be solved by using a fixed point iterative method while choosing randomly , then applying repeatedly the operator that improves the current policy associated to . If the rewards are bounded, the sequence converges to and allows to compute .
If the model is unknown, we can solve the MDP using a reinforcement learning algorithm [22]. The reinforcement learning approach aims to find an optimal policy through iterative estimations of the optimal value function. The Q-learning algorithm is a reinforcement learning method that is able to solve the Bellman equations for the -weighted criterion. It uses simulations to iteratively estimate the value function , based on the observations of instantaneous transitions and their associated reward. For this purpose, Puterman [23] introduced a function , that carries a significance similar to that of but makes it easier to extract the associated policy because it does not need transition probabilities any more. We can express the “Q-value” as a function of a given policy and its value function,Therefore, it is easy to see that, in spite of the lack of transition probabilities, we can trace back to the optimal policy,The principle of the Q-learning Algorithm 1 says that after each observed transition the current value function for the couple is updated, where represents the current state, the chosen and applied action, the resulted state, and the immediate reward.
In this algorithm, is an initial parameter that represents the number of iterations. The learning rate is particular to each pair state action, and decreases toward 0 at each iteration. The function “” returns a new state and its associated reward according to the dynamics of the system. The choice of the current state and of the action to execute is made by the functions “” and “.” The function “” is used to initialize the values to .
The convergence of this algorithm has been thoroughly studied and is now well established. We assume the following.
(i) and are finite, . (ii)Each pair is visited an infinite number of times. (iii). Under these hypotheses, the function converges almost surely to . Let us recall that the almost-sure convergence means that for all the sequence converges to with a probability equal to 1. Practically, the sequence is often defined as follows:where represents the number of times the state was visited and the decision was made.
3.3. Partial Observation and POMDP Definition
In many cases, the observations that a decision agent is able to capture (see Figure 1) are only partial and do not allow the identification of the context state without ambiguity. Therefore, a new class of problems needs to be solved: partially observable Markov decision processes. The states of the underlying MDP are hidden and only the observation process will help to rediscover the running state of the process.
A partially observable Markov decision process is defined by:
(i) the underlying MDP;(ii) a set of observations;(iii) an observation function that maps every state to a probability distribution on the observations' space. The probability to observe knowing the agent's state will be referred to as follows: .
Non-Markovian Behavior
It is worth to note that, in this new model, we loose
a widely used property for the resolution of the MDPs, namely that the
observation process is Markovian. The probability of the next observation may depend not only on the current observation
and action taken, but also on previous observations and
actions,
Stochastic Policy
It has been proved that the results obtained for the and convergence using MDP resolution algorithms
are not applicable anymore. The POMDPs will need the use of stochastic policies
and not deterministic ones, as for MDP [24].
3.4. Resolution
The POMDP classic methods attempt to bring back the resolution problem to the underlying MDP. Two situations are possible. If the MDP model is known, one cannot determine the exact state of the system but a distribution probability on the set of the possible states (a belief state). In the second situation, without knowing the model parameters, the agent attempts to construct the MDP model relying only on observations' history.
Our experimental test bed uses the resolution software package provided by Cassandra et al. [25] that works in the potentially infinite space of belief states using linear programing methods.
4. Case Study: a Movie Presentation System for Mobile Terminals
We introduce here a system for browsing movie descriptions on mobile devices. For this system, our strategy aims to adapt the presentation of a multimedia content (i.e., movie description) and not to transform the media itself. This case study is intended to be both simple and pedagogical, while integrating a degree of realistic interactivity.
4.1. Interactive Access to a Movie Database
Figure 2 introduces an information system accessible from mobile terminals such as PDAs. A keyword search allows the user to obtain an ordered list of links to various movie descriptions. Within this list, the user can follow a link toward an interesting movie (the associated interaction will be referred to as clickMovie); then, he or she can consult details regarding the movie in question. This consultation will call on a full screen interactive presentation and a navigation scenario detailed below. Having browsed the details for one movie, the user is able to come back to the list of query results (interaction back in Figure 2). It is then possible to access the description of a second interesting film. The index of the accessed movie description will be referred to as .
To simplify the context modeling, we choose to consider the browsing sequence indexed by . Our problem becomes one that aims at adapting the content (movie descriptions) presented during this sequence. Our execution environment is dynamic because of the bandwidth's () variability, a very frequent problem in mobile networks. For simplicity reasons, we do not take into account other important parameters of mobile terminals such as signal strength, user's mobility, and power constraints.
As we consider the browsing session at a high level, we do not need to provide special specifications for the final goal of the service that can be renting/buying a DVD, downloading a media, and so forth. Properly managing the download or the streaming of the whole media is a separate problem and is not considered here.
4.2. From the Simplest to the Richest Descriptions
To present the details of a movie, three forms of descriptions are possible (see Figure 3). The poor “textual” version (referred to as ) groups together with a small poster image, a short text description, and links pointing to more production photos as well as a link to the video trailer. The intermediary version () provides a slideshow of still photos and a link to the trailer. The richest version () includes, in addition, the video trailer.
(a)
(b)
(c)
As the available bandwidth () is variable, the usage of the three versions is not equivalent. The bandwidth required to download the content increases with the complexity of the versions (). In other words, for a given bandwidth, the latencies perceived by the user during the download of the different versions grow proportionally with the size of the content.
More precisely, we now point out two problems generated by the inexistence of dynamic adaptation of the content when the available bandwidth varies. The adaptation strategy could systematically select only one of the three possible alternatives mentioned above. If it always selects the richest version (), this impacts the behavior of the user who experiences bad network conditions (low bandwidth). Although strong latencies could be tolerated while browsing the first query results (small index ), it becomes quickly unacceptable if grows. If the adaptation strategy selects systematically the simplest version (), this would also have a harmful impact on the behavior of the user. Despite the links toward the other resources ()mages and ()ideo, the lack of these visual components, which normally stimulate interest, will not encourage further browsing. An important and a legitimate question to be raised is what can be called an “appropriate” adaptation policy.
4.3. Properties of Appropriate Adaptation Policies
The afore-mentioned two examples of policies (one “too ambitious,” the other “too modest”) show how complex is the relationship among the versions, the number of browsed films, the time spent on the service, the quality of service, the available bandwidth, and the user interest. An in-depth analysis of these relationships can represent a research project in itself. We do not claim to deliver such an analysis in this paper, but we simply want to show how a policy and an adaptation agent can be generated automatically from a model where the context state is observable or partially observable.
Three properties of a good adaptation policy can be identified as follows.
(1) The version chosen for presenting the content must be simplified if the available bandwidth decreases ( is simpler than , itself simpler than (2) The version must be simplified if increases: it is straightforward to choose rich versions for the first browsed movie descriptions that are normally the most pertinent ones (as we have already mentioned, we should avoid large latencies for big values of and small .(3) The version must be enriched if the user shows a high interest for the query results. The simple underlying idea is that a very interested user is more likely to be patient and to tolerate more easily large downloading latencies. The first two properties are related to the variation of the context parameters, that we consider observable ( and ), while the third one is related to a hidden element, namely, user interest. At this stage, given these three properties, an adaptation policy for our case study can be expressed: the selection of the version (T, I, or V) knowing and and having a way to estimate the interest.
4.4. On Navigation Scenarios
This paragraph introduces by examples some possible navigation scenarios. Figure 4 illustrates different possible steps during navigation and introduces different events that are tracked. In this figure, the user chooses a film (event clickMovie), the presentation in version T is downloaded (event pageLoad) without the user interrupting this download. Interested in this film, the user requests the production photos, following the link toward the pictures (event linkI). In the one case, the downloading seems too long and the user interrupts it (event stopDwl means stopDownload) then returns to the movie list (event back). In the other case, the user waits for the downloading of the pictures to finish, then starts viewing the slideshow (event startSlide). Either this slideshow is shown completely and then an event EI (short for EndImages) is raised, or the visualization is incomplete, leading to the event stopSlide (not represented in the figure). Next, the link to the trailer can be followed (event linkV); here again an impatient user can interrupt the downloading (stopDwl) or start playing the video (play). Then the video can be watched completely (event EV for EndVideo) or stopped (stopVideo), before a return (event back).
Obviously, this example does not introduce all the possibilities, especially if the video is not downloaded but streamed. Streaming scenarios introduce different challenges and require a playout buffer that enriches the set of possible interactions (e.g., stopBuffering). Meanwhile, the user may choose not to interact with the proposed media: we introduce a sequence of events pageLoad, noInt (no interaction), back. Similarly, a back is possible just after a pageLoad, a stopDwl may occur immediately after the event clickMovie, watching the video before the pictures is also possible.
5. Problem Statement
5.1. Rewards for Well-Chosen Adaptation Policies
From the previous example and the definitions of associated interactions, it is possible to propose a simple mechanism aiming at rewarding a pertinent adaptation policy. A version (, , or ) is considered well chosen in a given context, if it is not questioned by the user. The reassessment of a version as being too simple is suggested, for example, by the full consumption of the pictures. In the same way, the reassessment of a version as being too rich is indicated by a partial consumption of the downloaded video. Four simple principles that guide our rewarding system are as follows.
(i) We reward the event EI for versions and .(ii) We reward the event EV if the chosen version was .(iii) We penalize upon arrival of interruption events (“stops”).(iv) We favor the simpler versions for no or little interaction. Thus, a version is sufficient if the user does not request (or at least does not completely consume) the pictures. A version is preferable if the user is interested enough and has access to enough resources to download and view the set of pictures (rewards EI). Similarly, a version is adopted if the user views all the pictures (reward EI) and, trying to download the video, is forced to interrupt it because of limited bandwidth. Finally, a rich version is adopted if the user is in good condition to consume the video completely (reward EV). The following decision-making models formalize these principles.
5.2. Toward an Implicit Measure of the Interest
The previously introduced navigations and interactions make it possible to estimate the interest of the user. We proceed by evaluating “implicit feedback” and use the sequences of events to estimate the user's interest level. Our approach is inspired by [26] and is based on the two following ideas.
The first idea is to identify two types of interactions according to what they suggest: either an increasing interest (linkI, linkV, startSlide, play, EI, EV) or a decreasing interest (stopSlide, stopVideo, stopDwl, noInt). Therefore, the event distribution (seen as the probability of occurrence) depends on the user's interest in the browsed movie.
The second idea is to consider not only a single running event to update the estimation of user interest but also to regard an entire sequence of events as being more significant. In fact, it has been recently established that the user actions on a response page to a search (e.g., on Google) depend not only on the relevance of the current response but also on the global relevance of the set of the query results [2].
Following the work of [26], it is natural to model the sequences of events or observations produced by a hidden Markov model (HMM) for which we do not detail here the definition (e.g., see [27]). One can simply translate the two previous ideas by using an HMM with several (hidden) states of interest. The three states of interest shown in Figure 5 are referred as S, M, and B, respectively, for a small, medium, or big interest. The three distributions of observable events in every state are different as stressed in the first idea mentioned above. These differences explain the occurrences of different sequences of observations in terms of sequential interest evolutions (second idea). These evolutions are encoded thanks to transition probabilities (stippled) between hidden states of interest. Given a sequence of observations, an HMM can thus provide the most likely underlying sequence of hidden states or the most likely running hidden state. At this point, the characteristics of our information system are rich enough to define an adaptation agent applying decision policies under uncertainty. These policies can be formalized in the framework presented in Section 3.1.
6. Modeling Content Delivery Policies
In this section, we model the dynamic context of our browsing system (Section 4) in order to obtain the appropriate adaptation agents. Our goal is to characterize the adaptation policies in terms of Markov decision processes (MDPs or POMDP).
6.1. MDP Modeling
Firstly, an observable context is considered. Let us introduce the proposed MDP that models it. The aim is to characterize adaptation policies which verify properties 1 and 2 described in Section 4.3: the presented movie description must be simplified if the bandwidth available decreases or if increases.
A state (observable) of the context is a tuple with being the rank of the film consulted, the bandwidth available, the version proposed by the adaptation agent, and the running event (see Figure 6). With , and (where clickMovie, stopDwl, pageLoad, noInt, linkI, startSlide stopSlide, EI, linkV, play, stopVideo, EV, back).
To obtain a finite and reasonable number of such states (limiting thus the MDP size), we will quantize the variables according to our needs. Thus (resp., ) can be quantized according to three levels meaning begin, middle, and end (resp., for low, average, and high) while segmenting in three regions the interval (resp., ).
The temporal axis of MDP is naturally represented by the sequence of events, every event implying a change of state.
The dynamics of our MDP is constrained by the dynamics of the context, especially by the user navigation. Thus, a transition from a movie index to is not possible. Similarly, every is followed by an event . The bandwidth's own dynamics will have also an impact (according to quantized levels) on the dynamics between the states of the MDP.
The choice of the movie description version (, , or ) proposed by the adaptation agent is done when the user follows the link to the film. This is encoded in the model by the event . The states of the MDP can be classified in:
(i) decision states () in which the agent executes a real action (it effectively chooses among T, I, or V);(ii) nondecision or intermediary () states where the agent does not execute any action. In an MDP framework, the agent decides an action in every single state. Therefore, the model needs to be enriched with an artificial action () as well as an absorbent state of strong penalty income (). Thus, any valid action chosen in an intermediary state brings the agent in the absorbent state where it will be strongly penalized. Similarly, the agent will avoid deciding in a decision-making state where a valid action is desired. Thus, the valid actions mark out the visit of the decision states while the dynamics of the context (subject to user navigation and bandwidth variability) are captured by the transitions between intermediary states for which the action (the nonaction) is carried out. These properties are clearly illustrated in Figure 6.
In other words, there is no change of version during the transitions between intermediary states. The action (representing the proposed version) chosen in a decision-making state is therefore, memorized () in all the following intermediary states, until the next decision state. Thus, the MDP captures the variation of the context dynamics according to the chosen version. Therefore, it will be able to identify which are the good choices of versions (to reproduce later in similar conditions), if it is rewarded for them.
The rewards are associated with the decision states according to the chosen action. Intermediary states corresponding to the occurrences of the events EI and EV are rewarded as well, according to Section 5.1. The rewards (other formulations are possible as well including, e.g., negative rewards for interruption events) are defined as follows:To favor simpler versions for users who do not interact with the content and do not view any media (c.f. Section 5.1), let us choose . To summarize, the model behaves in the following manner: the agent starts with a decision state , where it decides a valid action for which it receives an “initial” reward ; the simpler the version, the bigger is the reward. According to the transitions probabilities based on context dynamics, the model goes through intermediary states where it can receive new rewards or at the time of the occurrences of EI (resp., EV), if the taken action was or , (resp., ). As these occurrences are more frequent for small and high , while the absence of interactions is more likely if is big and low, then the MDP
(i) will favor the richest version for small and high ;(ii) will favor the simplest version for big and low ;(iii) will establish a tradeoff (optimum according to the rewards) for all the other cases. The best policy given by the model is obviously related to the chosen values for . In order to control this choice in the experimental section, a simplified version of the MDP will be defined.
A simplified MDP can be obtained by memorizing the occurrence of the events and during the navigation between two events . Thus, we can delay the rewards or . This simplified model does not contain non decision-making states, if two booleans ( and ) are added to the state structure (Figure 7). The boolean (resp., ) passes to 1 if the event (resp., ) is observed between two states. The simplified MDP is defined by its states (), the actions , the temporal axis given by the sequence of events , and the rewards redefined as follows:This ends the presentation of our observable model and we continue by integrating user interest in a richer POMDP model.
6.2. POMDP Modeling
The new partially observable model adds a hidden variable (It) to the state. The value of It represents the user's interest quantized on three levels (Small, Average, Big). To be able to estimate user interest, we follow the principles described in Section 5.2 and Figure 5. The events (interactions) are taken out from the previous MDP state to become observations in the POMDP model. These observations are distributed according to It (the interest level). A sequence of observations provides an implicit measure of It, following the same principle described for the HMM in Figure 5. Therefore, it becomes possible for the adaptation agent to refine its decisions according to the probability of the running user's interest: small, average, big. In other words, this refinement is done according to a belief state. The principle of this POMDP is illustrated in Figure 8.
A hidden state of our POMDP becomes a tuple . The notations are unchanged including the booleans and .
The temporal axis and the actions are unchanged.
The dynamics of the model. When an event occurs, the adaptation agent is in a decision state . It chooses a valid action and moves, according to the model's random transitions, to an intermediary state where and are equal to 0. The version proposed by the agent is memorized in the intermediary states during the browsing of the current film. The booleans and become 1, if the events or, respectively, are observed and preserve this value until the next decision state . During the browsing of the running film, and remain constant while the other factors (, , and the booleans) can change.
The observations are the occurred events: . They are distributed according to the states. In Figure 8, the event can be observed in and (probability 1.0) and cannot be observed elsewhere ( and ).
In every intermediary state, the event distribution characterizes the value of the interest. Thus, just as the HMM of Figure 5, the POMDP will know how to evaluate, from the sequence of events, the current belief state. The most likely interest value will evolve therefore, along with the events occurred; increase if ,?,?,? decrease in case of . To preserve the interest level throughout the decision states, the interest of the current receives the value corresponding to the last (Figure 8).
The rewards associated with the actions taken in a decision-making state are collected in the following decision-making state where we have all necessary information: ,?, and ;
7. Experimental Results
Simulations are used in order to experimentally validate the models. The developed software simulates navigations such as the one illustrated in Figure 4. Every transition probability between two successive states of navigation is a stochastic function of three parameters: ,?, and . The bandwidth is simulated as a random variable uniformly distributed in an interval compatible with today mobile networks. represents a family of random variables, whose expectation decreases with . The parameter is the movie version proposed to the user. Meanwhile, other experimental setups involving different distribution lows (e.g., normal distribution) for bandwidth dynamics or user's interest conduct to similar results.
7.1. MDP Validation for Observable Contexts
To validate the MDP model of Section 6.1, let us choose a problem with and . Initially, the intervals of and are quantized on 2 granularity levels: and . Rather than proceeding to an arbitrary choice of values ,?,?,?,? that define the rewards, we can look for the ones driving to the optimal policy shown in Table 1. In fact, this policy respects the principles formulated in Section 4.3 and could be proposed beforehand by an expert (Table 1 gives only for the pairs since .)
The value functions corresponding to the simplified MDP, estimated over on a 1 length horizon, (between two decision-making states and ) can be written as follows: because, for all does not depend on action . where and represent the probabilities to observe the events , respectively, , knowing the version .
For every pair we have computed, based on simulations, the probabilities ,?,?. The respect of the policy is assured if and only ifWriting these inequalities for the 4 pairs from Table 1 and using the estimations for , we obtain a 12-linear inequations system in the variables , , ,?,?. Two solutions of the system among an infinity are as follows: Starting from these values, we can experimentally check the correct behavior of our MDP model. Table 2 shows the policy obtained automatically by dynamic programing or Q-learning algorithm, with 4 granularity levels for and and the rewards . This table refines the previous coarse-grained policy; this is not a simple copy of actions (e.g., see the pairs : change from to , : change from to , etc.). This new policy is optimal with respect to the rewards , for this finer granularity level.
Resolving the MDP for the second set of rewards () gives a different refinement (Table 3) that shows richer versions (underlined) comparing to . The explanation stays in the growth of the rewards associated to the events , that induce the choice of a more complex versions, for a long time ( lasts for 3 classes of , when ).
7.2. POMDP Validation: Interest-Refined Policies
Once MDPs are calibrated and return appropriate adaptation policies, their rewards can be reused to solve the POMDP models. The goal is to refine the MDP policies for the observable case by estimating user interest.
Two experimental steps are necessary. The first step consists of learning the POMDP model and the second in solving the decision-making problem.
For the learning process, the simpler method consists of empirically estimating the transitions and observations probabilities from the simulator's traces. Starting from these traces, the probabilities are obtained from the frequencies' computationHaving a POMDP model, the resolution is the next step. Solving a POMDP is notoriously delicate and computationally intensive (e.g., see the tutorial proposed at www.pomdp.org). We used the software package pomdp-solve 5.3 in combination with CPLEX (with the more recent strategy called finite grid).
The results returned by pomdp-solve is an automaton that implements a “near optimal” deterministic policy, represented by a decision-making graph (policy graph). The nodes of the graph contain the actions () while the transitions are done according to the observations. Only the transitions made possible by the navigation process are to be exploited.
To illustrate this form of result, let us show one of the automata that is small enough to be displayed on an A4 page (Figure 9). We choose a single granularity level for and and three levels for . Additionally, we consider that the consumption of the slideshow precedes the consumption of the video. The obtained adaptation policy therefore takes into account only the variation of the estimated user interest ( and do not play any role).
Figure 9 shows that the POMDP agent learns to react in a coherent way. For example, starting from a version , and observing pageLoad, linkI, startSlide, EI, noInt, back the following version decided by the POMDP agent is , which translates the sequence into an interest rise. This rise is even stronger if, after the event EI, the user follows the link linkV. This is enough to make the agent select the version further.
Conversely, starting from version , an important decrease in interest can be observed on the sequence startSlide, stopSlide, play, stopVideo, back, so the system decides . A smaller decrease in interest can be associated with the sequence startSlide, stopSlide, play, EV, back, the next version selected being . These examples show that there exists a natural correlation between the wealth of the selected versions and the implicit user interest. For this problem, where and are not involved, the version given by the policy graph translates the estimation of the running interest (growing with ). For each movie, the choice of version is therefore based only on the events observed while browsing the previous movies.
Other sequences cause the decisions to be less intuitive or harder to interpret. For example, the sequence pageLoad, linkI, startSlide, stopSlide, noInt, back leaving leads to the decision . In this sequence, a compromise between interest rise (suggested by linkI, startSlide) and decrease (suggested by stopSlide, noInt) must be established. Thus, a decision would not be illegitimate. The POMDP trades off this decision according to its dynamics and its rewards. To obtain a modified graph leading to a decision for this sequence, it would be sufficient that the product decreases, where represents the probability to observe EI in the version , for a medium interest. In this case, stopSlide, instead of provoking a loopback on the node 5, would bring the agent to the node 1. Then the agent would decide since the expectation of the gains associated to would be smaller.
In general, the decision-making automaton depends on and . When , , and vary, the automaton becomes too complex to be displayed. The results of the POMDP require a different presentation. Henceforth, working with 3 granularity levels on , 2 on , 3 on and the set of rewards leads to a p olicy graph of more than 100 nodes. We apply it during numerous sequences of simulated navigations. Table 4 gives the statistics on the decisions that have been taken. For every triplet (, , ), the decisions—the agent not knowing —are counted and translated into percentages.
We notice that the proposed content becomes statistically richer when the interest increases, proving again that the interest estimation from the previous observations is as expected. Let us take an example and consider the bottom-right part of Table 4 (corresponding to and ). The probability of the policy proposing version increases with the interest: from 0% (small interest) to 2% (average interest) then 10% (big interest).
Moreover, when and/or increase, the interest trend is correct. For example, for a given set of and ( and ), the proposed version becomes richer with the bandwidth's increase from (1%T, 99%I, 0%V) to (0%T, 51%I, 49%V).
The POMDP capacity to refine adaptation policies according to the user interest is thus validated. Once the POMDP model is solved (offline resolution), the obtained automaton is easily put into practice online by encoding it into an adaptation agent.
8. Conclusion
This paper has shown that sequential decision processes under uncertainty are well suited for defining adaptation mechanisms for dynamic contexts. According to the type of the context state (observable or partially observable), we have shown how to characterize adaptation policies by solving Markov decision processes (MDPs) or partially observable MDP (POMDP). These ideas have been applied to adapt a movie browsing service. In particular, we have proposed a method for refining a given adaptation policy according to user interest. The perspectives of this work are manifold. Our approach can be applied to cases where rewards are explicitly related to the service (e.g., to maximize the number of rented DVDs). It will also be interesting to extend our model by coupling it with functionalities from recommendation systems and/or from multimedia search systems. In the latter case, we would benefit a lot from a collection of real data, that is, navigation logs. These are the research directions that will guide our future work.