Thursday Nov 17, 2011

Short Season, Long Models - Dealing with Seasonality

Accounting for seasonality presents a challenge for the accurate prediction of events. Examples of seasonality include:

·         Boxed cosmetics sets are more popular during Christmas. They sell at other times of the year, but they rise higher than other products during the holiday season.

·         Interest in a promotion rises around the time advertising on TV airs

·         Interest in the Sports section of a newspaper rises when there is a big football match

There are several ways of dealing with seasonality in predictions.

Time Windows

If the length of the model time windows is short enough relative to the seasonality effect, then the models will see only seasonal data, and therefore will be accurate in their predictions. For example, a model with a weekly time window may be quick enough to adapt during the holiday season.

In order for time windows to be useful in dealing with seasonality it is necessary that:

1. The time window is significantly shorter than the season changes
2. There is enough volume of data in the short time windows to produce an accurate model

An additional issue to consider is that sometimes the season may have an abrupt end, for example the day after Christmas.

Input Data

If available, it is possible to include the seasonality effect in the input data for the model. For example the customer record may include a list of all the promotions advertised in the area of residence.

A model with these inputs will have to learn the effect of the input. It is possible to learn it specific to the promotion – and by the way learn about inter-promotion cross feeding – by leaving the list of ads as it is; or it is possible to learn the general effect by having a flag that indicates if the promotion is being advertised.

For inputs to properly represent the effect in the model it is necessary that:

1. The model sees enough events with the input present. For example, by virtue of the model lifetime (or time window) being long enough to see several “seasons” or by having enough volume for the model to learn seasonality quickly.

Proportional Frequency

If we create a model that ignores seasonality it is possible to use that model to predict how the specific person likelihood differs from average. If we have a divergence from average then we can transfer that divergence proportionally to the observed frequency at the time of the prediction.

Definitions:

Ft = trailing average frequency of the event at time “t”. The average is done over a suitable period of to achieve a statistical significant estimate.

F = average frequency as seen by the model.

L = likelihood predicted by the model for a specific person

Lt = predicted likelihood proportionally scaled for time “t”.

If the model is good at predicting deviation from average, and this holds over the interesting range of seasons, then we can estimate Lt as:

Lt = L * (Ft / F)

Considering that:

L = (L – F) + F

Substituting we get:

Lt = [(L – F) + F] * (Ft / F)

Which simplifies to:

(i)                  Lt = (L – F) * (Ft / F)  +  Ft

This latest expression can be understood as “The adjusted likelihood at time t is the average likelihood at time t plus the effect from the model, which is calculated as the difference from average time the proportion of frequencies”.

The formula above assumes a linear translation of the proportion. It is possible to generalize the formula using a factor which we will call “a” as follows:

(ii)                Lt = (L – F) * (Ft / F) * a  +  Ft

It is also possible to use a formula that does not scale the difference, like:

(iii)               Lt = (L – F) * a  +  Ft

While these formulas seem reasonable, they should be taken as hypothesis to be proven with empirical data. A theoretical analysis provides the following insights:

1. The Cumulative Gains Chart (lift) should stay the same, as at any given time the order of the likelihood for different customers is preserved
2. If F is equal to Ft then the formula reverts to “L”
3. If (Ft = 0) then Lt in (i) and (ii) is 0
4. It is possible for Lt to be above 1.

If it is desired to avoid going over 1, for relatively high base frequencies it is possible to use a relative interpretation of the multiplicative factor.

For example, if we say that Y is twice as likely as X, then we can interpret this sentence as:

• If X is 3%, then Y is 6%
• If X is 11%, then Y is 22%
• If X is 70%, then Y is 85% - in this case we interpret “twice as likely” as “half as likely to not happen”

Applying this reasoning to (i) for example we would get:

If (L < F) or (Ft < (1 / ((L/F) + 1))

Then  Lt = L * (Ft / F)

Else

Lt = 1 – (F / L) + (Ft * F / L)

Thursday May 27, 2010

Tips on ensuring Model Quality

Given enough data that represents well the domain and models that reflect exactly the decision being optimized, models usually provide good predictions that ensure lift. Nevertheless, sometimes the modeling situation is less than ideal. In this blog entry we explore the problems found in a few such situations and how to avoid them.

1 - The Model does not reflect the problem you are trying to solve

For example, you may be trying to solve the problem: "What product should I recommend to this customer" but your model learns on the problem: "Given that a customer has acquired our products, what is the likelihood for each product". In this case the model you built may be too far of a proxy for the problem you are really trying to solve. What you could do in this case is try to build a model based on the result from recommendations of products to customers. If there is not enough data from actual recommendations, you could use a hybrid approach in which you would use the [bad] proxy model until the recommendation model converges.

2 - Data is not predictive enough

If the inputs are not correlated with the output then the models may be unable to provide good predictions. For example, if the input is the phase of the moon and the weather and the output is what car did the customer buy, there may be no correlations found. In this case you should see a low quality model.

The solution in this case is to include more relevant inputs.

3 - Not enough cases seen

If the data learned does not include enough cases, at least 200 positive examples for each output, then the quality of recommendations may be low.

The obvious solution is to include more data records. If this is not possible, then it may be possible to build a model based on the characteristics of the output choices rather than the choices themselves. For example, instead of using products as output, use the product category, price and brand name, and then combine these models.

4 - Output leaking into input giving the false impression of good quality models

If the input data in the training includes values that have changed or are available only because the output happened, then you will find some strong correlations between the input and the output, but these strong correlations do not reflect the data that you will have available at decision (prediction) time. For example, if you are building a model to predict whether a web site visitor will succeed in registering, and the input includes the variable DaysSinceRegistration, and you learn when this variable has already been set, you will probably see a big correlation between having a Zero (or one) in this variable and the fact that registration was successful.

The solution is to remove these variables from the input or make sure they reflect the value as of the time of decision and not after the result is known.

Saturday Dec 12, 2009

Evaluating Models and Prediction Schemes

It has become quite common in RTD implementation to utilize different models to predict the same kind of value in different situations. For example, if the RTD application is used for optimizing the presentation of Creatives, where the creatives belong to a Offers which in turn belong to Campaigns which belong to Products which belong to Product Lines; it may be desirable to be able to predict at the different levels and use the models in a waterfall fashion as they converge and become more precise.

Another example is when using more than one Model or algorithm, whether internal to RTD or external.

In all these cases it is interesting to determine which of the models or algorithms is better at predicting the output. While RTD Decision Center reports provide good Model Quality reports that can be used to evaluate the RTD internal models, the same may not exist for external model. Furthermore, it may be desired to evaluate the different models in a level playing field, utilizing just one metric that can be used to select the "best" algorithm.

One method of achieving this goal is to use an RTD model to perform the evaluation. This pattern is commonly used in Data Mining to "blend" models or create an "ensemble" of models. The idea is to have the predictors as input and the normal positive event as output. When doing this in RTD, the Decision Center Predictiveness report provides the sorting of the different predictors by their predictiveness.

To demonstrate this I have created an Inline Service (ILS) whose sole purpose is to evaluate predictors which represent different levels of noise over a basic "perfect" predictor. The attached image represents the result of this ILS.

The "Perfect" predictor is just a normally distributed variable centered at 3% with a standard deviation of 7%, limited to the range 0 to 1. The output variable follows exactly the probability given by the predictor. For example, if the predictor is 13% there is a 13% probability of the positive output.

The other predictors are defined by taking the perfect predictor and adding a noise component. The noise is also normally distributed and has a standard deviation that determines the amount of noise.

For example, the "Noise 1/5" predictor has noise with a standard deviation of 20% (1/5) of the value of the prefect predictor.

You can see that the RTD blended model nicely discovers that the more noise there is in the predictor, the less predictive it is.

This kind of blended model can also be used to create a combined model that has the potential of being better than each of the individual models. This is particularly interesting when the different models are really different, for example because of the inputs they use or because of the algorithms used to develop the models.

If you want a copy of the ILS send me an email.

About

Issues related to Oracle Real-Time Decisions (RTD). Entries include implementation tips, technology descriptions and items of general interest to the RTD community.

Archives
Sun Mon Tue Wed Thu Fri Sat « June 2016 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Today
RSS Atom