By Michel Adar on May 27, 2010
1 - The Model does not reflect the problem you are trying to solve
For example, you may be trying to solve the problem: "What product should I recommend to this customer" but your model learns on the problem: "Given that a customer has acquired our products, what is the likelihood for each product". In this case the model you built may be too far of a proxy for the problem you are really trying to solve. What you could do in this case is try to build a model based on the result from recommendations of products to customers. If there is not enough data from actual recommendations, you could use a hybrid approach in which you would use the [bad] proxy model until the recommendation model converges.
2 - Data is not predictive enough
If the inputs are not correlated with the output then the models may be unable to provide good predictions. For example, if the input is the phase of the moon and the weather and the output is what car did the customer buy, there may be no correlations found. In this case you should see a low quality model.
The solution in this case is to include more relevant inputs.
3 - Not enough cases seen
If the data learned does not include enough cases, at least 200 positive examples for each output, then the quality of recommendations may be low.
The obvious solution is to include more data records. If this is not possible, then it may be possible to build a model based on the characteristics of the output choices rather than the choices themselves. For example, instead of using products as output, use the product category, price and brand name, and then combine these models.
4 - Output leaking into input giving the false impression of good quality models
If the input data in the training includes values that have changed or are available only because the output happened, then you will find some strong correlations between the input and the output, but these strong correlations do not reflect the data that you will have available at decision (prediction) time. For example, if you are building a model to predict whether a web site visitor will succeed in registering, and the input includes the variable DaysSinceRegistration, and you learn when this variable has already been set, you will probably see a big correlation between having a Zero (or one) in this variable and the fact that registration was successful.
The solution is to remove these variables from the input or make sure they reflect the value as of the time of decision and not after the result is known.