Ignoring Robots - Or Better Yet, Counting Them Separately
By Michel Adar on Mar 22, 2010
One easy way to deal with these sessions is to define a partitioning variable for all the models that is a flag indicating whether the session is "Normal" or "Robot". Then all the reports and the predictions can use the "Normal" partition, while the counts and statistics for Robots are still available.
In order for this to work, though, it is necessary to have two conditions:
1. It is possible to identify the Robotic sessions.
2. No learning happens before the identification of the session as a robot.
The first point is obvious, but the second may require some explanation. While the default in RTD is to learn at the end of the session, it is possible to learn in any entry point. This is a setting for each model. There are various reasons to learn in a specific entry point, for example if there is a desire to capture exactly and precisely the data in the session at the time the event happened as opposed to including changes to the end of the session.
In any case, if RTD has already learned on the session before the identification of a robot was done there is no way to retract this learning.
Identifying the robotic sessions can be done through the use of rules and heuristics. For example we may use some of the following:
- Maintain a list of known robotic IPs or domains
- Detect very long sessions, lasting more than a few hours or visiting more than 500 pages
- Detect "robotic" behaviors like a methodic click on all the link of every page
- Detect a session with 10 pages clicked at exactly 20 second intervals
- Detect extensive non-linear navigation
In any case, the basic and simple technique of partitioning the models by the type of session is simple to implement and provides a lot of advantages.