Wednesday Nov 17, 2010

Performance Tips

As RTD implementations become more and more sophisticated and the applications extend the reach of decisions far beyond selecting the next best offer we have been recommending some design decisions in order to ensure a desired level of performance.

By far, the most significant factor affecting performance is external system access. In particular database access. This goes for reads as well as writes. Here are a few tips that are easy to implement and are good to keep in mind when designing a configuration;

  1. Data that is repeatedly used in decisions by different sessions should be cached. Examples include Offer Metadata, Product Catalog, Content Catalog, and Zip Code Demographics.
  2. If possible, data that will be needed in decisions should be pre-fetched. For example, customer profiles could be loaded at the very beginning of a session.
  3. A good storage system for data is an Oracle Coherence Cache. Particularly if it is configured with local storage in the same app server as RTD. Data can include customer profile, event history, etc.
  4. When writing to the database and there is no need to be transactional and synchronous, use RTD's batch writing capabilities. This can increase write performance by an order of magnitude.
  5. Avoid unnecessary writes and writing unnecessary data. For example, avoid writing metadata together with event data if the metadata can be linked.
  6. Consider using stored procedures when updating several tables to minimize roundtrips to the database
  7. If a result set is potentially very large, consider wrapping the query with a stored procedure that limits the number of rows returned. For example, if the application calls for loading the purchase history of a customer, and the median length of the list is 3 purchases, but there are 15 customers with 10000 purchases or more, processing these [good] customers will take long - it may be acceptable from the point of view of the application logic to just load a maximum of the latest 100 purchases.
  8. When loading metadata, avoid loading data that will not be used. For example, if there are 500k products in the catalog, but realistically only 90k have any real chance of being selected for a recommendation, do the filtering when loading the data and avoid loading the full list.
  9. Asynchronous processing is not free - avoid unnecessary processing. For example, a decision may take 5 ms of actual CPU processing. That would limit the theoretical throughput of a single CPU to 200 decisions per second. If we add 15 ms of asynchronous processing per decision, we will not be affecting response time, but the throughput will be afected - the thoretical throughput being reduced to 50 per second.

In addition to these tips, it is important also for the environment to be propery setup to achieve peak performance. Some tips include:

  1. Prefer physical servers to virtualized ones
  2. Always calculate to have at least two CPU cores per JVM
  3. Make sure the memory requirements and settings match the available memory to avoid swapping JVM memory
  4. If using virtualized servers make sure that CPUs are not overallocated. That is, do not run 5 virtual machines configured for 2 CPUs each on an 8 core system. While such a setup may be acceptable for some applications, with throughput intensive applications like RTD such a setup would certainly cause performance problems and these problems are difficult to diagnose.
  5. If using virtualized servers make sure the virtual machine's configured memory will be resident in physical memory. If the Guest believes that it has 4GB of memory, but the host needs to use swap to fulfill that amount of memory performance and availability will suffer. Problems in this area are very difficult to diagnose because to the Guest OS it looks as if CPU cycles were stolen.

Thursday Jun 10, 2010

Is RTD Stateless or Stateful?


A stateless service is one where each request is an independent transaction that can be processed by any of the servers in a cluster. A stateful service is one where state is kept in a server's memory from transaction to transaction, thus necessitating the proper routing of requests to the right server. The main advantage of stateless systems is simplicity of design. The main advantage of stateful systems is performance.

I'm often asked whether RTD is a stateless or stateful service, so I wanted to clarify this issue in depth so that RTD's architecture will be properly understood.

The short answer is: "RTD can be configured as a stateless or stateful service."

The performance difference between stateless and stateful systems can be very significant, and while in a call center implementation it may be reasonable to use a pure stateless configuration, a web implementation that produces thousands of requests per second is practically impossible with a stateless configuration.

RTD's performance is orders of magnitude better than most competing systems. RTD was architected from the ground up to achieve this performance. Features like automatic and dynamic compression of prediction models, automatic translation of metadata to machine code, lack of interpreted languages, and separation of model building from decisioning contribute to achieving this performance level. Because of this focus on performance we decided to have RTD's default configuration work in a stateful manner. By being stateful RTD requests are typically handled in a few milliseconds when repeated requests come to the same session.

Now, those readers that have participated in implementations of RTD know that RTD's architecture is also focused on reducing Total Cost of Ownership (TCO) with features like automatic model building, automatic time windows, automatic maintenance of database tables, automatic evaluation of data mining models, automatic management of models partitioned by channel, geography, etcetera, and hot swapping of configurations.

How do you reconcile the need for a low TCO and the need for performance? How do you get the performance of a stateful system with the simplicity of a stateless system? The answer is that you make the system behave like a stateless system to the exterior, but you let it automatically take advantage of situations where being stateful is better.

For example, one of the advantages of stateless systems is that you can route a message to any server in a cluster, without worrying about sending it to the same server that was handling the session in previous messages. With an RTD stateful configuration you can still route the message to any server in the cluster, so from the point of view of the configuration of other systems, it is the same as a stateless service. The difference though comes in performance, because if the message arrives to the right server, RTD can serve it without any external access to the session's state, thus tremendously reducing processing time. In typical implementations it is not rare to have high percentages of messages routed directly to the right server, while those that are not, are easily handled by forwarding the messages to the right server. This architecture usually provides the best of both worlds with performance and simplicity of configuration.

Configuring RTD as a pure stateless service

A pure stateless configuration requires session data to be persisted at the end of handling each and every message and reloading that data at the beginning of handling any new message. This is of course, the root of the inefficiency of these configurations. This is also the reason why many "stateless" implementations actually do keep state to take advantage of a request coming back to the same server. Nevertheless, if the implementation requires a pure stateless decision service, this is easy to configure in RTD. The way to do it is:

  1. Mark every Integration Point to Close the session at the end of processing the message
  2. In the Session entity persist the session data on closing the session
  3. In the session entity check if a persisted version exists and load it

An excellent solution for persisting the session data is Oracle Coherence, which provides a high performance, distributed cache that minimizes the performance impact of persisting and reloading the session. Alternatively, the session can be persisted to a local database.

An interesting feature of the RTD stateless configuration is that it can cope with serializing concurrent requests for the same session. For example, if a web page produces two requests to the decision service, these requests could come concurrently to the decision services and be handled by different servers. Most stateless implementation would have the two requests step onto each other when saving the state, or fail one of the messages. When properly configured, RTD will make one message wait for the other before processing.

A Word on Context

Using the context of a customer interaction typically significantly increases lift. For example, offer success in a call center could double if the context of the call is taken into account. For this reason, it is important to utilize the contextual information in decision making. To make the contextual information available throughout a session it needs to be persisted. When there is a well defined owner for the information then there is no problem because in case of a session restart, the information can be easily retrieved. If there is no official owner of the information, then RTD can be configured to persist this information.

Once again, RTD provides flexibility to ensure high performance when it is adequate to allow for some loss of state in the rare cases of server failure. For example, in a heavy use web site that serves 1000 pages per second the navigation history may be stored in the in memory session. In such sites it is typical that there is no OLTP that stores all the navigation events, therefore if an RTD server were to fail, it would be possible for the navigation to that point to be lost (note that a new session would be immediately established in one of the other servers). In most cases the loss of this navigation information would be acceptable as it would happen rarely. If it is desired to save this information, RTD would persist it every time the visitor navigates to a new page.

Note that this practice is preferred whether RTD is configured in a stateless or stateful manner.

Thursday May 27, 2010

Tips on ensuring Model Quality

Given enough data that represents well the domain and models that reflect exactly the decision being optimized, models usually provide good predictions that ensure lift. Nevertheless, sometimes the modeling situation is less than ideal. In this blog entry we explore the problems found in a few such situations and how to avoid them.

1 - The Model does not reflect the problem you are trying to solve

For example, you may be trying to solve the problem: "What product should I recommend to this customer" but your model learns on the problem: "Given that a customer has acquired our products, what is the likelihood for each product". In this case the model you built may be too far of a proxy for the problem you are really trying to solve. What you could do in this case is try to build a model based on the result from recommendations of products to customers. If there is not enough data from actual recommendations, you could use a hybrid approach in which you would use the [bad] proxy model until the recommendation model converges.

2 - Data is not predictive enough

If the inputs are not correlated with the output then the models may be unable to provide good predictions. For example, if the input is the phase of the moon and the weather and the output is what car did the customer buy, there may be no correlations found. In this case you should see a low quality model.

The solution in this case is to include more relevant inputs.

3 - Not enough cases seen

If the data learned does not include enough cases, at least 200 positive examples for each output, then the quality of recommendations may be low.

The obvious solution is to include more data records. If this is not possible, then it may be possible to build a model based on the characteristics of the output choices rather than the choices themselves. For example, instead of using products as output, use the product category, price and brand name, and then combine these models.

4 - Output leaking into input giving the false impression of good quality models

If the input data in the training includes values that have changed or are available only because the output happened, then you will find some strong correlations between the input and the output, but these strong correlations do not reflect the data that you will have available at decision (prediction) time. For example, if you are building a model to predict whether a web site visitor will succeed in registering, and the input includes the variable DaysSinceRegistration, and you learn when this variable has already been set, you will probably see a big correlation between having a Zero (or one) in this variable and the fact that registration was successful.

The solution is to remove these variables from the input or make sure they reflect the value as of the time of decision and not after the result is known.

Wednesday Apr 07, 2010

The softer side of BPM

BPM and RTD are great complementary technologies that together provide a much higher benefit than each of them separately. BPM covers the need for automating processes, making sure that there is uniformity, that rules and regulations are complied with and that the process runs smoothly and quickly processes the units flowing through it.

By nature, this automation and unification can lead to a stricter, less flexible process. To avoid this problem it is common to encounter process definition that include multiple conditional branches and human input to help direct processing in the direction that best applies to the current situation. This is where RTD comes into play. The selection of branches and conditions and the optimization of decisions is better left in the hands of a system that can measure the results of its decisions in a closed loop fashion and make decisions based on the empirical knowledge accumulated through observing the running of the process.

When designing a business process there are key places in which it may be beneficial to introduce RTD decisions. These are:

  • Thresholds - whenever a threshold is used to determine the processing of a unit, there may be an opportunity to make the threshold "softer" by introducing an RTD decision based on predicted results. For example an insurance company process may have a total claim threshold to initiate an investigation. Instead of having that threshold, RTD could be used to help determine what claims to investigate based on the likelihood they are fraudulent, cost of investigation and effect on processing time.
  • Human decisions - sometimes a process will let the human participants make decisions of flow. For example, a call center process may leave the escalation decision to the agent. While this has flexibility, it may produce undesired results and asymetry in customer treatment that is not based on objective functions but subjective reasoning by the agent. Instead, an RTD decision may be introduced to recommend escalation or other kinds of treatments.
  • Content Selection - a process may include the use of messaging with customers. The selection of the most appropriate message to the customer given the content can be optimized with RTD.
  • A/B Testing - a process may have optional paths for which it is not clear what populations they work better for. Rather than making the arbitrary selection or selection by committee of the option deeped the best, RTD can be introduced to dynamically determine the best path for each unit.
In summary, RTD can be used to make BPM based process automation more dynamic and adaptable to the different situations encountered in processing. Effectively making the automation softer, less rigid in its processing. In order for this to work the people responsible for the process need to understand the what are the important KPIs that the business is really interested in optimizing for, and make a concerted effort in measuring up to those KPIs and optimizing the process to achieve better results. The benefit of making better decisions in a process flow can be tremendous, as exemplified by many of current RTD implementations.

Monday Mar 22, 2010

Ignoring Robots - Or Better Yet, Counting Them Separately

It is quite common to have web sessions that are undesirable from the point of view of analytics. For example, when there are either internal or external robots that check the site's health, index it or just extract information from it. These robotic session do not behave like humans and if their volume is high enough they can sway the statistics and models.

One easy way to deal with these sessions is to define a partitioning variable for all the models that is a flag indicating whether the session is "Normal" or "Robot". Then all the reports and the predictions can use the "Normal" partition, while the counts and statistics for Robots are still available.

In order for this to work, though, it is necessary to have two conditions:

1. It is possible to identify the Robotic sessions.
2. No learning happens before the identification of the session as a robot.

The first point is obvious, but the second may require some explanation. While the default in RTD is to learn at the end of the session, it is possible to learn in any entry point. This is a setting for each model. There are various reasons to learn in a specific entry point, for example if there is a desire to capture exactly and precisely the data in the session at the time the event happened as opposed to including changes to the end of the session.

In any case, if RTD has already learned on the session before the identification of a robot was done there is no way to retract this learning.

Identifying the robotic sessions can be done through the use of rules and heuristics. For example we may use some of the following:

  1. Maintain a list of known robotic IPs or domains
  2. Detect very long sessions, lasting more than a few hours or visiting more than 500 pages
  3. Detect "robotic" behaviors like a methodic click on all the link of every page
  4. Detect a session with 10 pages clicked at exactly 20 second intervals
  5. Detect extensive non-linear navigation
Now, an interesting experiment would be to use the flag above as an output of a model to see if there are more subtle characteristics of robots such that a model can be used to detect robots, even if they fall through the cracks of rules and heuristics.

In any case, the basic and simple technique of partitioning the models by the type of session is simple to implement and provides a lot of advantages.

Monday Feb 22, 2010

The problem with Process Automation is Automation itself

Automation - (Noun) the use of machines to do work that was previously done by people

Replacing people with machines makes it possible to tremendously increase the capacity of a process, which has obvious economic advantages. Automation has been successful in replacing people's work and improving many aspects of the process in addition to the capacity. For example, automated process are much more uniform processing of units.

So what is wrong with Automation? Nothing really, but the fact that there are a few things that people do better than machines. My two favorite human characteristics that tend to be lost with automation are:

  1. The Capability of the process to learn
  2. The capability of people to discern between different cases
With automation we are able to run the same process, again and again, sometimes repeating the same mistake, again and again. With automation we tend to treat every unit the same way.

Lets take the simple example of automation of answering the phone. Most companies today use IVR software to answer the phone, but how many differentiate between callers? If a valuable bank customer who is approaching retirement age calls the bank after not calling for 5 years, how many banks will actually do the right thing with this customer, which is to kidnap the customer from the IVR and connect them directly with the best agent? How many companies are setup to discover that a problem that affects 1% of their callers is not possible to solve in the IVR, but these customer still have to go through a frustrating tree of options to get to talk with a person that can actually help them?

If there was an actual human that was capable of watching all the interactions in the IVR, and seeing the short and long term results of these calls, and had the capability of affecting the way decisions are made in the IVR the results from automation would be much better.

RTD was designed to infuse these missing elements into business processes. Learning and differentiating (sometimes called "personalization"), thus taking us a step further into better automation of business process, not yet matching all the capabilities of humans, but at least bringing some "common sense" into it.

Thursday Jan 28, 2010

Precomputed List of Next Best Offers = Bad Idea

What is the difference between having a batch process that computes the Next Best Offer for every customer every night and computing the best offer in real time?

It is all about context. Any precomputed offer list can not possibly take into account the context of the interaction between the customer and the company. Examples of attributes that can not be taken into account in a prebuilt list:

  • Call Reason
  • Recent and Last Transaction
  • Exact state of the account
  • Time of the interaction
  • User Agent (iPhone, Computer, Phone, etc.)
  • Call center agent answering the call
Without utilizing this kind of information you are certain to make the wrong decision in many cases. For example, a customer may be amenable to listening and accepting an offer if they are calling the service call center in the evening and have received a satisfactory resolution for a service call, while the same customer when accessing the site at 10:30 in the morning with the iPhone browser would more likely not be open to any offers at that time.

It has been my experience that in Real Time Marketing implementations in call centers the actual agent answering the call is always in the top 5 predictors that influence the selection of the best offer. Similarly, the call reason and the time of the call tend to be very good predictors.

It is important to understand the difference between inbound and outbound marketing. In addition to the obvious difference in the attitude of the customer and their openness to interact with the company, there is a fundamental difference from the point of view of the customer data. In outbound marketing I can compute the best offer for a customer and then call them a few hours or days later and there is no reason to assume the customer's data would have changed significantly in most cases - only the statistically regular changes apply. In contrast, in inbound marketing I am assured that the customer's data will have changed by the time I am ready to make an offer at the tail end of a call, after all, 100% of those callers decided to call the company for some reason.

Sunday Jan 17, 2010

It's not all about offers

Unfortunately for too many people managing their company's relationship with their customers is all about offers. This narrow view of the customer fails to realize that there are many decisions that the company makes day to day that affect the relationship of the company with the customer. These decisions include:

  • Selection of content to present to the customer
  • Selection of process and process alternatives
  • Product offerings
  • Offers
  • Solution to product or service issues
  • Proactive notifications
  • Fraud detection and avoidance

These decisions are made in the context of different business goals. Like:

  • Increasing revenue
  • Reducing cost
  • Enhancing customers' wallet share
  • Increasing brand recognition
  • Fulfilling partner commitments
  • Providing good customer service
  • Increasing loyalty
  • Controlling fraud

The catalog of possible selections for each decision con come from many sources, including:

  • Campaign management
  • Content management
  • Product catalog
  • Risk Rules
  • Process actions

RTD was designed from the ground up to optimize this variaty of decisions balancing the many competing business goals and selecting items from many different sources, without necessarily owning the metadata for these items. This is in contrast with the view of the world where everything looks like offers and the only goal that matters is immediate increase in revenue.

Monday Jan 04, 2010

Measuring Reality is much easier than Reconstructing it

When asked about the accuracy of RTD's data mining algorithms I often find myself explaining the reasons behind my belief that as a system RTD is much more accurate than any offline data mining system in most cases. One of the reasons for the enhanced accuracy is the capability of directly measuring reality rather than trying to reconstruct it from disconnected data sources.

For example, assume that you are studying the acceptance of offers in a call center. One of the inputs that may be interesting is the length of the queue at the time of the call. In an offline exercise you would have to obtain the logs from the telephony queue, hope that they are kept at enough accuracy, hope that the clock in the systems is synchronized and then query the log using a time based query for sorting the log records. The same thing in RTD is accomplished by simply querying the telephony queue for its current length, at the time of the call. There is no need to hope for data being collected properly, at the right granularity and with synchronized clocks. As we are dealing with reality as-it-happens, we do not care if the clocks are all wrong.

The end result of the difficulty in reconstructing reality is that typical offline data mining studies have much narrower inputs than those typically seen in RTD implementations. The difference in data availability in many cases more than makes up for possible accuracy improvements gained from a manually crafted data mining model.

Just to complete the picture I have to point out that I said "many cases" or "most cases" but not "all cases". The reason for that is that there are many good reasons to perform off-line data mining and it is worth investing in getting the data and complex queries right. Examples include retention, life-time value and in some cases product affinity models. There are also many areas for which RTD algorithms are not applicable, like data exploration, visualization and clustering.

Nevertheless, for predictive data mining applied to process improvement it is hard to beat the real time data collection capabilities or real time analytics systems.

Friday Dec 18, 2009

Learning and predicting for short and long term events

In many RTD deployments we see that the business wants to optimize decisions based on the long term effect of the decision. For example, selecting a retention offer to display to a customer in the web site should not be driven by the likelihood that the customer will click on the offer, but by the likelihood the customer will have been retained, say after 3 month.

Another simpler example is the decision by a bank to offer a credit card to a customer. The events in this situation may be:

  1. Offer Extended
  2. Clicked
  3. Applied for card
  4. Used card
The goal of the bank is to have the customer use the card. The problem is that the feedback for whether the card is used will come weeks after the initial offering. This not only requires the capability of closing the loop at a later time (the subject of a future entry in this blog), but it also leaves us a long time without the capability of having reliable models.

RTD provides built-in functionality to handle these cases gracefully, utilizing the maximum of available information. This is why the positive events for an RTD model can be more than one and their order matters. The idea behind this feature is that the events are naturally ordered. The use of the card comes after the application and after the click. Therefore, a model for the "Click" event can be used as a proxy for the deeper events for as long as we do not have a good model for them.

Using a closer event as a proxy for the farther one is a good strategy, but it requires management of events, levels of conversion, etc. and it gets even more complicated when you think that the different offers can be at different levels of conversion. RTD does all this management automatically.

Before we describe how RTD makes this all work, there is one more consideration. When comparing offers it is not fair to compare the likelihood of click for one offer with the likelihood of Card Use for another offer.

The way that RTD works is as follows. When computing the likelihood for a choice:

  1. Compute the likelihood for the deepest event for which we have a converged model
  2. If the event is the desired one (usually the deepest) stop here and use this likelihood
  3. Compute the average likelihood across all choices for all the events that are deeper than the one we used in step 1
  4. Using the average likelihoods compute the proportion between the different events and apply that proportion to the likelihood we got in step 1.
For example, if the only likelihood that can be computed for a specific choice is Click, and it is 10%, and the averages across all other choices are:

  • Click : 20%
  • Apply : 12%
  • Use: 8%
Then for our choice we take the 10% and multiply it by 8/20 to get the likelihood of use, which gives 4% if I am not mistaken. The likelihood of Apply for the same choice (and customer) would be 6%.

As mentioned before, if you define the events in the proper order as the positive events for the model RTD will take care of the logistics for you, following the algorithm described above.

Happy Holidays.

Saturday Dec 12, 2009

Evaluating Models and Prediction Schemes

It has become quite common in RTD implementation to utilize different models to predict the same kind of value in different situations. For example, if the RTD application is used for optimizing the presentation of Creatives, where the creatives belong to a Offers which in turn belong to Campaigns which belong to Products which belong to Product Lines; it may be desirable to be able to predict at the different levels and use the models in a waterfall fashion as they converge and become more precise.

Another example is when using more than one Model or algorithm, whether internal to RTD or external.

In all these cases it is interesting to determine which of the models or algorithms is better at predicting the output. While RTD Decision Center reports provide good Model Quality reports that can be used to evaluate the RTD internal models, the same may not exist for external model. Furthermore, it may be desired to evaluate the different models in a level playing field, utilizing just one metric that can be used to select the "best" algorithm.

One method of achieving this goal is to use an RTD model to perform the evaluation. This pattern is commonly used in Data Mining to "blend" models or create an "ensemble" of models. The idea is to have the predictors as input and the normal positive event as output. When doing this in RTD, the Decision Center Predictiveness report provides the sorting of the different predictors by their predictiveness.

To demonstrate this I have created an Inline Service (ILS) whose sole purpose is to evaluate predictors which represent different levels of noise over a basic "perfect" predictor. The attached image represents the result of this ILS.

The "Perfect" predictor is just a normally distributed variable centered at 3% with a standard deviation of 7%, limited to the range 0 to 1. The output variable follows exactly the probability given by the predictor. For example, if the predictor is 13% there is a 13% probability of the positive output.

The other predictors are defined by taking the perfect predictor and adding a noise component. The noise is also normally distributed and has a standard deviation that determines the amount of noise.


 For example, the "Noise 1/5" predictor has noise with a standard deviation of 20% (1/5) of the value of the prefect predictor.

You can see that the RTD blended model nicely discovers that the more noise there is in the predictor, the less predictive it is.

This kind of blended model can also be used to create a combined model that has the potential of being better than each of the individual models. This is particularly interesting when the different models are really different, for example because of the inputs they use or because of the algorithms used to develop the models.

If you want a copy of the ILS send me an email.

Monday Dec 07, 2009

Using RTD for recommendations from large number of items

I am often asked whether RTD can be used to recommend items when the number of available items is extremely large, in the tens of thousands to a couple of millions. These situations can be encountered in a number of different industries, including retail, media outlets or portals and news organizations.

Traditional approaches to these situations include Market Basket Analysis and Collaborative Filtering. Collaborative filtering has its strength in extracting affinity information from ratings, and a good CF algorithm can exploit ratings data to extract the least bit of information from it. So these traditional approaches do have their advantages, but nevertheless, they are clearly limited in the following ways:

  1. They can not recommend new items
  2. They can not issue recommendations to new users
  3. They require vast numbers of baskets or ratings to cover the space with statistically significant data
  4. They do not provide flexibility in selecting recommendations to optimize for varying and conflicting business goals

With RTD we are capable of overcoming these limitations by using a technique that does not necessarily involve clustering of items or users, and does not start from scratch for every new item.

Intuitively, it should be clear that the recommendation of the movie "Terminator 3" will follow similar patterns to "Terminator 2", so when T3 appears, the knowledge about T2 can be used as a good approximation. Similarly, the demographic and behavioral data about a user together with the context of an interaction can give us big clues of what the person will be interested n, even if we have not seen any purchase or ratings from that person.

The way we do item recommendations with RTD in this context is to compute the likelihood that an item will be of interest by dividing the likelihood computation into a two layer model network where the base layer computes the affinity of the user with the characteristics of the item and the second layer uses one model to blend the results of the first layer into one final prediction.

Wednesday Nov 25, 2009

Measuring reality is much easier than reconstructing it

The title of this entry says it all. When it comes to collecting data for any analytic work, it is much easier to measure the current data than attempting to reconstruct it from historical databases.

For example, assume you need to analyze the factors that affect cross selling success in the call center and you want to include data like the wait time in the queue or the number of calls the agent answered in the current shift before the call where cross selling was attempted. Collecting this data from history is very complex because:

  1. Not all data is collected all the time
  2. Data from different systems may end up in very disparate historical databases
  3. Different data may have different retention periods and granularity
  4. Different systems may have uncoordinated clocks
  5. Queries become very complex when trying to pinpoint the state of a data record at a specific time
  6. Queries become complex in order to include only events that happened before the point in time in question
For all these reasons and more, it is much easier to perform analytics in Real-Time, when reality can be measured by directly connecting to other systems. For example, it does not matter if the clocks in the different systems are totally unccordinated or work in a different time zone, all I need to worry about is to retrieve the latest data. Similarly, if I need to know the city a person lives in I just retrieve it from the DB, there is no need to go through the list of address changes.

This is one of the reasons I believe that even if you can hand-craft very accurate models, the real time models automatically generated by a self learning system can, in many cases, end up being much more accurate because they can take advantage of more data that is also more accurate.

Sunday Nov 22, 2009

Sizing an RTD installation - Part 3 (final)

Now that we know how to compute the number of requests per second and we have seen other things that need to be considered, we can finally compute the number of CPUs to cope with the desired load. This number is actually quite easy to compute. For planning purposes we usually account for 100 requests/second/CPU. This leaves enough room for higher peak loads or other underestimations in the process. In typical cases we see a higher throughput per CPU.

For example, if we need to support 300 requests per second we can plan for 3 CPUs for Decision Service. The other processes, Learning Server and Workbench Server can usually be run either on one of the Decision Service CPUs or on their own.

Now, lets say that there is the desire to use standard servers with 2 CPUs, each CPU with 4 cores. In this case, one server would have more than enough computing power to cope with the number of requests per second. Nevertheless, we may choose to have 2 of these servers, that is 16 total cores to provide for high availability.

If this same configuration was used with Disaster Recovery then we may end up running two servers in two sites with a total of 32 CPU cores. That, of course, is more computing power than necessary to cope with the load.

An alternative that is counter intuitive for people running transactional applications is to have RTD running on just one server, and pay the price of non-availability. This may be acceptable depending on the application. For example in offer optimization and if the expected down time of a single server is just a couple of hours per year, then the cost of having non redundant servers maybe better than the cost of having a HA setup.

In any case, the numbers above are for basic planning purposes. If there are many sessions being initialized and not so many other kinds of events then the equations may look different as a session initialization usually takes more resources. Additionally, the load balancing strategy in front of the RTD servers also affects performance. Maximum speed is attained when the load balancing scheme is capable of maintaining session affinity.

Finally, for really high throughput in the thousands of requests per second the strategy is to partition the servers along some strict lines. This partitining strategy can be taken all the way into the database.

Sunday Nov 15, 2009

Sizing an RTD installation - Part 2

Now that we have the expected throughput in terms of the number of requests per second, lets look at other sizing factors.

Response time  - sometimes the volume of requests is not smoothly distributed and there may be peaks of requests coming at the same time. If there are strict response time requirements, like having an average below 30ms with a maximum of 60ms for 99% of the requests, then we need to consider the maximum number of requests that are going to be processed in parallel. To achieve the highest performance for the highest number of requests we will design for between 3 and 6 requests being processed in parallel per CPU core or hardware hyperthread.

Session Initializations - when a session is initialized there are a few extra things that happen when compared with requests that come after initialization. First, depending on whether the RTD server manages session affinity, a new entry is created in the sessions table, which typically requires at least one database write. Additionally, the in memory session is typically filled from the configured data sources. The speed of these operations is totally driven by the performance of the source databases. If an application has many more session initialization than other types of messages, then the throughput may be affected even though the total number of requests is not too high for the configuration.

Single Point of Failure and High Availability - in most cases the system is configured to provide High Availability (HA) and resiliency to server failure or lack of availability (for rolling maintenance for example). RTD is typically configured with a number of servers to avoid the single point of failure. Sometimes it is also configured with multiple sites for HA and Disaster Recovery (DR). In this context it is important to consider the option of relying on default responses to cope with outages of the RTD servers. I know of one RTD server that has been working since 2005 and has been down for maintenance only for a few hours total since it started.

In the next entry we will finally talk about the sizing of the servers.


Issues related to Oracle Real-Time Decisions (RTD). Entries include implementation tips, technology descriptions and items of general interest to the RTD community.


« December 2015