Tuesday Nov 20, 2012

Video testimonial from Dell on their use of RTD

Dell Director Mark Sucrese, describes how Oracle’s Real-Time Decisions dynamically personalizes products and services through

predictive and optimization analytics across many customer facing vehicles to improve revenue and the customer experience.


Wednesday May 30, 2012

A conversation with world experts in Customer Experience Management in Rome, Italy - Wed, June 20, 2012

It is my pleasure to share the registration link below for your chance to meet active members of the Oracle Real-Time Decisions Customer Advisory Board.

Join us to hear how leading brands across the world have achieved tremendous return on investment through their Oracle Real-Time Decisions deployments and do not miss this unique opportunity to ask them specific questions directly during our customer roundtable.

Please share this information with anyone interested in real-time decision management and cross-channel predictive process optimization: http://www.oracle.com/goto/RealTimeDecisions

Thursday Mar 08, 2012

The Era of the Decision Graph

Gone are the days when “electronic billboards” for targeted merchandizing programs were leading edge.

Over the course of the last few years we observed a dramatic qualitative shift in how companies have applied Analytical Decision Management techniques to drive Customer Experience Optimization programs. It used to be the case that marketers were happy when they were allocated a dedicated piece of real-estate on their company’s web site (or contact center or any interaction channel for that matter) that they could use at their discretion for the purpose of one-off targeting programs. What companies now want is granular control over the whole user experience so that the various elements composing this cross-channel dialog can be targeted and relevant. Such a shift requires a new approach to analytics, based on understanding how the various elements of the user interaction relate one to the other.

Let me introduce the concept of Decision Graph in support of this idea.

To move from electronic billboard / product spotlight optimization to customer experience optimization, analytics must shift away from focusing on “the right offer for the right customer”. The focus of the Decision Graph is to identify “the right user experience for the right customer”. This change of focus has a critical impact on your analytics requirements as one dimensional targeting approach for matching customers with offers won’t address the need to optimize multiple dimensions at once.

This is where the Decision Graph comes in. Let’s consider the following graph eliciting the relationships between the various facets of the user experience to be optimized in the context of a Marketing Optimization use case.

Now imagine that for every offer presentation on any interaction channels, your analytical engine can record and identify the characteristics of the customer interactions that are associated with success (say click or offer acceptance) across all those dimensions.

Let’s take an example.

  • You see a nice picture of bear cubs on a forest background with a punchy banner stating “please give us back your share of the 20,000 tons of annual account statements” call to action to sign-up for electronic bill payment on the “recommended for you” section of the login page of your financial service web site and … you decide to click on the “one click wildlife donation” link.
  • Our Decision Graph, can then record the fact your customer profile is positively associated with “positive responses” to marketing messages in the following context: Channel (Web), Offer / Product (Electronic Bill Payment), Creative (The Bear Cub image), Tags (Environmental, Wildlife, Donation, Provocative), Slot Type (Image), Slot (Recommended for you), Placement (Login Page). As predictive models are attached to the Decision Graph, this means that such a business event updates 10 predictive models that marketers can now use for reporting and decision management purposes.

You can now generalize the idea and imagine that this graph collects information about all marketing events across all channels and you end-up with an analytical system that let all the actors of customer experience optimization discover the relationship between the different facets of user interactions

With the Decision Graph

  • Marketing stakeholders will learn about customer segments that are receptive to eco-centric marketing messages and which customers in the right context will step out of their standard routine (why they came to web site in the first place) to subscribe to specific causes.
  • Web user experiences stakeholders will learn about which type of marketing messages are appropriate and for whom at the start / at the end or throughout a logged-in web session
  • Content owners can focus their digital agencies on the most effective creative themes as they will be able to correlate response rates based on associated tags
  • And the company as a whole will have learned who is receptive to eco-centric marketing messages when displayed in a given context of a secured dialog from which it will be in a position to dynamically tailor user experiences across channels based on such empirical evidence

Now contrast this with a system that would only record the fact you’ve subscribed to the Electronic Bill Payment option as part of the “Go Green” Campaign and you will get a sense for the power of the Decision Graph. The bottom line is that companies need analytical systems that operate at multiple levels of the Decision Graph if they want to delight their customers with relevant customer experiences.

My next post will be on how the Oracle RTD Decision Manager product enables you to create and configure such graphs and to automatically identify the predictive drivers of response across the whole spectrum of the user experience.

Friday Mar 02, 2012

Announcing RTD

It is our pleasure to let you know that Oracle just released a new version of the RTD platform addressing some important scalability requirements for high end deployments.

The driver for this new version, released as a patch on top of the RTD platform was the need to address increasing volumes of “learnings” generated by ever-increasing volumes of “decisions”. This stems from the fact several of our high-end customers now have production environments with multiple cross-channel Inline Services supporting multiple “decisions” which make use of multiple predictive models using hundreds of data attributes to select from potentially thousands of “choices”. Addressing those high-end business requirements required increased RTD Learning Server capacity than was provided by RTD

To address those needs, Oracle re-architected its RTD Learning Server engine to enable some level of parallelization. This new architecture relies on multi-threaded models updating and asynchronous learning records reading/deletions operations. This change provides a 150% improvement in learning record processing rates, which enables RTD to now process more than 58M incremental learning records per day with a deployment configuration consisting of 3 concurrently active inline services each with 900 choices, 200 data attributes, and 4 choice event predictive/learning models. This was achieved on a machine with 4 core / 6GB RAM allocated to the RTD learning server.

This new version of RTD is an important release for companies setting-up Big Data Analysis & Decision platforms in support of real-time and batch targeted customer experience enterprise deployments.

For complete details on this patch please refer to http://www.oracle.com/technetwork/middleware/real-time-decisions/psu11-1532856.html

Monday Dec 12, 2011

Multiple Presentations - Part 2

In the first part of this series we explored the problem of predicting likelihood with multiple presentations of the same content. In this entry we will explore a few more options.

Modeling Options (continued)

Presentation Cap

With presentation caps a person decides on a threshold which is the maximum number of times a specific content is to be used with any given person. For example, "this offer should not be presented more than 5 times to the same customer."

This scheme avoids  the problem of wasting presentations in the long tail, but as any arbitrary threshold, it is not optimal. If the number is too large, then there are going to be wasted presentations and if it is too small, then the effort will quit too early, leaving potential customers "unimpressed."

An additional problem with this scheme stems from the fact that there may be a change inthe situation of the customer. That is, maybe this offer was not relevant a month ago, but now, with changed circumstances it may be relevant. A way to solve this problem is to set an expiration date for the cap.

In summary, for each choice two numbers are given, the maximum number of times to present the same choice to the same person, and the number of days after which the count is reset.

Graph representing clicks against number of presentations. A horizontal purple line represents the estimated likelihood until a cap is reached. the area between the original graph and the purple line is shaded yellow, to indicate over estimation of likelihood. After the threshold is red, indicating underestimation.

Purple line represents the believed likelihood before the cap is reached. Yellow marked area represents overestimation and red underestimation of the likelihood.

It is also clear that if the cap is too low then the untapped potential - the red area - grows tremendously. Conversely, if the cap is too large then the wasted presentations - the yellow area - become more and more significant. Therefore, quite literaly, using a presentation cap is like fitting a square peg in a round hole.

A further problem with this approach is one of modeling. Should you train the model with each presentation or only when the cap is reached?

In the next installment we will explore a possible better approach.

Wednesday Nov 30, 2011

Predicting Likelihood of Click with Multiple Presentations

When using predictive models to predict the likelihood of an ad or a banner to be clicked on it is common to ignore the fact that the same content may have been presented in the past to the same visitor. While the error may be small if the visitors do not often see repeated content, it may be very significant for sites where visitors come repeatedly.

This is a well recognized problem that usually gets handled with presentation thresholds – do not present the same content more than 6 times.

Observations and measurements of visitor behavior provide evidence that something better is needed.


For a specific visitor, during a single session, for a banner in a not too prominent space, the second presentation of the

same content is more likely to be clicked on than the first presentation. The difference can be 30% to 100% higher likelihood for the second presentation when compared to the first.

That is, for example, if the first presentation has an average click rate of 1%, the second presentation may have an average CTR of between 1.3% and 2%.

After the second presentation the CTR stays more or less the same for a few more presentations. The number of presentations in this plateau seems to vary by the location of the content in the page and by the visual attraction of the content.

After these few presentations the CTR starts decaying with a curve that is very well approximated by an exponential decay. For example, the 13th presentation may have 90% the likelihood of the 12th, and the 14th has 90% the likelihood of the 13th. The decay constant seems also to depend on the visibility of the content.

Chart representing click likelihood as a function of the presentation number. We can see that the first presentation has less likelihood than the second. Then it plateaus and after the sixth presentation it starts an exponential decay.

Modeling Options

Now that we know the empirical data, we can propose modeling techniques that will correctly predict the likelihood of a click.

Use presentation number as an input to the predictive model

Probably the most straight forward approach is to add the presentation number as an input to the predictive model. While this is certainly a simple solution, it carries with it several problems, among them:

  1. If the model learns on each case, repeated non-clicks for the same content will reinforce the belief of the model on the non-clicker disproportionately. That is, the weight of a person that does not click for 200 presentations of an offer may be the same as 100 other people that on average click on the second presentation.
  2. The effect of the presentation number is not a customer characteristic or a piece of contextual data about the interaction with the customer, but it is contextual data about the content presented.
  3. Models tend to underestimate the effect of the presentation number.

For these reasons it is not advisable to use this approach when the average number of presentations of the same content to the same person is above 3, or when there are cases of having the presentation number be very large, in the tens or hundreds.

Use presentation number as a partitioning attribute to the predictive model

In this approach we essentially build a separate predictive model for each presentation number. This approach overcomes all of the problems in the previous approach, nevertheless, it can be applied only when the volume of data is large enough to have these very specific sub-models converge.

In the next couple of entries we will explore other solutions and a proposed modeling framework.

Tuesday Nov 29, 2011

Customer retention - why most companies have it wrong

At least in the US market it is quite common for service companies to offer an initially discounted price to new customers. While this may attract new customers and rob customers from competitors, it is my argument that it is a bad strategy for the company. This strategy gives an incentive to change companies and a disincentive to stay with the company. From the point of view of the customer, after 6 months of being a customer the company rewards the loyalty by raising the price.

A better strategy would be to reward customers for staying with the company. For example, by lowering the cost by 5% every year (compound discount so it does never get to zero). This is a very rational thing to do for the company. Acquiring new customers and setting up their service is expensive, new customers also tend to use more of the common resources like customer service channels. It is probably true for most companies that the cost of providing service to a customer of 10 years is lower than providing the same service in the first year of a customer's tenure. It is only logical to pass these savings to the customer.

From the customer point of view, the competition would have to offer something very attractive, whether in terms of price or service, in order for the customer to switch.

Such a policy would give an advantage to the first mover, but would probably force the competitors to follow suit. Overall, I would expect that this would reduce the mobility in the market, increase loyalty, increase the investment of companies in loyal customers and ultimately, increase competition for providing a better service.

Competitors may even try to break the scheme by offering customers the porting of their tenure, but that would not work that well because it would disenchant existing customers and would be costly, assuming that it is costlier to serve a customer through installation and first year.

What do you think? Is this better than using "save offers" to retain flip-floppers?

Analyst Report on RTD

An interesting analyst report on RTD has been published by MWD, a reference and description can be found in this blog entry

Thursday Nov 17, 2011

Short Season, Long Models - Dealing with Seasonality

Accounting for seasonality presents a challenge for the accurate prediction of events. Examples of seasonality include: 

·         Boxed cosmetics sets are more popular during Christmas. They sell at other times of the year, but they rise higher than other products during the holiday season.

·         Interest in a promotion rises around the time advertising on TV airs

·         Interest in the Sports section of a newspaper rises when there is a big football match

There are several ways of dealing with seasonality in predictions.

Time Windows

If the length of the model time windows is short enough relative to the seasonality effect, then the models will see only seasonal data, and therefore will be accurate in their predictions. For example, a model with a weekly time window may be quick enough to adapt during the holiday season.

In order for time windows to be useful in dealing with seasonality it is necessary that:

  1. The time window is significantly shorter than the season changes
  2. There is enough volume of data in the short time windows to produce an accurate model

An additional issue to consider is that sometimes the season may have an abrupt end, for example the day after Christmas.

Input Data

If available, it is possible to include the seasonality effect in the input data for the model. For example the customer record may include a list of all the promotions advertised in the area of residence.

A model with these inputs will have to learn the effect of the input. It is possible to learn it specific to the promotion – and by the way learn about inter-promotion cross feeding – by leaving the list of ads as it is; or it is possible to learn the general effect by having a flag that indicates if the promotion is being advertised.

For inputs to properly represent the effect in the model it is necessary that:

  1. The model sees enough events with the input present. For example, by virtue of the model lifetime (or time window) being long enough to see several “seasons” or by having enough volume for the model to learn seasonality quickly.

Proportional Frequency

If we create a model that ignores seasonality it is possible to use that model to predict how the specific person likelihood differs from average. If we have a divergence from average then we can transfer that divergence proportionally to the observed frequency at the time of the prediction.


Ft = trailing average frequency of the event at time “t”. The average is done over a suitable period of to achieve a statistical significant estimate.

F = average frequency as seen by the model.

L = likelihood predicted by the model for a specific person

Lt = predicted likelihood proportionally scaled for time “t”.

If the model is good at predicting deviation from average, and this holds over the interesting range of seasons, then we can estimate Lt as:

Lt = L * (Ft / F)

Considering that:

L = (L – F) + F

Substituting we get:

Lt = [(L – F) + F] * (Ft / F)

Which simplifies to:

(i)                  Lt = (L – F) * (Ft / F)  +  Ft

This latest expression can be understood as “The adjusted likelihood at time t is the average likelihood at time t plus the effect from the model, which is calculated as the difference from average time the proportion of frequencies”.

The formula above assumes a linear translation of the proportion. It is possible to generalize the formula using a factor which we will call “a” as follows:

(ii)                Lt = (L – F) * (Ft / F) * a  +  Ft

It is also possible to use a formula that does not scale the difference, like:

(iii)               Lt = (L – F) * a  +  Ft

While these formulas seem reasonable, they should be taken as hypothesis to be proven with empirical data. A theoretical analysis provides the following insights:

  1. The Cumulative Gains Chart (lift) should stay the same, as at any given time the order of the likelihood for different customers is preserved
  2. If F is equal to Ft then the formula reverts to “L”
  3. If (Ft = 0) then Lt in (i) and (ii) is 0
  4. It is possible for Lt to be above 1.

If it is desired to avoid going over 1, for relatively high base frequencies it is possible to use a relative interpretation of the multiplicative factor.

For example, if we say that Y is twice as likely as X, then we can interpret this sentence as:

  • If X is 3%, then Y is 6%
  • If X is 11%, then Y is 22%
  • If X is 70%, then Y is 85% - in this case we interpret “twice as likely” as “half as likely to not happen”

Applying this reasoning to (i) for example we would get:

If (L < F) or (Ft < (1 / ((L/F) + 1))

Then  Lt = L * (Ft / F)


Lt = 1 – (F / L) + (Ft * F / L)


Monday Aug 22, 2011

Performance Goals for Human Decisions

I've often asked myself whether we, humans, make decisions in a similar way to what RTD does. Do we balance different KPIs? Do we evaluate the different options and chose the one that maximizes our performance goals?

To answer this question one would have to ask what are your performance goals and how much do you value each one of them. It would seem logical that our decisions would be made in such a rational way that they are totally driven by the evaluation of each alternative and the selection of the best one.

Following this logic, one could surmise that if we were able to discover the performance goals that are relevant for a specific person, and the weights for each one, we could be very good at predicting human behavior. Instead of using the Inductive prediction models like the ones we have today in RTD, we could use Deductive models that mimic the logic of the person to arrive to the predicted behavior.

Fortunately, as learned by modern economists and brilliantly put by Dan Ariely in his book Predictably Irrational, human decisions are typically not the result of rational optimization, but heavily influenced by emotions and instinct.

This is one of the reasons that rule systems perform so poorly in trying to predict human behavior. A rule system would try to detect the reason behind the behavior. Empirical, inductive models work much better because they do not try to discover the pattern behind a behavior, but the common characteristics of people; and while we can not rationally explain many of our behaviors, we do see a lot of commonality. While we are each a unique individual, it is possible to predict our behavior by generalizing from what is observed about people similar to us.

I was recently on a Southwest Airlines flight. As usual, travelers had optimized their seat selection according to what was available and the convenience, mostly preferring seats by the front of the plane, windows and aisle seats. Can we predict whether you will prefer a Window or an Aisle? Absolutely, just look at the history of the seats that the person has chosen in the past. While you may claim that such a model is obvious, it is a good model based on generalization of past experience. I can still not answer the question of what are the motivations for some person to preferring a window seat, but I can predict with great accuracy which one will you prefer on a specific flight.

Once everyone had selected their seat there were about 20 middle seats left in the plane. Just before the door closes, a mother with a child enters the plane in a hurry. She evaluates the situation, and for her the Performance Goal of being beside her child was the most important. Since all the "eligible" choices were not good, she tried to create a new choice, asking the flight attendants for help.

As the attendant was starting to make an announcement to ask for someone to give up their Aisle or Window seat, I saw the situation and immediately offered my aisle seat. Then, of course, I tried to figure out why I decided to do that? Why others didn't? Was it purely emotional or there was a Performance Goal mechanism involved?

Sure there are benefits to giving up you seat in a situation like this. For example I got preferential treatment and free premium drinks from the flight attendants for the rest of the flight, but I did not know about those benefits. Are there other hidden KPIs?

I would like to hear from you. What are the Performance Goals that motivate people to action? Is there a moral framework within which decisions do follow KPI optimization?

Wednesday Nov 17, 2010

Performance Tips

As RTD implementations become more and more sophisticated and the applications extend the reach of decisions far beyond selecting the next best offer we have been recommending some design decisions in order to ensure a desired level of performance.

By far, the most significant factor affecting performance is external system access. In particular database access. This goes for reads as well as writes. Here are a few tips that are easy to implement and are good to keep in mind when designing a configuration;

  1. Data that is repeatedly used in decisions by different sessions should be cached. Examples include Offer Metadata, Product Catalog, Content Catalog, and Zip Code Demographics.
  2. If possible, data that will be needed in decisions should be pre-fetched. For example, customer profiles could be loaded at the very beginning of a session.
  3. A good storage system for data is an Oracle Coherence Cache. Particularly if it is configured with local storage in the same app server as RTD. Data can include customer profile, event history, etc.
  4. When writing to the database and there is no need to be transactional and synchronous, use RTD's batch writing capabilities. This can increase write performance by an order of magnitude.
  5. Avoid unnecessary writes and writing unnecessary data. For example, avoid writing metadata together with event data if the metadata can be linked.
  6. Consider using stored procedures when updating several tables to minimize roundtrips to the database
  7. If a result set is potentially very large, consider wrapping the query with a stored procedure that limits the number of rows returned. For example, if the application calls for loading the purchase history of a customer, and the median length of the list is 3 purchases, but there are 15 customers with 10000 purchases or more, processing these [good] customers will take long - it may be acceptable from the point of view of the application logic to just load a maximum of the latest 100 purchases.
  8. When loading metadata, avoid loading data that will not be used. For example, if there are 500k products in the catalog, but realistically only 90k have any real chance of being selected for a recommendation, do the filtering when loading the data and avoid loading the full list.
  9. Asynchronous processing is not free - avoid unnecessary processing. For example, a decision may take 5 ms of actual CPU processing. That would limit the theoretical throughput of a single CPU to 200 decisions per second. If we add 15 ms of asynchronous processing per decision, we will not be affecting response time, but the throughput will be afected - the thoretical throughput being reduced to 50 per second.

In addition to these tips, it is important also for the environment to be propery setup to achieve peak performance. Some tips include:

  1. Prefer physical servers to virtualized ones
  2. Always calculate to have at least two CPU cores per JVM
  3. Make sure the memory requirements and settings match the available memory to avoid swapping JVM memory
  4. If using virtualized servers make sure that CPUs are not overallocated. That is, do not run 5 virtual machines configured for 2 CPUs each on an 8 core system. While such a setup may be acceptable for some applications, with throughput intensive applications like RTD such a setup would certainly cause performance problems and these problems are difficult to diagnose.
  5. If using virtualized servers make sure the virtual machine's configured memory will be resident in physical memory. If the Guest believes that it has 4GB of memory, but the host needs to use swap to fulfill that amount of memory performance and availability will suffer. Problems in this area are very difficult to diagnose because to the Guest OS it looks as if CPU cycles were stolen.

Thursday Jun 10, 2010

Is RTD Stateless or Stateful?


A stateless service is one where each request is an independent transaction that can be processed by any of the servers in a cluster. A stateful service is one where state is kept in a server's memory from transaction to transaction, thus necessitating the proper routing of requests to the right server. The main advantage of stateless systems is simplicity of design. The main advantage of stateful systems is performance.

I'm often asked whether RTD is a stateless or stateful service, so I wanted to clarify this issue in depth so that RTD's architecture will be properly understood.

The short answer is: "RTD can be configured as a stateless or stateful service."

The performance difference between stateless and stateful systems can be very significant, and while in a call center implementation it may be reasonable to use a pure stateless configuration, a web implementation that produces thousands of requests per second is practically impossible with a stateless configuration.

RTD's performance is orders of magnitude better than most competing systems. RTD was architected from the ground up to achieve this performance. Features like automatic and dynamic compression of prediction models, automatic translation of metadata to machine code, lack of interpreted languages, and separation of model building from decisioning contribute to achieving this performance level. Because of this focus on performance we decided to have RTD's default configuration work in a stateful manner. By being stateful RTD requests are typically handled in a few milliseconds when repeated requests come to the same session.

Now, those readers that have participated in implementations of RTD know that RTD's architecture is also focused on reducing Total Cost of Ownership (TCO) with features like automatic model building, automatic time windows, automatic maintenance of database tables, automatic evaluation of data mining models, automatic management of models partitioned by channel, geography, etcetera, and hot swapping of configurations.

How do you reconcile the need for a low TCO and the need for performance? How do you get the performance of a stateful system with the simplicity of a stateless system? The answer is that you make the system behave like a stateless system to the exterior, but you let it automatically take advantage of situations where being stateful is better.

For example, one of the advantages of stateless systems is that you can route a message to any server in a cluster, without worrying about sending it to the same server that was handling the session in previous messages. With an RTD stateful configuration you can still route the message to any server in the cluster, so from the point of view of the configuration of other systems, it is the same as a stateless service. The difference though comes in performance, because if the message arrives to the right server, RTD can serve it without any external access to the session's state, thus tremendously reducing processing time. In typical implementations it is not rare to have high percentages of messages routed directly to the right server, while those that are not, are easily handled by forwarding the messages to the right server. This architecture usually provides the best of both worlds with performance and simplicity of configuration.

Configuring RTD as a pure stateless service

A pure stateless configuration requires session data to be persisted at the end of handling each and every message and reloading that data at the beginning of handling any new message. This is of course, the root of the inefficiency of these configurations. This is also the reason why many "stateless" implementations actually do keep state to take advantage of a request coming back to the same server. Nevertheless, if the implementation requires a pure stateless decision service, this is easy to configure in RTD. The way to do it is:

  1. Mark every Integration Point to Close the session at the end of processing the message
  2. In the Session entity persist the session data on closing the session
  3. In the session entity check if a persisted version exists and load it

An excellent solution for persisting the session data is Oracle Coherence, which provides a high performance, distributed cache that minimizes the performance impact of persisting and reloading the session. Alternatively, the session can be persisted to a local database.

An interesting feature of the RTD stateless configuration is that it can cope with serializing concurrent requests for the same session. For example, if a web page produces two requests to the decision service, these requests could come concurrently to the decision services and be handled by different servers. Most stateless implementation would have the two requests step onto each other when saving the state, or fail one of the messages. When properly configured, RTD will make one message wait for the other before processing.

A Word on Context

Using the context of a customer interaction typically significantly increases lift. For example, offer success in a call center could double if the context of the call is taken into account. For this reason, it is important to utilize the contextual information in decision making. To make the contextual information available throughout a session it needs to be persisted. When there is a well defined owner for the information then there is no problem because in case of a session restart, the information can be easily retrieved. If there is no official owner of the information, then RTD can be configured to persist this information.

Once again, RTD provides flexibility to ensure high performance when it is adequate to allow for some loss of state in the rare cases of server failure. For example, in a heavy use web site that serves 1000 pages per second the navigation history may be stored in the in memory session. In such sites it is typical that there is no OLTP that stores all the navigation events, therefore if an RTD server were to fail, it would be possible for the navigation to that point to be lost (note that a new session would be immediately established in one of the other servers). In most cases the loss of this navigation information would be acceptable as it would happen rarely. If it is desired to save this information, RTD would persist it every time the visitor navigates to a new page.

Note that this practice is preferred whether RTD is configured in a stateless or stateful manner.

Wednesday Apr 07, 2010

The softer side of BPM

BPM and RTD are great complementary technologies that together provide a much higher benefit than each of them separately. BPM covers the need for automating processes, making sure that there is uniformity, that rules and regulations are complied with and that the process runs smoothly and quickly processes the units flowing through it.

By nature, this automation and unification can lead to a stricter, less flexible process. To avoid this problem it is common to encounter process definition that include multiple conditional branches and human input to help direct processing in the direction that best applies to the current situation. This is where RTD comes into play. The selection of branches and conditions and the optimization of decisions is better left in the hands of a system that can measure the results of its decisions in a closed loop fashion and make decisions based on the empirical knowledge accumulated through observing the running of the process.

When designing a business process there are key places in which it may be beneficial to introduce RTD decisions. These are:

  • Thresholds - whenever a threshold is used to determine the processing of a unit, there may be an opportunity to make the threshold "softer" by introducing an RTD decision based on predicted results. For example an insurance company process may have a total claim threshold to initiate an investigation. Instead of having that threshold, RTD could be used to help determine what claims to investigate based on the likelihood they are fraudulent, cost of investigation and effect on processing time.
  • Human decisions - sometimes a process will let the human participants make decisions of flow. For example, a call center process may leave the escalation decision to the agent. While this has flexibility, it may produce undesired results and asymetry in customer treatment that is not based on objective functions but subjective reasoning by the agent. Instead, an RTD decision may be introduced to recommend escalation or other kinds of treatments.
  • Content Selection - a process may include the use of messaging with customers. The selection of the most appropriate message to the customer given the content can be optimized with RTD.
  • A/B Testing - a process may have optional paths for which it is not clear what populations they work better for. Rather than making the arbitrary selection or selection by committee of the option deeped the best, RTD can be introduced to dynamically determine the best path for each unit.
In summary, RTD can be used to make BPM based process automation more dynamic and adaptable to the different situations encountered in processing. Effectively making the automation softer, less rigid in its processing. In order for this to work the people responsible for the process need to understand the what are the important KPIs that the business is really interested in optimizing for, and make a concerted effort in measuring up to those KPIs and optimizing the process to achieve better results. The benefit of making better decisions in a process flow can be tremendous, as exemplified by many of current RTD implementations.

Sunday Nov 15, 2009

Sizing an RTD installation - Part 2

Now that we have the expected throughput in terms of the number of requests per second, lets look at other sizing factors.

Response time  - sometimes the volume of requests is not smoothly distributed and there may be peaks of requests coming at the same time. If there are strict response time requirements, like having an average below 30ms with a maximum of 60ms for 99% of the requests, then we need to consider the maximum number of requests that are going to be processed in parallel. To achieve the highest performance for the highest number of requests we will design for between 3 and 6 requests being processed in parallel per CPU core or hardware hyperthread.

Session Initializations - when a session is initialized there are a few extra things that happen when compared with requests that come after initialization. First, depending on whether the RTD server manages session affinity, a new entry is created in the sessions table, which typically requires at least one database write. Additionally, the in memory session is typically filled from the configured data sources. The speed of these operations is totally driven by the performance of the source databases. If an application has many more session initialization than other types of messages, then the throughput may be affected even though the total number of requests is not too high for the configuration.

Single Point of Failure and High Availability - in most cases the system is configured to provide High Availability (HA) and resiliency to server failure or lack of availability (for rolling maintenance for example). RTD is typically configured with a number of servers to avoid the single point of failure. Sometimes it is also configured with multiple sites for HA and Disaster Recovery (DR). In this context it is important to consider the option of relying on default responses to cope with outages of the RTD servers. I know of one RTD server that has been working since 2005 and has been down for maintenance only for a few hours total since it started.

In the next entry we will finally talk about the sizing of the servers.

Friday Nov 13, 2009

Sizing an RTD installation - Part 1

In every implementation of RTD it is necessary to determine the hardware configuration to support the expected loads of RTD applications. While we try to provide guidelines and generalizations, it helps to understand the most significant factors that affect the desired hardware configuration. In a series of blog entries we describe the different factors that need to be considered.


The first factor to consider is the expected load, in terms of number of events per second, that the servers will need to deal with. These events have different types and therefore may cause different loads into the servers.

Estimating the number of events per second usually begins at some given metrics. Examples of typical metrics include:

  • Web site pages served per second/day/month
  • Web site [unique] visitors per month
  • Web site visits/sessions per day
  • Call Center calls per day
  • Average call length
  • Maximum number of concurrent agents
  • IVR calls handled per day

The first thing to do with these metrics is to translate them to "per second" numbers. The translation from large time periods, like months, can not be done by directly dividing by the number of seconds in a month, as it is typical that there are busier days and busier hours of the day.

Some rules of thumb that I have found to result in numbers that are pretty close to reality for a wide variety of situations are as follows:

  • Monthly numbers can be divided by 10 to produce the numbers for a busy day
  • Daily numbers can be divided by 10 to produce the numbers on a busy hour
  • Hourly numbers are divided by 3000 (or sometimes 2000) to produce the number per second
  • If number of pages per visit is unknown, 10 to 15 can be assumed for many sites
  • If call length is unknown, 5 minutes can be assumed
  • Dividing the number of concurrently active agents by the length of a call (in seconds) gives the number of call starts per second

From  these we can compute the expected number of requests per second. Lets look at some examples.

Web example: a bank. Only the following information is available: "The bank has 5M customers, of them 2M have signed up for online banking. They are planning to use RTD to determine content and promotions in several places in most online banking pages."

Since this is all the information we have, we will do a calculation based on many assumptions. Later on we can confirm or adjust our assumptions based on any additional information we are given.

Assuming 1/2 of the signed up customers are active, and we have on average 4 visits per month we have 4M visits per month. Using the rules of thumb above, we can assume 400k visits on a busy day, and 40k on a busy hour. Dividing by 2000 seconds in an hour that gives us about 20 visits started per second. Assuming 10 pages per visit and 3 requests per page we have 30 requests per visit and 600 requests per second.

Call Center example: "A telco has 5000 agents in the call center. They are interested in implementing RTD for offer recommendations at the end of service calls."

Lets assume that the maximum number of agents active at any given time is about 2/3 of the agents, say 3500. Assuming 5 minute calls, which is 300 seconds, we have an average of about 12 call initializations per second. Assuming 4 requests per call, we have about 48 requests per second.

In upcoming posts we will explore other considerations that come into play when selecting a configuration.


Issues related to Oracle Real-Time Decisions (RTD). Entries include implementation tips, technology descriptions and items of general interest to the RTD community.


« August 2016