The following is based on true events. The names have been changed to protect people and intellectual property.
I recently spent a few days with a team of developers at extremememe.info, an up-and-coming Web 3.0 paradigm-shifter that analyzes internet memes, tracks them to their source, and—best of all—predicts new and future memes.
The developers at extremememe.info are very knowledgeable about Oracle Database and PL/SQL, but as is the case with many other developers, they have little time to explore new technology features. They struggle to keep up with the demands of their users, and even when they know about some great new feature, it can be a challenge to convince management to commit the resources to apply those features to stable production code.
We interspersed training on PL/SQL techniques with reviews of their code and, in the process, came across a program that runs in a daily batch process and updates the status of all the memes. Unfortunately, as extremememe.info’s data volumes grew, the process was taking longer and longer, approaching 23 hours. Given how uncomfortably close that was to a full day, we decided to take a closer look.
The meme update process uses the following three tables: em_memes, em_mentions, and em_incoming. (You can create these tables by using the emmemes_setup.sql script in the download for this article.) Note that the columns of these tables have been simplified for the purposes of this article.
The developers at extremememe.info use the em_memes table for all the memes of which they are aware and whose status they track. (Note that setting the default value of the primary key to the next sequential value is an Oracle Database 12c feature; in Oracle Database 11g Release 2 and earlier, you would use a trigger to achieve the same result. The em_memes_setup.sql script includes this code.)
CREATE TABLE em_memes
meme_name VARCHAR2 (1000)
meme_status VARCHAR2 (100)
The meme_status value can be VIRAL, DORMANT, HOAX, and so on. extremememe.info has a table for all the mentions of, or references to, a meme:
CREATE SEQUENCE em_mentions_seq
CREATE TABLE em_mentions
source_name VARCHAR2 (100),
Meme mentions are loaded from many different sources, and extremememe.info relies on a staging table to collect data from all sources:
CREATE TABLE em_incoming
meme_name VARCHAR2 (1000),
source_name VARCHAR2 (100),
The developers at extremememe.info also use the following em_memes_pkg package to log errors and to perform some critical proprietary computations, based on the content of those tables:
CREATE OR REPLACE PACKAGE em_memes_pkg
TYPE incoming_t IS TABLE OF
TYPE mentions_t IS TABLE OF
TYPE meme_ids_t IS TABLE OF
PROCEDURE log_error (
IN INTEGER DEFAULT SQLCODE,
FUNCTION unpacked_incoming (
PROCEDURE reset_meme_status (
For space reasons, I will not provide the package body code for em_memes_pkg, but Table 1 lists what the elements in the specification provide.
|em_memes_pkg.incoming_t||This is a nested table type that contains rows of data from the em_incoming table.|
|em_memes_pkg.mentions_t||This is a nested table type that contains rows of data from the em_mentions table.|
|em_memes_pkg.meme_ids_t||This is a nested table type that contains primary keys from the em_memes table.|
|em_memes_pkg.log_error||This is a generic (and typical) error-logging mechanism. Defined as an autonomous transaction, it writes out the basic error information available from the DBMS_UTILITY functions (and more) to a log table and then commits just that insert.|
|em_memes_pkg.unpacked_incoming||This is a function that converts a row of incoming data (whose mention details column is an XML document that holds the different formats of source information) into a record that can be inserted into the em_mentions table.|
|em_memes_pkg.reset_meme_status||The heart and soul of the extremememe.info proprietary process, this procedure analyzes the contents of the mentions table and ascertains the status of the meme.|
Table 1: em_memes_pkg elements and descriptions
Listing 1 shows the code for the status update process in the em_update_status procedure. (Listing 1 is available as em_update_status_old.sql in the download for this article.) Descriptions of the key parts of the em_update _status package appear at the end of Listing 1.
Code Listing 1: Meme status updater, version 1
1 CREATE OR REPLACE PROCEDURE em_update_status
3 CURSOR incoming_cur
5 SELECT * FROM em_incoming;
7 l_mention em_mentions%ROWTYPE;
8 l_status em_memes.meme_status%TYPE;
10 FOR incoming_r IN incoming_cur
13 SAVEPOINT new_transaction;
15 l_mention := em_memes_pkg.unpacked_incoming (incoming_r);
17 INSERT INTO em_mentions (meme_id,
21 VALUES (l_mention.meme_id,
26 em_memes_pkg.reset_meme_status (l_mention.meme_id,
29 IF l_status IS NOT NULL
31 UPDATE em_memes
32 SET meme_status = l_status
33 WHERE meme_id = l_mention.meme_id;
34 END IF;
36 WHEN OTHERS
39 ROLLBACK TO new_transaction;
41 END LOOP;
|10||For each row in the staging table (em_incoming)…|
|12, 40||Put the entire body of the loop inside its own nested block so that any exception can be trapped and logged.|
|13||Set a savepoint for this new transaction.|
|15||Convert the incoming row to a record that can be inserted into the em_mentions table.|
|17–24||Insert a single row into the em_mentions table.|
|26–27||Compute the new status for the meme, based on the new mention.|
|29–34||If the status isn’t NULL, update the em_memes table with that status.|
|35–39||If anything goes wrong, log the error and then erase the effects of the insert and/or the update by rolling back to the savepoint.|
I was dismayed when I saw extremememe.info’s implementation of em_memes_pkg.reset_meme_status, because I found in that procedure a cursor FOR loop that contained two nonquery data manipulation language (DML) statements. That is a classic anti-pattern, a general pattern of coding that should be avoided. It should be avoided because the inserts and updates are changing the tables on a row-by-row basis, which maximizes the number of context switches (between SQL and PL/SQL) and consequently greatly slows the performance of the code. Fortunately, this classic antipattern has a classic, well-defined solution: use BULK COLLECT and FORALL to switch from row-by-row processing to bulk processing.
The presence of this antipattern was also a source of delight for me. The developers needed to improve the performance of the procedure significantly. That would have been very hard to do if they had already taken full advantage of SQL and PL/SQL performance optimization features. Because they had not done so, they were looking at a procedure full of low-hanging fruit. That is, they were looking at the possibility of solving their problem with a relatively straightforward change to their existing code.
I shared my dismay and delight with the extremememe.info developer team. They were excited about the potential improvements and were eager to get started. Then I noticed something odd about the algorithm they’d presented. A brief Q&A session soon clarified matters:
“Can the table of incoming data contain more than one reference to the same meme?”
Heads nodded vigorously. They told me that the more viral the behavior of a meme, the more mentions there would be. In the case of some very popular memes, the daily incoming could contain millions of references to the same meme. That led directly to my second question:
“Then why are you updating the meme status after each insert of a mention?”
Now eyebrows shot up and heads were turned. “Wow,” said one developer. “Oh. My. Gosh.” said another.
The team lead turned to me. “We never thought of that before. We’ve been doing an enormous amount of unnecessary processing. That’s kind of embarrassing. It sure helps to have a fresh pair of eyes looking at our code.”
Yes, it does. Never underestimate the power of your brain to hide the obvious from you. Once you’ve spent a lot of time writing or working with a program, you don’t really see the code anymore. You see what you think the code should be. When someone else looks at it, they come at it without any preconceived notions and perceive very different aspects of the program.
If you’re the only one who’s ever looked at your code, you can be certain it contains bugs and perhaps even substantial algorithm errors.
Members of the extremememe.info developer team quickly agreed that the statuses of the memes should be updated only after all mentions were inserted. Furthermore, only those memes with new mentions should be updated.
In the remainder of this article, I show you the end result of the rewrite of extremememe.info’s em_memes_pkg.reset_meme_status procedure into a program that runs significantly faster and made it possible for data volumes to grow substantially while still allowing the batch process to be finished in a single day. My stevenfeuersteinonplsql.blogspot.com blog expands on this article to take you through those changes in a step-by-step process.
Although row-by-row processing can be unacceptably slow for large volumes of data, it offers a significant advantage for execution of nonquery DML: it is easy to define and manage a transaction.In the reset_meme_status procedure, the transaction consists of two DML processes:
With the switch to bulk processing, the procedure moves from a row-by-row approach to a phased approach. For reset_meme_status, this means that the procedure will perform all the inserts first and then execute all the updates.
Generally, you can think of bulk processing as having three distinct phases:
Figure 1 shows these three phases.
Figure 1: Bulk processing phases
In the real world, life is complicated and there are always exceptions. Sometimes you will not need Phase 1, because the collections are already filled and being passed to your program, and sometimes you can skip Phase 2, because the collection is ready to go, to be used immediately in the FORALL statement.
The code in Listing 2 is the rewrite of the too slow em_update_status procedure. This new version takes advantage of nested procedures (procedures declared within another procedure or function). The code for these nested procedures—get_next_set_of_incoming, unpack_and_record_mentions, and reset_meme_statuses—is presented and analyzed later in this article. Descriptions of the key parts of the revised em_update _status package appear at the end of Listing 2.
Code Listing 2: Meme status updater, version 2
1 CREATE OR REPLACE PROCEDURE em_update_status (
2 bulk_collect_limit_in IN PLS_INTEGER DEFAULT 100)
4 bulk_errors EXCEPTION;
5 PRAGMA EXCEPTION_INIT (bulk_errors, -24381);
7 CURSOR incoming_cur
9 SELECT * FROM em_incoming;
11 l_incoming em_memes_pkg.incoming_t;
13 l_affected_meme_ids em_memes_pkg.meme_ids_t;
14 := em_memes_pkg.meme_ids_t ();
16 PROCEDURE get_next_set_of_incoming ....
17 PROCEDURE unpack_and_record_mentions ....
18 PROCEDURE reset_meme_statuses ....
21 OPEN incoming_cur;
24 get_next_set_of_incoming (l_incoming);
25 EXIT WHEN l_incoming.COUNT = 0;
27 unpack_and_record_mentions (l_incoming, l_affected_meme_ids);
28 END LOOP;
30 CLOSE incoming_cur;
32 reset_meme_statuses (l_affected_meme_ids);
|2||BULK COLLECT returns multiple rows with each fetch. This parameter controls the number of rows fetched.|
|4–5||Because the original procedure traps exceptions, logs the errors, and continues processing, use SAVE EXCEPTIONS in the FORALL statements and include an exception handler for ORA-24381.|
|11–13||Declare nested tables based on the packaged types to be used by the nested procedures.|
|16–18||The nested procedures.|
|21–32||The now small and very readable executable section. Here’s the narrative version of the code: Open the cursor that identifies all the incoming rows to be processed. |
Get the next N rows of incoming data. Stop when the collection is empty.
Unpack each incoming row, and insert it into the mentions table. Add all IDs of memes that were mentioned into the collection of affected memes.
After inserting all mentions, reset the statuses of only those memes that were affected.
Now let’s take a closer look at each of the nested procedures, correlating each of these with the phases I described earlier.
Phase 1: Get incoming rows. When you’re switching to BULK COLLECT to fetch data, there are a few key guidelines to keep in mind:
With all that in mind, let’s now implement the get_next_set_of_incoming procedure (called in the new em_update_status procedure). The get_next_set_of_incoming procedure passes back a collection of incoming records; BULK COLLECT will improve the performance, and the LIMIT parameter will help control PGA consumption:
PROCEDURE get_next_set_of_incoming (
OUT NOCOPY em_memes_pkg.incoming_t)
BULK COLLECT INTO incoming_out
First, note that I use the NOCOPY hint in my parameter definition. By using NOCOPY, I’m asking Oracle Database to not make a local copy of my collection inside the procedure. The result will be lower PGA consumption and improved performance.
Next, because I already opened the cursor on line 21 of the new em_update_status procedure, all I need to do in this get_next_set_of_incoming procedure is fetch. And I want to retrieve multiple rows with each fetch, so I put the keywords BULK COLLECT in front of INTO and then provide a collection after INTO.
I then add the LIMIT clause, which tells Oracle Database to retrieve up to that number of rows with each FETCH. I could have provided a literal instead of a parameter, as in
In fact, 100 is a good default value. You get a nice boost in performance, and you don’t consume enormous amounts of PGA.
Yet I hesitate to hard-code the value. If over time this procedure needs to work with more and more data, I may want to experiment with different limit values to see if I can coax more performance out of the process without breaking the PGA bank.
By passing the limit value as a parameter, I can change the value whenever I want without having to recompile the code.
OK, I’ve got the next N rows of incoming data. Now it’s time to move them to the em_mentions table.
Phases 2 and 3: Unpack and record mentions. With phased processing, you do not unpack a single row and then insert it into a table. Instead, you unpack all the rows fetched and then push them all into a table by using FORALL. In this case, all the rows fetched by get_next_set_of_incoming will be unpacked and pushed into the em_mentions table with FORALL.
Listing 3 shows the implementation of the unpack_and_record_mentions procedure. As you can see, it is much more involved than the code required to fetch data. That’s because it contains an INSERT, which means that you have to pay a lot of attention to error processing.
Code Listing 3: Unpack and record mentions
1 PROCEDURE unpack_and_record_mentions (
2 incoming_in IN em_memes_pkg.incoming_t,
3 affected_meme_ids_out IN OUT em_memes_pkg.meme_ids_t)
5 l_affected_this_time em_memes_pkg.meme_ids_t;
6 l_mentions em_memes_pkg.mentions_t := em_memes_pkg.meme_ids_t();
7 l_status em_memes.meme_status%TYPE;
9 PROCEDURE add_to_affected_memes
12 affected_meme_ids_out :=
14 MULTISET UNION DISTINCT l_affected_this_time;
17 PROCEDURE record_errors
19 l_bad_index PLS_INTEGER;
21 FOR indx IN 1 .. SQL%BULK_EXCEPTIONS.COUNT
23 l_bad_index := SQL%BULK_EXCEPTIONS (indx).ERROR_INDEX;
25 em_memes_pkg.log_error (
26 error_code_in => SQL%BULK_EXCEPTIONS (indx).ERROR_CODE,
27 error_msg_in => 'Unable to insert mention for incoming mention '
28 || incoming_in (l_bad_index).meme_name
29 || '-'
30 || incoming_in (l_bad_index).source_name
31 || '-'
32 || TO_CHAR (
33 incoming_in (l_bad_index).occurred_on,
34 'YYYY-MM-DD HH24:MI:SS'));
35 END LOOP;
38 l_mentions.EXTEND (incoming_in.COUNT);
40 FOR indx IN 1 .. incoming_in.COUNT
42 l_mentions (indx) :=
43 em_memes_pkg.unpacked_incoming (incoming_in (indx));
44 END LOOP;
46 FORALL indx IN 1 .. l_mentions.COUNT SAVE EXCEPTIONS
47 INSERT INTO em_mentions (meme_id,
51 VALUES (l_mentions (indx).meme_id,
52 l_mentions (indx).source_name,
53 l_mentions (indx).source_details,
55 RETURNING meme_id
56 BULK COLLECT INTO l_affected_this_time;
60 WHEN bulk_errors
Table 2 includes descriptions of the significant lines of code in Listing 3. These descriptions are out of order (not simply listed from the first to last lines), because I am again using nested subprograms to improve readability. And that usually means that you read from the bottom up: start with the main executable section and then jump to subprograms as needed.
|2 and 3||The next set of incoming rows is passed to the procedure, which passes back an updated list of all those meme IDs that have received at least one mention. This collection will drive the final “reset meme status” step.|
|38||Start of the executable section: make room in the local nested table of mentions to hold all the incoming data.|
|40–44||A classic Phase 2: for each of the incoming rows, convert it to the mentions format and load it into the array. Notice the use of the same index value for the l_mentions collection as in the incoming_in collection. Because incoming_in was populated with BULK COLLECT, it is either empty or filled sequentially from 1.|
|46–56||Time to push all that nicely prepared data into the em_mentions table with FORALL. Because the collection is sequentially filled from 1, use the simplest form for the FORALL header: |
FORALL indx IN 1 .. l_mentions.COUNT
If I used this form (very similar to a numeric FOR loop) and the collection had gaps (index values between FIRST and LAST that are not defined), the procedure would fail with the “ORA-22160: element at index [N] does not exist” error. To avoid this problem, you can use INDICES OF or VALUES OF; the former is discussed in the “Phases 2 and 3: Reset meme statuses” section. Next I use the RETURNING clause to capture the IDs of all mentioned memes. I need to use BULK COLLECT for my RETURNING clause, because I am likely to insert more than one row and therefore will return more than one ID.
|58||Call the add_to_affected_memes procedure. This is a procedure declared within the parent procedure. Its implementation is very simple: |
9 PROCEDURE add_to_affected_memes 10 IS 11 BEGIN 12 affected_meme_ids_out := 13 affected_meme_ids_out 14 MULTISET UNION DISTINCT l_affected_this_time; 15 END;
It uses the MULTISET UNION operation to add the ID of the newly affected meme to the existing list of previously affected memes. By adding the DISTINCT keyword, I ensure that there will not be duplicates in the resulting nested table.
|59–64||Time to handle any errors during insertion into the em_mentions table. The WHEN clause references the bulk_errors exception, which is declared at the top of the procedure. Within the handler, I call the add_to_affected_memes procedure.|
|17–36||The record_errors procedure iterates through the contents of SQL%BULK_EXCEPTIONS, a pseudocollection of records that contains the index in the collection for which an error was returned by the SQL engine and the error code. It is called a pseudocollection because it is only a partial implementation of an associative array; specifically, only the COUNT method is defined for this collection.|
Table 2: unpack_and_record_mentions procedure line descriptions
Phases 2 and 3: Reset meme statuses. At this point, all meme mentions have been inserted that can be inserted. In the process, the unpack_and_record_mentions procedure filled up the l_affected_meme_ids collection, so now it’s time to use that collection to drive the second FORALL statement, performing the updates against the em_memes table.
In the original program, for each mention, the new meme status was computed. If it was NULL, no update was performed. The new program will need to incorporate these steps. The code for reset_meme_statuses is shown in Listing 4.
Code Listing 4: Reset meme statuses
1 PROCEDURE reset_meme_statuses (
2 affected_meme_ids_in IN em_memes_pkg.meme_ids_t)
4 l_affected_meme_ids em_memes_pkg.meme_ids_t
5 := affected_meme_ids_in;
7 TYPE meme_statuses_t IS TABLE OF em_memes.meme_status%TYPE
8 INDEX BY PLS_INTEGER;
10 l_statuses meme_statuses_t;
12 FOR indx IN 1 .. affected_meme_ids_in.COUNT
14 em_memes_pkg.reset_meme_status (
15 meme_id_in => affected_meme_ids_in (indx),
16 new_status_out => l_statuses (indx));
18 IF l_statuses (indx) IS NULL
20 /* No update */
21 l_affected_meme_ids.delete (indx);
22 END IF;
23 END LOOP;
25 FORALL indx IN INDICES OF l_affected_meme_ids
26 SAVE EXCEPTIONS
27 UPDATE em_memes
28 SET meme_status = l_statuses (indx)
29 WHERE meme_id = l_affected_meme_ids (indx);
31 WHEN bulk_errors
34 l_index PLS_INTEGER;
36 FUNCTION bind_array_index_for (
37 bind_array_in IN em_memes_pkg.meme_ids_t,
38 error_index_in IN PLS_INTEGER)
39 RETURN PLS_INTEGER
41 l_index PLS_INTEGER := bind_array_in.FIRST;
43 FOR indx IN 1 .. error_index_in - 1
45 l_index := bind_array_in.NEXT (l_index);
46 END LOOP;
48 RETURN l_index;
51 FOR indx IN 1 .. SQL%BULK_EXCEPTIONS.COUNT
53 l_index :=
54 bind_array_index_for (
56 SQL%BULK_EXCEPTIONS (indx).ERROR_INDEX);
58 em_memes_pkg.log_error (
59 error_code_in => SQL%BULK_EXCEPTIONS (indx).ERROR_CODE,
60 error_msg_in => 'Unable to reset meme status '
61 || l_affected_meme_ids (
64 /* Mimic SAVEPOINT behavior, removing mentions for this meme */
65 DELETE FROM em_mentions
66 WHERE meme_id =
67 l_affected_meme_ids (l_index)
68 AND occurred_on >= TRUNC (SYSDATE);
69 END LOOP;
|4||Copy the collection passed into the procedure into a local variable.|
|7–10||Calculate all the meme statuses (Phase 2). Declare a collection to hold all those new statuses.|
|12–23||Phase 2 preparation of collection: for each affected meme ID, calculate the status and then either put that status into the l_statuses collection or remove the element from the affected meme IDs list.|
|25–29||Update the rows in the em_memes table by using a FORALL statement. But now the header of this statement looks a bit different: FORALL indx IN INDICES OF l_affected_meme_ids |
INDICES OF offers an alternative to the more common “IN low .. high” syntax and is used when the collection referenced inside the DML statement might be sparse (at least one index value between FIRST and LAST is undefined). Using INDICES OF, I tell the PL/SQL engine to use only index values that are defined in the l_affected_meme_ids, thereby skipping over undefined values.
|31–70||Handle the bulk_errors exception, and loop through the contents of the SQL%BULK_EXCEPTIONS collection.|
Table 3: Reset_meme_statuses procedure line descriptions
(The code in Listing 4 is also in the em_update_status_new.sql file in the download for this article). Table 3, also available in the Oracle-hosted online article, presents descriptions of the significant lines of code in Listing 4.
It might be lots of fun to completely reorganize one’s program in hopes of improving performance, but we can’t just assume that the resulting code actually does run quickly.
The results the extremememe.info team and I found when we put together and ran a test script (em_test.sql in the download for this article) were unambiguous: the new em_update_status runs between two and three times as fast as the original procedure.
Not too shabby.
After completing the rewrite, and verifying the performance, of the new em_update_status procedure, the developer team at extremememe.info and I sat down to review what we’d learned.
The team lead made the first point:
“No matter how much better the performance is, the most important outcome of this process was the identification of the flaw in our original algorithm. I still can’t believe we didn’t realize how unnecessary it was to recalculate the meme status with each new mention.”
Because it’s important to take these learning moments and extend them into the future, I suggested they start up a regular monthly code review meeting. “Pick another program, and walk-and-talk your way through it. In the process of explaining the code, you are sure to identify lots of possibilities for improvement—and maybe another bug or two.” Then we turned our attention to the impact of the bulk processing features on the code.
One big change stood out to everyone: the length of the procedure had increased from 42 to 157 lines. If it was any sort of reassurance, I told the team that this is what everyone experiences. And, as one of the junior programmers on the team pointed out, the new version is not only longer but also much more complicated. The switch from integrated, row-level processing to phased, bulk processing has ripple effects in terms of transaction management and error handling.
Everyone agreed, though, that as long as nested subprograms were used to hide that complexity and improve readability, the performance benefits more than justified the increased cost of maintenance of that code.
A stevenfeuersteinonplsql.blogspot.com blog post expands on this article to take you through the extremememe.info code changes in a step-by-step process.
Each PL/SQL article offers a quiz to test your knowledge of the information provided in it. The quiz appears below and also at PL/SQL Challenge (plsqlchallenge.com), a website that offers online quizzes on the PL/SQL language as well as SQL, Oracle Application Express, database design, and deductive logic.
Here is your quiz for this article:
Which of the following statements about BULK COLLECT and FORALL are correct?
a. System global area (SGA) is the only type of memory affected by a switch from row-by-row processing to bulk processing.
TEST your PL/SQL knowledge
Photography by Dmitri Popov, Unsplash