By Chris Saxon
August 23, 2019
You slip into your Monday morning team meeting just in time to hear your boss’s next “brilliant” idea: You are going to build a generic comma-separated values (CSV) parser! It must be able to take any CSV file as input and split out the results using SQL. Each record and field in the file will become a new row and column in the output. And you need to build the parser by this Friday!
Your mind starts racing. You’ll have to parse the file to read the header, see how many fields it returns, and then construct a SQL query with the same number of columns in the
Sounds like you’ll need some dynamic SQL. This could be a very long week.
Luckily you’re using Oracle Database 18c, so there's another option available. It’s one of the coolest new SQL extensions: polymorphic table functions (PTFs).
Oracle Database has included table functions for a long time. You can use them to generate rows in your result set.
But what if you want to change the columns at runtime based on input parameters? You have to fall back on some messy dynamic SQL or arcane custom aggregate functions.
PTFs offer a better solution. With PTFs you can, at runtime, add or remove columns from a table’s result set, based on values you supply.
Let’s find out how.
A PTF includes two key components:
A package, which defines the columns that will (or won’t) appear in the output and row values for any new columns
The function you’ll call to run the PTF, which can be a standalone function or a function nested within the package
The package contains the PTF’s implementation. It also includes two key parts:
describe function tells the database which columns to include in the output.
fetch_rows procedure assigns values to each row for these new columns.
Let’s start with a simple example. I’ll pass a list of columns to remove from the result set and another list of columns to add to the output.
If you want to follow along, you can find the code to build and run this polymorphic table function on Oracle Live SQL.
The following code creates a standalone PTF:
create or replace function hide_existing_add_new_cols ( tab table, add_cols columns, hide_cols columns ) return table pipelined row polymorphic using hide_and_add_cols_pkg; /
This code includes some new syntax:
tab table specifies which table the PTF operates on. Every PTF must have exactly one
add_cols columns and
hide_cols columns list the column identifiers to add and remove, respectively, from the result set. These parameters use the new pseudo-operator,
columns. They allow you to pass one or more column identifiers to the PTF.
row polymorphic means the PTF processes rows independently. You can also define table semantics PTFs, which allow you to operate on a set of rows. As you’ll see later, these parameters also enable powerful extensions when you’re executing your functions.
using hide_and_add_cols_pkg binds the PTF to the
As you can see in the last full line of code for the function, to implement this PTF, I need to create the
The following is the header for this package:
create or replace package hide_and_add_cols_pkg as function describe ( tab in out dbms_tf.table_t, add_cols dbms_tf.columns_t, hide_cols dbms_tf.columns_t ) return dbms_tf.describe_t; procedure fetch_rows; end hide_and_add_cols_pkg; /
describe function must have the same parameters as the PTF itself, with one subtle difference. You must replace the
TABLE argument with the
DBMS_TF.table_t type and any
COLUMNS arguments with
describe function must also have a
return type of
fetch_rows procedure must also have the same parameters as the PTF, minus any
COLUMNS arguments. In this example, the PTF includes only arguments of type
fetch_rows has no parameters.
So what do you do inside these program units?
describe function defines the shape of the result set. That is, it returns the names and data types of the columns to display in the query results. To get
hide_existing_add_new_cols to do what it describes, there are two things you need to do:
Hide any existing columns listed on the columns-to-suppress list
Add the list of new columns
You can do these things with the following
describe function in the body of the
function describe ( tab in out dbms_tf.table_t, add_cols dbms_tf.columns_t, hide_cols dbms_tf.columns_t ) return dbms_tf.describe_t as new_cols dbms_tf.columns_new_t; col_list dbms_tf.columns_t := add_cols; begin for i in 1 .. tab.column.count loop if tab.column(i).description.name member of hide_cols then tab.column(i).for_read := false; tab.column(i).pass_through := false; end if; end loop; for i in 1 .. col_list.count loop new_cols(i) := dbms_tf.column_metadata_t ( name => col_list(i), type => dbms_tf.type_number ); end loop; return dbms_tf.describe_t ( new_columns => new_cols ); end describe;
Lines 10–15 of the
describe function remove any of the table’s columns that are in the
hide_cols list. The code sets the
FOR_READ properties of the table’s columns.
The properties work as follows:
PASS_THROUGH determines whether a column appears in the output.
FOR_READ determines whether the
fetch_rows procedure can reference the column values.
You can set these values independently. So, for example, you can set the
PASS_THROUGH column value to
TRUE and the
FOR_READ column value to
FALSE and vice versa. This PTF removes columns from the results, and it doesn’t need those column values to define new columns, so you can set both properties to
Note that the table parameter also includes an array of its columns, so you can loop through these as shown in line 10. Line 11 checks whether the name of any existing column in the table matches a column in the list to exclude. If a column name matches, lines 12 and 13 set the
FOR_READ properties for it to
Next, lines 17–24 define the new columns. The code loops through the array of
new_cols, and for each value, it uses
DBMS_TF.column_metadata_t to define a new column of type
NUMBER. Finally, lines 26–28 return the new column array to the client. This adds the new columns to the query’s output.
Now that I have defined the new columns, it’s time to set their values.
To keep it simple, the
fetch_rows procedure will assign a value to each new column according to the following formula:
ROW NUMBER * NEW COLUMN POSITION
So for the first column, the values for rows 1, 2, and 3 will be 1, 2, and 3, respectively. The second column will have the values 2, 4, and 6; the third column will have the values 3, 6, and 9; and so on.
To set these values, I need to loop through the rows and new columns in the result set. I can access these by fetching the current environment, which returns a record that includes two column arrays: a
GET_COLUMNS list and a
The columns in the
GET_COLUMNS list are the existing columns in the table for which the
FOR_READ property is set to
TRUE. The columns in the
PUT_COLUMNS list include the new columns I defined in the
The following code sets the new column values:
procedure fetch_rows as env dbms_tf.env_t; col dbms_tf.tab_number_t; last_row pls_integer := 0; begin env := dbms_tf.get_env(); for cols in 1 .. env.put_columns.count loop dbms_tf.xstore_get('col' || cols, last_row); for rws in 1 .. env.row_count loop col ( rws ) := ( rws + last_row ) * cols; end loop; dbms_tf.put_col ( cols, col ); dbms_tf.xstore_set('col' || cols, last_row + env.row_count); end loop; end fetch_rows;
Line 8 gets the compilation state of the PTF by calling
env variable holds the current environment.
I can now use this to loop through the PUT columns—those added to the result set—and define a value for each row. Lines 10–22 iterate through these columns. Lines 14–16 loop through the current rows, building up an array of values. To return these to the client, line 18 calls
DBMS_TF.put_col. This assigns the values in the array to the new column in position N. The column position matches the order in which they were added in the
Now you might be wondering this: What’s the purpose of the
PTFs process rows in batches, and each new batch runs
fetch_rows again. So one call to a PTF may execute
fetch_rows many times.
Each call to
DBMS_TF.get_env returns the number of rows in that batch—not the total number of rows processed. If the row loop set the column value to ROW NUMBER * COLUMN NUMBER, PTFs processing many rows will reset the counter part way through your results. It’s unlikely you want this to happen.
To avoid this, you can save the state in the execution store. This is a series of key-value pairs that you access with the
Before assigning the row values, I get the number of the last row processed, which happens on line 12.
DBMS_TF.xstore_get looks for the key passed as the first argument. The second argument is an in-out parameter. If the key doesn’t exist,
DBMS_TF.xstore_get returns the value passed in for this second parameter.
DBMS_TF.xstore_get('col' || cols, last_row) searches for the keys col1, col2, and so on. For the first set of rows, these keys don’t exist, so the code returns the initialized value for
last_row: which is zero. Otherwise, it returns the position of the last row processed.
Then, after processing all the rows in the current batch, I need to update the execution store to hold the position of the current row in the results, as shown on line 20. Calling
DBMS_TF.xstore_set overwrites the current value for any keys that exist.
So that builds the PTF. Now how do you use it?
To use the PTF, simply call it in the
FROM clause of your query, passing the name of the table you’re operating on and the lists of columns to add and remove.
To replace the
DUMMY column in the
DUAL table with one named
C1, I run the following:
select * from hide_existing_add_new_cols ( dual, columns ( c1 ), columns ( dummy ) ); C1 1
I can also pass in a subquery to the PTF by defining a common-table expression (CTE):
with rws as ( select dummy from dual connect by level <= 4 ) select * from hide_existing_add_new_cols ( rws, columns ( c1, c2, c3, c4 ), columns ( not_here ) ); DUMMY C1 C2 C3 C4 X 1 2 3 4 X 2 4 6 8 X 3 6 9 12 X 4 8 12 16
This demonstrates a key feature of SQL: Everything is a table. You can pass views or named CTEs as table inputs to a PTF.
But you can’t pass a subquery directly. The following code will fail:
select * from hide_existing_add_new_cols ( ( select dummy from dual ), columns ( c1, ), columns ( dummy ) ); SQL Error: ORA-00904: : invalid identifier
To implement this PTF, you need to know only the position of the row in the result set—not the values from other rows. While row semantics are good enough for this PTF implementation, if you’re writing a PTF that calculates results across rows—say, aggregations—you need to use table semantics.
Switching to table semantics PTFs also enables you to define the order in which you process the dataset. So let’s re-create the PTF with table semantics:
create or replace function hide_existing_add_new_cols ( tab table, add_cols columns, hide_cols columns ) return table pipelined table polymorphic using hide_and_add_cols_pkg; /
This enables you to use the
PARTITION BY and
ORDER BY clauses when you call the PTF.
To demonstrate this, let’s create a table storing the integers 1 to 6, identifying whether they’re even and including their negative values:
create table integers as select level number_value, case mod ( level, 2 ) when 0 then 'Y' when 1 then 'N' end is_even, -level negated_value from dual connect by level <= 6;
Now I can define the order in which the PTF processes the rows. To do this, I add the
ORDER BY clause after the table:
select * from hide_existing_add_new_cols ( integers order by negated_value, columns ( c1, c2 ), columns ( is_even ) ) order by number_value; NUMBER_VALUE NEGATED_VALUE C1 C2 1 -1 6 12 2 -2 5 10 3 -3 4 8 4 -4 3 6 5 -5 2 4 6 -6 1 2
This sorts the input to the PTF by the negated values, so you have descending counters relative to the positive values. Also note that the final
ORDER BY for the query is different from the input for the PTF.
As with analytic functions, the
PARTITION BY clause carves the input dataset into groups for each value in the columns, so you can have separate counters for the odd and even values by using
PARTITION BY is_even:
select * from hide_existing_add_new_cols ( integers partition by is_even order by number_value, columns ( c1, c2 ), columns ( negated_value ) ) order by is_even, number_value; NUMBER_VALUE IS_EVEN C1 C2 1 N 1 2 3 N 2 4 5 N 3 6 2 Y 1 2 4 Y 2 4 6 Y 3 6
Partitioning and sorting your dataset this way uses standard functionality. All you need to do is define the PTF with table semantics.
Clearly PTFs are a powerful feature. With them you change the shape of a result set at runtime.
But you may be wondering: Why bother building one to add row counters? This is much easier to do with the analytic function ROW_NUMBER.
True; my example is a little contrived, but it demonstrates the key features of PTFs. Namely, it demonstrates how to
Remove existing columns from a result set in the
Add new columns in this function
Assign values to new columns in the
Use table semantics PTFs to split and sort the input rows
Manage execution state using
But what are the real-world use cases for PTFs?
Let’s return to your original requirement: to read a CSV file that has an unknown number of fields and split the values out into separate columns. Doing this without PTFs requires complex dynamic SQL.
But with PTFs in place, you could read the first row of the CSV file and pass it as an argument to the function, then use it to define the columns in your results in the
describe function, and split the input rows into the columns in the
fetch_rows procedure. Here’s an example of a dynamic CSV-to-column splitter PTF.
Other cases where PTFs may be appropriate include the following:
Dynamic PIVOT and UNPIVOT operations.
Top-N Plus queries. (Use PTFs in place of a traditional Top-N query, but with an extra summary row, giving totals for all the remaining rows in the dataset.)
Executing R scripts against a table, returning the results of the R script.
I suspect there will be relatively few general-purpose PTF use cases. But I’m also sure many developers will find more weird and wacky ways of using them!
What do you think of PTFs: awesome or awful? Can you come up with other general-purpose use cases? If you have any comments or questions about PTFs, reach out to Chris Saxon on Twitter.
Illustration by Wes Rowell
Chris Saxon is an Oracle Developer Advocate for SQL. His job is to help you get the best out of the Oracle Database and have fun with SQL!