Announcing Select AI for Synthetic Data Generation

September 10, 2024 | 13 minute read
Mark Hornick
Senior Director, Machine Learning and AI Product Management
Text Size 100%:

Synthetic, or artificially generated, data can have many of the characteristics of real data and can be a powerful tool when developing and testing solutions, especially when actual data doesn’t yet exist or isn’t allowed to be used. Synthetic data is typically created using random generators, algorithms, or statistical models to simulate the characteristics and distributions of real data. However, historically this has been complex to produce or needed to rely on tools with features of varying sophistication. With the availability of Large Language Models (LLMs), more relevant and schema-specific data may be generated that considers characteristics expressed in natural language. Having more relevant synthetic data supports better testing of your applications in the absence of real data, leading to overall quality of your application.  

In support of generating synthetic data, we’re excited to announce Select AI Synthetic Data Generation (SDG) – making it easier to use LLMs to generate data conforming to your schema for solution testing, proofs of concept, and other uses.

Select AI already simplifies and automates using generative AI, whether generating, running, or explaining SQL from a natural language prompt, chatting with your LLM, or leveraging vector stores to enable Retrieval Augmented Generation (RAG). Now with Synthetic Data Generation, you can take your application development even further with generative AI, conveniently from SQL.

Use cases

You can apply synthetic data generation to a wide range of use cases. Here are just a few: 

Populating a database metadata clone: Database metadata refers to the data that describes the structure, organization, and properties of the database itself. A metadata clone refers to the process of creating a clone (or copy) of a database or schema that includes only the metadata (the structure) and not the actual data. This clone contains all the database objects such as tables, indexes, views, statistics, procedures, and triggers but without any rows of data. Select AI’s support for synthetic data generation enables populating these cloned tables with synthetic data, avoiding the use of data from original tables that may contain sensitive data.

Such clones can be useful for development, testing, or creating a template for new databases or schemas – protecting the data in the original tables. Generating synthetic data in the cloned tables can be used for development and testing purposes, including performance and scalability.

Starting a new project: When proposing a new project, you may lack actual data, especially in the early phases of a project. It’s helpful to have data that can help make a concept “come to life” and convince others to move forward with a given project.

Validate the user experience: User interface design can be challenging. Having data with a range of dimensions and data values helps to discover design flaws and performance and scalability issues. Synthetic data generation can assist development teams work with “realistic” data to validate interfaces and the user experience.

Machine learning and AI projects: New AI and ML projects that involve training models may also need sample data to explore model training, either because real data doesn’t exist, or the product team doesn’t have permission to use actual data. LLMs can be surprisingly helpful at generating data with certain types of patterns.

Using Select AI for Synthetic Data Generation

Select AI introduces a new API, generate_synthetic_data() in the DBMS_CLOUD_AI package, that leverages AI services to generate synthetic data for single tables or a full schema with table dependency relationships. The new API includes two overloaded procedures: one for single table data generation and another for multiple tables.

In the single table case, you specify the AI profile name, the object (table) name, schema owner, and the number of records you want along with an optional user prompt and parameters that further control the data generation.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
PROCEDURE generate_synthetic_data(
  profile_name        IN  VARCHAR2,
  object_name         IN  DBMS_ID,
  owner_name          IN  DBMS_ID,
  record_count        IN  NUMBER,
  user_prompt         IN  CLOB DEFAULT NULL,
  params              IN  CLOB DEFAULT NULL
);

Here’s an example that highlights using existing data in the target table as examples to guide data generation – chosen at random. We specify choosing 5 sample rows as the basis for generating an additional 10 rows.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
  DBMS_CLOUD_AI.generate_synthetic_data(
    profile_name => 'GENAI',
    object_name  => 'Transactions',
    owner_name   => 'SELECT_AI_USER',
    record_count => 10,
    params       => '{"sample_rows":5}');
END;

In the multiple tables case, you specify the AI profile name, an object (table) list along with optional parameters that further control the data generation. The object list specifies the list of table information, which includes arguments provided in the procedure for single table case.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
PROCEDURE generate_synthetic_data(
  profile_name        IN  VARCHAR2,
  object_list         IN  CLOB,
  params              IN  CLOB DEFAULT NULL
);

Here’s an example using the multiple tables case that highlights the object list content. Notice that the JSON string lists the owner for each table, the table name, and the number of records you want. You can also specify a table-specific prompt to guide the LLM. Here, we’re generating content for movies that were released in 2009. Note that while the prompt can help tailor the content generated by the LLM, it does not guarantee the result because LLMs have been known to hallucinate. For example, if you ask for 10,000 directors, the LLM may not know of that many directors (or that many may not exist). So, the LLM will do its best to fulfil your request.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
  DBMS_CLOUD_AI.generate_synthetic_data(
    profile_name => 'GENAI',
    object_list => '[{"owner": "SELECT_AI_USER", "name": "Director","record_count":5},
                     {"owner": "SELECT_AI_USER", "name": "Movie_Actor","record_count":5},
                     {"owner": "SELECT_AI_USER", "name": "Actor","record_count":10},
                     {"owner": "SELECT_AI_USER", "name": "Movie","record_count":5,"user_prompt":"all movies released in 2009"}]');
END;

Getting more from synthetic data generation

There are several ways you can further affect how Select AI generates data. Here are a few to consider.

Column statistics

If the table has column statistics or is cloned from other tables with metadata, Select AI supports adding the statistics for each column in request to make data more consistent with the original data based on these statistics. This can enhance data quality.

For NUMBER columns, the high and low values from the statistics can be used to guide the value range. For example, if the values for a SALARY column in original EMPLOYEES table is between 1000 and 10000, then the data generation for this column in the cloned table should also be limited to this range.
For columns that have a reasonable number of distinct values, we can provide the list of desired values. For example, if the values for a STATE column in original EMPLOYEES table are only 'CA', 'WA' and 'TX', then the values in a cloned table should also be 'CA', 'WA' or 'TX'. You can enable column statistics by including this in the params argument as shown below. By default, column statistics are enabled.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
  DBMS_CLOUD_AI.generate_synthetic_data(
    profile_name  => 'GENAI',
    object_name   => 'Employees',
    owner_name    => 'SELECT_AI_USER',
    record_count  => 10,
    user_prompt   => 'all movies are released in 2009',
    params        => '{"sample_rows":5,"column_statistics":true}' );
END;

Comments on columns

Column comments, if they exist, are automatically included to provide more information about columns to give the LLM for data generation. For example, for a string column 'Status' in a Transaction table, you might have a comment that lists the allowed values, for example: successful, failed, pending, canceled, need manual check.

Distinct values

By using LLMs to generate data, it is likely that duplicate values will be produced, especially when the record count is large. You can address this by setting up a unique constraint on a given column. This causes rows with duplicate values to be ignored by Select AI.

Sample rows

If your table already contains records, you can instruct Select AI to randomly select a specific number of them to guide the LLM’s data generation. To use this feature, specify in the params argument {"sample_row":5}  for an individual table or add for the individual table in the object list for multiple table invocation. For example, Select AI will sample 5 rows from table Transactions and send them to the LLM to generate 10 more rows:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
  DBMS_CLOUD_AI.generate_synthetic_data(
    profile_name => 'GENAI',
    object_name  => 'Transactions',
    owner_name   => 'SELECT_AI_USER',
    record_count => 10,
    params       => '{"sample_rows":5}'
  );
END;

User prompt

A user prompt can be specified per table to provide general or specific guidance to the LLM when generating data. For example, instead of column comments, you may specify the values allowed for certain columns. For example, if you want to ask for only 'CA', 'WA' and 'TX' for a STATE column:

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
    DBMS_CLOUD_AI.generate_synthetic_data(
        profile_name  => 'GENAI',
        object_name   => 'employees',
        owner_name    => 'admin',
        user_prompt   => 'the value for state should either be CA, WA, or TX',
        record_count  => 10    );
END;

Parallelism

To help reduce data generation runtime, Select AI splits synthetic data generation tasks for appropriate tables to be run in parallel. By generating data in smaller chunks and running LLM tasks in parallel, runtime performance can be enhanced. Your Autonomous Database service level and ECPU/OCPU setting will determine the degree of parallelism allowed. While many factors affect runtime, generating data for many tables with a large number of records in parallel can improve overall performance.

Select AI Synthetic Data Generation Example

In this example we’ll create four tables involving movie-related data: directors, movies, actors, and which actors appeared in which movies. There are various constraints like primary and foreign keys specified as well.

First, we create tables explicitly, or we could use cloned tables.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
CREATE TABLE SELECT_AI_USER.Director (
    director_id     INT PRIMARY KEY,
    name            VARCHAR(100)
);
CREATE TABLE SELECT_AI_USER.Movie (
    movie_id        INT PRIMARY KEY,
    title           VARCHAR(100),
    release_date    DATE,
    genre           VARCHAR(50),
    director_id     INT,
    FOREIGN KEY (director_id) REFERENCES SELECT_AI_USER.Director(director_id)
);
CREATE TABLE SELECT_AI_USER.Actor (
    actor_id        INT PRIMARY KEY,
    name            VARCHAR(100)
);
CREATE TABLE SELECT_AI_USER.Movie_Actor (
    movie_id        INT,
    actor_id        INT,
    PRIMARY KEY (movie_id, actor_id),
    FOREIGN KEY (movie_id) REFERENCES SELECT_AI_USER.Movie(movie_id),
    FOREIGN KEY (actor_id) REFERENCES SELECT_AI_USER.Actor(actor_id)
);

As for any Setup AI use, we create our credential to access the AI provider, here OCI Generative AI Service. Then, we specify an AI profile using this credential and the object list, which simply points to the user’s schema – in this case SELECT_AI_USER.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN                                                                       
  DBMS_CLOUD.create_credential( 
    credential_name => 'GENAI_CRED', 
    user_ocid       => 'ocid1.user.oc1....',
    tenancy_ocid    => 'ocid1.tenancy.oc1....',
    private_key     => 'vZ6cO...',
    fingerprint     => '86:7d:...' 
  ); 
END; 
 
BEGIN  
  DBMS_CLOUD_AI.create_profile(  
      profile_name => 'GENAI',  
      attributes   =>'{"provider":        "oci", 
                       "credential_name": "GENAI_CRED", 
                       "object_list":     [{"owner": "SELECT_AI_USER"}]}');  
END;  
 
BEGIN  
  DBMS_CLOUD_AI.set_attribute( 
      'GENAI', 'oci_compartment_id', 'ocid1.compartment.oc1....');
END;  
 
EXEC DBMS_CLOUD_AI.set_profile(profile_name => 'GENAI');

Now, we’re ready to run the synthetic_data_generation procedure for single table.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
    DBMS_CLOUD_AI.generate_synthetic_data(
        profile_name => 'GENAI',
        object_name  => 'Director',
        owner_name   => 'SELECT_AI_USER',
        record_count => 5 );
END;

PL/SQL procedure successfully completed.

We can query the table explicitly to see the generated director names.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SQL> SELECT * FROM SELECT_AI_USER.Director

DIRECTOR_ID NAME
----------- -----------------
          1 John Smith
          2 Emily Chen
          3 Michael Brown
          4 Sarah Taylor
          5 David Lee

Or, we could ask Select AI to tell us how many directors there are.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SQL> SELECT AI how many directors are there

NUMBER_OF_DIRECTORS
-------------------
                  5

We have, however, more tables in our schema, so we’ll run synthetic_data_generation for multiple tables. Since we already have records in our director table, we’ll get 5 more. Note that in our Movie table, we’ve provided a prompt to limit our movies to those released in in 2009.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
BEGIN
  DBMS_CLOUD_AI.generate_synthetic_data(
    profile_name => 'GENAI',
    object_list => '[{"owner": "SELECT_AI_USER", "name": "Director","record_count":5},
                     {"owner": "SELECT_AI_USER", "name": "Movie_Actor","record_count":5},
                     {"owner": "SELECT_AI_USER", "name": "Actor","record_count":10},
                     {"owner": "SELECT_AI_USER", "name": "Movie","record_count":5,"user_prompt":"all movies released in 2009"}]' );
END;

PL/SQL procedure successfully completed.

Let’s query the movie table to view the results…

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SQL> select * from SELECT_AI_USER.Movie
 
  MOVIE_ID TITLE                   RELEASE_DATE GENRE            DIRECTOR_ID
---------- ----------------------- ------------ ---------------- -----------
         1 The Dark Knight         15-JUL-09    Action                     8
         2 Inglourious Basterds    21-AUG-09    War                        3
         3 Up in the Air           04-SEP-09    Drama                      6
         4 The Hangover            05-JUN-09    Comedy                     1
         5 District 9              14-AUG-09    Science Fiction           10

…and ask Select AI to tell us how many actors we have.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
SQL> SELECT AI how many actors are there
 
Number of Actors
----------------
              10

Resources

For more information about Select AI, see the following and try Select AI RAG through Oracle LiveLabs today!

Blogs

Documentation

LiveLabs

 

Mark Hornick

Senior Director, Machine Learning and AI Product Management

Mark Hornick is Senior Director, Machine Learning and AI Product Management. Mark has more than 20 years of experience integrating and leveraging machine learning with Oracle software as well as working with internal and external customers to apply Oracle’s machine learning technologies. Mark is Oracle’s representative to the R Consortium and is an Oracle Adviser and founding member of the Analytics and Data Oracle User Community. He has been issued seven US patents. Mark holds a bachelor’s degree from Rutgers University and a master’s degree from Brown University, both in computer science. Follow him on Twitter and connect on LinkedIn.

Show more

Previous Post

Enhance Your Semantic Similarity Search with Multilingual Support

Sherry LaMonica | 6 min read

Next Post


Enhance your AI/ML applications with flexible Bring Your Own Model options

Mark Hornick | 16 min read