Enhance Your Semantic Similarity Search with Multilingual Support

We’re pleased to announce the availability of a downloadable, augmented* version of Hugging Face’s infloat/multilingual-e5-small embedding model.This model, distributed by Hugging Face, can be loaded directly into the database using DBMS_VECTOR.LOAD_ONNX_MODEL or DBMS_DATA_MINING.IMPORT_ONNX_MODEL. This new model enhances our semantic similarity search capabilities with robust multilingual support.

About the Multilingual intfloat/multilingual-e5-small Model

The Multilingual intfloat/multilingual-e5-small model generates vector embeddings for text in multiple languages. Its compact design enables effective performance across various languages, suitable for diverse datasets and multilingual scenarios.

In our previous blog post, we introduced the all-MiniLM-L12-v2 ONNX model. Both all-MiniLM-L12-v2 and multilingual-e5-small are 12-layer sentence transformer models producing 384-dimensional embeddings. While all-MiniLM-L12-v2 focuses on English, multilingual-e5-small supports 100 languages, making it ideal for cross-lingual tasks. The latter underwent additional supervised fine-tuning by Hugging Face, improving its performance in multilingual scenarios.

This model can be loaded into Oracle Autonomous Database, as well as on-premises and cloud databases. Detailed instructions for both scenarios are provided below.

Load the Augmented Model into Oracle Autonomous Database

Let’s walk through how to enable the augmented Hugging face embedding model in Autonomous Database using the following steps:

Load the model to the database from Object Storage
Generate vector embeddings

Load the model to the database from Object Storage. You must specify the ONNX model file name and the Object’s location URI in Object Storage.

DECLARE ONNX_MOD_FILE VARCHAR2(100) := 'multilingual_e5_small.onnx'; MODNAME VARCHAR2(500); LOCATION_URI VARCHAR2(200) := 'https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/J7h8Bo3aoIjgHx8WiBRANi2nd3BNpAMx4v33nVnDhU6mIdrhE57hwpNZfupYAS9L/n/adwc4pm/b/OML-ai-models/o/';

BEGIN DBMS_OUTPUT.PUT_LINE('ONNX model file name in Object Storage is: '||ONNX_MOD_FILE); -------------------------------------------- -- Define a model name for the loaded model --------------------------------------------

SELECT UPPER(REGEXP_SUBSTR(ONNX_MOD_FILE, '[^.]+')) INTO MODNAME from dual; DBMS_OUTPUT.PUT_LINE('Model will be loaded and saved with name: '||MODNAME);

---------------------------------------------------- -- Read the ONNX model file from Object Storage into -- the Autonomous Database data pump directory -----------------------------------------------------

BEGIN DBMS_DATA_MINING.DROP_MODEL(model_name => MODNAME); EXCEPTION WHEN OTHERS THEN NULL; END;

DBMS_CLOUD.GET_OBJECT( credential_name => NULL, directory_name => 'DATA_PUMP_DIR', object_uri => LOCATION_URI||ONNX_MOD_FILE);

----------------------------------------- -- Load the ONNX model to the database -----------------------------------------

DBMS_VECTOR.LOAD_ONNX_MODEL( directory => 'DATA_PUMP_DIR', file_name => ONNX_MOD_FILE, model_name => MODNAME);

DBMS_OUTPUT.PUT_LINE('New model successfully loaded with name: '||MODNAME); END;

ONNX model file name in Object Storage is: multilingual_e5_small.onnx Model will be loaded and saved with name: MULTILINGUAL_E5_SMALL New model successfully loaded with name: MULTILINGUAL_E5_SMALL

Let’s verify the model was loaded to the database.

SELECT model_name, algorithm, mining_function from user_mining_models WHERE model_name='MULTILINGUAL_E5_SMALL';

MODEL_NAME ALGORITHM MINING_FUNCTION ---------- --------- --------------- MULTILINGUAL_E5_SMALL ONNX EMBEDDING

You are now ready to start generating vector embeddings.

Generate vector embeddings

Let’s generate embedding vectors using the VECTOR_EMBEDDING SQL scoring function. The VECTOR_EMBEDDING function has two input parameters:

The vector embedding model
An expression or column name as the data to embed

The output of the VECTOR_EMBEDDING SQL function is a vector, capturing the semantic features of the input data for use in similarity searches within the database.

Here, we generate embedding vectors using the Italian translation of “the quick brown fox,” which is “la veloce volpe marrone.” Partial output is shown below.

SELECT VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING 'la veloce volpe marrone' as DATA) AS embedding;

EMBEDDING --------------------------------------------------------------------------------

[5.99516742E-002,-1.65775437E-002,-4.30925973E-002,-1.02978349E-001,9.32177454E-002, -2.73556169E-002,3.29738595E-002,-2.38687661E-003,3.59498225E-002,-1.24295689E-002, 5.77632338E-002,3.75730284E-002,5.16409054E-002,-2.31190119E-002,-3.18786874E-002, ... ...

First, create a table with some data in Italian that will be vectorized.

CREATE TABLE vec3 ( id NUMBER PRIMARY KEY, str VARCHAR2(128), v VECTOR );

INSERT INTO vec3 (id, str) VALUES (1, 'Brilla, brilla, stellina'); INSERT INTO vec3 (id, str) VALUES (2, 'Baa baa pecorella nera'); INSERT INTO vec3 (id, str) VALUES (3, 'Mary had a little lamb'); INSERT INTO vec3 (id, str) VALUES (4, 'Jack e Jill sono saliti sulla collina'); INSERT INTO vec3 (id, str) VALUES (5, 'Humpty Dumpty had a great fall'); COMMIT; -- omit commit for OML notebooks; commit is automatic.

Next, create a vector for the str column with id = 2. Partial output is shown below.

SELECT VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) AS embedding FROM vec3 WHERE id = 2;

EMBEDDING -------------------------------------------------------------------------------- [-5.06732985E-002,2.79495809E-002,-2.49498505E-002,-1.40845561E-002,-4.07454036E -002,1.19132325E-002,4.41188365E-002,-1.03995301E-001,9.50366631E-003,4.92134318 E-003,-4.72284891E-002,-8.59436914E-002,-4.93300222E-002,1.57973133E-002,-5.0117 ... ...

Continue creating vectors for all of the str columns. Partial output is shown below.

SELECT VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) AS embedding FROM vec3;

EMBEDDING -------------------------------------------------------------------------------- [7.73708373E-002,-1.28598781E-002,-1.62120573E-002,-7.61429667E-002,6.15925454E-002, -4.64253426E-002,1.80853643E-002,2.14378778E-002,6.7779664E-003,1.08906506E-002, 2.8271025E-002,-2.15901323E-002,6.99522421E-002,-2.06865221E-002,-2.46137138E-002, ... ...

Create multiple vectors, one for each of the str columns, and for the string ‘random text’, translated to Italian, which is ‘testo casuale’.

SELECT VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) AS str_embedding, VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING 'testo casuale' as data) AS hello_embedding FROM vec3;

STR_EMBEDDING -------------------------------------------------------------------------------- 7.73708373E-002,-1.28598781E-002,-1.62120573E-002,-7.61429667E-002,6.15925454E-002, -4.64253426E-002,1.80853643E-002,2.14378778E-002,6.7779664E-003,1.08906506E-002, 2.8271025E-002,-2.15901323E-002,6.99522421E-002,-2.06865221E-002,-2.46137138E-002, ... ...

HELLO_EMBEDDING -------------------------------------------------------------------------------- [5.37178144E-002,-1.72779746E-002,-2.74479631E-002,-1.24437563E-001,3.56760323E-002, -1.83372796E-002,2.43975483E-002,1.8736871E-002,2.33881213E-002,1.51813021E-002, 4.40814905E-002,2.66225971E-002,5.5223234E-002,-3.82895581E-002,-1.87106878E-002, ... ...

You can also create a table based on an existing table with the vectors created on the fly.

CREATE TABLE vec4 as SELECT id, str, VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) as v FROM vec3;

Table created.

Insert a new row creating the vector on the fly.

INSERT INTO vec4 values (11, 'Rema rema rema la tua barca', VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING 'Hey, diddle, diddle' as data));

SELECT * from vec4;

Brilla, brilla, stellina [7.73708373E-002,-1.28598781E-002,-1.62120573E-002,...] Baa baa pecorella nera [5.88322766E-002,-7.08066951E-003,-3.22654694E-002,. .] Mary had a little lamb [6.68142065E-002,-1.72401755E-003,-4.23772 86E-002,...] Jack e Jill sono saliti sulla collina [9.11431164E-002,-5.46572 23E-003,-3.17931212E-002,...] Humpty Dumpty had a great fall [9.22596604E-002,-1.16861274E-003,-4.5548439E-002,...] Rema rema rema la tua barca [8.53356794E-002,3.74443494E-002,-3.76751497E-002,...]

You are now ready to perform similarity searches using these vector embeddings in Autonomous Database.

Load the augmented model to Oracle Database 23ai on-premises or cloud instances

First, download the ONNX model file from Object Storage. Next, set up the necessary directories to prepare for the import process. Then, use the DBMS_VECTOR.LOAD_ONNX_MODEL procedure to load the model into your schema. This enables the model’s vector embedding generation capabilities for similarity search.

Download and unzip the file containing the ONNX model (link)

$ unzip multilingual_e5_small_augmented.zip Archive: multilingual_e5_small_augmented.zip creating: multilingual_e5_small_augmented/ inflating: multilingual_e5_small_augmented/multilingual_e5_small.onnx inflating: multilingual_e5_small_augmented/README_MULTILINGUAL_E5_SMALL_augmented.txt

Log into your database instance as sysdba. In this example, we are logging into a pluggable database named ORCLPDB. Replace ORCLPDB with your database name.

$ sqlplus / as sysdba;

SQL> alter session set container=ORCLPDB;

Apply grants and define the data dump directory as the path where the ONNX model was unzipped.

Note, in this example, we are using the OMLUSER schema. Replace OMLUSER with your schema name.

SQL> GRANT DB_DEVELOPER_ROLE, CREATE MINING MODEL TO OMLUSER; SQL> CREATE OR REPLACE DIRECTORY DM_DUMP AS '<path to ONNX model>'; SQL> GRANT READ ON DIRECTORY DM_DUMP TO OMLUSER; SQL> GRANT WRITE ON DIRECTORY DM_DUMP TO OMLUSER; SQL> exit

Log into your schema.

$ sqlplus omluser/omluser@ORCLPDB;

Load the ONNX model. Optionally drop the model first if a model with the same name already exists in the database.

SQL> exec DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'MULTILINGUAL_E5_SMALL', force => true);

BEGIN DBMS_VECTOR.LOAD_ONNX_MODEL( directory => 'DM_DUMP', file_name => 'multilingual_e5_small.onnx', model_name => 'MULTILINGUAL_E5_SMALL', metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}')); END; /

Once imported, the model can generate embeddings as shown above.

Learn more

The intfloat/multilingual-e5-small model provides enhanced multilingual capabilities, enabling similarity searches with higher accuracy across a broader range of languages. By loading this model into Oracle Database 23ai, you can improve the effectiveness of your semantic similarity searches across diverse linguistic contexts. For further details, refer to the Oracle AI Vector Search User’s Guide.

* ONNX models must be augmented through a specialized pipeline to work with Oracle Database 23ai’s AI Vector Search capabilities. This pipeline includes preprocessing steps for both text (tokenization) and images, along with post-processing steps like normalization and pooling. The pipeline allows users to directly input raw text and image data since all preprocessing is internalized, eliminating the need to manually convert inputs to tensors. Starting with OML4Py 2.0, the OML4Py client streamlines this process by augmenting pre-trained transformer models from Hugging Face and converting them to ONNX format. After augmentation, these models can be imported into Oracle Database 23ai to generate embeddings for similarity search.

Enhance Your Semantic Similarity Search with Multilingual Support

About the Multilingual intfloat/multilingual-e5-small Model

Load the Augmented Model into Oracle Autonomous Database

Load the augmented model to Oracle Database 23ai on-premises or cloud instances

Learn more

Sherry LaMonica

Consulting MTS, AI and Machine Learning Product Management

OML4Py: Leveraging ONNX and Hugging Face for AI Vector Search

Announcing Select AI with Retrieval Augmented Generation (RAG) on Autonomous Database

Enhance Your Semantic Similarity Search with Multilingual Support

About the Multilingual intfloat/multilingual-e5-small Model

Load the Augmented Model into Oracle Autonomous Database

Load the augmented model to Oracle Database 23ai on-premises or cloud instances

Learn more

Authors

Sherry LaMonica

Consulting MTS, AI and Machine Learning Product Management

OML4Py: Leveraging ONNX and Hugging Face for AI Vector Search

Announcing Select AI with Retrieval Augmented Generation (RAG) on Autonomous Database