We’re pleased to announce the availability of a downloadable, augmented* version of Hugging Face’s infloat/multilingual-e5-small embedding model.This model, distributed by Hugging Face, can be loaded directly into the database using DBMS_VECTOR.LOAD_ONNX_MODEL or DBMS_DATA_MINING.IMPORT_ONNX_MODEL. This new model enhances our semantic similarity search capabilities with robust multilingual support.
About the Multilingual intfloat/multilingual-e5-small Model
The Multilingual intfloat/multilingual-e5-small model generates vector embeddings for text in multiple languages. Its compact design enables effective performance across various languages, suitable for diverse datasets and multilingual scenarios.
In our previous blog post, we introduced the all-MiniLM-L12-v2 ONNX model. Both all-MiniLM-L12-v2 and multilingual-e5-small are 12-layer sentence transformer models producing 384-dimensional embeddings. While all-MiniLM-L12-v2 focuses on English, multilingual-e5-small supports 100 languages, making it ideal for cross-lingual tasks. The latter underwent additional supervised fine-tuning by Hugging Face, improving its performance in multilingual scenarios.
This model can be loaded into Oracle Autonomous Database, as well as on-premises and cloud databases. Detailed instructions for both scenarios are provided below.
Load the Augmented Model into Oracle Autonomous Database
Let’s walk through how to enable the augmented Hugging face embedding model in Autonomous Database using the following steps:
- Load the model to the database from Object Storage
- Generate vector embeddings
Load the model to the database from Object Storage. You must specify the ONNX model file name and the Object’s location URI in Object Storage.
DECLARE
ONNX_MOD_FILE VARCHAR2(100) := 'multilingual_e5_small.onnx';
MODNAME VARCHAR2(500);
LOCATION_URI VARCHAR2(200) := 'https://adwc4pm.objectstorage.us-ashburn-1.oci.customer-oci.com/p/mbFT6Y4-cDFZr86_BlvZJA8CUiIzFmOCxN7m627gr3DWbksfgTzxf9HBREVgTvn1/n/adwc4pm/b/OML-Resources/o/';
BEGIN
DBMS_OUTPUT.PUT_LINE('ONNX model file name in Object Storage is: '||ONNX_MOD_FILE);
--------------------------------------------
-- Define a model name for the loaded model
--------------------------------------------
SELECT UPPER(REGEXP_SUBSTR(ONNX_MOD_FILE, '[^.]+')) INTO MODNAME from dual;
DBMS_OUTPUT.PUT_LINE('Model will be loaded and saved with name: '||MODNAME);
----------------------------------------------------
-- Read the ONNX model file from Object Storage into
-- the Autonomous Database data pump directory
-----------------------------------------------------
BEGIN DBMS_DATA_MINING.DROP_MODEL(model_name => MODNAME);
EXCEPTION WHEN OTHERS THEN NULL; END;
DBMS_CLOUD.GET_OBJECT(
credential_name => NULL,
directory_name => 'DATA_PUMP_DIR',
object_uri => LOCATION_URI||ONNX_MOD_FILE);
-----------------------------------------
-- Load the ONNX model to the database
-----------------------------------------
DBMS_VECTOR.LOAD_ONNX_MODEL(
directory => 'DATA_PUMP_DIR',
file_name => ONNX_MOD_FILE,
model_name => MODNAME);
DBMS_OUTPUT.PUT_LINE('New model successfully loaded with name: '||MODNAME);
END;
ONNX model file name in Object Storage is: multilingual_e5_small.onnx
Model will be loaded and saved with name: MULTILINGUAL_E5_SMALL
New model successfully loaded with name: MULTILINGUAL_E5_SMALL
Let’s verify the model was loaded to the database.
SELECT model_name, algorithm, mining_function from user_mining_models WHERE model_name='MULTILINGUAL_E5_SMALL';
MODEL_NAME ALGORITHM MINING_FUNCTION
---------- --------- ---------------
MULTILINGUAL_E5_SMALL ONNX EMBEDDING
You are now ready to start generating vector embeddings.
Generate vector embeddings
Let’s generate embedding vectors using the VECTOR_EMBEDDING SQL scoring function. The VECTOR_EMBEDDING function has two input parameters:
- The vector embedding model
- An expression or column name as the data to embed
The output of the VECTOR_EMBEDDING SQL function is a vector, capturing the semantic features of the input data for use in similarity searches within the database.
Here, we generate embedding vectors using the Italian translation of “the quick brown fox,” which is “la veloce volpe marrone.” Partial output is shown below.
SELECT VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING 'la veloce volpe marrone' as DATA) AS embedding;
EMBEDDING
--------------------------------------------------------------------------------
[5.99516742E-002,-1.65775437E-002,-4.30925973E-002,-1.02978349E-001,9.32177454E-002,
-2.73556169E-002,3.29738595E-002,-2.38687661E-003,3.59498225E-002,-1.24295689E-002,
5.77632338E-002,3.75730284E-002,5.16409054E-002,-2.31190119E-002,-3.18786874E-002,
...
...
First, create a table with some data in Italian that will be vectorized.
CREATE TABLE vec3 (
id NUMBER PRIMARY KEY,
str VARCHAR2(128),
v VECTOR
);
INSERT INTO vec3 (id, str) VALUES (1, 'Brilla, brilla, stellina');
INSERT INTO vec3 (id, str) VALUES (2, 'Baa baa pecorella nera');
INSERT INTO vec3 (id, str) VALUES (3, 'Mary had a little lamb');
INSERT INTO vec3 (id, str) VALUES (4, 'Jack e Jill sono saliti sulla collina');
INSERT INTO vec3 (id, str) VALUES (5, 'Humpty Dumpty had a great fall');
COMMIT; -- omit commit for OML notebooks; commit is automatic.
Next, create a vector for the str column with id = 2. Partial output is shown below.
SELECT VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) AS embedding
FROM vec3
WHERE id = 2;
EMBEDDING
--------------------------------------------------------------------------------
[-5.06732985E-002,2.79495809E-002,-2.49498505E-002,-1.40845561E-002,-4.07454036E
-002,1.19132325E-002,4.41188365E-002,-1.03995301E-001,9.50366631E-003,4.92134318
E-003,-4.72284891E-002,-8.59436914E-002,-4.93300222E-002,1.57973133E-002,-5.0117
...
...
Continue creating vectors for all of the str columns. Partial output is shown below.
SELECT VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) AS embedding
FROM vec3;
EMBEDDING
--------------------------------------------------------------------------------
[7.73708373E-002,-1.28598781E-002,-1.62120573E-002,-7.61429667E-002,6.15925454E-002,
-4.64253426E-002,1.80853643E-002,2.14378778E-002,6.7779664E-003,1.08906506E-002,
2.8271025E-002,-2.15901323E-002,6.99522421E-002,-2.06865221E-002,-2.46137138E-002,
...
...
Create multiple vectors, one for each of the str columns, and for the string ‘random text’, translated to Italian, which is ‘testo casuale’.
SELECT
VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) AS str_embedding,
VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING 'testo casuale' as data) AS hello_embedding
FROM vec3;
STR_EMBEDDING
--------------------------------------------------------------------------------
7.73708373E-002,-1.28598781E-002,-1.62120573E-002,-7.61429667E-002,6.15925454E-002,
-4.64253426E-002,1.80853643E-002,2.14378778E-002,6.7779664E-003,1.08906506E-002,
2.8271025E-002,-2.15901323E-002,6.99522421E-002,-2.06865221E-002,-2.46137138E-002,
...
...
HELLO_EMBEDDING
--------------------------------------------------------------------------------
[5.37178144E-002,-1.72779746E-002,-2.74479631E-002,-1.24437563E-001,3.56760323E-002,
-1.83372796E-002,2.43975483E-002,1.8736871E-002,2.33881213E-002,1.51813021E-002,
4.40814905E-002,2.66225971E-002,5.5223234E-002,-3.82895581E-002,-1.87106878E-002,
...
...
You can also create a table based on an existing table with the vectors created on the fly.
CREATE TABLE vec4 as
SELECT
id,
str,
VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING str as data) as v
FROM vec3;
Table created.
Insert a new row creating the vector on the fly.
INSERT INTO vec4 values (11, 'Rema rema rema la tua barca',
VECTOR_EMBEDDING(MULTILINGUAL_E5_SMALL USING 'Hey, diddle, diddle' as data));
SELECT * from vec4;
Brilla, brilla, stellina [7.73708373E-002,-1.28598781E-002,-1.62120573E-002,...]
Baa baa pecorella nera [5.88322766E-002,-7.08066951E-003,-3.22654694E-002,. .]
Mary had a little lamb [6.68142065E-002,-1.72401755E-003,-4.23772 86E-002,...]
Jack e Jill sono saliti sulla collina [9.11431164E-002,-5.46572 23E-003,-3.17931212E-002,...]
Humpty Dumpty had a great fall [9.22596604E-002,-1.16861274E-003,-4.5548439E-002,...]
Rema rema rema la tua barca [8.53356794E-002,3.74443494E-002,-3.76751497E-002,...]
You are now ready to perform similarity searches using these vector embeddings in Autonomous Database.
Load the augmented model to Oracle Database 23ai on-premises or cloud instances
First, download the ONNX model file from Object Storage. Next, set up the necessary directories to prepare for the import process. Then, use the DBMS_VECTOR.LOAD_ONNX_MODEL procedure to load the model into your schema. This enables the model’s vector embedding generation capabilities for similarity search.
Download and unzip the file containing the ONNX model (link)
$ unzip multilingual_e5_small_augmented.zip
Archive: multilingual_e5_small_augmented.zip
creating: multilingual_e5_small_augmented/
inflating: multilingual_e5_small_augmented/multilingual_e5_small.onnx
inflating: multilingual_e5_small_augmented/README_MULTILINGUAL_E5_SMALL_augmented.txt
Log into your database instance as sysdba. In this example, we are logging into a pluggable database named ORCLPDB. Replace ORCLPDB with your database name.
$ sqlplus / as sysdba;
SQL> alter session set container=ORCLPDB;
Apply grants and define the data dump directory as the path where the ONNX model was unzipped.
Note, in this example, we are using the OMLUSER schema. Replace OMLUSER with your schema name.
SQL> GRANT DB_DEVELOPER_ROLE, CREATE MINING MODEL TO OMLUSER;
SQL> CREATE OR REPLACE DIRECTORY DM_DUMP AS '<path to ONNX model>';
SQL> GRANT READ ON DIRECTORY DM_DUMP TO OMLUSER;
SQL> GRANT WRITE ON DIRECTORY DM_DUMP TO OMLUSER;
SQL> exit
Log into your schema.
$ sqlplus omluser/omluser@ORCLPDB;
Load the ONNX model. Optionally drop the model first if a model with the same name already exists in the database.
SQL> exec DBMS_VECTOR.DROP_ONNX_MODEL(model_name => 'MULTILINGUAL_E5_SMALL', force => true);
BEGIN
DBMS_VECTOR.LOAD_ONNX_MODEL(
directory => 'DM_DUMP',
file_name => 'multilingual_e5_small.onnx',
model_name => 'MULTILINGUAL_E5_SMALL',
metadata => JSON('{"function" : "embedding", "embeddingOutput" : "embedding", "input": {"input": ["DATA"]}}'));
END;
/
Once imported, the model can generate embeddings as shown above.
Learn more
The intfloat/multilingual-e5-small model provides enhanced multilingual capabilities, enabling similarity searches with higher accuracy across a broader range of languages. By loading this model into Oracle Database 23ai, you can improve the effectiveness of your semantic similarity searches across diverse linguistic contexts. For further details, refer to the Oracle AI Vector Search User’s Guide.
* ONNX models must be augmented through a specialized pipeline to work with Oracle Database 23ai’s AI Vector Search capabilities. This pipeline includes preprocessing steps for both text (tokenization) and images, along with post-processing steps like normalization and pooling. The pipeline allows users to directly input raw text and image data since all preprocessing is internalized, eliminating the need to manually convert inputs to tensors. Starting with OML4Py 2.0, the OML4Py client streamlines this process by augmenting pre-trained transformer models from Hugging Face and converting them to ONNX format. After augmentation, these models can be imported into Oracle Database 23ai to generate embeddings for similarity search.