ETIQMEDIA certifies their automated video transcription and indexation solution on OCI

April 17, 2024 | 5 minute read
María Perolet
Lead speech technologies expert in Etiqmedia
Antonio Carpio
CEO in Etiqmedia
Adolfo Arguedas
COO in Etiqmedia
Nacho Herrera
Senior Speech Technologies Specialist in Etiqmedia
Alessio Comisso
OCI Senior Big Compute Specialist
This is a translated post, view the original post
Text Size 100%:

In this article, we aim to evaluate the efficiency of ETIQMEDIA’s automatic transcription system on the Oracle Cloud Infrastructure (OCI).

Artificial intelligence (AI) is a trending market, and its adoption is growing in every vertical. For media professionals, AI is proving to be a game-changer, offering a plethora of applications that enhance efficiency, elevate content quality, and revolutionize user experiences. From content recommendation systems to the automated content creation by means of modern generative AI technologies, the range of applications is huge, including automated transcription, translation, sentiment analysis, personalized advertising, and enhanced post-production processes.

TSA, OCI and ETIQMEDIA

Telefónica Servicios Audiovisuales (TSA) has promoted a pioneering proof of concept (POC) in the audiovisual sector, using OCI and ETIQMEDIA technology for the analysis of multimedia content using AI. Through LABTSA, its innovation laboratory, TSA is consolidating itself as a leading company in the exploration and application of AI solutions for media, audio, and video analysis within the scope of Telefónica’s audiovisual services. ETIQMEDIA is a Spanish company specializing in developing AI-based solutions and is present in most of the national audiovisual sector. Among its solutions are computer vision, multimedia content indexing, and automatic voice transcription, among others.

The functionality of two major transcription technologies has been tested: One based on Convolutional Neural Networks and the other based on Transformers. The Convolutional Neural Networks-based approach allows for greater flexibility in its application because it’s designed to be evaluated on a CPU, with a modular and customizable character for each domain, both in the acoustic and contextual aspects. It also enables real-time processing.

On the other hand, ‌technology based on Transformers offers much greater acoustic processing power, being more robust in environments with poorer recording conditions and in content with more spontaneous speech, such as television series and movies. Tests have been conducted for different types of content: Institutional, news, and fiction, providing a comprehensive verification of the computing speed of both systems.

OCI offers a rich selection of NVIDIA GPU-powered compute shapes that are well-suited for machine learning (ML) workloads, like inference and model training.

For example, the shapes based on the NVIDIA A10 Tensor Core GPU can offer economical benefits. The OCI offering includes virtual machines (VMs) with 1 GPU, 2 GPUs, and even a bare metal shape with 4 A10 GPUs. A10-based VMs can start and stop quickly to elastically adapt resources to a variable demand.

For middle- and large-scale model training and inference, consider the shapes based on the NVIDIA A100 Tensor Core GPU and NVIDIA H100 Tensor Core GPU and the newly announced shape based on the NVIDIA L40S GPU. They can scale training jobs to thousands of GPUs by using OCI high-performance network-based on the accelerated RoCE protocol and NVIDIA ConnectX NICs. For small-scale models, training A10-based shapes can still prove valuable.

Test development

The described systems have been evaluated with the aim of maximizing the number of hours processed per day (throughput/day). For this purpose, we ran tests on OCI, making the most of its resources.

For these tests, we used the  VM.GPU.A10.1 shape, which includes a single A10 GPU, 15 physical CPU cores (OCPU) and 240 GB of RAM memory. This setup provides a balanced compute VM  to run both GPU- and CPU-based inference. Access to both types of processing, CPU and GPU, in the same shape allows efficient execution of complex tasks involving audio streams from different sources and with different characteristics. In addition, it allows us to optimize the management of cloud infrastructure, eliminating the need to maintain and manage multiple instances for different types of processing, with the consequent reduction of operating costs.

The videos used for these tests have an average duration of approximately 2 hours. In Convolutional ASR using the CPU, we employed three configurations: One application instance with two threads (Conv-1), five application instances with five threads each (Conv-5), and 10 application instances with two threads per instance (Conv-10). The following table shows the throughput obtained with each of the different configurations.

Test (Thread count)

Conv-1 (2)

Conv-5 (25)

Conv-10 (20)

hours/day of video
(improvement over Conv-1)

144.93

449.18 (3.09 times)

440.29 (3.03 times)

By increasing the number of application instances in parallel, we can maximize CPU usage, leading to a significant increase in throughput. Specifically, improvements of 3.09x (Conv-5) and 3.03x (Conv-10) in efficiency are achieved compared to Conv-1. We also observed that by running fewer application instances in parallel (Conv-5) but providing more resources (threads) to each one, we achieved about an 2% improvement compared to Conv-10.

For Transformers, the maximum number of application instances in parallel supported by a single A10 GPU is 3, although it depends on the model size and compute shape type. So, performance is compared using a single application instance (TF-1), two application instances (TF-2) and three application instances (TF-3). Each application instance obtained the following throughput improvement in hours/day, showing that the improvement achieved by TF-3 is 2.04 times, compared to using a single application instance.

  • TF-1: 401.67
  • TF-2: 730.01
  • TF-3: 818.45 (2.04-times improvement)

We observe how OCI can easily adapt to the needs of each system, optimizing the use of CPU or GPU as necessary.

Conclusion

The tests performed on the OCI server have yielded positive results in terms of the obtained throughput, both in Convolutional ASR and Transformers. The configuration that best suits the machine (15 OCPUs 240GB of RAM) for Convolutional ASR has five application instances with 5 threads per instance, resulting in a video transcription throughput of 449 hours/day. For Transformers, processing capacity is closely tied to the available GPU. In this shape (NVIDIA A10 24 GB), the system can process three application instances in parallel, achieving a throughput of 818.45 hours/day, representing a 1.8x improvement. Additionally, OCI has allowed ETIQMEDIA to achieve greater throughput with its algorithms compared to previous tests run on the rest of the clouds on the market using similar hardware resources.

Free learning resources are available at the AI workshops and training to help you get the most out of your Oracle AI development and deployment experience. For more information on Oracle Cloud Infrastructure’s capabilities, visit us at GPU compute and AI infrastructure.

For more information, see the following links:

María Perolet

Lead speech technologies expert in Etiqmedia

Antonio Carpio

CEO in Etiqmedia

Adolfo Arguedas

COO in Etiqmedia

Nacho Herrera

Senior Speech Technologies Specialist in Etiqmedia

Alessio Comisso

OCI Senior Big Compute Specialist


Previous Post

OCI delivers stellar generative AI performance in MLPerf Inference v4.0 benchmarks

Seshadri Dehalisan | 13 min read

Next Post


Accelerating telco innovation by leveraging power of GPUs on Oracle Cloud Infrastructure for enhanced customer experiences and operational efficiency

Deepak Soni | 6 min read