New NVIDIA Nemotron Vision Language Model power video capabilities in Oracle Media & Entertainment
Enterprises across industries are transforming how they interact with video, images, and documents. From media intelligence to business automation, multimodal AI models enable a new way to understand and generate rich content. Oracle is bringing NVIDIA Nemotron Nano 2 VL, an enterprise-ready multimodal reasoning model for document intelligence and video understanding to power the next wave of enterprise applications on Oracle Cloud Infrastructure (OCI).
Powering Smarter Video Understanding and Generation
Nemotron Nano 2 VL, announced at GTC DC, is designed to interpret complex video content including visual frames, dense captioning, and text overlays in a unified context. Its innovative Efficient Video Sampling (EVS) identifies and prunes temporally static patches within video sequences, reducing redundant tokens by up to 4x while preserving essential semantics and accuracy. OCI Generative AI will leverage Nemotron Nano 2 VL for dense captioning of huge volumes of videos at reduced cost using NVIDIA GB200 NVL72.
In the Oracle Media & Entertainment vertical, video understanding models can automatically summarize enterprise recordings such as conferences, training sessions, broadcasts, etc., accelerating information discovery and knowledge retrieval. Leveraging these models translates to substantial benefits for media organizations (news, sports, streaming platforms, studios, and archival houses):
- Targeted search and interactive Q&A, enabling teams to query long-form video with natural language and retrieve precise moments, transcripts, and entities.
- Faster indexing where the model quickly analyzes and labels video content, making it easier to catalog and organize vast collections of videos
- Smarter video curation, where the model effectively identifies and extracts key highlights from long form videos based on natural language, saving significant resources—time, compute, and manual effort.
- Scalable content analysis, where the model can analyze and summarize videos, deriving insights from tens of thousands of hours of footage, helping media companies improve content recommendations, assess the quality of AI datasets, filter out undesirable or unsafe content, and surface trends that inform content creation and editorial planning.
Large-Scale Video Dataset Curation for Generative AI Modeling
Large-scale video datasets are the fuel for modern vision-language research, but raw footage alone doesn’t move the needle. What matters is turning that unruly ocean of clips into trustworthy, richly described material from which models can actually learn. The goal isn’t just “more data”; it’s the right data—clean, well-labeled, and searchable—so teams can iterate quickly and see real gains in accuracy and robustness.
Our approach centers on building a data engine that scales effortlessly while staying focused on quality and context. We start by ensuring only high-signal content makes it into the curated corpus, then concentrate on describing what’s truly happening on screen. Long videos rarely fit a single caption, so we capture both the big picture and scene-by-scene nuance, blending what’s seen with what’s said to produce concise, clip-level summaries that are easy to index and evaluate.
Nemotron Nano 2 VL plays a pivotal role here. It helps generate detailed, fine-grained descriptions and then distill them into coherent captions that reflect the full story of a video segment. Those stronger captions unlock better search, more faithful evaluations, and ultimately better dataset curation—without prohibitive manual labeling.
The result is a faster path from messy raw data to model-ready corpora. Researchers get a reliable backbone for experiments, product teams gain confidence in the datasets behind their features, and the organization benefits from a repeatable curation loop that supports both rapid iteration and steady improvement. In short, by investing in scalable, context-rich curation with NVIDIA Nemotron Nano 2 VL at the core, we create the conditions for vision-language systems to reach their full potential.
Unlocking Document Intelligence in Enterprise Applications
For enterprises, Nemotron Parse brings advanced document intelligence to Oracle Fusion Cloud applications. The model understands structured and unstructured content, enabling intelligent assistants to retrieve answers, summarize data, and streamline decision making.
With NVIDIA Nemotron Parse, organizations in customer service, IT, finance, insurance, and healthcare can interpret complex documents with precision and confidence, improving operational efficiency.
Efficiency and Flexibility of Open Models
The NVIDIA Nemotron vision language models combine architectural efficiency with democratized innovation. Based on the hybrid transformer-Mamba architecture, the Nemotron Nano 2 VL model is trained on over 11 million high-quality samples covering several tasks such as Image QA, OCR, Captioning, Video QA, and Image reasoning, and delivers high token throughput and low latency achieving exceptional efficiency for large-scale text or visual reasoning tasks. The model is supported by vLLM and is quantized for FP4, FP8 and BF16 precisions, further boosting performance.
With open weights and open training datasets, developers have full transparency and flexibility to deploy Nemotron models across Oracle Fusion Applications, empowering organizations to build their custom models on top of their preferred foundation models.
Enabling the Future of Multimodal AI on Oracle
Oracle’s integration of NVIDIA Nemotron brings the power and flexibility of multimodal AI directly to enterprise workloads, from document intelligence for vendor invoices and orders, to driving image-based reasoning for retail catalogs, and delivering dense video captioning for faster search, ad placement, and interactive Q&A. With NVIDIA AI Enterprise natively integrated within OCI Console, this foundation enables enterprises to build future-ready AI agents that can understand and act on their critical business data.
Check out the latest NVIDIA Nemotron announcement at GTC DC.
We would like to acknowledge the Oracle AI for Fusion Applications team (Ashok Manthina, Kaushal Kurapati), and the OCI AI Science team (Graham Horwood, Vasudev Lal, Sujeeth Bharadwaj) teams for their contributions to this blog.
Future Product Disclaimer
The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation.
