For the last decade, computer vision focused on one goal: perception. Researchers taught machines to classify images, detect objects, and segment pixels. But in the enterprise and the real world “seeing” is only the first step.
To drive value, AI must move from observation to action. We invite you to join us at GRAIL-V (Grounded Retrieval & Agentic Intelligence for Vision-Language) at CVPR 2026. This workshop brings researchers and practitioners together to solve the hardest challenge in AI today: creating multimodal agents that work in production.
From Passive Observation to Active Agents
Real-world data is messy. Critical context lives in fragmented PDFs, dashboards, images, spreadsheets, and videos. A reliable multimodal agent must connect these distributed sources to make accurate decisions and solve user tasks. The question is no longer “Can the model interpret the input?” The question is “Can you trust the system to act?”
To bridge the gap between impressive demos and deployable reliability, agents must master the following capabilities:
- Plan and route intelligently across tools and modalities.
- Generate and edit content only when necessary.
- Ground decisions in evidence with precise citations.
- Handle uncertainty responsibly when evidence conflicts.

Why You Should Attend
Oracle AI is proud to sponsor and organize GRAIL-V. We believe the next wave of enterprise AI is defined by evidence-driven systems that behave predictably in complex environments.
This half-day workshop features a mix of invited talks, peer-reviewed research, and expert panels. You will hear from industry leaders and distinguished professors, including:
- Kristen Grauman (UT Austin)
- Mohit Bansal (UNC, Chapel Hill)
- Dan Roth (University of Pennsylvania)
- Scott Yih (FAIR, Meta)
- Sujith Ravi (Oracle AI)
Call for Papers
We want to see your work. We are looking for submissions that advance the mechanisms of multimodal agents.
Topics of Interest (non-exhaustive list):
- Multimodal Retrieval: Scaling search across images, video, and UI.
- Image/Video Understanding: Deep interpretation of visual data.
- Generative Tools: Use by multimodal agents across images, videos and text.
- Benchmarks & Evaluation: Reproducible methods for measuring success.
- Grounding: Evidence, citation provenance, and audit-ready faithfulness.
Important Dates
Mark your calendars for the following deadlines:
- Submission Deadline: March 5, 2026
- Author Notification: March 18, 2026
- Conference: June 3–7, 2026 (Denver, USA)
To learn more about the submission guidelines and speakers, visit the GRAIL-V Workshop Website.
We look forward to seeing you in Denver to advance the state of grounded multimodal agents.

