In this article, we continue to share our experiences in the field of Large Language Models (LLMs), focusing in particular on how Visual Language Models (VLMs) are revolutionizing document pre-processing in RAG systems.
VLMs are the meeting point between vision and language, between visual content and text.
They represent the next step in the evolution of AI for document understanding and are an essential component of next-generation RAG systems.
With VLMs, document pre-processing in RAG systems is no longer a simple technical step, but becomes a phase of intelligent data interpretation.
From the era of LLMs to multimodal understanding
In recent years, Large Language Models (LLMs) — such as GPT-4o, Claude 3.5, or Llama 3.1 — have transformed the way companies manage and interpret textual data.
From the automatic generation of intelligent responses in support systems to the semantic analysis of business reports and logs, LLMs have become essential tools for improving efficiency and decision quality.
However, most business documents are not made up of text alone.
Technical projects, reports, engineering drawings, or functional specifications contain images, diagrams, charts, and tables that convey crucial information but are difficult to interpret with purely linguistic models.
This is where a new generation of models comes into play: Visual Language Models (VLMs).
What are VLMs and why are they a fundamental component of RAG systems
VLMs combine the visual capabilities of computer vision models with the semantic understanding typical of LLMs. In other words, a VLM is able to “see” and “read” simultaneously, interpreting images, text, and graphic structure as a single coherent language.
In a Retrieval-Augmented Generation (RAG) architecture, the use of VLMs represents a turning point in the data preparation phase, which includes document pre-processing, chunking, data enrichment, embedding, and indexing in a vector store. More specifically, for the pre-processing phase, a VLM can analyze documents at a visual and semantic level, returning structured representations enriched with metadata.
In practice, while traditional OCR extracts only text, a VLM is capable of understanding diagrams, legends, tables, and visual relationships, providing a more comprehensive database for retrieval and the generation of high-quality responses.
An example of a VLM pipeline for pre-processing in RAG systems
To understand the potential of VLMs in document understanding, let’s imagine a pre-processing pipeline designed to process complex documents—for example, PDFs containing technical diagrams, tables, and illustrations. These are the main stages of the pipeline:
- Multimodal conversion and analysis
Each page of the document is converted into an image and sent to an advanced VLM model (such as Gemini 2.5 Pro, GPT-4o, or an open-source model such as LLaVA-NEXT). The model simultaneously interprets text, layout, and visual components, returning a semantic understanding at the “page” level. - Structured extraction
The result of the analysis is translated into structured data, such as a JSON file describing text, coordinates, types of visual elements, and spatial relationships. This step provides a unified view of the document, which is useful for subsequent segmentation or intelligent chunking operations. - Synthetic data generation and fine-tuning
In the absence of labeled datasets, the pipeline can generate synthetic data from public documents or controlled internal repositories. This data is used to optimize the model’s behavior through fine-tuning or prompt optimization, improving its accuracy in recognizing specific patterns. - Indexing and integration with RAG
The results are then enriched with metadata and sent to the embedding phase to be indexed in a vector database. In this way, the RAG system can subsequently retrieve both textual and visual information, ensuring more relevant responses based on multimodal understanding. - Pipeline automation
The pipeline can be executed asynchronously: for example, a service monitors an S3 bucket or a shared directory, automatically processes each new document, and updates the knowledge base index in real time.
The main Visual Language Models available today
Thanks to public benchmarks from platforms such as Hugging Face, it is possible to compare the best-performing VLMs on the market today. Below is an updated summary:
| Model | Type | State of the art | Production Ready | Notes |
| Gemini 2.5 Pro | Proprietary (Google DeepMind) | High | YES | Full multimodal (text, images, video). Excellent for technical documents. |
| GPT-4o | Proprietary (OpenAI) | High | YES | High multimodal performance; already used in production environments. |
| Claude 3.5 Sonnet | Proprietary (Anthropic) | High | YES | Strong in diagrams and complex visual comprehension. |
| LLaVA-NEXT | Open Source | Medium-high | NO | Good balance between performance and openness; still evolving. |
| Qwen-VL-Max | Open Source (Alibaba) | Medium-high | NO | Excellent balance between visual accuracy and speed. |
| InternVL 2.0 | Open Source | Medium | NO | Interesting for PDFs and complex diagrams; experimental phase. |
| Kosmos-2 | Open Source (Microsoft) | Low | NO | Solid multimodal OCR, but less effective in deep semantics. |
| Fuyu 8B | Open Source (Adept AI) | Low | NO | Excellent speed, ideal for prototyping and testing. |
Sources: OpenCompass public benchmarks – Hugging Face
VLM as the key to document understanding in RAG systems
In conclusion, Visual Language Models represent the natural evolution of LLMs, paving the way for systems capable of understanding multimodal documents in a truly intelligent way.
Within RAG pipelines, their contribution is decisive:
- they make pre-processing more accurate,
- enable truly contextual semantic chunking,
- and allow for automatic enrichment of visual and textual metadata.
For organizations that handle large quantities of technical documents, reports, or project diagrams, VLMs offer a concrete advantage: transforming every document — even the most complex ones — into knowledge that can be used by artificial intelligence.
In conclusion, Humanativa continues to invest in RAG/LLM, and Humanativa’s Competence Center will include Preprocessing Modules using VLM in the next version of the core system of LLM solutions, based on feedback from the first advanced RAG projects.