DeepSeek revolutionizes AI with text-in-image compression that achieves 97% fidelity

    Categories: News (EN)
DeepSeek

DeepSeek - Foto: Photo Nature Travel / Shutterstock.com

Chinese technology company DeepSeek announced a significant innovation in the field of artificial intelligence with the release of DeepSeek-OCR, a model designed to overcome one of the biggest barriers of large language models (LLMs): the context window limitation. The new approach transforms text into a visual representation, allowing for up to ten times greater data compression without substantial loss of information.

This technique allows AI systems to process massive volumes of documents more quickly and cost-effectively, while maintaining 97% accuracy in retrieving original content. The development, detailed in a technical article dated October 20, 2025, directly responds to the growing demand for large-scale data processing without the consequent increase in computational costs.

The main problem that DeepSeek-OCR aims to solve is the finite ability of LLMs to “remember” or process information in a single interaction. By converting text into compact images, the technology bypasses the need to process long sequences of text tokens, which are the basic unit of information for these models, optimizing the use of resources and opening up new possibilities for analyzing complex documents.

Intelligence Artificial – Foto: Owlie Productions/ Shutterstock.com

Innovation behind visual compression

DeepSeek-OCR operates with a two-step process that radically alters how textual information is handled by AI systems. Primeiramente, the model receives the input text and internally converts it into two-dimensional images, as if it were “printing” the content on a digital screen. Specialized visual encoders then analyze these images and compress them into a much smaller number of visual tokens. Essa strategy is fundamental to the system’s efficiency, as it drastically reduces the computational load required for processing. One of the most sophisticated aspects of the technology is the implementation of a variable compression system that mimics the functioning of human memory. The model assigns greater resolution and, consequently, more tokens to the most recent and relevant contexts, while older or less priority information is stored with less detail and using fewer tokens. Essa Dynamic resource allocation ensures that accuracy is maintained where it is needed most, while optimizing long-term storage. The model’s ability to handle approximately 100 different languages ​​and process non-textual elements such as graphs, complex tables and chemical formulas further expands its applicability in real-world scenarios, making it a versatile tool for digitizing and analyzing knowledge on a global scale.

Efficiency and performance in numbers

DeepSeek-OCR’s superiority has been validated in rigorous benchmark tests such as OmniDocBench, where it significantly outperformed state-of-the-art models. In comparative tests, the model was shown to be capable of generating more than 200,000 pages of data per day using a single Nvidia A100 GPU, setting a new standard of performance in optical character recognition (OCR) and document processing tasks.

[[MVG_PROTECTED_BLOCK_0]

Efficiency not only speeds up processing, but also generates savings in operational costs, which can reach 90%, according to production analyses. The model’s versatility is another strong point, showing its ability to process documents with irregular layouts, such as financial reports, invoices and even handwritten notes, as well as generating high-quality synthetic data for training other LLMs, expanding the available data sets. Compatibility with different resolutions, from 64 to 400 tokens per image, guarantees flexibility for different application needs.

The technical mechanism of DeepEncoder

The architecture behind DeepSeek-OCR’s performance is centered on the DeepEncoder component. Esta software engineering integrates advanced models to perform specific tasks in a highly optimized way. Inicialmente, models such as Segment Anything Model (SAM) are used to accurately segment the layout and image elements of the document. Paralelamente, the CLIP model (Contrastive Language–Image Pre-training) ensures understanding of the global context of the page. Após this initial analysis, a compressor comes into action to reduce the number of tokens generated by up to 16 times, which guarantees the system’s efficiency. The result is a framework that, during inference, activates just 570 million parameters, thanks to a MoE (Mixture of Experts) decoder that dynamically selects the most appropriate neural “experts” for each task.

[[MVG_PROTECTED_BLOCK_0]

Repercussions in the artificial intelligence community

The launch of DeepSeek-OCR generated immediate and positive reactions from prominent figures in the AI ​​community. Andrej Karpathy, co-founder of OpenAI, publicly praised the study.

In his analysis, Karpathy raised the fundamental question of whether pixels could become a more efficient input tool than text tokens for LLMs.

His post sparked an intense debate among developers and researchers in specialized forums about the feasibility of extending this technique to fully train language models.

Practical applications and business impact

The implications of DeepSeek-OCR for the enterprise environment are vast and transformative. With this technology, companies can overcome the limitations of fragmented prompts.

This allows you to load entire knowledge bases, such as technical documentation, product manuals, or source code repositories, in a single interaction with AI.

Jeffrey Emanuel, a former quantitative investor, highlighted the technology’s potential to quickly create caches containing millions of tokens, which would drastically reduce latency for complex enterprise queries.

The ability to process nine different types of PDF files, including academic articles, newspapers, and annual reports, speeds up analyzes that would previously require weeks of manual work.

Technical challenges and the future of technology

Despite remarkable performance in data storage and reconstruction, DeepSeek-OCR still faces limitations. Atualmente, the technology focuses more on faithful information retrieval than on advanced reasoning about visually compressed content.

Practical challenges such as variations in resolution, color and scan quality in real-world documents can impact accuracy and require further research to fully overcome.

Multilingual support and document versatility

One of DeepSeek-OCR’s competitive differentiators is its broad linguistic capabilities, offering support for approximately 100 languages. Isso makes it a global tool, capable of serving international organizations and multinational research projects.

The model was trained on a vast dataset, containing 30 million pages in Chinese and English, which guarantees robustness and accuracy in the languages ​​most used in the world of business and science. Essa universality allows the technology to be applied to a diverse range of documents, accelerating the analysis of large knowledge repositories, regardless of the original language or format.