News (EN)

GreenBoost: Linux module transforms RAM into CUDA memory and revolutionizes the use of LLMs with NVIDIA

NVIDIA
Photo: NVIDIA - Stock all / Shutterstock.com

The local artificial intelligence development scenario is undergoing a significant transformation with the arrival of GreenBoost. Este innovative module for the Linux kernel promises to overcome one of the main barriers faced by developers and researchers: the limitation of video memory (VRAM) on consumer NVIDIA cards. By converting system RAM into a resource usable by the CUDA architecture, GreenBoost opens new doors for running complex large-scale language models (LLMs) directly on commodity PCs.

The initiative, developed by independent programmer Ferran Duarri, represents a crucial advancement in an environment where high-capacity hardware, such as enterprise-grade GPUs with abundant VRAM, is inaccessible to most. The solution focuses on optimizing the use of existing resources, allowing the computational power of NVIDIA GPUs to be fully exploited even with VRAM constraints, boosting research and development in open source AI.

The ability to run models that previously required tens of gigabytes of memory, such as “glm-4.7-flash:q8_0” with its 31.8 GB of memory, on consumer equipment was an almost insurmountable challenge. Traditional approaches often resulted in performance bottlenecks or degraded inference quality, making practical interaction with these models unfeasible for many enthusiasts and small developers.

Overcoming Traditional VRAM Barriers

Historically, strategies for dealing with VRAM shortages in consumer GPUs have been limited. One of the most common solutions was to offload the surplus layers of the neural network to the CPU system memory. However, this approach suffered from a serious performance problem. The lack of CUDA coherence in CPU memory required massive and complex data transfers between the GPU and CPU, creating a bottleneck that could reduce token generation speeds by up to ten times.

Another alternative explored was the drastic reduction of the model’s quantization level. Embora this reduced the demand for memory, was accompanied by a significant degradation in the inference and logical reasoning capabilities of the LLM. Para maintain quality, the only viable option was to invest in enterprise-grade GPUs with 48 GB or more of VRAM, an expense that exceeds the cost of a full workstation and is out of reach for individual developers and startups with limited budgets.

GreenBoost’s innovative 3-tier architecture

GreenBoost is not merely a driver tweak or stopgap solution; is a carefully designed Linux kernel module licensed under the GPLv2. Ele acts independently and in parallel with the official NVIDIA drivers, intervening directly in the CUDA memory allocation layer. Essa ingenious intervention allows the GPU driver to recognize system RAM as “external memory”, creating a memory expansion architecture that operates at three distinct levels to optimize performance and capacity.

The first layer, known as T1, is the original VRAM integrated into the GPU. In a test environment using a GeForce RTX 5070, with its 12 GB of capacity and bandwidth of approximately 336 GB/s, this layer becomes the critical path for computation. Ela stores the active layers most accessed during the inference process, ensuring maximum speed for the most demanding operations.

The second level, T2, is made up of the motherboard system’s DDR4 or DDR5 RAM memory. Conectada to the GPU via a PCIe 4.0 x16 link, offers a speed of approximately 32 GB/s. The Este level serves as an efficient storage area for static model weight data and a substantial key-value (KV) cache, which is critical for LLMs to maintain and reference large contexts, allowing AI to work with more comprehensive information.

Finally, the third layer of security, T3, is NVMe storage. Alocado as a swap space with a relatively slower speed of about 1.8 GB/s, it is mapped to absorb any memory overflows. Esta layer comes into play only in exceptional situations, when both VRAM and system RAM are completely exhausted, offering a safeguard to prevent system failures in extreme usage scenarios.

The sophistication behind integration

GreenBoost’s technical intelligence lies in the way kernel and user-space components collaborate seamlessly. The kernel module (`greenboost.ko`) uses an optimized memory allocator to reserve a large page space in DDR4, eliminating paging overhead and fragmentation. Esses spaces are exported as DMA-BUF file descriptors, allowing direct memory access.

The GPU then imports these operating system pages as CUDA external memory via the `cudaImportExternalMemory` API. Esse process causes the CUDA platform to interpret the DDR4 physical pages as if they were memory directly connected to the graphics card, masking the motherboard architecture. Data movement is then managed as a DMA transfer via the PCI Express 4.0 bus, eliminating unnecessary copy cycles by the CPU.

In user space, the `libgreenboost_cuda.so` library acts as a smart interceptor. Inserida dynamically via `LD_PRELOAD`, it intercepts API calls like `cudaMalloc` and `cudaFree`. Requisições of small allocations are forwarded directly to the original VRAM without latency. However, large requests that exceed VRAM limits are redirected to the GreenBoost module in the kernel, which allocates the necessary memory from system RAM and returns it to the application as a legitimate CUDA device pointer. Para inference engines that use `dlopen` and `dlsym`, GreenBoost has countermeasures, intercepting the `dlsym` function itself and even changing the reported VRAM capacity to force offloading to RAM.

Synergy with optimizers and practical performance

GreenBoost is designed to work alongside the latest inference approaches, offering a multi-faceted optimization toolset. One example is its integration with `ExLlamaV3`, an inference engine that natively supports the KV cache layer path provided by GreenBoost. Isso allows the model’s KV tensor to be allocated directly from `/dev/greenboost` to Python via `mmap` access without copying, eliminating I/O overhead and improving performance.

For long contexts exceeding 100,000 tokens, the `kvpress` tool can be used in conjunction to reduce overhead on system RAM bandwidth. Mais Crucially, integration with NVIDIA ModelOpt, NVIDIA’s official optimization tool, allows 31.8GB models to be converted to the efficient FP8 format without the need for retraining, reducing the size to less than 16GB. Essa strategic combination, which allocates VRAM to model weights and system RAM to KV cache, has demonstrated average inference speeds of 10 to 25 tokens per second (tok/s) on the GeForce RTX 5070, a significant increase compared to the reference environment (2 to 5 tok/s).

The PCIe 4.0 bus challenge

Despite being a revolutionary approach, GreenBoost does not eliminate the fundamental physical limitations of the hardware. Ferran Duarri, the developer, is transparent about the biggest bottleneck: the PCIe 4.0 x16 bus maximum transfer bandwidth of approximately 32 GB/s. Enquanto the integrated VRAM of modern GPUs offers hundreds of GB/s, or even more than 1 TB/s in high-end models, the speed of accessing system RAM via PCIe is significantly slower, often less than a tenth.

If model weight data, which is frequently accessed, is transferred between VRAM and system RAM repeatedly, this “thrashing” will result in considerable delay in the pipeline. Da Likewise, although NVMe drives are efficient for sequential access, performance at the swap layer can degrade dramatically when dealing with millions of random access operations in small blocks during inference. The ideal solution for maximizing GreenBoost’s potential lies not in a single module, but in intelligently partitioning the workload, utilizing the latest parameter quantization technologies such as FP8 and INT4-AWQ to keep data weight to a minimum in VRAM (T1) and move the KV cache, which grows over time, to DDR4 RAM (T2).

Implications for AI infrastructure

The release of GreenBoost as open source represents a strong response from the developer community against the artificial limitations imposed by the consumer GPU market, where computational power is high but restricted VRAM limits industrial use. It is an attempt to emulate, via software, the unified memory experience seen in the Apple M-series architecture, which enables massive AI inference without the need for expensive HBM modules, by integrating this technology into existing PC platforms.

This implementation method offers a powerful countermeasure for individual researchers and small to medium-sized AI development ecosystems against the rising costs of enterprise-grade AI accelerators. Atualmente demonstrated on the GeForce RTX 5070, with the availability of the source code, it is expected that a wide range of users with cards of the Ada Lovelace and Ampere architectures will verify and adapt the solution. At a time when hardware-enforced scalability has reached a plateau, the Ferran Duarri approach, by bypassing complex layers from kernel management to the PCI-Express interface and the CUDA environment, points to the memory management challenges that future distributed AI infrastructures will need to address. Desenvolvedores around the world continue to create alternatives to get around this barrier.