Introduction
Running large language models like Llama 8B (8 billion parameters) on consumer hardware presents challenges around GPU memory constraints. The RTX 4090, despite its 24GB of VRAM, requires careful optimization to run these models effectively while maintaining good performance and quality.
This experiment examines different quantization techniques and memory optimization strategies, providing concrete benchmarks and insights for running 8B models on a single RTX 4090.
The Memory Challenge
Understanding Model Memory Requirements
A typical Llama 8B model in FP16 precision requires approximately:
- Model Weights: 8B × 2 bytes = 16GB
- KV Cache: Variable, depends on context length and batch size
- Activation Memory: ~2-4GB during inference
- Framework Overhead: ~1-2GB
This puts us right at the edge of what a 24GB RTX 4090 can handle, leaving little room for longer contexts or batch processing.
Note: The actual 22.6GB usage measured in my experiments includes KV cache, activations, and framework overhead, which are larger than the estimated 3-5GB due to vLLM’s memory management and pre-allocation strategies.
Experimental Setup
Hardware Configuration
- GPU: NVIDIA RTX 4090 (24GB VRAM, Ada Lovelace architecture)
- CPU: AMD/Intel CPU with sufficient cores
- RAM: 32GB DDR4/DDR5 system memory
- CUDA: 13.0
- Driver: NVIDIA driver (581.15)
Software Stack
- vLLM: v0.10.1.1 (for FP16, AWQ, GPTQ experiments)
- Transformers: v4.56.1 (for BitsAndBytes experiments)
- BitsAndBytes: v0.47.0 (for 4-bit quantization)
- AutoAWQ: v0.2.9 (for AWQ quantization)
- AutoGPTQ: v0.7.1 (for GPTQ quantization)
Benchmarking Methodology
Each experiment follows a standardized protocol:
- Memory Baseline: Record initial GPU memory state
- Model Loading: Load model and measure memory increase
- Inference Test: Run 7 representative prompts (256 tokens each)
- Performance Metrics: Measure tokens/second, memory usage, load time
Experiment 1: FP16 Baseline
Establishing performance and memory baseline with standard FP16 precision.
Configuration
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
dtype="float16",
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
max_model_len=4096,
enforce_eager=True, # Disable CUDA graphs for more consistent memory measurement
)
Results
| Metric | Value |
|---|---|
| GPU Memory Used | 22,631MB (92.1% of VRAM) |
| Load Time | 16.1s |
| Tokens/Second | 339.6 |
| Peak GPU Utilization | 99.2% |
| Model Quality | Baseline |
Analysis
The FP16 baseline establishes performance reference point. As expected, the full-precision model consumes nearly all available VRAM on the RTX 4090, leaving only ~200MB free. The model loads in a reasonable 16 seconds and delivers solid inference performance at 339.6 tokens/second across my 7-prompt test suite.
This near-maximum VRAM usage (99.2%) demonstrates why quantization is essential for practical deployment - there’s virtually no headroom for longer contexts, batch processing, or other optimizations.
Experiment 2: BitsAndBytes 4-bit Quantization
Testing the most accessible quantization method with an up-to-date tooling support.
Why BitsAndBytes?
BitsAndBytes offers several advantages:
- Easy Integration: Works directly with Transformers library
- Quality Preservation: Uses advanced quantization schemes (NF4, double quantization)
- Flexible: Supports mixed precision and selective quantization
Configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # Nested quantization for additional memory savings
bnb_4bit_quant_type="nf4", # Normal Float 4-bit quantization
)
Results
| Metric | Value | vs FP16 |
|---|---|---|
| GPU Memory Used | 7,711MB (31.4% of VRAM) | 66% reduction |
| Load Time | 6.6s | 59% faster |
| Tokens/Second | 42.4 | 87% slower |
| Peak GPU Utilization | 38.5% | 61% lower |
| Memory Savings | 14.9GB | - |
Memory Usage Breakdown
BitsAndBytes delivers exceptional memory optimization results:
- Model footprint: Reduced from 22.6GB to 7.7GB
- Free VRAM: 15.1GB available (vs 0.2GB with FP16)
- Memory efficiency: 66% reduction in GPU memory usage
This dramatic improvement opens up possibilities for:
- Longer context lengths (up to ~32K tokens estimated)
- Batch processing multiple requests
- Running additional models simultaneously
Quality Impact
While BitsAndBytes achieves outstanding memory savings, it comes with a significant performance trade-off. Inference speed drops to 42.4 tokens/second (87% slower than FP16). However, the quality of outputs remains good thanks to:
- NF4 quantization: Optimized for neural network weights
- Double quantization: Further compression without major quality loss
- FP16 compute: Maintains precision during computation
Experiment 3: AWQ (Activation-aware Weight Quantization)
Exploring hardware-optimized quantization designed for inference speed.
AWQ Advantages
AWQ offers unique benefits:
- Hardware Optimized: Designed for efficient GPU inference
- Activation Aware: Considers activation patterns during quantization
- Speed Focused: Minimal performance overhead
- vLLM Integration: Out-of-the-box support in production inference engines
Results
| Metric | Value | vs FP16 | vs BnB 4-bit |
|---|---|---|---|
| GPU Memory Used | 22,686MB (92.4% of VRAM) | +0.2% | +194% |
| Load Time | 10.4s | 35% faster | -58% |
| Tokens/Second | 579.1 | +70% | +1265% |
| Peak GPU Utilization | 99.5% | +0.3% | +159% |
Performance Analysis
AWQ delivers surprising results that challenge conventional assumptions about quantization:
Unexpected Memory Usage: Despite being 4-bit quantized, AWQ uses nearly identical memory to FP16 (22.7GB vs 22.6GB). This suggests the pre-quantized model includes additional metadata or the quantization benefits are offset by implementation overhead.
Superior Performance: AWQ excels in inference speed, delivering 579.1 tokens/second - a 70% improvement over FP16 and a massive 1265% improvement over BitsAndBytes. This demonstrates the hardware optimization focus of AWQ.
Fast Loading: Model loading is 35% faster than FP16, indicating efficient initialization of the pre-quantized weights.
Experiment 4: GPTQ Quantization
Testing the established post-training quantization method.
GPTQ Characteristics
GPTQ provides:
- Mature Implementation: Well-established quantization method
- Good Compression: Effective weight compression
- Quality Trade-offs: Balanced quality preservation
- Broad Support: Available across multiple frameworks
Results
| Metric | Value | vs FP16 | vs AWQ | vs BnB 4-bit |
|---|---|---|---|---|
| GPU Memory Used | 22,689MB (92.4% of VRAM) | +0.3% | +0.0% | +194% |
| Load Time | 21.0s | -31% | -101% | -219% |
| Tokens/Second | 598.7 | +76% | +3% | +1312% |
| Peak GPU Utilization | 99.5% | +0.3% | +0.0% | +159% |
Key Findings:
GPTQ shows similar patterns to AWQ with some notable differences:
Memory Usage: Like AWQ, GPTQ uses nearly full VRAM (22.7GB) despite 4-bit quantization, suggesting these pre-quantized models don’t deliver the expected memory savings in vLLM.
Peak Performance: GPTQ achieves the highest inference speed at 598.7 tokens/second, slightly outperforming AWQ and delivering 76% better performance than FP16.
Slower Loading: GPTQ takes significantly longer to load (21.0s vs 16.1s for FP16), likely due to model preprocessing or initialization overhead.
Comparative Analysis
Performance Trade-offs
| Method | Memory Usage | Memory Savings | Speed | Load Time | Ease of Use |
|---|---|---|---|---|---|
| FP16 Baseline | 22.6GB (92%) | - | 339 t/s | 16.1s | ⭐⭐⭐⭐⭐ |
| BitsAndBytes 4-bit | 7.7GB (31%) | 66% | 42 t/s | 6.6s | ⭐⭐⭐⭐⭐ |
| AWQ | 22.7GB (92%) | 0% | 579 t/s | 10.4s | ⭐⭐⭐⭐ |
| GPTQ | 22.7GB (92%) | 0% | 599 t/s | 21.0s | ⭐⭐⭐ |
Real-World Scenarios
Scenario 1: Interactive Chat Application
Requirements: Low latency, moderate context length, good quality Recommended: GPTQ - Delivers the highest inference speed (599 t/s) for responsive user interactions. While it uses full VRAM, the performance benefit justifies the memory cost for single-user scenarios.
Scenario 2: Batch Processing
Requirements: High throughput, cost efficiency, acceptable quality Recommended: AWQ - Offers excellent throughput (579 t/s) with faster loading than GPTQ. The slight speed difference is offset by better operational characteristics for batch workloads.
Scenario 3: Long Document Analysis
Requirements: Extended context, memory efficiency, quality preservation Recommended: BitsAndBytes 4-bit - Despite slower inference, the 66% memory reduction enables processing much longer documents (32K+ tokens) that wouldn’t fit with other methods. Quality remains excellent for analytical tasks.
Advanced Optimization Techniques
vLLM Configuration Tuning
Beyond quantization, several vLLM parameters significantly impact memory usage:
# Memory-optimized configuration
memory_optimized:
gpu_memory_utilization: 0.95
max_model_len: 2048
swap_space: 8
cpu_offload_gb: 2
block_size: 8
max_num_seqs: 64
System-Level Optimizations
-
CUDA Memory Management
- Pre-allocate GPU memory
- Disable memory fragmentation
- Optimize CUDA context switching
-
Operating System Tuning
- Increase virtual memory
- Optimize page file settings
- Configure GPU scheduling mode
-
Hardware Considerations
- PCIe bandwidth optimization
- CPU-GPU data transfer minimization
- Thermal management
My learnings
Key Findings
-
Memory vs Quality Trade-offs: BitsAndBytes is the only method that delivers significant memory savings (66% reduction), making it essential for memory-constrained scenarios despite the performance penalty.
-
Performance vs Memory Trade-off: AWQ and GPTQ deliver 70-76% better performance than FP16 but use identical memory (22.7GB vs 22.6GB). This counterintuitive result suggests these “4-bit” models are optimized for speed in vLLM’s implementation, trading memory savings for performance gains.
-
Practical Considerations: The choice between methods depends heavily on your bottleneck - memory constraints favor BitsAndBytes, while performance requirements favor AWQ/GPTQ.
Surprising Results
The Quantization Paradox: AWQ and GPTQ 4-bit models used identical memory to FP16 (22.7GB vs 22.6GB) while delivering superior performance. This challenges the assumption that quantization always reduces memory usage and suggests:
- Pre-quantized models may include additional metadata
- vLLM’s implementation optimizes for speed over memory savings
- The “4-bit” designation refers to weight storage, not runtime memory
UPD: I got comments after I posted this blog post:
- measuring memory used by vllm engine doesn’t reflect real “model memory”. vllm engine will use the amount of memory defined by memory utilisation percentage either ways, it will always load the model and use the rest for kv cache pre-allocation (thats why you always got around 90% of vram), whether the model is big or small, quantized or not. Thats why it might seem like fp16 uses as much as awq/gptq, but in fact vllm just preallocated more kv cache in the quantized scenario.
- also, the speedup you observed from awq/gptq vs bf16 is mostly due to their use of fused kernels, the result you got for fp16 is from enforce_eager=true, by doing so, you specifically disabled torch.compile which is used in vllm to reduce overhead of the entire model. enforce_eager doesnt only disable cuda graphs, it disables compilation all together.
That’s definitely something I missed and will dig deeper into it!
Possible recommendations
For Different Use Cases
Development and Experimentation:
- Start with BitsAndBytes 4-bit for maximum memory headroom and experimentation flexibility
- Use conservative memory settings (gpu_memory_utilization=0.7)
- Monitor GPU temperature and usage with the provided monitoring tools
Production Deployment:
- GPTQ for speed-critical applications - 599 tokens/sec with proven stability
- AWQ for throughput-focused workloads - 579 tokens/sec with faster loading
- BitsAndBytes for memory-critical deployments - only option for extended context lengths
- Implement proper monitoring, error handling, and graceful degradation
Resource-Constrained Scenarios:
- BitsAndBytes is your only viable option for significant memory reduction
- Consider CPU offloading configurations for extreme cases
- Optimize context length based on actual requirements rather than maximums
Conclusion
Running Llama 8B models on a single RTX 4090 reveals surprising insights about modern quantization techniques. My benchmarking shows that the choice of method depends critically on your primary constraint:
For Memory-Constrained Scenarios: BitsAndBytes 4-bit is the clear winner, delivering 66% memory reduction (22.6GB → 7.7GB) with acceptable quality preservation, despite 87% slower inference.
For Performance-Critical Applications: GPTQ achieves the highest throughput at 599 tokens/second, while AWQ offers similar performance (579 t/s) with better loading characteristics.
The Quantization Paradox: My most surprising finding is that pre-quantized AWQ and GPTQ models use identical memory to FP16 while delivering 70-76% better performance. This challenges conventional wisdom about quantization being primarily a memory optimization technique.
Key Takeaway: Modern quantization is more nuanced than expected. BitsAndBytes remains essential for memory optimization, while AWQ/GPTQ excel as performance optimizations rather than memory savers. Understanding these trade-offs enables informed decisions for your specific deployment requirements.
Appendix
Complete Benchmark Results
| Metric | FP16 Baseline | BitsAndBytes 4-bit | AWQ | GPTQ |
|---|---|---|---|---|
| Memory Performance | ||||
| GPU Memory Used | 22,631MB | 7,711MB | 22,686MB | 22,689MB |
| GPU Utilization | 99.2% | 38.5% | 99.5% | 99.5% |
| Memory Savings vs FP16 | - | 66.0% | 0.0% | 0.0% |
| Free VRAM | 196MB | 15,110MB | 124MB | 121MB |
| Performance Metrics | ||||
| Model Load Time | 16.1s | 6.6s | 10.4s | 21.0s |
| Inference Speed | 339.6 t/s | 42.4 t/s | 579.1 t/s | 598.7 t/s |
| Speed vs FP16 | - | -87.5% | +70.5% | +76.3% |
| Total Tokens Generated | 1,792 | 1,799 | 1,792 | 1,792 |
| System Resources | ||||
| System RAM Increase | 1,024MB | 163MB | 1,062MB | 1,385MB |
| System RAM % | 9.6% | 9.0% | 9.7% | 10.6% |
| Configuration | ||||
| Model Source | HuggingFace | HuggingFace | Pre-quantized | Pre-quantized |
| Quantization Bits | 16 | 4 (NF4) | 4 | 4 |
| Framework | vLLM | Transformers+BnB | vLLM | vLLM |
Reproducibility
All experiments in this analysis are fully reproducible using the provided code
This analysis was conducted using automated benchmarking tools. All performance measurements are specific to the tested hardware configuration and may vary on different systems.