Compute
High-performance, scalable computing resources for your critical workloads. Orchestrate your cloud-native applications with our modern container solutions.
Discover the Compute offer
Virtual machines
VM Instances
An on-demand, flexible and secure virtual machine solution on a shared infrastructure.
Dedicated servers
OpenSource IaaS
Open source virtualised infrastructure in a trusted SecNumCloud-qualified cloud environment for complete technological sovereignty.
VMWare IaaS
Your VMware virtual machines in a trusted SecNumCloud-qualified and HDS-certified cloud environment.
Bare Metal
Dedicated, fully customisable servers for total autonomy over your sovereign infrastructure.
Containers
PaaS OpenShift
The unified platform for creating, modernising and deploying your large-scale applications in a sovereign cloud.
Managed Kubernetes
Managed container orchestration solution offering security, resilience and advanced automation on sovereign infrastructure.
Storage
Adaptable, high-performance storage solutions for all your needs. Optimise your data with our highly available block and object solutions.
Discover our Storage offer
Storage
Block storage
The adaptable block storage solution for optimum storage performance in a sovereign cloud.
Object storage
The scalable, cost-effective storage solution for your unstructured data in a sovereign cloud.
Backup
Backup solutions
Differentiated backup solutions tailored to your challenges and environments
Network
Advanced network solutions to connect and secure your infrastructures. Deploy your private networks automatically and securely.
Discover the Network offer
Network
Virtual Private Cloud
Deploy and manage your private networks 100% automatically and securely.
Private Backbone
Take full control of your network with extended Layer 2 connectivity, designed for hybrid architectures and bespoke configurations.
Firewall
Managed Firewall
Advanced security solutions for complete insulation and enhanced protection
Accommodation Dry
Housing - Dedicated space
Secure hosting for your equipment in a dedicated or shared environment, depending on your needs.
Security
Advanced security solutions to protect your critical infrastructures. Control access and defend against online threats.
Discover the Security offer
Detection
Managed SIEM
A centralised platform for collecting and correlating security logs, combining AI-based automation and advanced detection rules (MITRE ATT&CK).
Sovereign SOC
A sovereign SOC offering operated 24/7, deployable from our marketplace, on SecNumCloud-qualified infrastructure.
Protection
Anti DDoS
The shield against online attacks
Bastion host
Transparent, centralised access control for robust protection of your infrastructure
Managed KMS
Sovereign cryptographic key management, with HSM hardware root of trust, to protect your most sensitive data on SecNumCloud infrastructure.
AI
Artificial intelligence solutions to transform your data into insights and accelerate your business processes.
Discover the AI offer
AI
LLMaaS
Access cutting-edge language models on a sovereign, SecNumCloud-qualified and HDS-certified infrastructure for high-performance, secure AI applications.
GPU
NVIDIA GPU instances to accelerate your artificial intelligence and high-performance computing in a sovereign cloud.
Data
Data solutions to manage, analyse and exploit your critical data.
Discover the Data offer
Databases
Managed MariaDB
A fully managed MariaDB relational database and PITR backup on SecNumCloud sovereign infrastructure.
Managed PostGreSQL
The fully managed relational database solution on SecNumCloud sovereign infrastructure
Big Data
Managed Kafka
The open-source distributed platform for streaming data in real time
Managed File System
A managed, sovereign, high-availability distributed file system, accessible via NFS and SMB on the SecNumCloud infrastructure.
Management & Governance
Coaching and support services to help you with your cloud transformation.
Find out about our support services
Support
Support levels
Discover the 3 levels of support available to help you meet your challenges.
Professional services
From design to optimisation, Cloud Temple is with you every step of the way.
Governance
Console - API - Terraform Provider
A single interface for viewing and managing your products and services
Observability
Infrastructure metrics available in market standards

Our Large Language Model as a Service (LLMaaS) offering gives you access to cutting-edge language models, inferred using SecNumCloud-qualified infrastructure, HDS-certified for healthcare data hosting, and therefore sovereign, calculated in France. Benefit from high performance and optimal security for your AI applications. Your data remains strictly confidential, and is neither exploited nor stored after processing.

Simple, transparent pricing
1.8 €
per million input tokens
8 €
per million tokens issued
8 €
per million reasoning tokens
4 €
per million reranking tokens
0,01 €
per minute of transcribed audio *
Calculated on an infrastructure based in France, SecNumcloud qualified and HDS certified.
Note on the "Reasoning" price: This price applies specifically to models classified as "reasoners" or "hybrids" (models with the "Reasoning" capability activated) when reasoning is active and only on tokens linked to this activity.
* any minute started is counted

Chat & Reasoning

Our large models offer state-of-the-art performance for the most demanding tasks. They are particularly well-suited to applications requiring a deep understanding of language, complex reasoning or the processing of long documents.

80 tokens/second

qwen3.6:27b

Generalist reference model with a native context of 1M tokens. Excels at reasoning, following instructions and multilingualism.
Significant improvements in following instructions, reasoning, reading comprehension, mathematics, coding and tool use. Its context of 1M tokens enables the analysis of entire documents without truncation.
94 tokens/second

gpt-oss:120b

OpenAI's state-of-the-art open-weight model with configurable reasoning and transparent chain of thought.
Mixture-of-Experts model with 120 billion parameters offering configurable reasoning and full access to the chain of thought. Ideal for scenarios requiring a permissive licence (Apache 2.0).
41 tokens/second

gpt-oss:20b

Compact version of the OpenAI model, optimised for rapid inference with good reasoning capabilities.
Mixture-of-Experts model with 21 billion parameters and 3.6 billion active parameters. Configurable reasoning and full agent capabilities.
10 tokens/second

llama3.3:70b

Multilingual Meta model, excellent in natural dialogue and nuanced understanding in 8 languages.
Supports English, French, German, Spanish, Italian, Portuguese, Hindi and Thai. Its 132k tokens window enables analysis of complex documents and long conversations.
23 tokens/second

gemma3:27b

Google multimodal model with integrated vision and support for 140+ languages. Context of 120K tokens.
Includes native multimodal capabilities (text + image) and excels in over 140 languages. Ideal for analysing large documents and document research.
72 tokens/second

nemotron-3-super:120b

NVIDIA model optimised for collaborative agents, long reasoning and high-volume workloads. 1M tokens context.
Ideal for agentic workflows, long-context reasoning, high-volume automation (support tickets, mass analyses), the use of tools and RAG.
160 tokens/second

nemotron3-nano:30b

Ultra-fast NVIDIA model (160 t/s) with reasoning and function calling. Context of 1M tokens.
Excels at function calling, structured reasoning and analysing long contexts. Rare combination of high speed and very long context.
130 tokens/second

nemotron-cascade:30b

NVIDIA model specialising in mathematics (IMO 2025 gold medal) and problem decomposition. Context 1M tokens.
Excels at structured reasoning, solving complex mathematical problems and analysing long contexts.
88 tokens/second

glm-4.7-flash:30b

Fast model with an excellent performance/latency balance for reasoning and analysis.
Offers fast inference (88 t/s) with a context of 120k tokens. Particularly suited to conversational assistants requiring low latency.
21 tokens/second

cogito:32b

Advanced analytical reasoning model, designed for complex problem decomposition and logic verification.
Excels in multi-factor analysis, formal demonstration and hallucination minimisation thanks to built-in logic verification mechanisms.
22 tokens/second

elm tree 3:32b

The first fully open reasoning model on this scale. Total transparency (data, code, weight).
Competes with the best proprietary models on complex benchmarks (MATH, HumanEval+). Able to expose its thought process. Preferred choice for transparency and auditability.
35 tokens/second

Elm 3:7b

Completely open and efficient model, excellent in mathematics and programming with total transparency.
Optimised for efficiency (2.5x less resources than Llama 3.1 8B). Ideal for tasks requiring complete reproducibility and auditability.
56 tokens/second

qwen3-2507:235b

The most powerful model in the catalogue (235B parameters, 22B active). Excels in maths, coding and logical reasoning.
Ultra-sparse Mixture-of-Experts architecture combining the power of a very large model with the efficiency of a smaller model.
28 tokens/second

mistral-small3.2:24b

Mistral model with improved instruction tracking, robust function calling and vision capabilities. Integrated problem content detection.
Excellent instruction tracking, fewer repetitions, reliable function calling. Supports vision (image analysis) and native security filters.
100 tokens/second

mistral-small4:119b

High-performance Mistral model (119B) with vision, integrated security and context of 262K tokens. Fast (100 t/s).
Large version of the Mistral Small family. Combines power, speed and reliability with an extended context. Native security filters.
28 tokens/second

ministral-3:14b

The most powerful of the Ministral family, with advanced reasoning and coding. Context of 250K tokens.
Excels at complex reasoning and coding while remaining efficient.
40 tokens/second

ministral-3:8b

Intermediate Ministral model with an excellent performance/speed ratio. Context of 250K tokens.
Capable of complex reasoning while remaining fast. Ideal for assistants requiring responsiveness and quality.
22 tokens/second

ministral-3:3b

Compact Mistral model, high performance despite its small size. Context of 250K tokens.
Surprising performance for conversational tasks and simple reasoning despite only 3B parameters.
32 tokens/second

qwen3.5:9b

Qwen3.5 intermediate model with solid reasoning and context extended to 250K tokens.
Good balance between generation quality and inference speed.
37 tokens/second

qwen3.5:4b

Compact Qwen3.5 model with a good performance/efficiency ratio and a 250K token context.
Good candidate for assistants and light reasoning tasks.
16 tokens/second

qwen3.5:0.8b

Ultra-light model with an exceptional background of 250K tokens - remarkable for a model of this size.
Ideal for fast conversational tasks requiring a very long history or analysis of large documents with a small footprint.
46 tokens/second

qwen3:0.6b

Ultra-fast micro-model for simple tasks and routing. 40K context tokens.
Ideal as the first level of processing in complex workflows or for rapid classification tasks.
55 tokens/second

qwen3-2507-think:4b

Compact model optimised for deep reasoning (logic, maths, science, code). Context of 250K tokens.
Thinking" version with enhanced reasoning capability. Combines compactness, speed and advanced reasoning.
19 tokens/second

qwen3-omni:30b

Native omnimodal model - includes text, image, video and audio simultaneously.
Supports multimodal input (text, image, audio, video) with advanced reasoning capabilities. Note - audio output via API is not yet enabled.

Programming & Agents

Our programming and agent models are specially optimised for agentic software engineering, large-scale code generation and development workflow automation.

121 tokens/second

qwen3.6:35b

Leader in agentic software engineering (SWE-bench 73.4%). Context of 1M tokens, integrated vision and tool calling.
Includes entire code repositories thanks to its 1M token context. Supports multi-step reasoning and vision (screenshots, diagrams). Optimised for IDEs and CI/CD pipelines.
97 tokens/second

qwen-coder-next:80b

State-of-the-art model for complex code and reasoning. Context of 250K tokens.
Excels at large-scale code generation and analysis. Designed for advanced software engineering tasks.
67 tokens/second

qwen3-next:80b

Versatile 80B model optimised for large contexts, function calling and structured reasoning.
Context of 250K tokens with support for function calling and guided decoding.
33 tokens/second

devstral-small-2:24b

State-of-the-art agentic model for software engineering. Performance close to >100B models for code. Integrated vision.
Optimised for codebase exploration, multi-file editing and the use of tools. Native vision support. Context of 200K tokens.
23 tokens/second

rnj-1:8b

Specialised STEM model - excels in code (83.5% HumanEval+), maths and science.
Dense model trained on 8.4T tokens. Often outperforms much larger models on code and mathematical reasoning tasks.
40 tokens/second

functiongemma:270m

Micro-model specialising in function call detection. Ideal as a router in an agentic architecture.
Ultra-compact, optimised for identifying and formatting function calls quickly.

Vision & Multimodal

Our Vision & Multimodal models can analyse images, videos and visual documents. They excel in OCR, object detection, structured extraction and spatio-temporal reasoning.

24 tokens/second

qwen3-vl:235b

The most powerful multimodal model in the catalogue. Advanced visual understanding and exceptional reasoning.
Excels in complex document analysis, multilingual OCR, 3D spatial reasoning and video understanding.
17 tokens/second

qwen3-vl:32b

High-performance variant for the most demanding vision tasks. Context 250K tokens.
Fine analysis of high-resolution images, understanding of dynamic scenes and text-timestamp alignment for video.
39 tokens/second

qwen3-vl:30b

High-performance multimodal model for OCR, object detection, video analysis and spatio-temporal reasoning.
Incorporates innovations in image and video analysis. Excels in complex OCR, graphics and structured extraction (JSON).
39 tokens/second

qwen3-vl:8b

Intermediate vision model - good compromise between performance and footprint. Context 250K tokens.
Capable of analysing complex documents, graphics and videos with a high degree of accuracy.
57 tokens/second

qwen3-vl:4b

Compact, fast vision model for document analysis and video comprehension.
Excellent compromise between performance and footprint. Supports structured extraction and visual reasoning.
64 tokens/second

qwen3-vl:2b

Ultra-compact vision model for rapid OCR, object detection and embedded applications.
Despite its small size, it offers amazing image and video analysis. Ideal for mobile or embedded applications.
59 tokens/second

gemma4:31b

Google's dense multimodal model, ranked 3rd in the world on Arena AI. Advanced vision, reasoning and coding. Context 250K tokens.
Google's most powerful open-source model. Native function calling, advanced visual understanding (OCR, graphics, documents, UI). Multilingual (35+ languages).
125 tokens/second

gemma4:e2b

Ultra-fast variant (125 rpm) of Gemma 4 with vision. Excellent energy efficiency.
Offers an exceptional performance to footprint ratio. 128K tokens with full vision capabilities.
85 tokens/second

gemma4:e4b

Gemma 4 variant with a better quality/speed compromise than the E2B version. Integrated vision.
Better fidelity than the E2B version, but still high speed. Context 128K tokens.
49 tokens/second

granite3.2-vision:2b

IBM Granite compact vision model for rapid OCR and data extraction from scanned documents.
Lightweight yet powerful for low-latency OCR and image analysis.
84 tokens/second

deepseek-ocr

Specialised OCR model for high-precision text extraction with preserved formatting (tables, formulas).
Optimised for converting documents into structured Markdown. Excels with complex tables and formulas.

Embedding

Our embedding models transform text into vector representations for semantic search, clustering and RAG (Retrieval-Augmented Generation) pipelines.

171 tokens/second

bge-m3:567m

State-of-the-art multilingual embedding (100+ languages). Supports dense, sparse and multi-vector searches.
Context of 8192 tokens with three complementary search methods.

qwen3-embedding:4b

High-performance embedding with deep semantic understanding and extended context (40K tokens).
Ideal for processing large documents in RAG pipelines.

qwen3-embedding:8b

High-capacity embedding with the best semantic understanding of the Qwen3 family. Extended context (40K tokens).
The most powerful version of the Qwen3 embedding family. Ideal for tasks requiring contextual understanding.

qwen3-embedding:0.6b

Ultra-light and fast embedding for low-latency semantic search.
Excellent compromise between semantic performance and speed of execution.
196.3 tokens/second

granite-embedding:278m

Ultra-compact IBM embedding for semantic search with minimal latency.
The fastest embedding model in the catalogue. Ideal for clustering and high-frequency searching.
175 tokens/second

embeddinggemma:300m

Multilingual Google embedding (100+ languages), optimised for search and semantic retrieval.
Produces vector representations of text for classification, clustering and similarity search.

Reranking

Our reranking models reorder search results by relevance to refine the quality of RAG pipelines. Compatible with the Cohere API.

nvidia/llama-nemotron-rerank-vl-1b-v2

Cohere API-compatible reranking model (/v1/rerank and /v2/rerank). Orders documents by relevance to a query.
Cohere v1/v2 SDK compatible. The relevance score is a raw logit (relative order is guaranteed). Ideal as a complement to the RAG stack (embedding + retrieval + rerank).

qwen3-reranker:4b

Powerful reranking model with a high level of contextual understanding.
Excellent rescheduling quality thanks to its 4B parameters. Ideal for demanding RAG pipelines.

qwen3-reranker:0.6b

Compact and efficient reranking model for rapid rescheduling.
Lightweight version for use cases requiring low reranking latency.

bge-reranker-large

High-performance multilingual reranking model from the BGE family.
Complementary to the BGE-M3 embedding model for complete RAG pipelines.

Security

Our security models specialise in detecting problematic content, preventing jailbreaks and ensuring regulatory compliance (RGPD, HDS). They can be used as pre-filters or post-filters in your workflows.

45 tokens/second

granite3-guardian:8b

Security model specialising in detecting problematic content, jailbreaking and regulatory compliance.
Designed to filter sensitive content and ensure RGPD/HDS compliance. Can be used as a pre-filter or post-filter in your workflows.
60 tokens/second

granite3-guardian:2b

Compact version of the Granite Guardian security model for low-latency filtering.
Same filtering capabilities as version 8B, but with a smaller footprint. Ideal for high-frequency workflows.

Translation

Our translation models offer high fidelity in 55 languages, respecting the grammar, cultural nuances and technical specificities of the documents.

17 tokens/second

translategemma:27b

High-performance translation for 55 languages. Superior quality for complex and technical content.
Captures literary and cultural nuances with exceptional fidelity.
27 tokens/second

translategemma:12b

High-fidelity translation for 55 languages with 128K tokens.
Respects grammar and cultural nuances. Ideal for long documents.
31 tokens/second

translategemma:4b

Fast, efficient translation for 55 languages. Ideal for real-time localisation.
Compact version with an excellent speed/quality ratio. Context of 128K tokens.

Audio & Image

Our Audio & Image models enable real-time voice transcription (ASR streaming) and image generation from text descriptions, compatible with the OpenAI API.

voxtral

Real-time audio transcription via WebSocket. Low-latency streaming speech recognition.
Operates in Realtime mode via the /v1/realtime endpoint (WebSocket). Transcribes streaming audio.

z-image:16b

Image generation from text prompts, OpenAI API compatible /v1/images/generations.
Supports image size and number of images. Compatible with the OpenAI ecosystem.

Model comparison

This comparison table will help you choose the model best suited to your needs, based on various criteria such as context size, performance and specific use cases.

Table comparing the characteristics and performance of the different AI models available, grouped by category.
Model Publisher Parameters Context (k tokens) Vision Agent Reasoning Security Quick * Energy efficiency *
Chat & Reasoning
qwen3.6:27b Qwen Team 27B 1000000
gpt-oss:120b OpenAI 120B 120000
gpt-oss:20b OpenAI 20B 120000
llama3.3:70b Meta 70B 132000
gemma3:27b Google 27B 120000
nemotron-3-super:120b NVIDIA 120B 1000000
nemotron3-nano:30b NVIDIA 30B 1000000
nemotron-cascade:30b NVIDIA 30B 1000000
glm-4.7-flash:30b Zhipu AI 30B 120000
cogito:32b Deep Cogito 32B 32000
elm tree 3:32b AllenAI 32B 65536
Elm 3:7b AllenAI 7B 65536
qwen3-2507:235b Qwen Team 235B 200000
mistral-small3.2:24b Mistral AI 24B 128000
mistral-small4:119b Mistral AI 119B 262144
ministral-3:14b Mistral AI 14B 250000
ministral-3:8b Mistral AI 8B 250000
ministral-3:3b Mistral AI 3B 250000
qwen3.5:9b Qwen Team 9B 250000
qwen3.5:4b Qwen Team 4B 250000
qwen3.5:0.8b Qwen Team 0.8B 250000
qwen3:0.6b Qwen Team 0.6B 40000
qwen3-2507-think:4b Qwen Team 4B 250000
qwen3-omni:30b Qwen Team 30B 32768
Programming & Agents
qwen3.6:35b Qwen Team 35B 1000000
qwen-coder-next:80b Qwen Team 80B 250000
qwen3-next:80b Qwen Team 80B 250000
devstral-small-2:24b Mistral AI & All Hands AI 24B 200000
rnj-1:8b Essential AI 8B 32000
functiongemma:270m Google 270M 32768
Vision & Multimodal
qwen3-vl:235b Qwen Team 235B 200000
qwen3-vl:32b Qwen Team 32B 250000
qwen3-vl:30b Qwen Team 30B 250000
qwen3-vl:8b Qwen Team 8B 250000
qwen3-vl:4b Qwen Team 4B 250000
qwen3-vl:2b Qwen Team 2B 250000
gemma4:31b Google 31B 250000
gemma4:e2b Google 31B (E2B) 128000
gemma4:e4b Google 31B (E4B) 128000
granite3.2-vision:2b IBM 2B 16384
deepseek-ocr DeepSeek AI 3B 8192
Embedding
bge-m3:567m BAAI 567M 8192
qwen3-embedding:4b Qwen Team 4B 40000
qwen3-embedding:8b Qwen Team 8B 40000
qwen3-embedding:0.6b Qwen Team 0.6B 32768
granite-embedding:278m IBM 278M 512
embeddinggemma:300m Google 300M 2048
Reranking
nvidia/llama-nemotron-rerank-vl-1b-v2 NVIDIA 1B 4096 N.C.
qwen3-reranker:4b Qwen Team 4B 4096 N.C.
qwen3-reranker:0.6b Qwen Team 0.6B 4096 N.C.
bge-reranker-large BAAI 335M 512 N.C.
Security
granite3-guardian:8b IBM 8B 8192
granite3-guardian:2b IBM 2B 8192
Translation
translategemma:27b Google 27B 120000
translategemma:12b Google 12B 128000
translategemma:4b Google 4B 128000
Audio & Image
voxtral Mistral AI 4B 32768 N.C.
z-image:16b Community 16B N.C.
Legend and explanation
Functionality or capacity supported by the model
Functionality or capability not supported by the model
* Energy efficiency Indicates particularly low energy consumption (< 2.0 kWh/Mtoken)
* Quick Model capable of generating more than 50 tokens per second
Note on performance measures
The speed values (tokens/s) represent performance targets in real-life conditions. Energy consumption (kWh/Mtoken) is calculated by dividing the estimated power of the inference server (in Watts) by the measured speed of the model (in tokens/second), then converted into kilowatt-hours per million tokens (division by 3.6). This method offers a practical comparison of the energy efficiency of different models, to be used as a relative indicator rather than an absolute measure of power consumption.

Recommended use cases

Here are some common use cases and the most suitable models for each. These recommendations are based on the specific performance and capabilities of each model.

Multilingual dialogue

Chatbots and assistants able to communicate in several languages with automatic detection and context maintenance
Recommended models
  • nemotron-3-super:120b
  • qwen3.6:27b
  • nemotron3-nano:30b
  • gpt-oss:120b

Analysis of long documents

Processing of large documents (>100 pages) with extraction of key information, summaries and answers to questions
Recommended models
  • nemotron-3-super:120b
  • qwen3.6:27b
  • qwen3-2507:235b

Programming and development

Code generation, optimisation and debugging in multiple languages, refactoring and test creation
Recommended models
  • qwen3.6:35b
  • qwen-coder-next:80b
  • devstral-small-2:24b
  • nemotron-3-super:120b

Visual analysis

Image and visual document processing, OCR, interpretation of graphs and tables
Recommended models
  • qwen3-vl:235b
  • gemma4:31b
  • deepseek-ocr
  • qwen3-vl:30b

Safety and compliance

Sensitive content filtering, jailbreak detection, RGPD/HDS compliance
Recommended models
  • granite3-guardian:8b
  • granite3-guardian:2b
  • mistral-small4:119b

Light deployments

Applications requiring a minimal footprint, low latency and low power consumption
Recommended models
  • qwen3.5:0.8b
  • qwen3-vl:2b
  • ministral-3:3b

RAG (Retrieval-Augmented Generation)

Complete semantic search, reordering and retrieval-enhanced generation pipelines
Recommended models
  • bge-m3:567m
  • nvidia/llama-nemotron-rerank-vl-1b-v2
  • qwen3.6:27b
Follow the development of the LLMaaS offering

Discover all our IA research papers

 

Cookie policy

We use cookies to give you the best possible experience on our site, but we do not collect any personal data.

Audience measurement services, which are necessary for the operation and improvement of our site, do not allow you to be identified personally. However, you have the option of objecting to their use.

For more information, see our privacy policy.