Enterprise AI Model Development to Deployment: NVIDIA NeMo Platform Guide #1

NeMo Framework and NeMo Microservices: A Comprehensive Overview

Introduction

The rapid advancement of AI technology has made developing and operating large language models (LLMs) and multimodal AI models a core priority for enterprises and research institutions. However, deploying these models reliably in production environments requires robust, enterprise-grade platforms.

NVIDIA has addressed this need with NeMo, a comprehensive AI platform that serves users across the entire spectrum—from researchers to enterprise development teams. NeMo consists of two core components:

  • NeMo Framework: An open-source development environment for model research and experimentation
  • NeMo Microservices: An enterprise platform for production deployment and operations

This series will cover everything from an introduction to the NeMo platform to practical implementation examples on the PAASUP DIP platform.

NeMo Framework vs NeMo Microservices: Key Differences

Here's a side-by-side comparison of the main features of both platforms:

Category NeMo Framework NeMo Microservices
Purpose Model research, development & custom architecture implementation Enterprise-grade model customization & operations
Primary Users Researchers, ML Engineers Application developers, Enterprise deployment teams
Development Approach Python-based library REST/gRPC-based microservices
Runtime Environment Jupyter, Python scripts Docker, Kubernetes
Use Case Examples Training script execution, experimentation Chatbot /chat, Document summarization /summarize APIs
Release Status Open source (GitHub) Generally Available (April 2025)

NeMo Framework: Custom AI Model Development for Researchers and Developers

Platform Overview

NeMo Framework is a PyTorch-based open-source development platform designed for custom AI model development across various domains. Available freely on GitHub, it supports everything from research projects to initial prototyping.

Image source: NVIDIA NeMo Framework

NeMo Framework 2.0: Major 2025 Updates

The NeMo 2.0 release introduced significant API changes and unveiled NeMo Run, a new workflow management library. Key updates include:

NVIDIA Cosmos Integration

NeMo Framework now supports training and customization of world foundation models from the NVIDIA Cosmos collection. Cosmos enables multimodal generative models that can generate various modalities (video, audio, images) from text inputs.

Unified Container Environment

NeMo Framework now consolidates Large Language Models (LLM), Multimodal (MM), Automatic Speech Recognition (ASR), and Text-to-Speech (TTS) capabilities within a single unified container.

Performance Optimizations

The framework delivers optimal performance for advanced generative AI model training through Megatron-style tensor/sequence/pipeline parallelization and FlashAttention-based attention optimizations.

Supported Model Collections

NeMo Framework supports various AI models through its collection-based architecture:

NLP (Natural Language Processing)

  • Large Language Models (LLMs)
  • Text classification and sentiment analysis
  • Question-answering systems
  • Llama, Mistral, Gemma series models

ASR (Automatic Speech Recognition)

  • Multilingual speech recognition
  • Real-time Speech-to-Text (STT)
  • Noise reduction capabilities
  • AED multitask models (Canary)

TTS (Text-to-Speech)

  • Natural speech synthesis
  • Multiple vocoder support
  • Speaker adaptation features

Multimodal

  • Image-text integration models
  • Video-text processing
  • Multimodal generative AI
  • Cosmos video foundation models

Advanced Technology Support

NeMo Framework incorporates cutting-edge technologies for efficient large-scale AI model training:

FSDP (Fully Sharded Data Parallel) enables memory-efficient distributed learning, improving training efficiency for large-scale AI models.

MoE (Mixture of Experts) supports expert mixture-based LLM architectures to enhance model performance.

High-Performance Data Processing through NeMo Curator, which includes high-throughput data curation, efficient multimodal data loading, scalable model training, and parallelized inference infrastructure.

Practical Implementation Examples

LLM Model Fine-tuning

# Fine-tuning mistral_7b_instruct model with custom data
import nemo_run as run
from nemo.collections.llm import api

data_config = {
    "train_ds": {"file_path": "train.jsonl"},
    "validation_ds": {"file_path": "val.jsonl"}
}

trainer_config = {
    "max_steps": 1000,
    "val_check_interval": 100,
    "devices": 1,
    "accelerator": "gpu"
}

# Recipe: Define fine-tuning configuration
recipe = run.Partial(
    api.finetune,
    model="mistral_7b_instruct",
    data=data_config,
    trainer=trainer_config,
    optim={"lr": 1e-5}
)

# Execute the recipe
run.run(recipe)

Automatic Speech Recognition (ASR) Implementation

# Converting audio files to text using pre-trained QuartzNet model
import nemo_run as run
from nemo.collections.asr import api

# Define speech recognition recipe
asr_recipe = run.Partial(
    api.transcribe,
    model="stt_en_quartznet15x5",
    audio_files=["audio_file.wav"],
    output_dir="./transcriptions"
)

# Execute speech recognition
transcription_results = run.run(asr_recipe)

# Alternative: Direct API usage without NeMo Run
from nemo.collections.asr.api import transcribe

results = transcribe(
    model="stt_en_quartznet15x5",
    audio_files=["audio_file.wav"]
)

Text-to-Speech (TTS) Implementation

# Converting text to audio using pre-trained FastPitch model
import nemo_run as run
from nemo.collections.tts import api

# Define speech synthesis recipe
tts_recipe = run.Partial(
    api.synthesize,
    model="tts_en_fastpitch",
    text="Hello, this is NeMo TTS!",
    output_dir="./audio_output"
)

# Execute speech synthesis
audio_output = run.run(tts_recipe)

# Alternative: Direct API usage
from nemo.collections.tts.api import synthesize

audio = synthesize(
    model="tts_en_fastpitch",
    text="Hello, this is NeMo TTS!",
    output_format="wav"
)

NeMo Microservices: Enterprise-Grade AI Model Operations Platform

Platform Overview

Released as Generally Available in April 2025, NeMo Microservices is designed for stable deployment and operation of large-scale AI models in enterprise environments. It provides scalability and management capabilities in Kubernetes through a REST API-based microservices architecture.

Unlike traditional approaches, NeMo Microservices doesn't deploy models directly from NeMo Framework or Hugging Face training. Instead, it follows this workflow:

  1. Selects NIM-based pre-optimized models from the NVIDIA API Catalog
  2. Customizes these models for specific user domains (e.g., SFT, LoRA)
  3. Repackages the results as NIM format for deployment

This approach enables enterprises to leverage NVIDIA's high-performance models while adapting them to domain-specific requirements.

Core Components

NeMo Microservices is a modular, production-focused platform that enables enterprises to fine-tune, evaluate, secure, and deploy AI models at scale. The three core components are:

NeMo Customizer

  • AI model fine-tuning and customization
  • Support for efficient training techniques (LoRA, QLoRA)
  • Automated hyperparameter optimization

NeMo Evaluator

  • Systematic model performance evaluation
  • Benchmarking and A/B testing capabilities
  • Continuous model improvement workflows

NeMo Guardrails

  • AI model safety and security
  • Ethical AI usage frameworks
  • Governance and compliance management

Additional Platform Components

NVIDIA NeMo Data Store

  • Primary file storage solution for the NeMo microservices platform
  • Hugging Face Hub client (HfApi) compatible API

NVIDIA NeMo Entity Store

  • Management tool for namespaces, projects, datasets, models, and other entities

NVIDIA NeMo Deployment Management

  • NIM deployment API for Kubernetes clusters
  • Management through NIM Operator microservice

NVIDIA NeMo NIM Proxy

  • Unified endpoint for accessing inference tasks across all deployed NIMs

Data Flywheel Architecture

One of NeMo Microservices' key strengths is its data flywheel architecture, which enables continuous optimization through this cyclical process:

  1. Real-time Data Collection → Gather user feedback and model performance metrics
  2. Data Analysis → Analyze model performance using collected data
  3. Model Improvement → Retrain models based on analysis insights
  4. Deployment and Monitoring → Deploy improved models to production

Image source: NVIDIA NeMo Microservices

Practical Implementation Examples

Model Fine-tuning Job Request

import requests

# Configuration
NMS_NAMESPACE = "your_org_namespace"
BASE_MODEL = "llama-3-8b-instruct"
DATASET_NAME = "your_custom_dataset"
NEMO_URL = "YOUR_NEMO_MICROSERVICES_ENDPOINT"

# Fine-tuning parameters
training_params = {
    "name": "llama-3-8b-instruct-sft-lora",
    "output_model": f"{NMS_NAMESPACE}/llama-3-8b-instruct-sft-lora-model",
    "config": BASE_MODEL,
    "dataset": {"name": DATASET_NAME, "namespace": NMS_NAMESPACE},
    "hyperparameters": {
        "training_type": "sft",  # Supervised Fine-Tuning
        "finetuning_type": "lora",  # LoRA method
        "epochs": 2,
        "batch_size": 16,
        "learning_rate": 0.0001,
        "lora": {
            "adapter_dim": 32,
            "adapter_dropout": 0.1
        }
    }
}

# Submit API request
response = requests.post(
    f"{NEMO_URL}/v1/customization/jobs", 
    json=training_params
)

Model Evaluation Job

# Model performance evaluation request
evaluation_params = {
    "model_name": "llama-3-8b-instruct-sft-lora",
    "evaluation_dataset": "benchmark_dataset",
    "metrics": ["accuracy", "bleu", "rouge"],
    "evaluation_type": "automated"
}

eval_response = requests.post(
    f"{NEMO_URL}/v1/evaluation/jobs",
    json=evaluation_params
)

Integration with NIM

NeMo Microservices works complementarily with NIM (NVIDIA Inference Microservices):

NIM's Role

  • Inference Optimization: Specialized for optimized model serving
  • Performance Maximization: Leverages TensorRT for optimal inference performance
  • Rapid Deployment: Immediate deployment of pre-optimized models

NeMo Microservices' Role

  • Model Enhancement: Focuses on data preparation, training, and evaluation
  • Customization: Domain-specific adaptation of existing models
  • Safety Management: Comprehensive model governance and security

Currently, NeMo Microservices is optimized for integration with NIM-based models from the NVIDIA API Catalog, following this workflow:

  1. Base Model Selection → Choose NIM-supported models from NVIDIA API Catalog
  2. Customization → Domain-specific fine-tuning via NeMo Microservices
  3. Deployment → NIM-based production deployment of customized models
  4. Operations → Ongoing monitoring and iterative improvement

NeMo Integration with PAASUP DIP Platform

NeMo Support in DIP Platform

The PAASUP DIP (Data Intelligence Platform) provides comprehensive support for both NeMo Framework and NeMo Microservices.

NeMo Framework Integration

  • Access to NeMo Framework 2.0 library within DIP platform's Jupyter environment
  • Recipe-based approach for fine-tuning large language models, speech recognition, and speech synthesis models
  • Support for custom AI model development and prototyping

NeMo Microservices Integration

  • Native access to NeMo Microservices through the DIP platform
  • Full model customization (fine-tuning), evaluation, and governance capabilities
  • Seamless production deployment of customized models through NIM

Real-World Implementation Scenarios

Scenario 1: Customer Service Chatbot Development

# Model fine-tuning with NeMo Framework 2.0 on DIP platform
import nemo_run as run
from nemo.collections.llm import api

data_config = {
    "train_ds": {"file_path": "train.jsonl"},
    "validation_ds": {"file_path": "val.jsonl"}
}

trainer_config = {
    "max_steps": 1000,
    "val_check_interval": 100,
    "devices": 1,
    "accelerator": "gpu"
}

# Recipe-based fine-tuning setup
recipe = run.Partial(
    api.finetune,
    model="mistral_7b_instruct",
    data=data_config,
    trainer=trainer_config
)

# Execute fine-tuning
run.run(recipe)
# Production fine-tuning with NeMo Microservices
import requests

training_params = {
    "name": "customer-service-bot",
    "config": "llama-3-8b-instruct",
    "dataset": {"name": "customer_service_data", "namespace": "company_ns"},
    "hyperparameters": {
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 2
    }
}

# Execute fine-tuning job
response = requests.post(f"{NEMO_URL}/v1/customization/jobs", json=training_params)

# Customized model deployed via NIM

Scenario 2: Speech Recognition System

# ASR model fine-tuning with NeMo Framework 2.0 on DIP platform
import nemo_run as run
from nemo.collections.asr import api

data_config = {
    "train_ds": {"file_path": "train.jsonl"},
    "validation_ds": {"file_path": "val.jsonl"}
}

trainer_config = {
    "max_steps": 1000,
    "val_check_interval": 100,
    "devices": 1,
    "accelerator": "gpu"
}

# ASR model fine-tuning recipe setup
asr_recipe = run.Partial(
    api.finetune,
    model="stt_en_conformer_ctc_small",
    data=data_config,
    trainer=trainer_config
)

# Execute fine-tuning
run.run(asr_recipe)

# Follow-up customization via NeMo Microservices based on fine-tuning results

Enterprise Adoption Considerations

Development Environment Strategy

  • Research and Prototyping: Leverage NeMo Framework for flexibility and experimentation
  • Production Deployment: Use NeMo Microservices + NIM combination for enterprise-grade operations

Operational Excellence

  • Unified workflow from development to deployment within DIP platform's integrated environment
  • Scalable operations through container-based microservices architecture
  • Comprehensive automated MLOps pipeline capabilities

Conclusion

This article explored the NVIDIA NeMo platform, which spans the entire AI model lifecycle from development to operations. NeMo provides two complementary solutions, each optimized for different aspects of AI model management:

NeMo Framework offers researchers and ML engineers a powerful Python library for flexible model experimentation and customization in local environments.

NeMo Microservices provides enterprise development teams with an integrated platform for stable production deployment, systematic evaluation, and comprehensive governance of AI models.

PAASUP DIP (Data Intelligence Platform) creates a unified environment that seamlessly integrates both solutions. Users can choose between NeMo Framework in Jupyter environments for development and experimentation, or leverage NeMo Microservices through the platform catalog for production deployment.

As of 2025, the powerful capabilities of NeMo Framework 2.0 combined with the production-ready NeMo Microservices platform represent one of the most comprehensive solutions for successful enterprise AI adoption.

✅ Key Takeaways

  • 🧪 NeMo Framework: Research and experimentation-focused model development and fine-tuning
  • 🏭 NeMo Microservices: Enterprise-grade model customization, evaluation, and operations management
  • 🔗 PAASUP DIP: Unified end-to-end AI operations environment integrating both platforms

In the next article, we'll examine detailed examples of model fine-tuning using NeMo Framework and model customization and evaluation using NeMo Microservices within the PAASUP DIP environment.