By dongwooshin in Global — 28 Aug 2025

Deploying NVIDIA NIM Microservice in a Private Environment

This guide explains how to securely and efficiently deploy LLM models like llama-3.2-3b-instruct in a private environment using NVIDIA NIM Microservice on PAASUP DIP. It supports both Python code integration and no-code workflows with Flowise, ideal for diverse enterprise AI applications.

Introduction

With the rapid expansion of AI model usage, safely and efficiently deploying and operating AI models in enterprise environments has become a critical challenge. NVIDIA NIM (NVIDIA Inference Microservice) is a container-based microservice designed to meet these requirements.

In this post, we'll introduce step-by-step process of building and running the llama-3.2-3b-instruct model in a private environment using the NVIDIA NIM catalog of PAASUP DIP (Data Intelligence Platform).

What is NVIDIA NIM?
Deployment Process
Key Advantages
Conclusion

What is NVIDIA NIM?

NVIDIA NIM is a container-based microservice proxy optimized for NVIDIA GPUs, designed to simplify the AI inference process and maximize performance. It standardizes complex AI model deployment processes, helping developers build AI services more easily.

Deployment Process

Step 1: Copy NIM Microservice Container Image Pull Path

First, you need to download the container image for the required model from the NVIDIA NGC catalog.

Access NGC Catalog and Select Model

Visit https://catalog.ngc.nvidia.com
Click on Containers menu
Check the NVIDIA NIM filter to display only NIM-related containers
Search and select the desired model (here llama-3.2-3b-instruct)
Navigate to the Tags tab in the model card and copy the container image pull path

Step 2: Create NIM Catalog in PAASUP DIP

Configure a catalog to create NIM service in the DIP environment.

Basic Configuration

Catalog Version: Enter desired version
Catalog Name: Enter an identifiable name

Service Model Configuration

CASE 1: When the catalog deployment environment is connected to the internet
Configure with the container image pull path of the model to be served copied from Step 1:

Model repository: nvcr.io/nim/meta/llama-3.2-3b-instruct (Pull path excluding version)
Model repository tag: 1.8.3 (Version information of the model from pull path)
NGC API KEY: Enter NVIDIA API key
Model name: meta/llama-3.2-3b-instruct (Model name from pull path)

CASE 2: When the catalog deployment environment is not connected to the internet
Download the container image in an internet-connected environment, then import it into the DIP environment before deploying the catalog:

Model repository: nvcr.io/nim/meta/llama-3.2-3b-instruct
Model repository tag: 1.8.3
NGC API KEY: Enter any key as this is unused
Model name: meta/llama-3.2-3b-instruct

Click the Create button to create the catalog after completing the configuration.
스크린샷 2025-08-21 164405.png

Step 3: Call NIM Microservice from Jupyter Lab

Now you can call and use the created NIM service with Python code.

Access Jupyter Lab

Click Kubeflow link in PAASUP DIP portal
Select Notebooks in Kubeflow
Access Jupyter Labs

Write and Execute Python Code

When writing Python code, enter the NIM catalog's domain information in base_url, and api_key is not used in private environments but is a required input parameter for OpenAI functions.

import httpx
from openai import OpenAI

# --- Client Setup ---
# Configure NIM microservice endpoint
client = OpenAI(
    base_url = "http://demo01-nimtest01.demo01-nimtest01.svc.cluster.local:8000/v1",
    api_key = "[YOUR-API-KEY]"  # Not actually used in private environment
)

# --- Conversation History ---
messages = [
    {"role": "system", "content": "You are a helpful assistant."}
]

print("Chatbot initialized. Type 'exit' to end the conversation.")
print("-" * 20)

# --- Main Chat Loop ---
while True:
    try:
        # 1. Get user input
        user_input = input("You: ")

        # 2. Check for exit command
        if user_input.lower() in ["exit", "quit"]:
            print("Exiting chat. Goodbye!")
            break

        # 3. Add user message to conversation history
        messages.append({"role": "user", "content": user_input})

        # 4. Send entire conversation history to model
        completion = client.chat.completions.create(
            model="meta/llama-3.2-3b-instruct",  
            messages=messages,
            temperature=0.5,
            top_p=1,
            max_tokens=1024,
            stream=True
        )

        # 5. Stream model response output
        print("Assistant: ", end="")
        assistant_response = ""
        for chunk in completion:
            if chunk.choices[0].delta.content is not None:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                assistant_response += content
        
        # 6. Add model response to history
        messages.append({"role": "assistant", "content": assistant_response})
        
        print()  # Line break for clean formatting

    except KeyboardInterrupt:
        print("\nExiting chat. Goodbye!")
        break
    except Exception as e:
        print(f"\nAn error occurred: {e}")
        break

Execute the above code to start chatting:
스크린샷 2025-08-26 175427.png

Step 4: Call NIM Microservice from Flowise

To visually configure AI flows without writing code, you can utilize Flowise.

What is Flowise?

Flowise is a no-code based tool that helps quickly build chatbots and AI agents by visually configuring LangChain flows.
Flowise is provided as a catalog in PAASUP DIP.

Flowise Configuration Process

Click Flowise link in DIP Catalog
Select Chatflows > + Add New
Configure ChatLocalAI Component:
- Base Path: Enter NIM catalog's domain information
- Model Name: Enter model name to use
Add Buffer Memory Component and Conversation Chain Component
Connect each component to Conversation Chain
Save and click chat icon to start chatting

Key Advantages

1. Security in Private Environment

Operates completely isolated within internal environment without depending on external APIs
No risk of sensitive data leakage to external sources

2. Performance Optimization

Provides high inference performance with optimizations specialized for NVIDIA GPUs
Ensures consistent performance and stability through container-based approach

3. Development Convenience

Enables reuse of existing code with OpenAI API-compatible interface
Rapid prototyping through no-code tools like Flowise

4. Scalability

Supports automatic scaling in Kubernetes environment
Flexible system configuration with microservice architecture

Conclusion

Through the process of building NVIDIA NIM Microservice in the PAASUP DIP environment, we confirmed that the latest AI models can be safely and efficiently utilized even in private environments.

In particular, by presenting two approaches - direct API calls through Python code and no-code utilization through Flowise - we provide flexibility to meet various user requirements.

As AI technology continues to advance and security requirements strengthen, such private AI deployment methods are expected to become essential technologies in enterprise environments. We encourage you to build AI services that satisfy both security and performance using optimized solutions like NVIDIA NIM.