By paasup, shobyun in Global — 26 Aug 2025

Deploying Serverless VLLM in a Private Environment

This is a practical guide for building an enterprise LLM serving environment using KServe with a vLLM backend on the PAASUP DIP platform. It presents various AI service operation methods, from programming through Jupyter Notebook to no-code web interfaces using OpenWebUI and Flowise.

1. Overview

To operate a stable LLM service in an enterprise environment, many technical challenges must be overcome, such as complex Kubernetes configurations, MLOps pipeline construction, and the development of various interfaces.

The PAASUP DIP (Data Intelligence Platform) is a data intelligence platform that solves these complexities, allowing you to build production-level AI/ML workflows with just a few clicks.

This article explores how to build a KServe-based LLM serving environment with DIP and use it with various interfaces:

Components

KServe: A Kubernetes-native serverless model serving platform for stable LLM deployment.
Jupyter Notebook: An environment for API calls and testing using Python code.
OpenWebUI: A user-friendly, ChatGPT-style web chat interface.
Flowise: A no-code AI workflow building tool with a drag-and-drop interface.

Key Advantage

OpenAI-Compatible API: Thanks to the OpenAI-compatible endpoint provided by KServe, you can use existing OpenAI SDKs and various tools without any modifications.

2. Preparing the Test Environment: Deploying Kubeflow Jupyter Notebook

To verify the operation of the model deployed with KServe, we will first prepare an API call environment. By using DIP's Kubeflow Catalog, you can easily configure a Jupyter Notebook environment without complex settings.

2.1 Creating a Jupyter Notebook Environment

Step 1: Access DIP and Select a Project

Log in to PAASUP DIP with a manager account.
Select the target project from the project list.
Go to the Create Catalog menu.

Step 2: Creating the Kubeflow Catalog

Select Kubeflow from the Create Catalog menu.

Enter a catalog version and a distinguishable name.
Click the "Create" button.

Step 3: Checking the Catalog Status

Monitor the status on the catalog list page until the deployment is complete.
Wait until the "Deployment Status" and "Service Status" change to Normal.\

Step 4: Creating a Jupyter Notebook Instance

Click the deployed Kubeflow service link to access the dashboard.
Create a new notebook instance in the Notebooks menu.

After creation, the environment is ready for access by clicking the Connect button.

※ Note: This Notebook is used for API testing after KServe deployment is complete.

3. Deploying KServe

3.1 KServe Architecture Overview

KServe is a Kubernetes-native model serving platform that supports various inference backends. PAASUP DIP uses vLLM as the default backend for LLM serving.

Advantages of the vLLM Backend:

Memory Efficiency: Optimizes memory usage with PagedAttention technology.
High Throughput: Efficiently handles multiple requests with dynamic batching.
OpenAI Compatibility: Provides a consistent interface with the standard OpenAI API format.

3.2 Creating the KServe Catalog

Step 1: Selecting the Catalog

Select the KServe catalog from the Create Catalog menu.
Click the 'Create' button.

Step 2: Entering Basic Information

Catalog Version: Select the version.
Catalog Name: Enter a distinguishable name (e.g., llama32-serving).
Query successful

Step 3: Configuring Model Serving Enter the following required values:

Kubeflow name: Select the Kubeflow instance created earlier.
Model name: Specify the model name to be served (e.g., llama32_3b_it).
- This value will be used as the model parameter in API calls later.
Model volume: Specify the path where the model is stored (e.g., pvc://model-volume/llama32_3b_it).
- The model files must be pre-uploaded to this path.

3.3 Checking Deployment Status

After creating the catalog, monitor the deployment progress.

Check the created KServe catalog in the catalog list.
Wait until the "Deployment Status" and "Service Status" change to Normal.

4. Verifying and Configuring Endpoints

Once KServe is deployed, DIP provides two access methods: an External URL for external cluster access and a Cluster-local URL for internal cluster access.

4.1 Checking KServe Endpoint Information

Method 1: Check in DIP Catalog Details

Access the KServe catalog details page.
Check the Catalog Address or Internal Catalog Address.
- Example URLs:
  - External access: https://demo01-kserve-kserve.demo01-test1.hopt.paasup.io
  - Internal access: http://demo01-kserve-predictor-00001.demo01-test1.svc.cluster.local:80

Method 2: Check in Kubeflow Dashboard

Kubeflow Dashboard → KServe menu.
Check the URL internal of the deployed model.
- Internal cluster access: http://demo01-kserve.demo01-test1.svc.cluster.local

※ Note:
All endpoints provide the same OpenAI-compatible API (/openai/v1), so you can choose the URL that is accessible based on your network environment or proxy settings.

5. Inference Testing with Jupyter Notebook

5.1 Accessing JupyterLab

Click the Connect button of the Jupyter Notebook instance you created earlier.
After accessing the JupyterLab environment, create a new Python notebook.
Install the necessary library:

!pip install openai

5.2 Inference Testing via OpenAI-Compatible API

One of the biggest advantages of KServe is that it provides an endpoint compatible with the OpenAI API. This allows you to use widely-used client code, such as the openai library, without any modifications.

Example Code:


from openai import OpenAI

# Set up the KServe endpoint
# For external cluster access (HTTPS):
# base_url="https://demo01-kserve-kserve.demo01-test1.hopt.paasup.io/openai/v1"
# For internal cluster access: 
base_url = "http://demo01-kserve-predictor-00001.demo01-test1.svc.cluster.local:80/openai/v1"

client = OpenAI(
    base_url = base_url,
    api_key = "dummy-key"     # Use a dummy value if authentication is disabled
)

# Compose messages to send to the LLM
messages = [
    {"role": "system", "content": "Please answer concisely and clearly."},
    {"role": "user", "content": "Express the area of a circle with a diameter of 10 in terms of π."}
]

# Generate a response
completion = client.chat.completions.create(
  model="llama32_3b_it",   # Model name
  messages=messages,
  temperature=0.5,
  top_p=1,
  max_tokens=1024,
  stream=True              # Enable streaming to receive responses in real-time
)

# Print the response
for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

6. Building a Web Interface with OpenWebUI

OpenWebUI is an open-source web interface that provides a ChatGPT-like user experience (UX), making it easy for non-developers to use LLMs.

6.1 Deploying the OpenWebUI Catalog

Step 1: Creating the OpenWebUI Catalog

From the Create Catalog menu, select OpenWebUI.
Enter the version and a name, then click "Create"

Step 2: Checking Deployment Status

Wait until the catalog's "Deployment Status" and "Service Status" change to Normal.

6.2 OpenWebUI Initial Setup

Step 1: First Access and Account Creation

Access the OpenWebUI service link.
The first time you access it, a screen for creating an administrator account will appear.
Set a name, email address, and password to create the administrator account.

Step 2: Accessing the Admin Panel

Click the user profile icon in the top right corner.
Select the Admin Panel menu.

6.3 Configuring KServe Endpoint Connection

Step 1: Accessing the Connection Settings Menu

From the top Settings tab, select Connections from the left-hand menu.

Step 2: Adding an OpenAI API Connection

Click the '+' button in the OpenAI API section.
Enter the connection information:
- URL: {KServe endpoint}/openai/v1
  - External access: https://demo01-kserve-kserve.demo01-test1.hopt.paasup.io/openai/v1
  - Internal access: http://demo01-kserve-predictor-00001.demo01-test1.svc.cluster.local:80/openai/v1
  - Internal access: http://demo01-kserve.demo01-test1.svc.cluster.local/openai/v1
- API Key: dummy-key (Enter any value in an environment with authentication disabled)
- Model: The model name set during KServe deployment (e.g., llama32_3b_it)
Click the Save button to complete the setup.

6.4 Testing the Model via the Web Interface

Step 1: Selecting the Model

Select the connected model on the OpenWebUI main screen.

Step 2: Conversation Test

Enter a question in the chat window to see the response.

Example Input: Find three consecutive even numbers whose sum is 48.

This way, you can instantly build a production-level AI chatbot with features like streaming responses, conversation history management, and multi-user access, all without coding.

7. No-Code AI Workflow with Flowise

Flowise is a tool that allows you to build complex AI workflows without coding, using a visual node-based interface. It is particularly useful for building RAG (Retrieval-Augmented Generation) systems or multi-step AI processes.

7.1 Deploying the Flowise Catalog

Step 1: Creating the Flowise Catalog

From the Create Catalog menu, select Flowise.
Enter the version and a name, then click "Create."

Step 2: Checking Deployment Status

Wait until the catalog's "Deployment Status" and "Service Status" change to Normal.

7.2 Building a Basic Chatflow

Step 1: Creating a New Chatflow

From the left-hand Chatflows menu, click the Add New button.

Step 2: Adding Nodes Click the '+' button on the canvas to add the following nodes:

ChatLocalAI: The core node responsible for connecting to the LLM.
Buffer Memory: Manages memory to maintain conversation context.
Conversation Chain: Manages the overall conversation flow.

Step 3: Connecting Nodes Connect the nodes logically using drag and drop:

ChatLocalAI ↔ Conversation Chain (as the Chat Model)
Buffer Memory ↔ Conversation Chain (as Memory)

Step 4: Configuring the ChatLocalAI Node

Base Path: {KServe endpoint}/openai/v1
- External access: https://demo01-kserve-kserve.demo01-test1.hopt.paasup.io/openai/v1
- Internal access: http://demo01-kserve-predictor-00001.demo01-test1.svc.cluster.local:80/openai/v1
- Internal access: http://demo01-kserve.demo01-test1.svc.cluster.local/openai/v1
Model Name: The model name set during KServe deployment (e.g., llama32_3b_it)

7.3 Testing and Using the Workflow

Basic Conversation Test:

Example Input: 
A train traveled at 60 km/h for 2 hours and then at 90 km/h for 1 hour. What was the average speed of the train?

8. Conclusion

By using PAASUP DIP, you can quickly build enterprise-grade LLM services without the burden of complex Kubernetes configurations or MLOps pipeline construction.

Features by Access Method

Interface	Advantages	Suitable Users/Use Cases
Jupyter Notebook	- Granular control via programming<br>- Direct API calls and customization<br>- Automation possible	- Developers<br>- Data Scientists
OpenWebUI	- ChatGPT-style UX<br>- Multi-user support	- General users<br>- Team AI chatbots
Flowise	- Drag-and-drop interface<br>- No-code workflow<br>- RAG/multi-step process configuration	- Non-developers<br>- Business automation

Usage Tips

Optimize Model Performance: After deployment, adjust hyperparameters such as temperature and top_p to find the best response quality.
Monitor Resources: Use DIP's monitoring features to continuously check CPU and memory usage.
Use Autoscaling: When traffic increases, you can enable KServe's autoscaling feature to automatically adjust the number of pods based on demand.

The PAASUP DIP catalog system allows you to efficiently build and operate LLM services tailored to your organization. We hope this guide serves as a great starting point for your AI projects.

목차