【万字长文】【InternVL】InternVL2

2025-07-26 21:12:49

【万字长文】【InternVL】InternVL2 一、LMDeploy介绍【模型部署框架】1-1、LMDeploy介绍LMDeploy 是一个用于压缩、部署、服务 LLM 的工具包，由 MMRazor 和 MMDeploy 团队开发。它具有以下核心功能：高效推理引擎（TurboMind）：开发持久批处理（又称连续批处理）、阻塞KV缓存、动态拆分融合、张量并行、高性能CUDA内核等关键特性，确保L

【万字长文】【InternVL】InternVL2

一、LMDeploy介绍【模型部署框架】

1-1、LMDeploy介绍

LMDeploy 是一个用于压缩、部署、服务 LLM 的工具包，由 MMRazor 和 MMDeploy 团队开发。它具有以下核心功能：

高效推理引擎（TurboMind）：开发持久批处理（又称连续批处理）、阻塞KV缓存、动态拆分融合、张量并行、高性能CUDA内核等关键特性，确保LLM推理的高吞吐和低延迟。
交互式推理模式：通过在多轮对话过程中缓存注意力的k/v，引擎会记住对话历史，从而避免历史会话的重复处理。
量化：LMDeploy 支持多种量化方法和量化模型的高效推理。量化的可靠性已在不同尺度的模型上得到验证。

TurboMind CUDA 平台支持的模型如下所示：

1-2、LLM推理代码语言：python代码运行次数：0运行复制

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2_5-7b-chat")
respe = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(respe)

在构造 pipeline 时，如果没有指定使用 TurboMind 引擎或 PyTorch 引擎进行推理，LMDeploy 将根据它们各自的能力自动分配一个，默认优先使用 TurboMind 引擎。from lmdeploy import pipeline, TurbomindEngineConfig pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=TurbomindEngineConfig( max_batch_size=2, enable_prefix_caching=True, cache_max_entry_count=0.8, session_len=8192, ))

代码语言：c代码运行次数：0运行复制

from lmdeploy import pipeline, PytorchEngineConfig
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=PytorchEngineConfig(
                    max_batch_size=2,
                    enable_prefix_caching=True,
                    cache_max_entry_count=0.8,
                    session_len=8192,
                ))

1-、VLM 推理

VLM 推理 pipeline 与 LLM 类似，但增加了使用 pipeline 处理图像数据的能力。例如，你可以使用以下代码片段对 InternVL 模型进行推理：from lmdeploy import pipeline from lmdeploy.vl import load_imagepipe = pipeline('OpenGVLab/InternVL2-8B')image = load_image('.jpeg') respe = pipe(('describe this image', image)) print(respe)

二、InternVL2-26B【介绍&加载&推理】

2-1、InternVL 2.0介绍

InternVL 2.0 : 是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种指令微调的模型，参数从 10 亿到 1080 亿不等。

特点如下：

与最先进的开源多模态大语言模型相比，InternVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力，包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决，以及文化理解和综合多模态能力。
InternVL 2.0 使用 8k 上下文窗口进行训练，训练数据包含长文本、多图和视频数据，与 InternVL 1.5 相比，其处理这些类型输入的能力显著提高。

InternVL 2.0各个模型如下所示：

如下图所示： 与其他同类模型相比，InternVL2-26B极具竞争力。

2-2、16位量化加载

安装包：

代码语言：c代码运行次数：0运行复制

pip install transformers==4.7.2

16位量化加载代码：

代码语言：c代码运行次数：0运行复制

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

8位量化加载代码：

代码语言：c代码运行次数：0运行复制

import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

otice: 这里的path换为模型下载到的路径，在Linux上的路径一般为：

/root/.cache/modelscope/hub/

2-、多GPU加载

多GPU加载代码如下：

split_model：根据模型的大小，将模型的各层分配到多个GPU上，以优化显存的使用。world_size用于计算可用GPU数量
num_layers_per_gpu ：表示每个GPU分配的需要处理的模型层数
num_layers_per_gpu0：由于第一个GPU将用于ViT（视觉模型的一部分），所以为了减轻0号GPU负担，可以酌情减少其平均分配层数。
device_mapxxxx: 注意这一部分，默认将视觉模型和其他一些组件固定分配到第一个GPU上，但是如果显存不够，可以均匀分配到其他GPU以减轻0号GPU压力。

代码语言：c代码运行次数：0运行复制

import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = device_count()
    num_layers = {
        'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 2, 'InternVL2-8B': 2,
        'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama-76B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = (num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = (num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_tok_embeddings'] = 0
    device_map['language_embed_tokens'] = 0
    device_map['language_'] = 0
    device_map['language_norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL2-26B"
device_map = split_model('InternVL2-26B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

2-4、使用Transformer进行推理

build_transform函数：该函数构建了一系列图像转换，包括：

将图像转换为RGB格式。
将图像调整为指定大小。
将图像转换为张量并进行标准化。

find_closest_aspect_ratio函数：用于在一组目标宽高比中，到与给定图像宽高比最接近的一个，并返回最佳的宽高比。

dynamic_preprocess函数：动态预处理图像，将图像切分为多个小块，以适应模型的输入要求。该函数的主要流程包括：

计算图像的宽高比并到最接近的目标宽高比。
调整图像的大小，并根据计算得到的块数量将图像分割为多个小块。
选择是否添加缩略图，以补充图像信息。

load_image函数

打开图像文件并转换为RGB格式。
调用build_transform和dynamic_preprocess对图像进行预处理。
将图像转换为张量并堆叠在一起，准备输入模型。

完整代码如下所示：

代码语言：c代码运行次数：0运行复制

import numpy as np
import torch
import  as T
from decord import VideoReader, cpu
from PIL import Image
from .functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGEET_MEA = (0.485, 0.456, 0.406)
IMAGEET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEA, STD = IMAGEET_MEA, IMAGEET_STD
    transform = T.Compose([
        T.Lambda(lambda img: ('RGB') if  != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        (mean=MEA, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = (image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you have an  A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2-26B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
respe, history = (tokenizer, one, question, generation_config, history=one, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

question = 'Can you tell me a story?'
respe, history = (tokenizer, one, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
respe = (tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {respe}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
respe, history = (tokenizer, pixel_values, question, generation_config, history=one, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

question = 'Please write a poem according to the image.'
respe, history = (tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = ((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
respe, history = (tokenizer, pixel_values, question, generation_config,
                               history=one, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

question = 'What are the similarities and differences between these two images.'
respe, history = (tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = ((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
respe, history = (tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=one, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

question = 'What are the similarities and differences between these two images.'
respe, history = (tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = ((pixel_values1, pixel_values2), dim=0)

questi = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
respes = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questi=questi,
                             generation_config=generation_config)
for question, respe in zip(questi, respes):
    print(f'User: {question}\nAssistant: {respe}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=2):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=one, input_size=448, max_num=1, num_segments=2):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = (pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
respe, history = (tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=one, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

question = 'Describe this video in detail. Don\'t repeat.'
respe, history = (tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {respe}')

2-5、Demo

otice： 个人写的一个小Demo，感兴趣的小伙伴可以看一下。

代码语言：c代码运行次数：0运行复制

import math
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch
import  as T
from decord import VideoReader, cpu
from PIL import Image
from .functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGEET_MEA = (0.485, 0.456, 0.406)
IMAGEET_STD = (0.229, 0.224, 0.225)

import os
['PYTORCH_CUDA_ALLOC_COF'] = 'expandable_segments:True'

def build_transform(input_size):
    MEA, STD = IMAGEET_MEA, IMAGEET_STD
    transform = T.Compose([
        T.Lambda(lambda img: ('RGB') if  != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        (mean=MEA, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    """
    源代码
    """
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=12):
    image = (image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    """
    源代码
    """
    device_map = {}
    world_size = device_count()
    num_layers = {
        'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 2, 'InternVL2-8B': 2,
        'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama-76B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = (num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    
    # 因为第一个GPU要分配给ViT部分，所以这里首先分给它的层数少一些。
    # num_layers_per_gpu是一个列表，表示每个GPU分配的层数，并且由于第一个GPU的特殊性，只给他分一层（计算后得到的结果）。
    num_layers_per_gpu[0] = (num_layers_per_gpu[0] * 0.1)
    #print(num_layers_per_gpu)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_layers.{layer_cnt}'] = i
            layer_cnt += 1
            
    # 将模型组件，依次分配到各个GPU上，其中vision_model占用的显存最多，切记合理分配。
    device_map['vision_model'] = 0
    device_map['mlp1'] = 1
    device_map['language_tok_embeddings'] = 2
    device_map['language_embed_tokens'] = 
    device_map['language_'] = 4
    device_map['language_norm'] = 5
    device_map['language_model.lm_head'] = 6
    device_map[f'language_layers.{num_layers - 1}'] = 7

    return device_map

def main():
    path = "/root/.cache/modelscope/hub/OpenGVLab/InternVL2-26B"
    device_map = split_model('InternVL2-26B')
    model = AutoModel.from_pretrained(
        path,
        torch_dtype=torch.bfloat16,
#    load_in_8bit=True,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        device_map=device_map).eval()
    tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
    pixel_values = load_image('./1.png', max_num=12).to(torch.bfloat16).cuda()
    generation_config = dict(max_new_tokens=1024, do_sample=False)

    question = """<image>\n
    [任务] 给出图中衣服或者其他物品的：
    1、详细描述。
    2、该物品的风格。
    、适合的场景。
    4、搭配建议。
    [注意] 风格和场景参考以下给出的列表。物品适合的风格和场景可以有多个。
    [风格列表]极简风 (Minimalist), 波西米亚风 (Bohemian), 街头风 (Streetwear), 复古风 (Vintage), 中性风 (Androgynous), 运动休闲风 (Athleisure), 雅痞风 (Preppy), 学院风 (Collegiate), 朋克风 (Punk), 哥特风 (Gothic), 奢华风 (Luxurious), 优雅风 (Elegant), 商务休闲风 (Business Casual), 高街风 (High Street), 工装风 (Utility/Workwear), 军事风 (Military), 户外风 (Outdoor), 田园风 (Cottagecore), 摩登风 (Modern), 法式风 (French Chic), 英伦风 (British), 乡村风 (Country Style), 嘻哈风 (Hip Hop), 简约优雅风 (Sophisticated Minimalism), 浪漫风 (Romantic), 海军风 (autical), 洛丽塔风 (Lolita), 未来风 (Futuristic), 摩托风 (Biker), 朋克洛丽塔风 (Punk Lolita), 宫廷风 (Baroque/Rococo), 东方风 (Oriental/Asian), 性感风 (Sexy), 度假风 (Resort), 摇滚风 (Rock), 艺术家风 (Artsy), 超模风 (Model-Off-Duty), 探险风 (Explorer), 丛林风 (Safari), 热带风 (Tropical), 工艺风 (Artisan), 环保风 (Sustainable), 日本原宿风 (Harajuku), 时尚运动风 (Sportswear Chic), 街头时尚风 (Urban Fashion), 经典正式风 (Classic Formal), 沙滩风 (Beachwear), 俱乐部风 (Clubwear), 黑暗童话风 (Dark Fairytale), 复古迪斯科风 (Retro Disco) 
    [场景列表]办公室 (Office), 商务会议 (Business Meeting), 正式宴会 (Formal Dinner), 婚礼 (Wedding), 鸡尾酒会 (Cocktail Party), 面试 (Job Interview), 约会 (Date), 度假 (Vacation), 沙滩派对 (Beach Party), 音乐节 (Music Festival), 音乐会 (Concert), 户外野餐 (Outdoor Picnic), 健身房 (Gym), 瑜伽课 (Yoga Class), 婚礼伴娘 / 伴郎 (Bridesmaid/Groomsman at Wedding), 毕业典礼 (Graduation Ceremony), 生日派对 (Birthday Party), 家庭聚会 (Family Gathering), 教堂 / 宗教仪式 (Church/Religious Ceremony), 朋友聚会 (Friends’ Get-together), 高尔夫场 (Golf Course), 剧院 / 歌剧院 (Theatre/Opera), 飞机旅行 (Airplane Travel), 购物逛街 (Shopping), 商务午餐 (Business Lunch), 下午茶 (Afternoon Tea), 红毯活动 (Red Carpet Event), 舞会 (Ball/Gala), 晚宴 (Dinner Party), 滑雪度假 (Ski Resort), 新年派对 (ew Year’s Eve Party), 宠物聚会 (Pet Party), 音乐节日巡游 (Festival Parade), 公司年会 (Corporate Annual Meeting), 主题派对 (Theme Party), 开幕酒会 (Art Gallery Opening), 慈善晚宴 (Charity Dinner), 约见客户 (Client Meeting), 产后派对 (Baby Shower), 运动赛事观赛 (Sporting Event Spectator), 小型私人派对 (Intimate House Party),  (ightclub), 商务展会 (Business Expo), 户外露营 (Outdoor Camping), 游艇派对 (Yacht Party), 时尚发布会 (Fashion Show), 博物馆参观 (Museum Visit), 酒吧聚会 (Bar Gathering), 读书会 (Book Club), 出差旅行 (Business Travel)"""
    respe = (tokenizer, pixel_values, question, generation_config)
    print(f'User: {question}\nAssistant: {respe}')

if __name__ == "__main__":
    main()

三、模型部署

-1、安装代码语言：c代码运行次数：0运行复制

pip install lmdeploy==0.5.

其他LMDeploy模型所需的依赖：

代码语言：c代码运行次数：0运行复制

pip install timm
# 建议从
pip install flash-attn

-2、api_server

参数介绍：

backend : 指定推理引擎turbomind
server-port：指定端口号
tp：多卡时使用（GPU数量）

代码语言：c代码运行次数：0运行复制

lmdeploy serve api_server OpenGVLab/InternVL2-26B --backend turbomind --server-port 2

部署后的显存占用： 需要两百G的显存，如果并发较大或者是需要历史对话的交互，建议至少增加50G显存。

API接口界面：

4-bit量化模型AWQ的部署：

代码语言：c代码运行次数：0运行复制

lmdeploy serve api_server OpenGVLab/InternVL2-26B-AWQ --backend turbomind --server-port 2 --model-format awq

-、客户端调用

Demo1： 使用官方提供的接口代码，调用/v1/chat/completi 接口。

代码语言：c代码运行次数：0运行复制

from lmdeploy.api_client import APIClient

api_client = APIClient(f':2')
model_name = api_client.available_models[0]
messages = [{
    'role':
    'user',
    'content': [{
        'type': 'text',
        'text': 'Describe the image please',
    }, {
        'type': 'image_url',
        'image_url': {
            'url':
            '.jpeg',
        },
    }]
}]
for item in api__completi_v1(model=model_name,
                                           messages=messages):
    print(item)

Demo2： 自定义常用的Request，调用/v1/chat/completi 接口。

代码语言：c代码运行次数：0运行复制

import requests
import json
import logging

# 配置日志记录
logging.basicConfig(level=logging.IFO, filename='app.log', filemode='a',
                    format='%(name)s - %(levelname)s - %(message)s')

# 设置请求的URL
url = ':2/v1/chat/completi'

# 设置请求的JSO数据
data = {
    "model": "/root/.cache/modelscope/hub/OpenGVLab/InternVL2-26B/",  # 替换为您的模型名称或ID
    "messages": [{
        "role": "user",
        "content": [{
            "type": "text",
            "text": "描述一下这张图片的内容"
            }, 
            {
            "type": "image_url",
            "image_url": {
                "url": ""
            }
        }]
    }],
    "temperature": 0.8,
    "top_p": 0.8
}

# 将请求数据转换为JSO格式
headers = {'Content-Type': 'application/json'}

try:
    for i in range(1):
        # 发送POST请求
        respe = requests.post(url, data=json.dumps(data), headers=headers)
        
        # 检查请求是否成功
        if respe.status_code == 200:
            logging.info("请求成功！响应内容：%s", respe.json())
            print(f"请求成功！响应内容：\n{i}\n{respe.json()}")
        else:
            ("请求失败，状态码：%s", respe.status_code)
            print(f"请求失败，状态码：{respe.status_code}")
except RequestException as e:
    ("请求异常：%s", str(e))
    print(f"请求异常：{e}")
except Exception as e:
    ("发生错误：%s", str(e))
    print(f"发生错误：{e}")

附录

1、显存查看命令

默认查看显存：

代码语言：c代码运行次数：0运行复制

nvidia-smi

动态查看显存：

代码语言：c代码运行次数：0运行复制

watch -n 0.5 nvidia-smi

如下所示：

2、/v1/chat/interactive 接口注意事项

otice： 这里一定要注意，这个互动接口，虽然文档内明确表示

在交互模式下，聊天记录保存在服务器上。请设置。interactive_mode = True
在正常模式下，服务器上不会保留任何聊天记录。设置。interactive_mode = False

但是实践证明，就算interactive_mode = False ，调用api的过程中，服务器上的显存也会一直增长，最终导致显存爆炸，慎用！！另外，completi接口不会有这个问题。

#感谢您对电脑配置推荐网 - 最新i3 i5 i7组装电脑配置单推荐报价格的认可，转载请说明来源于"电脑配置推荐网 - 最新i3 i5 i7组装电脑配置单推荐报价格

本文地址：http://www.dnpztj.cn/biancheng/1174471.html

本站网友猫扑两性	13分钟前发表
((i % (target_width // image_size)) + 1) * image_size
本站网友南汇水蜜桃	18分钟前发表
"text"
本站网友阿克苏诺贝尔太古漆油	4分钟前发表
调用/v1/chat/completi 接口
本站网友鸡内金是什么	27分钟前发表
image_size=448
本站网友省中医院预约挂号	15分钟前发表
True' def build_transform(input_size)
本站网友 windowsbuilder	26分钟前发表
模型部署-1
本站网友蟾酥	23分钟前发表
("发生错误：%s"
本站网友广州协佳医院	13分钟前发表
history=history
本站网友吴永平	6分钟前发表
str(e)) print(f"请求异常：{e}") except Exception as e
本站网友怎么下载电子书	12分钟前发表
AutoTokenizer IMAGEET_MEA = (0.485
本站网友柏子养心丸	14分钟前发表
img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB') img = dynamic_preprocess(img
本站网友女性生理健康	1分钟前发表
面试 (Job Interview)
本站网友男性生理健康	30分钟前发表
包括：将图像转换为RGB格式
本站网友 twisted	30分钟前发表
return_history=True) print(f'User
本站网友东风乘用车	0秒前发表
j) for n in range(min_num

【万字长文】【InternVL】InternVL2