调研:huggingface-diffusers

news/2024/11/7 23:41:12/

1. Diffusers能带来什么

1.1 Overview

Diffusers是集成state-of-the-art预训练diffusion模型库,用于生成图像、音频甚至3D结构。

Diffusers库注重可用性而非高性能。

Diffusers主要提供三项能力:

  • State-of-the-art diffusion pipelines,低代码推理。
  • Interchangeable noise schedulers,便于平衡生成速度和质量。
  • Pretrained models,构建自己的diffusion模型。

1.2 支持管道

Pipeline文章/项目任务
alt_diffusionAltCLIP: Altering the Language Encoder in CLIP for Extended Language CapabilitiesImage-to-Image Text-Guided Generation
audio_diffusionAudio DiffusionUnconditional Audio Generation
controlnetAdding Conditional Control to Text-to-Image Diffusion ModelsImage-to-Image Text-Guided Generation
cycle_diffusionUnifying Diffusion Models’ Latent Space, with Applications to CycleDiffusion and GuidanceImage-to-Image Text-Guided Generation
dance_diffusionDance DiffusionUnconditional Audio Generation
ddpmDenoising Diffusion Probabilistic ModelsUnconditional Image Generation
ddimDenoising Diffusion Implicit ModelsUnconditional Image Generation
ifIFImage Generation
if_img2imgIFImage-to-Image Generation
if_inpaintingIFImage-to-Image Generation
latent_diffusionHigh-Resolution Image Synthesis with Latent Diffusion ModelsText-to-Image Generation
latent_diffusionHigh-Resolution Image Synthesis with Latent Diffusion ModelsSuper Resolution Image-to-Image
latent_diffusion_uncondHigh-Resolution Image Synthesis with Latent Diffusion ModelsUnconditional Image Generation
paint_by_examplePaint by Example: Exemplar-based Image Editing with Diffusion ModelsImage-Guided Image Inpainting
pndmPseudo Numerical Methods for Diffusion Models on ManifoldsUnconditional Image Generation
score_sde_veScore-Based Generative Modeling through Stochastic Differential EquationsUnconditional Image Generation
score_sde_vpScore-Based Generative Modeling through Stochastic Differential EquationsUnconditional Image Generation
semantic_stable_diffusionSemantic GuidanceText-Guided Generation
stable_diffusion_text2imgStable DiffusionText-to-Image Generation
stable_diffusion_img2imgStable DiffusionImage-to-Image Text-Guided Generation
stable_diffusion_inpaintStable DiffusionText-Guided Image Inpainting
stable_diffusion_panoramaMultiDiffusionText-to-Panorama Generation
stable_diffusion_pix2pixInstructPix2Pix: Learning to Follow Image Editing InstructionsText-Guided Image Editing
stable_diffusion_pix2pix_zeroZero-shot Image-to-Image TranslationText-Guided Image Editing
stable_diffusion_attend_and_exciteAttend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion ModelsText-to-Image Generation
stable_diffusion_self_attention_guidanceImproving Sample Quality of Diffusion Models Using Self-Attention GuidanceText-to-Image Generation Unconditional Image Generation
stable_diffusion_image_variationStable Diffusion Image VariationsImage-to-Image Generation
stable_diffusion_latent_upscaleStable Diffusion Latent UpscalerText-Guided Super Resolution Image-to-Image
stable_diffusion_model_editingEditing Implicit Assumptions in Text-to-Image Diffusion ModelsText-to-Image Model Editing
stable_diffusion_2Stable Diffusion 2Text-to-Image Generation
stable_diffusion_2Stable Diffusion 2Text-Guided Image Inpainting
stable_diffusion_2Depth-Conditional Stable DiffusionDepth-to-Image Generation
stable_diffusion_2Stable Diffusion 2Text-Guided Super Resolution Image-to-Image
stable_diffusion_safeSafe Stable DiffusionText-Guided Generation
stable_unclipStable unCLIPText-to-Image Generation
stable_unclipStable unCLIPImage-to-Image Text-Guided Generation
stochastic_karras_veElucidating the Design Space of Diffusion-Based Generative ModelsUnconditional Image Generation
text_to_video_sdModelscope’s Text-to-video-synthesis Model in Open DomainText-to-Video Generation
unclipHierarchical Text-Conditional Image Generation with CLIP Latents(implementation by kakaobrain)Text-to-Image Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelText-to-Image Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelImage Variations Generation
versatile_diffusionVersatile Diffusion: Text, Images and Variations All in One Diffusion ModelDual Image and Text Guided Generation
vq_diffusionVector Quantized Diffusion Model for Text-to-Image SynthesisText-to-Image Generation

1.3 DiffusionPipeline

DiffusionPipeline是高度抽象的端到端接口,huggingdace-diffusers所有model和scheduler都包含在内,方便启动推理。

任务描述Pipeline
Unconditional Image Generationgenerate an image from Gaussian noiseunconditional_image_generation
Text-Guided Image Generationgenerate an image given a text promptconditional_image_generation
Text-Guided Image-to-Image Translationadapt an image guided by a text promptimg2img
Text-Guided Image-Inpaintingfill the masked part of an image given the image, the mask and a text promptinpaint
Text-Guided Depth-to-Image Translationadapt parts of an image guided by a text prompt while preserving structure via depth estimationdepth2img

2. 任务管道介绍

使用diffusers一个很重要的、需要特别注意的点是区分推理和训练管道之间的关系。

2.1 直接推理管道

2.1.1 Unconditional image generation

Unconditional image generation相对简单,管道内模型生成图像不需要任何额外的上下文信息(文字、图像等)。
Unconditional image generation管道生成的图像只取决于训练数据。

Pipeline:DiffusionPipeline

2.1.2 Text-to-image generation

Text-to-image generation即Conditional image generation,允许从text prompt生成图像。text被转换embeddings用于condition模型从noise中生成图像。

Pipeline:DiffusionPipeline

2.1.3 Text-guided image-to-image generation

Text-guided image-to-image generation允许以text prompt和一张初始图像作为限制生成一张新图像。

Pipeline:StableDiffusionImg2ImgPipeline

2.1.4 Text-guided image-inpainting

Text-guided image-inpainting允许通过mask和text prompt编辑图像中的特定部分。

Pipeline:StableDiffusionInpaintPipeline

2.1.5 Text-guided depth-to-image generation

Text-guided depth-to-image generation允许以text prompt和一张初始图像作为限制生成一张新图像。可通过depth_map参数保留图像depth结构,如不传递depth_map则会通过depth-estimation估计depth。

Pipeline:StableDiffusionDepth2ImgPipeline

2.2 训练管道

2.2.1 overview

任务是否支持Accelerate是否提供Datasets
Unconditional Image Generation
Text-to-Image fine-tuning
Textual Inversion-
Dreambooth-
Training with LoRA-
ControlNet
InstructPix2Pix
Custom Diffusion

2.2.2 Unconditional Image Generation

不以任何文本或图像为条件。它只生成与训练数据分布相似的图像。

accelerate launch train_unconditional.py \--dataset_name="huggan/flowers-102-categories" \--resolution=64 \--output_dir="ddpm-ema-flowers-64" \--train_batch_size=16 \--num_epochs=100 \--gradient_accumulation_steps=1 \--learning_rate=1e-4 \--lr_warmup_steps=500 \--mixed_precision=no \--push_to_hub

2.2.3 Text-to-Image fine-tuning

以text prompt生成图像的训练流程,如Stable Diffusion模型。

accelerate launch --mixed_precision="fp16"  train_text_to_image.py \--pretrained_model_name_or_path=$MODEL_NAME \--dataset_name=$dataset_name \--use_ema \--resolution=512 --center_crop --random_flip \--train_batch_size=1 \--gradient_accumulation_steps=4 \--gradient_checkpointing \--max_train_steps=15000 \--learning_rate=1e-05 \--max_grad_norm=1 \--lr_scheduler="constant" --lr_warmup_steps=0 \--output_dir="sd-pokemon-model" 

2.2.4 Textual Inversion

Textual Inversion从少量示例图像中捕捉novel concepts。这些学习到的概念可以用于个性化图像生成的prompt,更好地控制生成的图像。

accelerate launch textual_inversion.py \--pretrained_model_name_or_path=$MODEL_NAME \--train_data_dir=$DATA_DIR \--learnable_property="object" \--placeholder_token="<cat-toy>" --initializer_token="toy" \--resolution=512 \--train_batch_size=1 \--gradient_accumulation_steps=4 \--max_train_steps=3000 \--learning_rate=5.0e-04 --scale_lr \--lr_scheduler="constant" \--lr_warmup_steps=0 \--output_dir="textual_inversion_cat"

2.2.5 Dreambooth

DreamBooth是一种个性化文本到图像模型的方法,就像Stable Diffusion一样,只给出一个主题的几张(3-5张)图像。它允许模型在不同的场景、姿势和视图中生成主体的情境化图像。

python train_dreambooth_flax.py \--pretrained_model_name_or_path=$MODEL_NAME  \--instance_data_dir=$INSTANCE_DIR \--output_dir=$OUTPUT_DIR \--instance_prompt="a photo of sks dog" \--resolution=512 \--train_batch_size=1 \--learning_rate=5e-6 \--max_train_steps=400

2.2.6 LoRA

LoRA: Low-Rank Adaptation of Large Language Models是一种在消耗较少内存的同时加速大型模型训练的训练方法。它将成对的秩分解权重矩阵(称为更新矩阵)添加到现有的权重中,并且只训练那些新添加的权重。

accelerate launch --mixed_precision="fp16"  train_text_to_image_lora.py \--pretrained_model_name_or_path=$MODEL_NAME \--dataset_name=$DATASET_NAME \--dataloader_num_workers=8 \--resolution=512 --center_crop --random_flip \--train_batch_size=1 \--gradient_accumulation_steps=4 \--max_train_steps=15000 \--learning_rate=1e-04 \--max_grad_norm=1 \--lr_scheduler="cosine" --lr_warmup_steps=0 \--output_dir=${OUTPUT_DIR} \--push_to_hub \--hub_model_id=${HUB_MODEL_ID} \--report_to=wandb \--checkpointing_steps=500 \--validation_prompt="A pokemon with blue eyes." \--seed=1337

2.2.7 ControlNet

ControlNet相较单纯img2img更加精准和有效,可以直接提取画面的构图,人物的姿势和画面的深度信息,并以此为条件限制图像生成。

accelerate launch train_controlnet.py \--pretrained_model_name_or_path=$MODEL_DIR \--output_dir=$OUTPUT_DIR \--dataset_name=fusing/fill50k \--resolution=512 \--learning_rate=1e-5 \--validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \--validation_prompt "red circle with blue background" "cyan circle with brown floral background" \--train_batch_size=4

2.2.8 InstructPix2Pix

InstructPix2Pix使用:给定输入图像和编辑指令,告诉模型要做什么,模型将遵循这些指令来编辑图像。

accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \--pretrained_model_name_or_path=$MODEL_NAME \--dataset_name=$DATASET_ID \--enable_xformers_memory_efficient_attention \--resolution=256 --random_flip \--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \--max_train_steps=15000 \--checkpointing_steps=5000 --checkpoints_total_limit=1 \--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \--conditioning_dropout_prob=0.05 \--mixed_precision=fp16 \--seed=42 

2.2.9 Custom Diffusion

仅优化文本到图像扩散模型的交叉注意力层中参数即可高效学会新概念。面对多个概念组合时,可以先单独训练各个概念模型,再通过约束优化将多个微调模型合并成一个。

accelerate launch train_custom_diffusion.py \--pretrained_model_name_or_path=$MODEL_NAME  \--instance_data_dir=$INSTANCE_DIR \--output_dir=$OUTPUT_DIR \--class_data_dir=./real_reg/samples_cat/ \--with_prior_preservation --real_prior --prior_loss_weight=1.0 \--class_prompt="cat" --num_class_images=200 \--instance_prompt="photo of a <new1> cat"  \--resolution=512  \--train_batch_size=2  \--learning_rate=1e-5  \--lr_warmup_steps=0 \--max_train_steps=250 \--scale_lr --hflip  \--modifier_token "<new1>" \--validation_prompt="<new1> cat sitting in a bucket" \--report_to="wandb"

3. Prompt Engineering

Weighting prompts

diffusers中提供的功能本质基本为text2img,text2img基于给定的prompt生成图像。prompt理应包括模型应该生成的多个概念,然往往事违人愿,故通常需要或多或少地对部分prompt进行加权以做到强调和去除强调。
扩散模型的工作原理是用上下文化的文本嵌入来调节扩散模型的交叉注意力层。因此,强调(或去除强调)提示的某些部分的简单方法是增加或减少与提示的相关部分相对应的文本嵌入向量的比例。

Weighting prompts支持从

prompt = "a red cat playing with a ball"

变为

prompt = "a red cat playing with a ball++"

以强调某个prompt。


http://www.ppmy.cn/news/97889.html

相关文章

分治入门+例题

目录 &#x1f947;2.3.2 合并排序 &#x1f947;2.3.3 快速排序 &#x1f33c;P1010 [NOIP1998 普及组] 幂次方 &#x1f333;总结 形象点&#xff0c;分治正如“凡治众如治寡&#xff0c;分数是也”&#xff0c;管理少数几个人&#xff0c;即可统领全军 本质&#xff…

分布式事务的21种武器 - 7

在分布式系统中&#xff0c;事务的处理分布在不同组件、服务中&#xff0c;因此分布式事务的ACID保障面临着一些特殊难点。本系列文章介绍了21种分布式事务设计模式&#xff0c;并分析其实现原理和优缺点&#xff0c;在面对具体分布式事务问题时&#xff0c;可以选择合适的模式…

【数据湖仓架构】数据湖和仓库:范式简介

是时候将数据分析迁移到云端了——您选择数据仓库还是数据湖解决方案&#xff1f;了解这两种方法的优缺点。 数据分析平台正在转向云环境&#xff0c;例如亚马逊网络服务、微软 Azure 和谷歌云。云环境提供了多种好处&#xff0c;例如可扩展性、可用性和可靠性。此外&#xff0…

【RTE】http 请求实现过程及其回调处理

每次发起一个请求,注册一个cb,都能有cb 被异步触发以下是实现过程:CallFetch 发起一个请求并能回调请求结果 template <typename DT, typename DP> void CallFetch(agora::agora_refptr<IDataParam> param,DataRequestType req_type,ApiType api_type,utils::w…

设计模式总结

java的设计模式大体上分为三大类&#xff1a;创建型模式&#xff08;5种&#xff09;&#xff1a;工厂方法模式&#xff0c;抽象工厂模式&#xff0c;单例模式&#xff0c;建造者模式&#xff0c;原型模式。 结构型模式&#xff08;7种&#xff09;&#xff1a;适配器模式&…

《数据库》期末考试复习手写笔记-第11章 并发控制(锁)【10分】

目录 知识点&#xff1a;封锁活锁死锁可串行化调度 考题1&#xff1a;可串行化调度 考题2&#xff1a;调度正确判断&共享锁写锁 考题3&#xff1a; 事务调度死锁 知识点&#xff1a;封锁活锁死锁可串行化调度 考题1&#xff1a;可串行化调度 考题2&#xff1a;调度正确判…

关于Netty的一些问题

1.Netty 是什么&#xff1f; Netty是 一个异步事件驱动的网络应用程序框架&#xff0c;用于快速开发可维护的高性能协议服务器和客户端。Netty是基于nio的&#xff0c;它封装了jdk的nio&#xff0c;让我们使用起来更加方法灵活。 2.Netty 的特点是什么&#xff1f; 高并发&…

路由原理及vue实现动态路由

路由原理 在前端开发中&#xff0c;路由通常用于实现 SPA 应用程序&#xff0c;即在一个页面中切换不同的内容或页面&#xff0c;而不需要重新加载整个页面。路由的实现原理是通过监听 URL 的变化&#xff0c;然后根据不同的 URL 加载不同的内容或页面。 在前端框架中&#x…