한자로

http://openaccess.thecvf.com/content/ICCV2023/html/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.html 文書의 HTML 버전입니다.
Google은 웹文書를 蒐集(crawl)하면서 自動으로 文書의 HTML 버전을 生成합니다.

Adding Conditional Control to Text-to-Image Diffusion Models

Page 1

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala

Stanford University

{lvmin, anyirao, maneesh}@cs.stanford.edu

Input Canny edge

Default

“masterpiece of fairy tale, giant deer, golden antlers”

Input human pose

Default

“chef in kitchen”

“…, quaint city Galic”

“Lincoln statue”

Figure 1: Controlling Stable Diffusion with learned conditions. ControlNet allows users to add conditions like Canny edges

(top), human pose (bottom), etc., to control the image generation of large pretrained diffusion models. The default results use

the prompt “a high-quality, detailed, and professional image”. Users can optionally give prompts like the “chef in kitchen”.

Abstract

We present ControlNet, a neural network architecture to

add spatial conditioning controls to large, pretrained text-

to-image diffusion models. ControlNet locks the production-

ready large diffusion models, and reuses their deep and ro-

bust encoding layers pretrained with billions of images as a

strong backbone to learn a diverse set of conditional controls.

The neural architecture is connected with “zero convolutions”

(zero-initialized convolution layers) that progressively grow

the parameters from zero and ensure that no harmful noise

could affect the finetuning. We test various conditioning con-

trols, e.g., edges, depth, segmentation, human pose, etc., with

Stable Diffusion, using single or multiple conditions, with

or without prompts. We show that the training of Control-

Nets is robust with small (<50k) and large (>1m) datasets.

Extensive results show that ControlNet may facilitate wider

applications to control image diffusion models.

1. Introduction

Many of us have experienced flashes of visual inspiration

that we wish to capture in a unique image. With the advent

of text-to-image diffusion models [ 54 , 61 , 71 ], we can now

create visually stunning images by typing in a text prompt.

Yet, text-to-image models are limited in the control they

provide over the spatial composition of the image; precisely

expressing complex layouts, poses, shapes and forms can be

difficult via text prompts alone. Generating an image that

accurately matches our mental imagery often requires nu-

merous trial-and-error cycles of editing a prompt, inspecting

the resulting images and then re-editing the prompt.

Can we enable finer grained spatial control by letting

users provide additional images that directly specify their

desired image composition? In computer vision and machine

learning, these additional images (e.g., edge maps, human

pose skeletons, segmentation maps, depth, normals, etc.)

are often treated as conditioning on the image generation

process. Image-to-image translation models [ 34 , 97 ] learn

This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

3836

Page 2

the mapping from conditioning images to target images. The

research community has also taken steps to control text-

to-image models with spatial masks [ 6 , 20 ], image editing

instructions [ 10 ], personalization via finetuning [ 21 , 74 ], etc.

While a few problems (e.g., generating image variations,

inpainting) can be resolved with training-free techniques

like constraining the denoising diffusion process or edit-

ing attention layer activations, a wider variety of problems

like depth-to-image, pose-to-image, etc., require end-to-end

learning and data-driven solutions.

Learning conditional controls for large text-to-image dif-

fusion models in an end-to-end way is challenging. The

amount of training data for a specific condition may be sig-

nificantly smaller than the data available for general text-to-

image training. For instance, the largest datasets for various

specific problems (e.g., object shape/normal, human pose

extraction, etc.) are usually about 100K in size, which is

50,000 times smaller than the LAION-5B [ 78 ] dataset that

was used to train Stable Diffusion [ 81 ]. The direct finetun-

ing or continued training of a large pretrained model with

limited data may cause overfitting and catastrophic forget-

ting [ 31 , 74 ]. Researchers have shown that such forgetting

can be alleviated by restricting the number or rank of train-

able parameters [ 14 , 25 , 31 , 91 ]. For our problem, designing

deeper or more customized neural architectures might be

necessary for handling in-the-wild conditioning images with

complex shapes and diverse high-level semantics.

This paper presents ControlNet, an end-to-end neural

network architecture that learns conditional controls for large

pretrained text-to-image diffusion models (Stable Diffusion

in our implementation). ControlNet preserves the quality

and capabilities of the large model by locking its parameters,

and also making a trainable copy of its encoding layers.

This architecture treats the large pretrained model as a strong

backbone for learning diverse conditional controls. The

trainable copy and the original, locked model are connected

with zero convolution layers, with weights initialized to zeros

so that they progressively grow during the training. This

architecture ensures that harmful noise is not added to the

deep features of the large diffusion model at the beginning

of training, and protects the large-scale pretrained backbone

in the trainable copy from being damaged by such noise.

Our experiments show that ControlNet can control Sta-

ble Diffusion with various conditioning inputs, including

Canny edges, Hough lines, user scribbles, human key points,

segmentation maps, shape normals, depths, etc. (Figure 1 ).

We test our approach using a single conditioning image,

with or without text prompts, and we demonstrate how our

approach supports the composition of multiple conditions.

Additionally, we report that the training of ControlNet is

robust and scalable on datasets of different sizes, and that for

some tasks like depth-to-image conditioning, training Con-

trolNets on a single NVIDIA RTX 3090Ti GPU can achieve

results competitive with industrial models trained on large

computation clusters. Finally, we conduct ablative studies to

investigate the contribution of each component of our model,

and compare our models to several strong conditional image

generation baselines with user studies.

In summary, (1) we propose ControlNet, a neural network

architecture that can add spatially localized input conditions

to a pretrained text-to-image diffusion model via efficient

finetuning, (2) we present pretrained ControlNets to control

Stable Diffusion, conditioned on Canny edges, Hough lines,

user scribbles, human key points, segmentation maps, shape

normals, depths, and cartoon line drawings, and (3) we val-

idate the method with ablative experiments comparing to

several alternative architectures, and conduct user studies

focused on several previous baselines across different tasks.

2. Related Work

2.1. Finetuning Neural Networks

One way to finetune a neural network is to directly continue

training it with the additional training data. But this approach

can lead to overfitting, mode collapse, and catastrophic for-

getting. Extensive research has focused on developing fine-

tuning strategies that avoid such issues.

HyperNetwork is an approach that originated in the Natural

Language Processing (NLP) community [ 25 ], with the aim

of training a small recurrent neural network to influence the

weights of a larger one. It has been applied to image gener-

ation with generative adversarial networks (GANs) [ 4 , 18 ].

Heathen et al. [ 26 ] and Kurumuz [ 43 ] implement HyperNet-

works for Stable Diffusion [ 71 ] to change the artistic style

of its output images.

Adapter methods are widely used in NLP for customiz-

ing a pretrained transformer model to other tasks by em-

bedding new module layers into it[ 30 , 83 ]. In computer

vision, adapters are used for incremental learning [ 73 ] and

domain adaptation [ 69 ]. This technique is often used with

CLIP [ 65 ] for transferring pretrained backbone models to

different tasks [ 23 , 65 , 84 , 93 ]. More recently, adapters have

yielded successful results in vision transformers[ 49 , 50 ]

and ViT-Adapter [ 14 ]. In concurrent work with ours, T2I-

Adapter [ 56 ] adapts Stable Diffusion to external conditions.

Additive Learning circumvents forgetting by freezing the

original model weights and adding a small number of new pa-

rameters using learned weight masks [ 51 , 73 ], pruning [ 52 ],

or hard attention [ 79 ]. Side-Tuning [ 91 ] uses a side branch

model to learn extra functionality by linearly blending the

outputs of a frozen model and an added network, with a

predefined blending weight schedule.

Low-Rank Adaptation (LoRA) prevents catastrophic for-

getting [ 31 ] by learning the offset of parameters with low-

rank matrices, based on the observation that many over-

3837

Page 3

parameterized models reside in a low intrinsic dimension

subspace [ 2 , 47 ].

Zero-Initialized Layers are used by ControlNet for con-

necting network blocks. Research on neural networks has

extensively discussed the initialization and manipulation of

network weights [ 36 , 37 , 44 , 45 , 46 , 75 , 82 , 94 ]. For exam-

ple, Gaussian initialization of weights can be less risky than

initializing with zeros [ 1 ]. More recently, Nichol et al. [ 58 ]

discussed how to scale the initial weight of convolution lay-

ers in a diffusion model to improve the training, and their

implementation of “zero module” is an extreme case to scale

weights to zero. Stability’s model cards [ 82 ] also mention

the use of zero weights in neural layers. Manipulating the

initial convolution weights is also discussed in ProGAN [ 36 ],

StyleGAN [ 37 ], and Noise2Noise [ 46 ].

2.2. Image Diffusion

Image Diffusion Models were first introduced by Sohl-

Dickstein et al. [ 80 ] and have been recently applied to

image generation[ 17 , 42 ]. The Latent Diffusion Models

(LDM) [ 71 ] performs the diffusion steps in the latent image

space[ 19 ], which reduces the computation cost. Text-to-

image diffusion models achieve state-of-the-art image gen-

eration results by encoding text inputs into latent vectors

via pretrained language models like CLIP [ 65 ]. Glide [ 57 ]

is a text-guided diffusion model supporting image genera-

tion and editing. Disco Diffusion [ 5 ] processes text prompts

with clip guidance. Stable Diffusion[ 81 ] is a large-scale

implementation of latent diffusion [ 71 ]. Imagen [ 77 ] directly

diffuses pixels using a pyramid structure without using latent

images. Commercial products include DALL-E2 [ 61 ] and

Midjourney [ 54 ].

Controlling Image Diffusion Models facilitate personal-

ization, customization, or task-specific image generation.

The image diffusion process directly provides some control

over color variation [ 53 ] and inpainting [ 66 , 7 ]. Text-guided

control methods focus on adjusting prompts, manipulating

CLIP features, and modifying cross-attention [ 7 , 10 , 20 , 27 ,

40 , 41 , 57 , 63 , 66 ]. MakeAScene [ 20 ] encodes segmentation

masks into tokens to control image generation. SpaText [ 6 ]

maps segmentation masks into localized token embeddings.

GLIGEN [ 48 ] learns new parameters in attention layers of

diffusion models for grounded generating. Textual Inver-

sion [ 21 ] and DreamBooth [ 74 ] can personalize content in

the generated image by finetuning the image diffusion model

using a small set of user-provided example images. Prompt-

based image editing [ 10 , 33 , 85 ] provides practical tools to

manipulate images with prompts. Voynov et al. [ 87 ] propose

an optimization method that fits the diffusion process with

sketches. Concurrent works [ 8 , 9 , 32 , 56 ] examine a wide

variety of ways to control diffusion models.

trainable copy

zero convolution

ControlNet

(a) Before

(b) Aer

neural network

block

y c

neural network

block (locked)

Figure 2: A neural block takes a feature map x as input and

outputs another feature map y, as shown in (a). To add a

ControlNet to such a block we lock the original block and

create a trainable copy and connect them together using zero

convolution layers, i.e., 1 × 1 convolution with both weight

and bias initialized to zero. Here c is a conditioning vector

that we wish to add to the network, as shown in (b).

2.3. Image-to-Image Translation

Conditional GANs [ 15 , 34 , 62 , 89 , 92 , 96 , 97 , 98 ] and trans-

formers [ 13 , 19 , 67 ] can learn the mapping between different

image domains, e.g., Taming Transformer [ 19 ] is a vision

transformer approach; Palette[ 76 ] is a conditional diffu-

sion model trained from scratch; PITI [ 88 ] is a pretraining-

based conditional diffusion model for image-to-image trans-

lation. Manipulating pretrained GANs can handle specific

image-to-image tasks, e.g., StyleGANs can be controlled

by extra encoders [ 70 ], with more applications studied in

[ 3 , 22 , 38 , 39 , 55 , 59 , 64 , 70 ].

3. Method

ControlNet is a neural network architecture that can en-

hance large pretrained text-to-image diffusion models with

spatially localized, task-specific image conditions. We first

introduce the basic structure of a ControlNet in Section 3.1

and then describe how we apply a ControlNet to the image

diffusion model Stable Diffusion[ 71 ] in Section 3.2 . We

elaborate on our training in Section 3.3 and detail several

extra considerations during inference such as composing

multiple ControlNets in Section 3.4 .

3.1. ControlNet

ControlNet injects additional conditions into the blocks of

a neural network (Figure 2 ). Herein, we use the term network

block to refer to a set of neural layers that are commonly

put together to form a single unit of a neural network, e.g.,

resnet block, conv-bn-relu block, multi-head attention block,

transformer block, etc. Suppose F(·; Θ) is such a trained

neural block, with parameters Θ, that transforms an input

feature map x, into another feature map y as

y = F(x; Θ).

(1)

3838

Page 4

In our setting, x and y are usually 2D feature maps, i.e., x ∈

h×w×c with {h, w, c} as the height, width, and number of

channels in the map, respectively (Figure 2 a).

To add a ControlNet to such a pre-trained neural block,

we lock (freeze) the parameters Θ of the original block and

simultaneously clone the block to a trainable copy with

parameters Θ c (Figure 2 b). The trainable copy takes an

external conditioning vector c as input. When this structure

is applied to large models like Stable Diffusion, the locked

parameters preserve the production-ready model trained with

billions of images, while the trainable copy reuses such large-

scale pretrained model to establish a deep, robust, and strong

backbone for handling diverse input conditions.

The trainable copy is connected to the locked model with

zero convolution layers, denoted Z(·; ·). Specifically, Z(·; ·)

is a 1 × 1 convolution layer with both weight and bias ini-

tialized to zeros. To build up a ControlNet, we use two

instances of zero convolutions with parameters Θ z1 and Θ z2

respectively. The complete ControlNet then computes

y c = F(x; Θ) + Z(F(x + Z(c;Θ z1 ); Θ c ); Θ z2 ),

(2)

where y c is the output of the ControlNet block. In the first

training step, since both the weight and bias parameters of

a zero convolution layer are initialized to zero, both of the

Z(·; ·) terms in Equation ( 2 ) evaluate to zero, and

y c = y.

(3)

In this way, harmful noise cannot influence the hidden states

of the neural network layers in the trainable copy when the

training starts. Moreover, since Z(c;Θ z1 ) = 0 and the train-

able copy also receives the input image x, the trainable copy

is fully functional and retains the capabilities of the large,

pretrained model allowing it to serve as a strong backbone

for further learning. Zero convolutions protect this back-

bone by eliminating random noise as gradients in the initial

training steps. We detail the gradient calculation for zero

convolutions in supplementary materials.

3.2. ControlNet for Text-to-Image Diffusion

We use Stable Diffusion [ 71 ] as an example to show how

ControlNet can add conditional control to a large pretrained

diffusion model. Stable Diffusion is essentially a U-Net [ 72 ]

with an encoder, a middle block, and a skip-connected de-

coder. Both the encoder and decoder contain 12 blocks,

and the full model contains 25 blocks, including the middle

block. Of the 25 blocks, 8 blocks are down-sampling or

up-sampling convolution layers, while the other 17 blocks

are main blocks that each contain 4 resnet layers and 2 Vi-

sion Transformers (ViTs). Each ViT contains several cross-

attention and self-attention mechanisms. For example, in

Figure 3 a, the “SD Encoder Block A” contains 4 resnet lay-

ers and 2 ViTs, while the “×3” indicates that this block is

repeated three times. Text prompts are encoded using the

Text

Encoder

Prompt c t

×3

Output ? θ (z t , t, c t , c f )

SD Decoder Block A

64×64

SD Decoder Block B

32×32

SD Decoder Block C

16×16

SD Decoder

Block D 8×8

Time

Encoder

Time t

×3

Input z t

SD Encoder Block A

64×64

SD Encoder Block B

32×32

SD Encoder Block C

16×16

SD Encoder

Block D 8×8

SD Middle

Block 8×8

×3

zero convolution

Condition c f

×3

zero convolution

×3

zero convolution

SD Encoder Block A

64×64 (trainable copy)

SD Encoder Block B

32×32 (trainable copy)

SD Encoder Block C

16×16 (trainable copy)

SD Encoder Block D

8×8 (trainable copy)

SD Middle Block

8×8 (trainable copy)

Prompt&Time

(a) Stable Diffusion

(b) ControlNet

Figure 3: Stable Diffusion’s U-net architecture connected

with a ControlNet on the encoder blocks and middle block.

The locked, gray blocks show the structure of Stable Diffu-

sion V1.5 (or V2.1, as they use the same U-net architecture).

The trainable blue blocks and the white zero convolution

layers are added to build a ControlNet.

CLIP text encoder [ 65 ], and diffusion timesteps are encoded

with a time encoder using positional encoding.

The ControlNet structure is applied to each encoder level

of the U-net (Figure 3 b). In particular, we use ControlNet

to create a trainable copy of the 12 encoding blocks and 1

middle block of Stable Diffusion. The 12 encoding blocks

are in 4 resolutions (64 × 64, 32 × 32, 16 × 16, 8 × 8) with

each one replicated 3 times. The outputs are added to the

12 skip-connections and 1 middle block of the U-net. Since

Stable Diffusion is a typical U-net structure, this ControlNet

architecture is likely to be applicable with other models.

The way we connect the ControlNet is computationally

efficient ? since the locked copy parameters are frozen, no

gradient computation is required in the originally locked

encoder for the finetuning. This approach speeds up train-

ing and saves GPU memory. As tested on a single NVIDIA

A100 PCIE 40GB, optimizing Stable Diffusion with Control-

Net requires only about 23% more GPU memory and 34%

3839

Page 5

more time in each training iteration, compared to optimizing

Stable Diffusion without ControlNet.

Image diffusion models learn to progressively denoise

images and generate samples from the training domain. The

denoising process can occur in pixel space or in a latent

space encoded from training data. Stable Diffusion uses

latent images as the training domain as working in this space

has been shown to stabilize the training process [ 71 ]. Specif-

ically, Stable Diffusion uses a pre-processing method similar

to VQ-GAN [ 19 ] to convert 512 × 512 pixel-space images

into smaller 64 × 64 latent images. To add ControlNet to

Stable Diffusion, we first convert each input conditioning

image (e.g., edge, pose, depth, etc.) from an input size of

512 × 512 into a 64 × 64 feature space vector that matches

the size of Stable Diffusion. In particular, we use a tiny

network E(·) of four convolution layers with 4 × 4 kernels

and 2 × 2 strides (activated by ReLU, using 16, 32, 64, 128,

channels respectively, initialized with Gaussian weights and

trained jointly with the full model) to encode an image-space

condition c i into a feature space conditioning vector c f as,

c f = E(c i ).

(4)

The conditioning vector c f is passed into the ControlNet.

3.3. Training

Given an input image z 0 , image diffusion algorithms

progressively add noise to the image and produce a noisy

image z t , where t represents the number of times noise is

added. Given a set of conditions including time step t, text

prompts c t , as well as a task-specific condition c f , image

diffusion algorithms learn a network ? θ to predict the noise

added to the noisy image z t with

L = E z 0 ,t,c t ,c f ,?∼N(0,1)

[

∥? ? ? θ (z t , t, c t , c f ))∥ 2

]

, (5)

where L is the overall learning objective of the entire dif-

fusion model. This learning objective is directly used in

finetuning diffusion models with ControlNet.

In the training process, we randomly replace 50% text

prompts c t with empty strings. This approach increases

ControlNet’s ability to directly recognize semantics in the

input conditioning images (e.g., edges, poses, depth, etc.) as

a replacement for the prompt.

During the training process, since zero convolutions do

not add noise to the network, the model should always be

able to predict high-quality images. We observe that the

model does not gradually learn the control conditions but

abruptly succeeds in following the input conditioning image;

usually in less than 10K optimization steps. As shown in Fig-

ure 4 , we call this the “sudden convergence phenomenon”.

3.4. Inference

We can further control how the extra conditions of Con-

trolNet affect the denoising diffusion process in several ways.

Test input

training step 100

step 1000

step 2000

step 6100

step 6133

step 8000

step 12000

Figure 4: The sudden convergence phenomenon. Due to the

zero convolutions, ControlNet always predicts high-quality

images during the entire training. At a certain step in the

training process (e.g., the 6133 steps marked in bold), the

model suddenly learns to follow the input condition.

(a) Input Canny map

(b) W/o CFG

Figure 5: Effect of Classifier-Free Guidance (CFG) and the

proposed CFG Resolution Weighting (CFG-RW).

Multiple condition (pose&depth)

“boy”

“astronaut”

Figure 6: Composition of multiple conditions. We present

the application to use depth and pose simultaneously.

Classifier-free guidance resolution weighting. Stable Dif-

fusion depends on a technique called Classifier-Free Guid-

ance (CFG) [ 29 ] to generate high-quality images. CFG is

formulated as ? prd = ? uc + β cfg (? c ? ? uc ) where ? prd , ? uc ,

? c , β cfg are the model’s final output, unconditional output,

conditional output, and a user-specified weight respectively.

When a conditioning image is added via ControlNet, it can

be added to both ? uc and ? c , or only to the ? c . In challenging

cases, e.g., when no prompts are given, adding it to both ? uc

and ? c will completely remove CFG guidance (Figure 5 b);

using only ? c will make the guidance very strong (Figure 5 c).

Our solution is to first add the conditioning image to ? c and

3840

Page 6

Sketch

Normal map

Depth map

Canny[ 11 ] edge

M-LSD[ 24 ] line

HED[ 90 ] edge

ADE20k[ 95 ] seg.

Human pose

Figure 7: Controlling Stable Diffusion with various conditions without prompts. The top row is input conditions, while all

other rows are outputs. We use the empty string as input prompts. All models are trained with general-domain data. The model

has to recognize semantic contents in the input condition images to generate images.

Method

Result Quality ↑ Condition Fidelity ↑

PIPT [ 88 ](sketch)

1.10 ± 0.05

1.02 ± 0.01

Sketch-Guided [ 87 ] (β = 1.6)

3.21 ± 0.62

2.31 ± 0.57

Sketch-Guided [ 87 ] (β = 3.2)

2.52 ± 0.44

3.28 ± 0.72

ControlNet-lite

3.93 ± 0.59

4.09 ± 0.46

ControlNet

4.22 ± 0.43

4.28 ± 0.45

Table 1: Average User Ranking (AUR) of result quality and

condition fidelity. We report the user preference ranking (1

to 5 indicates worst to best) of different methods.

then multiply a weight w i to each connection between Stable

Diffusion and ControlNet according to the resolution of each

block w i = 64/h i , where h i is the size of i th block, e.g.,

h 1 = 8,h 2 = 16, ..., h 13 = 64. By reducing the CFG guid-

ance strength , we can achieve the result shown in Figure 5 d,

and we call this CFG Resolution Weighting.

Composing multiple ControlNets. To apply multiple con-

ditioning images (e.g., Canny edges, and pose) to a single

instance of Stable Diffusion, we can directly add the outputs

of the corresponding ControlNets to the Stable Diffusion

model (Figure 6 ). No extra weighting or linear interpolation

is necessary for such composition.

4. Experiments

We implement ControlNets with Stable Diffusion to

test various conditions, including Canny Edge [ 11 ], Depth

Map [ 68 ], Normal Map [ 86 ], M-LSD lines [ 24 ], HED soft

edge [ 90 ], ADE20K segmentation [ 95 ], Openpose [ 12 ], and

user sketches. See also the supplementary material for ex-

amples of each conditioning along with detailed training and

inference parameters.

4.1. Qualitative Results

Figure 1 shows the generated images in several prompt

settings. Figure 7 shows our results with various conditions

without prompts, where the ControlNet robustly interprets

content semantics in diverse input conditioning images.

4.2. Ablative Study

We study alternative structures of ControlNets by (1)

replacing the zero convolutions with standard convolution

layers initialized with Gaussian weights, and (2) replacing

each block’s trainable copy with one single convolution layer,

which we call ControlNet-lite. See also the supplementary

material for the full details of these ablative structures.

We present 4 prompt settings to test with possible be-

haviors of real-world users: (1) no prompt; (2) insufficient

prompts that do not fully cover objects in conditioning im-

ages, e.g., the default prompt of this paper “a high-quality,

detailed, and professional image”; (3) conflicting prompts

that change the semantics of conditioning images; (4) perfect

prompts that describe necessary content semantics, e.g., “a

nice house”. Figure 8 a shows that ControlNet succeeds in

3841

Page 7

copy

zero conv

condition

origin

input

output

copy

conv

condition

origin

input

output

conv

condition

origin

input

output

(proposed)

(w/o zero conv)

(initialize lightweight

layers from scratch)

No prompt

Insufficient prompt

(w/o mentioning “house”)

“high-quality and detailed masterpiece”

Conflicting prompt

“delicious cake”

Perfect prompt

“a house, high-quality,

extremely detailed, 4K, HQ”

(a)

(b)

(c)

Figure 8: Ablative study of different architectures on a sketch condition and different prompt settings. For each setting, we

show a random batch of 6 samples without cherry-picking. Images are at 512 × 512 and best viewed when zoomed in. The

green “conv” blocks on the left are standard convolution layers initialized with Gaussian weights.

ADE20K (GT) VQGAN [ 19 ]

LDM [ 71 ]

PIPT [ 88 ]

ControlNet-lite

ControlNet

0.58 ± 0.10

0.21 ± 0.15

0.31 ± 0.09

0.26 ± 0.16

0.32 ± 0.12

0.35 ± 0.14

Table 2: Evaluation of semantic segmentation label recon-

struction (ADE20K) with Intersection over Union (IoU ↑).

all 4 settings. The lightweight ControlNet-lite (Figure 8 c) is

not strong enough to interpret the conditioning images and

fails in the insufficient and no prompt conditions. When zero

convolutions are replaced, the performance of ControlNet

drops to about the same as ControlNet-lite, indicating that

the pretrained backbone of the trainable copy is destroyed

during finetuning (Figure 8 b).

4.3. Quantitative Evaluation

User study. We sample 20 unseen hand-drawn sketches, and

then assign each sketch to 5 methods: PIPT [ 88 ]’s sketch

model, Sketch-Guided Diffusion (SGD) [ 87 ] with default

edge-guidance scale (β = 1.6), SGD [ 87 ] with relatively

high edge-guidance scale (β = 3.2), the aforementioned

ControlNet-lite, and ControlNet. We invited 12 users to rank

these 20 groups of 5 results individually in terms of “the

quality of displayed images” and “the fidelity to the sketch”.

In this way, we obtain 100 rankings for result quality and 100

for condition fidelity. We use the Average Human Ranking

(AHR) as a preference metric where users rank each result

on a scale of 1 to 5 (lower is worse). The average rankings

are shown in Table 1 .

Comparison to industrial models. Stable Diffusion V2

Depth-to-Image (SDv2-D2I) [ 82 ] is trained with a large-

Method

FID ↓ CLIP-score ↑ CLIP-aes. ↑

Stable Diffusion

6.09

0.26

6.32

VQGAN [ 19 ](seg.)* 26.28

0.17

5.14

LDM [ 71 ](seg.)*

25.35

0.18

5.15

PIPT [ 88 ](seg.)

19.74

0.20

5.77

ControlNet-lite

17.92

0.26

6.30

ControlNet

15.27

0.26

6.31

Table 3: Evaluation for image generation conditioned by

semantic segmentation. We report FID, CLIP text-image

score, and CLIP aesthetic scores for our method and other

baselines. We also report the performance of Stable Diffu-

sion without segmentation conditions. Methods marked with

“*” are trained from scratch.

scale NVIDIA A100 cluster, thousands of GPU hours, and

more than 12M training images. We train a ControlNet for

the SD V2 with the same depth conditioning but only use

200k training samples, one single NVIDIA RTX 3090Ti, and

5 days of training. We use 100 images generated by each

SDv2-D2I and ControlNet to teach 12 users to distinguish

the two methods. Afterwards, we generate 200 images and

ask the users to tell which model generated each image. The

average precision of the users is 0.52 ± 0.17, indicating that

the two method yields almost indistinguishable results.

Condition reconstruction and FID score. We use the test

set of ADE20K[ 95 ] to evaluate the conditioning fidelity.

The state-of-the-art segmentation method OneFormer [ 35 ]

achieves an Intersection-over-Union (IoU) with 0.58 on the

ground-truth set. We use different methods to generate

images with ADE20K segmentations and then apply One-

3842

Page 8

Input (sketch)

Ours (w/o prompts)

PITI

Ours (“electric fan”)

dog

cup

paper

wall

Input (seg.)

Input (sketch)

Input (canny)

PITI

Sketch-Guided

Taming Tran.

Ours (default) “golden retriever”

Ours (default)

“white helmet

on table”

Figure 9: Comparison to previous methods.We present the

qualitative comparisons to PITI [ 88 ], Sketch-Guided Diffu-

sion [ 87 ], and Taming Transformers [ 19 ].

Former to detect the segmentations again to compute the

reconstructed IoUs (Table 2 ). Besides, we use Frechet Incep-

tion Distance (FID) [ 28 ] to measure the distribution distance

over randomly generated 512×512 image sets using differ-

ent segmentation-conditioned methods, as well as text-image

CLIP scores [ 65 ] and CLIP aesthetic score [ 78 ] in Table 3 .

See also the supplementary material for detailed settings.

4.4. Comparison to Previous Methods

Figure 9 presents a visual comparison of baselines and our

method (Stable Diffusion + ControlNet). Specifically, we

show the results of PTIT [ 88 ], Sketch-Guided Diffusion [ 87 ],

and Taming Transformers[ 19 ]. We observe that Control-

Net can robustly handle diverse conditioning images and

achieves sharp and clean results.

4.5. Discussion

Influence of training dataset sizes. We demonstrate the

robustness of the ControlNet training in Figure 10 . The

training does not collapse with limited 1k images, and allows

the model to generate a recognizable lion. The learning is

scalable when more data is provided.

“Lion”

1k images

50k images

3m images

Figure 10: The influence of different training dataset sizes.

See also the supplementary material for extended examples.

Input

“a high-quality and extremely detailed image”

Figure 11: Interpreting contents. If the input is ambiguous

and the user does not mention object contents in prompts,

the results look like the model tries to interpret input shapes.

Comic Diffusion

Protogen 3.4

SD 1.5

“house”

Figure 12: Transfer pretrained ControlNets to community

models [ 16 , 60 ] without training the neural networks again.

Capability to interpret contents. We showcase Control-

Net’s capability to capture the semantics from input condi-

tioning images in Figure 11 .

Transferring to community models. Since ControlNets do

not change the network topology of pretrained SD models,

it can be directly applied to various models in the stable

diffusion community, such as Comic Diffusion [ 60 ] and Pro-

togen 3.4 [ 16 ], in Figure 12 .

5. Conclusion

ControlNet is a neural network structure that learns con-

ditional control for large pretrained text-to-image diffusion

models. It reuses the large-scale pretrained layers of source

models to build a deep and strong encoder to learn specific

conditions. The original model and trainable copy are con-

nected via “zero convolution” layers that eliminate harmful

noise during training. Extensive experiments verify that Con-

trolNet can effectively control Stable Diffusion with single

or multiple conditions, with or without prompts. Results on

diverse conditioning datasets show that the ControlNet struc-

ture is likely to be applicable to a wider range of conditions,

and facilitate relevant applications.

3843

Page 9

Acknowledgment

This work was partially supported by the Stanford In-

stitute for Human-Centered AI and the Brown Institute for

Media Innovation.

References

[1] Sadia Afrin. Weight initialization in neural network, inspired

by andrew ng, https://medium.com/@safrin1128/weight-

initialization-in-neural-network-inspired-by-andrew-ng-

e0066dc4a566, 2020. 3

[2] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. In-

trinsic dimensionality explains the effectiveness of language

model fine-tuning. In Proceedings of the 59th Annual Meeting

of the Association for Computational Linguistics and the 11th

International Joint Conference on Natural Language Process-

ing, pages 7319?7328, Online, Aug. 2021. Association for

Computational Linguistics. 3

[3] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Only

a matter of style: Age transformation using a style-based

regression model. ACM Transactions on Graphics (TOG),

40(4), 2021. 3

[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit

Bermano. Hyperstyle: Stylegan inversion with hypernetworks

for real image editing. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

pages 18511?18521, 2022. 2

[5] Alembics. Disco diffusion, https://github.com/alembics/disco-

diffusion, 2022. 3

[6] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta,

Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried,

and Xi Yin. Spatext: Spatio-textual representation for con-

trollable image generation. arXiv preprint arXiv:2211.14305,

2022. 2 , 3

[7] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended

diffusion for text-driven editing of natural images. In Pro-

ceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 18208?18218, 2022. 3

[8] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.

Multidiffusion: Fusing diffusion paths for controlled image

generation. arXiv preprint arXiv:2302.08113, 2023. 3

[9] Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko,

and Irfan Essa. Masksketch: Unpaired structure-guided

masked image generation. arXiv preprint arXiv:2302.05496,

2023. 3

[10] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-

structpix2pix: Learning to follow image editing instructions.

arXiv preprint arXiv:2211.09800, 2022. 2 , 3

[11] John Canny. A computational approach to edge detection.

IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, (6):679?698, 1986. 6

[12] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A.

Sheikh. Openpose: Realtime multi-person 2d pose estima-

tion using part affinity fields. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 2019. 6

[13] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping

Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and

Wen Gao. Pre-trained image processing transformer. In Pro-

ceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 12299?12310, 2021. 3

[14] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong

Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter

for dense predictions. International Conference on Learning

Representations, 2023. 2

[15] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,

Sunghun Kim, and Jaegul Choo. Stargan: Unified genera-

tive adversarial networks for multi-domain image-to-image

translation. In Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pages 8789?8797,

2018. 3

[16] darkstorm2150.

Protogen x3.4 (photorealism) offi-

cial release, https://civitai.com/models/3666/protogen-x34-

photorealism-official-release, 2022. 8

[17] Prafulla Dhariwal and Alexander Nichol. Diffusion models

beat gans on image synthesis. Advances in Neural Information

Processing Systems, 34:8780?8794, 2021. 3

[18] Tan M. Dinh, Anh Tuan Tran, Rang Nguyen, and Binh-Son

Hua. Hyperinverter: Improving stylegan inversion via hy-

pernetwork. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages 11389?

11398, 2022. 2

[19] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming

transformers for high-resolution image synthesis. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 12873?12883, 2021. 3 , 5 , 7 , 8

[20] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,

Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-

based text-to-image generation with human priors. In Euro-

pean Conference on Computer Vision (ECCV), pages 89?106.

Springer, 2022. 2 , 3

[21] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik,

Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An

image is worth one word: Personalizing text-to-image genera-

tion using textual inversion. arXiv preprint arXiv:2208.01618,

2022. 2 , 3

[22] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano,

Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-

guided domain adaptation of image generators. ACM Trans-

actions on Graphics (TOG), 41(4):1?13, 2022. 3

[23] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao

Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-

adapter: Better vision-language models with feature adapters.

arXiv preprint arXiv:2110.04544, 2021. 2

[24] Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun

Lee, Jingeun Lee, and Minchul Shin. Towards light-weight

and real-time line segment detection. In Proceedings of the

AAAI Conference on Artificial Intelligence, 2022. 6

[25] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks.

In International Conference on Learning Representations,

2017. 2

[26] Heathen. Hypernetwork style training, a tiny guide, stable-

diffusion-webui, https://github.com/automatic1111/stable-

diffusion-webui/discussions/2670, 2022. 2

3844

Page 10

[27] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,

Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-

age editing with cross attention control. arXiv preprint

arXiv:2208.01626, 2022. 3

[28] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-

hard Nessler, and Sepp Hochreiter. Gans trained by a two

time-scale update rule converge to a local nash equilibrium.

In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fer-

gus, S. Vishwanathan, and R. Garnett, editors, Advances in

Neural Information Processing Systems, volume 30. Curran

Associates, Inc., 2017. 8

[29] Jonathan Ho and Tim Salimans. Classifier-free diffusion

guidance, 2022. 5

[30] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna

Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona

Attariyan, and Sylvain Gelly. Parameter-efficient transfer

learning for nlp. In International Conference on Machine

Learning, pages 2790?2799, 2019. 2

[31] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,

Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora:

Low-rank adaptation of large language models. arXiv preprint

arXiv:2106.09685, 2021. 2

[32] Lianghua Huang, Di Chen, Yu Liu, Shen Yujun, Deli Zhao,

and Zhou Jingren. Composer: Creative and controllable

image synthesis with composable conditions. 2023. 3

[33] Nisha Huang, Fan Tang, Weiming Dong, Tong-Yee Lee, and

Changsheng Xu. Region-aware diffusion for zero-shot text-

driven image editing. arXiv preprint arXiv:2302.11797, 2023.

[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

Image-to-image translation with conditional adversarial net-

works. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 1125?1134, 2017. 1 , 3

[35] Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita

Orlov, and Humphrey Shi. OneFormer: One Transformer to

Rule Universal Image Segmentation. 2023. 7

[36] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.

Progressive growing of gans for improved quality, stability,

and variation. International Conference on Learning Repre-

sentations, 2018. 3

[37] Tero Karras, Samuli Laine, and Timo Aila. A style-based

generator architecture for generative adversarial networks.

In Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 4401?4410, 2019. 3

[38] Tero Karras, Samuli Laine, and Timo Aila. A style-based

generator architecture for generative adversarial networks.

IEEE Transactions on Pattern Analysis, 2021. 3

[39] Oren Katzir, Vicky Perepelook, Dani Lischinski, and Daniel

Cohen-Or. Multi-level latent space structuring for generative

control. arXiv preprint arXiv:2202.05910, 2022. 3

[40] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen

Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:

Text-based real image editing with diffusion models. arXiv

preprint arXiv:2210.09276, 2022. 3

[41] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif-

fusionclip: Text-guided diffusion models for robust image

manipulation. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages 2426?

2435, 2022. 3

[42] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho.

Variational diffusion models. Advances in Neural Information

Processing Systems, 34:21696?21707, 2021. 3

[43] Kurumuz. Novelai improvements on stable diffusion,

https://blog.novelai.net/novelai-improvements-on-stable-

diffusion-e10d38db82ac, 2022. 2

[44] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep

learning. Nature, 521(7553):436?444, May 2015. 3

[45] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-

based learning applied to document recognition. Proceedings

of the IEEE, 86(11):2278?2324, 1998. 3

[46] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli

Laine, Tero Karras, Miika Aittala, and Timo Aila.

Noise2noise: Learning image restoration without clean data.

Proceedings of the 35th International Conference on Machine

Learning, 2018. 3

[47] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason

Yosinski. Measuring the intrinsic dimension of objective

landscapes. International Conference on Learning Represen-

tations, 2018. 3

[48] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian-

wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee.

Gligen: Open-set grounded text-to-image generation. 2023.

[49] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He.

Exploring plain vision transformer backbones for object de-

tection. arXiv preprint arXiv:2203.16527, 2022. 2

[50] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim-

ing He, and Ross Girshick. Benchmarking detection

transfer learning with vision transformers. arXiv preprint

arXiv:2111.11429, 2021. 2

[51] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggy-

back: Adapting a single network to multiple tasks by learning

to mask weights. In European Conference on Computer Vi-

sion (ECCV), pages 67?82, 2018. 2

[52] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multi-

ple tasks to a single network by iterative pruning. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 7765?7773, 2018. 2

[53] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun

Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image

synthesis and editing with stochastic differential equations. In

International Conference on Learning Representations, 2021.

[54] Midjourney. https://www.midjourney.com/, 2023. 1 , 3

[55] Ron Mokady, Omer Tov, Michal Yarom, Oran Lang, Inbar

Mosseri, Tali Dekel, Daniel Cohen-Or, and Michal Irani. Self-

distilled stylegan: Towards generation from internet photos.

In ACM SIGGRAPH 2022 Conference Proceedings, pages

1?9, 2022. 3

[56] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon-

gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning

adapters to dig out more controllable ability for text-to-image

diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2 ,

3845

Page 11

[57] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav

Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and

Mark Chen. Glide: Towards photorealistic image generation

and editing with text-guided diffusion models. 2022. 3

[58] Alexander Quinn Nichol and Prafulla Dhariwal. Improved

denoising diffusion probabilistic models. In International

Conference on Machine Learning, pages 8162?8171. PMLR,

2021. 3

[59] Yotam Nitzan, Kfir Aberman, Qiurui He, Orly Liba, Michal

Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and

Daniel Cohen-Or. Mystyle: A personalized generative prior.

arXiv preprint arXiv:2203.17272, 2022. 3

[60] ogkalu. Comic-diffusion v2, trained on 6 styles at once,

https://huggingface.co/ogkalu/comic-diffusion, 2022. 8

[61] OpenAI. Dall-e-2, https://openai.com/product/dall-e-2, 2023.

1 , 3

[62] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan

Zhu. Semantic image synthesis with spatially-adaptive nor-

malization. In Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pages 2337?2346,

2019. 3

[63] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun

Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image

translation. arXiv preprint arXiv:2302.03027, 2023. 3

[64] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,

and Dani Lischinski. Styleclip: Text-driven manipulation

of stylegan imagery. In Proceedings of the IEEE/CVF In-

ternational Conference on Computer Vision (ICCV), pages

2085?2094, October 2021. 3

[65] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,

Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning

transferable visual models from natural language supervision.

In International Conference on Machine Learning, pages

8748?8763. PMLR, 2021. 2 , 3 , 4 , 8

[66] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,

and Mark Chen. Hierarchical text-conditional image genera-

tion with clip latents. arXiv preprint arXiv:2204.06125, 2022.

[67] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,

Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.

Zero-shot text-to-image generation. In International Confer-

ence on Machine Learning, pages 8821?8831. PMLR, 2021.

[68] Rene Ranftl, Katrin Lasinger, David Hafner, Konrad

Schindler, and Vladlen Koltun. Towards robust monocular

depth estimation: Mixing datasets for zero-shot cross-dataset

transfer. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 44(3):1623?1637, 2020. 6

[69] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi.

Efficient parametrization of multi-domain deep neural net-

works. In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 8119?8127, 2018.

[70] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,

Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding

in style: a stylegan encoder for image-to-image translation.

In Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, 2021. 3

[71] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Bjorn Ommer. High-resolution image

synthesis with latent diffusion models. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 10684?10695, 2022. 1 , 2 , 3 , 4 , 5 , 7

[72] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:

Convolutional networks for biomedical image segmentation.

In Medical Image Computing and Computer-Assisted Inter-

vention MICCAI International Conference, pages 234?241,

2015. 4

[73] Amir Rosenfeld and John K Tsotsos. Incremental learning

through deep adaptation. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 42(3):651?663, 2018. 2

[74] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,

Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine

tuning text-to-image diffusion models for subject-driven gen-

eration. arXiv preprint arXiv:2208.12242, 2022. 2 , 3

[75] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.

Williams. Learning representations by back-propagating er-

rors. Nature, 323(6088):533?536, Oct. 1986. 3

[76] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,

Jonathan Ho, Tim Salimans, David Fleet, and Mohammad

Norouzi. Palette: Image-to-image diffusion models. In ACM

SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22,

New York, NY, USA, 2022. Association for Computing Ma-

chinery. 3

[77] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay

Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,

Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes,

et al. Photorealistic text-to-image diffusion models with deep

language understanding. arXiv preprint arXiv:2205.11487,

2022. 3

[78] Christoph Schuhmann, Romain Beaumont, Richard Vencu,

Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo

Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-

man, Patrick Schramowski, Srivatsa R Kundurthy, Katherine

Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia

Jitsev. LAION-5b: An open large-scale dataset for training

next generation image-text models. In Thirty-sixth Confer-

ence on Neural Information Processing Systems Datasets and

Benchmarks Track, 2022. 2 , 8

[79] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karat-

zoglou. Overcoming catastrophic forgetting with hard atten-

tion to the task. In International Conference on Machine

Learning, pages 4548?4557. PMLR, 2018. 2

[80] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,

and Surya Ganguli. Deep unsupervised learning using

nonequilibrium thermodynamics. In International Confer-

ence on Machine Learning, pages 2256?2265. PMLR, 2015.

[81] Stability.

Stable diffusion v1.5 model card,

https://huggingface.co/runwayml/stable-diffusion-v1-5,

2022. 2 , 3

[82] Stability. Stable diffusion v2 model card, stable-diffusion-

2-depth, https://huggingface.co/stabilityai/stable-diffusion-2-

depth, 2022. 3 , 7

3846

Page 12

[83] Asa Cooper Stickland and Iain Murray. Bert and pals: Pro-

jected attention layers for efficient adaptation in multi-task

learning. In International Conference on Machine Learning,

pages 5986?5995, 2019. 2

[84] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter:

Parameter-efficient transfer learning for vision-and-language

tasks. arXiv preprint arXiv:2112.06825, 2021. 2

[85] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel.

Plug-and-play diffusion features for text-driven image-to-

image translation. arXiv preprint arXiv:2211.12572, 2022.

[86] Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo,

Haochen Wang, Falcon Z Dai, Andrea F Daniele, Moham-

madreza Mostajabi, Steven Basart, Matthew R Walter, et al.

Diode: A dense indoor and outdoor depth dataset. arXiv

preprint arXiv:1908.00463, 2019. 6

[87] Andrey Voynov, Kfir Abernan, and Daniel Cohen-Or. Sketch-

guided text-to-image diffusion models. 2022. 3 , 6 , 7 , 8

[88] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong

Chen, Qifeng Chen, and Fang Wen. Pretraining is all you

need for image-to-image translation. 2022. 3 , 6 , 7 , 8

[89] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,

Jan Kautz, and Bryan Catanzaro. High-resolution image

synthesis and semantic manipulation with conditional gans.

In Proceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 8798?8807, 2018. 3

[90] Saining Xie and Zhuowen Tu. Holistically-nested edge detec-

tion. In Proceedings of the IEEE International Conference

on Computer Vision (ICCV), pages 1395?1403, 2015. 6

[91] Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas J.

Guibas, and Jitendra Malik. Side-tuning: Network adapta-

tion via additive side networks. In European Conference on

Computer Vision (ECCV), pages 698?714. Springer, 2020. 2

[92] Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen.

Cross-domain correspondence learning for exemplar-based

image translation. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition, pages

5143?5153, 2020. 3

[93] Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kun-

chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-

adapter: Training-free clip-adapter for better vision-language

modeling. arXiv preprint arXiv:2111.03930, 2021. 2

[94] Jiawei Zhao, Florian Schafer, and Anima Anandkumar. Zero

initialization: Initializing residual networks with only zeros

and ones. arXiv, 2021. 3

[95] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Bar-

riuso, and Antonio Torralba. Scene parsing through ade20k

dataset. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 633?641, 2017. 6 , 7

[96] Xingran Zhou, Bo Zhang, Ting Zhang, Pan Zhang, Jianmin

Bao, Dong Chen, Zhongfei Zhang, and Fang Wen. Cocos-

net v2: Full-resolution correspondence learning for image

translation. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages 11465?

11475, 2021. 3

[97] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.

Unpaired image-to-image translation using cycle-consistent

adversarial networks. In Computer Vision (ICCV), 2017 IEEE

International Conference on, 2017. 1 , 3

[98] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell,

Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward

multimodal image-to-image translation. Advances in Neural

Information Processing Systems, 30, 2017. 3

3847