DINOv2 for Semantic Segmentation

Training semantic segmentation models are often time-consuming and compute-intensive. However, with the powerful self-supervised DINOv2 backbones, we can drastically reduce the training compute and time. Using DINOv2, we can just add a semantic segmentation head on top of the pretrained backbone and train a few thousand parameters for good performance. This is exactly what we are going to cover in this article. We will modify the DINOv2 backbone, add a simple pixel classifier on top of it, and train DINOv2 for semantic segmentation.

Figure 1. Result from training a DINOv2 model for semantic segmentation.

What are we going to cover while training DINOV2 for semantic segmentation?

We will primarily answer the following questions.

Why do we need this article?
How do we modify the DINOv2 backbone for semantic segmentation?
What is the process for training the model?
What results can we expect after training DINOv2 for semantic segmentation?
What are the limitations of our approach?

Note: This is not the cleanest possible code for converting a DINOv2 backbone into a semantic segmentation model. It is an early phase codebase for a larger project that we will work on at DebuggerCafe very soon. The code is not the cleanest, a lot of “patching” has gone into the modeling part, with some yet-to-be-answered parts.

Why Do We Need This Article?

The official DINOv2 repository provides two notebooks for modified DINOv2 models, one for semantic segmentation and another for depth estimation. We will keep our focus entirely on semantic segmentation here.

The repository only provides inference code for semantic segmentation with MMSegmentation integration. As it stands, it is quite difficult to set up the repository with MMSegmentation requirements because of several dependency issues. Furthermore, it is not clear how to train the model for semantic segmentation and what hyperparameters to choose. Additionally, the repository provides segmentation heads trained on Pascal VOC, and ADE20k, along with a completetly re-trained DINOv2 model based on Mask2Former segmentation head.

To mitigate some of the above issues, this article aims to completely discard the MMSegmentation dependency. We rely on pure PyTorch backbone code from the DINOv2 repository while adding the code needed to build the semantic segmentation version of it.

Because the project is in the early stages, the code is not yet completely clean as we mentioned above. Some of the hyperparameters and approaches are yet to be finalized. However, with this article, the community also gets their hands on something tangible that they can play around with, suggest, and improve.

The Human Segmentation Dataset

We will experiment with a very simple dataset, human segmentation to be precise, in this article. Our attempts at uncovering DINOv2 for different tasks (starting with segmentation in this article) may not produce very accurate results in the first pass, but we will get there eventually.

We will train the model on the Penn-Fudan Pedestrian Segmentation dataset. It is a small dataset with only 146 training samples and 24 validation samples. This is just perfect for experimenting with such a large code change so that we can iterate quickly.

The masks contain a different label mask for each person as it has been adapted from an instance segmentation dataset. During training, we will convert each person’s mask to a value of 255 and the background to 0.

You should see the following structure after downloading and extracting the data from the above link.

PennFudanPed/
├── train_images
├── train_masks
├── valid_images
└── valid_masks

The following are some ground truth images and masks (pedestrian mask labels have been converted to a value of 255).

Figure 2. Ground truth segmentation maps from the Penn-Fudan Pedestrian dataset.

Project Directory Structure

Let’s take a look at the project directory structure before moving to the coding part.

├── input
│   ├── inference_data
│   └── PennFudanPed
├── outputs
│   ├── valid_preds
│   ├── best_model_iou.pth
│   └── best_model_loss.pth
├── config.py
├── datasets.py
├── engine.py
├── infer_image.py
├── infer_video.py
├── metrics.py
├── model.py
├── requirements.txt
├── temp.py
├── train.py
└── utils.py

The input directory contains the Penn-Fudan pedestrian segmentation dataset that we downloaded above.
The outputs directory will contain the training results, the trained model weights, as well as the inference results.
In the root project directory, we have all the Python files that we will deal with.

All the code files, inference dataset, and trained weights will be available via the “Download Code” section.

Download Code

Installing Dependencies

We can install all the necessary dependencies using the requirements.txt file.

pip install -r requirements.txt

If you face OpenCV related errors (because of Albumentations), please execute the following command after installing the above requirements.

pip uninstall opencv-python opencv-python-headless
pip install opencv-python

Semantic Segmentation using DINOv2

Starting from this section, we will cover the important coding parts involving the project. As we have discussed earlier, a lot of code components are strictly “patched in” right now and there are better code practices for them. We will be pointing these out while covering the model code.

How to Modify the DINOv2 Backbone for Semantic Segmentation?

The code for creating the DINOv2 based segmentation model resides in the model.py file. Let’s start with the import statements.

import torch
import urllib
import warnings
import torch.nn.functional as F
import torch.nn as nn

from functools import partial
from collections import OrderedDict
from torchinfo import summary

Defining the DINOv2 Backbone

The first step for creating the model is defining the model architecture and downloading the pretrained backbone.

BACKBONE_SIZE = "small" # in ("small", "base", "large" or "giant")

backbone_archs = {
    "small": "vits14",
    "base": "vitb14",
    "large": "vitl14",
    "giant": "vitg14",
}
backbone_arch = backbone_archs[BACKBONE_SIZE]
backbone_name = f"dinov2_{backbone_arch}"
    
HEAD_SCALE_COUNT = 3 # more scales: slower but better results, in (1,2,3,4,5)
HEAD_DATASET = "voc2012" # in ("ade20k", "voc2012")
HEAD_TYPE = "ms" # in ("ms, "linear")

DINOV2_BASE_URL = "https://dl.fbaipublicfiles.com/dinov2"
head_config_url = f"{DINOV2_BASE_URL}/{backbone_name}/{backbone_name}_{HEAD_DATASET}_{HEAD_TYPE}_config.py"
head_checkpoint_url = f"{DINOV2_BASE_URL}/{backbone_name}/{backbone_name}_{HEAD_DATASET}_{HEAD_TYPE}_head.pth"

backbone_model = torch.hub.load(repo_or_dir="facebookresearch/dinov2", model=backbone_name)
backbone_model.cuda()

We are choosing the small backbone from DINOv2. Each backbone name is coupled with a different ViT model as we can see in backbone_archs. For DINOv2-Small model, it is vits14 which is a small Vision Transformer model with 14×14 resolution patches.

From this, we define the backbone_arch, and backbone_name, which in turn defines the head configuration and head checkpoint URLs.

We also have HEAD_SCALE_COUNT, HEAD_DATASET, and HEAD_TYPE. The HEAD_DATASET defines the dataset on which the segmentation head has been trained on. In this case, it is the Pascal VOC dataset. The HEAD_TYPE is multi-scale which tells the model later to aggregate features from different layers before passing it to the pixel classification layer.

At the moment, although we have defined HEAD_SCALE_COUNT, we are not using it. This is part of the MMCV configuration file and using MMSegmentation for training/inference uses this automatically. I have not yet figured out how to use it without MMSegmentation configuration file for training or inference. One speculation is that we should have different image scales for each batch to train the model at different resolutions. However, a closer look at the internal codebase is needed for proper implementation.

Downloading Backbone Configuration

Now coming to the messy part of the code. We need to download the configuration file to define the model backbone properly. The authors originally used this configuration for MMSegmentation training which is the expected methodology for the library.

# TODO: This part needs cleaning #
def load_config_from_url(url: str) -> str:
    with urllib.request.urlopen(url) as f:
        return f.read().decode()
    
cfg_str = load_config_from_url(head_config_url)
with open('temp.py', 'w') as f:
    f.write(cfg_str)

from temp import model as model_dict
##################################

backbone_model.forward = partial(
    backbone_model.get_intermediate_layers,
    n=model_dict['backbone']['out_indices'],
    reshape=True,
)

for name, param in backbone_model.named_parameters():
    param.requires_grad = False

We download the model configuration using the head_config_url and save it as temp.py. This file contains several dictionaries defining important training, inference, and test configurations along with model configs.

We load the model configuration dictionary as model_dict which is further used to define a partial method along with the number of layers (n) to get the features from. For segmentation, we will concatenate features from 4 different layers defined as out_indices in the configuration (temp.py) file.

The following shows the model configuration from temp.py.

model = dict(
    type="EncoderDecoder",
    pretrained=None,
    backbone=dict(type="DinoVisionTransformer", out_indices=[8, 9, 10, 11]),
    decode_head=dict(
        type="BNHead",
        in_channels=[384, 384, 384, 384],
        in_index=[0, 1, 2, 3],
        input_transform='resize_concat',
        channels=1536,
        dropout_ratio=0,
        num_classes=21,
        norm_cfg=dict(type="SyncBN", requires_grad=True),
        align_corners=False,
        loss_decode=dict(
            type="CrossEntropyLoss", use_sigmoid=False, loss_weight=1.0)),
    test_cfg=dict(mode="slide", crop_size=(640, 640), stride=(320, 320)))

As we can see, the configuration instructs the partial method to extract features from layers 8, 9, 10, and 11. It is important to remember that this configuration is only valid for the ViT-Small model and will differ for each backbone.

Custom Resize Function

Next, we have a custom resize function used in the model’s segmentation head.

def resize(input_data,
       size=None,
       scale_factor=None,
       mode="nearest",
       align_corners=None,
       warning=True):
    if warning:
        if size is not None and align_corners:
            input_h, input_w = tuple(int(x) for x in input.shape[2:])
            output_h, output_w = tuple(int(x) for x in size)
            if output_h > input_h or output_w > input_w:
                if ((output_h > 1 and output_w > 1 and input_h > 1
                     and input_w > 1) and (output_h - 1) % (input_h - 1)
                        and (output_w - 1) % (input_w - 1)):
                    warnings.warn(
                        f'When align_corners={align_corners}, '
                        'the output would more aligned if '
                        f'input size {(input_h, input_w)} is `x+1` and '
                        f'out size {(output_h, output_w)} is `nx+1`')
    return F.interpolate(input_data, size, scale_factor, mode, align_corners)

The BNHead for Pixel Classification

We use the BNHead class from the DINOv2 repository to concatenate the extracted features, define the pixel classification head for segmentation, and carry out the forward pass.

class BNHead(nn.Module):
    """Just a batchnorm."""

    def __init__(self, resize_factors=None, **kwargs):
        super().__init__(**kwargs)
        # HARDCODED IN_CHANNELS FOR NOW.
        self.in_channels = 1536
        self.bn = nn.SyncBatchNorm(self.in_channels)
        self.resize_factors = resize_factors
        self.in_index = [0, 1, 2, 3]
        self.input_transform = 'resize_concat'
        self.align_corners = False

        self.conv_seg = nn.Conv2d(self.in_channels, 21, kernel_size=1)

    def _forward_feature(self, inputs):
        """Forward function for feature maps before classifying each pixel with
        ``self.cls_seg`` fc.

        Args:
            inputs (list[Tensor]): List of multi-level img features.

        Returns:
            feats (Tensor): A tensor of shape (batch_size, self.channels,
                H, W) which is feature map for last layer of decoder head.
        """
        x = self._transform_inputs(inputs)
        feats = self.bn(x)
        return feats

    def _transform_inputs(self, inputs):
        """Transform inputs for decoder.
        Args:
            inputs (list[Tensor]): List of multi-level img features.
        Returns:
            Tensor: The transformed inputs
        """

        if self.input_transform == "resize_concat":
            # accept lists (for cls token)
            input_list = []
            for x in inputs:
                if isinstance(x, list):
                    input_list.extend(x)
                else:
                    input_list.append(x)
            inputs = input_list
            # an image descriptor can be a local descriptor with resolution 1x1
            for i, x in enumerate(inputs):
                if len(x.shape) == 2:
                    inputs[i] = x[:, :, None, None]
            # select indices
            inputs = [inputs[i] for i in self.in_index]
            # Resizing shenanigans
            # print("before", *(x.shape for x in inputs))
            if self.resize_factors is not None:
                assert len(self.resize_factors) == len(inputs), (len(self.resize_factors), len(inputs))
                inputs = [
                    resize(input=x, scale_factor=f, mode="bilinear" if f >= 1 else "area")
                    for x, f in zip(inputs, self.resize_factors)
                ]
                # print("after", *(x.shape for x in inputs))
            upsampled_inputs = [
                resize(input_data=x, size=inputs[0].shape[2:], mode="bilinear", align_corners=self.align_corners)
                for x in inputs
            ]
            inputs = torch.cat(upsampled_inputs, dim=1)
        elif self.input_transform == "multiple_select":
            inputs = [inputs[i] for i in self.in_index]
        else:
            inputs = inputs[self.in_index]

        return inputs

    def cls_seg(self, feat):
        """Classify each pixel."""
        output = self.conv_seg(feat)
        return output
        
    
    def forward(self, inputs):
        """Forward function."""
        output = self._forward_feature(inputs)
        output = self.cls_seg(output)
        return output

The pixel classification layer is just a Conv2D layer with 21 classes (for VOC). We will change the output channels as per the number of classes later. The forward method transforms the inputs as per the multi-scale output, resizes them, and passes them through the classifier head.

One factor to note here is the in_channels for the segmentation head which we hardcode as 1536. This is the final number of output features after concatenating the 384 dimensional features 4 times (as we concatenate features from 4 layers). Later, it will be much better to make this calculation dynamic.

Define the Final DINOv2 Segmentation Model

Finally, we define the DINOv2 Segmentation model as a Sequential Ordered and Named Dictionary with a backbone and decode_head.

class DINOv2Segmentation(nn.Module):
    def __init__(self):
        super(DINOv2Segmentation, self).__init__()

        self.backbone_model = backbone_model
        self.decode_head = BNHead()

        self.model = nn.Sequential(OrderedDict([
            ('backbone', self.backbone_model),
            ('decode_head', self.decode_head)
        ]))

    def forward(self, x):
        outputs = self.model(x)
        return outputs
    
if __name__ == '__main__':
    model = DINOv2Segmentation()
    summary(
        model, 
        (1, 3, 644, 644),
        col_names=('input_size', 'output_size', 'num_params'),
        row_settings=['var_names']
    )

Executing this file with the command python model.py will give the following output.

=============================================================================================================================
Layer (type (var_name))                            Input Shape               Output Shape              Param #
=============================================================================================================================
DINOv2Segmentation (DINOv2Segmentation)            [1, 3, 644, 644]          [1, 21, 46, 46]           --
├─Sequential (model)                               [1, 3, 644, 644]          [1, 21, 46, 46]           --
│    └─DinoVisionTransformer (backbone)            [1, 3, 644, 644]          [1, 384, 46, 46]          526,848
│    │    └─PatchEmbed (patch_embed)               [1, 3, 644, 644]          [1, 2116, 384]            (226,176)
│    │    └─ModuleList (blocks)                    --                        --                        (21,302,784)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (768)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (recursive)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (recursive)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (recursive)
│    └─BNHead (decode_head)                        [1, 384, 46, 46]          [1, 21, 46, 46]           --
│    │    └─SyncBatchNorm (bn)                     [1, 1536, 46, 46]         [1, 1536, 46, 46]         3,072
│    │    └─Conv2d (conv_seg)                      [1, 1536, 46, 46]         [1, 21, 46, 46]           32,277
=============================================================================================================================
Total params: 22,091,925
Trainable params: 35,349
Non-trainable params: 22,056,576
Total mult-adds (Units.MEGABYTES): 568.19
=============================================================================================================================
Input size (MB): 4.98
Forward/backward pass size (MB): 1073.41
Params size (MB): 86.26
Estimated Total Size (MB): 1164.64
=============================================================================================================================

As we are freezing the backbone, only the decode head is trainable with 35,349 parameters for 21 classes. This is going to reduce even further as we will train just one class, that is person.

One final point to note is the image size of 644×644 that we used in the above dummy forward pass. Because of the structure of the backbone (kernel size and padding), the input features get cropped and a resolution of 640×640 throws an error as the patches to be created are not divisible by 14. However, during training, we will handle this in the dataloader image transforms so that we can pass 640×640 resolution in the command line arguments to avoid confusion.

Important Factors About Dataset Preparation

In this section, we will explore some important code snippets from the datasets.py file which prepares the dataset.

The first one is the normalization values. We do not have the normalization values for the LVD-142M dataset. So, we use the ImageNet normalization values.

MEAN = [0.485, 0.456, 0.406]
STD = [0.229, 0.224, 0.225]

The second one is the training and validation transforms.

# TODO: Batchwise rescaling with different image ratios.
def train_transforms(img_size):
    """
    Transforms/augmentations for training images and masks.

    :param img_size: Integer, for image resize.
    """
    train_image_transform = A.Compose([
        A.Resize(img_size[1], img_size[0], always_apply=True),
        A.PadIfNeeded(
            min_height=img_size[1]+4, 
            min_width=img_size[0]+4,
            position='center',
            value=0,
            mask_value=0
        ),
        A.HorizontalFlip(p=0.5),
        A.RandomBrightnessContrast(p=0.2),
        A.Rotate(limit=25),
        A.Normalize(mean=MEAN, std=STD, max_pixel_value=255.)
    ], is_check_shapes=False)
    return train_image_transform

# TODO: Batchwise rescaling with different image ratios.
def valid_transforms(img_size):
    """
    Transforms/augmentations for validation images and masks.

    :param img_size: Integer, for image resize.
    """
    valid_image_transform = A.Compose([
        A.Resize(img_size[1], img_size[0], always_apply=True),
        A.PadIfNeeded(
            min_height=img_size[1]+4, 
            min_width=img_size[0]+4,
            position='center',
            value=0,
            mask_value=0
        ),
        A.Normalize(mean=MEAN, std=STD, max_pixel_value=255.)
    ], is_check_shapes=False)
    return valid_image_transform

The following are the common transforms for training and validation:

First, we resize the images to the resize values passed through the command line during the training script execution. For this project, it is going to be 640×640.
Then we center pad each image and mask with 4 black pixels. This brings the final resolution to 644×644. The original implementation from the authors does this on the fly and is part of the model pipeline. Furthermore, the authors implement dynamic padding so that the model never throws the patch division error. In our case, the 4 pixel center padding will work for 640×640 images. It is hardcoded for now. Generally, it needs to be image_size + (image_size/160). So, for 1280×1280 image size, we should have a center padding of 8 pixels.

We apply horizontal flipping, random brightness contrast, and rotation augmentations to the training set to avoid overfitting.

Furthermore, the file also contains SegmentationDataset class to prepare custom datasets. As all instances of persons have different values, we will make each pixel with a value greater than 0 as 255. This code snippet in the __getitem__ method of the dataset class does the job.

# Make all pixel > 0 as 255.
im = mask > 0
mask[im] = 255
mask[np.logical_not(im)] = 0

The Logits to Upsampled Pixel Map Logic

The model originally returns 46×46 shaped logits as raw pixel maps. These are 14 times downsampled compared to the 644×644 input that we pass. To calculate the loss and also for inference visualization, we need the upsampled logits which will be converted to segmentation maps.

To achieve this, in the train and validate method of engine.py, we have the following logic.

upsampled_logits = nn.functional.interpolate(
    outputs, size=target.shape[-2:], 
    mode="bilinear", 
    align_corners=False
)

We apply a bilinear interpolation to resize the logits to the original input image shape. The same logic is used during inference.

This completes all the important code snippets that we needed to cover.

Training the DINOv2 Model for Semantic Segmentation

In the train.py script, we need to modify the number of classes for the final segmentation head after initializing the model.

model = DINOv2Segmentation()
model.decode_head.conv_seg = nn.Conv2d(1536, len(ALL_CLASSES), kernel_size=(1, 1), stride=(1, 1))

We are all set to start the training process. Execute the following command in the terminal within the parent project directory.

python train.py --lr 0.001 --batch 20 --imgsz 640 640 --epochs 65 --scheduler --scheduler-epochs 50

We use a higher learning rate than usual as we train just the new segmentation head and the backbone weights are frozen. Furthermore, this follows the same pipeline as the original configuration file. We train with 640×640 resolution images for 65 epochs and apply a learning rate scheduler to reduce the learning rate by a factor of 10 after 50 epochs.

The following are the truncated results.

=============================================================================================================================
Layer (type (var_name))                            Input Shape               Output Shape              Param #
=============================================================================================================================
DINOv2Segmentation (DINOv2Segmentation)            [1, 3, 644, 644]          [1, 2, 46, 46]            --
├─Sequential (model)                               [1, 3, 644, 644]          [1, 2, 46, 46]            --
│    └─DinoVisionTransformer (backbone)            [1, 3, 644, 644]          [1, 384, 46, 46]          526,848
│    │    └─PatchEmbed (patch_embed)               [1, 3, 644, 644]          [1, 2116, 384]            (226,176)
│    │    └─ModuleList (blocks)                    --                        --                        (21,302,784)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (768)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (recursive)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (recursive)
│    │    └─LayerNorm (norm)                       [1, 2117, 384]            [1, 2117, 384]            (recursive)
│    └─BNHead (decode_head)                        [1, 384, 46, 46]          [1, 2, 46, 46]            --
│    │    └─SyncBatchNorm (bn)                     [1, 1536, 46, 46]         [1, 1536, 46, 46]         3,072
│    │    └─Conv2d (conv_seg)                      [1, 1536, 46, 46]         [1, 2, 46, 46]            3,074
=============================================================================================================================
Total params: 22,062,722
Trainable params: 6,146
Non-trainable params: 22,056,576
Total mult-adds (Units.MEGABYTES): 506.39
=============================================================================================================================
Input size (MB): 4.98
Forward/backward pass size (MB): 1073.08
Params size (MB): 86.14
Estimated Total Size (MB): 1164.20
=============================================================================================================================
EPOCH: 1
Training
100%|████████████████████| 8/8 [00:10<00:00,  1.34s/it]                                                                                                                                                                                                                                   
Validating
100%|████████████████████| 2/2 [00:02<00:00,  1.15s/it]                                                                                                                                                                                                                                   

Best validation loss: 0.3740081340074539

Saving best model for epoch: 1


Best validation IoU: 0.4842417537368489

Saving best model for epoch: 1

Train Epoch Loss: 0.5106, Train Epoch PixAcc: 0.6750, Train Epoch mIOU: 0.504861
Valid Epoch Loss: 0.3740, Valid Epoch PixAcc: 0.5870 Valid Epoch mIOU: 0.484242
LR for next epoch: [0.001]
--------------------------------------------------
EPOCH: 2
Training
100%|████████████████████| 8/8 [00:10<00:00,  1.29s/it]                                                                                                                                                                                                                                   
Validating
100%|████████████████████| 2/2 [00:02<00:00,  1.13s/it]                                                                                                                                                                                                                                   

Best validation loss: 0.2752937972545624

Saving best model for epoch: 2


Best validation IoU: 0.54755631614711

Saving best model for epoch: 2

Train Epoch Loss: 0.3536, Train Epoch PixAcc: 0.7934, Train Epoch mIOU: 0.652029
Valid Epoch Loss: 0.2753, Valid Epoch PixAcc: 0.6221 Valid Epoch mIOU: 0.547556
LR for next epoch: [0.001]
--------------------------------------------------
.
.
.
--------------------------------------------------
EPOCH: 49
Training
100%|████████████████████| 8/8 [00:11<00:00,  1.40s/it]                                                                                     
Validating
100%|████████████████████| 2/2 [00:02<00:00,  1.26s/it]                                                                                     

Best validation loss: 0.07244893908500671

Saving best model for epoch: 49


Best validation IoU: 0.6051037135003753

Saving best model for epoch: 49

Train Epoch Loss: 0.1237, Train Epoch PixAcc: 0.8418, Train Epoch mIOU: 0.733629
Valid Epoch Loss: 0.0724, Valid Epoch PixAcc: 0.6472 Valid Epoch mIOU: 0.605104
LR for next epoch: [0.001]
--------------------------------------------------
.
.
.
EPOCH: 65
Training
100%|████████████████████| 8/8 [00:10<00:00,  1.37s/it]                                                                                     
Validating
100%|████████████████████| 2/2 [00:02<00:00,  1.23s/it]                                                                                     
Train Epoch Loss: 0.1205, Train Epoch PixAcc: 0.8438, Train Epoch mIOU: 0.741549
Valid Epoch Loss: 0.0727, Valid Epoch PixAcc: 0.6472 Valid Epoch mIOU: 0.604787
LR for next epoch: [0.0001]
--------------------------------------------------
TRAINING COMPLETE

We obtain the best validation IoU of 60.5% on the 49th epoch.

Let’s take a look at the pixel accuracy, mean IoU, and loss graphs.

Figure 3. Pixel accuracy graph after training the DINOv2 model for semantic segmentation on the Penn-Fudan Pedestrian dataset.

Figure 4. Loss graph after training the DINOv2 model for semantic segmentation on the Penn-Fudan Pedestrian dataset.

Figure 5. Mean IoU graph after training the DINOv2 model for semantic segmentation on the Penn-Fudan Pedestrian dataset.

The validation plots plateaued after applying the learning rate scheduler. Although it is subject to experiment, we could train the model for a few more epochs without the scheduler to obtain a slightly higher mean IoU.

Inference on Images

It is quite straightforward to run image inference. All the code for running inference on images is present in infer_image.py file. The script accepts a path to a directory containing images along with other arguments.

python infer_image.py --imgsz 640 640 --model outputs/best_model_iou.pth

The image size is 640×640 which is the same as training and we use the model weights based on the highest IoU.

Following are the results that we get.

Figure 6. Image inference results after fine-tuning DINOv2 for semantic segmentation.

The results are not exceptional, just good enough for a first pass of the experiment. In fact, for crowded scenes and persons who appear smaller, the segmentation maps are not good at all.

Inference on Videos

The infer_video.py script contains code to run inference on videos with similar arguments.

python infer_video.py --imgsz 640 640 --model outputs/best_model_iou.pth --input input/inference_data/videos/video_1.mp4

The only difference is that we need to provide the path to the specific video.

Running on an RTX 3070 Ti laptop GPU, we are getting an average of 97 FPS. We can see that the segmentation maps often overlap when the two persons are close.

Here are some other results.

As we can see, the model is struggling primarily with thin structures like hands and legs of the persons.

Let’s take a look at a final difficult scenario.

Just like in images, for crowded and difficult scenarios, the results are not good.

Further Improvements and Analysis

From the above results, we can infer that the current approach “just works OK”. Further, resizing the logits to the original image/frame size makes the segmentation maps too much expanded out of the objects.

We can improve the results with the following changes:

The simplest one is training the model on more images. This will lead to slightly higher mean IoU and somewhat better segmentation maps. However, it is not enough to obtain results as shown in the demo notebooks in the official repository.
Applying varying image scale resizing during training will surely lead to better segmentation maps and will solve the issues with finer structures that the model is facing currently. In fact, the authors use this approach in the original training regime.

Summary and Conclusion

In this article, we attempted to train the DINOv2 model on semantic segmentation task. We started with the modification of the DINOv2 backbone and added a pixel classification head on top of it. Furthermore, we explored some of the important dataset preparation code snippets, along with the mask preparation and image resizing logic, training, inference, and analysis of results. I hope this article was worth your time.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Source link
lol