[Computer Vision | Target Detection] Grounding DINO deep learning environment configuration (including cases)

Official PyTorch implementation of " Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection " : SoTA Open-Set Object Detector.

一、Helpful Tutorial

Paper address:

https://arxiv.org/abs/2303.05499

Watch the introductory video on YouTube:

https://www.youtube.com/watch?v=wxWDt5UiwY8&feature=youtu.be

insert image description here
Try the Colab Demo:

https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb

Try Official Huggingface Demo:

https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo

2. Related paper work

2.1 Collation of related papers

insert image description here

  • Grounded-SAM: Marrying Grounding DINO with Segment Anything
  • Grounding DINO with Stable Diffusion
  • Grounding DINO with GLIGEN for Controllable Image Editing
  • OpenSeeD: A Simple and Strong Openset Segmentation Model
  • SEEM: Segment Everything Everywhere All at Once
  • X-GPT: Conversational Visual Agent supported by X-Decoder
  • GLIGEN: Open-Set Grounded Text-to-Image Generation
  • LLaVA: Large Language and Vision Assistant

2.2 Highlights of the paper

Highlights of this work:

  1. Open-Set Detection. Detect everything with language!
  2. High Performancce. COCO zero-shot 52.5 AP (training without COCO data!). COCO fine-tune 63.0 AP.
  3. Flexible. Collaboration with Stable Diffusion for Image Editting.

2.3 Introduction to the paper

insert image description here

2.4 Marrying Grounding DINO and GLIGEN

insert image description here

2.5 Notes/tips on inputs and outputs

  • Grounding DINO accepts an (image, text) pair as inputs.
  • It outputs 900 (by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.)
  • We defaultly choose the boxes whose highest similarities are higher than a box_threshold.
  • We extract the words whose similarities are higher than the text_threshold as predicted labels.
  • If you want to obtain objects of specific phrases, like the dogs in the sentence two dogs with a stick., you can select the boxes with highest text similarities with dogs as final outputs.
  • Note that each word can be split to more than one tokens with differetn tokenlizers. The number of words in a sentence may not equal to the number of text tokens.
  • We suggest separating different category names with . for Grounding DINO.
    insert image description here
    insert image description here

3. Environment configuration process

3.1 My environment

System: The latest ubuntu system

Graphics card: 3090

CUDA:11.3

If you have a CUDA environment, make sure the environment variable CUDA_HOME is set. If no CUDA is available, it will compile in CPU-only mode.

3.2 Configuration process

3.2.1 Clone the GroundingDINO repository from GitHub

git clone https://github.com/IDEA-Research/GroundingDINO.git

After downloading, you can find the corresponding folder:

insert image description here

3.2.2 Change the current directory to the GroundingDINO folder

cd GroundingDINO/

3.2.3 Install the required dependencies in the current directory

pip3 install -q -e .

I don't know why, my download keeps reporting errors! Change to a new download method:

python setup.py install

insert image description here

But it will also be red!

Don't be afraid at this time, if you encounter a wrong package, just use pip to download it directly, be patient, and finally run the above installation command, and you will be successful!

insert image description here

3.2.4 Create a new directory called “weights” to store the model weights

mkdir weights

Change the current directory to the “weights” folder:

cd weights

Download the model weights file:

wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

4. Test

Check your GPU ID (only if you’re using a GPU):

nvidia-smi

insert image description here

Replace { GPU ID}, image_you_want_to_detect.jpg, and “dir you want to save the output” with appropriate values in the following command:

CUDA_VISIBLE_DEVICES={
    
    GPU ID} python demo/inference_on_a_image.py \
-c /GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p /GroundingDINO/weights/groundingdino_swint_ogc.pth \
-i image_you_want_to_detect.jpg \
-o "dir you want to save the output" \
-t "chair"
 [--cpu-only] # open it for cpu mode

Of course, we can also use Python for testing:

from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2

model = load_model("./GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", "./GroundingDINO/weights/groundingdino_swint_ogc.pth")
IMAGE_PATH = "./GroundingDINO/weights/1.png"
TEXT_PROMPT = "person . bike . bottle ."
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
cv2.imwrite("./GroundingDINO/weights/annotated_image.jpg", annotated_frame)

Our test original picture is:

insert image description here
The picture after the test is:

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130582034