Official PyTorch implementation of " Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection " : SoTA Open-Set Object Detector.
Article directory
一、Helpful Tutorial
Paper address:
https://arxiv.org/abs/2303.05499
Watch the introductory video on YouTube:
https://www.youtube.com/watch?v=wxWDt5UiwY8&feature=youtu.be
Try the Colab Demo:
https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb
Try Official Huggingface Demo:
https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo
2. Related paper work
2.1 Collation of related papers
- Grounded-SAM: Marrying Grounding DINO with Segment Anything
- Grounding DINO with Stable Diffusion
- Grounding DINO with GLIGEN for Controllable Image Editing
- OpenSeeD: A Simple and Strong Openset Segmentation Model
- SEEM: Segment Everything Everywhere All at Once
- X-GPT: Conversational Visual Agent supported by X-Decoder
- GLIGEN: Open-Set Grounded Text-to-Image Generation
- LLaVA: Large Language and Vision Assistant
2.2 Highlights of the paper
Highlights of this work:
- Open-Set Detection. Detect everything with language!
- High Performancce. COCO zero-shot 52.5 AP (training without COCO data!). COCO fine-tune 63.0 AP.
- Flexible. Collaboration with Stable Diffusion for Image Editting.
2.3 Introduction to the paper
2.4 Marrying Grounding DINO and GLIGEN
2.5 Notes/tips on inputs and outputs
- Grounding DINO accepts an (image, text) pair as inputs.
- It outputs 900 (by default) object boxes. Each box has similarity scores across all input words. (as shown in Figures below.)
- We defaultly choose the boxes whose highest similarities are higher than a box_threshold.
- We extract the words whose similarities are higher than the text_threshold as predicted labels.
- If you want to obtain objects of specific phrases, like the dogs in the sentence two dogs with a stick., you can select the boxes with highest text similarities with dogs as final outputs.
- Note that each word can be split to more than one tokens with differetn tokenlizers. The number of words in a sentence may not equal to the number of text tokens.
- We suggest separating different category names with . for Grounding DINO.
3. Environment configuration process
3.1 My environment
System: The latest ubuntu system
Graphics card: 3090
CUDA:11.3
If you have a CUDA environment, make sure the environment variable CUDA_HOME is set. If no CUDA is available, it will compile in CPU-only mode.
3.2 Configuration process
3.2.1 Clone the GroundingDINO repository from GitHub
git clone https://github.com/IDEA-Research/GroundingDINO.git
After downloading, you can find the corresponding folder:
3.2.2 Change the current directory to the GroundingDINO folder
cd GroundingDINO/
3.2.3 Install the required dependencies in the current directory
pip3 install -q -e .
I don't know why, my download keeps reporting errors! Change to a new download method:
python setup.py install
But it will also be red!
Don't be afraid at this time, if you encounter a wrong package, just use pip to download it directly, be patient, and finally run the above installation command, and you will be successful!
3.2.4 Create a new directory called “weights” to store the model weights
mkdir weights
Change the current directory to the “weights” folder:
cd weights
Download the model weights file:
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
4. Test
Check your GPU ID (only if you’re using a GPU):
nvidia-smi
Replace { GPU ID}, image_you_want_to_detect.jpg, and “dir you want to save the output” with appropriate values in the following command:
CUDA_VISIBLE_DEVICES={
GPU ID} python demo/inference_on_a_image.py \
-c /GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
-p /GroundingDINO/weights/groundingdino_swint_ogc.pth \
-i image_you_want_to_detect.jpg \
-o "dir you want to save the output" \
-t "chair"
[--cpu-only] # open it for cpu mode
Of course, we can also use Python for testing:
from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2
model = load_model("./GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", "./GroundingDINO/weights/groundingdino_swint_ogc.pth")
IMAGE_PATH = "./GroundingDINO/weights/1.png"
TEXT_PROMPT = "person . bike . bottle ."
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25
image_source, image = load_image(IMAGE_PATH)
boxes, logits, phrases = predict(
model=model,
image=image,
caption=TEXT_PROMPT,
box_threshold=BOX_TRESHOLD,
text_threshold=TEXT_TRESHOLD
)
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
cv2.imwrite("./GroundingDINO/weights/annotated_image.jpg", annotated_frame)
Our test original picture is:
The picture after the test is: