MovieNet data set detailed

0. Preface

  • Relevant information
  • One sentence summary: Movie-based video understanding data set, including character bbox/id, scene boundary, location/behavior tag of each scene, etc.
  • Obtain: Download directly from the official website without any difficulty.
  • One sentence summary: Movie-based video understanding data set, including character bbox/id, shot type, scene location and behavior tags, etc.

1. Basic information

  • How to split a one-step movie
    • Movie structure: frame -> shot -> thread -> scene -> movie
    • My very unprofessional understanding:
      • frame: nothing to say, image frame
      • shot: Shot, my understanding is a video clip captured by a camera continuously.
        • Shot is a series of frames that runs for an uninterrupted period of time. It is also the minimal visual unit of a movie. A movie would usually contains hundreds of shots.
      • scene: Scene, my understanding is a video clip composed of several shots shot in one place.
        • Scene is a sequence of continued shots that are semantically related. Usually a scene would tell about one event in the movie. A movie would contains tens of scenes.
      • thread: I really don't understand what this means.
        • Thread shows the pattern of the shot arrangement in a scene. But note that not all scenes would contain threads.
        • Take a typical dialog scene as an example. Suppose there are two persons A and B in the dialog scene, they would be alternately shown, the pattern of which can be represented as ABABAB…". So there are two threads in this dialog scene, namely A and B. To capture the hierarchical structure of a movie is important for movie understanding.
  • Provided label category
    • People tags:
      • Hand-marked 1.3M personal bbox in 758k pictures of more than 300 movies
      • The identity tags of the characters in the 573 movie are marked. For movies without manual bbox, use SOTA person detector to detect. In order to reduce the workload, only focus on the top 10 actors in the movie in IMDb. In the end, 763k samples belonging to 3087 credited casts and 364k other samples were obtained.
    • Scene boundary:
      • Time division of movie scenes.
      • There are a total of 42k scenes.
    • Location/behavior label
      • Manually label the behavior and location of each scene.
      • Each scene corresponds to multiple place tags.
      • For behavior labels, first divide the scene into sub-clips, and then mark each sub-clip with multiple behavior labels.
      • In order to make the information more diverse and contain more information, we encourage annotators to create more tags, and those behaviors that are not very helpful for story understanding (such as standing and speaking) are removed. Finally, 80 types of behavior labels and 90 types of place labels were determined.
      • Finally, 19.6K location tags, 41.3k behavior fragments, and 45k behavior tags are obtained.
    • Description Alignment
    • Movie Type (Cinematic Style)
      • There are two dimensions of data
      • view scale:long shot, full shot, medium shot, close-up shot and extreme close-up shot
      • camera movement:static shot, pans and tilts shot, zoom in and zoom out
  • Provided data
    • id: the id of the movie in IMDb, TMDb ID and Douban ID are also provided.
    • Movie: The movie itself, provides 1,100 movies 720P, length and width 16:9, and may have black borders. Due to copyright reasons, only key frames are released. Adjacent frames are very similar, and only key frames are enough. In order to avoid copyright issues, only the 16K Hz sampling frequency and 512 window length data are collected.
    • Trailer: Trailer, that is, commercial advertisement, a total of 33k different trailers, there are also key frame information and corresponding sound features.
    • Subtitle, that is, subtitles, embedded English subtitles or downloaded from YIFY .
    • Script: Script.
    • Synopsis: Synopsis of the plot, written by the viewer and obtained from IMDb.
    • Meta data: Elementary number

2. Details

2.1. Label details

  • All tags are json files, and the file name is the movie id of IMDb.

  • The entire tag is a dictionary, containing the following keys

    • imdb_id: IMDb movie number
    • cast:The person-related tags, including bbox and the corresponding pid (i.e. task number)
    • scene: Scene information, including the starting frame, starting shot, location label and behavior label of each scene
    • story: I don’t know how to translate this. There are number, starting shot, starting frame, time, consistency(don’t know what it is), text description, subtitles
    • cinematic_style: Lens classification, that is, the scale and movement of each lens, as well as trailer information.
  • Examples of tags are as follows

{
    
    
  "imdb_id": "tt1210166",
  "cast": [
    {
    
    
      "id": "tt1210166_000001",
      "frame_idx": null,
      "resolution": [
        1280,
        694
      ],
      "shot_idx": 1,
      "img_idx": 0,
      "body": {
    
    
        "type": "detected",
        "bbox": [
          22,
          27,
          1148,
          675
        ]
      },
      "pid": "others",
      "possible_pids": [
        "others"
      ]
    },
    ...
  ],
  "scene": [
    {
    
    
      "id": "tt1210166_0000",
      "shot": [
        0,
        1
      ],
      "frame": [
        0,
        841
      ],
      "place_tag": null,
      "action_tag": null
    },
    ...
  ],
  "story": [
    {
    
    
      "id": "tt1210166_0000",
      "shot": [
        60,
        424
      ],
      "frame": [
        6257,
        44851
      ],
      "duration": [
        260.97997833333335,
        1870.6211273333333
      ],
      "consistency": 0.963081028938084,
      "description": "Oakland Athletics general manager Billy Beane is upset by his team's loss to the New York Yankees in the 2001 postseason ...",
      "subtitle": [
        {
    
    
          "shot": 60,
          "duration": [
            260.26,
            262.51225
          ],
          "sentences": [
            "You gotta give the Yankees--"
          ]
        },
        ...
      ]
    },
    ...
  ],
  "cinematic_style": {
    
    
    "movie": [
      {
    
    
        "shot": 1,
        "scale": "closeup",
        "movement": "static"
      },
      {
    
    
        "shot": 2,
        "scale": "full",
        "movement": "static"
      },
      {
    
    
        "shot": 3,
        "scale": "closeup",
        "movement": "moving"
      },
      ...
    ],
    "trailer": null
  }
}

2.2. Toolkit introduction

  • The homepage is here , it says there are four parts, but currently there is only one of them

    • image-20210111232653366
  • There are a lot of existing tools, starting with some existing libraries

    • image-20210111232718604

Guess you like

Origin blog.csdn.net/irving512/article/details/112503571