The convention in BERT is:
(a) For sequence pairs:
tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
(b) For single sequences:
tokens: [CLS] the dog is hairy . [SEP]
type_ids: 0 0 0 0 0 0 0
Where "type_ids" are used to indicate whether this is the first sequence or the second sequence.
The embedding vectors for `type=0` and `type=1` were learned during pre-training and are added to the wordpiece
embedding vector (and position vector).
This is not *strictly* necessary since the [SEP] token unambiguously separates the sequences, but it makes
it easier for the model to learn the concept of sequences.
For classification tasks, the first vector (corresponding to [CLS]) is used as as the "sentence vector".
Note that this only makes sense because the entire model is fine-tuned.