What are UUIDs and why are they useful?

insert image description here
A Universally Unique Identifier (UUID) is a specific form of identifier that can safely be considered unique for most practical purposes. The probability that two correctly generated UUIDs are the same is almost negligible, even if they were created by different parties in two different environments. That's why UUIDs are said to be universally unique.

In this article, we'll learn about the characteristics of UUIDs, how their uniqueness works, and the scenarios in which they can simplify resource identification. Although we will approach UUIDs from the general perspective of software that interacts with database records, they are broadly applicable to any use case that requires the generation of decentralized unique IDs.

What exactly is a UUID?

A UUID is just a value and you can safely treat it as a unique value. The risk of a collision is so low that you can reasonably choose to ignore it entirely. You may see UUIDs referred to using different terms (GUID or Globally Unique Identifier, Microsoft's preferred semantics), but the meaning and effect remain the same.

A true UUID is a unique identifier generated and represented by a standardized format. Valid UUIDs are defined by RFC 4122; this specification describes algorithms that can be used to generate UUIDs that remain unique between implementations without the need for a central issuing authority.

The RFC includes five different algorithms, each using a different mechanism to produce values. Here is a brief summary of the "versions" available:

Version 1 - Time Based - Combines a timestamp, a clock sequence, and a value specific to the generating device (usually its MAC address) to generate a unique output for that host at that point in time.
Version 2 - DCE Security - This version was developed as an evolution of Version 1 for use in Distributed Computing Environments (DCE). It's not widely used.
Version 3 - Name Based (MD5) - MD5
hashes "namespace" and "name" to create a value that is unique within a namespace. Generating another UUID with the same namespace and name
will produce the same output, so this method provides reproducible results.
Version 4 - Random - Most modern systems tend to choose UUID v4 because it uses the host's source of random or pseudo-random numbers to publish its value. The chance of generating the same UUID twice
is almost negligible.
Version 5 - Name Based (SHA-1) - This is similar to Version 3, but it uses the stronger SHA-1 algorithm to hash the input namespace and name.
Although the RFC refers to the algorithm as a version, that doesn't mean you should always use version 5, as it appears to be the latest. Which one you choose depends on your use case; in many cases, v4 is chosen because of its randomness. This makes it ideal for simple "give me a new identifier" scenarios.

The generation algorithm emits a 128-bit unsigned integer. However, UUIDs are more commonly hexadecimal strings and can also be stored as binary sequences of 16 characters. Here is an example of a UUID string:

16763be4-6022-406e-a950-fcd5018633ca

The value is represented as five sets of alphanumeric characters separated by the dash character. Dashes are not mandatory components of strings; their existence depends on historical details of the UUID specification. They also make identifiers more perceptible to the human eye.

UUID example

The primary use case for UUIDs is the decentralized generation of unique identifiers. You can generate a UUID anywhere and safely consider it unique, whether it comes from your backend code, client device, or your database engine.

UUIDs simplify determining and maintaining object identities in disconnected environments. Historically, most applications have used auto-incrementing integer fields as primary keys. When you create a new object, you don't know its ID until it is inserted into the database. UUIDs allow you to identify yourself earlier in your application.

Here is a basic PHP demo that demonstrates the difference. Let's first look at integer-based systems:

class BlogPost { 
    public  function __construct ( 
        public readonly ?int $Id , 
        public readonly string $Headline , 
        public readonly ?AuthorCollection $Authors = null )  { } 
}
 
#[POST("/posts")]
 function createBlogPost ( HttpRequest $Request )  : void { 
    $headline  =  $Request  ->  getField ( "Headline" ) ; 
    $blogPost  =  new BlogPost ( null ,  $headline ) ; 
}

We have to initialize $Id property, because n u ll we do n't know its actual Id until it's persisted to the database .____This is not ideal - it should n't really be nullable, it allows the instance to be not The complete state exists .$ IdBlogPost

Changing to UUID solved the problem:

class BlogPost { 
    public  function __construct ( 
        public readonly string $Uuid , 
        public readonly string $Headline , 
        public readonly ?AuthorCollection $Authors = null )  { } 
}
 
#[POST("/posts")]
 function createBlogPost ( HttpRequest $Request )  : void { 
    $headline  =  $Request  ->  getField ( "Headline" ) ; 
    $blogPost  =  new BlogPost ( "16763be4-..." ,  $headline ) ; 
}

Post identifiers can now be generated in the app without risking duplicate values. This ensures that object instances always represent a valid state and doesn't require awkward nullable ID properties. The model is also easier to handle transactional logic; child records that need to reference their parent (such as our Post's Author association) can be inserted immediately, without the need for a database round-trip to get the ID assigned to the parent.

In the future, your blog application may move more logic to the client. Perhaps the front end gained support for fully offline draft creation, effectively creating an instance of BlogPost that is temporarily saved on the user's device. Now the client can generate the UUID of the post and transmit it to the server when it reconnects to the network. If the client then retrieves the server's draft copy, it can match it against any remaining local state, since the UUID is already known.

UUIDs can also help you combine data from various sources. Combining database tables and caches that use integer keys can be tedious and error-prone. UUIDs provide uniqueness not only within the table but also at the entire universe level. This makes them better candidates for replicated structures and data that are often moved between different storage systems.

Considerations when UUID encounters a database

The benefits of UUIDs are compelling. However, there are several issues to be aware of when using them in real systems. A big factor in supporting integer IDs is that they are easy to extend and optimize. Database engines can easily index, sort, and filter lists of numbers in only one direction.

The same cannot be said for UUIDs. First, UUIDs are four times larger than integers (36 bytes versus 4 bytes); for large datasets, this in itself can be an important consideration. Sorting and indexing these values is also trickier, especially in the most common case of random UUIDs. Their randomness means they have no natural order. This will hurt indexing performance if you use UUID as primary key.

These issues can be compounded in normalized databases that use foreign keys heavily. Now you may have many relational tables, each containing a reference to your 36-byte UUID. Ultimately, the additional memory required to perform joins and sorts can have a significant impact on system performance.

You can partially alleviate these problems by storing UUIDs as binary data. This means a BINARY(16) column instead of VARCHAR(36). Some databases (eg PostgreSQL) include a built-in UUID data type; other functions like MySQL can convert a UUID string to its binary representation and vice versa. This method is more efficient, but keep in mind that you will still use extra resources to store and select data.

An effective strategy is to keep integers as primary keys, but add an extra UUID field for application reference. When your code uses UUIDs to get and insert top-level objects, relational linked tables can use IDs to improve performance. It all depends on your system, its size, and your priorities: UUIDs are the best choice when you need decentralized ID generation and direct data merging, but you need to make trade-offs.

generalize

UUIDs are unique values that you can safely use for decentralized identity generation. A collision is possible, but should be so rare that it can be discarded from consideration. If you generate 1 billion UUIDs per second over the entire century, assuming enough entropy is available, the probability of encountering a duplicate is about 50%.

You can use the UUID to establish a database-independent identity before the insert takes place. This simplifies application-level code and prevents incorrectly recognized objects from existing in your system. Unlike traditional integer keys that operate at the table level, UUIDs also aid data replication by guaranteeing uniqueness independent of the data store, device, or environment.

While UUIDs are now ubiquitous in software development, they are not a perfect solution. Novices tend to focus on the possibility of conflict, but this shouldn't be your main consideration unless your system is so sensitive that uniqueness must be guaranteed.
The more obvious challenge for most developers is the storage and retrieval of the generated UUID. Naively using a VARCHAR(36) (or stripping the hyphen and using a VARCHAR(32) ) may cripple your application over time because most database index optimizations will be ineffective. Investigate your database system's built-in UUID handling to ensure you get the best performance out of your solution.