Target URLs with special characters

Henrik Petterson :

I have a string with HTML, and I target image URLs like this:

$regex = '#([a-z,:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#i';

Works fine with:

https://example.com/image.jpg

But when a URL has a special character, like:

https://example.com/ストスト.jpg

It doesn't match. See test!

How do I alter the regex so it matches with URLs that have these special characters?

The fourth bird :

In the character class you don't have to escape the , and the :. You also don't have to escape the / if you use a different delimiter like #.

You could shorten the pattern to

[\w,=/:.-]+\.(?:jpe?g|png|gif)

Regex demo | Php demo

If you want to find the href from the anchors, I suggest using a parser instead.

The pattern including the u unicode flag:

$regex = '#[\w,=/:.-]+\.(?:jpe?g|png|gif)#iu

For example (using anchors ^ and $ to prevent getting partial matches)

$input = <<<HTML
<a href="https://e...content-available-to-author-only...e.com/example1.jpg">
<a href="https://e...content-available-to-author-only...e.com/ストスト.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.bak">
HTML;

$dom = new DomDocument();
$dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', "UTF-8"));

$anchors = $dom->getElementsByTagName("a");
$regex = '#^[\w,=/:.-]+\.(?:jpe?g|png|gif)$#iu';

foreach ($anchors as $anchor) {
    $res = $anchor->getAttribute("href");
    if (preg_match($regex, $res)) {
        echo "Valid url: $res" . PHP_EOL;
    } else {
        echo "Invalid url: $res" . PHP_EOL;
    }
}

Output

Valid url: https://e...content-available-to-author-only...e.com/example1.jpg
Valid url: https://e...content-available-to-author-only...e.com/ストスト.jpg
Valid url: https://e...content-available-to-author-only...e.com/example3.jpg
Invalid url: https://e...content-available-to-author-only...e.com/example3.bak

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=11900&siteId=1