Java url domain parsing with regex

abidinberkay :

I want to parse a Url's domain (without 'www') with regex and return it. There are many examples for it on stackoverflow but they do not provide solution for all cases below or some of them has unneccessary features. My cases are:

http://www.google.co.uk      pass
http://www.google.co.uk      pass
http://google.com.co.uk      pass
same for https               pass
google.co.uk                 pass
www.google.co.uk             pass

and all must return only part of domain google.co.uk There is no need for links like 101.34.24.. or starting for fps etc... Only allowed input formats are at above. And i validate url with regex : ^(https?:\/\/)?(www\.)?([\w]+\.)+[‌​\w]{2,63}\/?$ and it is working good but i do not know how to parse it.

Note: I would be happy if you do not recommend URI or URL classes and their methods for parsing domain automatically like:

private String parseUrl(String url) throws URISyntaxException {
        if (url.startsWith("http:/")) {
            if (!url.contains("http://")) {
                url = url.replaceAll("http:/", "http://");
            }
        } else if (url.startsWith("https:/")) {
            url = url.replaceAll("https:/", "http:/");
        } else {
            url = "http://" + url;
        }
        URI uri = new URI(url);
        String domain = uri.getHost();
        return domain.startsWith("www.") ? domain.substring(4) : domain;
    }

This code works perfectly as well but i need regex not this one.

Pushpesh Kumar Rajwanshi :

Your regex,

^(https?:\/\/)?(www\.)?([\w]+\.)+[‌​\w]{2,63}\/?$

matches the input but doesn't capture the intended domain in a group properly. You can modify it and make it simple like this,

^(?:https?:\/\/)?(?:www\.)?((?:[\w]+\.)+\w+)

which captures your intended domain capture in group 1.

Live Demo

Here is a sample Java code using extracts and prints domain name,

public static void main(String[] args) throws SQLException {
    Pattern p = Pattern.compile("^(?:https?:\\/\\/)?(?:www\\.)?((?:[\\w]+\\.)+\\w+)");
    List<String> list = Arrays.asList("http://www.google.co.uk", "http://www.google.co.uk",
            "http://google.com.co.uk", "https://www.google.co.uk", "https://www.google.co.uk",
            "https://google.com.co.uk");

    list.forEach(x -> {
        Matcher m = p.matcher(x);
        if (m.matches()) {
            System.out.println(x + " --> " +m.group(1));
        }
    });
}

Prints,

http://www.google.co.uk --> google.co.uk
http://www.google.co.uk --> google.co.uk
http://google.com.co.uk --> google.com.co.uk
https://www.google.co.uk --> google.co.uk
https://www.google.co.uk --> google.co.uk
https://google.com.co.uk --> google.com.co.uk

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=129946&siteId=1