So I'm trying to get more familiar with Python web scraping and I'm trying to find external links only for a specific function. In the books I'm reading the author implements this by simply removing the "http://" from the string and then seeing if the new link contains the new string (which is the domain name without the preceding "http://".
I can see how this code might fail and although I can simply write an if statement it does make me wonder - is there any way to match all links that start with "http" but not with "http(s)://domain.com"? I tried many different regex solutions that I thought would work but they havent. For example, the variable "site" contains the link address.
re.compile("^((?!"+site+").)^http|www*$"))
re.compile("^http|www((?!"+site+").)*$"))
The results I get would simply be all links that start with http or www and that's not what I Intend to do. Again, I can implement this just fine with an if statement and filter the results, this isn't a complete blocker, but I'm curious about the existance of such a possibility
Any help would be appreciated. I looked around the web but couldn't find anything that matches my use case.
To match a string that starts with one string but not with another one, you shoud use this pattern :
^(?!stringyoudontwant)stringyouwant.*
So in your case, this would be :
^(?!https?:\/\/domain\.com)http.*
For this kind of things, you can check out https://regex101.com which is the perfect interface to experiment with complicated regexes.