Sibgha :
I have two directories in my linux system, /dir
and /dir2
Both have more than 4000 JSON files. The JSON content of every file is like
{
"someattribute":"someValue",
"url":[
"https://www.someUrl.com/xyz"
],
"someattribute":"someValue"
}
Note that url is an array, but it always contains one element (the url).
The url makes the file unique. If there is a file with the same url in /dir
and /dir2
then it's a duplicate and it needs to be deleted.
I want to automate this operation either using a shell command preferrably. Any opinion how I should go about it?
oguz ismail :
Use jq to get a list of duplicates:
jq -nr 'foreach inputs.url[0] as $u (
{}; .[$u] += 1; if .[$u] > 1
then input_filename
else empty end
)' dir/*.json dir2/*.json
And to delete them, pipe above command's output to xargs:
xargs -d $'\n' rm --
or, for compatibility with non-GNU xargs that has -0
but not -d
:
tr '\n' '\0' | xargs -0 rm --
Note that filenames must not contain line feeds.