linux - How to check folder of text files for duplicate URLs -
i have folder *.txt
files. want regularly check files duplicate urls.
actually, save bookmarks in these files, @ least 2 lines, such as:
www.domain.com quite popular domain name
as happens, save same url description, such as:
www.domain.com should buy domain whenever happen have enough money
all entries separated single blank lines. , urls in markdown format:
[domain.com](www.domain.com)
how crawl folder duplicate urls?
the solution found far cat
in combination it's uniq
pipe:
cat folder/* |sort|uniq|less > dupefree.txt
the problem is:
- this check full identical lines - markdown urls ignored , connected comments lost
- i don't want output cleaned text file need hint urls duplicates
how can proper duplicate check?
here source file made description
cat file www.domain.com quite popular domain name www.domain.com should buy domain whenever happen have enough money entries separated single blank lines. , urls in markdown format: [domain.com](www.domain.com) how crawl folder duplicate urls?
using awk export duplicate domain name:
awk 'begin{fs="\n";rs=""} { if ($1~/\[/) { split($1,a,"[)(]"); domain[a[2]]++} else {domain[$1]++} } end{ (i in domain) if (domain[i]>1) print "duplicate domain found: ",i }' file duplicate domain found: www.domain.com
Comments
Post a Comment