linux - How to check folder of text files for duplicate URLs -


i have folder *.txt files. want regularly check files duplicate urls.

actually, save bookmarks in these files, @ least 2 lines, such as:

www.domain.com quite popular domain name 

as happens, save same url description, such as:

www.domain.com should buy domain whenever happen have enough money 

all entries separated single blank lines. , urls in markdown format:

[domain.com](www.domain.com) 

how crawl folder duplicate urls?

the solution found far cat in combination it's uniq pipe:

cat folder/* |sort|uniq|less > dupefree.txt 

the problem is:

  1. this check full identical lines - markdown urls ignored , connected comments lost
  2. i don't want output cleaned text file need hint urls duplicates

how can proper duplicate check?

here source file made description

cat file  www.domain.com quite popular domain name  www.domain.com should buy domain whenever happen have enough money entries separated single blank lines. , urls in markdown format:  [domain.com](www.domain.com) how crawl folder duplicate urls? 

using awk export duplicate domain name:

awk 'begin{fs="\n";rs=""} { if ($1~/\[/) { split($1,a,"[)(]"); domain[a[2]]++}   else {domain[$1]++} } end{ (i in domain)        if (domain[i]>1) print "duplicate domain found: ",i     }' file  duplicate domain found:  www.domain.com 

Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -