linux - How to check folder of text files for duplicate URLs -

i have folder *.txt files. want regularly check files duplicate urls.

actually, save bookmarks in these files, @ least 2 lines, such as:

www.domain.com quite popular domain name

as happens, save same url description, such as:

www.domain.com should buy domain whenever happen have enough money

all entries separated single blank lines. , urls in markdown format:

[domain.com](www.domain.com)

how crawl folder duplicate urls?

the solution found far cat in combination it's uniq pipe:

cat folder/* |sort|uniq|less > dupefree.txt

the problem is:

this check full identical lines - markdown urls ignored , connected comments lost
i don't want output cleaned text file need hint urls duplicates

how can proper duplicate check?

here source file made description

cat file  www.domain.com quite popular domain name  www.domain.com should buy domain whenever happen have enough money entries separated single blank lines. , urls in markdown format:  [domain.com](www.domain.com) how crawl folder duplicate urls?

using awk export duplicate domain name:

awk 'begin{fs="\n";rs=""} { if ($1~/\[/) { split($1,a,"[)(]"); domain[a[2]]++}   else {domain[$1]++} } end{ (i in domain)        if (domain[i]>1) print "duplicate domain found: ",i     }' file  duplicate domain found:  www.domain.com

Search This Blog

Brazzel

linux - How to check folder of text files for duplicate URLs -

Comments

Post a Comment

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

Reading inputs from Keyboard in Objective C -

javascript - jQuery show full size image on click -