Posted by nick @ 23:48 on June 25th 2009

A perl one-liner to extract all URLs from an HTML document

Just in case I ever forget how I did it… I was trying to download some 40 page PDF brochure from a government web site – I wanted to print it out and read it off-line. However, it was cleverly split into 20 different PDFs – no doubt for convenience. Instead of spending 20 minutes clicking on those various links and printing 20 document fragments, I chose to spend twice that time trying to automate the process. And here it is, in all its glory:

curl -s "http://www.datori.org" \
| perl -n -e 'chomp;s/.*?(?:(?i)href)="([^"]+)".*?(?:$|(?=(?i)href))/$1\n/xg and print'

The “thing” downloads the specified page and extracts all linked URLs from it, as indicated by the “href” tags. You’ve got to appreciate the enormity of perl…

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment