datori

Database administration for fun and profit

A perl one-liner to extract all URLs from an HTML document

Just in case I ever forget how I did it… I was trying to download some 40 page PDF brochure from a government web site – I wanted to print it out and read it off-line. However, it was cleverly split into 20 different PDFs – no doubt for convenience. Instead of spending 20 minutes clicking on those various links and printing 20 document fragments, I chose to spend twice that time trying to automate the process. And here it is, in all its glory:

curl -s "http://www.datori.org" \
| perl -n -e 'chomp;s/.*?(?:(?i)href)="([^"]+)".*?(?:$|(?=(?i)href))/$1\n/xg and print'

The “thing” downloads the specified page and extracts all linked URLs from it, as indicated by the “href” tags. You’ve got to appreciate the enormity of perl…

2 Responses to A perl one-liner to extract all URLs from an HTML document

  1. Pingback: links for 2010-01-16 | nrvous, org.

  2. bcarroll says:

    Great tip, Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *