Posted by nick @ 23:48 on June 25th 2009

A perl one-liner to extract all URLs from an HTML document

Just in case I ever forget how I did it… I was trying to download some 40 page PDF brochure from a government web site – I wanted to print it out and read it off-line. However, it was cleverly split into 20 different PDFs – no doubt for convenience. Instead of spending 20 minutes clicking on those various links and printing 20 document fragments, I chose to spend twice that time trying to automate the process. And here it is, in all its glory:

curl -s "http://www.datori.org" \
| perl -n -e 'chomp;s/.*?(?:(?i)href)="([^"]+)".*?(?:$|(?=(?i)href))/$1\n/xg and print'

The “thing” downloads the specified page and extracts all linked URLs from it, as indicated by the “href” tags. You’ve got to appreciate the enormity of perl…

2 Comments »

  1. [...] datori » A perl one-liner to extract all URLs from an HTML document Just in case I ever forget how I did it… I was trying to download some 40 page PDF brochure from a government web site – I wanted to print it out and read it off-line. However, it was cleverly split into 20 different PDFs – no doubt for convenience. Instead of spending 20 minutes clicking on those various links and printing 20 document fragments, I chose to spend twice that time trying to automate the process. And here it is, in all its glory: [...]

    Pingback by links for 2010-01-16 | nrvous, org. — January 17, 2011 @ 06:04

  2. Great tip, Thanks!

    Comment by bcarroll — March 19, 2013 @ 09:53

RSS feed for comments on this post. TrackBack URI

Leave a comment