A perl one-liner to extract all URLs from an HTML document
Just in case I ever forget how I did it… I was trying to download some 40 page PDF brochure from a government web site – I wanted to print it out and read it off-line. However, it was cleverly split into 20 different PDFs – no doubt for convenience. Instead of spending 20 minutes clicking on those various links and printing 20 document fragments, I chose to spend twice that time trying to automate the process. And here it is, in all its glory:
curl -s "http://www.datori.org" \
| perl -n -e 'chomp;s/.*?(?:(?i)href)="([^"]+)".*?(?:$|(?=(?i)href))/$1\n/xg and print'
The “thing” downloads the specified page and extracts all linked URLs from it, as indicated by the “href” tags. You’ve got to appreciate the enormity of perl…
[...] datori » A perl one-liner to extract all URLs from an HTML document Just in case I ever forget how I did it… I was trying to download some 40 page PDF brochure from a government web site – I wanted to print it out and read it off-line. However, it was cleverly split into 20 different PDFs – no doubt for convenience. Instead of spending 20 minutes clicking on those various links and printing 20 document fragments, I chose to spend twice that time trying to automate the process. And here it is, in all its glory: [...]
Pingback by links for 2010-01-16 | nrvous, org. — January 17, 2011 @ 06:04
Great tip, Thanks!
Comment by bcarroll — March 19, 2013 @ 09:53