De-internet-archive-scripting webpages

The place to post if you need help or advice

Moderators: ChrisThornett, LXF moderators

De-internet-archive-scripting webpages

Postby Dutch_Master » Sun Mar 25, 2012 9:18 pm

Some time ago I downloaded a website from the internet archive pages as it turned out the owner discontinued it. They insert some ***** scripting I distaste greatly and it keeps on linking to the archive. I could remove all instances by hand in the html code, but with 100+ pages, I'd think there's a better (quicker!) solution. I assume sed or awk are required, but knowing nought about either, what's the best oneliner (or script, that's fine) that get me running? (s'cuse the pun ;))
Dutch_Master
LXF regular
 
Posts: 2445
Joined: Tue Mar 27, 2007 1:49 am

Postby nelz » Mon Mar 26, 2012 12:55 am

If the code is in the same place in each file, you can remove a range of line with

Code: Select all
for i in *.html; do
  sed -i x-yd $i
done


Where x and y are the first and last lines of the script. Otherwise we'd need to see an example to see how to identify the lines to be deleted.
"Insanity: doing the same thing over and over again and expecting different results." (Albert Einstein)
User avatar
nelz
Site admin
 
Posts: 8498
Joined: Mon Apr 04, 2005 11:52 am
Location: Warrington, UK

Postby Dutch_Master » Mon Mar 26, 2012 1:32 am

The problem is that although the archive script puts a lot of files in the same place, it also hard-links all links in a page, using absolute links (with the http header). However, when storing the files I also introduced some issues, only to become apparent when I opened the html code.... Right now, I think your script will remove the bulk of the added code and I'd have to manually edit the hard-links back to relative links. Thanks again Nelz!

[edit: cried victory too soon, after replacing the x and y with numerical values I got the error
Code: Select all
sed: -e expression #1, char 3: unknown command: `-'
I've got as far as
Code: Select all
for i in *.html; do sed i\ {14-216}d $i; done
This clears the files completely. Luckily I got a backup.... ;)]

[edit2: here's a simple sample]
Dutch_Master
LXF regular
 
Posts: 2445
Joined: Tue Mar 27, 2007 1:49 am

Postby nelz » Mon Mar 26, 2012 9:34 am

My bad, I was working from failing memory, it is x,yd not x-yd, which means something completely different.

You can also give an extension to -i and sed will create backups of the original files with that extension. This will remove the toolbar stuff and save a backup

Code: Select all
sed -i.bak /BEGIN\ WAYBACK\ TOOLBAR/,/END\ WAYBACK\ TOOLBAR/d Start\ page.html


You can change the links to relative with something like

Code: Select all
sed 's/http:\/\/web.archive.org\/web\/20041103050546\/http:\/\/web.utanet.at\/smiderkr\/asr\//g'


on all files in the same directory, but it gets messy if the pages are stored in multiple subdirectories.
"Insanity: doing the same thing over and over again and expecting different results." (Albert Einstein)
User avatar
nelz
Site admin
 
Posts: 8498
Joined: Mon Apr 04, 2005 11:52 am
Location: Warrington, UK

Postby Dutch_Master » Mon Mar 26, 2012 1:06 pm

Thanks Nelz, give it a try later. :)
Dutch_Master
LXF regular
 
Posts: 2445
Joined: Tue Mar 27, 2007 1:49 am

Postby Dutch_Master » Thu Jun 21, 2012 2:07 pm

And again I find some familiar question in LFX159 this time :D Thx guys!
(haven't pursued it yet, got some RSI complaints by my wrist back then... :()
Dutch_Master
LXF regular
 
Posts: 2445
Joined: Tue Mar 27, 2007 1:49 am


Return to Help!

Who is online

Users browsing this forum: lok1950 and 2 guests