Linux Format forums Forum Index Linux Format forums
Help, discussion, magazine feedback and more
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

De-internet-archive-scripting webpages

 
Post new topic   Reply to topic    Linux Format forums Forum Index -> Help!
View previous topic :: View next topic  
Author Message
Dutch_Master
LXF regular


Joined: Tue Mar 27, 2007 2:49 am
Posts: 2435

PostPosted: Sun Mar 25, 2012 10:18 pm    Post subject: De-internet-archive-scripting webpages Reply with quote

Some time ago I downloaded a website from the internet archive pages as it turned out the owner discontinued it. They insert some ***** scripting I distaste greatly and it keeps on linking to the archive. I could remove all instances by hand in the html code, but with 100+ pages, I'd think there's a better (quicker!) solution. I assume sed or awk are required, but knowing nought about either, what's the best oneliner (or script, that's fine) that get me running? (s'cuse the pun Wink)
Back to top
View user's profile Send private message
nelz
Site admin


Joined: Mon Apr 04, 2005 12:52 pm
Posts: 8464
Location: Warrington, UK

PostPosted: Mon Mar 26, 2012 1:55 am    Post subject: Reply with quote

If the code is in the same place in each file, you can remove a range of line with

Code:
for i in *.html; do
  sed -i x-yd $i
done


Where x and y are the first and last lines of the script. Otherwise we'd need to see an example to see how to identify the lines to be deleted.
_________________
"Insanity: doing the same thing over and over again and expecting different results." (Albert Einstein)
Back to top
View user's profile Send private message
Dutch_Master
LXF regular


Joined: Tue Mar 27, 2007 2:49 am
Posts: 2435

PostPosted: Mon Mar 26, 2012 2:32 am    Post subject: Reply with quote

The problem is that although the archive script puts a lot of files in the same place, it also hard-links all links in a page, using absolute links (with the http header). However, when storing the files I also introduced some issues, only to become apparent when I opened the html code.... Right now, I think your script will remove the bulk of the added code and I'd have to manually edit the hard-links back to relative links. Thanks again Nelz!

[edit: cried victory too soon, after replacing the x and y with numerical values I got the error
Code:
sed: -e expression #1, char 3: unknown command: `-'
I've got as far as
Code:
for i in *.html; do sed i\ {14-216}d $i; done
This clears the files completely. Luckily I got a backup.... Wink]

[edit2: here's a simple sample]
Back to top
View user's profile Send private message
nelz
Site admin


Joined: Mon Apr 04, 2005 12:52 pm
Posts: 8464
Location: Warrington, UK

PostPosted: Mon Mar 26, 2012 10:34 am    Post subject: Reply with quote

My bad, I was working from failing memory, it is x,yd not x-yd, which means something completely different.

You can also give an extension to -i and sed will create backups of the original files with that extension. This will remove the toolbar stuff and save a backup

Code:
sed -i.bak /BEGIN\ WAYBACK\ TOOLBAR/,/END\ WAYBACK\ TOOLBAR/d Start\ page.html


You can change the links to relative with something like

Code:
sed 's/http:\/\/web.archive.org\/web\/20041103050546\/http:\/\/web.utanet.at\/smiderkr\/asr\//g'


on all files in the same directory, but it gets messy if the pages are stored in multiple subdirectories.
_________________
"Insanity: doing the same thing over and over again and expecting different results." (Albert Einstein)
Back to top
View user's profile Send private message
Dutch_Master
LXF regular


Joined: Tue Mar 27, 2007 2:49 am
Posts: 2435

PostPosted: Mon Mar 26, 2012 2:06 pm    Post subject: Reply with quote

Thanks Nelz, give it a try later. Smile
Back to top
View user's profile Send private message
Dutch_Master
LXF regular


Joined: Tue Mar 27, 2007 2:49 am
Posts: 2435

PostPosted: Thu Jun 21, 2012 3:07 pm    Post subject: Reply with quote

And again I find some familiar question in LFX159 this time Very Happy Thx guys!
(haven't pursued it yet, got some RSI complaints by my wrist back then... Sad)
Back to top
View user's profile Send private message
View previous topic :: View next topic  
Display posts from previous:   
Post new topic   Reply to topic    Linux Format forums Forum Index -> Help! All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Linux Format forums topic RSS feed 


Powered by phpBB © 2001, 2005 phpBB Group


Copyright 2011 Future Publishing, all rights reserved.


Web hosting by UKFast