HZGN.COM
welcome to my space
X
Search:  
Welcome to:hzgn.com
Feng Shui | Graphic Design | Cosmetics | Causes and Organizations | Regulatory Compliance | Gadgets and Gizmos | Computer Forensics | Tools and Equipment | Related articles
NAVIGATION - HOME

Scraping ascii file with redirected URL using Perl

Published by: jack 2009-01-07

  • I am trying to scrape a page that is generated on the fly by a webserver. Consistent data is submitted (by me) to the CGI on the target machine, then a report is generated on the fly and a redirect is issued. I need the script to obtain and follow this redirect, match a "ascii download" link on the target page and "click" this link to download. I was thinking of using WWW:Mechanize to achieve this. A full code example would be required.


  • Hello grabby-ga There are few exact details in your question so I will give a generic perl script that should hopefully allow you to modify it and generate the solution you require. This script will: 1) Go to the page and retrieve the redirect 2) Download the redirect page 3) Match the text link 4) Download the text link into a file #BEGIN #!/usr/bin/perl # modules to use use LWP::UserAgent; # this is the url of the page that gets redirected $url = "http://redirectedurl.com"; # get this redirected page and find out where it is redirected to $ua = new LWP::UserAgent; $request = new HTTP::Request HEAD => $url; $response = $ua->request($request); $url = $response->request->url; # download the redirected page $browser = LWP::UserAgent->new(); $response = $browser->
    TCL proc size limits - The Tcl programming language::
    3 posts - Last post: Jul 29, 2005I have been programming in TCL for a few months - mostly writing scripts to scrape data from files on the Quovadx Integration engine.
    http://www.wellho.net/forum/The-Tcl-programming-language/TCL-proc-size-limits.html
    HOME
    get($url); $page_content = $response->
    Malay Kumar Basu <malay@bioinformatics.org> Perl module to parse ::
    Zoffix Znet <cpan@zoffix.com> redirect users based on conditions Zoffix a file upload Tony Bowden <tony@tmtm.com> validate a URL Jesse Sheidlower
    http://ppm4.activestate.com/i686-linux/5.8/818/package.xml
    HOME
    content; # use a regular expression to match the text file # this will need to be nailed down more to ensure the link is correct # but this is difficult to do without seeing a copy of the page it is on $page_content =~ m/http://(.*).txt/ ; $text_link = "http://" . $1 . ".txt"; # download the text link page $browser = LWP::UserAgent->new(); $response = $browser->get($text_link); # save the output open(OUT,">config.txt") die $!; print OUT $response->content; close(OUT); # end the program exit(0); #END If you have any questions or need some more help in adapting this to your situation please ask for clarification and give as much further information as you can for your exact requirements.


  • This is a great start, the actual file that I am trying to download is in the following format (it is in a frame BTW):- ENVOY1_SG1.NEWARK-RH-VNNN_2004_07_27_17_30_29_556.ascii The name and date changes daily but this should illustrate it somewhat.


  • How much of the name changes every day? Do you have a link of the page that you are trying to scrape? If the page you are trying to scrape is in a frame you should be able to discover the name of the page by looking at the frameset of the pages...


  • the format of the name stays the same, just the date changes and the unique numeric identifier directly before .ascii Can you update the regex to scrape this?


  • Another way to match the ASCII file to download would be to parse the HTML in the downloaded file. This can be done something like this: use HTML::TokeParser; $stream = HTML::TokeParser->new( $response->content_ref ); # process the tags while ( my $tag = $stream->get_tag('a') ) { # get the href from the tag $text_link = $tag->[1]{'href'}; # if the href contains .ascii... if ( $text_link =~ m/.ascii/ ) { #it is the link we are looking for so break from the while loop break; }; }


  • I was writing my last comment as you posted yours :-) The updated regex part would require you to change these lines: $page_content =~ m/http://(.*).txt/ ; $text_link = "http://" . $1 . ".txt"; to: $page_content =~ m/http://(.*)ENVOY1_SG1.NEWARK-RH-VNNN_(.*).ascii/ ; $text_link = "http://" . $1 . "ENVOY1_SG1.NEWARK-RH-VNNN_" . $2 . ".ascii"; This could be simplified further still by removing the first (.*) in the regex if the domain name/directory information also does not change.


  • Eek! 501 Protocol scheme 'javascript' is not supported Anything I can do here? I tip well ;-)


  • I don't think I can help much with this error without being able to see exactly what you are doing. I guess the error is testing your script to see if javascript is supported when fetching the page with LWP. I don't know if this will work but it could be worth a try... After this line: $ua = new LWP::UserAgent; Add this: @pretend_to_be_netscape = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept-Language' => 'en-US', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Encoding' => 'gzip', 'Accept' => "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*", 'Referer' => "://www.google.com" ); Then change this line: $response = $browser->get($url); to: $response = $browser->get($url, @pretend_to_be_netscape); And this line: $response = $browser->get($text_link); to: $response = $browser->get($text_link, @pretend_to_be_netscape); This should make the website think it is being queried by a Netscape browser... if this doesn't help I am afraid I am not sure how to solve the problem.


  • Thanks for the tip and rating! I'm sorry I couldn't work out the last error part for you.


  • I think I have worked out why this is barfing, when I submit the precomposed URL to the servers' CGI the page that is returned is a "Processing" holding page, then 30-40 seconds later a redirect is issued, so is this why the "$request = new HTTP::Request HEAD => $url;" doesnt work? Is there a way of doing this with WWW::Mechanize ? Any other ideas?


  • If the redirect is not happening for a period of time, try looking at the code of the "processing" page (you can access print this out by using print $response->content;). The processing page may have an HTTP redirect command in it which will include the URL you need.


  • excellent, I'm almost there... This URL when pasted in gives me the result I am looking for!!! Is there a way that I can pull the following out of the HTML and assign a variable with anything after the "/cgi/" up to the apostrophe/closing paranthasis. I'm thinking regex trickery ;-) function reloadPage() { location.replace ('/cgi/nhWeb?func=viewProgress&rptHtml=users/ENVOY1_SG1/mark/stats/ENVOY1_SG1.Newark-RH-VNNN_2004_07_27_21_37_14_217/&rptPid=6232&rptTimeStamp=1090960641&rptET=routerSwitch&subjectType=element&subject=ENVOY_1_SG1.Newark-RH-VNNN&report=Standard&isShortCut=No&includeNav='); }


  • The regex should be something like this: $page_content =~ m//cgi/(.*)includeNav='/ ; $text_link = $1 . "includeNav=";





  • Red Hat's Rough Recovery From CFO Exit
    Windows Live Finds a New, Pre-installed Home

    PRINT Add to favorites
  • system s sign in reply once please
  • play pool connect 4 and more with real people
  • official xbox sticky
  • warez s hacks big macs
  • play pool connect 4 and more with real people
  • super mario 3 in 11 minutes update
  • infinite question pc or mac seriously
  • need for speed underground
  • diablo expnsn lod annyone
  • yahoo gamers
  • yahoo gamers
  • any gamers out there
  • a good policy

  • hey hey what games are y all currently playing
  • diablo expnsn lod annyone
  • infinite question pc or mac seriously
  • computer question
  • need for speed underground
  • hey hey what games are y all currently playing
  • baaaaaattlefield 1942
  • any gamers out there
  • super mario 3 in 11 minutes update
  • on going poll which system
  • official xbox sticky
  • computer question
  • a good policy
  • warez s hacks big macs
  • #If you have any other info about this subject , Please add it free.#
    Your name:
    E-mail:
    Telphone:

    Your comments:


    If you have any other info about Scraping ascii file with redirected URL using Perl , Please add it free.
     Homepage | Add to favorites | Contact us | Exchange links | LOGIN | Site map | 
    Copyright© 2008 hzgn.com        Site made:CFZ