I want to extract a URL from some HTML into a variable using a Perl
regex, it *always* ends in page0001.html.
ie:- http://some.long.protracted.url.domain/and/some/other/stuff/page0001.html
By the way, I just need the regex, not the whole script.
After greyknight's comments, do you require any further information on
this subject? If you do I would be happy to help in any way that I
can. Extended Regular Expression (regcomp() e regexec()) [Archive :: 6 posts - Last post: May 30, 2007[Archive] Extended Regular Expression (regcomp() e regexec()) C. http://www.developerweb.net/forum/archive/index.php/t-3580.htmlHOME |
Assuming you have the HTML page in a variable called $html, the
following few lines will extract the URL you require:
$html =~ m/href=(.*)page0001.html/ ;
$url_matched = $1 ;
The first line looks for a href reference in the html code and matches
everything until it comes across "page0001.html" and remembers it.
$url_matched then stores the matched information in a variable.
$url_matched may contain a ' or " character depending on how well the
html code was written, if it does it can be removed by adding ' or "
to the regex immediately before (.*).
If this is sufficient for your answer please let me know and I will
post it as an official answer. a different slant on 'lookup' [Archive] - Business Functions Forums:: How the Function Seaches: The RegEx idea is a good one for text searches, but your main need seems to be SQL type tests. Is the test you described do-able http://www.businessfunctions.com/forum/archive/index.php/t-105.htmlHOME |
Thanks greyknight! This method from the O'Reilly book seems a little
overboard really. I simply need to match the
http://---->/page0001.html I'd need a code fragment to put this into a
variable. (this is a continuation of the question from yesterday!)
Palitoy can you help?
Hello grabby
This is the regex you require.
Assuming that the above text is in a variable called $html after you
have scraped the page:
$html =~ m/replace('(.*)page0001.html/ ;
$url_matched = $1 ;
For your example above $url_matched will now be equal to:
http://foo.bar.com/something/here/blah/
If you need any more information on this please ask for clarification
and I will do my best to help. Similarly if you would like this
explained more fully I would be glad to help.
hmmm, this doesn't seem to work for me, any chance of a standalone
perl script that I can pipe stdout of my script into so that I can
check the regex? (and my own sanity!) :-)
Again, I'll be nice with the tips :-)
Does this suit your needs?
#!/usr/bin/perl
# so what are we using?
use LWP;
# start the browser
$browser = LWP::UserAgent->new();
# the url of the file to fetch
# for my example I created a file called test.html with the html code
# you supplied in the clarification and called it test.html, it can though
# be anything on the internet
$url = "http://localhost/test.html";
# get the file
$response = $browser->get($url);
# put the page into a variable
$html = $response->content ;
# find the section you require with the regex
$html =~ m/replace('(.*)page0001.html/ ;
$url_matched = $1 ;
# print the matched url onto the screen
print $url_matched;
# quit the program
exit(0);
Thanks for the 5-star rating and generous tip. If you need any more
information on regular expressions let me know.
Red Hat's Rough Recovery From CFO Exit
Windows Live Finds a New, Pre-installed Home
|