HZGN.COM welcome to my space |
HOME Retrieving patent data in XML format |
| Retrieving patent data in XML format | | Published by: admin 2009-01-07 |
I would like to find a way to retrieve patent information from the
USPTO (US patent trademark office) in XML format rather than HTML
format.
That is to say, the HTML patent information available from http://uspto.gov
I want to either be able to extract specific field elements from XML
output or the HTML (clean with no tags) from the HTML output. I would
prefer to work with XML for obvious reasons.
Hello Chesketh
This sounds like the kind of thing I would enjoy doing. Can you
clarify your question slightly?
1) How do you want to extract this information? Via a perl or PHP script?
2) Can you please give a link to an example page of an item you wish to extract?
3) Is it correct that you wish to parse the HTML page of the result
and produce an XML file? If so which elements do you require in the
XML file?
Hi, thanks for the response,
1) I would like to extract the information with PHP script.
2) http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,670,149.WKU.&OS=PN/6,670,149&RS=PN/6,670,149
This HTML output contains particular "fields" such as inventor,
asignee, patent number, abstract etc... which I would like to cleanly
extract.
3) I would rather not have to deal with HTML if it is possible to get
an XML directly from USPTO, but I think this is only possible with
European patent services. Saying that, yes, I would like to parse the
HTML, extracting several fields, 5 examples are:
a) Patent #
b) Title
c) inventor(s)
d) asignee(s)
e) abstract
Best regards,
Christian
Hello again
I have written this small script for you to parse the pages on the
USPTO website to an "XML-style" page that can be copied and pasted
into Notepad or similar software.
In the print statements I have escaped the < and > characters so that
they appear in your web browser when the program is run. Should you
not require this simply alter the < to < and > to >.
I have not had too much time to make the script pretty but please take
a look over it and post any clarifications or changes that you require
to be done here and I will look at them in the morning when I am
fresher.
$siteurl = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,670,149.WKU.&OS=PN/6,670,149&RS=PN/6,670,149";
$siteinfo = get_file($siteurl);
preg_match ("UnitedsStatessPatent:s(.*)imU",$siteinfo,$out);
print "<item> <patent>" . $out[1] . "</patent> ";
preg_match ("(.*) Frequently-Asked Questions about the Extensible Markup Language:: in an XML format; Java[Script] and and metadata and their use in XML; updated standard, it will remain accessible and processable as a data format.Information http://xml.silmaril.ie/faq.sgmlHOME | Rapid development in a distributed application environment - Patent 7107279:: RETRIEVE DATA FROM DATA SOURCES THAT DO NOT NECESSARILY SUPPORT THE FORMAT text based data in a standardized XML format; and transmitting the result over http://www.freepatentsonline.com/7107279.htmlHOME | imU",$siteinfo,$out);
print "<title>" . $out[1] . "</title> ";
preg_match ("Abstract(.*) imU",$siteinfo,$out);
print "<abstract>" . $out[1] . "</abstract> ";
preg_match_all ("(.*) | imU",$siteinfo,$out, PREG_PATTERN_ORDER);
print "<inventors>" . strip_tags($out[0][0]) . "</inventors> ";
print "<assignee>" . strip_tags($out[0][1]) .
"</assignee> </item>";
function get_file($filename) {
$file = file ($filename);
$lines = ereg_replace("[rtn]","",join("",$file));
$lines = preg_replace("s{1,}"," ",$lines);
return $lines;
}
?>
Thanks Palitoy, the only other requirements I would have is that the
script is commented so I can understand and expand it, i would like
each element written to a variable so that it can be written to mysql
and also, the xml file should be saved with a filename uspto-[patent
#].
Thanks again,
Christian
Hi Christian
I have updated the script with the other items you requested. I have
not added the mysql database section as this would require knowing the
structure, username/password and elements of your databases. This
should be fairly easily to add on as all the patent elements are now
stored in variables for your use.
To run the script you now also need to pass it a URL in the browser -
this is so you can use the same script for every patent you wish to
look at (rather than saving multiple copies of the same script with
just the URL changed). The syntax of this is:
http://nameofyourscript.php?url=http://whichpatent
If you have any further questions please ask and I will do my best to help again.
// retrieve the URL for the patent from the browser address line
$siteurl = $_GET['url'];
// if no URL is given...
if ( $siteurl == "" ) {
// print out a message and quit the program here
print "No URL was given. To give a URL add
'?url=http://site.com' to the end of your URL (substituting the URL of
your choice for site.com). ";
exit(0);
};
// get the patent information and store it in a variable called $siteinfo
$siteinfo = get_file($siteurl);
// find the patent number - it is always in the title of the patent
preg_match ("UnitedsStatessPatent:s(.*)imU",$siteinfo,$out);
// store the patent number in a variable and clean it up
$patent_number = clean($out[1]);
// find the title of the patent - it is always the first line with these html tags
preg_match ("(.*)imU",$siteinfo,$out);
$patent_title = clean($out[1]);
// find the abstract of the patent - it is always the first paragraph
after the word abstract
preg_match ("Abstract(.*)imU",$siteinfo,$out);
$patent_abstract = clean($out[1]);
/* find the inventor and assignee data - these are stored in a table
with this html cell information,
the first match is the inventor data and the second the assignee data */
preg_match_all ("(.*) | imU",$siteinfo,$out, PREG_PATTERN_ORDER);
$patent_inventors = clean($out[0][0]);
$patent_assignees = clean($out[0][1]);
// create a variable for the XML file including the data found
$xmlData = "- $patent_number$patent_title$patent_abstract$patent_inventors$patent_assignees
";
// save the file to a location using the patent number (with the commas removed)
if($file=fopen("uspto-" . str_replace(",","",$patent_number) . ".xml",
"w")) { // open file for writing
fwrite($file, $xmlData); // write to file
};
fclose($file);
/* add mysql data storage here using the following variables in the
SQL INSERT statement
patent number = $patent_number
patent title = $patent_title
patent abstract = $patent_abstract
patent inventors = $patent_inventors
patent assignees = $patent_assignees */
// print something out on the screen so you know the process has finished.
print "Process completed. " . htmlentities($xmlData) . " ";
// a function to retrieve the patent from the internet
function get_file($filename) {
// get the info and store it in $file
$file = file ($filename);
// tidy up the information by removing unwanted line feeds, multiple spaces etc
$lines = ereg_replace("[rtn]","",join("",$file));
$lines = preg_replace("s{1,}"," ",$lines);
// give the cleaned up information back to the variable calling this function
return $lines;
}
// a function to clean the information passed to it
function clean($str) {
// remove any html tags from the data given
$str = strip_tags($str);
// remove any whitespace at the beginning and end of the data given
$str = ltrim($str);
$str = rtrim($str);
// give the cleaned up info back to the variable calling this function
return $str;
}
?>
Palitoy, I'm having a few problems with the script, when I run it, i
get the following output:
Warning: file(http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1):
failed to open stream: HTTP request failed! in
/home/chesketh/public_html/getpatent.php on line 64
Warning: join(): Bad arguments. in
/home/chesketh/public_html/getpatent.php on line 66
I have chmodded the directory to 777 which cleared up some issues, but
I'm still left with these puzzling statements. I'm not sure if the
'&' character in the URL is causing problems...
Thanks in advance
Process completed.
Thank-you for the 5-star rating and tip!
With regards to your problem I think you are right the & is causing a
problem, I forgot those were there. There is an easy remedy to
this...
Change:
// retrieve the URL for the patent from the browser address line
$siteurl = $_GET['url'];
To:
// retrieve the URL for the patent from the browser address line
$siteurl = "";
// get each part of the URL and then stitch it back together
while(list($key, $val) = each($HTTP_GET_VARS)){
if ( $siteurl == "" ) { $siteurl = $val; }
else { $siteurl = $siteurl . "&" . $key . "=" . $val; };
}
This should fix the problem, sorry about that...
Thanks again, just one more problem, I get the following output (also
note that the inventor and asignee fields are empty:
Warning: fopen(uspto-6727353.xml): failed to open stream: Permission
denied in /home/chesketh/public_html/getpatent.php on line 53
Warning: fclose(): supplied argument is not a valid stream resource in
/home/chesketh/public_html/getpatent.php on line 56
Process completed.
- 6,727,353Nucleic acid encoding
Kv10.1, a voltage-gated potassium channel from human
brainThe invention provides isolated nucleic acid
and amino acid sequences of Slo potassium family members such as,
antibodies to Kv10 subfamily members such as Kv10.1, methods of
detecting Kv10, subfamily members such as Kv10.1, methods of screening
for potassium channel activators and inhibitors using biologically
active Kv10 subfamily members such as Kv10.1, and kits for screening
for activators and inhibitors of voltage-gated potassium channels
comprising Kv10 subfamily members such as
Kv10.1.
Sorry for the trouble, I have solved the file writing problem (just a
chmod oversight), but the inventor and asignee fields remain empty, it
doesn't seem to be pulling the information properly from the table...
I have just tried to pull that patent myself here and it seems to
work... I used this URL:
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,727,353.WKU.&OS=PN/6,727,353&RS=PN/6,727,353
In your script is this line all on one line:
preg_match_all ("(.*) | imU",$siteinfo,$out, PREG_PATTERN_ORDER);
If not, it should be :-)
I forgot to mention there should also be a single space between the
ALIGN="LEFT" and the WIDTH="90% parts.
Thanks palitoy, that did it, the script works perfectly, thanks again,
I'll keep you in mind for my next project. All the best...
I'm glad I could help! If you do want to hire me again through Google
Answers simply ask for palitoy-ga in the question title and I will
probably see it (I usually check the site at least a couple of times a
day).
Red Hat's Rough Recovery From CFO Exit
Windows Live Finds a New, Pre-installed Home |
You are looking at:hzgn.com's Retrieving patent data in XML format, click hzgn.com to home
|
#If you have any other info about this subject , Please add it free.# | |
|