HZGN.COM
welcome to my space
X
Welcome to:hzgn.com
Search:  
 HOME   Retrieving patent data in XML format
Retrieving patent data in XML format
Published by: admin 2009-01-07
imU",$siteinfo,$out, PREG_PATTERN_ORDER); print "<inventors>" . strip_tags($out[0][0]) . "</inventors>
"; print "<assignee>" . strip_tags($out[0][1]) . "</assignee>
</item>"; function get_file($filename) { $file = file ($filename); $lines = ereg_replace("[rtn]","",join("",$file)); $lines = preg_replace("s{1,}"," ",$lines); return $lines; } ?>

  • Thanks Palitoy, the only other requirements I would have is that the script is commented so I can understand and expand it, i would like each element written to a variable so that it can be written to mysql and also, the xml file should be saved with a filename uspto-[patent #]. Thanks again, Christian


  • Hi Christian I have updated the script with the other items you requested. I have not added the mysql database section as this would require knowing the structure, username/password and elements of your databases. This should be fairly easily to add on as all the patent elements are now stored in variables for your use. To run the script you now also need to pass it a URL in the browser - this is so you can use the same script for every patent you wish to look at (rather than saving multiple copies of the same script with just the URL changed). The syntax of this is: http://nameofyourscript.php?url=http://whichpatent If you have any further questions please ask and I will do my best to help again. No URL was given.

    To give a URL add '?url=http://site.com' to the end of your URL (substituting the URL of your choice for site.com).

    "; exit(0); }; // get the patent information and store it in a variable called $siteinfo $siteinfo = get_file($siteurl); // find the patent number - it is always in the title of the patent preg_match ("UnitedsStatessPatent:s(.*)imU",$siteinfo,$out); // store the patent number in a variable and clean it up $patent_number = clean($out[1]); // find the title of the patent - it is always the first line with these html tags preg_match ("(.*)imU",$siteinfo,$out); $patent_title = clean($out[1]); // find the abstract of the patent - it is always the first paragraph after the word abstract preg_match ("Abstract(.*)

    imU",$siteinfo,$out); $patent_abstract = clean($out[1]); /* find the inventor and assignee data - these are stored in a table with this html cell information, the first match is the inventor data and the second the assignee data */ preg_match_all ("
  • imU",$siteinfo,$out, PREG_PATTERN_ORDER); $patent_inventors = clean($out[0][0]); $patent_assignees = clean($out[0][1]); // create a variable for the XML file including the data found $xmlData = "$patent_number$patent_title$patent_abstract$patent_inventors$patent_assignees"; // save the file to a location using the patent number (with the commas removed) if($file=fopen("uspto-" . str_replace(",","",$patent_number) . ".xml", "w")) { // open file for writing fwrite($file, $xmlData); // write to file }; fclose($file); /* add mysql data storage here using the following variables in the SQL INSERT statement patent number = $patent_number patent title = $patent_title patent abstract = $patent_abstract patent inventors = $patent_inventors patent assignees = $patent_assignees */ // print something out on the screen so you know the process has finished. print "

    Process completed.

    " . htmlentities($xmlData) . "

    "; // a function to retrieve the patent from the internet function get_file($filename) { // get the info and store it in $file $file = file ($filename); // tidy up the information by removing unwanted line feeds, multiple spaces etc $lines = ereg_replace("[rtn]","",join("",$file)); $lines = preg_replace("s{1,}"," ",$lines); // give the cleaned up information back to the variable calling this function return $lines; } // a function to clean the information passed to it function clean($str) { // remove any html tags from the data given $str = strip_tags($str); // remove any whitespace at the beginning and end of the data given $str = ltrim($str); $str = rtrim($str); // give the cleaned up info back to the variable calling this function return $str; } ?>

  • Palitoy, I'm having a few problems with the script, when I run it, i get the following output: Warning: file(http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1): failed to open stream: HTTP request failed! in /home/chesketh/public_html/getpatent.php on line 64 Warning: join(): Bad arguments. in /home/chesketh/public_html/getpatent.php on line 66 I have chmodded the directory to 777 which cleared up some issues, but I'm still left with these puzzling statements. I'm not sure if the '&' character in the URL is causing problems... Thanks in advance Process completed.


  • Thank-you for the 5-star rating and tip! With regards to your problem I think you are right the & is causing a problem, I forgot those were there. There is an easy remedy to this... Change: // retrieve the URL for the patent from the browser address line $siteurl = $_GET['url']; To: // retrieve the URL for the patent from the browser address line $siteurl = ""; // get each part of the URL and then stitch it back together while(list($key, $val) = each($HTTP_GET_VARS)){ if ( $siteurl == "" ) { $siteurl = $val; } else { $siteurl = $siteurl . "&" . $key . "=" . $val; }; } This should fix the problem, sorry about that...


  • Thanks again, just one more problem, I get the following output (also note that the inventor and asignee fields are empty: Warning: fopen(uspto-6727353.xml): failed to open stream: Permission denied in /home/chesketh/public_html/getpatent.php on line 53 Warning: fclose(): supplied argument is not a valid stream resource in /home/chesketh/public_html/getpatent.php on line 56 Process completed. 6,727,353Nucleic acid encoding Kv10.1, a voltage-gated potassium channel from human brainThe invention provides isolated nucleic acid and amino acid sequences of Slo potassium family members such as, antibodies to Kv10 subfamily members such as Kv10.1, methods of detecting Kv10, subfamily members such as Kv10.1, methods of screening for potassium channel activators and inhibitors using biologically active Kv10 subfamily members such as Kv10.1, and kits for screening for activators and inhibitors of voltage-gated potassium channels comprising Kv10 subfamily members such as Kv10.1.


  • Sorry for the trouble, I have solved the file writing problem (just a chmod oversight), but the inventor and asignee fields remain empty, it doesn't seem to be pulling the information properly from the table...


  • I have just tried to pull that patent myself here and it seems to work... I used this URL: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,727,353.WKU.&OS=PN/6,727,353&RS=PN/6,727,353 In your script is this line all on one line: preg_match_all ("
  • imU",$siteinfo,$out, PREG_PATTERN_ORDER); If not, it should be :-)

  • I forgot to mention there should also be a single space between the ALIGN="LEFT" and the WIDTH="90% parts.


  • Thanks palitoy, that did it, the script works perfectly, thanks again, I'll keep you in mind for my next project. All the best...


  • I'm glad I could help! If you do want to hire me again through Google Answers simply ask for palitoy-ga in the question title and I will probably see it (I usually check the site at least a couple of times a day).





  • Red Hat's Rough Recovery From CFO Exit
    Windows Live Finds a New, Pre-installed Home
  • I would like to find a way to retrieve patent information from the USPTO (US patent trademark office) in XML format rather than HTML format.


  • That is to say, the HTML patent information available from http://uspto.gov I want to either be able to extract specific field elements from XML output or the HTML (clean with no tags) from the HTML output. I would prefer to work with XML for obvious reasons.


  • Hello Chesketh This sounds like the kind of thing I would enjoy doing. Can you clarify your question slightly? 1) How do you want to extract this information? Via a perl or PHP script? 2) Can you please give a link to an example page of an item you wish to extract? 3) Is it correct that you wish to parse the HTML page of the result and produce an XML file? If so which elements do you require in the XML file?


  • Hi, thanks for the response, 1) I would like to extract the information with PHP script. 2) http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,670,149.WKU.&OS=PN/6,670,149&RS=PN/6,670,149 This HTML output contains particular "fields" such as inventor, asignee, patent number, abstract etc... which I would like to cleanly extract. 3) I would rather not have to deal with HTML if it is possible to get an XML directly from USPTO, but I think this is only possible with European patent services. Saying that, yes, I would like to parse the HTML, extracting several fields, 5 examples are: a) Patent # b) Title c) inventor(s) d) asignee(s) e) abstract Best regards, Christian


  • Hello again I have written this small script for you to parse the pages on the USPTO website to an "XML-style" page that can be copied and pasted into Notepad or similar software. In the print statements I have escaped the < and > characters so that they appear in your web browser when the program is run. Should you not require this simply alter the < to < and > to >. I have not had too much time to make the script pretty but please take a look over it and post any clarifications or changes that you require to be done here and I will look at them in the morning when I am fresher. UnitedsStatessPatent:s(.*)imU",$siteinfo,$out); print "<item>
    <patent>" . $out[1] . "</patent>
    "; preg_match ("(.*)
    Frequently-Asked Questions about the Extensible Markup Language::
    in an XML format; Java[Script] and and metadata and their use in XML; updated standard, it will remain accessible and processable as a data format.Information
    http://xml.silmaril.ie/faq.sgml
    HOME
    Rapid development in a distributed application environment - Patent 7107279::
    RETRIEVE DATA FROM DATA SOURCES THAT DO NOT NECESSARILY SUPPORT THE FORMAT text based data in a standardized XML format; and transmitting the result over
    http://www.freepatentsonline.com/7107279.html
    HOME
    imU",$siteinfo,$out); print "<title>" . $out[1] . "</title>
    "; preg_match ("Abstract

    (.*)

    imU",$siteinfo,$out); print "<abstract>" . $out[1] . "</abstract>
    "; preg_match_all ("
  • (.*)(.*)(.*)
    You are looking at:hzgn.com's Retrieving patent data in XML format, click hzgn.com to home
    #If you have any other info about this subject , Please add it free.#
    Your name:
    E-mail:
    Telphone:

    Your comments:


    If you have any other info about Retrieving patent data in XML format , Please add it free.
  • dcwu
  •  Homepage | Add to favorites | Contact us | Exchange links | LOGIN | Site map | 
    Copyright© 2008 hzgn.com        Site made:CFZ