<rdf:RDF xmlns:burst="http://xmlns.com/burst/0.1/" xmlns:admin="http://webns.net/mvcb/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:cc="http://web.resource.org/cc/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:swrc="http://swrc.ontoware.org/ontology#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><channel rdf:about="http://www.bibsonomy.org/burst/user/neilernst/extraction"><title>BibSonomy publications for /user/neilernst/extraction</title><link>http://www.bibsonomy.org/burst/user/neilernst/extraction</link><description>BibSonomy BuRST Feed for /user/neilernst/extraction</description><dc:date>2008-10-16T06:50:58+02:00</dc:date><items><rdf:Seq><rdf:li rdf:resource="http://www.bibsonomy.org/bibtex/2ee6de663fac26a3777328c769ca3cc70/neilernst"/></rdf:Seq></items></channel><item rdf:about="http://www.bibsonomy.org/bibtex/2ee6de663fac26a3777328c769ca3cc70/neilernst"><title>Information extraction: distilling structured data from unstructured text</title><description>An overview of "Information extraction"</description><link>http://www.bibsonomy.org/bibtex/2ee6de663fac26a3777328c769ca3cc70/neilernst</link><dc:creator>neilernst</dc:creator><dc:date>2008-02-18T18:11:47+01:00</dc:date><dc:subject>information unstructured extraction </dc:subject><content:encoded>&lt;span style=&#034;color:#555555;&#034;&gt;A. &lt;a href=&#034;http://www.bibsonomy.org/author/McCallum&#034;&gt;McCallum&lt;/a&gt;  &lt;/span&gt;&lt;em&gt;Queue&lt;/em&gt;&lt;em&gt;3(9):48--57&lt;/em&gt;(&lt;em&gt;2005&lt;/em&gt;)</content:encoded><taxo:topics><rdf:Bag><rdf:li rdf:resource="http://www.bibsonomy.org/tag/information"/><rdf:li rdf:resource="http://www.bibsonomy.org/tag/unstructured"/><rdf:li rdf:resource="http://www.bibsonomy.org/tag/extraction"/></rdf:Bag></taxo:topics><burst:publication><rdf:Description rdf:about="http://www.bibsonomy.org/bibtex/2ee6de663fac26a3777328c769ca3cc70/neilernst"><owl:sameAs rdf:resource="http://www.bibsonomy.org/uri/bibtex/2ee6de663fac26a3777328c769ca3cc70/neilernst"/><rdf:type rdf:resource="http://swrc.ontoware.org/ontology#Article"/><owl:sameAs rdf:resource="http://portal.acm.org/citation.cfm?id=1105679"/><swrc:date>Mon Feb 18 18:11:47 CET 2008</swrc:date><swrc:journal>Queue</swrc:journal><swrc:number>9</swrc:number><swrc:pages>48--57</swrc:pages><swrc:publisher><swrc:Organization swrc:name="ACM"/></swrc:publisher><swrc:title>Information extraction: distilling structured data from unstructured text</swrc:title><swrc:volume>3</swrc:volume><swrc:year>2005</swrc:year><swrc:keywords>information unstructured extraction </swrc:keywords><swrc:abstract>In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major data-integration project, aiming to automatically gather detailed, structured information from tens of thousands of individual institutions every three months.The first and biggest problem was that much of the data wasn&#039;t available even in semi-structured form, much less normalized, structured form. Although some of the larger organizations had internal databases of their course listings, almost none of them had publicly available interfaces to their databases. The only universally available public interfaces were Web pages designed for human browsing. Unfortunately, but as expected, each organization used different text formatting. Some of these Web pages contained two-dimensional text tables; many others used a stylized collection of paragraphs for each course offering; still others had a single paragraph of English prose containing all the information about each course.The task thus required extracting structured information from English that had been formatted in a mixture of two-dimensional layout and free-running prose--a daunting technical challenge, but one that was ultimately solved successfully. More details about the solution follow, but first, let&#039;s place this problem in context.</swrc:abstract><swrc:hasExtraField><swrc:Field swrc:value="1542-7730" swrc:key="issn"/></swrc:hasExtraField><swrc:hasExtraField><swrc:Field swrc:value="http://doi.acm.org/10.1145/1105664.1105679" swrc:key="doi"/></swrc:hasExtraField><swrc:author><rdf:Seq><rdf:_1><swrc:Person swrc:name="A. McCallum"/></rdf:_1></rdf:Seq></swrc:author></rdf:Description></burst:publication></item></rdf:RDF>