OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. This server allows you to use the system through your web browser.
hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.
The purpose of this document is to define an open standard for representing OCR results. The goal is to reuse as much existing technology as possible, and to arrive at a representation that makes it easy to reuse OCR results.
TopQuadrant's TopBraid™ products support RDFa, both as an input format (via third party RDFa readers), and as an output format, using an RDFa editor. Input supports HTML via jtidy. We are currently upgrading our RDFa editor to have better wysiwyg capabilities.
Microformats are a way to embed specific semantic data into the HTML that we use today. One of the first questions an XML guru might ask is "Why use HTML when XML lets you create the same semantics?" I won't go into all the reasons XML might be a better or worse choice for encoding data or why microformats have chosen to use HTML as their encoding base. This article will focus more on how to extract microformats data from the HTML, how the basic parsing rules work, and how they differ from XML.
While HTML 4.01 is formally an SGML-based document format, the only clients actually treating HTML that way are validators. Browsers, on the other hand, treat HTML documents as tag soup—they try to make sense out of, and display, even the most horridly
Surely by now you've heard or seen the term semantics being bandied about by web standards evangelists and document purists. But what does the term really signify in the context of markup, and what do you need to know about semantics to improve your marku
The term “Semantic Markup” is bandied about freely, and with every year that passes, it makes me more and more nervous. Herewith an exploration of what, if anything, those two terms mean when placed side by side. (Warning: way too long.)
The hard work that RSS does allows us to keep up with many more Web sites in much less time. But it also allows us to share information that we weren’t sharing before, allowing others to remix our content in new, useful ways.
In these Web 2.0-3.0 days, there is a lot of expectation for data publishers to offer their data through APIs, but there is no clear way to encode and query this data in a universal way. There are many ways of encoding information structure and semantics
This is a BETA implementation of an XSLT file to transform and hCa* encoded XHTML file into the corresponding vCard/iCalendar file. The DRAFT specification for hCa* encodings can be found at the Technorati Developer Wiki.
Being a more generalized, scalable solution, RDFa can do a lot more than microformats, and with many of those other applications having more commercial potential, I see them as the best growth area for the format.
The Web is designed to support flexible exploration of information, by human users and by automated agents. For such exploration to be productive, information published by many different sources and for a wide variety of purposes must be comprehensible to