The objective of the ACE Program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events.
Trying to find a name for a company, project, algorithm, product? Acronym Creator helps you generate a name that is an acronym or abbreviation. With this acronym builder, abbreviation maker, name generator, label finder - whatever you call it - you can make your own acronyms and have fun!
This workshop will gather researchers in a variety of fields that contribute to the automated construction of knowledge bases. It will be held at Xerox Research Centre Europe, near Grenoble (France), May 17-19, 2010.
andLinux runs Linux natively inside Windows. It is a complete Ubuntu Linux system running seamlessly in Windows 2000 based systems (2000, XP, 2003, Vista, 7; 32-bit versions only).
The POI project consists of APIs for manipulating various file formats based upon Microsoft's OLE 2 Compound Document format, and Office OpenXML format, using pure Java. In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java.
Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat.
With proper mark-up/logic separation, a POJO data model, and a refreshing lack of XML, Apache Wicket makes developing web-apps simple and enjoyable again.
The professional, open source development tool for the open web. Develop and test your entire web application using a single environment. With support for the latest browser technology specs such as HTML5, CSS3 and JavaScript; and Ruby, Rails, PHP & Python on the server side. We've got you covered!
ASV Toolbox is a modular collection of tools for the exploration of written language data. They work either on word lists or text and solve several linguistic classification and clustering tasks. The topics covered contain language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction.
AWS Elastic Beanstalk is an even easier way for developers to quickly deploy and manage applications in the AWS cloud without having to worry about the physical infrastructure or the resource configuration that accompanies setting up that infrastructure. You simply upload your application and AWS Elastic Beanstalk automatically handles the deployment details of capacity provisioning, load balancing, auto-scaling, and application health monitoring, while allowing you to change configuration settings and deploy new versions.
Cibyl is a programming environment and binary translator that allows compiled C programs to execute on J2ME-capable phones. Cibyl uses GCC to compile the C programs to MIPS binaries, and these are then recompiled into Java bytecode.
CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development.
ConceptNet represents data in the form of a semantic network, and makes it available to be used in natural language processing and intelligent user interfaces.
Das Fußball Studio ist eine Freeware, mit der Fussball-Ligen und -Turniere verwaltet und ausgewertet werden können. Dazu die Bundesliga-Datenbank mit vollständigen Daten der 1. und 2. Bundesliga.
The Mozenda Scraper provides web data extraction software, Web Screen Scraping tools that makes it easy to capture nearly any content from the web. See how you can start getting data from the web in minutes.
In this paper, we present recent research on internet
threats aiming at fraud or hampering critical information infrastructure. One approach concentrates on the
rapid detection of phishing email, designed to make it next impossible for attackers to obtain financial
resources or commit identity theft in this way. Then we address how another type of internet fraud, the
violation of the rights of trademark owners by faked merchandise, can be semi-automatically solved with
text mining methods. Thirdly, we report on two projects that are designed to prevent fraud in business
processes in public administrations, namely in the healthcare sector and in customs administrations. Finally,
we focus on the issue of critical infrastructures, and describe our approach towards protecting them using a
specific middleware architecture.
We have developed a systems that enables
the detection of certain common salting
tricks that are employed by criminals. Salting
is the intentional addition or distortion of
content. In this paper we describe a framework
to identify email messages that might
contain new, previously unseen tricks. To
this end, we compare the simulated perceived
email message text generated by our hidden
salting simulation system to the OCRed
text we obtain from the rendered email message.
We present robust text comparison
techniques and train a classifier based on the
differences of these two texts. In simulations
we show that we can detect suspicious emails
with a high level of accuracy.
Diese DVD-ROM der Deutschen Nationalbibliothek enthält sowohl die Personennamendatei (PND) als auch die Schlagwortnormdatei (SWD) und die Gemeinsame Körperschaftsdatei (GKD) und ist direkt über die Deutsche Nationalbibliothek zu beziehen.
Enunciate is a Web service deployment framework. It is not another Web service stack implementation. Rather, Enunciate leverages existing Web service technologies to provide a mechanism to build, package, deploy, and to clearly, accurately deliver your Web service API on the Java platform.
Lately I’ve been working on evaluating and comparing algorithms, capable of extracting useful content from arbitrary html documents. I have made a feature wise comparison of related software and APIs.
Extensible Dependency Grammar (XDG) is a general framework for dependency grammar, with multiple levels of linguistic representations called dimensions, e.g. grammatical function, word order, predicate-argument structure, scope structure, information structure and prosodic structure. It is articulated around a graph description language for multi-dimensional attributed labeled graphs.
An XDG grammar is a constraint that describes the valid linguistic signs as n-dimensional attributed labeled graphs, i.e. n-tuples of graphs sharing the same set of attributed nodes, but having different sets of labeled edges. All aspects of these signs are stipulated explicitly by principles: the class of models for each dimension, additional properties that they must satisfy, how one dimension must relate to another, and even lexicalization.
Freebase is a database that has all kinds of data in it and an API. Because it's an open database, anyone can enter new data in Freebase. An example page in the Freebase db looks pretty similar to a Wikipedia page. When you enter new data, the app can make suggestions about content. The topics in Freebase are organized by type, and you can connect pages with links, semantic tagging. So in summary, Freebase is all about shared data and what you can do with it.
Quantitative Fonds arbeiten mit ausgeklügelten Rechenmodellen und verzichten auf die subjektive Titelauswahl durch menschliche Manager. Doch die Produkte haben ihre Tücken, wie die jüngste Krise beweist.
Die Personen-Datenbank des Munzinger-Archivs umfasst mehr als 20.000 prominente Lebensläufe und wird kontinuierlich aktualisiert. Sie finden dort Porträts von Politikern, Wirtschaftsgrößen, aber auch von Künstlern und Wissenschaftlern.
Emacs is the extensible, customizable, self-documenting real-time display editor. This Info file describes how to edit with Emacs and some of how to customize it; it corresponds to GNU Emacs version 23.1.
SmartGWT is a GWT based framework that allows you to not only utilize its comprehensive widget library for your application UI, but also tie these widgets in with your server-side for data management. SmartGWT is based on the powerful and mature SmartClient library.
Despite the many JavaScript libraries that are available today, I cannot find one that makes it easy to add keyboard shortcuts(or accelerators) to your javascript app. This is because keyboard shortcuts where only used in JavaScript games - no serious web application used keyboard shortcuts to navigate around its interface. But Google apps like Google Reader and Gmail changed that. So, I have created a function to make adding shortcuts to your application much easier.
An example of a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second in less than a page of python code.
HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.
It is a fast real-time parser for real-world HTML. What has attracted most developers to HTMLParser has been its simplicity in design, speed and ability to handle streaming real-world html.
We investigate the statistical filtering
of phishing emails, where a classifier is
trained on characteristic features of existing
emails and subsequently is able to identify
new phishing emails with different contents.
We propose advanced email features generated
by adaptively trained Dynamic Markov
Chains and by novel latent Class-Topic Models.
On a publicly available test corpus classifiers
using these features are able to reduce
the number of misclassified emails by two
thirds compared to previous work. Using a
recently proposed more expressive evaluation
method we show that these results are statistically
significant. In addition we successfully
tested our approach on a non-public email
corpus with a real-life composition.
Semantic MediaWiki (SMW) is a free extension of MediaWiki that helps to search, organise, tag, browse, evaluate, and share the wiki's content. While traditional wikis contain only texts which computers can neither understand nor evaluate, SMW adds semantic annotations that bring the power of the Semantic Web to the wiki.
JADE (Java Agent DEvelopment Framework) is a software Framework fully implemented in Java language. It simplifies the implementation of multi-agent systems through a middle-ware that complies with the FIPA specifications and through a set of graphical tools that supports the debugging and deployment phases
This is an overview of the open source NLP and machine learning tools for text mining, information extraction, text classification, clustering, approximate string matching, language parsing and tagging, and more.
PojoCache is an in-memory, transactional, and replicated POJO (plain old Java object) cache system that allows users to operate on a POJO transparently without active user management of either replication or persistency aspects. This tutorial focuses on the usage of the PojoCache API.
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML.
This software is an extension of the SVMlight software. It provides an interface to kernel functions that are implemented in Java by means of the Java Native Interface (JNI) Invocation API.
Joda-Time provides a quality replacement for the Java date and time classes. The design allows for multiple calendar systems, while still providing a simple API. The 'default' calendar is the ISO8601 standard which is used by XML. The Gregorian, Julian, Buddhist, Coptic, Ethiopic and Islamic systems are also included, and we welcome further additions. Supporting classes include time zone, duration, format and parsing.
It contains a Web Crawler, HTML Parser and ("in the near future") NER and REX.
Additionally, including JWikiDocs, a Java tool for crawling and downloading Wikipedia documents.
jWebSocket is a pure Java/JavaScript high speed bidirectional communication solution for the Web - secure, reliable and fast. Provides easy integration into existing Tomcat web applications.
This web page provides information, errata, as well as about a third of the chapters of the book Learning with Kernels, written by Bernhard Schölkopf and Alex Smola (MIT Press, Cambridge, MA, 2002).
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification.
Alle Programme und Resourcen auf der Liste sind frei, d.h. kostenlos (für Forschungszwecke) verfügbar, auf deutschsprachige Texte anwendbar und sofort startklar, d.h. sie müssen nicht erst mit Hilfe von z.B. annotierten Korpora trainiert werden. Die Liste ist natürlich unvollständig (Stand 22.5.2007).
MegaMap is a Java implementation of a map (or hashtable) that can store an unbounded amount of data, limited only by the amount of disk space available. Objects stored in the map are persisted to disk. Good performance is achieved by an in-memory cache. The MegaMap can, for all practical reasons, be thought of as a map implementation with unlimited storage space.
MSTParser is a non-projective dependency parser that searches for maximum spanning trees over directed graphs. Models of dependency structure are based on large-margin discriminative training methods. Projective parsing is also supported.
MuNPEx is a multi-lingual noun phrase (NP) extraction component developed for the GATE architecture, implemented in JAPE. It currently supports English, German, French, and Spanish (in beta).
MuNPEx requires a part-of-speech (POS) tagger to work and can additionally use detected named entities (NEs) to improve chunking performance. Please read the documentation (or source code) for more details.