Speech technology potentially allows everyone to participate in today's information revolution and can bridge the language barrier gap. Unfortunately, construction of speech processing systems requires significant resources. With some 6900 languages in the world, traditionally speech processing is prohibitive to all but the most economically viable languages. In spite of recent improvements in speech processing, supporting new languages is a skilled job requiring significant effort from trained individuals. SPICE aims to overcome both limitations by providing an interactive language creation and evaluation toolkit that allows everyone to develop speech processing models, to collect appropriate data for model building, and to evaluate the results enabling iterative improvements.
ConceptNet is a freely available commonsense knowledgebase and natural-language-processing toolkit which supports many practical textual-reasoning tasks over real-world documents right out-of-the-box (without additional statistical training) including
Finding important information in unstructured text
From Language and Information Technologies
Jump to: navigation, search
A vast majority of the information we deal with in everyday life consists of raw, unstructured text, where the most important facts or concepts are not always readily available, but hidden in the myriad of details that accompany them. To handle and digest the sheer amount of information we are exposed to in this information age, more sophisticated procedures are required to unveil the important parts of a text, and to allow us to process more information in less time. The goal of this project is to develop robust and accurate techniques to automatically extract important information from unstructured text, in the form of keyphrases (keyphrase extraction) or entire sentences (extractive summarization).
Funded by Google
[edit]
Libtextcat is a library with functions that implement the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization" [1]. It was primarily developed for language guessing, a task on which it is known to perform with near-pe
The Natural Programming Project is working on making programming languages and environments easier to learn, more effective, and less error prone. We are taking a human-centered approach, first studying how people perform their tasks and then designing languages and environments around people's natural tendencies. We focus on all kinds of programming, including professional programmers, novice programmers who are trying to learn to be experts, and end users, who program to support other jobs or hobbies, such as multimedia authoring, simulations, teaching, prototyping, and other activities supported by computing.
NGramJ is a Java based library containing two types of ngram based applications. It's major focus is to provide robust and state of the art language recognition.
Chomsky bot written in Ruby. A funny little thing which generates random paragraphs of text from a set sentence building blocks. It combines four kinds of phrases (introduction phrases, subject phrases, verb phrases and object phrases) into a sentence. The sentences this simple construction can create are amazing. They are syntactically correct and "hovers on the edge on understandability".
Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.
Features
* (Jointly) visualize
o syntactic dependency graphs
o semantic dependency graphs (a la CoNLL 2008)
o Chunks (such as syntactic chunks, NER chunks, SRL chunks etc.)
* Compare gold standard trees to your generated trees (e.g. highlight false positive and negative dependency edges)
* Filter trees and visualize only what's necessary, for example
o only dependency edges with certain labels
o only the edges between certain tokens
* Search corpora for sentences with certain attributes using powerful search expressions, for example
o search for all sentences that contain the word "vantage" and the pos tag sequence DT NN
o search for all sentences that contain false positive edges and the word "vantage"
* Reads
o CoNLL 2000, 2002, 2003, 2004, 2006 and 2008 format
o Lisp S-Expressions
o Malt-Tab format
o markov thebeast format
* Export to EPS
Check this screenshot to get a better idea.
L. Rino, T. Pardo, C. Silla Jr., C. Kaestner, and M. Pombo. Proceedings of the 17th Brazilian Symposium on Artificial Intelligence (SBIA), page 235-244. São Luis-MA, Brazil, (September 2004)
M. Schwab, R. Jäschke, and F. Fischer. Proceedings of the 6th International Conference on Natural Language and Speech Processing, page 99--109. Association for Computational Linguistics, (2023)