Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results. Journal of Biomedical Discovery and Collaboration
David D. Lewis, Ph.D.
858 W. Armitage Ave., #296
Chicago, IL 60614 U.S.A.
phone: 773-975-0304
fax: 773-289-0507
Services: I work with clients to make the most effective use possible of textual data. Applications I have worked on include search engines, text categorization, filtering of email and web pages, mining of customer data, and a variety of others. My clients have included both vendors and users of text processing software, and my work with them has included mining data sets, analyzing manual and automated text processing procedures, designing algorithms and system architectures, performing competitive and strategic analysis, training, and ongoing advisory relationships. Contact me to see how we can work together.
The Open Text Mining Interface (OTMI) is an initiative from Nature Publishing Group (NPG). It aims to enable scholarly publishers, among others, to disclose their full text for indexing and text-mining purposes but without giving it away in a form that is
The eXtensible Text Framework (XTF) is a flexible indexing and query tool that supports searching across collections of heterogeneous data and presents results in a highly configurable manner. The highlights of the XTF system are described in an online brochure
How is the indexing performed?
A: Indexing is the process of creating a Conceptual Fingerprint from a text. In Collexis, this automated indexing mechanism performs the following steps on the text: removing the stop words, normalizing the text, selecting concepts by comparison with the thesaurus, clustering the concepts and attaching a relative weight to the concepts by means of a set of algorithms and measuring the specificity, similarity and frequency of the concepts.
Back to Top
Q: How does Collexis generate its search results?
A: Collexis employs vector matching: comparing a search query with the Fingerprints from the records in a Collexion. The outcome is a very accurate and relevant list of content items and/or experts in the form of a list of records. There also exists the possibility of over-specifying a query (i.e., using a considerable piece of text), thus adding context to the query. This context will help the system to improve the accuracy of the query and return references to those content items that are contextually related. The system administrator can enlarge or reduce the set of returned documents by entering a threshold that indicates the minimum “distance” between the records returned and the query. Matching of a search query with Collexion records can be performed on multiple Collexions at the same time.
Back to Top
Q: What makes Collexis different?
A: Initially, Collexis differentiates itself from full-text search engines by making use of thesauri for information retrieval. The high-quality search is based on semantics that have been defined in a thesaurus or ontology: synonymous terms and terms in different languages are linked to a single concept. Hierarchical relations between concepts, links between definitions and terms, and other semantic relationships are utilized in the search applications. This process helps to highlight those terms most relevant to the searcher’s query.
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well.
A. Hotho, S. Staab, и G. Stumme. Proceedings of the 2003 IEEE International Conference on Data Mining, стр. 541-544 (Poster. Melbourne, Florida, IEEE Computer Society, (ноября 2003)
M. Hearst. Proceedings of the 37th annual meeting of the Association for Computational
Linguistics on Computational Linguistics, стр. 3--10. Morristown, NJ, USA, Association for Computational Linguistics, (1999)
M. Li, Y. Cheng, и H. Zhao. CGIV '04: Proceedings of the International Conference on Computer Graphics, Imaging and Visualization, стр. 183--186. Washington, DC, USA, IEEE Computer Society, (2004)
P. Kluegl, M. Atzmueller, и F. Puppe. Proc. Unstructured Information Management Architecture (UIMA), 2nd UIMA@GSCL Workshop, 2009 Conference of the GSCL (Gesellschaft für Sprachtechnologie und Computerlinguistik), (2009)
P. Kluegl, M. Atzmueller, и F. Puppe. Proc. LWA 2009, Knowledge Discovery and Machine Learning Track, Darmstadt, Germany, University of Darmstadt, (2009)
D. Rusu, B. Fortuna, и D. Mladenic. 4th Linked Data on the Web Workshop (LDOW 2011), 20th World Wide Web Conference (WWW 2011)., Hyderabad, India, (2011)
M. Hearst. Proceedings of the 37th annual meeting of the Association for Computational
Linguistics on Computational Linguistics, стр. 3--10. Morristown, NJ, USA, Association for Computational Linguistics, (1999)
J. Kroeze, M. Matthee, и T. Bothma. Proceedings of the 2003 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists on Enablement Through Technology, стр. 93-101. South African Institute for Computer Scientists and Information Technologists, (2003)
K. Nishida, R. Banno, K. Fujimura, и T. Hoshide. Proceedings of the 2011 international workshop on DETecting and Exploiting
Cultural diversiTy on the social web, стр. 29--34. New York, NY, USA, ACM, (2011)
B. Martins, H. Manguinhas, и J. Borbinha. Proceedings of the International Conference on Semantic Computing, стр. 1--9. IEEE Computer Society, (августа 2008)
X. Li, B. Liu, и S. Ng. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, стр. 218--228. Stroudsburg, PA, USA, Association for Computational Linguistics, (2010)
P. Kluegl, M. Atzmueller, и F. Puppe. Proc. Unstructured Information Management Architecture (UIMA), 2nd UIMA@GSCL Workshop, 2009 Conference of the GSCL (Gesellschaft für Sprachtechnologie und Computerlinguistik), (2009)
P. Kluegl, M. Atzmueller, и F. Puppe. Proc. LWA 2009, Knowledge Discovery and Machine Learning Track, Darmstadt, Germany, University of Darmstadt, (2009)
J. Verma, S. Agrawal, B. Patel, и A. Patel. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), 5 (1):
11(февраля 2016)
J. Verma, S. Agrawal, B. Patel, и A. Patel. International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), 5 (1):
41 - 51(февраля 2016)