Introduction
On several occasions developing database-driven web applications, I've been approached by clients who want Google-style search implemented at the last minute of the development cycle. Usually this leads to using some canned script that crawls the website, or a hacked up search function that uses the database but either returns too many results or none at all. On top of that, the queries performed are too many or too slow.
Until now, most developers have been forced to use relational databases to power search, install extra component packages, or seek out other non-php solutions. The problem with using a relational database, such as MySql's fulltext indexing, is that scalability problems crop up as your search criteria becomes more complicated.
One of the features that sets the Zend Framework apart from the others is the inclusion of a decent search module. Zend_Search_Lucene is a php port of the Apache Lucene project, a full-text search engine framework. Zend_Search_Lucene promises a simple way to add search functionality to an application without requiring additional php extensions or even a database.
Zend_Search_Lucene overcomes the usual limitations of relational databases with features such as fast indexing, ranked result sets, a powerful but simple query syntax, and the ability to index multiple fields. Better still, a Zend_Search_Lucene index can live happily alongside your relational database to provide fast searching but without duplicating the effort of storing all of your data twice. In this tutorial, I'll show you how to use Zend_Search_Lucene to index and search some RSS feeds.
<property> <name>http.agent.name</name> <value></value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.description</name> <value></value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value></value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value></value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property>
"Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as
The LIRE (Lucene Image REtrieval) library a simple way to create a Lucene index of image features for content based image retrieval (CBIR). The used features are taken from the MPEG-7 Standard: ScalableColor, ColorLayout and EdgeHistogram. Furthermore met
PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.
Features
* PDF to text extraction
* Merge PDF Documents
* PDF Document Encryption/Decryption
* Lucene Search Engine Integration
* Fill in form data FDF and XFDF
* Create a PDF from a text file
* Create images from PDF pages
* Print a PDF
While Lucene has long offered these capabilities, its native capabilities are not intended for large semi-structured document collections (or documents with very different schemas). For this reason we developed SIREn - Semantic Information Retrieval Engine - a Lucene plugin to overcome these shortcomings and efficiently index and query RDF, as well as any textual document with an arbitrary amount of metadata fields.
U. Schindler, и I. Drost. Java Magazin, (2010)Zusätzlich interessante Punkte die im Artikel erwähnt werden:
1) Die Häufigkeit einzelner Suchanfragen ist meist zipf-verteilt.
2) Abstandsberechnung bei Geodaten über Haversinus.
3) Cartesian Tiers
4) Wissenschaftliches Infosystem PANGAEA
5) KML Regionen Dokumentation von Google
6) Geohshes.