Symfony and Search: Lucene, Solr and Elasticsearch
Apache Software Project's Lucene Core is the most common Open Source search technology in used today. The core product, written in Java and running on the Java Virtual Machine (JVM) provides a high performance platform for full text search. In also it provides additional services such as spellchecking, result highlighting and advanced analysis capabilities.
Lucene is a low level technology and is not very practical to use in higher level applications such as web apps. A pure PHP port of the full-text search engine exists in the form of Zend_Search_Lucene, but by far the most popular way of using Lucene together with PHP is through a higher level search engines.
The two most popular implementations today are Solr and Elasticsearch. Both remain relevant and neither has any significant advantage over the other in common use cases. They build on Lucene by adding a layer to provide additional functionality such as caching, replication, a web administration interface and, most importantly, a RESTful API for interaction.
PHP applications interact with a separate Java application through HTTP traffic in JSON or XML form. The results can be weighted and filtered in a multitude ways, making for a very versatile tool for sorting and fetching large data listings.
The Symfony framework offers a number of different ways of using Lucene based search with Solr and Elasticsearch.
The Doctrine ORM and Search
Object-relational mappers (ORMs) like Doctrine and Propel abstract the storage of application objects to SQL databases. Both are commonly used in Symfony framework projects to handle routines of data persistence. Relational database provide a wide array of features and a solid storage for data.
ORMs and a lot of other applications like Content Management Systems generate SQL queries on dynamically. This can lead to performance issues when sorting by certain criteria. SQL queries can be crafted by hand for optimal performance, but sometimes either the performance or features are simply impractical.
Lucene based search engines are by nature fast and allow extensive sorting and queying. The Doctrine ORM can be tied to both Elasticsearch and Solr using ready made bundles when working with the Symfony framework. These bundles provide transparent synchronisation of the SQL data to the Lucene based search engines.
The FOSElasticaBundle integrates the Elastica library to the Symfony2 environment. For Solr similar integration is available in the form of SolrBundle. Both bundles hook into Doctrine listeners to trigger automatic indexing of content in creation, update and delete events to the supported search engine.
Taking SolrBundle or FOSElasticaBundle into use in existing SQL based ORMs is easy and allows leveraging the power of Lucene in a very approachable manner. For users of the Doctrine MongoDB (ODM) similar functionality exists in working prototype form.
Solr and the eZ Platform Content Repository
eZ Platform is a CMS built using the Symfony framework. The previous eZ Publish system has had integrated Solr support since 2007 and the new product continues on this tradition. The system provides a content management repository which emits signals on events (such as publish, remove, etc.) which are used by Solr Search Engine for eZ Platform.
In addition to simply indexing and retrieval of content the Solr integration also takes into account user permissions, languages and other meta data. In eZ Platform developers write queries against the content tree contained in the repository.
Below is an example that lists articles and blog posts in English and Finnish from a certain branch of the content tree:
This will generate complex SQL queries on the database. If we have hundreds of items of content items or desire sorting by distance from locations queries, the generated queries will get more complex and likely quite a bit slower.
When the Solr integration is configured and enabled, the same API calls will be forwarded to the Solr search daemon in the backend. Switching from SQL queries to Solr queries requires zero code changes.
You can also import data from other sources to the same Solr index containing the content, making the search index one option for a viable integration point for applications.
You need tons of data to create relevancy
Internal search technology in web sites and web applications has advanced greatly during the last five years thanks largely to the adoption of Lucene based software. Geospatial search is now a default feature of Solr and indexing the text content of binary documents is also nowadays quite straightforward with Apache Tika.
Technology wise we're now a light year away from indexing search engines such as htDig and mnoGoSearch, but what remains is the requirement of having data to provide relevant results. Bing and other popular search engines are able to return relevant results due to heavy investments in search algorithms, but also large amounts of usage data and linking in the hypertext the robots crawl and index.
Lucene based search was a hot topic in 2008. The buzz has since leveled down and the software has improved. So in 2015 a faceteted search is about as exciting a feature as a responsive design. Regardless of the commodization of powerful search, the underlying tasks of keeping two data stores in sync remains a complex task.
Search has passed peak hype, but it has never been as relevant as it is today.