In this article we will look at the main features of the search engine Solr. Apache Solr is a powerful tool with an amazing search capability.
Introduction
The need to use full-text search today is conditioned by the need to search information in the ever-increasing number of electronic documents. It is very important to be able to quickly find the necessary information among them, so many applications today use various full-text search engines.
Solr is an open source full-text search platform based on the Apache Lucene project.
The main features of this powerful search server are:
- Faceted Search;
- Highlighting results;
- Integration with databases;
- Filtering;
- Processing of documents with a complex format;
- Search by synonyms;
- Spellchecking;
- Result Grouping;
- More Like This;
- Phonetic Matching;
- Highly scalable and fault tolerant.
Also, Solr supports REST like API.
Objectives
Let's imagine that we are creating an online store. We need a quick search for goods. The search allows you to find the product by entering characters in the search field that match the contents of one of the fields listed below:
- Name;
- Barcode;
- Categories;
- Manufacturer.
There should also be options:
- Limiting the number of results;
- String searching with errors;
- Grouping the results according to the position of the typed character set (at the beginning of the field, in the middle, at the end);
- Filter by category, manufacturer;
- Sorting by name, price (it should be noted that the price for the same product may be different for different stores, so we will take as the price of the product the minimum of all the proposed prices).
Before You Begin
- Download an Apache Solr release from this site. We will work with Apache Solr 7.3.
- Unzip the Solr release and change your working directory to the subdirectory where Solr was installed.
- The bin folder contains the scripts to start and stop the server. To launch Solr, run: bin/solr start . This command will run Solr server on port 8983.
Solr has a web interface, which is available after launching on port 8983, where you can observe statistics, errors and some other things.
Basic concepts
Field – content for indexing/searching, along with metadata that define how content should be processed by Solr.
Document – a unit of storage in Solr – a group of fields and their values.
Schema – defines the fields to be indexed and their type.
Collection – in Solr, one or more documents are grouped into one logical index using a single configuration and schema. A collection can be a single core.
Core – Individual Solr instance (represents a logical index). Multiple cores can work on one node.
Node – A JVM instance that launches Solr. Also known as the Solr server.
To understand how Solr works with the text data, the following three concepts are important:
Analyzers – they analyze the text and return the stream of tokens. It can be a single class, or it can consist of a series of classes of a tokenizer and a filter. Tokenizers – describe how Solr breaks the text into lexical units – tokens. Filters – describe the processing of tokens that are output from the tokenizer (for example, converting all words to lowercase).
Data processing Data → Token Stream → Additional Processing → Index
Query Processing Search query → Token stream → Additional processing → Dictionary search
Creating a new core and configuring it
First, create a new core: bin/solr create -c product
This command will create a collection named product with the default settings and schema. It will be located in the *server/solr/product *directory.
Immediately after the creation, the core information is available in the web interface in the "Core Selector" section.
Now you need to define the core schema. Solr itself can generate a schema based on the data, automatically defining fields and types. However, to ensure the correct indexing and the corresponding type semantics, it is recommended to specify the schema explicitly. The core schema is stored in the server/solr/product/conf/managed-schema file.
Suppose you want to save the following object:
{ "id": "e55b6300-b016-11e8-8d04-d4619d2d8b56", "name": "product_name", "manufacturer": "product_manufacturer", "barcode": "0123456789", "categories": [ "category_1", "category_2" ], "prices": [ { "store_id": "1", "price": 244.5 }, { "store_id": "4", "price": 301.5 } ] }
Add the following fields:
<field name="name" type="string_ci" indexed="true" stored="true" /> <field name="manufacturer" type="string_ci" indexed="true" stored="true" /> <field name="barcode" type="string_ci" indexed="true" stored="true" /> <field name="categories" type="string_ci" indexed="true" stored="true" multiValued="true" /> <field name="price" type="pdouble" indexed="true" stored="true" /> <field name="content_type" type="string" indexed="true" stored="true" required="true"/>
For the price we will use the pdouble type, already declared below in the same file. For the fields name, manufacturer, etc. you could use an existing string type or text_generals, but since we want the case insensitive search, we need to convert the contents of these fields to lowercase. To do this, we'll create our own type string_ci:
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
Also create the type UUYD and change the type of the fields id and root:
<field name="id" type="<strong>uuid</strong>" indexed="true" stored="true" required="true" multiValued="false" /> <field name="_root_" type="<strong>uuid</strong>" indexed="true" stored="false" docValues="false" /> ... <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
More information about types and properties of fields can be found in the official documentation.
Let's consider the storage of prices. It is necessary to store information about the price of goods in different stores – a pair (store_id, price).
Solr does not support nested documents directly, because Lucene has a flat object model. An object with nested objects is stored not as one document in the index, but as a sequence of documents, the so-called block. Thus, the above object will be saved as a parent document with 2 children, these appear in the index as child1, child2, parent.
The content_type field has been added to distinguish between parent and child documents.
To make changes appear in the web interface, after saving the schema, you need to restart Solr with bin/solr restart command or reload this core in the "Core Admin" section of the web interface.
Adding content to a Solr index
There are several ways to add data to the search index.
You can index documents of different types (xml, json, csv, pdf, doc, html, txt, etc.) using bin/post tool (it is a Unix shell script).
However, we need to import data from the database. For this we can use Data Import Handler (DIH). It works not only with relational databases, but also with HTTP based data sources, e-mail repositories, and structured XML.
First, you need to configure SolrRequestHandler to bind DIH to Solr. To do this, add the necessary library and configure the requestHandler (A request handler describes the request processing logic. Some of them process search requests, others manage various tasks) in the server/solr/product/conf/solrconfig.xml file:
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-\d.*\.jar" /> ... <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">db-data-config.xml</str> </lst> </requestHandler>
<updateRequestProcessorChain name="uuid"> <processor class="solr.UUIDUpdateProcessorFactory"> <str name="fieldName">id</str> </processor> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
Also, you need to add a driver to work with the database, for this you can create a directory server/solr/product/lib and put the driver there.
db-data-config.xml – the name of the file with the import configuration. It contains:
<dataConfig> <dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/test" user="postgres" password="postgres"/> <document> <entity name="product" pk="product_id" transformer="com.test.solr.transformer.UuidTransformer, TemplateTransformer" query="select product_id, name, manufacturer_id from product" deltaQuery="select product_id from product where update_time > '${dataimporter.last_index_time}'" deltaImportQuery="select product_id, manufacturer_id, name from product where product_id = '${dataimporter.delta.product_id}'"> <field column="product_id" name="id" uuid="true" /> <field column="name" name="name" /> <field column="content_type" template="product" /> <field column="manufacturer_id" name="manufacturer_id" uuid="true" /> <entity name="manufacturer" pk="manufacturer_id" query="select name, manufacturer_id from manufacturer where manufacturer_id = '${product.manufacturer_id}'" deltaQuery="select manufacturer_id from manufacturer where update_time > '${dataimporter.last_index_time}'" parentDeltaQuery="select product_id from product where manufacturer_id = '${manufacturer.manufacturer_id}'" deltaImportQuery="select manufacturer_id, name from manufacturer where manufacturer_id = '${dataimporter.delta.manufacturer_id}'"> <field column="name" name="manufacturer" /> </entity> <entity name="product_category" pk="product_category_id" transformer="com.test.solr.transformer.UuidTransformer" query="select category_id from product_category where product_id = '${product.product_id}'" deltaQuery="select product_category_id, product_id from product_category where update_time > '${dataimporter.last_index_time}'" parentDeltaQuery="select product_id from product where product_id = '${product_category.product_id}'"> <field column="category_id" name="category_id" uuid="true" /> <entity name="category" pk="category_id" query="select name from category where category_id = '${product_category.category_id}'" deltaQuery="select category_id from category where update_time > '${dataimporter.last_index_time}'" deltaImportQuery="select name from category where category_id = '${dataimporter.delta.category_id}'" parentDeltaQuery="select product_id from product_category where category_id = '${category.category_id}'"> <field column="name" name="categories" /> </entity> </entity> <entity name="price" child="true" pk="price_id" transformer="com.test.solr.transformer.UuidTransformer, TemplateTransformer" query="select price_id, store_id, price from price where product_id = '${product.product_id}'" deltaQuery="select price_id, product_id from price where update_time > '${dataimporter.last_index_time}'" deltaImportQuery="select store_id, price from price where price_id = ${dataimporter.delta.price_id}" parentDeltaQuery="select product_id from product where product_id = '${price.product_id}'"> <field column="price_id" name="id" uuid="true" /> <field column="price" name="price" /> <field column="store_id" name="store_id" uuid="true" /> <field column="content_type" template="price" /> </entity> </entity> </document> </dataConfig>
Let's consider some parameters:
- query — Required. The SQL query used to select rows.
- deltaQuery – The SQL query for delta import (delta import – incremental import and change detection), it retrieves the primary keys of the corresponding rows. Most often these rows are retrieved using the condition where update_time> '${dataimporter.last_index_time}' (the value of dataimporter.last_index_time can be found in the dataimporter.properties file, it is updated with each delta import)
- deltaImportQuery — The SQL query for delta import. The pks from the deltaQuery are available to the deltaImportQuery through the variable ${dataimporter.delta.<column-name>}.
- parentDeltaQuery – The SQL query to retrieve the primary keys of the parent entity rows that will be changed with delta import.
Note that the content_type field is filled with a TemplateTansformer. Transformers are used to modify existing fields or to create new ones. You can use existing transformers or create custom ones. We also use our custom transformer to work with fields of the UUID format (We need to use it, because there is a problem with importing UUID format fields from the database into the index – you will get the error org.apache.solr.common.SolrException: TransactionLog doesn't know how to serialize class java.util.UUID; try implementing ObjectResolver?):
package com.test.solr.transformer; import org.apache.solr.handler.dataimport.Context; import org.apache.solr.handler.dataimport.DataImporter; import org.apache.solr.handler.dataimport.Transformer; import java.util.List; import java.util.Map; public class UuidTransformer extends Transformer { @Override public Object transformRow(Map<String, Object> map, Context context) { final List<Map<String, String>> fields = context.getAllEntityFields(); for (Map<String, String> field : fields) { final String uuid = field.get("uuid"); if ("true".equals(uuid)) { final String columnName = field.get(DataImporter.COLUMN); final Object value = map.get(columnName); if (value != null) { map.put(columnName, value.toString()); } } } return map; } }
Create a JAR file and put it in the directory server/solr/product/lib.
For full import use:
http://localhost:8983/solr/product/dataimport?command=full-import
For delta import use:
http://localhost:8983/solr/product/dataimport?command=delta-import&clean=false
However, for clarity, you can use the "DataImport" section in the web interface.
After having been added to the index, the content becomes searchable.
Work with search queries
The search query is processed by a request handler, which refers to the query parser. There are several parsers that support each a different query syntax. We use the default Solr query parser – the Standard query parser or the "lucene" query parser.
Standard Query Parser Parameters
There is only one required parameter q, that defines a query using the standard query syntax.
Thus, the simplest query will return the contents of the first 10 documents: http://localhost:8983/solr/products/select?q=*:*
Basic query parameters
- sort – Defines the field to sort and its direction;
- start – Offset;
- rows – Limit;
- fq (Filter Query) – Defines the query for filtering;
- fl (Field List) – Defines the fields displayed in the response;
- wt – Defines the format of the response.
Sample Queries
Let's start with simple queries:
#TODO: fix the table
<div class="mks_col "> <div class="mks_one_third ">Equal</div> <div class="mks_two_thirds ">name:product / name:"product name"</div> </div> <div class="mks_col "> <div class="mks_one_third ">Not Equal</div> <div class="mks_two_thirds ">-name:product</div> </div> <div class="mks_col "> <div class="mks_one_third ">In Set</div> <div class="mks_two_thirds ">id:(100 OR 200 OR 300)</div> </div> <div class="mks_col "> <div class="mks_one_third ">Not In Set</div> <div class="mks_two_thirds ">-id:(100 OR 200 OR 300)</div> </div> **String Data Type** <div class="mks_col "> <div class="mks_one_third ">Starts With</div> <div class="mks_two_thirds ">name:pro*</div> </div> <div class="mks_col "> <div class="mks_one_third ">Contains</div> <div class="mks_two_thirds ">name:*oduc*</div> </div> <div class="mks_col "> <div class="mks_one_third ">Ends With</div> <div class="mks_two_thirds ">name:*uct</div> </div> <div class="mks_col "> <div class="mks_one_third ">Fuzzy Search (allows two errors)</div> <div class="mks_two_thirds ">name:prdut~</div> </div> <div class="mks_col "> <div class="mks_one_third ">Fuzzy Search (allows one error)</div> <div class="mks_two_thirds ">name:prduct~1</div> </div> **Numeric Data Type** <div class="mks_col "> <div class="mks_one_third ">Greater Than</div> <div class="mks_two_thirds ">price:[100 TO*]</div> </div> <div class="mks_col "> <div class="mks_one_third ">Less Than</div> <div class="mks_two_thirds ">price:[* TO 100]</div> </div> <div class="mks_col "> <div class="mks_one_third ">Between</div> <div class="mks_two_thirds ">price:[100 TO 500]</div> </div> <div class="mks_col "> <div class="mks_one_third ">Not Between</div> <div class="mks_two_thirds ">-price:[100 TO 500]</div> </div>Now add a filter by category and manufacturer, and also limit the number of results returned:q=name:prod* & fq=manufacturer:manufacturer_1 AND categories:cat_1_1 & rows=5
Consider more complex queries. For example, we need to find products with a price lower than 500. We cannot just set the price condition: [* TO 100], because this query will return not the products, but the child documents – the pair (store_id, price). In order to search by child documents and return the parent, use Block Join Parent Query Parser. The resulting query:
q={!parent which= content_type:product 22}price:[* TO 500]
Now we need to sort the products by price. To do this, in addition to Block Join Parent Query Parser, we will use Function Query:
q={!parent which="+content_type:product +name:*" score=max v='+content_type:price +{!func}price'}} & sort=score desc
Conclusion
Solr is a powerful tool with many features. In addition, it should be noted that there are client libraries available for Java, C #, PHP, Python, Ruby and most other popular programming languages.