From zero to Solr: A hands-on tutorial

The intention of this tutorial is to give you a quick guide on how to set up Apache Solr and query xml based data, without digging too much into details. As everywhere, the are several ways to accomplish this goals. The steps mentioned below are therefore just an example and don’t claim to be “best practise”.

  1. Install Solr 4 on ubuntu 12.04
  2. Prepare your data for being indexed
  3. Setup the schema
  4. Index some data
  5. Customize the default query handler

Install Solr 4 on ubuntu 12.04

This section is based on a blog post by Tomasz Muras.

The following instructions refere to a vanilla installation of Ubuntu LTS server version (12.04) i386, just using the “OpenSSH server” package set during install. In order to install and run Solr, we need tomcat and curl.

  • sudo apt-get install tomcat6 curl

Next download Solr from http://lucene.apache.org/solr. At the point of writing, the current version is 4.0.0 The following command will download Solr to your home directory from one of the many mirrors available. Adapt the url if needed.

  • cd ~ && curl -O http://mirror.netcologne.de/apache.org/lucene/solr/4.0.0/apache-solr-4.0.0.tgz

Now we have the Solr tgz in our home. Let’s put it somewhere, e.g. to /opt/solr

  • sudo mkdir -p /opt
  • sudo tar -xvzf ~/apache-solr-4.0.0.tgz -C /opt

In order to keep the following steps independent from the actual Solr version, let’s create a symbolic link in opt. Adapt this to your actual Solr version.

  • cd /opt
  • sudo ln -s apache-solr-4.0.0 solr

Solr comes with example configurations which can be easily used to get started. Therefore, we need to copy the appropriate files to Solr’s homedir.

  • cd /opt/solr
  • sudo cp -r example/solr/* ./
  • sudo cp example/webapps/solr.war ./

The example shipped with Solr uses a single “core” named collection1. Without trying the explain what Solr cores are, think of it as way to host multiple indices within a single Solr instance.

Let’s change collection1’s name to something more friendly, e.g., catalog.

  • cd /opt/solr
  • sudo mv collection1 catalog

But that’s not all. You have to modify the copied example config /opt/solr/solr.xml as well. Simply change every occurrence of “collection1” to “catalog” below the “cores” element. After all, the it should look like this

https://gist.github.com/3972338

Solr needs a directory to store its data. Let’s create the directory and set appropriate rights for Solr to be able to access it.

  • sudo mkdir /opt/solr/data
  • sudo chown tomcat6 /opt/solr/data

Now tell Solr about the data directory by add/edit the “dataDir” element in your core’s main config file, which is in this case /opt/solr/catalog/conf/solrconfig.xml. The element should look like this.

https://gist.github.com/3972377

Last but not least you have to tell tomcat about your new Solr instance. Therefore, create a file named /etc/tomcat6/Catalina/localhost/solr.xml with the following content.

https://gist.github.com/3972383

Restart tomcat and Solr should be waiting for on port 8080.

  • sudo /etc/init.d/tomcat6 restart

Prepare your data for beeing indexed

The following steps assume, that the data which should be indexed is stored as xml files, each file representing a single item.

Now that you have Solr up-and-running, let’s index some data. But beware, Solr expects data to be in a special format. You cannot simply push your own xml files into it without preprocessing them first. So what does Solr expect ? Have a look at the xml files in the exampledocs directory (which is in our case located at /opt/solr/example/exampledocs), for example monitor.xml

  • cat /opt/solr/example/exampledocs/monitor.xml

which will show the following xml file.

https://gist.github.com/3972412

As you can see, you have an “add” element, which has one or many “doc” elements, which have one or many “field” elements with an attribute called “name” and a value. A minimal xml file to be indexed by Solr would look like this.

https://gist.github.com/3972606

The problem is, that your data is probably not in that format, so you have to convert it prior to loading it into Solr. When the input data is xml, one way to accomplish this is by using XSLT. Doing complex transformations using XSLT is a topic of it’s own, so let’s assume the following simple xml input.

https://gist.github.com/3972615

The following XSLT will transform this xml to an xml file which can be loaded into Solr.

https://gist.github.com/3972619

By the way, if you want to hack some xsl, try http://www.xmlper.com an online xsl/xml editor with live preview of your transformed xml.

Ok, now put your xsl file somewhere, where we can use it later on. There is already a directory in our example core (which is named catalog) for xslt files, located at /opt/solr/catalog/conf/xslt, so save the xslt file there and give it an expressive name, like input_to_solr.xsl.

To check if your stylesheet works as expected you can use xsltproc. Install the package and do some transformation on your sample input xml located at ~/input.xml.

  • sudo apt-get install xsltproc
  • xsltproc /xsltproc /opt/solr/catalog/conf/xslt/input_to_solr.xsl ~/input.xml

This should give the following xml file, which accords to what Solr expects.

https://gist.github.com/3972643

Setup the schema

Before trying to index some data, we have to tell Solr which fields we are using. The corresponding schema configuration file is located at /opt/solr/catalog/conf/schema.xml. Remember that we have a core named “catalog” and the schema file is just located in the conf directory below that cor’s root.

Solr already knows certain fields, e.g. id or subject, but some fields are missing in the default schema, like “type” or “format”. But even the fields which the default schema defines may not match our input data, so lets alter the default schema to match our input.

In /opt/solr/catalog/conf/schema.xml, add the following lines inside the “fields” element.

https://gist.github.com/3972656

But we are not finished. The Solr default schema defines a field named “subject”, but does not declare this field to be “multi valued”, as in our input xml. So we need to alter the existing field definition and add the “multiValued” attribute set to “true”. After editing, the line should look like this

https://gist.github.com/3972660

Now that the schema file corresponds our input, restart tomcat to ensure, the new schema is loaded.

  • sudo /etc/init.d/tomcat6 restart

Index some data

Now, that you are able to transform your input to something, Solr understands, you could apply this transformation to all input files and POST them to Solr. Let’s to this for our input.xml.

  • xsltproc /opt/solr/catalog/conf/xslt/input_to_solr.xsl ~/input.xml|curl "http://localhost:8080/solr/update?commit=true" --data-binary @- -H 'Content-type:application/xml'

Let’s look at this command. We transform our ~/input.xml file using xsltproc and the stylesheet located at /opt/solr/catalog/conf/xslt/input_to_solr.xsl and pipe the result to curl.

Curl does a POST (with “Content-Type” header set to “application/xml”) to http://localhost:8080/solr/update with “commit=true” and taking STDIN data from the pipe as –data-binary (“@-” says, take file from STDIN).

The result should be as follows (except QTime, which may be different for you).

https://gist.github.com/3972685

Congratulation, you have indexed your first file. Let’s search it using curl.

  • curl http://localhost:8080/solr/catalog/query?q=status:foo_status

This should give you a nice JSON representation of the input file, like this

https://gist.github.com/3972691

Customize the default query handler

At the moment, you can query items by specifying field:value pairs like “status:foo_status”. But what you probable want is to query for terms in multiple (or all) fields, without naming them. This can be accomplished by setting some smart defaults for a query handler in your core’s Solr config. In this tutorial, we have single core named “catalog”, so the config would be /opt/solr/catalog/conf/solrconfig.xml.

Search for the definition of the “requestHandler” with name=”/query”. This element has a lst child named “defaults”. Here you can define query parameters, which should be assumed if not given in the request.

Let’s combine the ability to define defaults with a different query mode like (e)dismax. Change the “requestHandler” element for /query to look like this.

https://gist.github.com/3972892

Now, if you issue a query, the dismax query mode is used. This mode provides the qf, where you can specify the fields, where Solr should search for the query term. In this example all fields should be searched. With this request handler, you can query items by simply doing

  • curl http://192.168.56.100:8080/solr/catalog/query?q=foo_status

A last note on the qf parameter. You can define “boost” values for each field, which will make some fields more relevant than others when doing a query. Boost values can be written by using ^, e.g. title^2.

From zero to Solr: A hands-on tutorial

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s