The intention of this tutorial is to give you a quick guide on how to set up Apache Solr and query xml based data, without digging too much into details. As everywhere, the are several ways to accomplish this goals. The steps mentioned below are therefore just an example and don’t claim to be “best practise”.
- Install Solr 4 on ubuntu 12.04
- Prepare your data for being indexed
- Setup the schema
- Index some data
- Customize the default query handler
Install Solr 4 on ubuntu 12.04
This section is based on a blog post by Tomasz Muras.
The following instructions refere to a vanilla installation of Ubuntu LTS server version (12.04) i386, just using the “OpenSSH server” package set during install. In order to install and run Solr, we need tomcat and curl.
sudo apt-get install tomcat6 curl
Next download Solr from http://lucene.apache.org/solr. At the point of writing, the current version is 4.0.0 The following command will download Solr to your home directory from one of the many mirrors available. Adapt the url if needed.
Now we have the Solr tgz in our home. Let’s put it somewhere, e.g. to /opt/solr
sudo mkdir -p /opt
sudo tar -xvzf ~/apache-solr-4.0.0.tgz -C /opt
In order to keep the following steps independent from the actual Solr version, let’s create a symbolic link in opt. Adapt this to your actual Solr version.
sudo ln -s apache-solr-4.0.0 solr
Solr comes with example configurations which can be easily used to get started. Therefore, we need to copy the appropriate files to Solr’s homedir.
sudo cp -r example/solr/* ./
sudo cp example/webapps/solr.war ./
The example shipped with Solr uses a single “core” named collection1. Without trying the explain what Solr cores are, think of it as way to host multiple indices within a single Solr instance.
Let’s change collection1’s name to something more friendly, e.g., catalog.
sudo mv collection1 catalog
But that’s not all. You have to modify the copied example config
/opt/solr/solr.xml as well. Simply change every occurrence of “collection1” to “catalog” below the “cores” element. After all, the it should look like this
Solr needs a directory to store its data. Let’s create the directory and set appropriate rights for Solr to be able to access it.
sudo mkdir /opt/solr/data
sudo chown tomcat6 /opt/solr/data
Now tell Solr about the data directory by add/edit the “dataDir” element in your core’s main config file, which is in this case
/opt/solr/catalog/conf/solrconfig.xml. The element should look like this.
Last but not least you have to tell tomcat about your new Solr instance. Therefore, create a file named
/etc/tomcat6/Catalina/localhost/solr.xml with the following content.
Restart tomcat and Solr should be waiting for on port 8080.
sudo /etc/init.d/tomcat6 restart
Prepare your data for beeing indexed
The following steps assume, that the data which should be indexed is stored as xml files, each file representing a single item.
Now that you have Solr up-and-running, let’s index some data. But beware, Solr expects data to be in a special format. You cannot simply push your own xml files into it without preprocessing them first. So what does Solr expect ? Have a look at the xml files in the exampledocs directory (which is in our case located at
/opt/solr/example/exampledocs), for example
which will show the following xml file.
As you can see, you have an “add” element, which has one or many “doc” elements, which have one or many “field” elements with an attribute called “name” and a value. A minimal xml file to be indexed by Solr would look like this.
The problem is, that your data is probably not in that format, so you have to convert it prior to loading it into Solr. When the input data is xml, one way to accomplish this is by using XSLT. Doing complex transformations using XSLT is a topic of it’s own, so let’s assume the following simple xml input.
The following XSLT will transform this xml to an xml file which can be loaded into Solr.
By the way, if you want to hack some xsl, try http://www.xmlper.com an online xsl/xml editor with live preview of your transformed xml.
Ok, now put your xsl file somewhere, where we can use it later on. There is already a directory in our example core (which is named catalog) for xslt files, located at
/opt/solr/catalog/conf/xslt, so save the xslt file there and give it an expressive name, like
To check if your stylesheet works as expected you can use
xsltproc. Install the package and do some transformation on your sample input xml located at
sudo apt-get install xsltproc
xsltproc /xsltproc /opt/solr/catalog/conf/xslt/input_to_solr.xsl ~/input.xml
This should give the following xml file, which accords to what Solr expects.
Setup the schema
Before trying to index some data, we have to tell Solr which fields we are using. The corresponding schema configuration file is located at
/opt/solr/catalog/conf/schema.xml. Remember that we have a core named “catalog” and the schema file is just located in the conf directory below that cor’s root.
Solr already knows certain fields, e.g. id or subject, but some fields are missing in the default schema, like “type” or “format”. But even the fields which the default schema defines may not match our input data, so lets alter the default schema to match our input.
/opt/solr/catalog/conf/schema.xml, add the following lines inside the “fields” element.
But we are not finished. The Solr default schema defines a field named “subject”, but does not declare this field to be “multi valued”, as in our
input xml. So we need to alter the existing field definition and add the “multiValued” attribute set to “true”. After editing, the line should look like this
Now that the schema file corresponds our input, restart tomcat to ensure, the new schema is loaded.
sudo /etc/init.d/tomcat6 restart
Index some data
Now, that you are able to transform your input to something, Solr understands, you could apply this transformation to all input files and POST them to Solr. Let’s to this for our input.xml.
xsltproc /opt/solr/catalog/conf/xslt/input_to_solr.xsl ~/input.xml|curl "http://localhost:8080/solr/update?commit=true" --data-binary @- -H 'Content-type:application/xml'
Let’s look at this command. We transform our
~/input.xml file using
xsltproc and the stylesheet located at
/opt/solr/catalog/conf/xslt/input_to_solr.xsl and pipe the result to
Curl does a POST (with “Content-Type” header set to “application/xml”) to
http://localhost:8080/solr/update with “commit=true” and taking STDIN data from the pipe as –data-binary (“@-” says, take file from STDIN).
The result should be as follows (except QTime, which may be different for you).
Congratulation, you have indexed your first file. Let’s search it using curl.
This should give you a nice JSON representation of the input file, like this
Customize the default query handler
At the moment, you can query items by specifying field:value pairs like “status:foo_status”. But what you probable want is to query for terms in multiple (or all) fields, without naming them. This can be accomplished by setting some smart defaults for a query handler in your core’s Solr config. In this tutorial, we have single core named “catalog”, so the config would be
Search for the definition of the “requestHandler” with name=”/query”. This element has a lst child named “defaults”. Here you can define query parameters, which should be assumed if not given in the request.
Let’s combine the ability to define defaults with a different query mode like (e)dismax. Change the “requestHandler” element for
/query to look like this.
Now, if you issue a query, the dismax query mode is used. This mode provides the
qf, where you can specify the fields, where Solr should search for the query term. In this example all fields should be searched. With this request handler, you can query items by simply doing
A last note on the
qf parameter. You can define “boost” values for each field, which will make some fields more relevant than others when doing a query. Boost values can be written by using