A Beginner's Guide to Searching With Lucene
Written on , by Andrew Lalis.
Nowadays, if you want to build the next fancy new web app, chances are pretty good that you'll need a search bar in it, and for that, you've probably heard of ElasticSearch, or some other fancy, all-in-one solution. However, in this article, I'd like to try and convince you that you don't need any of that, and instead, you can brew up your own homemade search feature using Apache Lucene.
Hopefully you'll be surprised by how easy it is.
The Use Case
Before we dive into the code, it's important to make sure that you actually need an indexing and searching tool that goes beyond simple SQL queries.
If you can answer "yes" to any of these questions, then continue right along:
- I want to search over multiple different types of entities.
- I want to prioritize matching certain fields from entities over other fields. (For example, a user's name should be more important than their nickname.)
- I'm okay with search results being eventually consistent (that is, it might take a moment for new data to appear in results).
- I want to search for results that match a wildcard search. (For example, "find all animals whose name matches
tig*
.
Indexing and Searching Basics
No matter what searching solution you end up choosing, they all generally follow the same approach:
- Ingest data and produce an index.
- Search for data quickly using the index.
In most situations, ingesting data roughly translates to scraping content from a database or message queue, or even CSV content. The contents of each entity are analyzed and the important bits are extracted and stored in a compressed format that's optimized for high-speed searching. The exact implementation depends on what sort of solution you choose, but a lot of databases use a sort of red-black tree structure.
Searching over your index involves parsing a user's query (and sanitizing it, if necessary), and then constructing a well-formed query that's accepted by your searching solution, possibly with different weights or criteria applied to different fields.
This is no different for Lucene, and in this guide, we'll go through how to create an index and search through it.
Setting Up a New Project
In this guide, I'll be creating a small Java program for searching over a huge set of airports which is available for free here: https://ourairports.com/data/. The full source code for this project is available on GitHub, if you'd like to take a look.
I'll be using Maven as the build tool of choice, but feel free to use whatever you'd like.
We start by creating a new project, and add the apache-lucene dependency, and the Apache Commons CSV library for parsing the CSV dataset.
Parsing the Data
First of all, we need to parse the CSV data into a programming construct that we can use elsewhere in our code. In this case, I've defined the Airport
record like so:
And a simple AirportParser
class that just reads in a CSV file and returns a List<Airport>
(Check the source code to see how I did that).
Now that we've got our list of entities, we can build an index from them.
Indexing
In order to efficiently search over a massive set of data, we need to prepare a special set of index files that Lucene can read during searches. To do that, we need to create a new directory for the index to live in, construct a new IndexWriter, and create a Document for each airport we're indexing.
An important takeaway here is the construction of the Document
. There are a variety of fields that you could add to your document, which have different effects on the search.
- StoredFields are fields that just store plain data, but can't be searched on. In the above code, we store the id and wikipedia link, since they might be nice to have when fetching results, but nobody is going to want to search for airports by our internal id.
- TextFields are fields that allow for a full-text search of its value. This is generally the most popular "searchable" field type. It also allows us to specify whether or not we want to store its value, just like with a StoredField. In our case, we do want to store all our fields.
For more information about the types of fields that you can use, check the Lucene documentation. It's very well-written.
Also important to note is that once a document is added, it's staying in the index until either the index is removed or overwritten, or the document is deleted through another IndexWriter method. I'd suggest reading the documentation if you'd like to learn more about how to dynamically update a living index that grows with your data. But for 95% of use cases, regenerating the search index occasionally is just fine.
Searching
Now that we've built an index from our dataset, we can search over it to find the most relevant results for a user's query.
The following code might look a bit daunting, but I've added some comments to explain what's going on, and I'll walk you through the process below.
- We check to make sure that the user's query is legitimate. If it's just empty or null, we can exit right away and return an empty result.
- Since we want to make some fields have a greater effect than others, we prepare a mapping that specifies a weight for each field.
- In Lucene, the
Query
object is passed to an index searcher to do the searching. But first, we need to build such a query. In our case, we want to match each term the user enters against any of the fields we've added a weight for. By using a BooleanQuery, we can construct this as a big OR clause, where each term is a wildcard query that's boosted by the weight of the field it applies to. - Finally, we open up a DirectoryReader on the index directory, create an IndexSearcher, and get our results. The searcher produces a
TopDocs
object that has ascoreDocs
property containing the list of document ids that appear in the results. We can use the searcher to lookup the stored fields of each document in the result set, and in this case, we just fetch the name of the airport.
That's it! In my sample project, the whole Lucene implementation for indexing and searching, including imports and comments, is less than 150 lines of pure Java! It's so simple that it can just be tucked away into a single class.
Now, with your newfound knowledge, go forth and build advanced search features into your apps, and be content that you've built your solution from the ground up, without reinventing the wheel or getting roped into a complex cloud solution.
Once again, my sample code is available on GitHub here.