Simple Vagrant Box for Elasticsearch

Elasticsearch is a search server written in Java, providing an advanced JSON interface. It is scalable, distributed and open source.
There are many similar products on the market, most noticeably Solr – its greatest competitor. Almost all of them, including Elasticsearch, utilize Apache Lucene indexing engine behind the scene. The difference is in the front-end API layer, technology used and, of course, the architectural choices made by their authors.

Why Elasticsearch?

My first contact with Elasticsearch was when a client – a large international company running their website on EPiServer – requested replacing their existing search solution. The first and obvious choice for me was EPiServer Find, a search engine offered by EpiServer as both SaaS and On Premise. After some research and analysis, however, the client chose to host the search service themselves, build a fully customized page indexer and a C# client library. The final choice was between Elasticsearch and Solr.

Personally, I find Elasticsearch more appealing than Solr for its simple installation and configuration, self-containment (self-hosintg), distributed architecture and cluster capabilities. Even though it is a Java-based software, and maybe because of that, it is equally easy to set up on both Windows and Linux hosts.

Being a search server means it is not intended as a primary data storage for an application, even if I have seen examples of such an idea implemented in an enterprise-grade websites. Keeping all the raw data in some primary data storage (a CMS, relational database etc.) and providing a full re-index functionality gives an additional perk: an ability to upgrade and reconfigure the search engine even in a production environment, under the heavy load. For developers, the same approach makes their search servers dispensable, letting them avoid cluttering their harddrives with loads of test data.

Virtualization

Even though Elasticsearch is a self-hosted solution and does not require much prerequisites (in fact, it requires only JRE installed), I’ve chosen to install it on a virtual machine. The main reason was to get an environment I could use for both work and personal projects without the need to install and configure it multiple times. I was also interested in experimenting with its cluster functions.

I decided to use Vagrant – a set of tools built in Ruby, a great helper for building and distributing virtualized environments. It uses one of the three hypervisors behind the scenes: VirtualBox (default), VMware and Hyper-V. It allows me to automate the environment creation process by providing a set of so-called “base boxes” – usually clean Linux images – and running configuration tools or scripts on them – a process called provisioning.

First, I tried to use the Hyper-V provider as Hyper-V was already installed on my machine. But due to the limited number of Hyper-V base boxes available, I decided to switch over to VirtualBox. I think it was smart choice considering the VirtualBox being available on Windows, Mac and Linux. Thus, now I can share my environments with front-end developers working on Macs.

Vagrant supports multiple provisioners, including Chef Solo, Puppet and of course shell commands. For the sake of simplicity, I chose shell scripts.

The repository with my Elasticsearch server can be found here: https://github.com/malpaw/vagrant-elasticsearch. After downloading it just requires invoking

1	vagrant up

and it is ready.

What it is not

The environment I created has some severe limitations. First of all it is not intended for production use – it is not secured in any way and is not configured to work in distributed scenarios. Also, it cannot handle large amounts of data very well with its limited memory and storage settings; instead of increasing the limits I would rather suggest building a cluster instead.

I am aware there are many better Vagrant Elasticsearch projects available on Github. This was, however, my first effort of making Elasticsearch available locally and it went surprisingly well. Being a “disposable” environment (i.e. being only a secondary data storage), such a VM can be easily replaced by its more advanced version in the future. For now it fits nicely into my workflow and I am happy I can spawn another development search engine by simply typing “vagrant up”.