MeiliSearch: A Minimalist Fleshy-Textual exclaim Search Engine

MeiliSearch is a rapidly, feature-rich paunchy-textual exclaim search engine. Its constructed on prime of the LMDB key-stamp retailer and lives as a 35 MB binary when keep in on Ubuntu or MacOS. It comes with a constructed-in shopper, server and WebUI. Options corresponding to stemming, cease words, synonyms, ranking, filters and faceting all determine of the field, use wise defaults and would possibly perchance perhaps well be with out issues customised.

The MeiliSearch challenge began in 2018 with valuable of the early code written by Clément Renault. Clément later when on to found Meili, a Paris-primarily based fully firm that offers services round this offering. They in the imply time make use of 18 workers in step with LinkedIn.

MeiliSearch’s Source Code

The key thing about the MeiliSearch codebase that caught my witness turn into as soon as that, moreover unit tests, it’s made up of 7,600 lines of Rust. Though competing offerings enact normally have more gains,
Elasticsearch is made up of nearly 2 million lines of Java, Apache Solr which makes use of Apache Lucene comes in at round 1.3M lines of Java, Groonga is made up of 600Enough+ lines of C, Manticore Search is made up of 150Enough lines of C++, Sphinx is made up of 100Enough lines of C++, Typesense is made up of 50Enough lines of C++ including headers, Tantivy, which shall be written in Rust, has 40Enough lines of code. Bleve, which is correct a textual exclaim indexing library for GoLang, that does no longer encompass any shopper or server interfaces, is made up of 83Enough+ lines of code. That being talked about, there is a standalone search engine that can sit on prime of it called SRCHX that comes in at just a few hundred lines of GoLang.

Surely among the explanations in the support of this apparently small codebase is that areas of order had been broken up into separate repositories. The indexer milli sits at 17Enough lines of Rust, the tokenizer is 1,200-lines long and the WebUI dashboard is a 3.5K-line React app.

The builders have long gone out of their system to no longer reinvent the wheel with their use of third-birthday celebration libraries. These encompass ticket which they use to wrap LMDB, Tokio is feeble for networking, Actix is their Web Framework, futures is feeble for streaming functionality and parking_lot handles locking and synchronisation.

Both Sled and RocksDB had been candidates regarded as for the embedded database backend before the group settled on LMDB. They cited LMDB as having the most efficient combination of efficiency and stability for this use case.

The assorted of the use of Rust, with its rich ecosystem of libraries, non-verbose syntax and skill to present performant binaries appears to have paid off effectively. Rust started off as a private challenge of Mozilla staffer Graydon Hoare support in 2006. In 2020, a StackOverflow gaze found Rust to be the most loved programming language among its respondents.

MeiliSearch Up & Working

I ran the next on a 2020 MacBook Legit with a 1.4 GHz Quad-Core Intel Core i5, 8 GB of RAM and an exterior SSD linked by map of Thunderbolt. I may be the use of Homebrew as my package manager. The following will set up version 0.20.0 of MeiliSearch as effectively as Python, jq and curl which can perchance perhaps perhaps be feeble at some level of this post.

$ brew update
$ brew set up 
    curl 
    jq 
    meilisearch 
    virtualenv

In the occasion you are working this on Ubuntu 20.04 LTS, the next will set up the above.

$ echo "deb [trusted=yes] https://real.fury.io/meilisearch/ /" | 
    sudo tee /and a lot of others/real/sources.list.d/fury.list
$ sudo real update
$ sudo real set up 
    curl 
    jq 
    meilisearch-http 
    python3-pip 
    python3-virtualenv

I may begin MeiliSearch within a camouflage so this would possibly occasionally live working in the background. I may be launching this from a working folder on an exterior SSD to lead particular of any excess keep on on my computer’s major force.

$ camouflage
$ MEILI_NO_ANALYTICS=1 
  MEILI_HTTP_PAYLOAD_SIZE_LIMIT=1073741824 
    meilisearch 
        --db-course ./meilifiles 
        --http-addr '127.0.0.1: 7700'

Style CTRL-A after which CTRL-D to detach the camouflage.

Importing Files

Meilisearch has indices that contain paperwork. Each and every index has its possess settings. Paperwork are made up of fields which have a name and a stamp that can perchance perhaps well be either a string, integer, drift, boolean, array, dictionary or NULL. Dates and time don’t have any native representation. I’ve seen timestamps converted into integers to salvage round this limitation. So Center of the evening on March 18th, 1995 would be expressed as the UNIX timestamp 795484800.

Wikipedia produces a dump of their dwelling’s contents every few days. I may pull down one amongst the dump’s 239 MB, bzip2-compressed segments.

$ wget -c https://dumps.wikimedia.org/enwiki/20210801/enwiki-20210801-pages-articles1.xml-p1p41242.bz2

The archive contains 879 MB of XML. I may salvage a Python script that can extract portion of its contents and convert them staunch into a JSON layout that MeiliSearch is friendly with.

The following will draw up a digital atmosphere the build I may set up lxml, an XML library for Python.

$ virtualenv ~/.meilisearch
$ source ~/.meilisearch/bin/activate
$ python3 -m pip set up lxml

The following is the conversion script I’ve keep together.

import bz2
import io
import json
import sys

from lxml import etree


def get_parser(bz2_data): 
    prefix          = '{http://www.mediawiki.org/xml/export-0.10/}'
    ns_token        = prefix + 'ns'
    id_token        = prefix + 'id'
    title_token     = prefix + 'title'
    revision_token  = prefix + 'revision'
    text_token      = prefix + 'textual exclaim'

    with bz2.BZ2File(io.BytesIO(bz2_data), 'r') as bz2_file: 
        for match, ingredient in etree.iterparse(bz2_file, events=('end',)): 
            if ingredient.stamp.endswith('page'): 
                namespace_tag = ingredient.net(ns_token)

                if namespace_tag.textual exclaim == '0': 
                    id_tag = ingredient.net(id_token)
                    title_tag = ingredient.net(title_token)
                    text_tag = ingredient.net(revision_token).net(text_token)
                    yield id_tag.textual exclaim, title_tag.textual exclaim, text_tag.textual exclaim

                ingredient.particular()


parser = get_parser(sys.stdin.buffer.read())
print(json.dumps([{'id':    id_,
                   'title': title,
                   'text':  text}
                  for id_, title, text in parser],
                 ensure_ascii=Groundless))

The following took 102 seconds to total. It converted 27,199 paperwork into JSON at a price of ~8.6 MB/s and 267 paperwork/s. The resulting JSON file is 842 MB when decompressed.

$ cat enwiki-20210801-pages-articles1.xml-p1p41242.bz2 
    | python3 convert.py 
    > enwiki-20210801-pages-articles1.json-p1p41242

Below is what the major dictionary, within the easiest list in the above JSON file, appears love.

$ jq 'nth(0)' enwiki-20210801-pages-articles1.json-p1p41242

{
  "id":  "10",
  "title":  "AccessibleComputing",
  "textual exclaim":  "#REDIRECT [[Computer accessibility]]nn{{rcat shell|n{{R from transfer}}n{{R from CamelCase}}n{{R unprintworthy}}n}}"
}

The following post took 3 seconds to total. Indexing will occur in the background so here is solely the shipping time of the raw JSON to MeiliSearch.

$ curl 
      -X POST 'http://127.0.0.1: 7700/indexes/articles/paperwork' 
      --files @enwiki-20210801-pages-articles1.json-p1p41242 
      | jq

Here’s MeiliSearch’s acknowledge with the update identifier.

Hunting and Ranking

Opening http://127.0.0.1: 7700/ in a web based browser will whisper a WebUI with a live search field. Outcomes will whisper as you kind out and refine your ask.

Querying by map of curl on the CLI shall be supported. Below is an example the build I’ve looked for “programming languages” and requested for 10 outcomes, every containing the title field, to be returned. I’ve bustle the outcomes by map of jq so that they are more concisely formatted.

$ curl 'http://127.0.0.1: 7700/indexes/articles/search' 
        --files '{"q": "programming languages",
                 "attributesToRetrieve": ["title"],
                 "restrict": 10}' 
    | jq '.hits | draw(.title)'

[
  "ProgrammingLanguages",
  "Timeline of programming languages",
  "List of object-oriented programming languages",
  "Programming language/Timeline",
  "Fourth-generation programming language",
  "Logic programming",
  "Lynx (programming language)",
  "Procedural programming",
  "Class (computer programming)",
  "Structured programming"
]

There are six default ranking tips with MeiliSearch.

typo prioritises paperwork matching your ask phrases with the fewest quite quite a bit of of typos.
words prioritises paperwork that contain all of your ask phrases ahead of paperwork handiest matching just a few of them.
proximity prioritises paperwork the build your search ask phrases are in closest proximity to one one other.
attribute prioritises which field your search ask matched in. In the occasion you would possibly perchance perhaps well merely have a title, description and creator attribute ranking whisper then fits in the title field would possibly perchance perhaps have a increased weight than these in the description or creator fields.
words location prioritises paperwork the build your search phrases appear closest to the starting up of the field.
exactness prioritises paperwork that match your ask the closest.

These would possibly perchance perhaps well be extra prolonged, removed and/or rearranged by posting to the ranking-tips endpoint. This environment is index-tell. Below is an example the build hypothetical release_date and nefarious fields are taken into myth when deciding how relevant any matching document is for the length of any given ask.

$ curl 
  -X POST 'http://127.0.0.1: 7700/indexes/articles/settings/ranking-tips' 
  --files '[
      "typo",
      "words",
      "proximity",
      "attribute",
      "wordsPosition",
      "exactness",
      "asc(release_date)",
      "desc(rank)"
  ]' | jq

The following is a response from the server with the update identifier.

Dumping MeiliSearch Indices

The following will divulge MeiliSearch to delivery out producing a dump of its contents.

$ curl -X POST 'http://127.0.0.1: 7700/dumps' | jq

{
  "uid":  "20210813-193103820",
  "dwelling":  "in_progress"
}

This course of would possibly perchance perhaps well be monitored by map of a standing call.

$ curl 'http://127.0.0.1: 7700/dumps/20210813-193103820/dwelling' | jq

{
  "uid":  "20210813-193103820",
  "dwelling":  "carried out"
}

As soon as carried out a .dump file will appear within the dumps folder of MeiliSearch’s working folder.

$ ls -lh dumps/20210813-193103820.dump

-rwxrwxrwx  1 stamp  workers   291M Aug 13 22: 32 dumps/20210813-193103820.dump

The dump file is GZIP-compressed. It incorporates a header portion with its metadata before line-delimited JSON is feeble to serialise paperwork.

$ gunzip -c dumps/20210813-193103820.dump 
    | head -c1500 
    | hexdump -C

00000000  2e 2f 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |./..............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 30 30 34 30  37 35 35 00 30 30 30 30  |....0040755.0000|
00000070  37 36 35 00 30 30 30 30  30 32 34 00 30 30 30 30  |765.0000024.0000|
00000080  30 30 30 30 30 30 30 00  31 34 31 30 35 35 34 34  |0000000.14105544|
00000090  31 36 37 00 30 30 30 37  34 30 34 00 35 00 00 00  |167.0007404.5...|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000100  00 75 73 74 61 72 20 20  00 00 00 00 00 00 00 00  |.ustar  ........|
00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000140  00 00 00 00 00 00 00 00  00 30 30 30 30 30 30 30  |.........0000000|
00000150  00 30 30 30 30 30 30 30  00 00 00 00 00 00 00 00  |.0000000........|
00000160  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000200  6d 65 74 61 64 61 74 61  2e 6a 73 6f 6e 00 00 00  |metadata.json...|
00000210  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000260  00 00 00 00 30 31 30 30  36 34 34 00 30 30 30 30  |....0100644.0000|
00000270  37 36 35 00 30 30 30 30  30 32 34 00 30 30 30 30  |765.0000024.0000|
00000280  30 30 30 30 33 30 30 00  31 34 31 30 35 35 34 34  |0000300.14105544|
00000290  31 36 37 00 30 30 31 31  37 31 30 00 30 00 00 00  |167.0011710.0...|
000002a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000300  00 75 73 74 61 72 20 20  00 00 00 00 00 00 00 00  |.ustar  ........|
00000310  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000340  00 00 00 00 00 00 00 00  00 30 30 30 30 30 30 30  |.........0000000|
00000350  00 30 30 30 30 30 30 30  00 00 00 00 00 00 00 00  |.0000000........|
00000360  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000400  7b 22 69 6e 64 65 78 65  73 22 3a 5b 7b 22 6e 61  |{"indexes": [{"na|
00000410  6d 65 22 3a 22 61 72 74  69 63 6c 65 73 22 2c 22  |me":"articles","|
00000420  75 69 64 22 3a 22 61 72  74 69 63 6c 65 73 22 2c  |uid":"articles",|
00000430  22 63 72 65 61 74 65 64  41 74 22 3a 22 32 30 32  |"createdAt":"202|
00000440  31 2d 30 38 2d 31 33 54  31 32 3a 32 33 3a 32 32  |1-08-13T12:23:22|
00000450  2e 33 33 30 35 39 36 5a  22 2c 22 75 70 64 61 74  |.330596Z","updat|
00000460  65 64 41 74 22 3a 22 32  30 32 31 2d 30 38 2d 31  |edAt":"2021-08-1|
00000470  33 54 31 32 3a 32 33 3a  32 32 2e 33 33 33 32 34  |3T12:23:22.33324|
00000480  38 5a 22 2c 22 70 72 69  6d 61 72 79 4b 65 79 22  |8Z","primaryKey"|
00000490  3a 22 69 64 22 7d 5d 2c  22 64 62 56 65 72 73 69  |:"id"}],"dbVersi|
000004a0  6f 6e 22 3a 22 30 2e 32  30 2e 30 22 2c 22 64 75  |on":"0.20.0","du|
000004b0  6d 70 56 65 72 73 69 6f  6e 22 3a 22 56 31 22 7d  |mpVersion":"V1"}|
000004c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000005d0  00 00 00 00 00 00 00 00  00 00 00 00              |............|
000005dc

The following is the major JSON-primarily based fully portion of metadata within the dump file.

$ gunzip -c dumps/20210813-193103820.dump 
    | strings -50 
    | head -n1 
    | jq

{
  "indexes":  [
    {
      "name": "articles",
      "uid": "articles",
      "createdAt": "2021-08-13T12:23:22.330596Z",
      "updatedAt": "2021-08-13T12:23:22.333248Z",
      "primaryKey": "id"
    }
  ],
  "dbVersion":  "0.20.0",
  "dumpVersion":  "V1"
}

The following is the major document within the dump.

$ gunzip -c dumps/20210813-193103820.dump 
    | strings -50 
    | grep '^{"id' 
    | head -n1 
    | jq

{
  "id":  "10",
  "title":  "AccessibleComputing",
  "textual exclaim":  "#REDIRECT [[Computer accessibility]]nn{{rcat shell|n{{R from transfer}}n{{R from CamelCase}}n{{R unprintworthy}}n}}"
}

MeiliSearch dumps are effectively matched between assorted versions of the utility whereas snapshots, which can perchance perhaps perhaps be more performant to present, are handiest effectively matched with the same version of the utility they’re produced with.

The following will import a dump into MeiliSearch.

$ MEILI_NO_ANALYTICS=1 
    meilisearch --import-dump dumps/20210813-193103820.dump

Boundaries of MeiliSearch

As of this writing, the supreme dataset MeiliSearch is being officially examined in opposition to has 120 million paperwork. The utility would possibly perchance perhaps potentially make stronger more nonetheless here is the supreme occasion of this utility I would possibly perchance perhaps net any mention of.

MeiliSearch’s database dimension is something to protect an witness on as by default, it’s runt to 100 GB. This is succesful of perchance perhaps perhaps be modified by passing overriding parameters at begin.

There shall be a difficult-coded restrict of 200 indices and handiest the major 1,000 words of any attribute will be indexed.

The following endpoint will represent on the database dimension and offers statistics for every index hosted by this given occasion of MeiliSearch.

$ curl http://127.0.0.1: 7700/stats | jq

{
  "databaseSize":  4465295366,
  "lastUpdate":  "2021-08-13T19: 51: 25.342231Z",
  "indexes":  {
    "articles":  {
      "numberOfDocuments":  27199,
      "isIndexing":  unsuitable,
      "fieldsDistribution":  {
        "id":  27199,
        "textual exclaim":  27199,
        "title":  27199
      }
    }
  }
}

Thanks for taking the time to read this post. I provide every consulting and hands-on trend services to purchasers in North America and Europe. In the occasion that you just would possibly care for to discuss how my offerings can aid your industry please contact me by map of LinkedIn.

Be taught Extra