MeiliSearch is a rapidly, feature-rich paunchy-textual exclaim search engine. Its constructed on prime of the LMDB key-stamp retailer and lives as a 35 MB binary when keep in on Ubuntu or MacOS. It comes with a constructed-in shopper, server and WebUI. Options corresponding to stemming, cease words, synonyms, ranking, filters and faceting all determine of the field, use wise defaults and would possibly perchance perhaps well be with out issues customised.
The MeiliSearch challenge began in 2018 with valuable of the early code written by Clément Renault. Clément later when on to found Meili, a Paris-primarily based fully firm that offers services round this offering. They in the imply time make use of 18 workers in step with LinkedIn.
MeiliSearch’s Source Code
The key thing about the MeiliSearch codebase that caught my witness turn into as soon as that, moreover unit tests, it’s made up of 7,600 lines of Rust. Though competing offerings enact normally have more gains,
Elasticsearch is made up of nearly 2 million lines of Java, Apache Solr which makes use of Apache Lucene comes in at round 1.3M lines of Java, Groonga is made up of 600Enough+ lines of C, Manticore Search is made up of 150Enough lines of C++, Sphinx is made up of 100Enough lines of C++, Typesense is made up of 50Enough lines of C++ including headers, Tantivy, which shall be written in Rust, has 40Enough lines of code. Bleve, which is correct a textual exclaim indexing library for GoLang, that does no longer encompass any shopper or server interfaces, is made up of 83Enough+ lines of code. That being talked about, there is a standalone search engine that can sit on prime of it called SRCHX that comes in at just a few hundred lines of GoLang.
Surely among the explanations in the support of this apparently small codebase is that areas of order had been broken up into separate repositories. The indexer milli sits at 17Enough lines of Rust, the tokenizer is 1,200-lines long and the WebUI dashboard is a 3.5K-line React app.
The builders have long gone out of their system to no longer reinvent the wheel with their use of third-birthday celebration libraries. These encompass ticket which they use to wrap LMDB, Tokio is feeble for networking, Actix is their Web Framework, futures is feeble for streaming functionality and parking_lot handles locking and synchronisation.
Both Sled and RocksDB had been candidates regarded as for the embedded database backend before the group settled on LMDB. They cited LMDB as having the most efficient combination of efficiency and stability for this use case.
The assorted of the use of Rust, with its rich ecosystem of libraries, non-verbose syntax and skill to present performant binaries appears to have paid off effectively. Rust started off as a private challenge of Mozilla staffer Graydon Hoare support in 2006. In 2020, a StackOverflow gaze found Rust to be the most loved programming language among its respondents.
MeiliSearch Up & Working
I ran the next on a 2020 MacBook Legit with a 1.4 GHz Quad-Core Intel Core i5, 8 GB of RAM and an exterior SSD linked by map of Thunderbolt. I may be the use of Homebrew as my package manager. The following will set up version 0.20.0 of MeiliSearch as effectively as Python, jq and curl which can perchance perhaps perhaps be feeble at some level of this post.
$ brew update $ brew set up curl jq meilisearch virtualenv
In the occasion you are working this on Ubuntu 20.04 LTS, the next will set up the above.
$ echo "deb [trusted=yes] https://real.fury.io/meilisearch/ /" | sudo tee /and a lot of others/real/sources.list.d/fury.list $ sudo real update $ sudo real set up curl jq meilisearch-http python3-pip python3-virtualenv
I may begin MeiliSearch within a camouflage so this would possibly occasionally live working in the background. I may be launching this from a working folder on an exterior SSD to lead particular of any excess keep on on my computer’s major force.
$ camouflage $ MEILI_NO_ANALYTICS=1 MEILI_HTTP_PAYLOAD_SIZE_LIMIT=1073741824 meilisearch --db-course ./meilifiles --http-addr '127.0.0.1: 7700'
Style CTRL-A after which CTRL-D to detach the camouflage.
Importing Files
Meilisearch has indices that contain paperwork. Each and every index has its possess settings. Paperwork are made up of fields which have a name and a stamp that can perchance perhaps well be either a string, integer, drift, boolean, array, dictionary or NULL. Dates and time don’t have any native representation. I’ve seen timestamps converted into integers to salvage round this limitation. So Center of the evening on March 18th, 1995 would be expressed as the UNIX timestamp 795484800.
Wikipedia produces a dump of their dwelling’s contents every few days. I may pull down one amongst the dump’s 239 MB, bzip2-compressed segments.
$ wget -c https://dumps.wikimedia.org/enwiki/20210801/enwiki-20210801-pages-articles1.xml-p1p41242.bz2
The archive contains 879 MB of XML. I may salvage a Python script that can extract portion of its contents and convert them staunch into a JSON layout that MeiliSearch is friendly with.
The following will draw up a digital atmosphere the build I may set up lxml, an XML library for Python.
$ virtualenv ~/.meilisearch
$ source ~/.meilisearch/bin/activate
$ python3 -m pip set up lxml
The following is the conversion script I’ve keep together.
import bz2 import io import json import sys from lxml import etree def get_parser(bz2_data): prefix = '{http://www.mediawiki.org/xml/export-0.10/}' ns_token = prefix + 'ns' id_token = prefix + 'id' title_token = prefix + 'title' revision_token = prefix + 'revision' text_token = prefix + 'textual exclaim' with bz2.BZ2File(io.BytesIO(bz2_data), 'r') as bz2_file: for match, ingredient in etree.iterparse(bz2_file, events=('end',)): if ingredient.stamp.endswith('page'): namespace_tag = ingredient.net(ns_token) if namespace_tag.textual exclaim == '0': id_tag = ingredient.net(id_token) title_tag = ingredient.net(title_token) text_tag = ingredient.net(revision_token).net(text_token) yield id_tag.textual exclaim, title_tag.textual exclaim, text_tag.textual exclaim ingredient.particular() parser = get_parser(sys.stdin.buffer.read()) print(json.dumps([{'id': id_, 'title': title, 'text': text} for id_, title, text in parser], ensure_ascii=Groundless))
The following took 102 seconds to total. It converted 27,199 paperwork into JSON at a price of ~8.6 MB/s and 267 paperwork/s. The resulting JSON file is 842 MB when decompressed.
$ cat enwiki-20210801-pages-articles1.xml-p1p41242.bz2
| python3 convert.py
> enwiki-20210801-pages-articles1.json-p1p41242
Below is what the major dictionary, within the easiest list in the above JSON file, appears love.
$ jq 'nth(0)' enwiki-20210801-pages-articles1.json-p1p41242
{ "id": "10", "title": "AccessibleComputing", "textual exclaim": "#REDIRECT [[Computer accessibility]]nn{{rcat shell|n{{R from transfer}}n{{R from CamelCase}}n{{R unprintworthy}}n}}" }
The following post took 3 seconds to total. Indexing will occur in the background so here is solely the shipping time of the raw JSON to MeiliSearch.
$ curl -X POST 'http://127.0.0.1: 7700/indexes/articles/paperwork' --files @enwiki-20210801-pages-articles1.json-p1p41242 | jq
Here’s MeiliSearch’s acknowledge with the update identifier.
Hunting and Ranking
Opening http://127.0.0.1: 7700/ in a web based browser will whisper a WebUI with a live search field. Outcomes will whisper as you kind out and refine your ask.
Querying by map of curl on the CLI shall be supported. Below is an example the build I’ve looked for “programming languages” and requested for 10 outcomes, every containing the title field, to be returned. I’ve bustle the outcomes by map of jq so that they are more concisely formatted.
$ curl 'http://127.0.0.1: 7700/indexes/articles/search' --files '{"q": "programming languages", "attributesToRetrieve": ["title"], "restrict": 10}' | jq '.hits | draw(.title)'
[ "ProgrammingLanguages", "Timeline of programming languages", "List of object-oriented programming languages", "Programming language/Timeline", "Fourth-generation programming language", "Logic programming", "Lynx (programming language)", "Procedural programming", "Class (computer programming)", "Structured programming" ]
There are six default ranking tips with MeiliSearch.
- typo prioritises paperwork matching your ask phrases with the fewest quite quite a bit of of typos.
- words prioritises paperwork that contain all of your ask phrases ahead of paperwork handiest matching just a few of them.
- proximity prioritises paperwork the build your search ask phrases are in closest proximity to one one other.
- attribute prioritises which field your search ask matched in. In the occasion you would possibly perchance perhaps well merely have a title, description and creator attribute ranking whisper then fits in the title field would possibly perchance perhaps have a increased weight than these in the description or creator fields.
- words location prioritises paperwork the build your search phrases appear closest to the starting up of the field.
- exactness prioritises paperwork that match your ask the closest.
These would possibly perchance perhaps well be extra prolonged, removed and/or rearranged by posting to the ranking-tips endpoint. This environment is index-tell. Below is an example the build hypothetical release_date and nefarious fields are taken into myth when deciding how relevant any matching document is for the length of any given ask.
$ curl -X POST 'http://127.0.0.1: 7700/indexes/articles/settings/ranking-tips' --files '[ "typo", "words", "proximity", "attribute", "wordsPosition", "exactness", "asc(release_date)", "desc(rank)" ]' | jq
The following is a response from the server with the update identifier.
Dumping MeiliSearch Indices
The following will divulge MeiliSearch to delivery out producing a dump of its contents.
$ curl -X POST 'http://127.0.0.1: 7700/dumps' | jq
{ "uid": "20210813-193103820", "dwelling": "in_progress" }
This course of would possibly perchance perhaps well be monitored by map of a standing call.
$ curl 'http://127.0.0.1: 7700/dumps/20210813-193103820/dwelling' | jq
{ "uid": "20210813-193103820", "dwelling": "carried out" }
As soon as carried out a .dump file will appear within the dumps folder of MeiliSearch’s working folder.
$ ls -lh dumps/20210813-193103820.dump
-rwxrwxrwx 1 stamp workers 291M Aug 13 22: 32 dumps/20210813-193103820.dump
The dump file is GZIP-compressed. It incorporates a header portion with its metadata before line-delimited JSON is feeble to serialise paperwork.
$ gunzip -c dumps/20210813-193103820.dump | head -c1500 | hexdump -C
00000000 2e 2f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |./..............| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000060 00 00 00 00 30 30 34 30 37 35 35 00 30 30 30 30 |....0040755.0000| 00000070 37 36 35 00 30 30 30 30 30 32 34 00 30 30 30 30 |765.0000024.0000| 00000080 30 30 30 30 30 30 30 00 31 34 31 30 35 35 34 34 |0000000.14105544| 00000090 31 36 37 00 30 30 30 37 34 30 34 00 35 00 00 00 |167.0007404.5...| 000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000100 00 75 73 74 61 72 20 20 00 00 00 00 00 00 00 00 |.ustar ........| 00000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000140 00 00 00 00 00 00 00 00 00 30 30 30 30 30 30 30 |.........0000000| 00000150 00 30 30 30 30 30 30 30 00 00 00 00 00 00 00 00 |.0000000........| 00000160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000200 6d 65 74 61 64 61 74 61 2e 6a 73 6f 6e 00 00 00 |metadata.json...| 00000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000260 00 00 00 00 30 31 30 30 36 34 34 00 30 30 30 30 |....0100644.0000| 00000270 37 36 35 00 30 30 30 30 30 32 34 00 30 30 30 30 |765.0000024.0000| 00000280 30 30 30 30 33 30 30 00 31 34 31 30 35 35 34 34 |0000300.14105544| 00000290 31 36 37 00 30 30 31 31 37 31 30 00 30 00 00 00 |167.0011710.0...| 000002a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000300 00 75 73 74 61 72 20 20 00 00 00 00 00 00 00 00 |.ustar ........| 00000310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000340 00 00 00 00 00 00 00 00 00 30 30 30 30 30 30 30 |.........0000000| 00000350 00 30 30 30 30 30 30 30 00 00 00 00 00 00 00 00 |.0000000........| 00000360 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000400 7b 22 69 6e 64 65 78 65 73 22 3a 5b 7b 22 6e 61 |{"indexes": [{"na| 00000410 6d 65 22 3a 22 61 72 74 69 63 6c 65 73 22 2c 22 |me":"articles","| 00000420 75 69 64 22 3a 22 61 72 74 69 63 6c 65 73 22 2c |uid":"articles",| 00000430 22 63 72 65 61 74 65 64 41 74 22 3a 22 32 30 32 |"createdAt":"202| 00000440 31 2d 30 38 2d 31 33 54 31 32 3a 32 33 3a 32 32 |1-08-13T12:23:22| 00000450 2e 33 33 30 35 39 36 5a 22 2c 22 75 70 64 61 74 |.330596Z","updat| 00000460 65 64 41 74 22 3a 22 32 30 32 31 2d 30 38 2d 31 |edAt":"2021-08-1| 00000470 33 54 31 32 3a 32 33 3a 32 32 2e 33 33 33 32 34 |3T12:23:22.33324| 00000480 38 5a 22 2c 22 70 72 69 6d 61 72 79 4b 65 79 22 |8Z","primaryKey"| 00000490 3a 22 69 64 22 7d 5d 2c 22 64 62 56 65 72 73 69 |:"id"}],"dbVersi| 000004a0 6f 6e 22 3a 22 30 2e 32 30 2e 30 22 2c 22 64 75 |on":"0.20.0","du| 000004b0 6d 70 56 65 72 73 69 6f 6e 22 3a 22 56 31 22 7d |mpVersion":"V1"}| 000004c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000005d0 00 00 00 00 00 00 00 00 00 00 00 00 |............| 000005dc
The following is the major JSON-primarily based fully portion of metadata within the dump file.
$ gunzip -c dumps/20210813-193103820.dump | strings -50 | head -n1 | jq
{ "indexes": [ { "name": "articles", "uid": "articles", "createdAt": "2021-08-13T12:23:22.330596Z", "updatedAt": "2021-08-13T12:23:22.333248Z", "primaryKey": "id" } ], "dbVersion": "0.20.0", "dumpVersion": "V1" }
The following is the major document within the dump.
$ gunzip -c dumps/20210813-193103820.dump | strings -50 | grep '^{"id' | head -n1 | jq
{ "id": "10", "title": "AccessibleComputing", "textual exclaim": "#REDIRECT [[Computer accessibility]]nn{{rcat shell|n{{R from transfer}}n{{R from CamelCase}}n{{R unprintworthy}}n}}" }
MeiliSearch dumps are effectively matched between assorted versions of the utility whereas snapshots, which can perchance perhaps perhaps be more performant to present, are handiest effectively matched with the same version of the utility they’re produced with.
The following will import a dump into MeiliSearch.
$ MEILI_NO_ANALYTICS=1 meilisearch --import-dump dumps/20210813-193103820.dump
Boundaries of MeiliSearch
As of this writing, the supreme dataset MeiliSearch is being officially examined in opposition to has 120 million paperwork. The utility would possibly perchance perhaps potentially make stronger more nonetheless here is the supreme occasion of this utility I would possibly perchance perhaps net any mention of.
MeiliSearch’s database dimension is something to protect an witness on as by default, it’s runt to 100 GB. This is succesful of perchance perhaps perhaps be modified by passing overriding parameters at begin.
There shall be a difficult-coded restrict of 200 indices and handiest the major 1,000 words of any attribute will be indexed.
The following endpoint will represent on the database dimension and offers statistics for every index hosted by this given occasion of MeiliSearch.
$ curl http://127.0.0.1: 7700/stats | jq
{ "databaseSize": 4465295366, "lastUpdate": "2021-08-13T19: 51: 25.342231Z", "indexes": { "articles": { "numberOfDocuments": 27199, "isIndexing": unsuitable, "fieldsDistribution": { "id": 27199, "textual exclaim": 27199, "title": 27199 } } } }
Thanks for taking the time to read this post. I provide every consulting and hands-on trend services to purchasers in North America and Europe. In the occasion that you just would possibly care for to discuss how my offerings can aid your industry please contact me by map of LinkedIn.