< Back

How to Build a Helpful Search for Technical Documentation : The Laravel Example

You can offer the best technology in the world, but if you don’t document it properly, the only engineers that will use it are the ones that are paid to do it, or that enjoy adventure. Sure enough, your goal is to attract every type of developer.

Building a great documentation search seems very easy at first but there are several pitfalls that you need to be aware of and avoid. This blog post explains those pitfalls and describes the recipe we have used to develop the Laravel documentation search.

UPDATE: we have launched DocSearch, the easiest and fastest way to add search to your documentation. Take a look, it’s free!

Pitfall Nº1: the web page as the default entry

Developer documentations often mean lengthy pages filled with a lot of content. Most people try to index the complete page as one entry in their search engine. But, they discover later on that there were a lot of edge cases and they try to fix them through relevance tuning but it quickly becomes an endless story as the issue comes from the actual indexing itself:

  1. 1. Relevant content only

For example the query "composer upgrade" will match the QuickStart page because the menu contains "Upgrade Guide" and the first paragraph contains the "composer" word. This is not the kind of match that provides a good user experience.

illu4

2. Pages contain too long pieces of text

Developers don’t like to change web pages too often and they like to have long pages containing a lot of information. If such a page is indexed as one document, it will almost systematically trigger relevance issues. This is why we do not recommend to use a standard web crawler, but rather a scrapper to have access to the original content (most of the time available in Markdown).

For example, querying "cache incrementing value" will match the Query Builder page because it contains a paragraph with the word "cache" and another paragraph with the words "incrementing" and "value". This is a false positive because it is not relevant: the more text you have on a page, the more irrelevant results you will get.

illu3 illu2

3. The right anchored section

In order to deliver the best user experience, it is key to open the page  at the exact position of the match. This is made very difficult if you only index one document per page. That’s why there are so many documentation searches that just open the page at the top and the user needs to scroll or use the search of his browser to jump to the right section. This not always easy and is a waste of time.

Screen Shot 2015-08-13 at 22.15.00

Pitfall Nº2: indexing titles only

Indexing the titles of your documentation page will probably answer common queries but this is not enough. The underlying paragraphs contain most of the words your users will search for. To obtain a great level of relevance, it’s important to index the whole content, body text included.

In this example, the text is required to correctly answer to the "rememberForever" or "cache driver" queries.

illu5

Pitfall Nº3: poor relevance

With most search engines, relevance is the trickiest part of the configuration because it is often defined by a unique and complex formula that mixes a lot of information almost impossible to manage. Engineers often adjust the formula or add some bonus/malus scoring to improve the results on one specific query. Since they don’t have any non-regression tools, they cannot measure the real impact for all queries. The consequences can be significant.

In order to keep the ranking under control, it is key to split the ranking formula in several pieces that you understand and will tune independently. In practice we are able to split the ranking formula with a Tie-Breaking algorithm.

Let’s imagine your ranking formula is split in 2 parts:

  • the first one defines the textual relevance of a matching hit,
  • the second one defines the importance of a matching hit (from a use-case/business POV).

You can then first apply the textual relevance and only if two hits have the same value move to the use-case/business relevance (importance). This is the best way to ensure your end-users will always have relevant hits first (from a text POV, matching exactly their query words) and then – in terms of relevance equality – tie the results using the business relevance.

Since you’re not mixing together the text & the business relevance (but applied them one after another), you can modify the business relevance without impacting how the text relevance is working.
Getting Started With Realtime Search

Our recipe

1. Create small hierarchical records

In order to solve all those pitfalls, we split the page in a lot of smaller chunks indexed as separate records by using the HTML structure of the page (H1, H2, H3, H4, P).

See the Validation page of Laravel’s documentation:

Validation Laravel

The first record generated will be the Validation page title. It will be transformed into the following JSON object. The “link” attribute only contains the last part of the URL, the first part being easily rebuilt with the tag:

{
    "h1": "Validation", 
    "link": "validation", 
    "importance": 0, 
    "_tags": [
        "5.1"
    ], 
    "objectID": "master-validation-13148717f8faa9037f37d28971dfc219"
}

Then, the first section of the page (The Introduction) will be turned into the following record. The link now contains an anchor text and that keeps the title of the page:

{
    "h1": "Validation", 
    "h2": "Introduction", 
    "link": "validation#introduction", 
    "importance": 1, 
    "_tags": [
        "5.1"
    ], 
    "objectID": "master-validation#introduction-eeafb566c2af34e739e2685efdb45524"
}

A paragraph of this page under a H3 section would be translated into the following record:

{
    "h1": "Validation", 
    "h2": "Validation Quickstart", 
    "h3": "Defining the Routes", 
    "link": "validation#validation-quickstart", 
    "content": "First, let's assume we have the following routes defined in out `app/Http/routes.php` file:", 
    "importance": 6, 
    "_tags": [
        "5.1"
    ], 
    "objectID": "5.1-validation#validation-quickstart-380c9827712413dbe75b5db515cd3e59"
}

This approach fixes pitfalls #1 and #2. We have solved the problem by indexing each chunk of text as an independent record while keeping the titles hierarchy in each record.

2. Use a tie-breaking ranking algorithm

Algolia is designed natively to use a Tie-Breaking algorithm to make sure everyone understands & is able to tune the ranking. Now,Pitfall #3 can be easily resolved by applying the settings we recommend for a documentation search implementation:

Screen Shot 2015-08-14 at 14.10.11

Matching hits will now be sorted against those six ranking criteria: the first 5 are related to text relevance and the last one is the custom business relevance.

Ranking criterion Nº1: number of matched words (words)

First, we sort the number of query words found in the records. We have decided to process the query with all words as mandatory (AND between query terms). If there are not enough matching words, we run the query again with all words as optional (OR between query terms). This process is configured with a single index setting and allows your to get the best of both worlds: AND guarantees to reduce the number of false positives while OR allows to return results even if the query is too narrow.Screen Shot 2015-08-14 at 14.26.18

Ranking criterion Nº2: number of typos (typo)

If two records match with the same number of search terms, we use the number of typos as the differentiator (so we have exact matches first, then matches with 1 typo, then matches with 2 typos, …).

For example if the query is “validator”, the record that contains “validate” will match with some typos but will be retrieved after the record containing “validator”.

Ranking criterion Nº3: proximity between query terms (proximity)

When two records are identical for the words and typos ranking criteria, we then move to the next criteria which compares the proximity of the query terms in the record. It will basically count the number of words in between them until a limit is reached (after a certain point they are considered as “too far”).

For example, the "cache configuration" query will have a proximity of 1 when it matches the sentence: "The cache configuration is ..." and will have a proximity of 2 when it matches the sentence "... in the config/cache.php configuration file". We sort this value by increasing order as we prefer records that contains the query terms close together first.

Ranking criterion Nº4: the matched attribute (attribute)

If two records are identical for the 3 first ranking criteria, we use the name of the matched attribute to determine which hit needs to be retrieved first. In the index settings, just order the attributes you want to search by order of importance:

Screen Shot 2015-08-14 at 14.50.49

That means that if the match is identified inside h1, it will be better than in h2, better than in h3, etc. You can also notice there is an “unordered” flag on each attribute. It means that the position of the match inside the attribute is not considered in the ranking. That’s why the query "cache" will match with the same attribute score for a record that contains "[Cache Configuration]" or  "[Obtaining a cache instance]" for the same attribute.

Ranking criterion Nº5: the number of terms matching exactly (exact)

If two records are identical for the first 4 criteria, then we use the number of query terms  that match exactly in the record to determine which hit needs to be retrieved first. Because we’re returning results after each keystroke, the last query term will mostly match as a prefix (it will match beginning of words). This criterion is used to rank an exact match before a prefix match.

For example the query “valid” will retrieve the records containing “valid” before the ones containing “validation”.

Ranking criterion Nº6: business ranking (custom)

There is still one important thing missing: your use-case/business criterion. If all previous criteria are identical for two records, we use the custom ranking which is defined by the user.

For example, searching for "Validation" will match the two following records using the most important “h1” attribute. That results in a tie on all previous criteria but we want to retrieve the page title first because the other record is a paragraph. This is how the "importance" attribute plays out when added to the records.

{
    "h1": "Validation", 
    "link": "validation", 
    "importance": 0, 
    "_tags": [
        "5.1"
    ], 
    "objectID": "master-validation-13148717f8faa9037f37d28971dfc219"
}
{
    "h1": "Validation", 
    "h2": "Working With Error Messages", 
    "h3": "Custom Error Messages", 
    "link": "validation#custom-error-messages", 
    "content": "If needed, you may use custom error messages for validation instead of the defaults. There are several ways to specify custom messages. First, you may pass the custom messages as the third argument to the `Validator::make` method:", 
    "importance": 6, 
    "_tags": [
        "5.1"
    ], 
    "objectID": "5.1-validation#custom-error-messages-380c9827712413dbe75b5db515cd3e59"
}

The “importance” value is a integer that goes from 0 (page title) to 7 (text section under h4) and that we use in the custom ranking in an ascending order (the smaller, the better):

Screen Shot 2015-08-14 at 15.02.33

The complete scale of importance is the following:

  • 0 for h1,
  • 1 for h2,
  • 2 for h3,
  • 3 for h4,
  • 4 for text under h1,
  • 5 for text under h2,
  • 6 for text under h3,
  • and 7 for text under h4.

This is a generic recipe

We have successfully applied this recipe on several technical documentation,namely the Laravel documentation and Bootstrap documentation search. The way results are displayed differ but we use exactly the same approach and the same API.

Get DocSearch for your website

One of our missions is to help all developers to better access and navigate technical documentations. If you are working on an open source project we’d be happy to help you! We will provide you with a free Algolia account and with any support to make your implementation a best-in-class reference. Drop us a note!

  • Laravel’s documentation is a good example for others to learn from. Algolia is quite cool.
    Cheers!

  • Rhoit

    Algolia is really cool, no doubt why HN choose it. Thanks a ton Rohit

  • mlbrgl

    Hi, following your advice I am considering using this approach for indexing long and structured articles (e.g http://devsante.org/base-documentaire/medecine/la-rage). I still cannot get my head around a few couple of things though.

    Running the “validation” query on https://laravel.com/docs/5.2/validation, I am being returned, in that order:

    – Validation (h1)
    – Validation (h1) / Working with error messages (h2)
    – Validation (h1) / Validation quickstart (h2)

    All of these are headings, no content outside of h1 and h2 is being returned as indexed for those records.

    The way I understand it, both the first and second result match on the h1 tag but the second looses against the first because of the importance criteria (being a h2, record, it gets 1 point instead of 0, lower is better).

    Now why is the third record being returned last even though it has “validation” in both attributes? Is it because it has the misfortune of matching on a h2, which somehow does not positively compound with fact that it matches on an h1 as well? Running that hypothesis on the next results does not make sense, as Validation (h1) / Introduction (h2) is returned in fifth position, after the presumably unfortunate Validation / Validation quickstart.

    So in short:

    – do attributes compound in the algorithm when matches are found in more than one (h1 and h2 in our example), and if so how?

    – how does this compare to not indexing the parent headings in the DOM, i.e. having h2 records not containing the parent h1 in our example:
    – Validation (h1)
    – Working with error messages (h2)
    – Validation quickstart (h1)
    Is the reason on the relevance side or on the contextualisation of results side when returning suggestions?

    Thanks!

    • The ranking algorithm keeps the best match (so in your case all are equals since they all contains “Validation” in h1), the only difference between the result is in the custom ranking part (in this case the level and only it, but it can also take into account the popularity).

      Depending of your use case, you can indeed decide to do not index the parent in the DOM and only keep them to display the result (this is something we have seen several time and that can make sense). You can do that by adding a separated attributes that contains the text to index and only have this one in your attributes to index (and keep the h1, h2, h3 for the display of your result).