In the search world the foundational building block of it all is the “term”.
This is, simplistically, a single word that is stored in the “postings” file.
If you were to index the content “The quick brown fox jumps over the lazy dog”
you would expect to get the following terms:
dog. This content is stored in a
// the document
// a field
content: "The quick brown fox jumps over the lazy dog"
Assuming this is document 1, then this would get turned into something like.
the are both listed in the postings because they
are different words to a computer. Because of this we will want to “normalize”
the content before indexing it. In this case we are going to lowercase
everything first. The normalized index looks like:
Much better. Now lets index another document:
content: "The dog chases the cat over the hill"
The updated index looks like:
dog: [1, 2]
over: [1, 2]
the: [1, 2]
With each new document we repeat this process and add more terms and more
documents to those terms. This is the inverted index aspect of the search
Its worth pointing out that
the were not duplicated in the
index, all we did was add a another doc id to the list for the term. This is one
of the ways that full text search indexes can be smaller than the source system.
When you go to query the system, your query for
dog runs through the
lowercaser, then it looks up
dog in the postings file, finds document ids
1, 2 - pulls those out of “storage” and returns them back to you.