Leveraging GPT-3 for Search Solutions Development

--

Search solutions exist in almost every modern application. They serve to make existing content available to the customer upon request, and it can be challenging to imagine a type of functionality other than habitual search to retrieve this information.

Custom-made search solution options

Simultaneously, even if the need to query and search is very common, different applications take drastically different approaches.

In most cases, companies remain at a very basic level of search by simply querying OLTP databases directly. Requests may look like this: “SELECT id, title FROM, entities WHERE, description LIKE ‘%bow%”. However, more frequently, they are represented in complex, multiple-level table join structures which are impossible to read, slow, and primitive. They are unable to comprehend context, require numerous customizations, and are very challenging to implement properly. While it is possible to improve query execution time through materialized views, query caching, and other technics, the added complexity results in a considerable lag between primary updates and consistent search results.

More efficient alternatives to primitive DB-based search solutions may constitute open-source search engines such as Apache Lucene, Apache Solr, Elasticsearch, Sphinx, MeiliSearch, Typesense, etc. These tend to be comparatively faster and much better at dealing with complex queries and working with filters. But once these search engines are compared to counterparts like Google Search or DuckDuckGo, it becomes clear that open-source solutions fail at building proper search context and query modalities — they are incapable of understanding a query if the user provides a vague search request.

Extracting the meaning from the query

Imagine you simply cannot remember what that yellow citrus fruit with a sour taste is called! But you want to search the app for an article on how to grow this mysterious fruit. How do you go about that search? Your query may be, “how to grow yellow sour citrus indoors”. Any of the aforementioned open-source search engines may struggle significantly to return relevant results for this query, even if the database does contain articles about growing “lemons”.

That is because the extraction of meaning from a query is a natural language task and is unlikely to be solved without AI components. GPT-3 is good at this task.

OpenAI offers an embeddings API based on GPT-3 that converts natural language text into a vector of floating point numbers. An embedding is essentially a low-dimensional space into which high-dimensional vectors can be translated. In this case, the high-dimensional vector is text and the low-dimensional is an output vector of fixed size. The distance between vectors represents how strongly they correlate. The smaller the distance, the higher the correlation. By redefining text as a vector-based form, task is reduced to classical ML search problem.

Choosing the right model

Selecting the model

The conversion of document text into a vector representative can happen in the background, while the vectorization of the search query should occur during runtime. There are several GPT-3 model families that OpenAI offers:

text-search-ada-doc-001: 1024
text-search-babbage-doc-001: 2048
text-search-curie-doc-001: 4096
text-search-davinci-doc-001: 12288

Higher vector dimensions lead to more embedded information and, thus, also higher costs and slower searches.

Documents are usually long and queries generally short and incomplete. Therefore, the vectorization of any document differs significantly from any query’s vectorization considering the content’s density and size. OpenAI know that, and so they offer two paired models, ‘-doc’ and ‘-query’:

text-search-ada-query-001: 1024
text-search-babbage-query-001: 2048
text-search-curie-queryc-001: 4096
text-search-davinci-query-001: 12288

It is important to note that the query and document must both utilize the same model family and have the same length of the output vector.

Example dataset

It may be easiest to observe and understand the power of this search solution through example. For this example, let us draw on the TMDB 5000 Movie Dataset which contains metadata about roughly 5,000 movies from TMDb. We will build a search solution relying only on movie documents, not reviews.

The dataset contains plenty of columns, but our vectorization process will be built around the title and overview columns only.

Embeddings diagram image

Example of the record:

Title: Harry Potter and the Half-Blood Prince
Overview: As Harry begins his sixth year at Hogwarts, he discovers an old book marked as ‘Property of the Half-Blood Prince’, and begins to learn more about Lord Voldemort’s dark past.

Let’s map the dataset into ready-to-indexing text:

datafile_path = "./tmdb_5000_movies.csv"
df = pd.read_csv(datafile_path)

def combined_info(row):
columns = ['title', 'overview']
columns_to_join = [f"{column.capitalize()}: {row[column]}" for column in columns]
return '\n'.join(columns_to_join)

df['combined_info'] = df.apply(lambda row: combined_info(row), axis=1)

The embedding process is straightforward:

def get_embedding(text, model="text-search-babbage-doc-001"):
text = text.replace("\n", " ")
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

get_embedding(df['combined_info'][0])

This block of code outputs a list the size of which is equal to the parameters that the model is operating with, which in the case of ‘text-search-babbage-doc-001’ is 2048.

A similar embedding process should be applied to all documents we would like to search on:

df['combined_info_search'] = df['combined_info'].apply(lambda x: get_embedding(x, model='text-search-babbage-doc-001'))
df.to_csv('./tmdb_5000_movies_search.csv', index=False)

Column ‘combined_info_search’ will hold a vector representation of the combined_text.

And, surprisingly, that’s already it! Finally, we are ready to perform a sample search query:

from openai.embeddings_utils import get_embedding, cosine_similarity

def search_movies(df, query, n=3, pprint=True):
embedding = get_embedding(
query,
engine="text-search-babbage-query-001"
)
df["similarities"] = df.combined_info_search.apply(lambda x: cosine_similarity(x, embedding))

res = (
df.sort_values("similarities", ascending=False)
.head(n)
.combined_info
)
if pprint:
for r in res:
print(r[:200])
print()
return res


res = search_movies(df, "movie about the wizardry school", n=3)

These are the results we get:

Title: Harry Potter and the Philosopher’s Stone
Overview: Harry Potter has lived under the stairs at his aunt and uncle’s house his whole life. But on his 11th birthday, he learns he’s a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school’s kindly headmaster, Harry uncovers the truth about his parents’ deaths — and about the villain who’s to blame.

Title: Harry Potter and the Goblet of Fire
Overview: Harry starts his fourth year at Hogwarts, competes in the treacherous Triwizard Tournament and faces the evil Lord Voldemort. Ron and Hermione help Harry manage the pressure - but Voldemort lurks, awaiting his chance to destroy Harry and all that he stands for.

Title: Harry Potter and the Prisoner of Azkaban
Overview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black - and turns to sympathetic Professor Lupin for help.

The overview for ‘Harry Potter and the Philosopher’s Stone’ contains the words ‘wizardry’ and ‘school’ and comes first in the search output. The second result no longer contains the word ‘school’, but still retains words close to ‘wizardry’, ‘Triwizard’. The third result contains only a synonym of ‘wizardry’ — magic.

There are, of course, a multitude of other movies within this database that feature schools or wizards (or both), but the above were the only ones returned to us. This is clear proof that the search solution works and actually understood the context of our query.

We used the babbage model with only 2048 parameters. Davinci has six times more (12,288) parameters and can, thus, perform significantly better with regard to highly complex queries.

The search solution may occasionally fail to produce output relevant to some queries. For instance, the query ‘movies about wizards in school’ produces:

Title: Harry Potter and the Philosopher’s Stone
Overview: Harry Potter has lived under the stairs at his aunt and uncle’s house his whole life. But on his 11th birthday, he learns he’s a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school’s kindly headmaster, Harry uncovers the truth about his parents’ deaths — and about the villain who’s to blame.

Title: Dumb and Dumberer: When Harry Met Lloyd
Overview: This wacky prequel to the 1994 blockbuster goes back to the lame-brained Harry and Lloyd's days as classmates at a Rhode Island high school, where the unprincipled principal puts the pair in remedial courses as part of a scheme to fleece the school.

Title: Harry Potter and the Prisoner of Azkaban
Overview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black - and turns to sympathetic Professor Lupin for help.

What is ‘Dumb and Dumberer: When Harry Met Lloyd’ doing here you may wonder? Thankfully, this issue was not reproduced on parameters with more parameters.

Calculating the distance between the query and documents

The search output should consist of documents sorted in descending order by relevance. To achieve this, we should be aware of the distance between the current query and each document. The shorter the length, the comparatively more relevant the output. Then, after a defined maximum reach, we should stop considering the relevance of the remaining documents.

Image of the relevance output

In the aforementioned example, we used cosine similarity to calculate distance due to the high dimensionality of the vector space. With the babbage model, we have 2048 parameters.

Distance calculation algorithms tend to represent this similarity (difference) between a query and a document with a single number. However, we cannot rely on the Euclidean distance because of the curse of dimensionality — distances will be too similar. This is because Euclidean distance becomes highly impractical beyond around seven dimensions — at this point, distances fall into the same buckets and become almost identical.

If you would like, you may check out the resulting repository here:

Alternatively, you can play with it in Google Colab here:

Time complexity

We used the brute force approach to sort the documents. Let’s determine:

● n: number of points in the training dataset

● d: data dimensionality

The search time complexity for brute force solutions is O(n * d * n * log(n)). Parameter d depends on the model (in the case of babbage, it is equal to 2048) while we have O(nlog(n)) block due to the sorting step.

It is critical to remind ourselves at this stage that smaller models are faster and cheaper. For instance, in the search case similarity calculation step, the Ada model is two times faster, while the Davinci model is six times slower.

Cosine similarity calculations between the query and 4803 documents of 2048 dimensions took 1260 ms on my M1 Pro. In the current implementation, time required to calculate would grow linearly to the total number of documents. Simultaneously, this approach supports computation parallelization.

Alternatives to the brute force solution

In search solutions, queries should be completed as quickly as possible. And this price is usually paid on the side of training and pre-caching time. We can use data structures like a k-d tree, r-tree, or ball tree. Consider the article from Towards Data Science about the computational complexity analysis of these methods: They all lead to computational complexity close to O(k * log(n)), where k is the number of elements we would like to return within a single batch.

K-d trees, ball trees, and r-trees constitute data structures that are used to store and efficiently search for points in N-dimensional space, such as our meaning vectors.

K-d and ball trees are tree-based data structures that use an iterative, binary partitioning scheme to divide the space into regions, with each node in the tree representing a subregion. K-d trees are particularly efficient at searching for points within a specific range or finding the nearest neighbor to a given point.

Similarly, r-trees are also used to store points in N-dimensional space, however, they are much more efficient at searching for points within a specific region or finding all points within a certain distance of a given point. Importantly, r-trees use a different partitioning scheme to k-d trees and ball trees; they divide the space into ‘rectangles’ rather than binary partitions.

Tree implementations fall outside the scope of this article, and different implementations will lead to different search outputs.

Query time

Perhaps, the most significant disadvantage of the current search solution is that we must call an external OpenAI API to retrieve the query embedding vector. No matter how quickly our algorithm is able to find the nearest neighbors, a sequential blocking step will be required.

Text-search-babbage-query-001
Number of dimensions: 2048
Number of queries: 100
Average duration: 225ms
Median duration: 207ms
Max duration: 1301ms
Min duration: 176ms
Text-search-ada-query-002
Number of dimensions: 1536
Number of queries: 100
Average duration: 264ms
Median duration: 250ms
Max duration: 859ms
Min duration: 215ms
Text-search-davinci-query-001
Number of dimensions: 12288
Number of queries: 100
Average duration: 379ms
Median duration: 364ms
Max duration: 1161ms
Min duration: 271ms

If we take the median as a referencing point, we can see that ada-002 is +28% slower and davinci-001 is +76% slower.

Limitations of GPT-3 Search Embedding

Referring to Nils Reimer’s article about dense text embeddings model comparison, we may conclude that GPT-3 does not provide an exceptional performance or output quality and requires dependence on external API that is rather slow. GPT-3 has the capability to support inputs of up to 4096 tokens (approximately 3072 words), however, there is no truncation service available through the API and attempting to encode text longer than 4096 tokens will result in an error. Thus, it is the responsibility of the user — you — to determine how much text can actually be encoded.

Also, training costs are relatively high with OpenAI!

Training cost image

Alternatively, you may consider trying TAS-B or multi-qa-mpnet-base-dot-v1.

Approximate nearest neighbor search in Elasticsearch

Elasticsearch 8.0 supports efficient approximate nearest neighbor search (ANN) that may be used to solve our problem magnitudes faster than any linear KNN could. Elasticsearch 8.0 utilizes an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW) to organize vectors into graphs based on similarity. Tested on a dataset of 10 million embedded vectors, we achieved an impressive performance of 200 queries per second with ANN on a single compute-focused machine, while only 2 queries per second using KNN. Both were provided by Elasticsearch.

According to ElasticSearch’s blog post about the ANN release and the benchmark for dense vector searches, the cost of this performance is 5% of recall. This means that around 1 in 10 of the closest vectors found are not genuinely relative. We applied a hybrid approach to improve this result: Request k+l instead of k and apply filtering to remove outliers. Sadly, there are no pagination capabilities. So, this approach will only work in some use cases.

As you will have hopefully seen, GPT-3 Embeddings is not the perfect solution for any and all search problems due to its indexing complexity, cost, as well as high computational complexity of the search operation, even of approximates. Nevertheless, GPT-3 Embeddings remains an excellent choice for anyone looking for a powerful backbone for a search solution with infrequent querying and modest indexing requirements.

Moreover, it’s also worth adding that Microsoft recently announced that the Bing search engine now uses the new upgraded version of GPT 3.5, called ‘Prometheus’ and developed initially for search. According to the announcement, the new Prometheus language model allows Bing to increase relevance, more accurately annotate snippets, provide fresher results, understand geolocation and improve security. This may open up new possibilities for using the autoregressive language model for search solutions, which we will definitely keep an eye on going forward.

About the author

Oleksandr Knyga is Solution Architect and open-source believer with a big passion for Software Development.
Oleksandr lives the life of the digital nomad. When not tapping on the keyboard, he can be found hiking, snowboarding, fishing, and exploring the city. Oleksandr has an MBA and Computer Science Master’s degree.

Learn more about our company by following the links: WebLabTechnology, LinkedIn, DOU.

--

--

Highly organized distributed teams of software engineers specialized in the cloud, microservices, JS, hybrid front-end applications, and business automation.