Articles

What are vector databases, how they work and potential market

A vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes.

These vectors are usually generated by applying some sort of embedding function to raw data, such as text, images, audio, video, and others.

Vector databases can be definite as a tool that indexes and stores vector embeds for quick retrieval and similarity searching, with features like metadata filtering and horizontal scaling.

Table of Contents

Estimated reading time: 9 minutes

Growing Investor Interest

In recent weeks, there has been an increase in investor interest in vector databases. Since the beginning of 2023 we have noticed that:

vector database startup Weaviate obtained $50 million in Series B funding;
Pinecone raised $100 million in Series B funding at a $750 million valuation;
Chroma , an open source project, raised $18 million for its embedding database;

Let's see in more detail what vector databases are.

Vectors as data representation

Vector databases rely heavily on vector embedding, a type of data representation that carries within it the semantic information critical for AI to gain understanding and maintain long-term memory to draw upon when executing complex activities.

Vector embeds

Vector embeds are like a map, but instead of showing us where things are in the world, they show us where things are in something called vector space. Vector space is kind of a big playground where everything has its place to play. Imagine that you have a group of animals: a cat, a dog, a bird and a fish. We can create a vector embed for each image by giving it a special position on the playground. The cat may be in one corner, the dog on the other side. The bird could be in the sky and the fish could be in the pond. This place is a multidimensional space. Each dimension corresponds to different aspects of them, for example, fish have fins, birds have wings, cats and dogs have legs. Another aspect of them may be that fish belong to water, birds mainly to the sky, and cats and dogs to the ground. Once we have these vectors, we can use mathematical techniques to group them based on their similarity. Based on the information we hold,

So, vector embeddings are like a map that helps us find similarity between things in vector space. Just as a map helps us navigate the world, vector embeds help navigate the vector playground.

The key idea is that embeds that are semantically similar to each other have a smaller distance between them. To find out how similar they are, we can use vector distance functions such as Euclidean distance, cosine distance, etc.

Vector databases vs vector libraries

The vector libraries store embeddings of vectors in indexes in memory, in order to perform similarity searches. Vector libraries have the following characteristics/limitations:

Store vectors only : Vector libraries only store embeddings of vectors and not the associated objects from which they were generated. This means that when we query, a vector library will respond with the relevant vectors and object IDs. This is limiting since the actual information is stored in the object and not the id. To solve this problem, we should store the objects in secondary storage. We can then use the IDs returned by the query and match them to objects to understand the results.
Index data is immutable : Indexes produced by vector libraries are immutable. This means that once we've imported our data and built the index, we can't make any changes (no new inserts, deletions, or changes). To make changes to our index, we will have to rebuild it from scratch
Query while restricting import : Most vector libraries cannot be queried while importing data. We need to import all of our data objects first. So the index is created after the objects are imported. This can be a problem for applications that require millions or even billions of objects to be imported.

There are many vector search libraries available: FAISS of Facebook, Annoy by Spotify and ScanNN by Google. FAISS uses clustering method, Annoy uses trees and ScanNN uses vector compression. There is a performance trade-off for each, which we can choose based on our application and performance metrics.

CRUD

The main feature that distinguishes vector databases from vector libraries is the ability to archive, update and delete data. Vector databases have CRUD support complete (create, read, update and delete) that solves the limitations of a vector library.

Archive vectors and objects : Databases can store both data objects and vectors. Since both are stored, we can combine vector search with structured filters. Filters allow us to make sure that the closest neighbors match the metadata filter.
Mutability : as vector databases fully support crud, we can easily add, remove or update entries in our index after it has been created. This is especially useful when working with constantly changing data.
Real-time search : Unlike vector libraries, databases allow us to query and modify our data during the import process. As we load millions of objects, the imported data remains fully accessible and operational, so you don't have to wait for the import to complete to start working on what's already there.

In short, a vector database provides a superior solution for handling vector embeds by addressing the limitations of self-contained vector indices as discussed in the previous points.

But what makes vector databases superior to traditional databases?

Vector databases vs traditional databases

Traditional databases are designed to store and retrieve structured data using relational models, which means they are optimized for queries based on columns and rows of data. While it is possible to store vector embeddings in traditional databases, these databases are not optimized for vector operations and cannot perform similarity searches or other complex operations on large datasets efficiently.

This is because traditional databases use indexing techniques based on simple data types, such as strings or numbers. These indexing techniques are not suitable for vector data, which has high dimensionality and requires specialized indexing techniques such as inverted indexes or spatial trees.

Also, traditional databases aren't designed to handle the large amounts of unstructured or semi-structured data often associated with vector embeds. For example, an image or sound file can contain millions of data points, which traditional databases cannot handle efficiently.

Vector databases, on the other hand, are specifically designed to store and retrieve vector data and are optimized for similarity searches and other complex operations on large datasets. They use specialized indexing techniques and algorithms designed to work with high-dimensional data, making them much more efficient than traditional databases for storing and retrieving vector embeds.

Now that you've read so much about vector databases, you might be wondering, how do they work? Let's take a look.

How does a vector database work?

We all know how relational databases work: they store strings, numbers, and other types of scalar data in rows and columns. On the other hand, a vector database operates on vectors, so the way it's optimized and queried is quite different.

In traditional databases, we usually query for rows in the database where the value usually matches our query exactly. In vector databases, we apply a similarity metric to find a vector that is most similar to our query.

A vector database uses a combination of several algorithms that all participate in nearest neighbor search (ANN). These algorithms optimize the search by hashing, quantization or graph-based search.

These algorithms are assembled into a pipeline that provides fast and accurate retrieval of a queried vector's neighbors. Since the vector database provides approximate results, the main tradeoffs we consider are between accuracy and speed. The more precise the result, the slower the query will be. However, a good system can provide ultra-fast searching with near-perfect accuracy.

Indexing : The vector database indexes vectors using an algorithm such as PQ, LSH or HNSW. This step associates the vectors with a data structure which will allow for faster searching.
Query : vector database compares the indexed query vector against indexed vectors in the dataset to find the closest neighbors (applying a similarity metric used by that index)
Post-processing : In some cases, the vector database fetches the final nearest neighbors from the dataset and post-processes them to return the final results. This step may include reclassifying the closest neighbors using a different similarity measure.

Benefits

Vector databases are a powerful tool for similarity searches and other complex operations on large data sets, which cannot be performed effectively using traditional databases. To build a functional vector database, embeds are essential, as they capture the semantic meaning of the data and enable accurate similarity searches. Unlike vector libraries, vector databases are designed to fit our use case, making them ideal for applications where performance and scalability are critical. With the rise of machine learning and artificial intelligence, vector databases are becoming increasingly important for a wide range of applications including recommender systems, image search, semantic similarity and the list goes on . As the field continues to evolve, we can expect to see even more innovative applications of vector databases in the future.

Ercole Palmeri