Vectors as information storage

Blog post description.

VECTOR DATABASES

5/3/20232 min read

The tech world is witnessing a surge in the popularity of vector databases, with significant investments being made in new startups in this space. Last month, Weaviate, a vector database firm, received $16 million in Series A funding. Following this, PineconeDB fetched $28 million at a $700 million valuation. Chroma, an open-source project with only 1.2 GitHub stars, recently raised $18 million for its embeddings database. The attraction towards these databases is palpable, but what makes them so revolutionary?

Vectors are arrays of numbers, but their charm lies in their ability to represent more complex entities like words, sentences, images, or audio files in a high-dimensional space known as an embedding. This is akin to grouping similar people at a party - all the jocks cluster around the TV, the girls gravitate towards the dance floor, and the programmers huddle in a corner talking about vectors. Similarly, embeddings map the semantic meanings of words together, or group together similar features in virtually any data type.

This fascinating property of vectors and their embedding representations can be harnessed for various applications like recommendation systems, search engines, and even text generation, such as ChatGPT.

The intriguing question then becomes, where do we store these rich embeddings and how do we query them swiftly? Enter vector databases. Unlike relational databases with rows and columns, or document databases with documents and collections, vector databases consist of arrays of numbers grouped based on similarity. They can be queried with ultra-low latency, making them an ideal choice for AI-driven applications.

Relational databases like Postgres and key-value databases like Redis have tools like PgVector and RedisVector, respectively, to support this type of functionality. However, a myriad of new, native vector databases have started emerging. Weaviate and Milvus are open-source options written in Go. Pinecone is another major player, though it isn't open-source. Then there's Chroma, which operates on Clickhouse under the hood, among many other options.

To illustrate how these work, consider a JavaScript code snippet using Chroma. First, the client is created and an embedding function is defined, which uses the OpenAI API to update the embeddings whenever a new data point is added. Each data point is a document with an ID and some text. When querying the database, a string of text is passed, and in the returned results, there's an array of distances, where a smaller number indicates a higher degree of similarity.

What's making vector databases so enticing right now is their ability to extend LLMs with long-term memory. Typically, one starts with a general-purpose model like OpenAI's GPT-4, Meta's Llama, or Google's Lambda, and then provides custom data in a vector database. When the user makes a prompt, relevant documents from the database are queried to update the context, thus customizing the final response. This technique also retrieves historical data to give the AI long-term memory.

They can be integrated with tools like Langchain to combine multiple LLMs. As a result, we see an increasing number of top trending repos in GitHub today trying to create artificial general intelligence, like Microsoft's Jarvis, AutoGPT, and Baby AGI, which leverage vector databases and LLMs to prompt themselves.

With such massive potential and investment flowing into the vector database space, this technology is poised to revolutionize AI-driven applications, changing how we process, understand, and use data. As we witness the evolution of vector databases and their increasing accessibility, we're excited to see the new opportunities they will unlock in the AI world.

Vectors as information storage

kss239@cornell.edu