šŸ’Ŗ
3 Week Bootcamp: Building Realtime LLM Application
  • Introduction
    • Timelines and Structure
    • Course Syllabus
    • Meet your Instructors
    • Action Items
  • Basics of LLM
    • What is Generative AI?
    • What is a Large Language Model?
    • Advantages and Applications of Large Language Models
    • Bonus Resource: Multimodal LLMs and Google Gemini
  • Word Vectors Simplified
    • What is a Word Vector
    • Word Vector Relationships
    • Role of Context in LLMs
    • Transforming Vectors into LLM Responses
      • Neural Networks and Transformers (Bonus Module)
      • Attention and Transformers (Bonus Module)
      • Multi-Head Attention, Transformers Architecture, and Further Reads (Bonus Module)
    • Graded Quiz 1
  • Prompt Engineering
    • What is Prompt Engineering
    • Prompt Engineering and In-context Learning
    • Best Practices to Follow in Prompt Engineering
    • Token Limits in Prompts
    • Ungraded Prompt Engineering Excercise
      • Story for the Excercise: The eSports Enigma
      • Your Task
  • Retrieval Augmented Generation and LLM Architecture
    • What is Retrieval Augmented Generation (RAG)?
    • Primer to RAG: Pre-Trained and Fine-Tuned LLMs
    • In-Context Learning
    • High-level LLM Architecture Components for In-context Learning
    • Diving Deeper: LLM Architecture Components
    • LLM Architecture Diagram and Various Steps
    • RAG versus Fine-Tuning and Prompt Engineering
    • Versatility and Efficiency in Retrieval-Augmented Generation (RAG)
    • Key Benefits of RAG for Enterprise-Grade LLM Applications
    • Similarity Search in Vectors (Bonus Module)
    • Using kNN and LSH to Enhance Similarity Search in Vector Embeddings (Bonus Module)
    • Graded Quiz 2
  • Hands-on Development
    • Prerequisites
    • Dropbox Retrieval App in 15 Minutes
      • Building the app without Dockerization
      • Understanding Docker
      • Building the Dockerized App
    • Amazon Discounts App
      • How the Project Works
      • Repository Walkthrough
    • How to Run 'Examples'
  • Bonus Resource: Recorded Interactions from the Archives
  • Bootcamp Keynote Session on Vision Transformers
  • Final Project + Giveaways
    • Prizes and Giveaways
    • Tracks for Submission
    • Final Submission
Powered by GitBook
On this page
  • kNN (k-Nearest-Neighbors) in the Context of Vector Embeddings
  • Challenges with kNN
  • Leveraging LSH to Optimize kNN
  • Incremental Indexing in LLM App
  1. Retrieval Augmented Generation and LLM Architecture

Using kNN and LSH to Enhance Similarity Search in Vector Embeddings (Bonus Module)

Diving into the intricacies of kNN (k-Nearest-Neighbors) and LSH (Locality Sensitive Hashing), we find ourselves at the intersection of mathematics, data science, and algorithmic strategy.

kNN (k-Nearest-Neighbors) in the Context of Vector Embeddings

If you've been exploring this domain, you might have come across mentions of kNN (k-Nearest-Neighbors). So, what exactly is kNN? Why do we need it, especially when we already have tools like cosine similarity at our disposal? Let's understand.

  • Role of kNN with Vector Embeddings: Once we have vector representations of data and a measure of similarity like cosine similarity, the next logical step is to leverage these vectors for practical applications. kNN is a powerful algorithm that does just that. Given a query vector, it identifies the 'k' vectors in the dataset that are closest (most similar) to the query.

  • Why kNN?: kNN hinges on the premise that similar data points in a vector space are likely to share attributes or classifications. For instance, in a text classification problem, if a majority of your 'k' nearest vectors correspond to articles about 'F1 Racing', it's likely that the query article is also related to F1 Racing.

  • The kNN approach has gained success due to its straightforward nature. The basic version is not only easy to implement but also offers high accuracy. Unlike many alternative methods, kNN isn't a black box and offers explainability, meaning its decision-making process is clear and understandable. This transparency enhances user trust in the system, and thus easier adoption within enterprises.

Challenges with kNN

  • Computational Intensity: The brute-force approach of kNN, where every query vector is compared with all vectors in the dataset, is computationally expensive.

  • For example, using Pathway, often developers with large datasets with high-dimensional data. On such datasets, this naive approach doesn't work. As datasets grow, the time taken for these pairwise comparisons becomes a bottleneck. The standard kNN algorithm compares a query vector to every vector in the dataset, which is demanding in terms of computation, especially as the size of the dataset grows. This method's time complexity is O(dntnq)O(d n_t n_q) O(dnt​nq​) where ddd represents the dimensions, and ntn_tnt​ and nqn_qnq​ are the numbers of training and query points, respectively. This can become very inefficient with large datasets.

  • Moreover, managing frequent updates in scenarios with regular data changes is both costly and complex. With new data points coming in, it brings in the requirement for recalculating distances for all queries, which consumes significant resources. In real-time or frequently updated data scenarios, this presents notable challenges. Moreover, modifying or deleting data points necessitates reevaluating all query responses, adding to the complexity.

To overcome this, Pathway uses a better approach.

Leveraging LSH to Optimize kNN

  • What is LSH?: LSH (Local Sensitive Hashing) is a hashing technique wherein similar data points are probabilistically mapped to the same bucket. Understandably, unlike hash functions used for security, which scatter similar items widely to prevent predictability, LSH clusters related items together to streamline similarity searches.

  • How LSH Complements kNN: LSH enhances kNN's effectiveness by grouping vectors based on similarity into the same or adjacent buckets, utilizing hashing functions defined as hv,b,w(p):hv,b,w(p)=⌊pā‹…v+bAāŒ‹h_{v,b,w}(p) : h_{v,b,w}(p) = \left\lfloor \frac{\mathbf{p} \cdot \mathbf{v} + b}{A} \right\rfloor hv,b,w​(p):hv,b,w​(p)=⌊Apā‹…v+bā€‹āŒ‹ where vvv is a random chosen vector and b is a random bias bbb is used to offset the vector, and AAA is the bucket width. This method efficiently narrows down the pool of vectors kNN needs to analyze, boosting the overall speed and performance of the search.

If you are keen on diving deeper into the intricacies of how kNN and LSH work together, especially in the context of large-scale datasets, we recommend checking out the below resource by Olivier Ruas on KNN+LSH. It provides a more detailed exploration, complete with visual aids and practical examples.

Building on similarity search, vector embeddings, and RAG, the next piece offers insights into managing vector indexes in dynamic settings, like an e-commerce platform with constantly changing product data.

PreviousSimilarity Search in Vectors (Bonus Module)NextGraded Quiz 2

Last updated 1 year ago

Incremental Indexing in

It delves into LSM (Log-Structured Merge-tree) indexes and how Pathway's indexing approach for its RAG Framework – adapts to streaming (live) data, balancing computational efficiency with user needs. The article includes practical scenarios, such as table joins and real-time alerts in Pathway, enhancing your understanding of indexing in fluid data environments.

LLM App
LLM App
Realtime Classification with Nearest Neighbors (2/2) | Pathway
Logo
Indexes in Pathway
Logo