Embedding

What are Embeddings, and why?

Embedding Examples

The language of modern neural networks is array/list of numbers, or tensors, or vectors. Embedding vectors are just "lists of numbers". Think $<0.21, 0.34, ...>$ .

They have interesting traits though. Words with similar semantic meanings are also similar vectors. "Cat" and "dog" are similar words, and thus they are also similar vectors. They are very different from "Hamburger", and thus you can expect the embedding vector representing hamburger to be very different.

The reason we need them is that just like English, Chinese, Japanese, ... are our languages, vectors are the language of neutral networks, thus LLMs.

LLMs can do math on those embedding vectors, and this is the core reason how it is able to output human language words while only understand tensors / vectors. The most classic example is $king - man + woman \approx queen$ .

ChatGPT response to a chinese celebrity example

Embeddings into LLM, how?

Embedding into LLM

When a word like "cat" is fed into an LLM, it first gets "tokenized", a concept we will talk about in another post. Now a word becomes a token (it's not 1-1 mapping, a word can be many tokens too. for simplicity, we say it's one token). This token is then translated to an embedding vector via a look up table. This embedding lookup table comes from LLM training.

initial_embedding = embedding_table[token]
contextual_embedding = ...
next_token = ...

How to get pre-trained embeddings

[1] OpenAI has online API:

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The dog is playing in the park."
)

embedding = response.data[0].embedding
print(len(embedding))

[2] Sentence Transformers library is a standard python, offline solution

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

texts = [
    "A dog is running outside.",
    "A puppy is playing in the park.",
    "I want to eat a hamburger."
]

embeddings = model.encode(texts)

print(embeddings.shape)

[3] Hugging Face allows direct access to models, so more flexibility

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "A dog is running outside."

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)

token_embeddings = outputs.last_hidden_state
sentence_embedding = token_embeddings.mean(dim=1)

Applications

Search (RAG)
- RAG: embedding vector -> vector DB

Recommendation Systems
- User Embedding, Item Embedding

An interesting paper on Food Embeddings shows that each food ingredient can have an embedding. Embedding space is not objective. It encodes the relationship you choose to train or retrieve on (things cooked together, or things that are substitutes)

What are Embeddings, and why?

Embeddings into LLM, how?

More on this in an LLM note.

How to get pre-trained embeddings

Applications