Coding Collie Logo
Coding Collie

Embedding

Authors
  • avatar
    Name
    Kai Kang
    Role
    Staff Software Engineer @ Meta · Solo App Builder
    Twitter

I created this embedding based game :)

What are Embeddings, and why?

Embedding Examples

The language of modern neural networks is array/list of numbers, or tensors, or vectors. Embedding vectors are just "lists of numbers". Think <0.21,0.34,...><0.21, 0.34, ...>.

They have interesting traits though. Words with similar semantic meanings are also similar vectors. "Cat" and "dog" are similar words, and thus they are also similar vectors. They are very different from "Hamburger", and thus you can expect the embedding vector representing hamburger to be very different.

The reason we need them is that just like English, Chinese, Japanese, ... are our languages, vectors are the language of neutral networks, thus LLMs.

LLMs can do math on those embedding vectors, and this is the core reason how it is able to output human language words while only understand tensors / vectors. The most classic example is kingman+womanqueenking - man + woman \approx queen.

ChatGPT response to a chinese celebrity example


Embeddings into LLM, how?

Embedding into LLM

When a word like "cat" is fed into an LLM, it first gets "tokenized", a concept we will talk about in another post. Now a word becomes a token (it's not 1-1 mapping, a word can be many tokens too. for simplicity, we say it's one token). This token is then translated to an embedding vector via a look up table. This embedding lookup table comes from LLM training.

initial_embedding = embedding_table[token]
contextual_embedding = ...
next_token = ...

More on this in an LLM note.

How to get pre-trained embeddings

[1] OpenAI has online API:

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The dog is playing in the park."
)

embedding = response.data[0].embedding
print(len(embedding))

[2] Sentence Transformers library is a standard python, offline solution

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

texts = [
    "A dog is running outside.",
    "A puppy is playing in the park.",
    "I want to eat a hamburger."
]

embeddings = model.encode(texts)

print(embeddings.shape)

[3] Hugging Face allows direct access to models, so more flexibility

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "A dog is running outside."

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)

token_embeddings = outputs.last_hidden_state
sentence_embedding = token_embeddings.mean(dim=1)

Applications

  • Search (RAG)
    • RAG: embedding vector -> vector DB
  • Recommendation Systems
    • User Embedding, Item Embedding

An interesting paper on Food Embeddings shows that each food ingredient can have an embedding. Embedding space is not objective. It encodes the relationship you choose to train or retrieve on (things cooked together, or things that are substitutes)

Enjoyed this post? Subscribe for more.