DistiLLM 03

2024-05-01

Here I am, again during my quick lunch break, bringing the third installment in our series where I curate some of NLP topics/blogs/papers. I must confess, the task of keeping up with the fast-paced world of NLP while juggling my own schedule has been overwhelming lately. In fact, as I type this, there’s a pile of travel bags eyeing me, begging to be packed for my upcoming trip. So, I find myself wondering if, perhaps, a good old “copy and paste” might be the way to go—just for this month.

What happened in April 2024?

LLM Task-Specific Evals that Do & Don’t Work
April 2nd
- Many-shot jailbreaking - Anthropic
- [2404.02258] Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
April 7th.
- SentenceTransformers: Python framework for sentence, text and image embeddings | Hacker News
  - SentenceTransformers, and its paper [1908.10084] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- 3Blue1Brown - Visualizing Attention, a Transformer’s Heart | Chapter 6, Deep Learning
April 9th.
- Mistral tweet a magnet link for mixtral-8x22b
- CodeGemma is a collection of powerful, lightweight models that can perform a variety of coding tasks like fill-in-the-middle code completion, code generation, natural language understanding, mathematical reasoning, and instruction following.
  - CodeGemma - an official Google release for code LLMs
April 12th.
- Andrej Karpathy’s tweet: twitter.com/karpathy/status/1778841713605525889
  - Discusses the complexity and overhead involved in deep learning frameworks like PyTorch.
  - Highlights the significant startup latency involved in deep learning frameworks.
  - Suggests that as LLMs improve, they could act as compilers to generate highly optimized low-level code.
- Debunking Devin: “First AI Software Engineer” Upwork Lie Exposed [video] | Hacker News
April 16th.
- Use Ray on Databricks for new scalable AI applications
April 17th.
- Mistral AI’s Mixtral 8x22B is released.
- Show HN: Speeding up LLM inference 2x times (possibly) | Hacker News
April 18th.
- Meta’s Llama 3 is released.
  - MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF · OK llama 3 8b model is INSANE. Is almost as good as wizard 2 8x22b!. HN discussion.
April 19th.
- Reliable, fully local RAG agents with LLaMA3 - YouTube
April 23rd.
- What can LLMs never do? - by Rohit Krishnan
- Microsoft’s Phi-3 is released.
  - microsoft/Phi-3-mini-128k-instruct · Hugging Face
April 24th.
- [2404.15758] Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
April 25th.
- Apple releases eight small AI language models aimed at on-device use | Ars Technica
April 29th.
- LLM Hallucinations and Mitigation Strategies | Build AI-Enriched Apps With SingleStore
Effort Engine
- Effort Engine demo - asciinema.org
Compare LLM API Pricing Instantly - Get the Best Deals at LLM Price Check

Something not happened in April, or not about LLM, but

Framework for Easy Statistical Modeling, Visualization, and Reporting • easystats
Astral: Next-gen Python tooling
- GitHub - astral-sh/ruff: An extremely fast Python linter and code formatter, written in Rust.
- GitHub - astral-sh/uv: An extremely fast Python package installer and resolver, written in Rust.
  - Is UV the FUTURE of Python PACKAGING? 🐍📦 - YouTube A pretty solid video on not only the obvious new kid in town, but the abomination of Python packaging.
    - Gall’s Law: A complex system that works is invariably found to have evolved from a simple system that worked.
  - At least we can all agree: Python packaging sucks.
  - Resolving dependencies is tough.
  - Virtual environment first. You cannot even install packages with UV into the global python. ᕦ(òᴥó)ᕥ
  - Not a one-stop solution, but look forward to the future development of it.
Chip Huyen’s ML interview book..
Understanding Deep Learning
VASA-1 - Microsoft Research. Paper: [2404.10667] VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
The Illustrated Word2vec – Jay Alammar – Visualizing machine learning one concept at a time.
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.
A Visual Guide to Vision Transformers | MDTURP
This is a teenager. Let’s track hundreds of teens into adulthood using this huge dataset. - YouTube
- Behind the scene: This is a teenager - by Alvin Chang
[2402.12354] LoRA+: Efficient Low Rank Adaptation of Large Models

My notes on hallucination

Been studying Representation Engineering previously mentioned recently, and spent some time on hallucination.

The current status quo of hallucination spotting is empirical based: once you see it, you claim it is hallucination. And as the time goes by, you might have an overall impression about how often a model hallucinates, even though you are not sure if your prompt is controlled. Or, under a different scenario, what if a prompt is altered, will the model hallucinate as before, or totally differently?

We don’t have a universal answer to these.

Plus, what is the ultimate goal of hallucination evaluation? Just saying one model is superior to the others? Is it possible that model A hallucinates in area X, while model B hallucinates more in area Y?

Stumbled upon this leaderboard (plus an associated model) from Vectara:

Vectara’s Hughes Hallucination Evaluation Model (HHEM) leaderboard on HuggingFace.
- Methodology explained in a blog post Cut the Bull…. Detecting Hallucinations in Large Language Models (RIP, Simon.)
- Trained a model to detect hallucinations in LLM outputs, using open source datasets from the factual consistency research into summarization models. Insert multiple SOTA models with the same prompt, and ask them to summarize with facts presented in open-source documents (CNN/Daily Mail Corpus) at temperature 0.
- Determining hallucinations is impossible to do for any ad hoc question since it’s not known precisely what data every LLM is trained on. In addition, having a model that can determine whether any response was hallucinated without a reference source requires solving the hallucination problem and presumably training a model as large or larger than these evaluated LLMs.
- “Arguably the best approach for reducing hallucinations in LLM responses is to ground the responses in an existing knowledge source…”
- “Thus if we can measure how accurate an LLM is at summarizing data, i.e., acting as a reader model, we can estimate how accurate these systems are when provided with accurate search results.”
- vectara/hallucination_evaluation_model · Hugging Face
- When evaluating, consider accuracy, hallucination rate, average summary length, and answer rate.
- You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question ‘Provide a concise summary of the following passage, covering the core pieces of information described.’ <PASSAGE>’

#nlp #distillm