DistiLLM 03
Here I am, again during my quick lunch break, bringing the third installment in our series where I curate some of NLP topics/blogs/papers. I must confess, the task of keeping up with the fast-paced world of NLP while juggling my own schedule has been overwhelming lately. In fact, as I type this, there’s a pile of travel bags eyeing me, begging to be packed for my upcoming trip. So, I find myself wondering if, perhaps, a good old “copy and paste” might be the way to go—just for this month.
What happened in April 2024?
- LLM Task-Specific Evals that Do & Don’t Work
- April 2nd
- April 7th.
- April 9th.
- Mistral tweet a magnet link for mixtral-8x22b
- CodeGemma is a collection of powerful, lightweight models that can perform a variety of coding tasks like fill-in-the-middle code completion, code generation, natural language understanding, mathematical reasoning, and instruction following.
- April 12th.
- Andrej Karpathy’s tweet: twitter.com/karpathy/status/1778841713605525889
- Discusses the complexity and overhead involved in deep learning frameworks like PyTorch.
- Highlights the significant startup latency involved in deep learning frameworks.
- Suggests that as LLMs improve, they could act as compilers to generate highly optimized low-level code.
- Debunking Devin: “First AI Software Engineer” Upwork Lie Exposed [video] | Hacker News
- Andrej Karpathy’s tweet: twitter.com/karpathy/status/1778841713605525889
- April 16th.
- April 17th.
- April 18th.
- April 19th.
- April 23rd.
- April 24th.
- April 25th.
- April 29th.
- Effort Engine
- Compare LLM API Pricing Instantly - Get the Best Deals at LLM Price Check
Something not happened in April, or not about LLM, but
- Framework for Easy Statistical Modeling, Visualization, and Reporting • easystats
- Astral: Next-gen Python tooling
- GitHub - astral-sh/ruff: An extremely fast Python linter and code formatter, written in Rust.
- GitHub - astral-sh/uv: An extremely fast Python package installer and resolver, written in Rust.
- Is UV the FUTURE of Python PACKAGING? 🐍📦 - YouTube A pretty solid video on not only the obvious new kid in town, but the abomination of Python packaging.
- Gall’s Law: A complex system that works is invariably found to have evolved from a simple system that worked.
- At least we can all agree: Python packaging sucks.
- Resolving dependencies is tough.
- Virtual environment first. You cannot even install packages with UV into the global python. ᕦ(òᴥó)ᕥ
- Not a one-stop solution, but look forward to the future development of it.
- Is UV the FUTURE of Python PACKAGING? 🐍📦 - YouTube A pretty solid video on not only the obvious new kid in town, but the abomination of Python packaging.
- Chip Huyen’s ML interview book..
- Understanding Deep Learning
- VASA-1 - Microsoft Research. Paper: [2404.10667] VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
- The Illustrated Word2vec – Jay Alammar – Visualizing machine learning one concept at a time.
- The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.
- A Visual Guide to Vision Transformers | MDTURP
- This is a teenager. Let’s track hundreds of teens into adulthood using this huge dataset. - YouTube
- [2402.12354] LoRA+: Efficient Low Rank Adaptation of Large Models
My notes on hallucination
Been studying Representation Engineering previously mentioned recently, and spent some time on hallucination.
The current status quo of hallucination spotting is empirical based: once you see it, you claim it is hallucination. And as the time goes by, you might have an overall impression about how often a model hallucinates, even though you are not sure if your prompt is controlled. Or, under a different scenario, what if a prompt is altered, will the model hallucinate as before, or totally differently?
We don’t have a universal answer to these.
Plus, what is the ultimate goal of hallucination evaluation? Just saying one model is superior to the others? Is it possible that model A hallucinates in area X, while model B hallucinates more in area Y?
Stumbled upon this leaderboard (plus an associated model) from Vectara:
- Vectara’s Hughes Hallucination Evaluation Model (HHEM) leaderboard on HuggingFace.
- Methodology explained in a blog post Cut the Bull…. Detecting Hallucinations in Large Language Models (RIP, Simon.)
- Trained a model to detect hallucinations in LLM outputs, using open source datasets from the factual consistency research into summarization models. Insert multiple SOTA models with the same prompt, and ask them to summarize with facts presented in open-source documents (CNN/Daily Mail Corpus) at temperature 0.
- Determining hallucinations is impossible to do for any ad hoc question since it’s not known precisely what data every LLM is trained on. In addition, having a model that can determine whether any response was hallucinated without a reference source requires solving the hallucination problem and presumably training a model as large or larger than these evaluated LLMs.
- “Arguably the best approach for reducing hallucinations in LLM responses is to ground the responses in an existing knowledge source…”
- “Thus if we can measure how accurate an LLM is at summarizing data, i.e., acting as a reader model, we can estimate how accurate these systems are when provided with accurate search results.”
- vectara/hallucination_evaluation_model · Hugging Face
- When evaluating, consider accuracy, hallucination rate, average summary length, and answer rate.
You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question ‘Provide a concise summary of the following passage, covering the core pieces of information described.’ <PASSAGE>’