May 2021 Gwern.net Newsletter

links on AI hardware, diffusion models, optogenetics, brain scanning.

Jun 11, 2021

May 2021’s Gwern.net newsletter is now out; previous, April 2021 (archives). This is a collation of links and summary of major changes, overlapping with my Changelog; brought to you by my donors on Patreon.

Note: I will be in Denver 12–13 June 2021 for a conference.

1 Writings

Proposal: “Choose Your Own Adventure AI Dungeon”; “Decision Transformers: Preference Learning As Simple As Possible”

2 Links

2.1 AI

Matters Of Scale:

Hardware:
- “Podracer architectures for scalable Reinforcement Learning”, Hessel et al 2021 (highly-efficient TPU pod use: eg solving Pong in <1min at 43 million FPS on a TPUv3-2048); “Google details new TPUv4 AI accelerator chips” (2.7× TPUv3 chips; up to TPUv4-4096 pods, yielding >1 ExaFLOPS; public access later in 2021)x
- “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning”, Rajbhandari et al 2021 (~1 trillion parameters per 16 GPUs/DGX-2-node, scaling to >512 GPUs ~40% efficiency)
- “GSPMD: General and Scalable Parallelization for ML Computation Graphs”, Xu et al 2021 (Google upgrade of GPipe/GShard arch to match MS DeepSpeed: “…50%–62% compute utilization on 128–2048 Cloud TPUv3 cores for models with up to one trillion parameters”)
- “DLRM: High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models”, Mudigere et al 2021 (ZionEX software/hardware platform for training extremely large embeddings—while embeddings aren’t ‘real’ parameters & things like DynamicEmbedding will never learn tricks like GPT-3 no matter how big, they present similar challenges); “RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance”, Gupta et al 2021
“From Motor Control to Team Play in Simulated Humanoid Football”, Liu et al 2021 (curriculum training of a single NN from raw humanoid control to coordinated team-wide soccer strategy; neat to compare with Hill et al 2020 in terms of agent abilities)
“Wav2vec-U: Unsupervised Speech Recognition”, Baevski et al 2021
“Anthropic” public-benefit-corp/startup launched (founded by the Amodeis; $124M investment for scaling “reliable and steerable AI systems”); “Cooperative AI Foundation” (CAIF) launched
“MLP-Mixer: An all-MLP Architecture for Vision”, Tolstikhin et al 2021 (another FC paper removing even more inductive biases—ponies are all you need: “Mixer improves more rapidly with data than ResNets, or even ViT, and the gap between large scale Mixer and ViT models shrinks until the performance is matched on the entire dataset…” The Bitter Lesson truly is the single bitterest lesson in ML, isn’t it? The more people tweet about how MLP-Mixer is overhyped because is −X% worse than the ultra-hand-optimized baseline or requires Y× more FLOPS, the more they demonstrate precisely why this sort of research is so important! And showing, incidentally, that Transformers are still under-researched if such a fundamental fact could have been missed for so long.)
“Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation”, Cheng et al 2021 (CLIP-like performance scaled down to n = 3m using soft labels generated by a Conceptual Captions-pretrained model)
“SR3: Image Super-Resolution via Iterative Refinement”, Saharia et al 2021; “Diffusion Models Beat GANs on Image Synthesis”, Dhariwal & Nichol 2021 (DDPM^1^ finally surpass BigGAN-deep on ImageNet 512px images at similar compute-cost, as expected from their good scaling); “Cascaded Diffusion Models for High Fidelity Image Generation”, Ho et al 2021
“Learning to summarize from human feedback”, Stiennon et al 2020
“Grokking: Generalization Beyond Overfitting On Small Algorithmic Data Sets”, Power et al 2021 (discussion; new scaling effect, ‘grokking’: sudden perfect generalization emerging many epochs after training-set overfitting on algorithmic tasks when training in flat shallow loss landscapes); “Knowledge distillation: A good teacher is patient and consistent”, Beyer et al 2021 (training much smaller models merely requires hundreds of thousands or millions of epochs)
“Scaling End-to-End Models for Large-Scale Multilingual ASR”, Li et al 2021
“The Shape of Learning Curves: a Review”, Viering & Loog 2021
“Reward is enough”, Silver et al 2021 (a DRL manifesto: reward losses enough at scale of compute/parameters/tasks to induce all important capabilities like memory/exploration/generalization/imitation/reasoning)
Scaling Down: lazy: a tool for running processes in idle time (how to train on a GPU without destroying your GUI’s usability! lazy pauses runs briefly while you interact with your desktop, letting you do months-long runs without going crazy or resorting to Colab etc. This enables hobbyists to go after previously-infeasible model sizes); EleutherAI releases a 6b-parameter GPT-3 model, GPT-J (are you still using GPT-2/GPT-Neo? upgrade!); “Aggregating Nested Transformers”, Zhang et al 2021/“Less is More: Pay Less Attention in Vision Transformers”, Pan et al 2021

“ByT5: Towards a token-free future with pre-trained byte-to-byte models”, Xue et al 2021 (character models—not just feasible but desirable; we’ll get our rhyming & pun-making language models yet!)
“Machine Learning Attacks Against the Asirra CAPTCHA”, Golle 2008 (a look back on a decade of CV progress: months of work for 80% cat vs dog with SVM ensembles in 2008; 5min in Fast.ai for 99% accuracy in 2018; for even more perspective, Cireşan 2012)

2.2 Genetics

Everything Is Heritable:

“Bi-ancestral depression GWAS in the Million Veteran Program and meta-analysis in >1.2 million individuals highlight new therapeutic directions”, Levey et al 2021
“The complete sequence of a human genome”, Nurk et al 2021 (media)
“Using DNA to predict intelligence”, von Stumm & Plomin 2021 (review)
“Long read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits”, Beyter et al 2021
“Rapid Sequencing–Based Diagnosis of Thiamine Metabolism Dysfunction Syndrome” (sequence everyone!)

Engineering:

“Sense codon reassignment enables viral resistance and encoded polymer synthesis”, Robertson et al 2021 (“ultra-safe cells”: synthesizing an entire E. coli genome with swapped codons for complete viral immunity)
“In vivo CRISPR base editing of PCSK9 durably lowers cholesterol in primates”, Musunuru et al 2021
Optogenetics: “Partial recovery of visual function in a blind patient after optogenetic therapy”, Sahel et al 2021 (media); “Wireless multilateral devices for optogenetic studies of individual and social behaviors”, Yang et al 2021 (media)
“Retron Library Recombineering (RLR): High-throughput functional variant screens via in vivo production of single-stranded DNA”, Schubert et al 2021
“First genetically modified Oxitec mosquitoes released in the United States”
“Genomic characterization of world’s longest selection experiment in mouse reveals the complexity of polygenic traits”, Palma-Vera et al 2021
“Surrogate broodstock to enhance biotechnology research and applications in aquaculture”, Jin et al 2021
“Utility of polygenic embryo screening for disease depends on the selection strategy”, Lencz et al 2021
“Limit on lab-grown human embryos dropped by stem-cell body: The International Society for Stem Cell Research relaxed the famous 14-day rule on culturing human embryos in its latest research guidelines”
“Useful Mutants, Bred With Radiation” (on atomic gardening)

2.3 Statistics/Meta-Science

“Correlated Failures” in HDDs/SSDs
“How a Publicity Blitz Created The Myth of Subliminal Advertising”, Rogers 1992 (the famous movie-theater/popcorn-sales experiment never happened)

2.4 Politics/Religion

“Clarifying the Structure and Nature of Left-Wing Authoritarianism (LWA)”, Costello et al 2021
“Book Review: The Decline and Fall of the Roman Empire” (excerpts)

2.5 Psychology/Biology

“A connectomic study of a petascale fragment of human cerebral cortex”, Shapson-Coe et al 2021 (“…This “digital tissue” is a ~660,000× scale up of an earlier saturated reconstruction from a small region of mouse cortex, published in 2015 (Kasthuri et al 2015). Although this scaleup was difficult, it was not hundreds of thousands of times more difficult and took about the same amount of time as the previous data set (~4 years)…The rapid improvements over the past few years…argues that analyzing volumes that are even 3 orders of magnitude larger, such as an exascale whole mouse brain connectome, will likely be in reach within a decade." See also “Accelerating progress in brain recording tech”.)
“Neuroimaging evidence for a network sampling theory of individual differences in human intelligence test performance”, Soreq et al 2021; “The neural basis of intelligence in fine-grained cortical topographies”, Feilong et al 2021; “Predicting intelligence from brain gray matter volume”, Hilger et al 2020 (towards the mechanistic reification of g: per P-FIT, it is global efficiency/total cognitive resources which can be spent on learning & orchestrating specialized capabilities); if we consider recent human brain imaging studies, cross-species comparisons, and deep learning as converging, I would offer as a speculation the following:
The Master Synthesis: intelligence is execution of small simplicity-weighted programs, best discovered by search over smooth loss landscapes like that of highly-overparameterized differentiable networks containing lottery-ticket subnetworks which are ensembled/averaged over, approaching Bayes-optimal reasoning in the limit (as nearest-neighbors-like high dimensional interpolation / memorization gives way to algorithmic generalization / interpolation on a more abstract level); this can be implemented by large numbers of similar neurons trained using any of the many approximations to backprop; human intelligence’s g is real but is the overall ‘pool’ of neural resources which derives from overall body integrity because the number of neurons, their density, their myelination, resistance to damage and infection etc, is causally downstream of all body and developmental systems, creating a huge mutational target; the brain regions specialize and differentiate, and their orchestration (or lack thereof) contributes to observed performance on tasks tapping into multiple specialized regions; as tasks rely on fewer regions or approach intrinsic ceiling, g ceases to be observable and task-specific influences matter most.
“MDMA-assisted therapy for severe PTSD: a randomized, double-blind, placebo-controlled phase 3 study”, Mitchell et al 2021 (d = 0.9 over therapy); “Effects of Psilocybin-Assisted Therapy on Major Depressive Disorder”, Davis et al 2021
“Why Animals Don’t Get Lost: Birds do it. Bees do it. Learning about the astounding navigational feats of wild creatures can teach us a lot about where we’re going” (on spectacular but still mysterious feats of animal navigation)
“In The Future Of Collecting, Is Anyone Having Fun?” (on Bobblehead collectors)
“Linking Brain Biology to Intellectual Endowment: A Review on the Associations of Human Intelligence With Neuroimaging Data”, Dizaji et al 2021
“The Best And The Rest: Revisiting The Norm Of Normality Of Individual Performance”, O’Boyle & Aguinis 2012 (performance is log-normal)
“A conserved strategy for inducing appendage regeneration”, Abrams et al 2021 (slight regrowth of damaged mouse limbs by drinking sugar+amino-acid-supplemented water)
“Know Your Amphetamines”, Scott Alexander
“Feeling Small: Exploring the Tactile Perception Limits [of Humans]”, Skedung et al 2013
“The Board Game of the Alpha Nerds: Before Risk, before Dungeons & Dragons, before Magic: The Gathering, there was Diplomacy” (WP; “I still don’t know whom I should have trusted, if anyone. All I know is that I felt stupid, stressed out, humiliated, and sad.”)

2.6 Technology

“I walk the (beta-stability) line: How counting neutrons explains nuclear waste”
“Making is Show Business now”, Alex Danco
“Shop Class as Soulcraft: The case for the manual trades”, Crawford 2006
“Spintronics: Build mechanical circuits”, Kickstarter (followup to Turing Tumble)

2.7 Economics

“RCTs to Scale: Comprehensive Evidence from 2 Nudge Units”, DellaVigna & Linos 2020 (nudge effects overestimated by 6.2× due to publication bias)
“No causal associations between childhood family income and subsequent psychiatric disorders, substance misuse and violent crime arrests: a nationwide Finnish study of >650,000 individuals and their siblings”, Sariaslan et al 2021; “Parental income and mental disorders in children and adolescents: prospective register-based study”, Kinge et al 2021
“Everything You Might Want to Know about Whaling”, Matt Lakeman
Exploding Nash Equilibrium For Trustless Trade

2.8 Fiction

“Love Is the Plan the Plan Is Death”, James Tiptree, Jr. (WP)

2.9 Miscellaneous

“The Strange Story of Dagobert, the Duck Tales Bandit: In the ’90s, a frustrated artist in Berlin went on a crime spree—building bombs, extorting high-end stores, and styling his persona after Scrooge McDuck. He soon became a German folk hero.” (WP; another reminder for Americans—odd as it may seem, Donald Duck is extremely popular overseas; see also the unknown-in-the-USA character John D. Rockerduck or beloved Scandinavian traditionFrom All of Us to All of You who 2020 airing set an all-time record of >4.5m viewers)
List of atmospheric optical phenomena (How many would you recognize from a distance or plane? How many have you even heard of?)
Baron Franz Nopcsa von Felső-Szilvás (noted geologist, paleontologist, anthropologist, homosexual, & skyjacker)
Krishnacore

What is a diffusion model like DDPM? To try to explain it as simply as possible without the math:
DDPM is a neural net which is trained to fix noise in an image: it takes a noisy image and ‘sharpens’ it to produce a new image. You train it by adding dirt to a normal image, and teaching it to turn the dirty version into the original. As it gets better, it learns what the images all tend to look like so it can ‘see through’ ever more noise, to turn smudged hints of the original image into its best guess. Once it’s done training, what happens if you give it a completely dirty photo, which is pure static noise? Well, it produces a slightly less dirty ‘photo’. And if you do it again? it’s a little cleaner still. Now, what if you do this many times? It has to get cleaner each time. The end result: the static noise goes in, and a face pops out! The DDPM has hallucinated a face out of the noise. One little blob of static here turned into a nose, and another blob turned into an ear, and it went from there.

Gwern.net Newsletter