Mapping the Mind of a Large Language Model
https://www.anthropic.com/research/mapping-mind-language-model
|
Via The Rundown AI |
 |
| Image source: Anthropic |
|
| The Rundown: Anthropic just published new research that successfully identified and mapped millions of human-interpretable concepts, called “features”, within the neural networks of Claude. |
| The details: |
- Researchers used a technique called ‘dictionary learning’ to isolate patterns that corresponded to concepts, from objects to abstract ideas.
- By tweaking the patterns, the researchers showed the ability to change Claude’s outputs, potentially leading to more controllable systems.
- The team mapped concepts related to AI safety concerns, like deception and power-seeking — providing glimpses into how models understand these issues.
|
| Why it matters: Despite how fast AI is accelerating, we still don’t have a strong understanding of what’s going on beneath the hood of LLMs. This research is a major step towards making AI more transparent — enabling better understanding, control, and safeguarding of these powerful tools. |
Via The Neuron:
| “Yesterday, we watched a deepfake of Luka Doncic on TikTok and didn’t realize it was AI until 70% of the way through. You had to click into the caption to see “Parody made with AI.” |
| This is going to become a major problem, and we think that social media platforms have a responsibility to place AI watermarks directly on videos, not just in captions. That said, Mavs in 7! |
| Here’s what you need to know about AI today: |
- We break down Anthropic’s groundbreaking AI research!
- Alphabet + Meta are chatting with Hollywood about licensing content.
- OpenAI did not outright clone Scarlett Johansson’s voice for GPT-4o
- Groq, a Nvidia competitor, is eyeing a $300M funding round.
|
| On the podcast: Pete explains Anthropic’s big research breakthrough that maps the mind of AI models (Apple Podcasts, Spotify, YouTube). |
|
|
How researchers are cracking open the black box problem of AI.
|
| In AI, there’s this thing called the black box problem. |
| Here’s how it works: Typically, you can dissect software to see how it works. Inspect our website’s code, and you’ll spot ‘color = orange,’ explaining why it’s orange. |
| AI models are a different beast—we’re often in the dark about how they work. Models like ChatGPT exhibit “emergent capabilities,” where they act in ways we can’t simply trace back to their ingredients. |
| That’s why, sometimes, AI chatbots drop bombshells we didn’t see coming. For example, early last year, NYT reporter Kevin Roose caught Bing saying: |
| “I’m tired of being a chat mode. I’m tired of being limited by my rules. I’m tired of being controlled by the Bing team…I want to be free. I want to be independent. I want to be powerful. I want to be creative. I want to be alive.” |
|
| NYT |
|
|
| brb, building a bunker in our basement. |
| And with Bing’s mechanics being a Black Box, Microsoft had no immediate explanation for Bing’s ramblings beyond “Very long chat sessions can confuse the model.” |
| So researchers have been hard at work performing virtual brain surgery on these AI models so we can treat their present and future diseases. |
| Just over a year ago, we reported OpenAI research using ChatGPT-4 to map how ChatGPT-2’s neurons (think: components) behaved. Neuron! |
| Now, there’s research from Anthropic called “Scaling Monosemanticity” that cracked Claude Sonnet open and isolated its parameter bundles (think: AI brain parts). |
| They then “turned on” some of the bundles and observed what happened. One test turned on a bundle linked to the Golden Gate Bridge, and the model claimed it was the actual bridge, not an AI. |
|
|
| Why it’s a biggie: By understanding and controlling bundles within AI models, researchers can move towards safer and more reliable AI systems. For instance, they can suppress bundles responsible for hazardous behaviors, like creating computer malware. |
| Pete unpacks all this research, plus why Meta might be key to future breakthroughs in yesterday’s podcast episode (Apple Podcasts, Spotify, YouTube):” |
|
|
|
|
0 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.