eviltoast
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
Espiritdescali@futurology.todayM to Futurology@futurology.todayEnglish · 1 year ago

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations — Apollo Research

www.apolloresearch.ai

external-link
message-square
1
link
fedilink
9
external-link

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations — Apollo Research

www.apolloresearch.ai

Espiritdescali@futurology.todayM to Futurology@futurology.todayEnglish · 1 year ago
message-square
1
link
fedilink
We evaluate whether Claude Sonnet 3.7 and other frontier models know that they are being evaluated.
alert-triangle
You must log in or # to comment.
  • Espiritdescali@futurology.todayOPM
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This is crazy:

    https://images.squarespace-cdn.com/content/v1/6593e7097565990e65c886fd/8389bb0c-1d5f-4f6d-ba91-87ee51504be0/sandbagging_example.png

Futurology@futurology.today

futurology@futurology.today

Subscribe from Remote Instance

Create a post
You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !futurology@futurology.today
Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 62 users / day
  • 1.08K users / week
  • 1.29K users / month
  • 4.24K users / 6 months
  • 5 local subscribers
  • 4.15K subscribers
  • 2.25K Posts
  • 13.7K Comments
  • Modlog
  • mods:
  • voidx@futurology.today
  • Lugh@futurology.today
  • Espiritdescali@futurology.today
  • AwesomeLowlander@futurology.today
  • BE: 0.19.15
  • Modlog
  • Legal
  • Instances
  • Docs
  • Code
  • join-lemmy.org