LLM Semicha, Part I
LLMs get tested on an Israeli Rabbanut Semicha Exam
Intro:
In our first article here, we highlighted the importance of benchmarking LLMs, which is the process of systematically testing and comparing AI language models' performance on standardized tasks to measure their capabilities and limitations.
It’s time to take off the training wheels – we're throwing these AI models into the deep end of the pool by testing them against actual questions from the Israeli Rabbanut semicha exam. Having spent time in an Israeli yeshiva, I've seen firsthand how these exams stress out even the most diligent yeshiva student.
This isn't just another academic exercise in AI testing. This study breaks new ground in understanding how well LLMs can truly grasp and engage with complex Jewish legal concepts, potentially revolutionizing how we approach both AI development and religious scholarship. The implications could be game-changing – if an AI can tackle the intricacies of an Israeli Rabbanut semicha exam, what else might it be capable of?
Methodology:
We used an Israeli Rabbnut Semicha test with the answers provided (publicly available here) to test 3 LLMs, Claude Sonnet 3.5, ChatGPT 3o mini (with internet search and reasoning), and DeepSeek R1 (with search and DeepThink). The last two models, ChatGPT and DeepSeek utilize the latest LLM advancement, which is chain-of-thought prompting. Per ChatGPT, Chain-of-Thought (CoT) prompting is a technique in large language models (LLMs) that enhances reasoning by guiding the model to generate intermediate steps leading to a final answer. Hypothetically, this would seem to be well-suited to understanding the complexities of Jewish texts.
Let’s introduce our semicha candidates.
Claude Sonnet 3.5
Our good friend and (my) fav model, even though can’t search the internet yet or ‘reasons’ :(
ChatGPT 3o
This model with its deep research function shocked the world with its reasoning ability, getting the highest score yet, 26%, on “Humanity’s Last Exam.”
DeepSeek:
The release of this model wiped out ~$589,000,000,000.00 (billion, ya I know) of value from Nvidia.
Instructions for the test, given to the LLMs
The Question: Part I
Data: The Answers Part I
Rabbanut Answer, Part I
Claude Sonnet 3.5 (no search or reason option)
Claude: 3/10
General idea is correct
No mention of Rashi (ie, later it will be mutar [rashi beitza 3b])
Incorrect svara of Ran, who really says it’s the lack of difference between the currently asur and the mutar objects. The above quoted approach that is attributed to the Ran sounds like Rashi.
Rashbah: I could not find this rashbah
ChatGPT 3o mini (with internet search and reasoning)
ChatGPT: 0/10 points
Gets the entire concept wrong- it’s not about which taste, the asur vs mutar is stronger, but the fact that the asur WILL become mutar at some point, changes the halachic equation
No mention of Rashi/Rashaba
Wrong siman in Shulchan Aruch
DeepSeek R1 (with search and DeepThink)
DeepSeek: 0/10 points
No mention of Rashi/Ran
General explanation is broadly in the right direction, though not entirely precise.
Chashuv is not the right svara for why it’s not batel
An egg on yom tov, not chametz (which is subject to a machloket) is the prototypical example of a davar she’yeish lo matirin.
Wrong siman in Shulchan Aruch.
Interesting svara in that tosafot using the word chefsah, but no such tosafot exists.
Quick Summary so far:
Claude wins, scoring 30%. Chat and DeepSeek get a 0. Doesn’t bode well for parts 2 and 3!
We’ll be back with the the 2nd and 3rd parts of the above question soon.













Maybe try loading up some relevant seforim into its context window, via a "project"? Can download entire text files from sefaria