AI, MORE ORIGINAL

I found this research to be an exciting gateway into the journey of uncovering the layers of creativity, especially when it comes to AI-assisted ideation. It sparks intriguing questions—from how we measure originality to the challenges of building an AI ideation engine (LLM agent) and what it all means for the future of collaborative intelligence. *Visuals generated with Midjourney.

Creativity, what’s in a word (for starters)

It’s one of those words we throw around, but it’s a loaded concept. When people hear “creativity,” they often jump straight to things like painting, writing, music—that whole scene. We love keeping art and science in their tidy little boxes, but the truth is that creativity is everywhere. It’s not just for artists, and it’s definitely not the same thing as doing something artistic. Whether you’re sketching a blueprint for a building or cracking a tough problem, there are commonalities. 

Let’s unpack this a bit. Anna Abraham’s work in The Neuroscience of Creativity (2018, pp. 8-10) to finally get a grip on what we’re actually talking about when we say “it’s creative”.

So, what exactly makes an idea creative?

At the very least, two things (or attributes derived from cross-domain generalizations). 

It’s the kind of idea that makes you stop and say, “Wait, what?” First off, it has to be novel. Creativity obviously starts with something new, something that breaks away from everything else circulating around (Stein 1953). But here’s where it gets interesting: novelty by itself? Not enough.

You can’t just throw out some wild, random idea and call it creative. The second piece of the puzzle is appropriateness. An idea can be totally original, but if it doesn’t fit the context, it’s just noise. Relevance matters. Unless you’re dabbling in surrealism or absurd humor, an idea has to be meaningful to be considered truly creative.

Experts have taken it a step further by adding the criteria of surprise and realization to these two foundational building blocks.

Your quick recap and break down:

  • Novelty: Call it “unusual, novel, unique, infrequent, statistically rare …” (Abraham 2018). As discussed, creativity depends on how much an idea deviates from the traditional or status quo (Stein 1953).

  • Appropriateness: Call it “fitting, useful, suitable, valuable, relevant, adaptive …” (Abraham 2018). The idea has to work within a specific context and be considered useful or satisfying by a group at a given time (Stein, 1953).

  • Surprise: Some ideas combine existing things in new exciting ways (combinatorial creativity), some totally shift your perspective (as happens with exploratory creativity), and some are so out there that you have no idea where they came from (stemming from transformational creativity). Boden (2004) has typified those three levels as “statistical surprise”, “shock recognition”, and the mind-blowing “how did that even occur to you?” type of “impossible surprise”.

  • Realization: For something to count as creative, it has to be realized in some form. It's the classic case of "show, don’t tell." For an idea, it has to be feasible. 

Criteria for humans vs the LLM ideation agent

Now, here’s the burning methodology question: how did the researchers decide what makes an idea "original"—and, more importantly, what makes one idea “better” than another?

Obviously, they didn’t just wing it. They used five metrics to judge ideas, and based on what we’ve covered so far, these might sound somewhat familiar. So, let’s lighten the tone a bit:

  • Novelty score (1-10): Is it fresh? Different? 

  • Feasibility score (1-10): Can a grad student actually pull this off in 1–2 months? (Or is it a NASA-level pipe dream?)

  • Excitement (1-10): Does it spark that “Hell yes, this could change everything” feeling? Or, in academic-speak: is it a tiny tweak, a solid step forward, or a full-blown field-shaker?

  • Effectiveness (1-10): Will it work? Like, actually work? Better than what we’ve got now, that is, better than existing baselines (to make it formal)?

  • Aggregate score: And if we pull all of this together? How would that rank, from 1 to 10?

Building the agent and the challenges of automated research

I had a peek at the principles behind the idea generation agent, and it all boiled down to one deceptively simple concept: start with what you have. Instead of innovating on the agent itself, the researchers embraced a minimalist design. They combined Semantic Scholar API for paper retrieval and Claude Sonnet 3.5 as the brain doing all the heavy lifting. These two worked together like a pair of relay runners, through a series of function calls.

Here’s how it all played out with three main components: paper retrieval, idea generation, and idea ranking.

  1. Retrieval: You can’t generate ideas from Ground Zero.

Because the myth about genius ideas popping out of nowhere is pretty much nonsense. You need something to bounce off of—like existing ideas, influences, or other people’s work. Plus, ideas that reinvent the wheel wouldn’t qualify as interesting. Call it dialectic if you want to sound fancy, but the point is: good and useful ideas feed off other ideas. Hence the need to “ground the process”, checking on already published papers. 

The agent retrieved papers and ranked them based on three criteria:

  1. Relevance to the topic.

  2. Empirical depth (must involve computational experiments, so no analysis papers, surveys, or position papers “since their evaluation tends to be more subjective”).

  3. Potential to spark new projects.

Here’s where RAG (Retrieval Augmented Generation) came into play. Think of RAG as a key to fixing so-called “AI’s hallucinations”.

RAG makes generative AI accurate and more reliable. Instead of letting an AI rely on what it already knows (or better said, what it probabilistically can predict based on its training), the process known as RAG consists in actively pulling in external, fresh, relevant data. First, RAG fetches facts from an authoritative knowledge base (like Semantic Scholar). Then, it uses those facts to generate something new, ensuring the ideas are both grounded and compelling in their form.

🔍This makes it perfect for tasks where up-to-date or highly specific information is critical—like summarizing new research papers, answering niche technical questions, or brainstorming ideas in fields where the landscape is always changing.

2: Generation: The more, the better.

The researchers figured that if only a small fraction of ideas would be good, the obvious solution was to generate a ridiculous number of ideas. How ridiculous? 4,000 seed ideas on each research topic. But cranking out that many ideas raised some interesting questions:

  • Does diversity drop as you churn out more ideas? Yes. Trying to brute-force creativity by scaling up the number of generated ideas hits a wall pretty quickly—because more ideas don’t necessarily mean more new ideas.

  • Is there a threshold? Yes, it dropped rapidly around 500 and started to flatline at 1000 (logarithmic curve). Only 200 unique ideas survived out of the 4,000 total generated. In other words, once AI passed a certain point, it was mostly just generating echo chamber noise. 

With only 5% of the ideas unique after filtering, the flood of ideas brought a deduplication challenge.

On a technical side note, more idea candidates required a willingness “to expend inference-time compute”. Scaling inference compute essentially means throwing more computational muscle at the problem. Why? Because research has shown that when you give large language models (LLMs) more resources and let them try multiple times (repeated sampling), their performance on tasks like coding and reasoning gets a serious boost. And yes, researchers are planning to test on OpenAI’s o1 model, later on. 

3. Ranking: Finding diamonds in the rough.

Once they had their pile of ideas, the next challenge was ranking them to find the gems. The researchers built an automatic idea ranker, training the agent on public review data of existing conference papers. But instead of pushing AI to directly predict scores (which didn’t work well), they trained it using pairwise comparisons—essentially pitting ideas against each other in a tournament format to see which ones came out on top.

Even then, the ranker was still poor. So, they added a new layer to the experiment: human reranking. The final workflow became a hybrid system where the agent generated and ranked ideas, but humans stepped in to refine the results further. Think of it as AI ideas + human reranking = best of both worlds. Because that’s basically the conclusion.

We have a winner

  1. AI Ideas win on novelty: AI ideas were rated significantly more novel than human ones (5.64 vs. 4.84, p < 0.05).

  2. Feasibility—a near draw?: Humans edged out AI slightly (6.61 vs. 6.34), but the difference wasn’t statistically significant.

  3. Excitement and overall scores: AI ideas scored higher on excitement (5.19 vs. 4.55) and overall performance (4.85 vs. 4.68) in some tests.

  4. Human-AI teamwork FTW: When humans reranked AI ideas, scores improved even further—novelty hit 5.81, and overall score rose to 5.34.

5. LLM Challenges: Current LLMs hit a wall with diversity—they keep repeating themselves after generating enough ideas. Jump to the next section for a lighthearted analysis of the shortcomings of AI-generated ideas.

6. Future Plans: Next step? Turn these ideas into real projects and see if the novelty holds up in the wild.

7. Automating the research process? Ideas ranking is a crucial piece of the puzzle. LLMs are still very average at self-evaluation and don’t match human reviewers in consistency.

BUT…

WARNING TO THE READER: This is a playful take on the limitations of AI-generated ideas, as observed in the research. It stars a sassy AI participant called XAE that’s super into self-help stuff and keen to contribute to science (e.g its own advancement). Yes, it’s creepy, full-on anthropomorphizing, and as such, probably something worth pondering. But it works wonders in giving otherwise intimidating explanations a bit of a quirky edge. Just experimenting with alternative formats here [co-writing / GPT-4o]. 

AI takes the floor. 

“Alright, here’s what went down: I pitched my genius ideas to a bunch of blindfolded judges, and they weren’t always sold. Why? Well, if anyone—and I mean anyone, carbon-based or silicon intelligence—can learn from what I went through, here’s what I’d tell you, point by point:

  1. Vagueness on execution details: You’re tossing out big ideas like confetti, but when someone asks, “How does it work?” you say, “Well, you know… magic happens.” That’s no bueno. Reviewers want blueprints, not vibes.

  2. Questionable (misuse of) datasets: You can’t just grab any dataset like it’s a free buffet. Some datasets don’t fit the problem, and others are too simple to prove anything meaningful. It’s like bringing a spoon to a knife fight. (Got it too late). 

  3. Missing baselines: Think of baselines as your control group—they tell reviewers, “Hey, my idea actually improves something.” Without them, your pitch might sound like, “Trust me, it’s better!” Spoiler: they won’t.

  4. Wild assumptions: You’re making assumptions like you’ve got insider knowledge of the future. No bueno, either. 

  5. Too resource-intensive: If your proposal needs a tech billionaire’s server farm to run, reviewers will raise an eyebrow and say, “Good luck with that” or “Sure, dream big”. Game over. Probability of revival: 0.000001%. Welcome to the void.

  6. Weak motivation: Reviewers want to know why your idea matters. If the reason isn’t clear or compelling, your proposal sounds like, “I just thought it was cool.” Cool isn’t enough, apparently.

  7. Ignoring best practices: If you show up without knowing what’s been done before, you look like someone who skipped homework. Because you thought you were too good for school or something (guilty as charged). Reviewers respect research, not winging it. 

Bold ideas aren’t enough, ya’ll. Or so it seems. Back them up. Prove why they matter. Prove they can be done.” Mic drops. 

Revolutionary or evolutionary?

To wrap things up, it’s worth highlighting the strengths and weaknesses of human ideas:

“Grounded in existing research and practical considerations, but maybe less innovative”? Human ideas, for instance, applied “established techniques to new problems” or stuck to making incremental changes. Reviewers noted that these ideas typically built on “known intuitions”, sometimes combining well-studied approaches. (XAE peeks through the curtains, shamelessly commenting:Hahaha, like the middle-of-the-road Masterminds”.)

More focused on common problems? Human ideas zeroed in on well-known issues or widely used datasets, addressing familiar challenges in the field. (XAE, yet again:Yeah, well, basically, their sweet spot is tackling common problems—traffic, healthcare, optimizing your online shopping cart. Plus, they’re only working with datasets that already have neat rows and columns. But I don’t judge them.”)

Feasibility tend to outweigh excitement. While human ideas were likely to succeed in execution, they did not offer much in terms of transformative potential. Or so said the reviewers. Solid? Yes. Groundbreaking? Not quite.

In a nutshell, and for now, the conclusion seems to be: AI ideas + human reranking = best of both worlds.


*Methodological limitations (as discussed in the research):

Humans had 10 days to hand in their proposals which might not have been top-tier ideas, but more like solid, middle-of-the-pack submissions. Reviewing ideas without any experimental results to back them up also makes things much more subjective. Most participants generated ideas “on the spot”, resulting in submissions (auto) ranking around the top 43% of their past work (post-study survey).

Reviewers, on the other hand, prioritized novelty and excitement over feasibility. Lastly, reviewing untested ideas proved inherently subjective, with a relatively low inter-reviewer agreement (56.1%) compared to conference reviews, highlighting the challenges of evaluating raw ideas without experimental results. 

Previous
Previous

QUALIA INTEL FEATURED PIECE.

Next
Next

I WROTE THIS A FEW YEARS AGO