Sup AI | The pair of AIs that scored No. 1 in the final test of humanity.

by SkillAiNest April 8, 2026

written by SkillAiNest April 8, 2026

Hey Product Hunt. I’m Ken, a 20-year-old Stanford CS student. I created Sup AI.

I started working on this because no single AI model is correct all the time, but their errors are not strongly correlated. In other words, models often make unique errors compared to other models. So I run multiple models in parallel and synthesize the outputs by weighting the segments based on confidence. The output token is integrated into the probability distribution with low entropy precision. High entropy is often where the illusion begins.

My father Scott (AI Research Scientist at TRI, PhD from UCLA) is my research partner on this. He sends me papers all the time, we discuss whether they really apply and whether the modifications make sense, and then I build and test things. The entropy-weighting approach emerged from one of these discussions.

In our Evil on Humanity final test, Sup scored 52.15%. The best individual model in the same evaluation run received 44.74%. The corresponding difference is statistically significant (p < 0.001).

Procedures, Evil Code, Data, and Raw Results:

Limitations:

We reviewed 1,369 of the 2,500 HLE questions (details in the links above).
Not all APIs expose token log probes. We use several methods to assess trust when they don’t.

We tried to offer free access and it was so abused it killed us. Now there is a $10 starter credit with permanent option card verification (no auto charge). If you don’t want to sign up, leave a prompt in the comments and I’ll run it myself and post the result.

Try it on. https://sup.ai. My dad (@Scottum) is also in this thread. Would love some candid feedback, especially where it really works for you and where it falls short. If the Sup proves useful, we’ve added a Product Hunt offer that ends in a week: 20% off your first month with code PRODUCTHUNT.

If you are not sure what I mean by entropy and output token probability distribution, whenever an LLM outputs a token, it is choosing that token from among all possible tokens. Each possible output token has a probability assigned by the model. They sum to 1, forming a probability distribution. APIs typically return these probabilities as logprobs (logarithms of probabilities) because rare tokens can have raw probabilities so low that they approach zero in floating point, and because logprobs are a natural byproduct of how models actually calculate their distributions. We use these directly to calculate entropy. Entropy is a measure of uncertainty and if the probability distribution of a token is certain (1 token has a 99.9% probability, and the rest have a 0.1% probability) or uncertain (each token has approximately the same probability, so it is fairly random which token is chosen). Low entropy is the former case, and high entropy is the latter.

There is interesting research into the correlation of entropy with accuracy and deception:

Editor's pick

Get latest news

Sup AI | The pair of AIs that scored No. 1 in the final test of humanity.

How architecture experts blend into AI models.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news