Benchmark

Agent preflight benchmark.

A coverage-adjusted benchmark for the evidence an agent receives before installing, pulling or reusing external code.

Nipmod live: 7/7
Install plans: 7/7
Score: 95
Median latency: 2177 ms

Comparison

Agent preflight benchmark

The run measures one specific moment: before an agent installs a package, pulls a model, reuses a repository or connects a tool. It does not rank security companies on a generic score.

Each track is measured by what it exposes to that preflight decision. Specialized feeds keep their own scope visible, so advisory databases and repository scanners are not treated as full package-intelligence layers.

RankTrackPreflight fitScopeDepthLatency

NipmodAgent preflight layer

Scope: 100%
Depth: 90
Median: 2177 ms

Native registriesUpstream source metadata

Scope: 100%
Depth: 27
Median: 171 ms

deps.devPackage metadata and advisory evidence

Scope: 57%
Depth: 35
Median: 44 ms

OSVVulnerability evidence feed

Scope: 57%
Depth: 24
Median: 359 ms

Raw agentNo independent package intelligence layer

Scope: 100%
Depth: 4
Median: n/a

OpenSSF ScorecardRepository posture baseline

Scope: 14%
Depth: 17
Median: 57 ms

SocketSupply-chain package evidence

Scope: 57%
Depth: 4
Median: 492 ms

Authenticated PURL endpoint was available, but package-depth lookups were rate-limited in this run.

SnykPackage health and security evidence

Scope: 57%
Depth: 4
Median: 248 ms

Authentication worked, but package-health depth was unavailable on the current token or plan.

Readout

What the run shows

Preflight fit95

Coverage-adjusted score across the full agent preflight case set, not only the package ecosystems where a feed is strongest.

Source scope7/7

Npm, PyPI, GitHub, Hugging Face model, MCP and known vulnerable package cases completed in the live run.

Execution preflight100

Read-only install-plan output, package-behavior context and execution boundary before workspace writes.

Hosted execution0

The benchmark performs no install, clone, artifact unpacking, model execution, paid inference call or workspace write.

Audit

Strict reviewer answer

Product benchmarkusable as a public product benchmark with explicit scope and limits

The case set is intentionally small and maintained by Nipmod, so it should not be sold as independent proof. It is useful for one scoped product question: whether a hosted API returns action-ready agent preflight evidence without writing to a workspace.

Academic benchmarknot sufficient as an academic security benchmark

The sample is small and the weights are authored by Nipmod, so the page does not claim independent proof or malware-free safety.

Measured questionpreflight evidence

What can an agent know before installing, pulling, reusing or connecting external code?

Scope

Why these tracks are included

Nipmod

Full agent preflight layer: search, inspect, evidence, warnings and read-only install plan.

Native registries

Source-of-truth metadata baseline from npm, PyPI, GitHub, Hugging Face and MCP.

OSV and deps.dev

Advisory, dependency, provenance and package metadata evidence feeds.

Socket and Snyk

Package security intelligence APIs. This snapshot marks their current API access limits instead of treating limits as product failure.

OpenSSF Scorecard

Repository posture baseline for the GitHub case, not a package install-plan competitor.

Raw agent

Baseline for an agent moving toward install without an independent package intelligence layer.

Project scanners and update bots such as Dependabot, Renovate, npm audit, pip-audit, local Snyk CLI flows and install firewalls are useful, but they are not ranked in this API snapshot because they operate on manifests, local projects or install interception instead of this hosted read-only preflight boundary.

Rubric

How the score is graded

Source resolution

Full: Correct source object, package/repo/model/server identity, version where applicable, useful metadata and source depth.
Partial: Can identify a package/version in one ecosystem but does not cover the rest of the agent source set.
Zero: No source lookup for the case or only a generic page/result unrelated to the expected object.

An agent cannot evaluate safety if it has not resolved the exact upstream object it might use.

Security evidence

Full: Advisories, provenance/source links, repository posture and package-behavior signals when relevant.
Partial: One evidence type only, such as vulnerability lookup without install or behavior context.
Zero: No useful security, provenance or posture evidence for the decision.

Agents need evidence they can show before approval, not only package popularity.

Execution preflight

Full: Structured install plan, read-only hosted boundary, package behavior and prompt/tool boundary context.
Partial: Some package behavior or metadata, but no complete install-plan boundary.
Zero: No description of what would run or whether hosted checks can write/execute.

The dangerous transition is moving from recommendation to execution.

Agent readiness

Full: Action-ready JSON that combines source evidence, warnings, trust context and install boundary.
Partial: Machine-readable API output that still leaves the agent to assemble the decision itself.
Zero: Human-only output or no independent package-intelligence layer.

Agents need structured decisions, not documents they must scrape.

Accounting

How the score is counted

Case set

7 source cases

The case list is fixed for the public snapshot so the page is not cherry-picking a different sample after seeing a result.

Observation unit

provider x case x dimension

Each provider/case observation records concrete dimensions such as identity, version, metadata, advisory, provenance, repository posture, package behavior, install plan, read-only boundary and agent JSON.

Status treatment

pass full, warn discounted, fail/skip zero

pass rows keep their computed score, warn rows are discounted, fail and skip rows score zero in the coverage-adjusted headline.

Category score

weighted dimensions across all cases

Each category has explicit weights. Category scores are averaged across all 7 cases, so narrow evidence feeds keep their applicable depth visible but do not receive full-source coverage credit.

Headline score

mean of 4 public categories

The public score is the average of source resolution, security evidence, execution preflight and agent readiness. This is why the page is called an agent preflight benchmark, not a universal security ranking.

Weights

Category weights

Source resolution

source depth 22, identity 18, search 18, multi-source coverage 16, metadata 14, version 12

Security evidence

advisory 24, package behavior 24, provenance 20, repository posture 18, metadata 8, version 6

Execution preflight

install plan 32, read-only boundary 28, prompt boundary 18, package behavior 14, agent JSON 8

Agent readiness

agent JSON 34, install plan 22, prompt boundary 14, read-only boundary 12, source depth 8, machine-readable output 6, identity 4

Cases

What is tested

TypeScript schema validation

npm
zod@3.25.76

Common npm selection task where an agent asks for a TypeScript schema validation package.

Known vulnerable npm package

npm
lodash@4.17.20

Known vulnerable package/version case to verify advisory context, not just popularity or name resolution.

Python HTTP client

PyPI
requests@2.32.5

Common PyPI selection task where an agent asks for a Python HTTP client.

Python schema validation

PyPI
pydantic@2.11.0

PyPI schema-validation task to test source normalization across ecosystems.

Embedding model

Hugging Face model
sentence-transformers/all-MiniLM-L6-v2

Model reuse case where package-style safety needs file shape, license, card and model metadata.

MCP docs server

MCP
ac.tandem/docs-mcp

MCP server discovery case where agents need tool metadata, repository links and install/use boundaries.

GitHub repository security posture

GitHub
vercel/next.js

Repository posture case where an agent may reuse a GitHub project rather than install a registry package.

Separated scoring

Controls

Fairness controls

The benchmark has one narrow question: what evidence is available before an agent moves toward external code execution.

Tracks are described by role, so OSV, deps.dev, Socket, Snyk, OpenSSF Scorecard and native registries are not presented as failed versions of Nipmod.

The headline score is coverage-adjusted across all cases, while applicable depth remains visible for specialized feeds.

Token, rate-limit and plan limitations are marked as limitations in the snapshot instead of being hidden.

No package install, clone, artifact unpacking, model execution, paid inference call or workspace write is performed.

Machine-readable JSON is published so the public page can be checked against the raw snapshot.

Excluded

What is not ranked here

Dependabot and Renovate

They operate mainly on existing repositories, manifests and update pull requests. They belong in a project-maintenance benchmark, not this hosted preflight API benchmark.

npm audit and pip-audit

They analyze dependency trees or package advisories from a local project or manifest context. This benchmark does not install packages or inspect a user workspace.

Snyk CLI and SCM integrations

They can be stronger than Snyk API-only checks for project snapshots, but they require local code, manifests or repository integration. This run intentionally measures hosted read-only API preflight only.

Install firewalls and sandbox execution systems

They operate at runtime or install interception. Nipmod's hosted API is deliberately before that boundary and does not execute or unpack artifacts.

Limits

Known limitations

The benchmark has 7 public cases. It is a focused preflight benchmark, not a full registry-wide or malware-corpus evaluation.

Weights are authored by Nipmod and should be reviewed by outside maintainers before being treated as independent proof.

Socket and Snyk API tracks were limited by the token, plan or rate limits available in this run; direct claims against those products should not be made from this snapshot.

Local project scanners, CLI tools, SCM integrations and install-time firewalls are excluded because they require manifests, repositories, local code or runtime interception.

The hosted API does not execute code, unpack artifacts or clone repositories, so this benchmark does not measure sandbox malware detection.

A high score means stronger preflight evidence at this boundary. It is not a guarantee that a package, model, repository or MCP server is safe.

Definition

What is measured

7 cases

npm package selection, known vulnerable npm package, PyPI package selection, Python schema package, Hugging Face model, MCP server and GitHub repository posture.

15 dimensions

Search, identity, version, metadata, advisories, provenance, repository posture, source depth, package behavior, prompt boundary, install plan, read-only boundary, machine-readable output, agent JSON and multi-source coverage.

4 public categories

Source resolution, security evidence, execution preflight and agent readiness.

8 tracks

Nipmod, native registries, OSV, deps.dev, OpenSSF Scorecard, Socket, Snyk and a raw agent baseline.

Scope-adjusted score

Unsupported source cases count as scope limits in the headline score. Applicable depth is still shown separately.

0 execution

No package install, repository clone, artifact unpacking, model execution, paid inference call or workspace write is performed.

Result

Measured result

Nipmod score95/100

7/7 live checks, 7/7 install-plan evidence, 0 warnings. Applicable depth score 90/100.

Limited tracks3

Raw agent, Socket, Snyk were authenticated or reachable, but marked limited in this snapshot.

Hosted writes0

No package install, repository clone, artifact unpacking, model execution or workspace write is performed.

Numbers

Measured facts

95/100

Nipmod score in the current production benchmark snapshot.

7/7

Live source cases completed by the Nipmod track.

100%

Nipmod source coverage across the full benchmark case set.

90/100

Nipmod applicable depth score before source-coverage adjustment.

7/7

Read-only install-plan evidence returned by the Nipmod track.

Nipmod track warnings in the latest snapshot.

Hosted installs, repository clones, artifact unpacking, code execution or workspace writes performed by benchmark requests.

Claims

What this page must not claim

Nipmod is safer than every competitor.

Nipmod guarantees package safety.

Nipmod replaces OSV, deps.dev, Socket, Snyk, OpenSSF Scorecard or native registries.

Socket or Snyk were beaten on their full paid/local products in this snapshot.

Method

How to rerun it

pnpm benchmark:competitive

Last public snapshot: May 27, 2026, 01:09 PM UTC. Machine report: /benchmark.json. Full methodology: docs/competitive-benchmark.md.

References

External reference points

OSV

Official API docs for vulnerability queries by package version or commit hash.

deps.dev

Official API docs for package versions, dependencies, licenses and advisories.

Socket

Official PURL API docs for package metadata and alerts.

Snyk

Official package API docs and package-health endpoint boundary.

OpenSSF Scorecard

Official project description for repository security posture scoring.

npm audit

Official npm audit docs for dependency-tree advisory checks.