Benchmark

Agent preflight benchmark.

A coverage-adjusted benchmark for the evidence an agent receives before installing, pulling or reusing external code.

Nipmod live
7/7
Install plans
7/7
Score
95
Median latency
2177 ms

Comparison

Agent preflight benchmark

The run measures one specific moment: before an agent installs a package, pulls a model, reuses a repository or connects a tool. It does not rank security companies on a generic score.

Each track is measured by what it exposes to that preflight decision. Specialized feeds keep their own scope visible, so advisory databases and repository scanners are not treated as full package-intelligence layers.

RankTrackPreflight fitScopeDepthLatency
1
NipmodAgent preflight layer
95
Scope
100%
Depth
90
Median
2177 ms
2
Native registriesUpstream source metadata
22
Scope
100%
Depth
27
Median
171 ms
3
deps.devPackage metadata and advisory evidence
16
Scope
57%
Depth
35
Median
44 ms
4
OSVVulnerability evidence feed
10
Scope
57%
Depth
24
Median
359 ms
5
Raw agentNo independent package intelligence layer
3
Scope
100%
Depth
4
Median
n/a
6
OpenSSF ScorecardRepository posture baseline
2
Scope
14%
Depth
17
Median
57 ms
7
SocketSupply-chain package evidence
1
Scope
57%
Depth
4
Median
492 ms

Authenticated PURL endpoint was available, but package-depth lookups were rate-limited in this run.

8
SnykPackage health and security evidence
1
Scope
57%
Depth
4
Median
248 ms

Authentication worked, but package-health depth was unavailable on the current token or plan.

Readout

What the run shows

Preflight fit95

Coverage-adjusted score across the full agent preflight case set, not only the package ecosystems where a feed is strongest.

Source scope7/7

Npm, PyPI, GitHub, Hugging Face model, MCP and known vulnerable package cases completed in the live run.

Execution preflight100

Read-only install-plan output, package-behavior context and execution boundary before workspace writes.

Hosted execution0

The benchmark performs no install, clone, artifact unpacking, model execution, paid inference call or workspace write.

Audit

Strict reviewer answer

Product benchmarkusable as a public product benchmark with explicit scope and limits

The case set is intentionally small and maintained by Nipmod, so it should not be sold as independent proof. It is useful for one scoped product question: whether a hosted API returns action-ready agent preflight evidence without writing to a workspace.

Academic benchmarknot sufficient as an academic security benchmark

The sample is small and the weights are authored by Nipmod, so the page does not claim independent proof or malware-free safety.

Measured questionpreflight evidence

What can an agent know before installing, pulling, reusing or connecting external code?

Scope

Why these tracks are included

Nipmod
Full agent preflight layer: search, inspect, evidence, warnings and read-only install plan.
Native registries
Source-of-truth metadata baseline from npm, PyPI, GitHub, Hugging Face and MCP.
OSV and deps.dev
Advisory, dependency, provenance and package metadata evidence feeds.
Socket and Snyk
Package security intelligence APIs. This snapshot marks their current API access limits instead of treating limits as product failure.
OpenSSF Scorecard
Repository posture baseline for the GitHub case, not a package install-plan competitor.
Raw agent
Baseline for an agent moving toward install without an independent package intelligence layer.

Project scanners and update bots such as Dependabot, Renovate, npm audit, pip-audit, local Snyk CLI flows and install firewalls are useful, but they are not ranked in this API snapshot because they operate on manifests, local projects or install interception instead of this hosted read-only preflight boundary.

Rubric

How the score is graded

Source resolution
Full: Correct source object, package/repo/model/server identity, version where applicable, useful metadata and source depth.
Partial: Can identify a package/version in one ecosystem but does not cover the rest of the agent source set.
Zero: No source lookup for the case or only a generic page/result unrelated to the expected object.
An agent cannot evaluate safety if it has not resolved the exact upstream object it might use.
Security evidence
Full: Advisories, provenance/source links, repository posture and package-behavior signals when relevant.
Partial: One evidence type only, such as vulnerability lookup without install or behavior context.
Zero: No useful security, provenance or posture evidence for the decision.
Agents need evidence they can show before approval, not only package popularity.
Execution preflight
Full: Structured install plan, read-only hosted boundary, package behavior and prompt/tool boundary context.
Partial: Some package behavior or metadata, but no complete install-plan boundary.
Zero: No description of what would run or whether hosted checks can write/execute.
The dangerous transition is moving from recommendation to execution.
Agent readiness
Full: Action-ready JSON that combines source evidence, warnings, trust context and install boundary.
Partial: Machine-readable API output that still leaves the agent to assemble the decision itself.
Zero: Human-only output or no independent package-intelligence layer.
Agents need structured decisions, not documents they must scrape.

Accounting

How the score is counted

Case set
7 source cases
The case list is fixed for the public snapshot so the page is not cherry-picking a different sample after seeing a result.
Observation unit
provider x case x dimension
Each provider/case observation records concrete dimensions such as identity, version, metadata, advisory, provenance, repository posture, package behavior, install plan, read-only boundary and agent JSON.
Status treatment
pass full, warn discounted, fail/skip zero
pass rows keep their computed score, warn rows are discounted, fail and skip rows score zero in the coverage-adjusted headline.
Category score
weighted dimensions across all cases
Each category has explicit weights. Category scores are averaged across all 7 cases, so narrow evidence feeds keep their applicable depth visible but do not receive full-source coverage credit.
Headline score
mean of 4 public categories
The public score is the average of source resolution, security evidence, execution preflight and agent readiness. This is why the page is called an agent preflight benchmark, not a universal security ranking.

Weights

Category weights

Source resolution
source depth 22, identity 18, search 18, multi-source coverage 16, metadata 14, version 12
Security evidence
advisory 24, package behavior 24, provenance 20, repository posture 18, metadata 8, version 6
Execution preflight
install plan 32, read-only boundary 28, prompt boundary 18, package behavior 14, agent JSON 8
Agent readiness
agent JSON 34, install plan 22, prompt boundary 14, read-only boundary 12, source depth 8, machine-readable output 6, identity 4

Cases

What is tested

TypeScript schema validation
npm
zod@3.25.76
Common npm selection task where an agent asks for a TypeScript schema validation package.
Known vulnerable npm package
npm
lodash@4.17.20
Known vulnerable package/version case to verify advisory context, not just popularity or name resolution.
Python HTTP client
PyPI
requests@2.32.5
Common PyPI selection task where an agent asks for a Python HTTP client.
Python schema validation
PyPI
pydantic@2.11.0
PyPI schema-validation task to test source normalization across ecosystems.
Embedding model
Hugging Face model
sentence-transformers/all-MiniLM-L6-v2
Model reuse case where package-style safety needs file shape, license, card and model metadata.
MCP docs server
MCP
ac.tandem/docs-mcp
MCP server discovery case where agents need tool metadata, repository links and install/use boundaries.
GitHub repository security posture
GitHub
vercel/next.js
Repository posture case where an agent may reuse a GitHub project rather than install a registry package.

Categories

Separated scoring

search, identity, version, metadata, source depth, multi-source scope

Source resolution

Can the system resolve the right upstream object and return enough source context before an agent moves toward install?

Leader
Nipmod
Top score
97/100
Margin
+47
Nipmod
97100% scope
Native registries
50100% scope
deps.dev
2557% scope
OSV
1757% scope
Raw agent
13100% scope
OpenSSF Scorecard
314% scope
Socket
057% scope
Snyk
057% scope
advisories, provenance, repository posture, package behavior

Security evidence

Can the system return security evidence beyond a name match: advisories, provenance, repository posture and package behavior?

Leader
Nipmod
Top score
83/100
Margin
+50
Nipmod
83100% scope
deps.dev
3357% scope
Native registries
26100% scope
OSV
1757% scope
OpenSSF Scorecard
314% scope
Raw agent
0100% scope
Socket
057% scope
Snyk
057% scope
install plan, read-only boundary, package behavior, prompt boundary

Execution preflight

Can the system describe what would run, keep hosted checks read-only and expose the execution boundary before workspace writes?

Leader
Nipmod
Top score
100/100
Margin
+98
Nipmod
100100% scope
Native registries
2100% scope
Raw agent
0100% scope
deps.dev
057% scope
OSV
057% scope
Socket
057% scope
Snyk
057% scope
OpenSSF Scorecard
014% scope
agent decision JSON, install boundary, source evidence, machine output

Agent readiness

Can an agent consume the result as an action-ready decision object, not just a generic API response or human page?

Leader
Nipmod
Top score
100/100
Margin
+90
Nipmod
100100% scope
Native registries
10100% scope
deps.dev
657% scope
OSV
657% scope
Socket
257% scope
Snyk
257% scope
OpenSSF Scorecard
114% scope
Raw agent
0100% scope

Controls

Fairness controls

1
The benchmark has one narrow question: what evidence is available before an agent moves toward external code execution.
2
Tracks are described by role, so OSV, deps.dev, Socket, Snyk, OpenSSF Scorecard and native registries are not presented as failed versions of Nipmod.
3
The headline score is coverage-adjusted across all cases, while applicable depth remains visible for specialized feeds.
4
Token, rate-limit and plan limitations are marked as limitations in the snapshot instead of being hidden.
5
No package install, clone, artifact unpacking, model execution, paid inference call or workspace write is performed.
6
Machine-readable JSON is published so the public page can be checked against the raw snapshot.

Excluded

What is not ranked here

Dependabot and Renovate
They operate mainly on existing repositories, manifests and update pull requests. They belong in a project-maintenance benchmark, not this hosted preflight API benchmark.
npm audit and pip-audit
They analyze dependency trees or package advisories from a local project or manifest context. This benchmark does not install packages or inspect a user workspace.
Snyk CLI and SCM integrations
They can be stronger than Snyk API-only checks for project snapshots, but they require local code, manifests or repository integration. This run intentionally measures hosted read-only API preflight only.
Install firewalls and sandbox execution systems
They operate at runtime or install interception. Nipmod's hosted API is deliberately before that boundary and does not execute or unpack artifacts.

Limits

Known limitations

1
The benchmark has 7 public cases. It is a focused preflight benchmark, not a full registry-wide or malware-corpus evaluation.
2
Weights are authored by Nipmod and should be reviewed by outside maintainers before being treated as independent proof.
3
Socket and Snyk API tracks were limited by the token, plan or rate limits available in this run; direct claims against those products should not be made from this snapshot.
4
Local project scanners, CLI tools, SCM integrations and install-time firewalls are excluded because they require manifests, repositories, local code or runtime interception.
5
The hosted API does not execute code, unpack artifacts or clone repositories, so this benchmark does not measure sandbox malware detection.
6
A high score means stronger preflight evidence at this boundary. It is not a guarantee that a package, model, repository or MCP server is safe.

Definition

What is measured

7 cases
npm package selection, known vulnerable npm package, PyPI package selection, Python schema package, Hugging Face model, MCP server and GitHub repository posture.
15 dimensions
Search, identity, version, metadata, advisories, provenance, repository posture, source depth, package behavior, prompt boundary, install plan, read-only boundary, machine-readable output, agent JSON and multi-source coverage.
4 public categories
Source resolution, security evidence, execution preflight and agent readiness.
8 tracks
Nipmod, native registries, OSV, deps.dev, OpenSSF Scorecard, Socket, Snyk and a raw agent baseline.
Scope-adjusted score
Unsupported source cases count as scope limits in the headline score. Applicable depth is still shown separately.
0 execution
No package install, repository clone, artifact unpacking, model execution, paid inference call or workspace write is performed.

Result

Measured result

Nipmod score95/100

7/7 live checks, 7/7 install-plan evidence, 0 warnings. Applicable depth score 90/100.

Limited tracks3

Raw agent, Socket, Snyk were authenticated or reachable, but marked limited in this snapshot.

Hosted writes0

No package install, repository clone, artifact unpacking, model execution or workspace write is performed.

Numbers

Measured facts

95/100
Nipmod score in the current production benchmark snapshot.
7/7
Live source cases completed by the Nipmod track.
100%
Nipmod source coverage across the full benchmark case set.
90/100
Nipmod applicable depth score before source-coverage adjustment.
7/7
Read-only install-plan evidence returned by the Nipmod track.
0
Nipmod track warnings in the latest snapshot.
0
Hosted installs, repository clones, artifact unpacking, code execution or workspace writes performed by benchmark requests.

Claims

What this page must not claim

1
Nipmod is safer than every competitor.
2
Nipmod guarantees package safety.
3
Nipmod replaces OSV, deps.dev, Socket, Snyk, OpenSSF Scorecard or native registries.
4
Socket or Snyk were beaten on their full paid/local products in this snapshot.

Method

How to rerun it

pnpm benchmark:competitive

Last public snapshot: . Machine report: /benchmark.json. Full methodology: docs/competitive-benchmark.md.

References

External reference points