Benchmark
Agent preflight benchmark.
A coverage-adjusted benchmark for the evidence an agent receives before installing, pulling or reusing external code.
- Nipmod live
- 7/7
- Install plans
- 7/7
- Score
- 95
- Median latency
- 2177 ms
Comparison
Agent preflight benchmark
The run measures one specific moment: before an agent installs a package, pulls a model, reuses a repository or connects a tool. It does not rank security companies on a generic score.
Each track is measured by what it exposes to that preflight decision. Specialized feeds keep their own scope visible, so advisory databases and repository scanners are not treated as full package-intelligence layers.
- Scope
- 100%
- Depth
- 90
- Median
- 2177 ms
- Scope
- 100%
- Depth
- 27
- Median
- 171 ms
- Scope
- 57%
- Depth
- 35
- Median
- 44 ms
- Scope
- 57%
- Depth
- 24
- Median
- 359 ms
- Scope
- 100%
- Depth
- 4
- Median
- n/a
- Scope
- 14%
- Depth
- 17
- Median
- 57 ms
- Scope
- 57%
- Depth
- 4
- Median
- 492 ms
Authenticated PURL endpoint was available, but package-depth lookups were rate-limited in this run.
- Scope
- 57%
- Depth
- 4
- Median
- 248 ms
Authentication worked, but package-health depth was unavailable on the current token or plan.
Readout
What the run shows
Coverage-adjusted score across the full agent preflight case set, not only the package ecosystems where a feed is strongest.
Npm, PyPI, GitHub, Hugging Face model, MCP and known vulnerable package cases completed in the live run.
Read-only install-plan output, package-behavior context and execution boundary before workspace writes.
The benchmark performs no install, clone, artifact unpacking, model execution, paid inference call or workspace write.
Audit
Strict reviewer answer
The case set is intentionally small and maintained by Nipmod, so it should not be sold as independent proof. It is useful for one scoped product question: whether a hosted API returns action-ready agent preflight evidence without writing to a workspace.
The sample is small and the weights are authored by Nipmod, so the page does not claim independent proof or malware-free safety.
What can an agent know before installing, pulling, reusing or connecting external code?
Scope
Why these tracks are included
Project scanners and update bots such as Dependabot, Renovate, npm audit, pip-audit, local Snyk CLI flows and install firewalls are useful, but they are not ranked in this API snapshot because they operate on manifests, local projects or install interception instead of this hosted read-only preflight boundary.
Rubric
How the score is graded
Partial: Can identify a package/version in one ecosystem but does not cover the rest of the agent source set.
Zero: No source lookup for the case or only a generic page/result unrelated to the expected object.
Partial: One evidence type only, such as vulnerability lookup without install or behavior context.
Zero: No useful security, provenance or posture evidence for the decision.
Partial: Some package behavior or metadata, but no complete install-plan boundary.
Zero: No description of what would run or whether hosted checks can write/execute.
Partial: Machine-readable API output that still leaves the agent to assemble the decision itself.
Zero: Human-only output or no independent package-intelligence layer.
Accounting
How the score is counted
Weights
Category weights
Cases
What is tested
zod@3.25.76
lodash@4.17.20
requests@2.32.5
pydantic@2.11.0
sentence-transformers/all-MiniLM-L6-v2
ac.tandem/docs-mcp
vercel/next.js
Categories
Separated scoring
Source resolution
Can the system resolve the right upstream object and return enough source context before an agent moves toward install?
- Leader
- Nipmod
- Top score
- 97/100
- Margin
- +47
Security evidence
Can the system return security evidence beyond a name match: advisories, provenance, repository posture and package behavior?
- Leader
- Nipmod
- Top score
- 83/100
- Margin
- +50
Execution preflight
Can the system describe what would run, keep hosted checks read-only and expose the execution boundary before workspace writes?
- Leader
- Nipmod
- Top score
- 100/100
- Margin
- +98
Agent readiness
Can an agent consume the result as an action-ready decision object, not just a generic API response or human page?
- Leader
- Nipmod
- Top score
- 100/100
- Margin
- +90
Controls
Fairness controls
Excluded
What is not ranked here
Limits
Known limitations
Definition
What is measured
Result
Measured result
7/7 live checks, 7/7 install-plan evidence, 0 warnings. Applicable depth score 90/100.
Raw agent, Socket, Snyk were authenticated or reachable, but marked limited in this snapshot.
No package install, repository clone, artifact unpacking, model execution or workspace write is performed.
Numbers
Measured facts
Claims
What this page must not claim
Method
How to rerun it
pnpm benchmark:competitiveLast public snapshot: . Machine report: /benchmark.json. Full methodology: docs/competitive-benchmark.md.
References
