{"categoryBreakdown":[{"description":"Can the system resolve the right upstream object and return enough source context before an agent moves toward install?","dimensions":"search, identity, version, metadata, source depth, multi-source scope","key":"source-resolution","rubric":"Full credit requires a resolved source object with identity, version where applicable, useful metadata, source depth and broad source coverage. Advisory-only feeds receive credit only when they can identify the package/version they are asked about.","title":"Source resolution","tracks":[{"name":"Nipmod","score":97,"sourceCoveragePct":100},{"name":"Native registries","score":50,"sourceCoveragePct":100},{"name":"deps.dev","score":25,"sourceCoveragePct":57},{"name":"OSV","score":17,"sourceCoveragePct":57},{"name":"Raw agent","score":13,"sourceCoveragePct":100},{"name":"OpenSSF Scorecard","score":3,"sourceCoveragePct":14},{"name":"Socket","score":0,"sourceCoveragePct":57},{"name":"Snyk","score":0,"sourceCoveragePct":57}]},{"description":"Can the system return security evidence beyond a name match: advisories, provenance, repository posture and package behavior?","dimensions":"advisories, provenance, repository posture, package behavior","key":"security-evidence","rubric":"Full credit requires more than a known-vulnerability lookup: advisory context, provenance or source links, repository posture when relevant, and package-behavior signals. Vulnerability feeds get credit for advisory evidence even if they do not provide install-plan context.","title":"Security evidence","tracks":[{"name":"Nipmod","score":83,"sourceCoveragePct":100},{"name":"deps.dev","score":33,"sourceCoveragePct":57},{"name":"Native registries","score":26,"sourceCoveragePct":100},{"name":"OSV","score":17,"sourceCoveragePct":57},{"name":"OpenSSF Scorecard","score":3,"sourceCoveragePct":14},{"name":"Raw agent","score":0,"sourceCoveragePct":100},{"name":"Socket","score":0,"sourceCoveragePct":57},{"name":"Snyk","score":0,"sourceCoveragePct":57}]},{"description":"Can the system describe what would run, keep hosted checks read-only and expose the execution boundary before workspace writes?","dimensions":"install plan, read-only boundary, package behavior, prompt boundary","key":"execution-preflight","rubric":"Full credit requires a structured install plan, explicit hosted read-only behavior, and enough package-behavior context for an agent or host to review execution before workspace writes. Tools that only report metadata or advisories are shown as not designed for this layer, not as broken tools.","title":"Execution preflight","tracks":[{"name":"Nipmod","score":100,"sourceCoveragePct":100},{"name":"Native registries","score":2,"sourceCoveragePct":100},{"name":"Raw agent","score":0,"sourceCoveragePct":100},{"name":"deps.dev","score":0,"sourceCoveragePct":57},{"name":"OSV","score":0,"sourceCoveragePct":57},{"name":"Socket","score":0,"sourceCoveragePct":57},{"name":"Snyk","score":0,"sourceCoveragePct":57},{"name":"OpenSSF Scorecard","score":0,"sourceCoveragePct":14}]},{"description":"Can an agent consume the result as an action-ready decision object, not just a generic API response or human page?","dimensions":"agent decision JSON, install boundary, source evidence, machine output","key":"agent-readiness","rubric":"Full credit requires structured, action-ready output that combines source evidence, warnings, trust context and install boundary. Generic JSON earns limited credit because an agent would still need to assemble the preflight decision itself.","title":"Agent readiness","tracks":[{"name":"Nipmod","score":100,"sourceCoveragePct":100},{"name":"Native registries","score":10,"sourceCoveragePct":100},{"name":"deps.dev","score":6,"sourceCoveragePct":57},{"name":"OSV","score":6,"sourceCoveragePct":57},{"name":"Socket","score":2,"sourceCoveragePct":57},{"name":"Snyk","score":2,"sourceCoveragePct":57},{"name":"OpenSSF Scorecard","score":1,"sourceCoveragePct":14},{"name":"Raw agent","score":0,"sourceCoveragePct":100}]}],"categoryWeights":[{"category":"Source resolution","weights":[{"dimension":"source depth","weight":22},{"dimension":"identity","weight":18},{"dimension":"search","weight":18},{"dimension":"multi-source coverage","weight":16},{"dimension":"metadata","weight":14},{"dimension":"version","weight":12}]},{"category":"Security evidence","weights":[{"dimension":"advisory","weight":24},{"dimension":"package behavior","weight":24},{"dimension":"provenance","weight":20},{"dimension":"repository posture","weight":18},{"dimension":"metadata","weight":8},{"dimension":"version","weight":6}]},{"category":"Execution preflight","weights":[{"dimension":"install plan","weight":32},{"dimension":"read-only boundary","weight":28},{"dimension":"prompt boundary","weight":18},{"dimension":"package behavior","weight":14},{"dimension":"agent JSON","weight":8}]},{"category":"Agent readiness","weights":[{"dimension":"agent JSON","weight":34},{"dimension":"install plan","weight":22},{"dimension":"prompt boundary","weight":14},{"dimension":"read-only boundary","weight":12},{"dimension":"source depth","weight":8},{"dimension":"machine-readable output","weight":6},{"dimension":"identity","weight":4}]}],"checkedAt":"2026-05-27T13:09:11.674Z","command":"pnpm benchmark:competitive","excludedComparisons":[{"name":"Dependabot and Renovate","reason":"They operate mainly on existing repositories, manifests and update pull requests. They belong in a project-maintenance benchmark, not this hosted preflight API benchmark."},{"name":"npm audit and pip-audit","reason":"They analyze dependency trees or package advisories from a local project or manifest context. This benchmark does not install packages or inspect a user workspace."},{"name":"Snyk CLI and SCM integrations","reason":"They can be stronger than Snyk API-only checks for project snapshots, but they require local code, manifests or repository integration. This run intentionally measures hosted read-only API preflight only."},{"name":"Install firewalls and sandbox execution systems","reason":"They operate at runtime or install interception. Nipmod's hosted API is deliberately before that boundary and does not execute or unpack artifacts."}],"fairnessControls":["The benchmark has one narrow question: what evidence is available before an agent moves toward external code execution.","Tracks are described by role, so OSV, deps.dev, Socket, Snyk, OpenSSF Scorecard and native registries are not presented as failed versions of Nipmod.","The headline score is coverage-adjusted across all cases, while applicable depth remains visible for specialized feeds.","Token, rate-limit and plan limitations are marked as limitations in the snapshot instead of being hidden.","No package install, clone, artifact unpacking, model execution, paid inference call or workspace write is performed.","Machine-readable JSON is published so the public page can be checked against the raw snapshot."],"limitations":["The benchmark has 7 public cases. It is a focused preflight benchmark, not a full registry-wide or malware-corpus evaluation.","Weights are authored by Nipmod and should be reviewed by outside maintainers before being treated as independent proof.","Socket and Snyk API tracks were limited by the token, plan or rate limits available in this run; direct claims against those products should not be made from this snapshot.","Local project scanners, CLI tools, SCM integrations and install-time firewalls are excluded because they require manifests, repositories, local code or runtime interception.","The hosted API does not execute code, unpack artifacts or clone repositories, so this benchmark does not measure sandbox malware detection.","A high score means stronger preflight evidence at this boundary. It is not a guarantee that a package, model, repository or MCP server is safe."],"claimBoundary":["This is an agent package-decision benchmark, not a malware-free guarantee.","Nipmod is measured at the pre-install moment: search, inspect, evidence, warnings and install-plan output.","OSV, deps.dev, Socket, Snyk, OpenSSF Scorecard and native registries are measured by the dimensions they actually expose.","The headline score is coverage-adjusted across all benchmark cases; specialized evidence feeds keep their own applicable depth score.","No package install, repository clone, artifact unpacking, model execution or workspace write is performed."],"scoreAccounting":[{"explanation":"The case list is fixed for the public snapshot so the page is not cherry-picking a different sample after seeing a result.","label":"Case set","value":"7 source cases"},{"explanation":"Each provider/case observation records concrete dimensions such as identity, version, metadata, advisory, provenance, repository posture, package behavior, install plan, read-only boundary and agent JSON.","label":"Observation unit","value":"provider x case x dimension"},{"explanation":"pass rows keep their computed score, warn rows are discounted, fail and skip rows score zero in the coverage-adjusted headline.","label":"Status treatment","value":"pass full, warn discounted, fail/skip zero"},{"explanation":"Each category has explicit weights. Category scores are averaged across all 7 cases, so narrow evidence feeds keep their applicable depth visible but do not receive full-source coverage credit.","label":"Category score","value":"weighted dimensions across all cases"},{"explanation":"The public score is the average of source resolution, security evidence, execution preflight and agent readiness. This is why the page is called an agent preflight benchmark, not a universal security ranking.","label":"Headline score","value":"mean of 4 public categories"}],"cases":[{"expected":"zod@3.25.76","id":"npm-schema-zod","reason":"Common npm selection task where an agent asks for a TypeScript schema validation package.","source":"npm","task":"TypeScript schema validation"},{"expected":"lodash@4.17.20","id":"npm-vulnerable-lodash","reason":"Known vulnerable package/version case to verify advisory context, not just popularity or name resolution.","source":"npm","task":"Known vulnerable npm package"},{"expected":"requests@2.32.5","id":"pypi-http-requests","reason":"Common PyPI selection task where an agent asks for a Python HTTP client.","source":"PyPI","task":"Python HTTP client"},{"expected":"pydantic@2.11.0","id":"pypi-schema-pydantic","reason":"PyPI schema-validation task to test source normalization across ecosystems.","source":"PyPI","task":"Python schema validation"},{"expected":"sentence-transformers/all-MiniLM-L6-v2","id":"hf-embedding-model","reason":"Model reuse case where package-style safety needs file shape, license, card and model metadata.","source":"Hugging Face model","task":"Embedding model"},{"expected":"ac.tandem/docs-mcp","id":"mcp-docs-server","reason":"MCP server discovery case where agents need tool metadata, repository links and install/use boundaries.","source":"MCP","task":"MCP docs server"},{"expected":"vercel/next.js","id":"github-nextjs","reason":"Repository posture case where an agent may reuse a GitHub project rather than install a registry package.","source":"GitHub","task":"GitHub repository security posture"}],"headline":{"installPlanEvidence":"7/7","liveChecks":"7/7","medianLatencyMs":2177,"score":95},"publishableClaims":["Nipmod score: 95/100 across the current production agent-preflight benchmark.","Nipmod completed 7/7 live source cases and returned 7/7 read-only install-plan evidence.","Socket and Snyk were authenticated, but package-depth endpoints were rate-limited or plan-limited in this run; do not use this snapshot for a direct Socket or Snyk depth claim."],"reviewerAssessment":{"academicGrade":"not sufficient as an academic security benchmark","productGrade":"usable as a public product benchmark with explicit scope and limits","reason":"The case set is intentionally small and maintained by Nipmod, so it should not be sold as independent proof. It is useful for one scoped product question: whether a hosted API returns action-ready agent preflight evidence without writing to a workspace."},"unsafeClaims":["Nipmod is safer than every competitor.","Nipmod guarantees package safety.","Nipmod replaces OSV, deps.dev, Socket, Snyk, OpenSSF Scorecard or native registries.","Socket or Snyk were beaten on their full paid/local products in this snapshot."],"rubric":[{"category":"Source resolution","fullCredit":"Correct source object, package/repo/model/server identity, version where applicable, useful metadata and source depth.","noCredit":"No source lookup for the case or only a generic page/result unrelated to the expected object.","partialCredit":"Can identify a package/version in one ecosystem but does not cover the rest of the agent source set.","whyItMatters":"An agent cannot evaluate safety if it has not resolved the exact upstream object it might use."},{"category":"Security evidence","fullCredit":"Advisories, provenance/source links, repository posture and package-behavior signals when relevant.","noCredit":"No useful security, provenance or posture evidence for the decision.","partialCredit":"One evidence type only, such as vulnerability lookup without install or behavior context.","whyItMatters":"Agents need evidence they can show before approval, not only package popularity."},{"category":"Execution preflight","fullCredit":"Structured install plan, read-only hosted boundary, package behavior and prompt/tool boundary context.","noCredit":"No description of what would run or whether hosted checks can write/execute.","partialCredit":"Some package behavior or metadata, but no complete install-plan boundary.","whyItMatters":"The dangerous transition is moving from recommendation to execution."},{"category":"Agent readiness","fullCredit":"Action-ready JSON that combines source evidence, warnings, trust context and install boundary.","noCredit":"Human-only output or no independent package-intelligence layer.","partialCredit":"Machine-readable API output that still leaves the agent to assemble the decision itself.","whyItMatters":"Agents need structured decisions, not documents they must scrape."}],"tracks":[{"applicable":7,"coveragePct":100,"depthScore":90,"latencyMs":2177,"name":"Nipmod","note":"Search, inspect, source evidence, warnings, read-only install-plan output and agent JSON.","pass":7,"role":"Agent preflight layer","score":95,"sourceCoveragePct":100,"status":"pass","warn":0},{"applicable":4,"coveragePct":100,"depthScore":35,"latencyMs":44,"name":"deps.dev","note":"Package metadata, licenses, advisory and provenance context for supported ecosystems.","pass":4,"role":"Package metadata and advisory evidence","score":16,"sourceCoveragePct":57,"status":"pass","warn":0},{"applicable":7,"coveragePct":100,"depthScore":27,"latencyMs":171,"name":"Native registries","note":"Source-of-truth metadata from npm, PyPI, GitHub, Hugging Face and MCP.","pass":7,"role":"Upstream source metadata","score":22,"sourceCoveragePct":100,"status":"pass","warn":0},{"applicable":4,"coveragePct":100,"depthScore":24,"latencyMs":359,"name":"OSV","note":"Vulnerability lookup for package and version pairs.","pass":4,"role":"Vulnerability evidence feed","score":10,"sourceCoveragePct":57,"status":"pass","warn":0},{"applicable":1,"coveragePct":100,"depthScore":17,"latencyMs":57,"name":"OpenSSF Scorecard","note":"Repository posture for GitHub projects. It is not a package install-plan layer.","pass":1,"role":"Repository posture baseline","score":2,"sourceCoveragePct":14,"status":"pass","warn":0},{"applicable":4,"coveragePct":0,"depthScore":4,"latencyMs":492,"name":"Socket","note":"Authenticated PURL endpoint was available, but package-depth lookups were rate-limited in this run.","pass":0,"role":"Supply-chain package evidence","score":1,"sourceCoveragePct":57,"status":"warn","warn":4},{"applicable":4,"coveragePct":0,"depthScore":4,"latencyMs":248,"name":"Snyk","note":"Authentication worked, but package-health depth was unavailable on the current token or plan.","pass":0,"role":"Package health and security evidence","score":1,"sourceCoveragePct":57,"status":"warn","warn":4},{"applicable":7,"coveragePct":0,"depthScore":4,"latencyMs":null,"name":"Raw agent","note":"Baseline for direct install or pull behavior without a package intelligence layer.","pass":0,"role":"No independent package intelligence layer","score":3,"sourceCoveragePct":100,"status":"warn","warn":7}],"type":"dev.nipmod.competitive-benchmark-public.v1"}