Friday Roundup - Week 13: Trust, Benchmarks, and the $50M Sprayer

Three supply chain incidents and a GitHub policy change with a hard deadline collided this week to make trust infrastructure the dominant developer story. At the same time, Claude Code shipped twice in 48 hours, ARC-AGI-3 set a new benchmark that frontier models fail by 99 percentage points, and Ecorobotix committed $50 million to bring precision spraying to the U.S. market. The week covered a lot of ground.

GitHub Trains on Private Repos Unless You Opt Out by April 24

GitHub changed its default settings so that private repository content feeds into Copilot model training. If you take no action before April 24, your private code is in scope. The Hacker News thread (score 268) surfaced a direct opt-out link: github.com/settings/copilot/features. The comment consensus was disbelief that this defaults to opt-out rather than opt-in.

The practical concern runs deeper than one company’s policy. Private repositories hold internal APIs, business logic, proprietary algorithms, and patterns that sit close enough to credentials that developers have a reasonable expectation of privacy. Opt-out as a default requires users to actively monitor policy updates and respond on a deadline, which most do not. GitHub has taken the same approach before with Copilot telemetry settings.

If you use GitHub for any private projects, check your settings this week. For organizations, the relevant setting lives in the organization-level Copilot policy page, not individual user settings. The April 24 deadline is not soft.

The Supply Chain Attack Wave

Three incidents arrived this week with the same root cause: developers trusting package names and repository references more than the code those references point to at any given moment.

The Telnyx Python SDK was compromised through its teampcp dependency on PyPI. Attackers published a malicious version that exfiltrated environment variables on install. Anyone running pip install telnyx between March 23 and 25 without pinned dependencies pulled in the malicious build. The Telnyx team published a security notice and yanked the affected versions; the thread on HN (score 58) linked to Aikido’s analysis as well. Changelog News episode 184 covered a separate LiteLLM supply chain attack the same week: LiteLLM is widely deployed as an LLM API routing layer, and a compromised version reached users who had not pinned their dependencies.

Earlier this week, Aqua Security confirmed that its setup-trivy GitHub Action was compromised because the repository’s CI referenced a floating branch rather than a pinned SHA. Anyone with write access to the Aqua GitHub org could inject code into every downstream workflow using that action. GitHub’s documented guidance recommends pinning uses: references to full commit SHAs rather than tags or branches; the platform enforces none of this.

The pattern across all three incidents: a floating reference is a standing delegation of trust to whoever controls the upstream at that moment. SHA pinning for GitHub Actions and hash-pinned dependencies in pip, npm, or composer are the concrete mitigations. Neither is complex. Both require discipline.

Claude Code Ships Back-to-Back, and the .claude Folder Goes Viral

Claude Code v2.1.85 shipped March 26 and v2.1.86 followed on March 27. Together they represent the densest two-day release window in recent memory. The Anthropic team fixed 20+ bugs across both releases while landing new capabilities.

Version 2.1.85 delivered conditional if fields for hooks, letting you filter when a hook fires using the same permission rule syntax as allowlists (for example, Bash(git *) to run only on git commands). MCP OAuth now follows RFC 9728 Protected Resource Metadata discovery, which enables proper authorization server lookup in compliant environments. The CLAUDE_CODE_MCP_SERVER_NAME and CLAUDE_CODE_MCP_SERVER_URL environment variables now reach headersHelper scripts, so one helper can serve multiple MCP servers without duplication. The release also fixed /compact failing on sessions too large for the compact request itself, and fixed MCP step-up authorization for servers requesting elevated scopes via 403 insufficient_scope.

Version 2.1.86 added an X-Claude-Code-Session-Id header on all API requests so proxies can aggregate session traffic without parsing request bodies. .jj and .sl directories now get excluded from file search, which matters for Jujutsu and Sapling users who were seeing VCS metadata surface in autocomplete results. The Read tool switches to compact line-number format and deduplicates unchanged re-reads, reducing token usage on large codebases. Also fixed: scroll not following new messages, --resume failing on sessions created before v2.1.85, and memory growth in long sessions from markdown render caches.

Concurrent with these releases, a post titled “Anatomy of the .claude/ folder” hit the HN front page with 317 points and 160 comments. It walked through the .claude/ structure that Claude Code maintains: CLAUDE.md for project-specific instructions, settings.json for configuration, memory/ for persistent notes across sessions, and hook scripts that attach to tool use lifecycle events. The thread filled with developers sharing their own configurations. That organic interest indicates the tool has crossed from early adopter territory into the broader developer population.

ARC-AGI-3 Sets a New Bar, and AI Scores Below 1%

The ARC Prize Foundation released ARC-AGI-3 this week (arXiv:2603.24621), a new benchmark for evaluating agentic intelligence through novel, abstract, turn-based environments. Agents must explore, infer goals, build internal models of environment dynamics, and plan action sequences without explicit instructions. The benchmark uses only Core Knowledge priors: basic facts about objects, space, and counting. No language, no external knowledge, no pattern recall.

Humans solve 100% of the environments. Frontier AI systems, as of March 2026, score below 1%.

That gap requires some context. The benchmark is deliberately constructed to exclude the main advantage LLMs have: broad training data. What it measures is novel reasoning applied to genuinely new situations. Scoring below 1% on a task humans clear completely is a useful corrective to narratives about near-human AI capability.

A separate paper from this week reinforces the point from a different direction. AgentDS (arXiv:2603.19005) had 29 teams and 80 participants compete against AI agents on 17 domain-specific data science tasks spanning healthcare, manufacturing, and retail banking. AI-only agents performed near or below the median human participant. The strongest results came from human-AI collaboration, not full automation.

The consistent finding across both benchmarks: AI agents are good at pattern application within trained distributions, and they struggle at domain-specific reasoning that requires judgment calls sparse in training data. For teams designing AI-assisted workflows, human-in-the-loop is not a temporary workaround while models improve. For problems that require genuine domain expertise, it is a design feature.

On-Device AI: The 400B Threshold Moves to Consumer Hardware

The weekly research discussion flagged a demonstration from @anemll showing an iPhone 17 Pro serving a 400-billion-parameter mixture-of-experts model by streaming expert weights from flash storage rather than holding them in RAM. The iPhone 17 Pro carries 12GB of unified memory and NVMe-class SSD bandwidth. The key insight: MoE architectures activate only a small subset of experts per token (often 8 of 512), so the OS filesystem cache keeps the frequently used experts warm while cold experts get read from SSD on demand.

The practical implication for developers building on-device inference tools: the threshold for “capable local model” has moved considerably in 2026. Models that required GPU clusters two years ago now run on consumer hardware with the right architecture. For applications like offline agricultural decision support, where internet connectivity in fields is unreliable, local inference at this scale opens options that were not viable in 2024.

Precision Agriculture: $50 Million and the Data Layer Nobody Built

Ecorobotix announced a $50 million investment to bring its ARA precision spraying robot to the U.S. market. The Swiss company’s hardware uses computer vision and real-time targeting to spray weeds individually, reducing herbicide application by up to 90% compared to broadcast spraying. The investment covers U.S. manufacturing, distribution, and regulatory work.

The Precision Farming Dealer survey on strip-tillers shows continued heavy adoption of precision technology, and a video from Syngenta’s Global Head of IT and Digital Strategy at this week’s FEMA Supply Summit put it plainly: “There’s no going back” on AI in agriculture.

The most pointed piece this week is from Precision Farming Dealer: “The Machine Already Knows… Nobody Built the Layer to Use It.” Modern tractors and planters generate large volumes of operational data. That data sits on the machine or gets exported as a CSV by whoever takes the time. The integration layer that would normalize that data and expose it to decision-support systems and farm management platforms is largely unbuilt at the field level. Precision hardware vendors have proprietary data silos, and agricultural data standards like ADAPT and ISOXML see limited adoption in end-user tools.

The article frames this as an opportunity, not a criticism. The data exists. Compute is cheap. What is missing is the collection and accessibility layer at the field level, which is precisely where tools like offline-first flock management applications address a real gap. The problem is not sensing or computing; it is getting the data out of the hardware and into a form that analysis tools can use.

Research Highlights

AVO: Agentic Variation Operators for Autonomous Evolutionary Search (arXiv:2603.24517)

AVO replaces fixed mutation and crossover operators in evolutionary search with autonomous coding agents. The agent consults the current code lineage, a domain-specific knowledge base, and execution feedback to propose, test, critique, and verify changes. Evaluated on attention kernel optimization for NVIDIA Blackwell B200 GPUs over 7 days of continuous autonomous search, AVO produced kernels outperforming cuDNN by 3.5% and FlashAttention-4 by 10.5%. Applying those optimizations to grouped-query attention took 30 additional minutes and yielded 7% and 9.3% gains respectively. The paper shows that agentic loops can find performance-critical optimizations in search spaces too large for manual exploration.

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models (arXiv:2603.24844, github.com/ishapuri/multi_answer_rl)

MIT researchers trained LLMs to produce multiple plausible answers in a single forward pass using multi-answer reinforcement learning. Standard post-training collapses a model’s internal distribution onto the dominant mode; this approach preserves distributional breadth. On medical diagnosis and coding benchmarks, multi-answer RL models show better diversity and coverage than single-answer baselines and use fewer tokens than repeated sampling. For coding tasks, accuracy also improves. This positions multi-answer RL as a more compute-efficient alternative to best-of-k sampling.

WAFT-Stereo: Warping-Alone Field Transforms for Stereo Matching (arXiv:2603.24836, github.com/princeton-vl/WAFT-Stereo, 15 GitHub stars)

Princeton’s Vision and Learning Lab demonstrates that cost volumes, a near-universal design choice in stereo matching, are not necessary for strong performance. WAFT-Stereo replaces them with warping and ranks first on ETH3D, KITTI, and Middlebury benchmarks, reducing zero-shot error by 81% on ETH3D while running 1.8 to 6.7 times faster than competitive methods. Relevant for anyone working on depth estimation pipelines for agricultural robotics or autonomous equipment.