Archives

35 posts

2026-06

Bifocal Diffusion LMs: 非对称双向上下文实现 KV-Cache 兼容的并行生成
Jun 30, 2026 6000 words 15 min read Papers Model Architecture

arXiv 2026-06｜Bifocal dLLM：因果注意力 + 反向 Mamba SSM 边车，首次实现 KV cache 兼容的右侧上下文；吞吐量比双向 dLLM 高 2.4×-12.9×。
Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference
Jun 16, 2026 9777 words 24 min read Papers Sparse Attention

arXiv 2026-06｜Sketch&Walk 用 Hadamard block sketch + 跨层 walk state 捕捉多跳依赖，在长上下文稀疏注意力中以 20% density 接近 dense。
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Jun 08, 2026 15654 words 39 min read Papers Kernel Agent

arXiv 2026-03｜AutoKernel 用 agent 循环自动优化 GPU kernel，结合性能剖析、代码生成和正确性验证提升 Triton/CUDA 算子速度。
FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
Jun 08, 2026 18336 words 46 min read Papers Kernel

MLSys 2026 / arXiv 2026-01｜FlashInfer-Bench 建立 AI-driven LLM systems 的闭环 benchmark，用真实 workload 评估 kernel/系统自动优化。

2026-05

🤖 Agent Compression 论文全景图 (60+ 篇 · 13 个方向)
May 29, 2026 30000 words 75 min read Papers Survey Agent

Survey 2025-12 至 2026-05｜梳理 60+ 篇 Agent Compression 论文，按轨迹压缩、计划缓存、记忆压缩、多 Agent 通信、KV Cache 优化等 13 个方向归类。

2026-04

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Apr 21, 2026 11155 words 28 min read Papers Serving

NeurIPS 2025 preprint｜Prefill-as-a-Service 讨论跨数据中心复用下一代模型 KV cache 的可行性，重新拆分 prefill 与 decode 服务边界。
Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
Apr 15, 2026 18641 words 47 min read Papers Serving Agent

arXiv 2026-04｜TokenDance 用 collective KV cache sharing 扩展多智能体 LLM serving，减少 agent 间重复 prefill 和 KV 搬运。
Scheduler Overlap：CPU-GPU 调度级重叠
Apr 14, 2026 12776 words 32 min read Blog

Blog 2026-04-14｜SGLang Zero Overhead Scheduler 学习笔记，解释 CPU-GPU Overlap Mode、CUDA Stream、FutureMap、Token Pool 等调度机制。
AI Agent & Harness 设计模式
Apr 10, 2026 15908 words 40 min read Blog

Blog 2026-04-10｜整理 AI Agent 与 harness 系统设计的三种范式：Claude Intelligence Patterns、Ralph Wiggum Technique、Three-Agent Harness。
SpecExit: Accelerating Large Reasoning Model via Speculative Exit
Apr 07, 2026 11458 words 29 min read Papers Spec Decoding

ICLR 2026｜SpecExit 通过 speculative exit 让大推理模型在合适层提前退出或验证，减少长 CoT 推理延迟。
Reinforcement Learning for Chain of Thought Compression with One-Domain-to-All Generalization
Apr 07, 2026 11414 words 29 min read Papers Agent

arXiv 2026-04｜用强化学习压缩 Chain-of-Thought，并研究 one-domain-to-all 泛化，让推理 token 更短但保持跨域效果。
LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning
Apr 03, 2026 11547 words 29 min read Papers KV Cache

ICML 2026｜LIFT 通过 long input fine-tuning 增强 LLM 长上下文理解能力，关注训练数据构造与长输入泛化。
OpenClaw 带来的思维范式转变
Apr 03, 2026 816 words 2 min read Blog

Blog 2026-04-03｜个人工作流札记：OpenClaw 带来的变化不只是效率，而是 IM 通道与充足执行资源让工作链条变短。

2026-03

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Mar 30, 2026 12603 words 32 min read Papers KV Cache

arXiv 2026-03｜TurboQuant 提出近最优失真率的在线向量量化方法，用于降低 KV/激活等向量存储成本并保持精度。
Meta-Harness: End-to-End Optimization of Model Harnesses
Mar 28, 2026 15289 words 38 min read Papers Agent

COLM 2026｜Meta-Harness 把模型 harness 视为可端到端优化对象，自动搜索 prompt、工具与评测封装以提升任务表现。
SpargeAttention: 准确且无训练稀疏注意力加速任意模型推理
Mar 24, 2026 11341 words 28 min read Papers Sparse Attention

ICML 2025｜SpargeAttention 提出训练免费的稀疏注意力加速方案，用 query-aware / block-level 选择在保持精度的同时降低长上下文推理开销。
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
Mar 19, 2026 9955 words 25 min read Papers Spec Decoding

NeurIPS 2025｜L-MTP 将 multi-token prediction 从相邻 token 扩展到 leap context，提升并行预测和 speculative decoding 的可用性。
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Mar 19, 2026 7087 words 18 min read Papers KV Cache

NeurIPS 2025｜KVzip 用上下文重建衡量 KV 重要性，做 query-agnostic cache 压缩，在多查询场景中复用压缩后的 KV cache。
AI-Researcher: Autonomous Scientific Innovation
Mar 16, 2026 19057 words 48 min read Papers Application

NeurIPS 2024｜AI-Researcher 构建自主科学创新流程，覆盖文献检索、实验实现、论文生成与自动评审，用于评估 AI 科研能力。
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
Mar 13, 2026 21711 words 54 min read Papers Serving

arXiv 2026-03｜Step-3 通过模型-系统协同设计降低大模型 decoding 成本，在大规模参数与可负担服务之间做折中。
Where Matters More Than What: DapQ 论文解读
Mar 12, 2026 12834 words 32 min read Papers KV Cache

arXiv 2026-03｜DapQ 强调 KV cache 压缩中位置比内容更关键，提出 decoding-aligned 的量化/压缩思路以减少生成阶段误差。
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Mar 12, 2026 12875 words 32 min read Papers Serving

arXiv 2026-03｜Step 3.5 Flash 展示 11B active parameters 的开放模型系统设计，重点在模型能力、推理效率和服务成本平衡。
DefensiveKV: Taming the Fragility of KV Cache Eviction in LLM Inference
Mar 12, 2026 11105 words 28 min read Papers KV Cache

ICLR 2026 / arXiv｜DefensiveKV 研究 KV cache eviction 的脆弱性，并提出更稳健的保留策略，降低压缩后质量突降风险。
Query-Aware Sparsity for Efficient Long-Context LLM Inference
Mar 11, 2026 10371 words 26 min read Papers Sparse Attention

ICML 2024｜Quest 提出 query-aware sparsity，用查询相关的稀疏选择加速长上下文 LLM 推理，减少无关 KV 访问。
ReAct: Synergizing Reasoning and Acting in Language Models
Mar 11, 2026 6078 words 15 min read Papers Agent

ICLR 2023｜ReAct 把 reasoning trace 与 action 交替生成，让语言模型能在工具/环境反馈中边推理边行动，是 agent 经典范式。
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Mar 10, 2026 18275 words 46 min read Papers Kernel

arXiv 2026-03｜FlashAttention-4 从算法和 kernel pipeline 协同设计出发，针对不对称硬件扩展优化 attention 吞吐与访存。
XAttention: Block Sparse Attention with Antidiagonal Scoring
Mar 09, 2026 11712 words 29 min read Papers Sparse Attention

ICML 2025｜XAttention 用 anti-diagonal scoring 做 block sparse attention，在长上下文推理中选择关键块以降低注意力成本。
FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
Mar 09, 2026 12742 words 32 min read Papers Sparse Attention

arXiv 2026-03｜FlashPrefill 通过即时模式发现和阈值化做长上下文 prefill 加速，减少全量注意力计算中的冗余 token/block。
DualSpec: Accelerating Deep Research Agents via Dual-Process Action Speculation
Mar 09, 2026 13512 words 34 min read Papers Agent Spec Decoding

ICML 2026｜DualSpec 用双过程 action speculation 加速 deep research agent，让快思路先预测行动、慢过程再验证和修正。
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Mar 09, 2026 10258 words 26 min read Papers Agent

arXiv 2026-03｜ARLArena 提供稳定 agentic RL 的统一评测与训练框架，聚焦长程交互任务中的环境、奖励和训练稳定性。
Efficient Agent Training for Computer Use
Mar 09, 2026 10348 words 26 min read Papers Agent

arXiv 2026-03｜面向 computer-use agent 的高效训练方法，关注如何用更少交互和更稳定的训练信号提升 GUI/网页操作能力。
Stop Wasting Your Tokens: 高效运行时多智能体系统
Mar 06, 2026 7032 words 18 min read Papers Agent

ICLR 2026｜Stop Wasting Your Tokens 提出 SupervisorAgent，通过运行时监督与任务分配减少多智能体系统 token 消耗，GAIA 上降低 29.68%。
RAPID: 长上下文推理的检索增强推测解码
Mar 05, 2026 8948 words 22 min read Papers Spec Decoding

arXiv 2026-03｜RAPID 将 RAG 草稿器引入长上下文推测解码，在保持目标模型验证质量的同时提升长上下文生成效率。
论文报告：Gated Attention for Large Language Models
Mar 04, 2026 5783 words 14 min read Papers Sparse Attention

arXiv 2026-03｜Gated Attention 通过门控结构调节注意力信息流，关注大模型架构中的稀疏/选择性计算与实现细节。
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
Mar 04, 2026 10058 words 25 min read Papers Spec Decoding

arXiv 2026-03｜RAPID 用检索增强上下文作为 draft 侧输入，缓解长上下文 SD 中小草稿模型 KV 开销过高与质量不足的问题。