Your thoughts on attention DSA from DeepSeek?
DSA is clever in sense that it is not like “last 100 tokens or every 1000th token” also it is somewhat better NSA because of its indexer (which learns to give importance scores to tokens to attend to) and then using topk based on importance scores while nsa is just block compress with mlp, then do topk and window (last x tokens to attend).