I recently came across an old paper (1991) by Wuu Yang. The paper describes a diff program for C by matching syntax trees. The *matching* between two trees is defined to be a set of pairs of nodes, one from each tree, such that

- two nodes in a pair contain identical symbols,
- a node can match at most one node in the other tree, and
- the parent-child relationship as well as the order between sibling nodes are respected.

A maximum matching is a matching with the maximum number of pairs. A straight forward dynamic algorithm is proposed to find the maximum matching with the constraints above.

It’s good to make the matching restrictions such that one could come up with a nice algorithm to maximize a parameter. But, respecting the order of siblings in the matching, limits the algorithm in describing changes which involve moving program elements around.

We should keep in mind that the goal of developing a tool for displaying changes, is to assist the developer in understanding the differences. I believe we should pay attention to the vocabulary that developers use in describing their changes to each other. Do they only tell each other which parts of their codes match in two versions? I say no. Programmers use various vocabulary to explain how they have modified their code by talking about things that have *moved*, *renamed, deleted, added, split, …*