Publication: Improving corpus reproducibility through modular text transformations and connected data set.