Corpora Cleaning Tools Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems. Inspired by the Data Filtering and Data Pre-processing sections of Tilde's WMT17 paper . This repository includes some of the more basic scripts that can help to get rid of the majority of junk from parallel corpora. Tools included parallel - tools for parallel corpora mono - tools for monolingual corpora Requirements Python with langid.py PHP Moses scripts Subword NMT pip install subword-nmt pip install langid Publications If you use this tool, please cite the following paper: Mat?ss Rikters (2018). " Impact of Corpora Quality on Neural Machine Translation. " In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018). @inproceedings { Rikters2018BalticHLT , author = { Rikters, Mat?ss } , booktitle = { In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) } , title = { {Impact of Corpora Quality on Neural Machine Translation} } , address = { Tartu, Estonia } , year = { 2018 } }