Machine translation services supporting Chinese (zh), seem to provide content in "Mandarin Chinese". Similar non-Mandarin Chinese languages (NMCL) such as Cantonese (zh-yue) and Wu (wuu) for which there is no machine translation support, may benefit from having access to the Mandarin Chinese machine translation.
Languages that can benefit from surfacing "Mandarin Chinese" MT:
For yuewp and ganwp, in addition to expose the Mandarin Chinese machine translation, it would be useful to convert the Simplified Chinese characters from the machine translation into Traditional Chinese characters.
This change is similar to the way English translations were exposed for Simple English ( T196354 ), but we may need to have some additional considerations:
This was based on the feedback provided by a translator .
By reading that topic, I believe that there are zhwiki benefits affected
Thank you very much for creating this ticket, @Pginer-WMF . I have just created two community discussions on yuewp and wuuwp , still waiting for response. Activity on ganwp is virtually down to zero.
(Side note: Several others are still in the incubator.)
I am confused by this part. Does this refer to using CT to spam the target wikis? There are active sysops and other users patrolling yuewp and wuuwp. YUEWP. We have filters to block Mandarin contents. So far not that many people have actually used CT to translate stuff to yuewp. And spammers could actually post untranslated contents right now. Also, policies are such that pages left unfinished in Mandarin would be deleted just like pages written in other languages. As such, I don't think spamming with Mandarin contents would be a big problem on yuewp.
this is tricky.... I can think of two solutions.
When support for yue, wuu and others become available, and the quality of MT contents is satisfactory, then CT should switch to using the new support. (I hope it would not involve too much work, but there might be other concern. See below.) I suppose Cantonese is the most hopeful to be supported next, but still it's quite distant. Currently, Bing.com and Baidu.com support yue (though their mechanism is exactly translating stuff into Mandarin and then doing some dummie word-to-word conversion to yue). Google announced a few years ago that it was working on yue.
other concern
Thanks for all the details provided, @Roy17 . Some comments below:
In T199523#4423046 , @Roy17 wrote: potential risk.... I am confused by this part. Does this refer to using CT to spam the target wikis? There are active sysops and other users patrolling yuewp and wuuwp.
I am confused by this part. Does this refer to using CT to spam the target wikis? There are active sysops and other users patrolling yuewp and wuuwp.
I was not thinking of intentional spam, but more of increasing the chances of some contents to accidentally going without review. In any case, we are improving the system to measure how much content is reviewed and warn users accordingly. What I'd propose is to monitor the articles created and provide us with feedback, so that the thresholds for the warning mechanisms can be adjusted if needed.
If there are filters to block Mandarin contents, we need to check how the errors are surfaced in Content Translation, to make sure we communicate clearly to the user what needs to be reviewed.
Notice/Warning this is tricky.... I can think of two solutions. Maybe CT can have some sort of banners/highlights, on the sidebar or anywhere, to warn the users of ' default: Mandarin in place, please work on MT-contents to make sure they conform to styles and standards on the target wiki' We could put up permanent banners on yuewp and wuuwp, to provide guidelines to users.
I was thinking along the lines of 1, that is, how to communicate it inside Content Translation. I think that we can convey the information in the "Automatic translation" card. For example, showing the language name next to the service used (e.g., showing "Using Yandex (Chinese)". This does not seem a complex change, but my point is that we need to discuss design options and implement it as part of this work (unlike the case of Simple English which was, well, simpler).
yuewp and ganwp pages must be written in Traditional Chinese characters. wuuwp doesn't specify which to use, but most contents are written in Simplified Chinese characters. So, if possible, please consider including support for this minor difference. This is not a big issue as it can be easily resolved by users using browser extensions, JS or whatever.
So ideally, for yuewp and ganwp, in addition to expose the Mandarin Chinese machine translation, it would be useful to convert the Simplified Chinese characters from the machine translation into Traditional Chinese characters. I'll add this to the ticket description.
If the source is zhwp, would this (translating chinese to chinese) be a potential bug and break the server? If so, CT should just copy the original content in this case.
Good consideration. I don't think it would break, but it is a good case to check, and definitely avoid unnecessary use of resources.
In the past few months I used CT to convert zhwp articles for around 100 times. I remember there were occasions that CT showed a banner on the top right hand corner, warning me that the translation would activate a certain filter on YUEWP.
Two most important filters we put up on YUEWP to detect Mandarin contents are filter 4 and 5 . They basically identify some fundamentally different structural Mandarin words like conj., prep.
I am not a linguist but I'd try summarising the differences between Mandarin and Cantonese. The two share a large set of vocabulary, especially proper nouns and concepts that come from the West. Not 100% identical but maybe 70%. The major differences lie in the grammar and those words used for grammatical structures. e.g. be 是 係, at 在 ?, he 他 ?. In each case the first character is Mandarin and the other is Cantonese. You could see these in the filters.
That's why I brought forth this proposal, in hopes that users could save the time translating the large bulk of n. v. adj. adv. etc, but only focus on arranging content, grammar and style. Additionally, we prefer translating original sources from European languages, instead of the Mandarin translation we might find on zhwp, because zhwp contents are often outdated and written in a not so encyclopaedic style.
I agree with the comments above that until a separate machine translation engine is available for Cantonese, it can be useful to display a machine translation from a non-Sinosphere language to Modern Standard Written Chinese (MSWC) when the target language is Cantonese. This is because Cantonese and Mandarin share the vast majority of technical vocabulary; the lexical similarity rises when one considers MSWC which actually incorporates significant amounts of Cantonese vocabulary. In addition, the fact that most existing machine translation engines don't distinguish between the different kinds of Chinese languages means that the machine translations themselves are somewhat influenced by Cantonese anyway (being treated as a subset of Chinese rather than a standalone language).
As discussed above, the fact that unedited machine translation is generally awful and the Cantonese Wikipedia already has edit filters to post unedited Mandarin-grammar content, means that there is already some safeguard against CT being used to flood the Cantonese Wikipedia with machine-translated Mandarin content.
Change 618050 had a related patch set uploaded (by KartikMistry; owner: KartikMistry): [mediawiki/services/cxserver@master] Enable MT based on closely-related languages (wuu)
https://gerrit.wikimedia.org/r/618050
Change 618050 merged by KartikMistry: [mediawiki/services/cxserver@master] Enable MT based on closely-related languages (wuu)
Change 618525 had a related patch set uploaded (by KartikMistry; owner: KartikMistry): [operations/deployment-charts@master] Update cxserver to 2020-08-05-070016-production
https://gerrit.wikimedia.org/r/618525
Change 618525 merged by jenkins-bot: [operations/deployment-charts@master] Update cxserver to 2020-08-05-070016-production
Mentioned in SAL (#wikimedia-operations) [2020-08-06T12:06:33Z] <kart_> Updated cxserver to 2020-08-05-070016-production ( T258919 , T199523 , T257943 , T256194 )
As part of T258919 , we supported the following that is relevant to this ticket:
This is based on the input that was provided in this ticket, so feel free to give it a try and let me know f further adjustments are needed.