Wikipedia talk : Offline Content Generator

Book creator [ edit ]

This discussion up to and including 8 August is copied from User talk:Jimbo Wales/Archive_210#Book creator ? Cheers, Steelpillow ( Talk ) 10:37, 10 August 2016 (UTC) [ reply ]

This is getting embarrassing:

Status last updated 23 August 2020.

? Cheers, Steelpillow ( Talk ) 05:57, 1 August 2016 (UTC) [ reply ]

I wonder if we shouldn't just remove those pages. Or are you arguing that the Wikimedia Foundation should invest resources in fixing the problem? I'm not opposed to that in principle, but I don't believe the tool was ever used much. I might be wrong about that, though, so if I am, then let me know!-- Jimbo Wales ( talk ) 13:49, 1 August 2016 (UTC) [ reply ]

I'd really, really appreciate if Wikipedia's standard PDF export function (the clickable link in the left margin on each page ? is that the same software as Book Creator?) would render tables. Only today I added this to the WP:CONTENTFORK guidance, partially based on the fact that tables are not exported. I'd rather not need a bypass for that guidance for such reason (a kind of guidance that is prone to inadvertent shadyness). Also it's quite frustrating, e.g. I've been putting some energy in List of repertoire pieces by Ferruccio Busoni lately: click on the PDF export function and *poof* almost nothing remains apart from four pages of references referencing something that isn't there. Yeah, imho, would be money well spent to get that sorted. -- Francis Schonken ( talk ) 16:59, 1 August 2016 (UTC) [ reply ]

I hope someone has some usage statistics. I know I field a number of questions at OTRS about the tool (mostly bug reports, but they do substantiate some level of usage), so I know there is interest, but I don't have a clue about whether the usage is high enough to justify expense.-- S Philbrick (Talk) 17:10, 1 August 2016 (UTC) [ reply ]

Usage statistics OK, but that cuts both ways (talking about the PDF export function, still not sure whether that's the same as Book Creator): I don't use it any more while it doesn't do what it should do , i.e. not maim an article when converting it to PDF. How can one extract insightful usage statistics from something that is avoided for its cumbersome MO? Use PDFcreator or some similar tool on the weblayout is what everyone says when I bring up the issue of the discarded tables, so I assume that's what most people do when they want to create PDFs ? but the result is considerably different from what one gets with the built-in PDF export function (which has a better readability afaics). Current usage as such doesn't learn much... how many Wikipedia pages are sent to local software PDF generators? Wouldn't people prefer prints in "PDF export function" layout over "weblayout" generated by local software? -- Francis Schonken ( talk ) 17:27, 1 August 2016 (UTC) [ reply ]

Meaningful usage statistics would have to predate the 2014 "update", I wouldn't know where or how to look. All I can say is that there is still a fair trickle of complaints at Help:Books/Feedback and it pretty much borks most new submissions to pediapress. For any one editor making their presence felt there, a standard rule of thumb is that there are 100 to 1,000 silent editors who just walked, and ten times as many visitors left with the impression that the whole business sucks. If nobody's gonna fix it, then I think it needs to be killed. OTOH if the copyrighting battle against rip-off artists is worth the fight, then book creation needs fixing up properly so pediapress and the rest of us can leverage it again. Either way, doing nothing is bad. ? Cheers, Steelpillow ( Talk ) 19:00, 1 August 2016 (UTC) [ reply ]

How much does http://pediapress.com donate for the premium service of [1] ? This press release from 2007 explains what happened. The reason is when people started selling PDFs on Amazon. EllenCT ( talk ) 18:41, 1 August 2016 (UTC) [ reply ]

It seems to me that this is a nice conclusion you've jumped to. From what I see, and I could be wrong, there are the following problems with your theory: 1) PediaPress seems to be focused on creation of physical paper books, not just files of Wikipedia content as anyone (should) be able to generate. 2) PediaPress seems to depend upon the same book creator that creation of files do. I don't know, but I wonder if that service now suffers from the same rendering problems that Book Creator does. And I wonder if you know by experience, or are you just speculating? But I actually am writing to say that I'm one of the silent masses who would really like the book creator to be fixed; it would be nice to have the ability to port collections of astronomy articles to a document file and be able to read them offline at our observatory. Laughing Vulcan 17:27, 2 August 2016 (UTC) [ reply ]

I'm completely sure I remember when Wikipedia articles and collections started showing up on Amazon. I recommend simply asking PediaPress if they can make the nice PDFs you want, but don't be suprised if they charge you a token amount and add certain strings. @ CAnanian (WMF) : do you know the answers? You seem to be the only staff assigned to [2] . EllenCT ( talk ) 19:14, 2 August 2016 (UTC) [ reply ]

OK, but my question to you was if you actually have knowledge and/or proof that the reason for Book Creator not working properly is that people started selling PDFs on Amazon (and implying PediaPress in the process)? It appears to me that you do not, and are merely speculating / fishing in the dark. Especially since the PediaPress thing apparently began in 2007 and apparently the breaking of Book Creator occurred after that. As mentioned above by Steelpillow and as I speculated, the breaking of Book Creator ALSO breaks PediaPress as well, as Book Creator is HOW one submits files to PediaPress in addition to creating PDF files for download. But you didn't know that, did you? Anyway, it's clear to me that you do not seem to know what you're talking about, as fixing it so PediaPress would work would also fix it so I can just download a PDF. But you don't seem to get that. Anyway, as I said, mark me down as one who sees Book Creator as important and would like it to be fixed so that tables, etc. render properly. Whether for personal use, or to submit to PediaPress. Laughing Vulcan 19:31, 2 August 2016 (UTC) [ reply ]

Hold on there chaps, I get a sense of talking at cross-purposes. The reason for *what* is because books started appearing on Amazon? We are effectively trying to create Print on demand books and Amazon is a popular sales outlet for the printed volumes, whether published by PediaPress or anybody else. ? Cheers, Steelpillow ( Talk ) 14:05, 4 August 2016 (UTC) [ reply ]

All I remember from the time is that the works were poor quality (tables would render, but type would break across page breaks) and there was a substantial outcry that they would tend to bring the project into disrepute. The problem became substantially worse in the years following 2007. See e.g. OmniScriptum#Wikipedia content duplication , [3] and [4] . EllenCT ( talk ) 20:16, 4 August 2016 (UTC) [ reply ]

There are two obvious questions here. First, why did WMF make a bad update that broke features, then refuse to fix it? And how did we go from "This technology is of key strategic importance to the cause of free education world-wide," said Sue Gardner, Executive Director of the Wikimedia Foundation. (2007) to saying that it was not worth having a management process and intentionally breaking the feature seven years later? I mean, if strategic means "totally unwanted in seven years" then there is no strategy at all and donors shouldn't be paying for overpaid Brahmins to work it out. Wnt ( talk ) 00:07, 3 August 2016 (UTC) [ reply ]

As a relative outsider I looked into this a little. It seems that the old, relatively functional code was Wikipedia-specific, in an unfashionable programming language and (ironically) not easily maintainable. A more maintainable core engine was pulled in from somewhere and what I can only describe as alpha software wrapped around it and gifted to us in place of the "unmaintainable" that had basically worked. The idea was to iron out the bugs and add the missing features from here on in. But that never happened because at that point the developer walked. Maybe it had all been done for free up until then, I don't know, but the folks at WMF apparently decided to spend their money and effort elsewhere and just leave the mess hanging. Quite why they trashed Sue's strategic vision is unclear to me. ? Cheers, Steelpillow ( Talk ) 04:06, 3 August 2016 (UTC) [ reply ]

I think you've got it right as to what happened. I'm willing to advocate for investing in fixing it but only if we have some indication that it was actually being used by many people. It is entirely possible that upon release Sue thought it was going to be "of key strategic importance" but within a few months time it may have become apparent that it wasn't important at all. These things happen, and no one can really be blamed for it. But if a decision was made to deprioritize it to the point that broken software has been left in place for years, well, that's not good - better to just remove it completely I would imagine.-- Jimbo Wales ( talk ) 13:52, 3 August 2016 (UTC) [ reply ]

The Wikipedia:Books page was created at the tail end of 2008 and Help:Books in 2009 around the time of Sue's vision statement. Browsing Category:Wikipedia books gives some idea of how much Book Creator had been used up until 2014. Another way is to browse the PediaPress website , although I don't know if any download/purchase stats exist. I can't imagine the usage stats could have been all that bad after say 2012, or a long-term maintainable rewrite would never have been kicked off in 2014. To me, the key question is whether WMF should care about the likes and ambitions of PediaPress any more, and if the answer is "yes" then the management process needs resurrecting if nothing else. Let that process decide whether to share or to shaft. Or, if "no", then can the whole thing. For my part, some of the moans on the feedback page give me the feeling that that the 'press momentum was beginning to create a self-perpetuating marketplace in which academics were improving articles to publishable quality so they could provide better books in class. Is there a critical mass there to be sought for? As I say, does the WMF care? ? Cheers, Steelpillow ( Talk ) 16:02, 3 August 2016 (UTC) [ reply ]

I almost agree with that, except that I'd make the case the likes and ambitions of companies that monetize the use of it should take a distant second place behind those of us who would use it without commercial ambitions but rather for continuing learning for when we don't have internet connections. But the other thing I'd note is that I was frightened away from the warning above, only to find that while parts of it are broken, parts aren't as well. It still put a decent book together for me of Messier Objects, even as it borked the "List of Messier Objects" article/chapter because it is one big table. I think the creators of that announcement went a tiny amount Chicken Little - then again maybe it does just accurately describe the problem. The other question is, if the Foundation doesn't have the resources to create and maintain it, is it possible to crowdsource development of it? (Just whistling in the dark there.) Laughing Vulcan 01:49, 4 August 2016 (UTC) [ reply ]

Hi everyone, the WMDE's software development team is also currently looking into that issue, since adding tables to pdfs was one of the wishes of the German Community Wishlist . So far, our investigations have shown that it would take an enormous engineering effort (comparable to software companies that produce layout software) to add tables to the current latex layout in a way that 80%-90% of the tables display correctly. 10%-20% would always be off due to the different capabilities of the two media (printed, layouted page versus HTML). Therefore we will probably add another option to the page that appears when you click on "download as pdf", which allows you to download a pdf that looks more or less like the web page you see. On the plus side, it will contain all tables, images etc. that are present in the article, on the down side, it will not be as concise and nicely layouted as the latex version. Therefore we would add this as a new option that you can choose depending on what you want. Wikibooks however would probably need a "print page" which includes all chapters for the new rendering service to work, which is not included in our initial plans. In general, the hope is that by moving towards a browser based rendering service (which takes the web page as its basis) we will get more people to join in in improving the layout that comes out of there, making it a more maintainable solution to the pdf creation problem. -- Lea Voget (WMDE) ( talk ) 17:00, 4 August 2016 (UTC) [ reply ]

Offline Content Generator (OCG) [ edit ]

As the current default maintainer of the Collection extension, PDF export, plaintext export, and (soon) ePub and ZIM export, let me give a (short version of a) longish history. The Book Creator/Collection extension was originally created by Pediapress in 2008. Part of the service was hosted on WMF servers in our data center in Tampa, but if you actually ordered a printed book the request got bundled up and passed over to Pediapress' servers, which ran a similar version of the code but interfaced with their print-on-demand service. Pediapress made enough money from the print-on-demand service (apparently) to fund continued development of the service, which benefited all those who generated PDFs but did the printing themselves, and this mutually-beneficial arrangement persisted for a number of pleasant years.

However, the buglist grew over time. Pediapress did not invest much effort in internationalization, and support for non-roman-script languages was poor-to-nonexistent. Pediapress maintained their own bug tracking system, which grew to contain thousands of bugs. It *appears* that Pediapress was no longer making enough money from print-on-demand to fund their continued development and maintenance of the code base, and development stalled. No effort was made on the code base for a number of years, but the system "worked enough" (for European languages, at least) that things muddled on.

Unfortunately, the day came when WMF had to move out of its Tampa datacenter. The Pediapress code was literally the last thing running in Tampa, and it was costing the Foundation $1,000/day to keep that one server running ($30K/month). Worse, no one had written down how that server had been installed and there was no one who could recreate its configuration in our new datacenter. It looked like we were going to have to turn off Book Creator.

Matt Walker was passionate about Book Creator, however, and pulled in a skunkworks group of WMF folks to save the service, rewriting it in what was a state-of-the-art architecture at the time. We rebuilt it from scratch, documenting the process and installing it on modern server infrastructure, and were able to keep things going. The project had the support of Erik Moeller, and I was pulled in to provide support from the Parsoid side, eventually writing the PDF backend and a plaintext backend. As these things go, however, the new project had a different feature set -- it was much better at Indic and non-latin languages (thanks to XeLaTeX), had clickable hyperlinks, included enough license information to actually comply with our Creative Commons attribution requirements, etc -- but was missing some features. Tables and infoboxes are particularly hard, and those aren't particularly strong points for LaTeX either.

I don't need to recap the organizational struggles at the foundation in the following years. Suffice it to say that all the original participants in the skunkworks project, including Erik who had provided C-level support, have since left the foundation, leaving me as the last member of the original skunkworks. Further, the engineering reorganization which occurred toward the end of the Lila era left OCG homeless. OCG should rightly be part of the "reading" team, but it's only remaining developer (me) is on the "editing" team. La la la. We don't generally let these sorts of things get in the way of actually doing good work, but they are relevant when deciding who to petition for additional resources...

We actually had a great Wikimania this year, with a lot of focus on the "Offline Content Generator" (as the architecture behind Book Creator, PDF export, the Collection extension, etc, is formally named). In fact, we had ZIM export and ePub export capabilities developed during the hackathon. Unfortunately, the code hasn't actually been submitted yet to me/the WMF, so we can't deploy it.?:( But it exists, I've seen it running, and for the first time we had more-than-just-me working on OCG.

In addition, as the WMDE team above explained, the German Wikimedia chapter has adopted "tables in PDFs" as one of their feature development goals. The first part of this is https://gerrit.wikimedia.org/r/290417 . And I wrote basic support for tables a few years ago; see https://gerrit.wikimedia.org/r/107587 -- the problem is that my patch doesn't *always* work, and can in some cases cause the entire page to fail to render. At this level of support I judged it best to keep suppressing tables and get *some* output, rather than risk getting *no* output for many pages. (This is really a fault of LaTeX's limited table support, which prefers to fail when it sees something unexpected or unexpectedly wide, and requires semi-heroic measures to work around.) There are ways around the problem we can discuss. (Gabriel posted some phabricator links below.)

One final wrinkle is that the architecture which was state-of-the-art in 2014 is already looking a little dated in 2016. The "services" team here at WMF has standardized on a services architecture and the use of cassandra for storage, and in general we would like to use browser technologies to render the page more directly from the HTML DOM rather than use a LaTeX intermediary. In addition, we made some architecture compromises to maintain compatibility with the pediapress POD service, which are looking less wise (we still support the pediapress POD but we send a high-level description of the page to them now, so we don't need to maintain compatibility at lower levels in the stack). We could really use some help (a) modernizing the backend, and (b) working with modern CSS technologies to make browser output on par with the LaTeX output, so we can eventually remove the LaTeX backend. Sometimes discussions of OCG spiral off into tangents along these lines; some even suggesting that further investment in features on the LaTeX backend is a waste of time.

So. Yes, OCG is starved for resources. It is also sitting at an awkward place both in the org chart and in the overall services architecture of the foundation. As long as I am the only one working on OCG, it will continue to make slow progress, but there are in fact several useful improvements on the immediate horizon. The usage statistics are also available; the short version is that we generate about 10 PDFs a second currently. That's an order of magnitude less than the number of pageviews/second of our article web pages, but still quite a large number of users. C. Scott Ananian ( talk ) 21:55, 4 August 2016 (UTC) [ reply ]

Links to related tasks @ Cscott : mentioned: Table support in PDFs , Options for browser-based PDF rendering . To gauge quality of browser-based rendering, we have set up an instance of a Chrome based third party render service ( Electron ) in labs. Example URL: https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://en.wikipedia.org/wiki/Barack_Obama

Wikimedia Germany is considering to use this for improving table & other complex content support for the "This page as PDF" feature. -- GWicke ( talk ) 22:07, 4 August 2016 (UTC) [ reply ]

Thank you Cscott for the update (not to mention for hanging in there). Chicken Little has now updated the warning template accordingly. ? Cheers, Steelpillow ( Talk ) 09:30, 5 August 2016 (UTC) [ reply ]

I'm interested in hearing more about "missing math support"--OCG should actually be on par or better than the previous service on this regard, as they both use the native math support of LaTeX. If someone could chase down more details on this I'd appreciate it. C. Scott Ananian ( talk ) 15:26, 5 August 2016 (UTC) [ reply ]

I think it's more to do with sensible layout. Some longer equations do not fit in a two-column layout. For example try downloading the Grassmannian article as pdf and check out section 6 on the Plucker embedding - one equation runs right across both columns. Worse, a long equation in the second column has nowhere to run off to. The no-brainer answer is to allow selection of single-column, full-width layout. More sophisticated solutions might be to split the equation across multiple lines or to shrink the font size to fit. ? Cheers, Steelpillow ( Talk ) 19:40, 5 August 2016 (UTC) [ reply ]

This already exists if you use the Book Creator. Single-column layout is one of the options available. What's needed is some way for an article to embed a hint that it looks better in a single-column layout, via a category or some such. C. Scott Ananian ( talk ) 14:28, 7 August 2016 (UTC) [ reply ]

How come Lea Voget (WMDE) 's prognoses seem bleaker than Cscott 's? Or am I missing something? Lea's seem like "forget it", something that looks like steering for just taking the service off-line, while Cscott's rather looks like, "baby steps, but we're progressing and have prospects", and at least shows someone kinda managing the process (even from a somewhat awkward position that doesn't leave too much wiggle room). -- Francis Schonken ( talk ) 15:53, 5 August 2016 (UTC) [ reply ]

Hi, in my own mediawiki2latex compiler linked in the above template I can handle tables correctly, as you can easily check by just running the exe file on the examples of your choice. Still I must agree it was extremely hard for me to write that software and I was driven by an extremely passionate hate on the economic system I happen to live in. If you want to pay someone to do it, it will be quite expensive I think, since people working for money never reach such a level passion. I personally can not help you with the development, since I got a permanent position at university now. Still I will try to keep my software available so that anyone in need of the LaTeX source of wikipedia articles or their respective PDF version will have access to them. Also I must say the the process I developed needs lots of computational resources, so that the above mentioned cost of 10000$/day might be realistic if you wanted to use my software as default renderer on wikipedia. Its quite simple you create 10 pdf a second. My software needs 300s per PDF on a current i3 desktop. So thats 3000 i3s you need to run the software wikipedia wide, which is not affordable. And of course I will get myself a t-shirt: "Semi-Hero of LaTeX OCG table rendering" Yours -- Dirk Hunniger ( talk ) 16:53, 6 August 2016 (UTC) [ reply ]

Tx. Is there a place to continue this conversation somewhere centralized? Wikipedia:Offline Content Generator ( WP:OCG )? Or some place at meta? -- Francis Schonken ( talk ) 04:58, 7 August 2016 (UTC) [ reply ]

Also, is there a compelling reason why the computing power should be server-side? Can't the conversion to PDF be done client-side with a script? -- Francis Schonken ( talk ) 04:35, 8 August 2016 (UTC) [ reply ]

If you want to stick with LaTeX you need a quite exhaustive LaTeX installation which needs several gigabytes, which you can not easily transfer to the client, just to create a PDF. I tried hard to reduce the amount of software that needs to be installed, but was not successful. Furthermore the LaTeX compiler and some other auxiliary tools are binaries compiled to run on a PC, and cannot easily be turned into java script. I in deed offer binary releases of mediawiki2latex for download on sourceforge which can run stand alone on a client PC, but the are just standard binary executables and have nothing to do which scripts running in a web browser. Yours Dirk Hunniger ( talk ) 06:12, 8 August 2016 (UTC) [ reply ]

"Also, is there a compelling reason why the computing power should be server-side? Can't the conversion to PDF be done client-side with a script?" As, I have stated before, if your only goal is to have a PDF "as you see it", then you simply don't need anything from WMF to achieve this. Almost every modern computer has a "Print to PDF" feature built right into the Print (center) functionality and it works just fine in 99% of the cases. If your computer does not come with something like that, you can easily download a free one online. ? Th e DJ ( talk ? contribs ) 11:32, 10 August 2016 (UTC) [ reply ]

Other issues are that; server-side can distinguish useful infoboxes from gratuitous navboxes more reliably than client-side can, document publishing needs copyrights collation and downstream management/availability which make far more sense to keep 100% server-side, server updates would need to be mirrored by script updates giving the client side a maintenance problem. So even direct html-to-pdf is unworkable on the client side. ? Cheers, Steelpillow ( Talk ) 14:29, 10 August 2016 (UTC) [ reply ]

Usage statistics [ edit ]

As for statistics, there are some at stats:reportcard/booktool/BookTool.html . Nemo 23:43, 11 August 2016 (UTC) [ reply ]

Looking at those statistics the absence of noticeable development does not seem surprising. -- Dirk Hunniger ( talk ) 19:15, 12 August 2016 (UTC) [ reply ]

Could the fall in PediaPress sales have anything to do with the rise of the Czech publisher e-Pedia?

"e-Pedia (an imprint of e-artnow) charges for the convenience service of formatting these e-books for your eReader. We donate a part of our net income after taxes to the Wikimedia Foundation from the sales of all books based on Wikipedia content." [5]

"e-artnow is an electronic information service distributing selected e-mail announcements related to contemporary visual arts. e-artnow is an artists' initiative founded in january 2008 in Prague, Czech Republic. We provide an independent and competitive alternative to the existing electronic e-mail art news distribution with a Do It Yourself philosophy ? You create your own announcement online: we send it out. We send out your exclusive e-mail announcement on the date you wish + we include you in our weekly Opening Reminder e-mail announcement." [6]

Can anybody confirm e-artnow's claim of passing cash back to the WMF? Are/were they associated with PediaPress or a competitor or what? I would also invite folks to visit Amazon and note the number of books relating to wikipedia on sale, which are directly extracted from Wikipedia articles (unfortunately mixed up with those which are about Wikipedia). ? Cheers, Steelpillow ( Talk ) 08:59, 13 August 2016 (UTC) [ reply ]