Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use?

Alvaro @social.graves.cl · 11 months ago

Calibre is far from ideal so I wonder if there is a better way to convert a PDF into EPUB? Maybe a new AI tool exist for that purpose? What do you use?

DonnieDarkmode@lemm.ee · 11 months ago

I had this exact question myself a little while ago, so I’ll share what I learned. I don’t know your level of knowledge with these things so forgive me if I’m explaining things you already know. And spoiler alert, the answer is “technically, but not how you’d like”

An EPUB “file” is really a folder containing a bunch of individual HTML files which hold the text for the book as well as things like the table of contents, and photos (if your ebook has pictures), with CSS for styling. This is the exact medium you’d work in if you were designing a web page, but with en ebook there are different best practices and considerations.

Now assuming that your PDF has a good OCR (optical character recognition) layer, then it will be possible for calibre and other programs to grab the text of the PDF, and even to create an epub with it. But as you’ve noticed, they don’t do a good job of this. The fundamental problem is that creating an epub is something of an art, with best practices and personal choices as far as layout and file structure. When you “convert”, you’re not changing the file type from PDF to EPUB; you’re grabbing the text from the PDF and then sticking it into multiple different files, with HTML and CSS instructions throughout to tell the EReader how to lay things out, which footnotes link to which annotations, where to display pictures, etc.

As far as I’m aware, this basically can’t be done (well) with dumb, automatic programs like what Calibre offers because there’s too much “thinking” involved. Perhaps an AI tool could be created that would handle this better, but I’m not aware of one, and it’s a pretty specialised application so it’s possible you’ll need to wait a while before someone gets around to that.

So I realised that if I wanted an EPUB version, I’d need to make it myself. I used Sigil, a free EPUB creation tool, to do it, which gave me some nice features to help speed up the process, but it’s a big time commitment (unless you’re working with a very short PDF), especially for your first EPUB where you’re still learning what to do while making it. You’ll also need to learn HTML and CSS if you haven’t already.

I did it as a sort of fun side project in my free time to learn a new skill, but unfortunately other than that, I don’t think there’s such thing as an “EPUBinator” that’s gonna take your PDF and create a well-made ebook.

Em Adespoton@lemmy.ca · 11 months ago

You’ve identified the main issue: PDF extraction. A PDF can lay out pages in an infinite number of ways.

My personal workflow is to take a PDF, tun it through ClearType OCR, save it as a web-friendly, accessibility standard compliant PDF, which will extract all the text and re-lay it out so a screen reader can read the text in the correct order.

After that, it’s a matter of exporting the PDF to HTML, chunking it, zipping the results with a CSS file and a manifest, and you’ve got an ePub.

And of course, there are Python libraries to do a lot of the conversion as well.

DonnieDarkmode@lemm.ee · 11 months ago

Oh yeah my actual workflow to create a book was horribly inefficient and time-consuming. How automated is that HTML export and chunking process? Are you still going through and manually adding in every last <p></p> and href?

I’m curious about your use case, because I was doing this with a book that was hundreds of pages long, full of photos and footnotes, which added lots of tedium.

Em Adespoton@lemmy.ca · 11 months ago

I just use a text editor and regex to add all the paras and hrefs. Done it for a few horribly mangled books I was converting. Whole process was “manually automated”.

Wilker@lemmy.blahaj.zone · 11 months ago

question: why is using OCR software more worth it than taking its contents with something like LibreOffice Draw?

JaymesRS@midwest.social · 11 months ago

The ideal would have to be some sort of AI translation. The problem is that PDF is a page layout format and EPUB is a reading format and you can’t just extract the text without understanding what parts are affected by page layout, think of reading by columns for example. And you would need to train the AI on what’s unnecessary for reading comprehension

Empricorn@feddit.nl · 11 months ago

By “far from ideal”, I think you mean “not perfect”.

RoyalEngineering@lemmy.world · 11 months ago

And ugly!

jbrains@sh.itjust.works · 11 months ago

An ugly powerhouse Linux application? What will they think of next?!

RoyalEngineering@lemmy.world · 11 months ago

Yeah! Like Audacity!

prashanthvsdvn@lemmy.world · 11 months ago

Well data extraction from PDF is always tricky and there isn’t a defined way of how to translate PDF to EPUB 1:1 so I don’t think it’s calibre is the problem. It’s difficult to program how to reverse engineer automatically.

cooopsspace@infosec.pub · 11 months ago

This is the answer.

PDF was designed to be hard to convert. Mobi to EPUB easy though because it’s just text.

/thread

j4k3@lemmy.world · 11 months ago

deleted by creator