[NTLK] html2newt news; font metrics? book.hints?

From: Vladimir Alexiev (Vladimir_at_worklogic.com)
Date: Wed Nov 20 2002 - 17:20:02 EST


Our html2newt project is getting some traction: it seems that we 3
will be joining forces: Vincent Lee (author of Unicode Press), Dakkar
and me (Vlad). The goal of the project is to make a convertor from
html/text/[pdf] to newton book, running on win&mac, supporting unicode.

Decisions/news/knowledge [dev who did it or wants it]:
- we need a good name for this project, please help!!!
-- I like Vince's "UniPress", but it may lead to confusion because
unlike Newton Press, this will be fully automatic and won't be interactive
(does anyone really use Newton Press?)
-- html2newt is an imprecise name because we'll want to support more
input formats (it's lame anyway).
-- I kinda like "newtonberg". That's the name of an old dead project
to collect a library of newt books, is it appropriate to reuse it?
- we'll support Mac [Vince, Dakkar] and Win [Vlad] from the start
(perhaps also Linux).
- we'll support Unicode from the start [Vince].
- we'll support HTML [Vlad: want it for computer books, eg from
ITKnowledge] and TXT initially.
- it seems we won't be using bookmaker because it:
-- has limitations (eg few supported fonts)
-- has bugs (eg Assertion Failed on some files)
-- uses "unclean" input (eg on Windows, RTF+dot, which is a mess)
-- doesn't support unicode (interprets \u1234\u as several chars).
-- dunno how to patch Mac Bookmaker to support big screen size
(we can patch Win Bookmaker, thanks to Marcus Koppenburg. I've
released "bookmaker-patch.pl" (perl-based patcher) a while ago.)
- Vince can make pkg directly, at least for books with very simple
structure. If we can help it, we won't use NTK.
-- as a second test case of this "majic conversion to pkg", I'm trying
to convince Vince to make a mp3->pkg convertor based on Padilla's
MP3Builder. If this will be useful to you, please reply with subject
"mp3topkg" to voice your support.

PDF support (this is important to Didier and Laurent):
- Vince currently does text (uncompressed only) from pdf but the
extraction is rudimentary.
- keep in mind that PDF embeds a programming language (postscript),
so trivial approaches to extracting text will fail in some cases.
- for example, such an approach on an academic paper made with
latex->dvips->ps2pdf, eg
http://citeseer.nj.nec.com/rd/37652379%2C476095%2C1%2C0.25%2CDownload/http:/
/citeseer.nj.nec.com/cache/papers/cs/25291/http:zSzzSzwww.rxrc.xerox.comzSzp
eoplezSzandreolizSzpublicationszSzDocumentszSzP01563zSzcontentzSzdist.pdf/li
near-objects-logical-processes.pdf
which is the "PDF" link on http://citeseer.nj.nec.com/476095.html. I bet
that it won't work well because latex emits postscript differently, compared
to Word+PDFWriter for example.
- the only "Right" approach is to use a real postscript interpreter, eg xpdf
or ghostscript.
- I think that a better approach is to use a free service like www.gobcl.com
(better) or http://www.adobe.com/products/acrobat/access_onlinetools.html
(not too good).

Questions:
- how to get the font metrics of builtin fonts, and of a PKG containing
font(s)?
-- NBM_Hacks of Vladimir Kochergin can display and change the font metrics
inside Win Bookmaker, so for the builtin fonts we can get it from here.
-- if such knowledge/code is not available for PKG, we'll have to hit the
books and apple font tools, and figure it out.
- what is book.hints?
-- I think that we understand most of the rest about a book's structure, or
can figure it out relatively easily with some Bookmaker experiments.

-- 
This is the NewtonTalk list - http://www.newtontalk.net/ for all inquiries
List FAQ/Etiquette/Terms: http://www.newtontalk.net/faq.html
Official Newton FAQ: http://www.chuma.org/newton/faq/



This archive was generated by hypermail 2.1.2 : Mon Dec 02 2002 - 22:03:12 EST