Ebook File Creation from HTML

Solaris x86 FAQ on a Palm Pilot PDA

This article is about generating some ebook files formats from HTML. "Ebooks" are file formats that are viewable on ebook readers, including PDAs and smart cell phones/mobile phones.

HTML files are fine for viewing on the web, and in fact many ebook readers support HTML. Plain text format is the most universal format, supported by nearly all ebook readers. and often more compact for small displays, but lack images. Palm OS Doc PDB file format is supported on Palm Pilots and other devices that run Palm OS. zTXT file format is a highly-compressed format (more than Palm Doc), and requires it's own reader. Plucker file format, which optionally contains images, also runs on Palm Pilot devices, among others. Plucker is my favorite format. It's open source, so doesn't come pre-installed on PDAs and readers, but is easy-to-use and more versatile than proprietary software. Plucker is free DRM restrictions, so is not widely used by commercial ebook publishers. Project Guttenberg, with thousands of eBooks, supports it though. PDF files, supported for some ebook readers, can also come embedded with images, although line wrapping is often a problem with small displays.

This article shows how to use four software programs, txt2pdbdoc to generate plain text and Palm DOC PDB files, Plucker to create Plucker PDB files, and Ebookconverter to create PDF files, and Jmakeztxt to create zTXT files, all from a source HTML or text file. All of this software is open source.

Obtaining the Software

Txt2pdbdoc is available from http://homepage.mac.com/pauljlucas/software/txt2pdbdoc/ converts from HTML and plain text formats to plain text and Palm OS Doc PDB format. Paul J. Lucas is the current maintainer of txt2pdbdoc. The software comes in source form, but is easy to compile on Solaris, Linux, and similar systems. Basically, you type this line in the source directory:
./configure; make; make check; make install
The files install under /usr/local/. For your convenience, I compiled Txt2pdbdoc for Solaris SPARC and x86, and Linux 2.6 (x86). The source and binaries are at txt2pdbdoc-1.4.4-bin.tar.gz and may be extracted with this command:
gzcat txt2pdbdoc-1.4.4-bin.tar.gz | tar xvf -

For Plucker, it's best to obtain pre-compiled binaries if you can. Solaris packages are available from http://www.blastwave.org/ and many Linux distributions come with Plucker. Otherwise, build from source files available at http://www.plkr.org/dl Be sure to get the Plucker "distiller", not just the Plucker Viewer (which manages and views Plucker PDB files, but doesn't create them). Plucker Viewer, http://www.plkr.org/dl comes in Linux, Apple OS X, Windows versions to view and upload Plucker files to your PDA or smartphone.

Jmakeztxt is available from http://jmakeztxt.sourceforge.net It's a Java program by Karin Herm. To run in GUI mode, type
java -jar ./jmakeztxt-1.9.jar
To run in command line mode, type
java -jar ./jmakeztxt-1.9.jar net.sourceforge.jmakeztxt.MakeztxtCmd filename.txt
Either way creates a .pdb zTXT file.

Ebookconverter is available from http://www.kevinboone.com/ebookconverter.html as a zip file. It's Java software, so can run on any system with Java 1.5 or higher. Extract the software using unzip or similar software. I installed the software under /opt/ebookconverter/. Kevin Boone wrote ebookconverter.

Using the software

I use the current above software to generate ebooks automatically each week from selected HTML webpages. This is done with a shell script, generate_ebooks. I'll step through a simplified version of this shell script (with error handling and site-specific stuff removed for readability). You can download generate_ebooks here. You can use this to create ebooks automatically from your website, or can generate HTML files yourself from other software (such as OCR software, OpenOffice, Staroffice) to create custom ebooks. This last step is an exercise left to the reader :-).

The first part of the ksh shell script does initial housekeeping, such as getting the input HTML filename and creating output filenames. The script extracts the author's name automatically from the <meta name="author"> HTML tag.

#! /bin/ksh
# Set this for your website:
PARENTURL="http://sun.drydog.com/"
MYNAME=generate_ebooks
TMPOUT=/tmp/$MYNAME-out$$.tmp
parentDir=$(basename $(dirname $PWD))
shortDir=$(basename $PWD)
baseURL="$PARENTURL$shortDir"

usage () {
   echo "Usage: $MYNAME [-h] htmlfile"
   echo "$MYNAME generate ebook files from HTML"
   echo "Where: -h Displays this help"
   echo "Example: $MYNAME index.html"
   exit 1
}

# Setup
if [ "$1" = "-h" -o -z "$1" -o ! -f "$1" ] ; then
	usage
fi
inputHTML="$1"

# Create output filenames
shortTitle="$(echo "$shortDir" | sed 's/[-_]/ /g')"
outputTXT="$shortDir.txt"
outputPDF="$shortDir.pdf"
# Note that Palm Pilot DOC PDB and Plucker PDB use the same extension:
outputPDB="${shortDir}d.pdb"
outputZTXT="${shortDir}z.pdb"
outputPlucker="${shortDir}p.pdb"

# Extract author from HTML <head> element.  E.g.,
# <meta name="author" http-equiv="author" content="Dan Anderson" />
author=$(grep '"author"' $inputHTML | head -1 | sed 's/.\*content="//' \\
	| sed 's/".\*//')

The first files we generate are plain text and PalmOS Doc Pdb files. html2pdbtxt creates a plain text file for input into txt2pdbdoc, which creates the PalmOS Doc file. After creating the Doc file, the script removes (\*) markers at the beginning of some lines (and removes a end-of-line if it's in the middle of a paragraph), and removes the PalmOS end-of-file marker, <(\*)>

# Convert to text and Palm Doc Pdb Format
# Convert HTML to intermediate TEXT for use as Palm Pilot DOC PDB
# (Note: title is scanned from the <title> HTML tag and placed in line 1)
html2pdbtxt -u"$baseURL" $inputHTML $outputTXT >/dev/null

# Get title from first line of $outputTXT
longTitle="$(head -1 $outputTXT)"

# Convert intermediate TEXT file to Palm Pilot DOC PDB
txt2pdbdoc "$shortTitle" $outputTXT $outputPDB
echo "Generated Palm Pilot DOC PDB file $outputPDB"

# Remove Palm Pilot weirdness from intermediate TEXT file for use as a
# plain text file.  That is,
# (1) Remove all "(\*)" at the beginning of the line
# and the preceding newline,
# (2) Remove the Palm Pilot end of file tag, "<(\*)>".
# (3) Convert to DOS-like format, with \\r\\n characters.
cp $outputTXT $TMPOUT
cat $TMPOUT | grep -v '\^<(\\\*)>$'  | sed 's/$/@EOL@/g' | tr -d '\\n' | \\
	sed 's/@EOL@(\\\*)//g' | sed 's/@EOL@/\\r\\n/g' >$outputTXT
rm -rf $TMPOUT
echo "Generated text output file $outputTXT"

Next, we generate a zTXT format file, which also has a .pdb extension, but is in a different format from Palm Doc files (and about half the size) and has it's own reader. We use Jmakeztxt to create a file from our earlier-generated plain text file.

# Create zTXT format from .txt with JmakezTXT
cp $outputTXT ${shortDir}z.txt java -cp /usr/local/jmakeztxt/jmakeztxt-1.9.jar
net.sourceforge.jmakeztxt.MakeztxtCmd ${shortDir}z.txt
rm -f ${shortDir}z.txt
echo "Generated zTXT output file $outputzTXT"

Next, we generate a Plucker PDB file. Plucker files share the same extension as PalmOS Doc files, but the two formats are not interchangeable. You need to install Plucker Desktop software (free) on your Desktop computer and PDA to read it. Plucker files, unlike PalmOS files, can optionally come with embedded images and has rich text capabilities, such as bold and italics, for a richer reading experience.

# Create Plucker format
# Note: You must run plucker-setup before using plucker-build.
# Note: zip compression is better, but can only be used in PalmOS 3.0+.
# Plucker supported with PalmOS 2.0+.
plucker-build --author="$author" \\
	--title="$longTitle" --doc-name="$shortTitle" --author="$author" \\
	--home-url="$baseURL/$inputHTML" --staybelow="$baseURL/" \\
        --maxdepth=1 --noimages \\
	--pluckerdir="$PWD" --doc-file="$(basename "$outputPlucker" .pdb)"
echo "Generated Plucker PDB file $outputPlucker"

Finally, we generate a PDF file from HTML, using ebookconverter. Images are embedded with ebookconverter, although not in a sophisticated way. Image centering and sizing is ignored by ebookconverter, and JPEG files tend to be too large, and PNG files too small. However, the ability to run this software unattended in the background is great. The text rendering is excellent and may be done sans-serif (default), or serif, as done here.

# Create PDF format with ebookconverter
java -jar /opt/ebookconverter/ebookconverter.jar \\
	--format_options font-family=serif \\
	--source_options encoding=iso-8859-1 $inputHTML $outputPDF
echo "Generated PDF file $outputPDF"

Finally, lets run this shell script. It produces 4 output files from one input file, index.html

$ generate_ebooks index.html
Generated Palm Pilot DOC PDB file faqd.pdb
Generated text output file faq.txt
Pluckerdir is '/usr/local/htdocs/sun/faq'...
---- 0 collected, 1 to do ----
Processing http://sun.drydog.com//index.html...
  Retrieved ok.
  Parsed ok.
---- all 1 pages retrieved and parsed ----
Writing out collected data...
Writing document 'faq' to file /usr/local/htdocs/sun/faq/faqp.pdb
Converting http://sun.drydog.com//index.html...
Wrote 1 <= plucker:/~special~/index
Wrote 2 <= http://sun.drydog.com//index.html
Wrote 3 <= plucker:/~special~/pluckerlinks
Wrote 5 <= plucker:/~special~/metadata
Wrote 95 <= plucker:/~special~/links1
Done!
Generated Plucker PDB file faqp.pdb
Generated zTXT file faqz.pdb
Generated PDF file faq.pdf
$ ls -l
-rw-r--r-- 1 dan dan 136328 Sep  1 18:59 faqd.pdb
-rw-r--r-- 1 dan dan 179972 Sep  1 19:00 faqi.pdb
-rw-r--r-- 1 dan dan 165153 Sep  1 19:00 faq.pdb
-rw-r--r-- 1 dan dan  89804 Sep  1 18:59 faqz.pdb
-rw-r--r-- 1 dan dan 221561 Sep  1 18:00 faq.pdf


Enjoying the Fruits of Labor (or Labour outside the U.S. :-)

Results of files generated by this software can be seen at my Solaris x86 FAQ website, http://sun.drydog.com/faq/, which has the Solaris x86 FAQ available in multiple formats. Also I have dozens of ebooks available at Yosemite Online Library, http://www.yosemite.ca.us/library/ http://sun.drydog.com/faq/

Ebook file formats come in many shapes and sizes—more than are necessary, in fact. If you know of other ebook file formats not here, please leave a comment here. They must have freely-available converter software that runs on UNIX-class operating systems (such as Solaris and Linux).

—Dan Anderson

(Note: trademarks here are owned by their respective manufacturers.)

<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
<script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>
Comments:

Post a Comment:
Comments are closed for this entry.
About

Solaris cryptography and optimization.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today