[pdftex] Incredibly PDF filesize difference by pdflatex compared to ps2pdf

Reinhard Kotucha reinhard.kotucha at web.de
Wed Jun 16 01:13:09 CEST 2004


>>>>> "Roboco" == Roboco Sanchez <roboco2004 at yahoo.com> writes:

    > Hello, Anyone can tell me if this is a bug? I have 2 pdf files
    > created differently on my latest MiKTeX and the output file
    > sizes are incredibly different.

    > 1. pdflatex -> 1.3MB 
    > 2. latex+dvips+ps2pdf -> 378KB

    > I've tried \pdfcompresslevel9 but it doesn't help.

...as expected, 9 is the default on many systems.

    >> From the log files:
    > pdflatex: 1011303 words of font info for 152 fonts 
    > latex: 650836 words of font info for 86 fonts

I'm wondering why the log files are so different.  I suppose that the
same tfm files are used.


First a few words about Type1 font internals:

An important feature of Type1 fonts are the so-called "hints".

The description of the outline of a glyph is not sufficient for
rendering on low-resolution devices.  Hints provide additional
information, for instance that the three vertical stems of an "m"
should have the same width.

Nowadays some people think that hints are not necessary because modern
printers have a reasonable high resolution and most screen devices do
anti-aliasing (smoothing).  But omitting hints is questionable,
especially if you want to create device independent files.


A Type1 font consists of three parts:

  1. A header providing general information about the font.  The first
     part of the header is readable ASCII text, the second part is
     encrypted.  The header might contain an encoding vector or at
     least the name of a standardized encoding.

  2. A set of subroutines.  They can contain outlines which are needed
     by many glyphs, for instance dots are needed by glyphs like "i",
     "j", "ä" ...  Though the Type1 renderer provides an internal
     routine for placing accents over non-accented letters, this only
     works for fonts using Adobe StandardEncoding.  Other fonts like
     Vietnamese have to make use of subroutines.  Most subroutines
     contain hints.  Most hints *have* to be put into subroutines so
     that stupid programs can ignore them easily.

  3. Glyph descriptions.  A glyph is described by a set of straight
     lines, curves (3rd order Bezier curves) and calls to subroutines.


Glyphs have a name like /a, /egrave or /semicolon.  But subroutines
are put into an array and are accessed by an index to this array (a
number).  Some Adobe software cannot deal with sparse arrays,
i.e. arrays where some elements do not exist.

The reason to omit subroutines is that if you only need a subset of a
font, you only need the subroutines used by the glyph descriptions you
need.  dvips solved the problem by replacing the content of an unused
subroutine by a "return" statement.  This is less efficient than to
renumber all needed subroutines and their calls but quite secure.

pdftex does not modify subroutines, it just includes all of them up to
the one with the highest number and omits the rest.  I'm not 100%
sure, so I'm still interested in the output of dvips.  AFAIK dvips and
pdftex share the same source code but behave slightly different.


Coming back to your files, this is what I've done:

First I converted both files to PostScript using the program pdftops
(which is part of the xpdf distribution), extracted one font
(MinionPro-It) from each file and disassembled them using the program
t1disasm (part of t1utils).


pdftex seems to insert the font as it is.  


ps2pdf obviously resolves all subroutines which contain glyph
descriptions.  That means that each call to a subroutine which
containes lines and curves is replaced by the content of the
subroutine.  This is ok.

All subroutines containing hints are ignored.  IMHO, this is not a
good idea.  The fonts processed by pdftex contain all the information
the font designer put into them, the fonts processed by ps2pdf are of
much lower quality.

BTW., ps2pdf is just a wrapper script which uses ghostscript.

You can write a program "pdf2pdf" which sends the output of pdflatex
through ghostscript:

pdf2pdf.bat:
------------------------------------------------------------------
gswin32c -sOutputFile=%2 -dNOPAUSE -sDEVICE=pdfwrite %1 -c quit
------------------------------------------------------------------ 

You can than type on the command line:

pdf2pdf <infile> <outfile>

... and you'll see that the resulting file is as small as the one you
created by latex->dvips->ps2pdf.

The reason the file sizes are so different is the fonts you use or, to
be more precise, the stupid program which produced them.

Here is an example from MinionPro-It:

The definition of subroutine No. 11:

dup 11 {
	-12 29 hstem
	273 30 hstem
	605 30 hstem
	92 68 vstem
	312 67 vstem
	420 68 vstem
	640 67 vstem
	return
	} |

Now let's look at the glyph description where it is used:

/percent {
	0 751 hsbw
	-12 29 hstem	#
	273 30 hstem	#
	605 30 hstem	#
	92 68 vstem	#
	312 67 vstem	#
	420 68 vstem	#
	640 67 vstem	#
	9 4 callsubr
	10 4 callsubr
	660 631 rmoveto
	-28 23 rlineto
	-493 -657 rlineto
	28 -24 rlineto
	11 4 callsubr	<=====
	closepath
	212 525 rmoveto
	71 -32 66 -78 vhcurveto
....


Well, it is certainly a good idea to make use of the subroutine as
done in the line "11 4 callsubr", but I don't understand why they
didn't replaced the lines I marked with "#" by a subroutine call as
well.  The content is the same.

Subroutines had been invented to make the font smaller.  A piece of
code which is needed everywhere should be placed into a subroutine.

The fonts you use are amazingly funny in this respect.

This is an infinitesimal small excerpt of the subroutine array of
MinionPro-It:

dup 149 {
	hhcurveto
	} |
dup 150 {
	hhcurveto
	} |
dup 151 {
	hhcurveto
	} |
dup 152 {
	hhcurveto
	} |
dup 153 {
	hhcurveto
	} |
dup 154 {
	hhcurveto
	} |


I doubt that is useful to have subroutines which contain a single
token, but it absolutely doesn't make sense to have two subroutines
with the same content.

But the font you use has 2111 subroutines which only contain the token
"hhcurveto".  Yes, twothousandonehundredeleven!  A single one would be
sufficient.  And even this doesn't make much sense.

This is the reason the file created by pdftex is so much larger, but
the file produced by ps2pdf is of minor quality.

The difference in file size will certainly be much smaller if you use
other fonts.  \usepackage{latin-modern} instead of Minion and compare
the file sizes.

Regards,
  Reinhard

-- 
----------------------------------------------------------------------------
Reinhard Kotucha			              Phone: +49-511-4592165
Marschnerstr. 25
D-30167 Hannover	                      mailto:reinhard.kotucha at web.de
----------------------------------------------------------------------------
Microsoft isn't the answer. Microsoft is the question, and the answer is NO.
----------------------------------------------------------------------------




More information about the pdftex mailing list