Decimal Floating Point Types

I'm off to a meeting of the ISO/SC22/WG14, the C programming language committee meeting in a couple of weeks.  One of the papers on the agenda (N1154) is a proposal for a Technical Report on adding Decimal Floating Point types and arithmetic to the C programming language specification.  The proposal is based on a model of decimal arithmetic which is a formalization of the decimal system of numeration (Algorism) as further defined and constrained by, IEEE-854, ANSI X3-274, and the proposed revision of IEEE-754 (known as IEEE-754R).

The proposal adds decimal floating point within the type hierarchy, as base types, real types and arithmetic types.  The three types are called:
  • _Decimal32
  • _Decimal64
  • _Decimal128
There is a new macro an implementation must define to indicate conformance to this technical report:
  • __STDC_DEC_FP__
The proposal introduces generic floating types the existing floating point types: float, double, and long double.  Together the generic floating point types and decimal floating types are known as the real floating types.

The three decimal encoding formats defined in IEEE-754R correspond to the three decimal floating types as follows:
  • _Decimal32 is a decimal32 number, which is encoded in four consecutive octets (32 bits)
  • _Decimal64 is a decimal64 number, which is encoded in eight consecutive octets (64 bits)
  • _Decimal128 is a decimal128 number, which is encoded in 16 consecutive octets (128 bits)
The details of the format are give in IEEE-754R.

New macros similar to those of <float.h> are defined in a new header <decfloat.h>.  For example, DEC_EVAL_METHOD, DEC32_MANT_DIG, DEC64_MANT_DIG, DEC128_MANT_DIG.  Prefixes of DEC32_, DEC64_, and DEC128_ are used to denote the types _Decimal32, _Decimal64, and _Decimal128 respectively.

Conversion from decimal floating type to integer is as you would expect, the fractional part is discarded (value truncation towards zero).  If the value cannot be represented by the integer type the result depends on the sign of the integer type.  If unsigned, and the result is positive, the largest representable number, otherwise 0.  If signed, the result it the most negative or positive number according to the sign of the floating point number.

For conversion from integer to decimal floating type, if the value being converted can be represented exactly, it is unchanged.  If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is correctly rounded.  If the value being converted is outside the range of values that can be represented, the result is positive or negative infinity depending on the sign of the value being converted, and the “overflow” floating-point exception will be raised.

For conversion between generic floating types and decimal floating types, the TR is similar to the existing ones for float, double and long double, except that when the result cannot be represented exactly, the behavior is tightened to become correctly rounded.

The TR does not add complex or imaginary decimal floating types.  However, it does add the equivalent rules for conversion between complex and imaginary types to decimal floating types as exist for conversion between generic floating types.

Determining the common type for mixed operations between decimal and other real types is difficult because ranges overlap, therefore mixed mode operations are not allowed and the use of explicit casts are required. Implicit conversions are allowed only for simple assignment and in argument passing.

There is no default argument promotion specified for the decimal floating types.

The new suffixes to denote decimal floating constants are: DF for _Decimal32, DD for _Decimal64, and DL for _Decimal128.

It would help usability if unsuffixed floating constant can be used to initialize decimal floating types.  For, example, 0.1 has type double and in implementations where FLT_EVAL_METHOD is not -1, the internal representation of 0.1 is not exact. This defeats the purpose of decimal floating types.  So the proposal introduce a translation time data type (TTDT) which the translator uses as the type for unsuffixed floating constants.  An unsuffixed floating constant is kept as a TTDT until an operation requires it to be converted to an actual type.  The value of the constant remains exact for as long as possible during the translation process.

The concept can be summarized as follows:
  • The implementation defines the type, including the limits on the constant that the type can represent exactly.  Probably the widest decimal floating type.
  • The range and precision of the type are implementation defined and are fixed throughout the program.
  • The type is an arithmetic type.  All arithmetic operations are defined for this type.
  • The Usual arithmetic conversion is extended to handle mixed operations between TTDT and other types.  If an operation involves both TTDT and an actual type, the TTDT is converted to an actual type before the operation.

double f;
f = 0.1;

Suppose the implementation uses _Decimal128 as the TTDT. 0.1 is represented exactly after the constant is scanned. It is then converted to double in the assignment operator.

f = 0.1 \* 0.3;

Here, both 0.1 and 0.3 are represented in TTDT.  If the compiler evaluates the expression during translation time, it would be done using TTDT, and the result would be TTDT.  This is then converted to double before the assignment.  If the compiler generates code to evaluate the expression during execution time, both 0.1 and 0.3 would be converted to double before the multiply.  The result of the former would be different but more precise than the latter.

float g = 0.3f;
f = 0.1 \* g;

When one operand is a TTDT and the other is one of float/double/long double, the TTDT is converted to double with an internal representation following the specification of FLT_EVAL_METHOD for constant of type double.  Usual arithmetic conversion is then applied to the resulting operands.

_Decimal32 h = 0.1;

If one operand is a TTDT and the other a decimal floating type, the TTDT is converted to _Decimal64 with an internal representation specified by DEC_EVAL_METHOD. Usual arithmetic conversion is then applied.

If one operand is a TTDT and the other a decimal floating type, the TTDT is converted to the decimal floating type.

The floating-point environment <fenv.h> specified in C99 applies also to decimal float types.  The decimal floating-point arithmetic specified is more stringent.  All the rounding directions and flags are supported.

Certain algorithms stipulate a precision on the result of an operation; and this precision could be different from those of the three standard types.  The technical report adds a pragma directive to control this during translation time.


A host of new functions are added to <math.h> to support the new decimal floating types, along with new macros HUGE_VAL_D32, HUGE_VAL_D64, HUGE_VAL_D128, DEC_INFINITY and DEC_NAN are defined to help using these functions. The functions are equivalent to the existing generic floating type functions with d32, d64, and d128 suffixes added for the decimal floating type versions of the functions.  Similarly equivalent functions to support decimal floating types are added to <stdlib.h>, <wchar.h>, and macros to <tgmath.h>.

And last New quantize functions are added to <math.h>  These functions set the exponent of argument x to the exponent of argument y, while attempting to keep the value the same.

For a look at the full document and the rational see:


Post a Comment:
  • HTML Syntax: NOT allowed

Douglas is a principal software engineer working as the C compiler project lead and the Oracle Solaris Studio technical lead.


« July 2016