Overview


The floating point unit consists of a four stage pipelimne with a one entry instruction queue. The purpose of the queue (F1) is to hold an instruction if the previous FP instruction stalled for some reason. There is also a 32 member register file.

All floating point instructions are some combination of FP add and FP multiply, often using the fmadd instruction. See 7.3.4 in the PPC 601 user's manual for a more thorough discusssion.

Implementation Overview


The FPU's functionality is divided among eight main classes of which are hierarchically arranged into three layers. The FPU class is the top-level class, serving as the shell for instances of the FPULoadWB, FPUnitRegs, and FPUPipeline classes. The FPUnitPipeline class, in turn, instantiates the remaining classes which are the heart of the pipeline : FPUnitDecode, FPUnitMultiply, FPUnitAdd, and FPUnitAddWB. The FPUnitDecode stage handles the duties of the PPC601's FD stage, and contains both the F1 and FD portions of that stage. The FPUnitMultiply class performs similar functions to the PPC601's FPM stage. Similarly, the FPUnitAdd class handles the needs of the PPC601's FPA stage. Class FPUnitAddWB takes care of the same tasks as the PPC601's FWA stage, much as class FPUnitLoadWB corresponds to the FWL stage in the PPC601. Finally, the FPUnitRegs class is responsible for maintaining the floating point registers as the PPC601's special register hardware does.

Class FPUnitRegs

The FPUnitRegs class maintains the contents of the FPU's 32 special floating point registers. These registers have been implemented as double precision. There are three main methods for reading and writing to the FPU registers, which can only be done from with in the FPU.

  double  ReadFloatReg(unsigned _reg);
  double  ReadReg(unsigned _reg);
  void WriteReg(unsigned _reg, double _data);
  void WriteFloatReg(unsigned _reg, double  _data);

WriteReg allocates an available shadow register and modifies the said value of this register, WriteFloatReg consists of a single call to WriteReg. Likewise, ReadFloatReg and ReadReg perform the similar function of returning the value of the specified register; ReadReg returns a double and ReadFloatReg consists of a single call to ReadReg.

Class FPUnitPipeline

The FPUnitPipeline serves as a shell for the computational components of the FPU, as well as creating instances of the FPUnitDecode, FPUnitMultiply, FPUnitAdd, and FPUnitAddWB classes. The FPUnitPipeline directs these modules in reverse order; this is necessary to keep coherency with respect to the many dependencies between the various pipeline stages - for example FPUnitDecode is dependent on certain instructions in FPUnitMultiply which hence must be updated first. Note that the FPUnitPipeline does not direct the FPUnitLoadWB module.

Common Sub-class Features

All of the sub-classes (I.E. the actual stages within the pipeline) have the following common member functions.

 
 bool IsStalled()
 void Load(Instruction* i ...)
 Instruction* CurInstruc()

All sub-classes also contain the data types

Instruction *latch, *current

The latch holds the loaded instruction field until the given sub-class updates the current instruction with the latch value.

Like all processing units and sub-units in the simulator, each of these classes has three member functions which, when called, perform a portion of the work for that cycle.

  void StartStage()
  void DoStage()
  void EndStage()

The intent of StartStage, DoStage, and EndStage is to support non deterministic parallelism. These stages represent the start, do, and end phases of the clock cycle. What is actually done in these phases varies widely among the pipeline modules.

Class FPUnitDecode

Instantiated by

 FPUnitDecode(FPUnitMultiply *_multiply, FPUnitAdd *_add,
	       FPUnitAddWB *_add_WB, IU *_iu, FPUnitLoadWB *_load_WB)

the FPUnitDecode preforms many functions. This is particularly true with respect to RAW data hazard checking. To this end, pointers to the FPUnit Multiply, Add, and AddWB stages are provided. Using the 'TargetReg()' functions from these stages, the FPUnitDecode class handles register conflicts that may occur if the current instruction is allowed to proceed. If the current target register matches the target register in any of the subsequent pipeline stages (except FPUnitLoadWB) then a stall occurs. Because the state of the decode stage must be determined early, all data hazard checking is done in the do phase. Further, since the PPC601's FPU decode stage contains the buffer F1 (which fills when the FD stage is stalled and empties when FD is not), the shuffling of sub-stage instruction contents as well as the current 'stalled' state is determined in the start phase to preclude needless processing in the do phase. Defined 'stalled' states are implemented as

enum {NO_STALL, HAZARD_STALL, FD_STALL, F1_STALL} stalled;

The end phase arbitrates the passing of the current instruction to the next stage as well as setting the 'stalled' state of FPUnitDecode as determined by IU::CompletedIC.

Class FPUnitMultiplly

Instantiated by

FPUnitMultiply(FPUnitAdd *_add, FPUnitRegs *_regs)

class FPUnitMultiply() is passed a pointer the the FPU registers which will be used in computational and register checking services. The start phase assigns the current instruction to the latch. The do phase does the lion's share of the work in this sub-class. Here, the multiplicative portion of an arithmetic instruction is computed using the PPC601's standard fmadd base algorithm, AC + B = D. D is the computed value, A and C are the operands, and B is set to 0. The end phase arbitrates the passing of the current instruction to the FPUnitAdd module. An instance of the data structure

FPU::struct fmaddBASE
  {
    double A;
    double B;
    double C;
    double D;
  };

is passed here to the next stage to continue relevant arithmetic operations.

The procedure WORD FPMTargetReg() provides the current target register number to the FPUnitDecode module for data hazard checking.

Class FPUnitAdd

Instantiated by

FPUnitAdd(FPUnitAddWB *_writeback, FPUnitRegs *_regs)
Class FPUnitAdd contains features nearly identical to class FPUnitMultiply. The main exception is its utilization of the standard fmadd base algorithm, AC + B = D. Here, C is set to 1, and B is the additive value as dictated by the instruction.

Class FPUnitAddWB

Instantiated by

FPUnitAddWB(FPUnitRegs *_regs, Cache *_cache, IU *_iu)

Class FPUnitAddWB performs all register manipulations within class FPU as well as supplying store data to the cache (with the exception of FPUnitLoadWB, which handles all load instructions). A pointer to the IU module is passed as it is needed for coordinating FPU and IU instructions dispatched out of order.

In the start phase, the first thing that is done is to utilize IU::CompleteIC(). If the current instruction in FPUnitAddWB is less than or equal to the IC tag of the instruction in the IU, than processes may proceed. Otherwise, FPUnitAddWB stalls. Thus, the start phase determines the state of FPUnitAddWB for this cycle. Its other function is to pass store data to the cache, querying the cache using Cache::FloatStoreDataReady().

In the do phase nothing happens. In the end phase, all register manipulations, such as updating registers to reflect arithmetic instruction results, are accomplished.

Class FPUnitLoadWB

Instantiated by

 FPUnitLoadWB(FPUnitRegs *_regs, FPUnitMultiply *_multiply,
	       FPUnitAdd *_add, FPUnitAddWB *_add_writeback)

The only purpose of FPUnitLoadWB is to update the FPU registers with load data. This is not a trivial task, since premature updates can cause data hazards. To this end, pointers to all other pipeline stages are passed.

In start phase, dependencies are checked by comparing the current instruction to all store instructions within the FPU pipeline. FPUnitLoadWB stalls if one of those instructions has priority over the current instruction in FPUnitLoadWB.

In do phase, the registers are updated using the supplied load data. Nothing happens in the end phase.

Assumptions and Limitations

While not a perfect emulation of the PPC601's FPU, class FPU does provide a close approximation of that pipeline's performance. Since the logic of the various pipeline stages was not discussed in great detail in any of the references used in coding the FPU, the FPU may need to be modified to reflect a greater knowledge and accuracy in the future. Also, communication between the BPU and FPU is lacking. Ideally, when predicting branches, the BPU should interact with the FPU much as it does with the integer unit. The FPU should have analogies to IU::SetPredict, IU::FlushPredict, and IU::ClearPredict. As it stands, any results gained from mixing branch and floating point instructions are suspect at best. Certain FPU registers are privileged in the PPC 601. For simplicity, this constraint was ignored. This allowed easier creation of test cases as generated by the gcc compiler which is ignorant of these limitations.

Suggestions

In order to accurately represent FP doubles, a new data type may need to be created, WORDS. This would be an absract data type which could hold arbitary numbers of words (32 bit values). In addition, it can be designed to return a float, WORD, or double, or even an instruction.

// words.hh -- Defines WORDS datatype

// Note that this module makes some assumptions about size, namely that
// sizeof (float) == sizeof (long) && sizeof (double) == 2 * sizeof (long)

class WORDS
{
public:
  WORDS ();
  WORDS (WORDS&);
  ~WORDS (void) { delete data; }

  WORDS& operator = (const WORDS &);

  WORDS& operator = (const double);
  WORDS& operator = (const float);
  WORDS& operator = (const int);
  WORDS& operator = (const long);
  WORDS& operator = (const unsigned long);

  operator unsigned long (void);
  operator long (void);
  operator int (void);
  operator float (void);
  operator double (void);

private:
  long *data;
  int   size;
};
The cache/memory code already has hooks in it to return multiple bytes (e.g. see member function of class Cache). Some work needs to be done to that to dump the values into a WORDS instead of a double. Likewise, functions that pass around doubles also need to be modifed to pass around WORDS. If there are sufficient overloaded operator ='s in WORDS, the transition shouldi not be too difficult. As a result of this, implementation of double precision FP loads and stores should be greatly simplified, as they can be natural extensions of single prec. operations. As mentioned above, I think we may need to design FP single and double classes to be better able to simulate the behavior of the FPU. This class can simply be a class w/ data members for all the components of the IEEE-754 FP representation. E.g.
class IEEE754Double
{
public:
    IEEE754Double (WORDS);
    IEEE754Double& operator = (const WORDS);
    bool isNAN ();
    bool isInfinity ();
    // etc...
private:
    int sign,exp,frac; // operations which modified these
                       // would mask off the appropiate # of bits
}

Additionally, it is not hard to implement compare, convert to int, etc., since all these procedures are documented in the manual, as is normalization, NANs, etc. Some thoughts on the floating point store buffer (FPSB). An examination of the FP store code in iu/execute.cc indicates that the FPSB functionality is subsumed into the FP store code. Whether it is necessary to separate out the code into a distinct FPSB object is unclear; James Maxwell's tests on the stfs instruction indicates that it is currently broken - taking far more cycles than predicted. According to table 7.4 in the User's guide, all FP store instructions pass through here. Accordingly, if the FPSB is to be created, the FP store instructions should be modified to pass the instruction to the store buffer instead of completing it, and the code that used to reside in the store instruction code should be moved to the FPSB. Furthermore, additional work may need to be done on the memory subsystem to get the numbers to come out right.

References

International Business Machines Corporation. 1993. PowerPC 601 RISC Microprocessor User's Manual

Weiss, Shlomo. 1994. Power and PowerPC

. San Francisco, CA : Morgan Kaufmann Publishers, Inc.

Patterson, David; Hennessy, John. 1996. Computer Architecture a Quantitative Approach. San Francisco, CA : Morgan Kaufmann Publishers, Inc.

User Interface
Back to Design Outline