OSG 多线程模型 设计思想

A New Processing Model for

Multithreaded, Multidisplay Scene Graphs

Copyright © 2001 Don Burns

(DB - Apr 28, 2004) This article was written in 2001 before the initial implementation of Producer. The concepts of multi-display management that influenced the design of Producer are discussed. While many things may have been implemented differently than discussed here, the article serves as a backdrop of the concpets within Producer.

A New Proposal


The primary role of a scene graph is to improve performance throughculling, state sorting and various other methods, which reduce the loadon the graphics rendering engine, allowing a complex scene to be renderedin "real time".  The goal of real-time rendering is to render a sceneat frame rates high enough to meet expectations of human interaction. Sixty Hz. or better is the goal of out-the-window image generation on flight simulators, where anything less produces visual anomalies. Thirty, twenty and even fifteen Hz are considered "interactive", that is,the view is being manipulated by the user and responds within a reasonabletime of the user's input.  For the purposes of this discussion, weshall refer to frame times for a 60 Hz. simulation.  Constant framerates and the ability of the graphics subsystem to synchronize renderingbuffer swap  to vertical blanking time is assumed.  Further,systems with multiple displays assume genlock, or at least frame lock,such that vertical retrace boundaries are synchronous across all graphicssubsystems.
 

Multitask, Multidisplay, Single System Image the Traditional Way


The "traditional" method of using a scene graph for real-time renderingis to use separate phases: APP, CULL, DRAW.  App describes the phasewithin which all dynamic user data is updated, including the position ofthe camera(s), positional and attitudinal updates of moving objects. CULL must follow APP and describes the phase whithin which the scene issorted by objects visible in the viewing frustum, and secondarily by stateto improve rendering performance.  CULL updates camera position dependentinformation, and builds a "display list" for the DRAW phase.  DRAWsimply traverses the display list and issues OpenGL calls which are passedto the graphics subsystem for processing.
 
 


Figure 1 - The three phases of scene graph processing



In a system with multiple graphics subsystems, it becomes necessaryto have a CULL and DRAW phase for each, because CULL will produce unique"display lists" for each subsystem, assuming differing viewing frustums. More than a single APP is not necessary since each view will share thesame dynamic data updated by the one APP phase.

Herein is defined the first requirement for multi-tasking in a systemwith multiple graphics subsystems.  A single processor model wouldneed to process the phases serially (e.g. APP, CULL_0, DRAW_0, CULL_1,DRAW_1, CULL_2, DRAW_2), causing a frame to be as long as the aggregatetime for all phases.  Two task identities can then be defined: 1)a single APP task and 2) a CULL/DRAW task for each of the graphics subsystem.
 
 


Figure 2 - Breaking the phases up into parallel tasksfor a
multi-display system

On a multiprocessor system, each of these tasks can be assigned to runin parallel on a separate processor, given enough processors.  Further,the CULL/DRAW task can be split into two tasks which can run in parallelas well.

Now, two types of goals take place in a parallel multiprocessing environment. 1) Divide a large task into multiple smaller tasks that can run in paralleland reduce the processing time.  Refer to this as Task DivisionParallelization. and 2) Multiply a task N times and run each instance in parallel without increasing the processing time.  Referto this as Task Aggregation Parallelization.  SplittingCULL/DRAW from APP, and then further, splitting CULL and DRAW into separateparallel tasks is an example of Task Division Parallelization..  Add CULL/DRAW paired tasks for each graphics subsystem is an exampleof Task Aggregation Parallelization..

Several issues arise when running these phases in parallel.  First,the phases must process data serially.  That is,APP must finish workingon data before CULL can start using it.  Likewise DRAW cannot beginto process data generated by CULL until CULL has finished generating it. However, APP need not wait until both CULL and DRAW are done to begin workingon the next frame of data, and thus the system is pipelined as demonstratedin the next figure.
 
 


Figure 3 - The "traditional" multidisplay, multiprocessingphase parallelization model



Further, data that is shared between phases, must be protected or buffered. Data that is being written to by an upstream phase cannot be read concurrentlyby a parallel phase.  This introduces a large data management requirementfor the scene graph software.

The above describes a framework introduced by SGI's Iris Performer inthe early nineties.  It was good for its time, but has become outdatedfor a few reasons.

Real time, out the window flight simulators that required 60 Hz framerates held a 16.667 milliseconds as the standard time slice each phasehad for completing its task.  In 1990, SGI was developing real timegraphics on processors that ran at 1/60th of the speed of current processors. While graphics were scaled proportionate to graphics processor capabilities,the load requirements for the APP and CULL phases has not grown at thesame rate.  The graph in the figure above depicts system design basedon the assumption that the APP and CULL phases may take an entire frameto process.

Further, increased system bandwidths reduce the load of host-based graphicsdispatch and the DRAW phase must be considered as two separate processingthreads, one running on the host and the other running on the graphicssubsystem.  This subject will be expanded further on.

One last item of note is latency.  Flight simulator requirementsoften allow for a visual resonse latency of just over three frames. While this time lapse may be founded in actual human behavior researchdata, it is hard not to assume that it must  comprimise the idealto allow for the above processing model.
 

A New Approach


Typical applications running on current hardware that consider a60 Hz the goal for frame rate, yield processing times for Pre-CULL (previouslyreferred to as APP) and CULL in the range of less than 1 millisecond and3-5 milliseconds respectively.  The requirement, then, to dedicatean entire frame, or an entire CPU to each of these phases is no longerpertinent.  The diagrams that follow will reflect these representations.

One might argue that there are expensive tasks that can be assignedto Pre-CULL and CULL to increase their run time.  However, most oftasks that are typical of applications that do expensive operations arebetter run asynchronously to the frame driven part of the application.

Consider first, a single processor system with a single graphics subsystem. With the decreased requirement on Pre-CULL and CULL, the phase diagrammight look like the following.


Figure 4 - Single processor, single display phase modelwith pre-cull and cull processing times representative of recent computers



The good news is that all three phases are now executing in one frame,and latency is reduced to one frame  The bad news is that DRAW isallocated much less time than it had before and begins drawing in the middleof the frame.  The application still benefits from the use of a scenegraph, since CULL wll remove all of the scene that is not visible to alleviatethe host to graphics subsystem bandwidth bottleneck and wll sort objectsby state changes to optimize graphics pipeline performance.

As system bandwidth and graphics performance continue to increase incapability, it may be that an application with a lifespan longer than thehardware it runs on may have no further requirements than the above. The time allocated for rendering is sufficient.  In this case, nospecial requirements are put on the scene graph software for data protectionor management.

Apply now, this model to a system with multiple graphics subsystemsand multiple processors.  To take advantage of the multiprocessingsystem, we must spawn a main thread, which will run the Pre-CULL phase,and a CULL/DRAW thread for each of the graphics subsystems.  To continue,we must assume two aspects about data management:

1) Data written to by Pre-CULL, is public. and
2) Data generated by CULL is internal and separate copies are madefor each CULL/DRAW pair.
With these two assumptions, we can now safely thread the phases as follows.
 
 


Figure 5 - Multithreaded  phase model for multidisplaysystem



We have solved the problem of Task Aggregation Parallelization, buthave not solved the issue of having a DRAW phase that is significantlyshorter than a full frame.  To accomplish this, we must break CULLand DRAW up into their own process threads.  This will require protectingor buffering data that is generated by CULL and read and processed by DRAW. This subject is discussed in folowing sections.  The phase diagramfollows.


Figure 6 - Multithreaded phase model for multidisplaysystem with CULL and DRAW as separate threads



If you are a hardware vendor, this is a beautiful diagram because itrepresents 7 CPUS being used to drive three graphics subsystems. A simple argument about reserving CPU 0 for operating system tasks andstarting the simulation tasks with CPU 1 will sell an eighth CPU. However, as an engineer, it is hard not to  notice a lot of emptyspace in the diagram.  Note also, that we've increased latency bya frame.  This is still significantly better than the three framelatency of the old model, however.
 

Host  DRAW vs. Graphics Subsystem DRAW

Up until now, we have referred to DRAW as a single phase, or a single threador process.  On older systems this was a reasonable working modelas the DRAW phase was affected by host to graphics subsystem bandwidthand slower graphics processing.  Today, however, it is important tounderstand that the DRAW phase that is running on a dedicated CPU on thehost, is also interacting with another parrallel processor running on thegraphics subsystem.  OpenGL programs do little more than package OpenGLprotocol as a stream of tokens and data and pass it off to the graphicssubsystem which processes the stream and does the actual matrix transforms,and rendering  Consequently, the host base DRAW begins slightly aheadof the graphics subsystem DRAW and finishes before (sometimes long before)the graphics subsystem finishes processing.  Anyone who has ever donegraphics benchmarking with only host based timing tools knows this issuewell.

Look closely at the run-time lapse of host-based DRAW (also referredto as Dispatch), and graphics subsystem DRAW depicted in the followingdiagram.
 
 

 
Figure 7 - DRAW as represented by dispatch and substyemprocessing


This diagram represents one frame of time, with host based DRAW (Dispatch)starting on the frame boundary.   The period of time that occursbetween when host DRAW begins dispatching OpenGL calls and the graphicssubsystem begins processing these can be referred to as Dispatch latency. The yellow band represent the time it takes for the graphics subsystemto completely process the input stream, finish the transform phase, therender phase and issue a rendering buffer swap.  Since buffer swapdoes not occur until the next vertical retrace blanking time, the graphicssubsystem waits.

Note, however that DRAW Dispatch has finished well ahead of the graphicssubsystem processing.  In the interest of synchronizing the applicationwith the graphics subsystem, most serious graphics software will wait alsofor a signal indicating that buffer swap has occured before going on tothe next frame.  This presents an opportunity for processing timeon the host.

With this knowledge in hand, considering the parallel nature of hostand graphics, we can apply the following phase model.
 
 


Figure 8 - Phase shifted processing model


In this model we run the host based frame scheduler at the precise rateof the graphics subsystem vertical retrace signal.  However we staggerit to be slightly out of phase such that we can begin a frame on the hostsometime before vertical retrace.  We finish the Pre-CULL phase andthe CULL phase and begin to issue OpenGL protocol from host based DRAWsuch that when vertical retrace occurs and graphics subsystem processingresumes, it begins as close to graphics subystem frame boundaries as possible. Note that CULL and DRAW (dispatch) reside in the same thread and are processedserially.  This is a result of taking advantage of time wasted inthe host waiting for vertical retrace.

This model simplifies memory management within the scene graph, takesbetter advantage of compute resources and allows maximum rendering timefor graphics DRAW.  Note also, that latency is down to less than twoframes.


Application to Design of

Open Scene Graph MP Model


The Open Scene Graph Multi Processor model begins with the followingdiagram.








The blocks in the diagram represent abstract concepts and should notbe bound too closely to hardware or implementation at first glance. The implementation will be derived as the discussion ensues.  Lettersin red represent designations to be used in a configuration document orimplementation.  Lines and arrows represent the flow of data as itgoes through the system and ends up rendered on a display.
 

Main Thread


The Main Thread is the process or thread that runs Pre-CULL. Its declaration in the configuration will include a CPU upon which to runit.  It will be assumed that the Main Thread will run from the hostwhere it is invoked.  A configuration manager may be implemented tostart and initialize each block above, and the Main Thread will run fromthe same host the manager is run from.
 

Cull/Draw Pairs


Cull/Draw pairs may run as a single thread, or separate threadsdepending on the processing model chosen from the previous section. These can be designated with a Host argument defining the  hostnameof the system they should run on, and a CPU argument defining numericallywhat CPU on the designated host to schedule them on.  If CPU is plural(not greater than 2), it is assumed that Cull/Draw will run as separatethreads.
 

Rendering Surface


The Rendering Surface represents the screen space where the finalrendering will be displayed.  Its designation defines

  •  
Host Hostname of the  system where display will be seen
Display Graphics subsystem.  Display is used here as it is in theXWindowsystem.
Screen Screen is used as it is in the XWindow system.
Window Window is used as it is in the XWindow system.
Viewport Viewport is the rectangle within the Window the final rendering willbe confined to.
Representations of each of the above in the configuration document is animplementation detail.

Configuration


Note that the above can be configured to run in three separate environments.

1) Single System Image

If the Host field in designations remains constant the system will beinitialized on the same host.  Threads can then be configured to runwhere the CPU field is defined in their designations.

2) Graphics Cluster

If the Host field differs on the Cull/Draw pair than the Pre-cull host,then a Pre-cull agent will be started on the Cull/Draw pair host, and usedto synchronize dynamic data set within the Pre-cull phase on the slavehost.  This agent will block the Cull phase until data is synchronized.

3) Wire GL configurations

Note that Rendering Surfaces contain a "Host' field in their designation. This could be used to implement a WireGL implementation to handle the OpenGLprotocol issued from the host-based DRAW phase.

The flexibility of this configuration scheme allows for a mix-and-matchof the above configurations.  For example, an application could runits out-the-window display on three local graphics subsystems, providemultiple clustered displays for an Instructor Operator Station, and implementa final composite of all displays on a WireGL cluster.
 

MP Models

Two models fall out of the discussion in the previous section for doinga multi-task, multi-display implementation of OpenScene Graph.  Thedifference boils down to a decision to thread Cull/Draw pairs separatelyor not.  Considering the use of a phase shifted, host-based framescheduler the advantage of threading Cull/Draw pairs sepatarately is dubious.  Further, the implementation may introduce memory management overhead thatcould hamper performance, negating any real advantages.

Still, both methods are discussed here for consideration.
 
 

MP Model A - Data Flow


Consider the following MP model introduced in the previous section.  We deal here with a single, host based Pre-CULL/CULL/DRAW pipeline, withthe understanding that multiple Cull/Draw pairs can be extrapolated fromthe information.



This model assumes a host-based staggered frame scheduler, and a singlethread for CULL/DRAW.  Time lapses A, B, and C represent the dataflow stages in the next diagram.


As previously stated, the Pre-CULL phase updates dynamic data in thescene graph.  This dynamic data includes camera position(s), positionof moving objects within the scene, update of timestamps, frame counts,elapsed time, and other data management facilities, etc.  This datais assumed to be public, allocated and accessible by the application. Thus, CULL must wait until Pre-CULL has finished its phase.  OncePre-CULL is done, it signals CULL to run.  CULL reads the updateddynamic data, and generates internal data, not accessible by the application,but intended only for the DRAW phase.  This data is processed serially. DRAW traverses the generated data and issues OpenGL calls.

This model is simple and requires little more than simply implementingthe run-time including the phase shifted host based frame scheduler. OpenSceneGraph already contains support for multiple rendering contextson a multi-display system.  No changes are required in Cull/Draw torun them as a single thread.

MP Model B - Data Flow

Consider an MP model with separate threads for CULL/DRAW.  Note thatthe diagram does not include the graphics subsystem DRAW.  This modelassumes no phase shift and host-based processes run synchronized to thegraphics subsystem.

Data for this model is depicted in the following diagram.






This diagram differs from the single threaded CULL/DRAW diagram in thatthe internal data that is passed from CULL to DRAW must be double buffered.  Data generated by CULL will be written to Buffer 0 while DRAW reads fromBuffer 1.  At a synchronization point between CULL and DRAW, pointersto the Buffers can be swapped.

This approach requires the implementation of the double buffered internaldata, and the implementation of a synchronization point between CULL andDRAW.
 

Conclusion


OpenSceneGraph is architected for successful multi-tasking, multi-processingand multi-displays.  The implementation approach is modern and takesadvantage of current hardware.  Open Scene Graph has been tested successfullyunder SGI's MPK, and has held up well.  It is the desire of the OpenScene Graph authors to implement a solution that is cross-platform andallows the flexibility to run transparently on graphics clusters. Given the building blocks the implementation of multi-display, multi-processingrun time for Open Scene Graph is only slightly more than trivial.


猜你喜欢

转载自blog.csdn.net/fatcat123/article/details/79310188