Sunday Sep 28, 2014
Wednesday Dec 12, 2012
By Darryl Gove-Oracle on Dec 12, 2012
I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.
- Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
- Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
- Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
- The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
- Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
- Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
- The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
- SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
- The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.
I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.
Thursday Nov 01, 2012
Thursday Oct 18, 2012
- Where does misaligned data come from?
- Misaligned loads profiled (again)
- Misaligned loads in 64-bit apps
- C++ rules enforced in Studio 12.4
- SPARC processor documentation
- Using the Solaris Studio IDE for remote development
- Community redesign...
- New Studio C++ blogger
- Building xerces 2.8.0
- Building old code
The Developer's Edge
Solaris Application Programming
- Coding for multiple threads on a CMT system
- Compiling for the UltraSPARC IIICu ...
- Cool Tools for SPARC systems overview
- GCC for SPARC Systems compiler options
- Improving Code Layout ...
- Interpreting UltraSPARC T1/T2 performance counters
- Memory ordering - part 1
- Memory ordering - part 2
- Performance Analysis Using SPOT
- Selecting Training Workloads ...
- Selecting the Best Compiler Options
- Sun Memory Error Discovery Tool
- UltraSPARC-IIICu Performance Counters ...
- Using Inline Templates ...
- Using Profile Feedback
- Using SHADE to Trace Program Execution
- Using VIS Instructions ...
- Using redistributable libraries
- CPU2006 training workload quality
- CPU2006 working set size
- Coding for multiple threads on a CMT system
- Compilers, Tools, and Performance - OpenSolaris Japan
- Developing and deploying software on the UltraSPARC-T1
- Evaluating training data for profile feedback ...
- Multithreaded programming for CMT sytems
- Parallelising a serial application
- SVOSUG Compiler Flags
- SVOSUG OpenSPARC
- SVOSUG Parallelisation
- SVOSUG book presentation
- Solaris and Sun Studio
- Strategies for improving the performance of serial codes on a CMT system
- Techniques for utilizing CMT