FAQ for Windchill on Solaris
By Jeff Taylor-Oracle on May 30, 2007
FAQ for Windchill on Solaris
IntroductionI work in Sun's "ISV Engineering" team. Our responsibilities include working with Sun's key ISV partners to port, tune and optimize industry leading applications on Sun's hardware and software stack, and to ensure that the latest solutions from the ISV's are certified on the latest products from Sun. One of the applications that I focus on is Windchill from PTC. As such, I am in frequent contact with PTC's R&D, Global Services, QA and customers, all of whom bring varying degrees of Solaris expertise. This is a collection of questions and answers.
Section I: HARDWARE CONFIGURATION
1. What server configurations are recommended?
Windchill will run on hardware ranging from laptops to large servers. Solaris is typically used for installations where the user count ranges from 50 to several thousands of engineers. While you can run a full Windchill installation on a single server, typical installations split the Oracle database and Windchill application tier onto different tiers.
2. Does Sun recommend horizontal or vertical scaling for Windchill?
"Vertical scaling" is typically used at the database tier. This strategy is where multiple CPU's are used in a single server. For example, if four SPARC64 VI processors are required at the database tier, but you would like room for future expansion, a Sun SPARC Enterprise M5000 with eight CPU slots could be used. Four slots would be populated, and four additional CPU slots would be available for future expansion..
"Horizontal scaling" is often used at the application tier. This strategy is where multiple servers are combined to accomplish a larger workload. Often, one Sun Fire T2000 is sufficient to meet the current requirements at the application tier, but the installation does not leave room for future expansion. This is OK because the Windchill application tier scales horizontally. A drawback is that a "Winchill cluster" is somewhat more difficult to administer, so it is better if you have a rather savvy IT staff. If you are using an ASP with limited Windchill experience, vertical scaling at the application tier may be a better approach.
3. Is a "highly available" solution recommended?
Yes, the cost of engineering talent is to high to allow for long downtimes. A "Windchill cluster" is highly available at the application tier. If one node fails, users who were logged into Windchill on the failing node will loose their Windchill sessions. When they attempt to reconnect, they will establish a session with one of the other nodes.
In addition to the Windchill cluster, we recommend an "active/active" Sun Cluster installation, as follows:
- Oracle standalone is a potential single point of failure, and therefore we recommend Sun Cluster HA Oracle for fail over at the database tier. (Oracle RAC is an alternative, but quite expensive, and not expected to scale well beyond two nodes). With Sun Cluster HA Oracle, one node is actively running Oracle while the other node is passive. Oracle will be launched on the passive node in the event of a failure.
- There are several Windchill components that are single points of failure, including the Windchill master cache server, Aphelion and the background method server. We recommend that these services be run on the passive Oracle node. On failure, a Sun Cluster agent can launch the Windchill services on the active Oracle node.
4. Does Sun publish server sizing recommendations for Windchill?
Yes, see https://www.sun.com/third-party/srsc/resources/ptc/PTCWindchill8.0T2000SizingGuide.pdf
5. What is the hardware configuration?
# prtconf -pv | more
6. How can the disk devices and disk partition table be viewed?
# prtvtoc /dev/dsk/c3t1d0s2
7. How much free disk space is available?
# df -h
8. How much disk space is a file/directory using?
# du -sh my_dir
9. Where did all of the space on this disk go?
# cd mountpoint (identify mount point with df -h)
# du -sk \* | sort -n
# cd biggest_dir
(repeat, working your way down the to offensive directories.)
10. What fibre channel devices are on line?
# luxadm probe
Section II: SOLARIS KERNEL SETTINGS FOR WINDCHILL
1. What version of Solaris is running?
# cat /etc/release
2. What kernel tuning is necessay for Windchill?
Solaris 10 out of the box is well tuned.
3. Any tweaks? Add this to /etc/system:
\* slow down the fsflush daemon
4. Are there any T2000 specific settings? Add this to /etc/system:
\* Sun recommended settings for T2000's running S10u3
5. Are there any Oracle kernel parameters? Add this to /etc/system:
\* Oracle 10g Settings
6. SunCluster kernel parameters? Add this to /etc/system:
\* Start of lines added by SUNWscr
\* Disable task queues and send all packets up to Layer 3
\* in interrupt context.
\* Uncomment the appropriate line below to use the corresponding
\* network interface as a Sun Cluster private interconnect. This
\* change will affect all corresponding network-interface instances.
\* For more information about performance tuning, see
\* set ipge:ipge_taskq_disable=1
\* End of lines added by SUNWscr
\* When you use the ce Sun Ethernet driver for public network connections
7. What about the recommendation for Oracle /etc/system changes such as shmsys:shminfo_shmmax?
No longer required with Solaris 10. Most are obsolete or dynamically adjustable.
projadd -c "Oracle Project" -U oracle,root -K \\
"project.max-shm-memory=(priv,6GB,deny)" -K \\
Section III: PATCH LEVELS
1. What patches are on the system?
# showrev -p
2. How can the Solaris patch level be kept up to date using a GUI?
3. How can the Solaris patch level be kept up to date using a CLI?
Section IV: SAR
1. Setting up sar for one minute samples and long term logging:
# su - sys
# crontab -l > /tmp/crontab.txt
# vi /tmp/crontab.txt
0 \* \* \* 0-6 /usr/lib/sa/sa1 60 60
# 20,40 8-17 \* \* 1-5 /usr/lib/sa/sa1
5 18 \* \* 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 1200 -A
0 1 \* \* 2 /opt/sar_bk/mk_sar_bk.sh
# crontab /tmp/crontab.txt
# vi /opt/sar_bk/mk_sar_bk.sh
mkdir /opt/sar_bk/sar_`date +%Y%m%d`
cp /var/adm/sa/\* /opt/sar_bk/sar_`date +%Y%m%d`
Section V: MONITORING CPU ACTIVITY
1. What tools are used to monitor the current CPU activity level? All of these work:
# xcpustate -disk &
# vmstat 10 10
# mpstat 10 10
2. How busy was the CPU after lunch?
# sar -u -s 13:00 -e 13:30
3. Which cores/CPU's are BUSY?
Section VI: PROCESSORS
1. Are the processors on line?
2. How can you take a processor off line for performance analysis?
# psradm -f 16 (where "16" is the id of the processor)
3. How can you put the processor back on line?
# psradm -n 16 (where "16" is the id of the processor)
Section VII: RUN QUEUE
If you have more requests for processing than compute cycles, processes are scheduled in the run queue. A large run queue indicates that you need to find more compute cycles (i.e. buy more/faster CPU's) or reduce the workload (i.e. application tuning)
1. How can you see the current run queue length?
# vmstat 10 10 (Watch the "r" column.)
2. How can you see the historical run queue length?
# sar -q (specifically, look at runq-sz %runocc)
Section VIII: DISK ACTIVITY
1. Which disk is currently busy?
# iostat -mxPzn 10 10
2. Which disk were historically busy?
# sar -f /usr/adm/sa/sa13 -s 20:19 -e 21:19 -d -i 3400 | grep -v , | grep -v .fp | grep -v md | more
Section IX: PROCESSES, THREADS, SYSTEM CALLS AND LOCKS
1. Which processes are busy?
2. Which processes are making a lot of system calls?
# prstat -m (watch the "SCL" system call column)
3. Which threads inside of a process are busy?
# prstat -L -p 1332
4. Which processes can benefit from a muli-core server like the T2000?
Processes with a significant number of threads or processes may benefit. Windchill method servers and Tomcat both run well with a large number of cores and run well on the T2000. (In contrast, Pro/E is not a good match for the T2000.)
5. Which processes are have many threads (LWP's)?
# ps -e -o"nlwp,pid,args" | sort -n
6. What system calls is a process making?
# truss -c -p 1332
7. The "ps" command truncates the Java arguments. How do you see the full list?
# pargs 1332
8. How can you see how much time each thread has taken
# ps -o"lwp,time,args" -L -p 2282
9 Is there locking?
# plockstat -C -p 5992
# lockstat -CPD 5 sleep 10
10. If the process is locking on malloc, how can you use a threaded malloc?
11. How can you see the current stack trace of a process?
# pstack 1332
Section X: MEMORY, VIRTUAL MEMORY AND SWAP SPACE
1. What swap devices are mounted?
# swap -l
2. How much swap space is used/remaining?
# swap -s
# vmstat 10 10
3. Is there pressure on the virtual memory system, currently?
vmstat 10 10 (watch the "sr" scan rate column.)
4. How much free memory and swap have been available, historically?
# sar -r (freemem freeswap)
5. Was there pressure on the virtual memory system, historically?
# sar -g (watch pgscan/s)
6. Which processes are using the most RAM? Sort processes by Resident Set Size
ps -e -o"rss,pid,args" | sort -n
7. Which processes are using the swap space? Sort processes by Virtual size
ps -e -o"vsz,pid,args" | sort -n
Section XI: IO
1. What files does a process have open?
# pfiles 4514 | grep /
2. Which disks are busy?
# iostat -mxPzn 10 10
# xcpustate -disk &
Section XII: NETWORK STATUS
1. Overall network status?
# netstat -i
# netstat -sPtcp
2. What ports are processes listening on?
# netstat -a | grep LISTEN
3. What sockets does a process have open? Here is an example that shows that Apache has port 80 open.
# pfiles 4514
3: S_IFSOCK mode:0666 dev:310,0 ino:59012 uid:0 gid:0 size:0
sockname: AF_INET6 :: port: 80
netstat -an | grep LIST | grep 1158
In my environment, running "netstat -s" on the Windchill
application tier reported thousand of tcpListenDrop, and therefore,
network tuning was required:
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q 2048
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q0 8192
/usr/sbin/ndd -set /dev/udp udp_smallest_anon_port 8192
/usr/sbin/ndd -set /dev/tcp tcp_smallest_anon_port 8192
# ln -s /etc/init.d/network-tuning /etc/rc2.d/S99network-tuning
Also needed to increased maxSockets in wt.properties:
SECTION XIII: RANDOM CRASHES
1. How can you detect random application crashes?
SECTION XIV: ORACLE CONFIGURATION AND TUNING
The following setting worked well for the test database any use patterns provided by PTC. You mileage will vary.
a)A 5 GB SGA was required for the test database:
ALTER SYSTEM SET sga_max_size=5g SCOPE=spfile;
ALTER SYSTEM SET sga_target=5g SCOPE=spfile;
Total System Global Area 5368709120 bytes
Fixed Size 2037688 bytes
Variable Size 939526216 bytes
Database Buffers 4412407808 bytes
Redo Buffers 14737408 bytes
exec DBMS_STATS.GATHER_SCHEMA_STATS ( OWNNAME=>'DUBLIN80M010',
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE, CASCADE=>TRUE );
d)Verify system statistics:
select pname, pval1 from sys.aux_stats$ where sname =
ALTER SYSTEM SET open_cursors = 2500 SCOPE=SPFILE;
f)Use push join union:
alter system set "_push_join_union_view"=true scope=spfile;
g)Add some indexes recommended by Oracle Enterprise Manager Advisors:
CREATE INDEX "DUBLIN80M010"."MILESTONE_IDX$$_012C000B"
CREATE INDEX "DUBLIN80M010"."DELIVERABLE_IDX$$_012C000C"
CREATE INDEX "DUBLIN80M010"."PROJECTACTIVITY_IDX$$_012C000D"
CREATE INDEX "DUBLIN80M010"."MANAGEDBASELINE_IDX$$_012C000E"
CREATE INDEX "DUBLIN80M010"."WTORGANIZATION_IDX$$_012C000F"
CREATE INDEX "DUBLIN80M010"."PROJECTPLAN_IDX$$_012C0010"
CREATE INDEX "DUBLIN80M010"."REPORTTEMPLATE_IDX$$_012C0011"
CREATE INDEX "DUBLIN80M010"."WTPRODUCT_IDX$$_00740001"
h) Analyze Oracle
emctl start dbconsole
i) AWR and ASH reports
sqlplus sys/manager as sysdba
SECTION XV: METHOD SERVER LOG
1. Any fancy Unix commands to summarize the Method Server log?
# cat M\*log | grep Exception | cut -d: -f 5- | sed -e 's/[0-9]\*_OutdoorProducts_Org1_Admin_ActionItem[-_0-9]\*/XXXX/' | sort | uniq -c | sort -n