There seem to be more and more posts on
the forums about jobs ‘stuck’ in the Running state and I have been
investigating this problem for a client recently so I thought I would
summarise some of the troubleshooting techniques I use. This posting
expands on the article
I wrote a few years ago about agent_exec.
The problem is usually expressed in the form of ‘DA shows my job is
running but I know it’s not’. First of all DA shows a job as ‘Running’
whenever it finds a job whose a_special_app attribute is set to
‘agentexec’. Since agent_exec sets this attribute when it starts and
clears it when the job has finished, under normal circumstances this is a
quite accurate reflection of whether a job is running or not.
However if the agent_exec processes are interrupted before clearing
the attribute (if the box is rebooted or the content server hangs for
instance) then the job object can be left with a_special_app =
‘agentexec’ and DA shows the job as running.
Of course the agent_exec attempts to deal with such a situation.
Every time it wakes up to perform some processing it first runs a
‘garbage_collect_jobs’ routine. You won’t see much evidence of this in
the logs unless you turn on agent_exec tracing (see my job scheduler
article for details on how to do this). You will get the follow lines
when there is nothing to garbage collect:
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] garbage_collect_jobs
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec: execquery,s0,F,
SELECT ALL r_object_id, a_last_invocation, ...
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_get: getlastcoll,s0
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_next: next,s0,q0
Thu Jan 17 13:39:41 2008 [AGENTEXEC 283604] do_exec: close,s0,q0
Basically agent_exec runs the following query:
SELECT ALL
r_object_id, a_last_invocation,
a_last_completion, a_special_app
FROM dm_job
WHERE ( ( (a_last_invocation IS NOT NULLDATE)
AND (a_last_completion IS NULLDATE))
OR (a_special_app = 'agentexec'))
AND (i_is_reference = 0 OR i_is_reference is NULL)
AND (i_is_replica = 0 OR i_is_replica is NULL)
If jobs are returned from this query and agent_exec can not match the
job with an existing running job it will clean up the job object,
unsetting a_special_app and setting a_last_invocation to the current
time.
Here is some typical trace output
in the agentexec.log file when I set the dm_LogPurge a_special_app attribute to agentexec.
This output show that this is the source of the infamous message
Detected while processing dead job dm_LogPurge: The job object
indicated the job was in progress, but the job was not actually running.
It is likely that the dm_agent_exec utility was stopped while the job
was in progress.
DMCL tracing dm_agent_exec
Examining the agentexec trace is usually enough to figure out where
the problems lies however in extreme cases it is useful to look at the
dmcl trace for the agentexec process to further troubleshoot issues. In
principle you can do this by setting the dmcl.ini trace_file parameter
to an existing directory on the Content Server. However this has the
disadvantage of turning on tracing for all dmcl processes on the content
server i.e. all jobs and methods.
What we really want to do is isolate the agentexec process from all
others and in this section I tell you how. I present the steps along
with explanations for a typical Windows server. The same principle
applies to *nix servers usually with a suitable change of folder paths.
First force the agent exec to stop
. You can do this
by killing the main agent_exec process repeatedly. The Content Server
will detect that the agent exec dies and try and restart it, however
there is a limit to the number of times this will happen (seems to be 5
by default). Eventually you get the following message in the content
server log and the dm_agent_exec stays dead:
Thu Jan 17 13:35:37 2008 984000
[DM_SESSION_W_AGENT_EXEC_FAILURE_EXCEED]warning: "The failure limit of
the agent exec program has exceeded. It will not be restarted again.
Please correct the problem and restart the server."
Copy the agent_exec executable to a separate directory
. Copy the program file %DM_HOME%\bin\dm_agent_exec.exe to a new directory e.g. c:\Documentum\agentexec.
Copy the dmcl.ini
. Copy the main dmcl.ini file in c:\windows to c:\Documentum\agentexec. Now edit the file and add the following lines:
trace_level = 10
trace_file = c:\Documentum\agentexec
We are going to take advantage of the fact that the first place the
dmcl looks for the dmcl.ini is in the current working directory.
Start the agent_exec from the command line
. Use the following syntax:
dm_agent_exec -docbase_name docbase
-docbase_owner dmadmin -trace_level 1
Agent exec logging and trace output will continue to appear in the
%DOCUMENTUM%\dba\log\agentexec\agentexec.log, however a number of dmcl
trace files will also be created in C:\Documentum\agentexec directory.
One of these (probably the largest) will be the dmcl trace for the main
agent_exec process; remember agent_exec works by forking off a new
dm_agent_exec process to manage each running job – each of these
processes will have its own dmcl trace file.
When you have finished tracing the agentexec you will need to kill
the command line process and restart the Content Server (if anyone knows
how to force the content server to restart the agentexec after the
failure limit has been reached I’d love to know).
Conclusion
With a clear understanding of how agent_exec works and with the trace
output available it should be possible to troubleshoot and resolve just
about any job scheduler related problem.
转自:http://robineast.wordpress.com/2008/01/17/troubleshooting-agent_exec-garbage-collection/
相关推荐
Installing,_troubleshooting,_and_repairing_wireless_networks_by,_troubleshooting,_and_repairing_wireless_networks_by
Troubleshooting Docker_Code.zip
思科认证2010版CCNP教材---Troubleshooting_and_Maintaining_Cisco_IP_Networks_(TSHOOT)_Foundation_Learning_Guide.pdf
AU2_Blok_SSL_Troubleshooting_with_Wireshark_and_Tshark.
《NetBackup 52xx 和 5330 设备故障排查指南》是针对赛门铁克NetBackup系列备份一体机的专用技术手册,适用于版本2.7.1。该指南详细介绍了如何诊断和解决NetBackup 52xx和5330型号设备在运行过程中可能出现的问题。...
《Troubleshooting Oracle Performance》是一本由Christian Antognini撰写的关于Oracle数据库性能优化的专业书籍。该书于2008年由Apress出版社出版,ISBN-13: 978-1-59059-917-4,ISBN-10: 1-59059-917-9。Christian...
ARMOURY_CRATE_Mobile_Connection_Troubleshooting_Guide.pdf
Troubleshooting with Wireshark: Locate the Source of Performance Problem ) By Laura Chappell Foreword by Gerald Combs Edit by Jim Aragon This book focuses on the tips and techniques used to identify ...
NetBackup81_Troubleshooting_Guide.pdf
Veritas NetBackup Appliance Troubleshooting Guide 4.1 是一份详细的手册,旨在帮助用户解决在使用Veritas NetBackup设备时可能遇到的各种问题。该指南适用于Release 4.1版本,更新日期为2021年7月19日。Veritas是...
Veritas NetBackup™ Appliance Troubleshooting Guide 是一份专门针对Veritas NetBackup设备的故障排查手册,适用于版本5.1.1。该指南旨在帮助用户解决在使用Veritas NetBackup设备过程中遇到的各种问题,确保备份...
Veritas NetBackup Appliance Troubleshooting Guide 5.0 是一份详细的手册,旨在帮助用户解决在使用Veritas NetBackup设备时可能遇到的问题。这份指南适用于版本5.0的Veritas NetBackup设备,最后一次更新是在2022...
标题:"Qualcomm USB Debug Drivers Installation and Troubleshooting User Guide" 指出这是一份专门针对安装和故障排除Qualcomm USB调试驱动程序的用户指南。这份指南是为QCC30xx系列的产品所设计的,因此相关的...
调优时,我们需要关注垃圾收集(Garbage Collection, GC)。GC是JVM自动管理内存的过程,涉及新生代(Young Generation)、老年代(Tenured Generation或Old Generation)和持久代(Permanent Generation或Metaspace...
【syslog介绍】 syslog是Linux系统中一个用于跟踪和记录各种系统消息的工具,这些消息范围从简单信息到极其严重的问题。syslog的主要作用在于收集来自不同应用程序的日志信息,将其集中存储在/var/log目录下,便于...
本书《Database_Tuning___Principles_Experiments_and_Troubleshooting_Techniques》(数据库性能调优:原理与技术)旨在为读者提供全面的指导,帮助他们了解数据库调优的基本原则,并通过实际案例学习有效的故障...
总的来说,"NetBackup82_Troubleshooting_Guide.pdf" 是一份详尽的指导文档,旨在帮助用户掌握Veritas NetBackup 8.2的故障排除技巧,确保系统的稳定运行和高效的数据保护。这份文档不仅涵盖了基本的故障处理步骤,...
【标题】"DKC-troubleshooting.pdf_Dkc1.1_zip_40_" 指的是一个关于DKC1.1版本的故障排查指南的PDF文档,它被压缩在了一个ZIP文件中,文件大小约为40KB。这个标题暗示了文档内容主要针对DKC系统在1.1版时可能出现的...