Professional Documents
Culture Documents
Rate This
4
Sudheesh N
Sudheesh N
Microsoft 1,943 Recent Achievements 3 1 0 New Blog Rater Blogger II New Blog Commentator View Profile
Comments 1
Many have been asking for what i need to check to make sure SCCM is healthy. According to me if I were you these are things I would have checked :) This might not be complete but had tried to include what ever I could think of.
Administrative Activities:
Daily Site Monitoring Tasks To best maintain your system, perform the following monitoring tasks on a daily basis. If there is any indication of a problem, isolate and repair the problem to ensure that the site remains healthy. Daily site monitoring tasks include:
Checking site systems status. Checking client status. Checking the operating system event log. Checking the SQL Server error log. Checking system performance. Checking SCCM system folders.
Site components and services. Check if any site server component or service is experiencing any problems. Packages and advertisements. Check the status of packages and advertisements in your site. Check package and advertisement status messages to ensure that package source files reach distribution points, and that advertised programs reach clients. Check status messages that are returned from clients to see whether the clients run programs successfully or not. Site-to-Site Communication. Check communication between the site and its parent and child sites (if they exist). Check status messages and, if necessary, check log files of the Replication Manager, Scheduler, and Senders on the site to determine whether the site is having communication problems.
Low level of available disk space. SCCM components that cannot connect with a site system.
Client components are experiencing problems. Clients are failing to install. Clients are not reporting software inventory or hardware inventory. Clients that are not reporting heartbeat discovery data regularly (or for the past x days). Client count unexpectedly increasing or decreasing at a fast rate.
You can monitor a client's status only if it creates status messages, and these status messages reach the site server.To detect clients from which you are missing status messages, you need to run a query that returns all clients that have not reported a status message within the last <n> days. In this query, <n> is the length of time you would expect to receive a status message from that client (taking into account the frequency of hardware or software inventory and the regular time it takes for status messages to reach the site server.)
Check the SQL Server Error Log Check the SQL Server error log in SQL Enterprise Manager. Look for messages that indicate error conditions. Isolate and repair the conditions that generate error or warning messages.
Check Status Filter Rules Check whether it is possible to reduce the amount of traffic generated by status messages being replicated throughout the hierarchy. If the site is currently healthy, it might be possible to add status filter rules to prevent replication of status messages, which are not necessary.
Listed here is a list of the ConfigMgr 2007 inboxes that should be checked on a regular basis to ensure that your site(s) function as expected. Auth\Dataldr.Box : A backlog of files can indicate problems accessing the site database. Auth\Dataldr.Box\Process :A backlog of files can indicate problems accessing the site database. Auth\Ddm.box\Bad_DDRs :A backlog of files can indicate a network corruption problem or a problem with the DDM Auth\Sinv.Box :A backlog of files can indicate that the Software Inventory Processor cannot connect to the site database or that too many files were received. Auth\Sinv.Box\Orphans :A backlog of files can indicate problems with specific clients, with management points, or with the network that could cause data corruption. Compsumm.Box : A backlog of files can indicate that the Component Status Summarizer cannot process the volume of messages. Dataldr.Box :A backlog of files can indicate problems accessing the Systems Management Server (SCCM) database Dataldr.Box\Badmifs :A backlog of files can indicate a bad custom MIF file or that a client
computer cannot transfer the file correctly. Ddm.Box :A backlog of files can indicate a bad DDR is preventing other DDRs to process. Ddm.Box\Bad_DDRs :A backlog of files can indicate a network corruption problem or a problem with the DDM OfferSum.Box : A backlog of files can indicate a performance problem that is caused by a large number of messages. Policypv.Box :A backlog of files in the policypv.box folder indicates that the policy provider component is not running. Replmgr.Box\Ready:A backlog of files can indicate that the Scheduler is backlogged or is already processing files of the same priority Schedule.Box:A backlog of files can indicate that the Sender cannot connect to or cannot transfer data to another site. Schedule.Box\Outboxes :A backlog of .srq files indicates that the sender cannot process the number of jobs scheduled for that sender or that the sender cannot connect to or transfer data to another site. Schedule.Box\Tosend :A backlog of files can indicate that many send requests are not completed or that the Scheduler has not yet deleted the files. Sinv.Box :A backlog of files can indicate that the Software Inventory Processor cannot connect to the site database or that too many files were received. Sinv.Box\BadSinv :A backlog of files can indicate problems with specific clients, with management points, or with the network, causing data corruption. SiteStat.Box :A backlog of files can indicate a performance problem. Examine status messages for the Site System Status Summarizer for possible problems. Statmgr.Box\Futureq :A backlog of files can indicate that some site systems' clocks are not synchronized with the site server. Statmgr.Box\Queue :A backlog of files can indicate a problem with the Status Manager or that the component is trying to process too many messages. Statmgr.Box\Retry :A backlog of files can indicate problems with the connection to the computer that is running SQL Server. Statmgr.Box\Statmsgs :A backlog of files can indicate a problem with the Status Manager or that the Status Manager is trying to process too many messages Swmproc.Box :A backlog of .sum and .sur files can indicate that the Software Metering Processor component cannot connect to the SCCM database.
Check and make sure that the daily Maintenance Task if any. We can use the smsdbmon.log for more details.
Weekly Site Monitoring Tasks To best maintain your system, perform the following monitoring tasks on a weekly basis. If there is any indication of a problem, isolate and repair the problem, to ensure that the site remains healthy. Weekly site monitoring tasks include:
Checking SCCM site database available space. Checking available disk space.
To best maintain your system, perform the maintenance tasks in this section on a weekly basis. You can automate some tasks by scheduling predefined maintenance tasks or custom maintenance tasks, as appropriate, to run on a weekly basis.
Weekly automated tasks. Delete unnecessary files. Delete unnecessary SCCM objects. Produce and distribute end-user reports. Run disk defragmentation tools. Back up application, security, and system event logs.
Weekly Automated Tasks The following predefined maintenance tasks should be scheduled to run on a weekly basis. For more information about these tasks, see the "Predefined Site Maintenance Tasks" section earlier in this chapter.
Rebuilding Indexes Monitor keys Delete aged inventory history Delete aged discovery data Delete aged collected files Delete aged software metering data Delete aged software metering summary data Summarize software metering data Summarize software metering periodic usage data
Caution :When deleting a collection, any advertisements to that collection are also deleted.
Periodic Site Maintenance Tasks To best maintain your system, perform the following tasks periodically. Use the predefined maintenance tasks when appropriate. Periodic site maintenance tasks include:
Backing up account data. Changing accounts and passwords. Checking network performance. Reviewing the security plan. Reviewing the maintenance plan. Performing recovery tests.
Use Microsoft tools, such as the NTBackup.exe tool that comes with Windows Server, or thirdparty tools to back up account data as follows:
If there are multiple domain controllers in your infrastructure that contain the SCCM account database, you need to periodically back up the account database. (If Active Directory directory service is implemented in your organization, then such a task might be included in the Active Directory maintenance plan.) If the account database is stored on a single domain controller, then back up the account database frequently. Depending on the frequency of changes to account data, you might need to add this task to the site's daily or weekly maintenance tasks. If the account data is stored on member servers, then regularly back up the whole operating system that contains the account data, using software that backs up account lists and the account database. Whenever there is a change to the password of the Client Push Installation account or to the site system connection accounts, you should note that change. For security reasons, SCCM encrypts the Client Push Installation account and the site system connection accounts. You need to be able to retrieve these accounts' passwords so that you can reenter them during a site recovery operation. In between account database backups, document any changes to accounts. Write down and save any changes made to SCCM accounts and share rights so that you can apply those changes again after recovering the site.
Which accounts need to be changed, and for which accounts is it sufficient to change only the password. How often to change passwords and accounts. How to change passwords and accounts (such as by running SCCM site reset). Which accounts cannot be configured by the administrator (either the account name cannot be changed, or the password cannot be manually modified).
Check the available bandwidth and error rates on the networks used by the SCCM hierarchy. Use Network Monitor to capture and analyze network frames so you can diagnose network problems and look for optimization opportunities.
Who has access to SQL Server and to the SCCM site database. Who can download from SCCM distribution points. Which accounts have permissions within SCCM security. Periodically, re-evaluate the risk assessment of your organization, and then review and update the security plan accordingly.
Update the maintenance plan document to reflect any changes to the maintenance plan, and then distribute it to all SCCM administrators that are using it.
The best way to be fully prepared for a site recovery operation is to ensure that the recovery plan is adequate and that administrators are familiar with the recovery process. After you develop a recovery plan for your site, it is recommended that you perform periodic recovery tests in a test lab.
A recovery test should follow the recovery plan developed for the production environment. Plan to perform a recovery test of the central site, and of any other systems deployed in your hierarchy. A recovery test should test all phases of recovery, including:
Backing up a site. Archiving the backup snapshot. Simulating a site failure, such as by turning a server off. Recovering the failed site. Verifying the success of the recovery operation. You might schedule periodic recovery tests. Company policy might require that new administrators always perform a recovery test. It is strongly recommended that you always include a recovery test when testing major changes to the hierarchy. For example, before upgrading site server operating systems, you should probably first test the upgrade in the test lab. After completing the upgrade in the test lab, you should perform a recovery test to identify any issues or adjustments to the recovery plan associated with the operating system upgrade. This ensures that if you upgrade the servers in the production environment, you will still be able to successfully recover a failed site. Include a recovery test in every major deployment test, such as: A major operating system upgrade (not service pack). A major change to the networking infrastructure. New equipment deployment or building relocation. An SCCM major version site upgrade.
Periodic Site Monitoring Tasks To best maintain your system, perform the following monitoring tasks periodically. If there is any indication of a problem, isolate and repair the problem to ensure that the site remains healthy. Periodic site monitoring tasks include:
Checking hardware. Checking site's overall health. Checking the backup snapshot.
Check Hardware
Even high-quality hardware occasionally fails. Sometimes, it fails gradually, so there might be early signs. Replacing hardware before it completely fails is a key step in preventing site failure. Both Windows and SCCM provide performance counters, which you can use to monitor the performance and state of the hardware used in the site. As soon as you notice any signs of hardware-related unreliable behavior of an SCCM server, replace the hardware. To properly replace server hardware, you must use the Recovery Expert. For more information about swapping the computer of SCCM servers, see the "Swapping the Computer of a Site Server" section later in this chapter.
Ensure that all SCCM services are running. Review the Status Message System for Critical status. Ensure that all the latest service packs are installed. Ensure that the latest critical security patches are installed. Examine the System and Application Event logs for errors. Note o When SCCM is configured to write status messages to the system's event log, SCCM error status messages are written as information events, not error events. Run a query to determine if discovery data is being updated correctly in the SCCM site database. The query should list all installed clients in which System Resource - Agent Time is not within the heartbeat interval. It is expected that some clients might be offline, but in other cases, it might indicate a problem. Run a query to determine if software inventory data is being updated correctly in the SCCM site database. The query should list all installed clients in which Last Software Scan - Last Inventory Collection is not within the software inventory interval. It is expected that some clients might be offline, but in other cases, it might indicate a problem. Run a query to determine if hardware inventory data is being updated correctly in the SCCM site database. The query should list all installed clients in which Workstation Status - Last Hardware Scan is not within the hardware inventory interval. It is expected that some clients might be offline, but in other cases, it might indicate a problem. If any of these tests fail, you need to diagnose the problem and repair it.
Restore a recent backup snapshot to a disk and examine file continuity, file size, and other file properties to ensure that they do not seem corrupted. Check critical files by restoring these files to their respective applications to ensure that the application can use the restored file.