You are on page 1of 98

A regular database management plan is essential to ensure the continued operation and

efficiency of Network Security Platform (NSP) resources, such as the Sensors. This lesson
focuses on McAfee Network Security Manager (Manager) database maintenance best
practices, operational status and troubleshooting common issues.

Upon successful completion you should be able to:


• Identify best practices for database and alert data maintenance.
• View the status of Network Security Platform system components.
• Research issues for troubleshooting solutions using resources such as KB articles,
Service web portals and documentation.
• Use troubleshooting tools, including Log files and Manager tools to troubleshoot
issues.
• Use the Sensor’s CLI commands to troubleshoot Sensor issues.
• Troubleshoot common Manager and Sensor issues.
Network security is an ongoing process that requires a long-term plan for archiving and
maintaining the database for the alerts and packet logs generated by the deployed NSP
Sensors. Archiving this information is necessary for historical analysis of alerts that may
help customer better protect their network in the future.

All sizing estimates are based on tests of various alert/log generation frequencies. Multiple
frequency and file size parameters are offered to help better prepare the database for
long-term maintenance.

As alerts and packet logs gradually accumulate in the database, the disk space allotted to
NSP processes will require thoughtful planning and maintenance to keep up with the
frequency and size of incoming data. Depending on the archiving needs, it is essential to
understand the database space required to maintain an efficient system.

One question to ask is: "If the Sensors generate one alert every ten seconds for a year, how
much database space will be needed to maintain all of these alerts?“ With that question in
mind, the topics presented here will help you get the most out of Network Security
Platform(Manager) and the database.
It’s important to cover capacity planning for NSP. Every network has slight architectural
differences that make each deployment unique. When deploying a network IPS, you must
take into consideration the following factors when planning the capacity of the database:

• Aggregate Alert and Packet Log Volume From All Sensors— What is the volume in the
network? A higher volume will require additional storage capacity.

• Lifetime of Alert And Packet Log Data— How long should you archive an alert?
Maintaining the data for a long period of time (for example, one year) will require
additional storage capacity to accommodate both old and new data.

The following subsections provide useful information for determining the necessary
capacity for alerts and packet logs in the database.
The Maintenance page groups many database and alert management tasks in a central
location. This page is accessed by clicking the Manage icon on the menu bar and then
expanding the Maintenance group.

The database houses the alert and packet log data generated by the Sensors, therefore, as
part of the capacity planning, should be archived on a regular basis. Intel Security
recommends archiving alert data monthly, and to discard alert and packet log information
from the database every 90 days to help manage the database size.

So let’s start with reviewing the Data Archiving pages.


The Data Archiving page presents actions that enable you to save alerts and packet logs
from the database on demand or by a set schedule. You can also restore archived alerts
and packet logs on the client or another Manager.

The Archive Now action enables you to archive alerts and packet logs on demand into an
archival file for future restoration. This process reads alerts and packet logs for the given
time range from the database and writes them into a zip file. Archived files are saved
locally to the Manager, and can be exported to your client machine.

Use the Automated Archival page to schedule automatic alert archiving.

The Export and Restore Archives pages are available to export and restore archives,
respectively.

NOTE: It’s important to understand the difference backup and archive - Backup deletes
current data, archive adds data to the current set.
The Alert Statistics option in Manager displays information that helps you track the
historical trend of database space usage on a weekly and monthly basis, and also the rate
at which data is being inserted into the database. By analyzing the trend of the load factors
on the database and the hardware, you can set the threshold for the amount of historical
data that you want to store at any given time.

Alert frequency is the first factor to consider when planning database capacity.

A good reference point for determining the required database capacity based on the
volume of alerts and packet logs is to find the average alert rate for a week, then multiply
by a longer time frame such as 12 weeks, one year (52 weeks), and so forth.
The Alert Pruning option enables you to manage the database space required for the
alerts generated by the Sensors. Use this page to limit the quantity of alerts and packet
logs stored in the Manager database. When Pruning is enabled, the Manager deletes all
alerts and packet logs that exceed the maximums specified on this page. Alert pruning is
an important, ongoing task that must be performed for optimal Manager and database
performance. If the database were to grow unchecked with millions of stored alerts,
analysis using the Threat Analyzer or Reports would slowdown considerably.

The Manager has a pre-defined alert capacity of 10,000,000 alerts. This means that the
Manager will generate system fault messages when the database is nearing this limit by
issuing warnings at 50%, 70%, 90%, and 100% of 10,000,000. This is a customizable limit
and is purely for capacity planning and not an actual constraining limit on the database.

To calculate disk space capacity, click Calculate. This calculator has specific fields related
to determining the database allocation space required to maintain the alerts and packet
logs.
The File and Database Pruning option enables you to set a schedule by which generated
log data and files are deleted from the Manager/database. These data/files are admin
created through various System Configuration actions, and each details a different aspect
of system functionality. These system files get larger as more data is added over time. File
pruning allows you to delete the data in a log or an entire static file either at the next
scheduled time or in a set number of days. Regular deletion saves disk space on Manager
server, thus improving overall performance.

The deletion scheduler works as follows:


1. First, you set a daily time when you want File pruning —that is, deletion—to take
place; this is under the Maintenance Scheduler setting.
2. Next, for each file type, you set a number of days/file size (Scheduled Deletion) after
which you want a file that has reached the set age/size to be deleted. On the day a file
is to be deleted, deletion takes place at the set daily time.

NOTE: An alternative to pruning alerts and packet logs is to delete these files using
purge.bat. Purge.bat also has the option to remove records flagged for deletion. This can
significantly increase the amount of time it takes to finish, depending on the size of the
database.
Regularly backing up the NSP data (alerts, saved reports, logs) and configuration settings is
strongly recommended to maintain the integrity of the system. The configuration tables
are saved by default once a week on Saturday.

Use the Database Backup function to back up the Manager data to the Manager server or
other media connected to the Manager, such as a tape drive. The Back Up Now option
backs up files on-demand (now). You can back up all or specific tables.

The Automated Backups option lets you schedule a backup on a scheduled basis (daily or
weekly). It also lets you back up all or specific tables.

By default, the backup file is saved to the Manager installation folder <Network Security
Manager install directory>\App\Backups.
After a backup is complete, you can export the backup files to a specified location.
Shown here are some guidelines on restoring backups.

If the manager software version matches, the database type and version will be the same.

Be sure the Manager service is stopped.

Before attempting to restore a backup:


• Database Type and Version: MySQL database users can only import a backup from a
MySQL database. Also, a restore of a database backup is only permitted if the major
release version of the database and the database backup match. For example, a
backup from a MySQL version 8.x.x can only be restored on a Manager using a MySQL
version 8.x.x database.
• Manager Software Version: A restore is permitted only if the major and minor release
versions of the current Manager and the backed up Manager from which the backup
was created match, that is, a backup from a Manager Release 8x can only be restored
on a Manager version 8x.

Restore means returning Manager to a previous configuration. The Sensor or policy


configuration may have changed since the backup. Reconfiguration may be required.

NOTE: During this process, the NSM database user (not root) is used.
When restoring the data, note that all related table information in the database is
overwritten. For example, restoring a Config Tables backup overwrites all current
information in the configuration table of the database. Therefore, any changes not backed
up are erased in favor of the restored backup.

The dbadmin.bat is available on the Manager server at <Network Security Manager install
directory>\App\bin\dbadmin.bat. Note that you need to execute the tool from the same
location, as well.

NOTE: DB restore will fail if the manager is still running, so be sure to verify the Manager
service has stopped before the restore.
Over time, a relational database can experience performance issues if the data is not re-
tuned on a recurring basis. By regularly diagnosing, repairing, and tuning the database
internals, you can ensure optimal database performance.

It is imperative that you tune the database after each purge operation; otherwise, the purge
process fragments the database, which can lead to significant performance degradation.

When scheduling tuning, along with other Manager actions (backups, file maintenance,
archives), set a time for each that is unique and is a minimum of an hour after/before other
scheduled actions. Do not run scheduled actions concurrently.
The Tuning Status page provides the current database tuning operation status for the
Manager. The page displays one or more of the following:
• Start Time: Displays the time in-progress tuning started.
• Status: Displays if tuning has yet been initiated, is in progress, or is idle.
• End Time of Latest Tuning: Displays the time when database was last tuned.

Clicking Refresh updates the dialog to provide you with the latest status; for example, if
another user initiated tuning since you opened the dialog, you could see the status after
refreshing.
Use this page to tune this Manager's embedded database immediately, or on-demand. Use
the Automated Tuning page to schedule the task to occur. When scheduling database
tuning, set a time when no other scheduled functions (archivals, backups, file maintenance)
are running.

The following operations cannot be performed when database tuning is in progress


• Viewing alert details from the Threat Analyzer.
• Modifying attack properties from the Threat Analyzer.
• Generating reports on alerts and endpoint events.
• Restoring All Tables and Event Table backups.
• Running All Tables and Event Table backups.
• Archiving alerts and packet logs.
The malware policy has configuration settings to archive downloaded files based on
various characteristics. These downloaded files are archived on the Manager server as
encrypted files. You can configure the location and maximum disk space that can be used
to store the archives. To prune the file storage edit the available Automatic file pruning
options.

The configuration for disk usage is defined at the Global Manager level. The Manager also
provides configuration to prune files that are stored for more than a specified period of
time.
The System Health monitor of the Dashboard page displays the current health of the
Manager and installed Sensors in the system. Based on the severity of issues, the System
Health displays messages, such as whether the Manager or Sensor is up or down; for
example:
Manager Status:
• Up (proper functioning) or Down (component not functioning).

Device (Sensor) Status:


• Active: all channels are up.
• Attention: One or two communication channels are down.
• Disconnected: All three communication channels are down.
• Standby: Command Channel is still being set up.
• Uninitialized: There is a failure in the initial setup.
• Unknown: Sensor has been added to the user interface, but the actual Sensor has not
been set up yet to communicate with the Manager.

This is a good starting point when troubleshooting issues. On this page, you should check
to see if there are any messages (Critical, Error, Warning, and Informational) being recorded
and if so, correct them as necessary. The alert counts greater than zero are hyperlinked the
System Faults page. You can click on the hyperlink to drill down to individual alert details.
Clicking any link in the System Health monitor opens the System Faults page. The System
Faults page displays a table listing the Manager server, database, and installed Sensors,
and provides health and fault information of each.
• Network Security Platform Manager: This is the Manager controlling the system.
There is only one instance, always named Manager.
• Status: This column shows the operational status of the component. Up indicates
proper functioning. Down indicates the component is not functioning.

Impact of system faults are displayed in the additional columns. The fields are populated
with two numbers (n/n), which are the number of faults that are still unacknowledged by
the user / the total faults (both acknowledged and unacknowledged).

• Critical: Major faults, such as component failure. Error: medium faults, such as a
stopped process or a session time-out (automatic logout).
• Warning: Minor faults, such as multiple bad logins.
• Informational: Faults that relay information.
• Total: Number of messages per component, per message category, or all messages in
the system.

NOTE: There can be more than one manager where an MDR pair is present.
• Database Type: MySQL.
• Database URL: This shows the navigation used by Manager to find the database.
• Status: Displays the state of the database connection. Up means Manager-to-database
communication is good; Down means the communication has been broken or another
error exists.
• Device: User-given name of the Sensor or failover pair.
• Model: Sensor model type.
• Status: Displays the operational status of the component. The status determined by
the state of three communication channel parameters: Command Channel, Alert
Channel, and Packet Log Channel.
 Active: All channels are up.
 Attention: One or two communication channels are down.
 Disconnected: All three communication channels are down.
 Standby: Command Channel is still being set up.
 Uninitialized: Failure in the initial setup.
 Unknown: Sensor has been added to the NSP user interface, but the actual Sensor
has not been set up yet to communicate with the Manager.
• Faults: Critical, Error, Warning, Informational, Total.
 n/n: Number of faults still unacknowledged by the user / the total faults (both
acknowledged and unacknowledged).
From the System Faults page, click a link to access details for the field; for example, the
Critical fault for the Manager. Based on this example, we see that the critical alert was
triggered by a database connectivity issue that occurred.
The Fault Type column provides a short description of the fault, that is a link that can be
clicked for a closer look at the fault detail.

Let’s look at an example: You log in to Manager. The System Health status in the page
reads Critical (red). You drill down to detail on the System Faults page. After examining the
fault, you manually Acknowledge it. You close the Operational Status and return to the
Dashboard. After 30 seconds, the page refreshes and the status displays Up/Active. The
problem may still exist, but since you acknowledged the fault, the Manager determines all
other system issues are good, and you are taking the steps to fix the fault issue. Therefore,
you are not constantly reminded of the fault.

Some faults clear on their own, and disappear from view; for example, if someone replaces
a power supply. When the power supply is re-inserted, another fault appears describing
the new situation, along with a third indicating that there is no power. When power is
detected on the device, the power supply is considered operational again, and Manager
clears all three fault messages.
On the Sensor side, you can run the status (shown here) and show commands (displayed
on the following page) that will provide you with administrative information on the
Sensors.

The show command provides configuration details, such as the IP Address, Software
Version, and Manager IP address, while the status command provides information on
communication, such as trust establishment, channel status and general health.

You will use these commands often when troubleshooting communication related issues.
You can also view Sensor status in the Threat Analyzer under NSP Health tab. It will show
the operation summary with number of errors broken down by severity as well as Sensor
throughput and utilization. Double-clicking a section of the pie charts will call up a list of
the sensors that the area represents.
When it comes to troubleshooting NSP issues, be sure to think about things like:
• Making sure that you are running the latest supported versions. So, for example, if after
upgrading the Manager it is no longer communicating with the devices, then you
should check the version being used against the supported version list.
• Make sure that you have tried the latest patches available. Check the patch readme for
list of resolved issues.
• You will need to understand the cause of the problem. It’s important to identify what
the exact steps that lead to the issue.
• Once armed with information to begin troubleshooting with, you should be familiar
with the available troubleshooting resources, such as the KnowledgeBase.
One of the most difficult issues surrounding troubleshooting is that the issue may seem
like the Sensor but could be other components. After all, the Sensor is placed in a network
and issues can occur with the Sensor, its software and/or with misconfiguration, but also
with the connections to other equipment, the configuration of adjoining equipment,
misunderstanding how the Sensor operates versus how you think its supposed to operate,
and much more.

The goal is to quickly eliminate the internetworking devices and components as possible
causes to their problem and then pinpoint what on the Sensor must be fixed. A
standardized, organized approach to troubleshooting will help.

1. Begin by getting a clear picture of the problem. Create a problem statement. “All the
HTTP traffic coming in on port 1A and going out port 1B to XYZ network segment is
being blocked.”
2. Gather facts about the issue. The basics of Who, What, When, Where, and How still
work. Who sees or experiences the problem? What are the symptoms being
experienced? Was the Sensor working previously? What changes have occurred
recently on the Sensor or in the network? When was the problem first noticed? How
often does it occur?

Continue on the next page


3. Develop a list of possible causes to the problem that is occurring.
4. Determine if there are any components on the possible cause list that can be
eliminated based on the facts learned.
5. Start with the most probable cause. Don’t change more than one variable at a time,
because you may not know what was the real solution, or you could end up causing
additional problems.
6. Figure out how you can test to check if the possible cause is the problem, and
generally, imbedded in the test is a possible solution for the problem. What data is
available to analyze? How can you replicate the environment?
7. Verify if the problem is fixed, if not repeat the process with the next possible cause.
8. Record your findings.
Basic troubleshooting tasks involve understanding the problem and gathering information
regarding the problem. Ask the questions that are necessary to your understanding of the
issue. Understanding the problem is your first and most important goal whenever an issue
arises. In order to understand and isolate the problem the customer is reporting, you must
use your existing knowledge of how NSP works, combined with your general computer and
troubleshooting skills. Sometimes, to fully understand a problem, you must search extra
resources or use additional tools to gather information.

Here are the resources that you will use most often when troubleshooting the McAfee NSP
solution. The KnowledgeBase, McAfee Communities, and the Help pages.
The KnowledgeBase should be your first resource to troubleshooting issues with Network
Security Platform. The Knowledge Center, or KnowledgeBase, provides troubleshooting
articles, various product documentation, and procedure documents. The KnowledgeBase is
a collection of questions and answers about many McAfee products. The information
contained in the KnowledgeBase sometimes includes, step-by-step instructions,
troubleshooting information and various other articles. If ever you have an error or
question not found within other sources, many times the KnowledgeBase has already
addressed it.

The KnowledgeBase provides a federated search feature that provides results from
multiple locations. You are also able to narrow down searches even further, with the
provided Match, Product, and Version fields.
In the course of every new NSP deployment, there comes a time when the customer, or the
team responsible for the deployment, takes a step back and says "now what?“. This comes
after the appliances are racked, networked and configured, and initial events are flowing
serenely into the Manager. Dashboards begin to populate with event data, canned
response rules begin to fire, and the administrator sitting at the console becomes
immediately overwhelmed by the magnitude of the problem they have tackled. With
thousands, or even millions, of individual alerts flowing into the Manager every day, it's a
daunting task deciding what's urgent today, what trends are important to watch over time,
and what can be safely ignored.

The Community site provides a multitude of information that you should become familiar
with, and that your customers should become familiar with. While each customers
deployment may be unique, this site outlines basic concepts and tactics that can be
applied to any NSP deployment.
If you look in the upper right-hand corner of the NSM menu bar, you’ll see a Help menu.
Click on Help and then again on the Help Contents option. The main NSM help displays an
introduction topic. From here there are several ways to find information.

The most common approach is to click the Search tab in the upper left-hand pane of the
help. Use the search field to find any word in the ESM help.

Another way to find information in the help is to click the Contents tab in the upper left-
hand pane of the help.

Another way to find information in the help is to use the Index tab in the upper left-hand
pane of the help. The help index works exactly like an index in a book. Keywords are
organized alphabetically so you can scroll through the list until you find the keyword you’re
looking for.

Another way to find help is in the NSP product itself: As an example; Previously, we
learned that to create an new IPS policy, you need to fill out the Properties page. Now, you
might need more information to fill in the fields on this screen. If you look in the upper
right-hand corner, you’ll see a question mark icon. Click that icon to display context-
sensitive help that explains the options available on this screen. So, whenever you’re
within a particular area of the UI that you need more information on, be sure to leverage
the Help pages.
This chart displays some common issues and the related log files to collect and review.
Let’s start with looking at the System Log (ems.log). The ems.log is the principal run-time
log file for the Manager. This is the number one log file to view for troubleshooting
common issues dealing with the Manager. This can include device configuration changes,
issues with policy updates, GTI communication, and communication with integrated
products.

Each log file is numbered incrementally for each megabyte of recorded data. The current
log is seen in Network Security Platform directory as ems.log. Previous logs increment with
every one megabyte of data (ems.log.1, ems.log.2, etc.). By default, the ems.log file is
located at <Network Security Manager install directory>App/ems.log.
Choosing to export the log and view it, or manually getting them from the root of the
Manager installation directory, you can view the logs in a text editor, such as Notepad.

The EMS.log file provides information on the communications between Manager and
Sensor(s). The log identifies the date and time of messages, types of messages, services
and a description status of services and operations. You can tell from this log if channels
are operational, what errors occur during message transfer, status of the manager server, if
SQL is running and much more. This is a rolling log, new data is continuously attached to
the file. You will need to scroll to the bottom of the log to see most recent events.

Generally, you will want to note the time of the Warning or Error and try to correlate to the
other faults in the Manager.
In this example log file, we can see that the Sensors named “new1” and “testNSP01” are
reported as DISCONNECTED.

When Sensor new1 is powered on, it requests connection to the Manager. Once the
communication channels are activated, the MIB data is downloaded from the Sensor and
trust gets re-established, the Sensor status transitions from DISCONNECTED to ACTIVE.
The System Log page, within the Manager, provides you with a graphical, filtered view of
the ems.log file. As shown in the example, you can configure to view ERROR or INFO level
(and above) messages from a specific time range, along with how many messages to
display, to aid in isolating and identifying issues.

You can customize the log query to display only the data you want to see, such as DEBUG
data only or WARN level faults only, for example.
1. Select the level of messages to display. Level/Range Options:
• ALL: All actions performed/recorded by the system. This includes all of the topics that
follow.
• DEBUG: Only debug information for the system.
• INFO: Only configuration information, such as when an action is performed.
• WARN: Only system warning (high severity) information.
• ERROR: Only system error (medium severity) information.
• FATAL: Only crash/failure information.

2. Select the desired range of dates.


3. Type a value for the Number of Messages to Display to limit the log output.
4. Then, click View Messages to view the log.
Shown here is the window displayed when View Messages is clicked. The log query you
created, will be used to display only the data you wanted to see, such as DEBUG data only
or WARN level faults only. You can click Back on the new interface to return to the
messages view.
When the Manager database or disk space becomes full, the Manager cannot process any
new alerts or packet logs. In addition, the Manager may not be able to process any
configuration changes, including policy changes and alert acknowledgment. There is also a
chance that the Manager may stop functioning completely.

Intel Security therefore recommends that customers monitor the disk space on a
continuous basis to prevent this from happening. Use the Disk Usage page to view details
such as the drive on which the Manager database is stored, its total capacity, and the
amount of disk space used.
The Running Tasks monitor of the Dashboard page displays status of currently in-progress
activities on the system that NSP identifies as long-running processes. When a long
running process is taking place, the status displays as In progress on the Dashboard page
and on the Running Tasks page.

After the activity has completed, it is removed from the Dashboard page and displays on
the Running Tasks page. If a long running activity includes several sub-activities, then the
Manager provides an activity log for each of the sub-activities. For example, an activity like
signature update involves two long running sub-activities: downloading the signature set,
and updating the signature set on all sensors that have the real time update enabled.
These sub-activities are tracked separately and the status for each is displayed separately
as well.

McAfee Network Security logs the long running processes against the <Admin Domain>
and the user who performs the activity. The result for each activity is displayed as:
• Failure
• Success
• In Progress: still running
When you see an alert in the Real-Time Threat Analyzer, you are being shown the details
pertinent to an attack attempt. One of the factors that can help you make a decision, on
whether or not to act, is the result of the attack. Ultimately, it is the relevance of the attack
that can assist you with this decision.

Alert relevance or relevance is the extent to which the alert generated is relevant and is
defined numerically and is calculated for signature based attacks. Alert relevance is
computed based on certain factors and is updated each time a signature set is updated. It
is carried out within the Sensor and Manager in a preset order.

Scoring relies on matching each component of the Common Platform Enumeration (CPE)
name of the attack signature with CPE name of the target system.

Relevance is a score that is displayed in the Real-Time Threat Analyzer Alerts tab under a
separate column labeled Relevance.
Use the User Activity Log page to audit the actions of administrative users on the Manager.
An audit is kept to help determine what users have done, in order to determine mistakes,
overwriting, or other issues concerning user activity.

Here, you are able to specify all or specific users, audit categories, the number of messages,
and a time range to be displayed.

Only messages belonging to the selected category/categories will be displayed.


When you click the View Messages option, the selected User Activity Audit appears. You
are also able to drill-down into the messages by clicking on the hyperlink under the
Description, (if available). Click Back to return to the messages view.
The MDR Failover Log action enables you to view previous Manager Disaster Recover
(MDR) activity, including the date on which the activity occurred, the users performing the
activity, and the nature of the activity.
The Manager Policy Cache page allows you to clear the attack and policy caches without
shutting down and restarting Manager. This maybe necessary if the policy cache gets out of
sync with the database possibly due to server errors, database errors, or client-to-server
communication errors.
You should always request an InfoCollector Tool results file from the customer for all
reported issues. InfoCollector is an information collection tool, bundled with the Manager
that allows the customer to easily provide you with all of the NSP-related log information.
You can use this information to investigate and diagnose the issues being reported.

InfoCollector can collect information from the following sources within the NSP system:
• Log Files – Configurable logs containing information from various components of the
Manager.
• Configuration backup – A collection of database information containing all Network
Security Platform configuration information.
• Audit Log Backup - A file containing various information in relation to user activity.
• Configuration Files – XML and property files within the Network Security Platform
config directory.
• Faultlogs – A table in the Network Security Platform database that contains generated
fault log messages.
• Sensor Trace – A file containing various McAfee® Network Security Sensor (Sensor)-
related log files.
• Compiled Signature – A file containing signature information and policy configuration
for a given Sensor.

NOTE: While not enabled by default, it is recommended that enable collection of the fault
log in the InfoCollector export, for all issues.
The output of the InfoCollector tool is a zip file that can be sent to Support for
troubleshooting and escalation. The resulting zip file will gather the information shown in
the image here. This includes configuration and system log files that can be used to
correctly understand the environment on the problem system, before we attempt to
reproduce the issue or perform debugging.

The results file, supplied from the customer, should always be attached to the support
case.

NOTE: If you select Sensor trace, the trace must already exist on the Manager. In other
words, you must first go to the Manager and run the Diagnostic trace on the Sensor or
Sensors as required. If you fail to do this, there will be no trace for the InfoCollector to
package.
Faults that appear in the in the Manager, will be recorded to the ems.logs and fault log,
depending on the issue. Correlating faults to the log files can help you identify the area
you need to troubleshoot by allowing you to see what actions occurred prior to the fault
and afterwards.

In this example, we see that the NS9200 device recorded a down fault on its Alert channel.
Further down, we see it’s because the device is not connected to the Manager.
NSP provides log files that you can use for troubleshooting. Important log file information
you should know includes:
• Name and location of log files.
• Typical issues requiring troubleshooting, and the log files likely to be helpful.

The EMS.log is the primary log of the Manager. Depending on the issue being investigated
other logs may be required in addition to the EMS log. This slide lists the logs available on
the Manager for troubleshooting. If additional logs are required they should be correlated
with the EMS log.
• The fault log contains fault entries.
• The audit log has entries for activities carried out by various users on the Manager.
• The crash log is useful for determining why the Manager shutdown unexpectedly.
• The MDR log is useful for troubleshooting MDR issues.
• The installer_debug.txt log is useful for troubleshooting installation issues.
• The Watchdog.txt log is useful for viewing controlled service restarts.
• The Threat Analyzer log is particularly useful for troubleshooting issues with the Real
time and Historical Threat Analyzers. For example if an issues is seen where clicking
on an alert in the Real time Threat Analyzer displays an error stating that the “Alert is
unavailable”, then the Threat Analyzer log should be collected in addition to the EMS
log.
A matrix providing you with the general files and tools used in troubleshooting is provided
here.
• ems.log files: Configurable logs containing information from various components of
the Manager.
• Configuration backup: Collection of database information containing all Network
Security Platform configuration information.
• Configuration files: XML and property files within the Network Security Platform config
directory. Contains the configuration backup for reproducing issues.
• Fault log: Table in NSP database that contains generated fault log messages.
• Sensor Trace: File containing various Sensor-related log files.
The following are the process that pertain to the NSP Manager. You can, for example, verify
the MySQL database (mysqld.exe) and the http daemon (httpd.exe) are running, if having
issues with database connectivity or web services.

The processes listed here should be checked in the Task Manager to ensure the health of
the Manager is in good standing.
At times, you may need to access a Sensor or device’s command line interface (CLI) to
perform actions for reported issues. When troubleshooting or product documentation
indicates that you must perform an operation "on the Sensor," it is signifying that you must
perform the operation from the command line of a console host connecting to the Sensor.

You can issue CLI commands locally, from the Sensor Console, or remotely, via an SSH
client, such as Putty.

When you are successfully connected to the Sensor, you will see the login prompt. Before
you can enter CLI commands, you must first log on to the Sensor with a valid user name
(username is admin) and password (default is admin123).
The “commands” CLI command will provide you with the entire list of all commands
available.

You can refer to the McAfee Network Security Platform x.x CLI Guide for an extensive list of
the available Sensor CLI commands, CLI syntax and command sequence.

All available NSP documentation can be found on the https://mysupport.mcafee.com


website.
The most used commands for troubleshooting are listed here.

The show and status commands can tell you the most about how the Sensor is
functioning.

In the event that you need to eliminate the Sensor as the problem area, the layer2 mode
assert/deassert commands are helpful. These commands will turn on and off Layer 2
switch mode on the Sensor and disable and enable the packet processing. When layer2
mode assert is issued, the Sensor simply passes traffic like a transparent switch.

NOTE: The deletesignatures and resetconfig requires a Sensor reboot.


Other useful commands for troubleshooting are:
• show netstat - The netstat command can provide information on errors, dropped
packets, and port state. Use ping to test network connectivity to other devices.
• ping <ip address> - The ping command is useful in verifying connectivity to up
and downstream devices, along with communication to the Manager server.
There are a number of tools that can be used to identify and troubleshoot Sensor issues.

• Its graphical interface can quickly show issues. From the dashboard you can drill-down
to receive fault messages.
• Under the Threat Analyzer you have access to Sensor statistics when you create a new
dashboard and select the Sensor statistics as the monitor. You can choose to monitor:
• Flows
• IP Spoofing Packets
• Port Packet Drops
• Rate Limiting
• TX/RX
• Sensor Packet Drop
• Port Throughput
• CPU Utilization
• TCP/UDP Flow
• Sensor Throughput

• There’s also the physical Sensor LEDs that can be viewed for problems, along with
using the command line interface to check for operational status.
This chart shows the communication ports that must be operational between the Manager
and Sensors.

For issues between the Manager and Sensor(s) begin with the obvious and work towards
the more complicated.

Is this a new installation or has an existing communications failed?

Is the Sensor powered on and is the LED on the management port link green? If not, there
may be a speed/duplex issue between the Sensor and the router/switch or the cables are
bad. Also check that portfast (Cisco switches) is off.

NOTE: The second set of ports are for 2048 bit connections.

Continued on the next page


Can you ping between Sensor and Manager? If not, then it is most likely a network
connectivity issue.
• Verify the actual cable is good.
• Verify cabling is connected correctly and the peer device configurations are correct.
• If there is a firewall between the Sensor and Manager, the ports shown in this chart
must be allowed through the firewall.
• If Network Address Translation is required to reach the Sensor, then you need to setup
a static NAT address on the Sensor. This command line interface command is set
manager IP <public IP address>.

Are both the Sensor and Manager listening on the correct ports? You can run netstat on
the Sensor and netstat –na on the Manager.

You can also verify that the correct ports are being used using a Wireshark capture. (Set up
trust between manager and Sensor while running the capture on the Manager)

Also check that if enabled, the windows firewall has proper rules to allow the traffic.
If the Sensor is having difficulty connecting to the Manager. Begin with checking the ports,
the cabling, port status, etc. Log into the Manager and check the device’s health being
reported. Also, log into the Sensor and review the status. Is trust established? Are the
Alert and Log channels up?
Troubleshooting issues with Sensor and Manager communication will include checking for
network connectivity issues existing between the Sensor and the Manager.
• Validate if there is a firewall in between, if so, ensure it is properly configured to allow
the communication.
• Check cabling and hardware status.
• Run the reconnectalertandpktlogchannels command to have the Sensor
reattempt to establish connectivity to the Manager.
When needing to diagnose Sensor and Manager-to-Sensor communication, it’s helpful to
obtain and review both the diagnostic trace and ems.log file. The diagnostic trace can be
downloaded and used by McAfee engineering or Tier III to identify issues. McAfee Tier II
and Tier III Support Engineers can decrypt and analyze the trace files.

The ems.log file is an event log generated by the Manager and used for debugging. If there
are communication issues between the Manager and the Sensor, the event and/or error will
log to this file. Errors such as snmpGetNext failed, the Sensor is unreachable, or request
timeouts, in the log, indicate that the Sensor is down or the network connectivity between
them is experiencing an issue.

The ems.log will save to any place the user selects.

The Diagnostic Trace and EMS.log function the same for all NSP Sensors.
To run a diagnostic trace on the device, select the Upload button. This will instruct the
device to begin logging certain internal statistics and provide them to the Manager in an
encrypted archive, that is then available to be provided to Support for review.
In the Manager you can also view the status of the Sensor ports.

To view the configuration of ports go to the Devices page/tab > [admin domain] > [Sensor]
> Setup > Physical Ports. This page shows the port configuration including speed and
duplex. Mismatched settings between the Sensor’s monitoring ports and peer devices can
show up as issues with dropped traffic or link failures. Also verify that for GE ports you are
using only authorized McAfee XFP GBICs.
Issues with the monitoring ports may show up as faults and status errors that can be seen
from the Manager Dashboard. For issues such as suspected latency or over utilization (aka
over subscription), you can use the Sensor performance output on the Threat Analyzer.

You can validate that the Sensor performance information is being collected by the
Manager: Devices > Troubleshooting > Performance Monitoring.

Again, begin with the basics and work towards the more complicated.

What are the symptoms being experienced?


• Dropped traffic? – Check the performance statistics for packets received and sent.
This could be a fiber problem, a transceiver problem – this needs to be investigated.
This could be a configuration issue – verify port speed/duplex settings on Sensor and
peer devices.
It might be a bandwidth issue with the Sensor – what is the port’s capacity?
A bandwidth issue with a peer device – an issue with Quality of Service (QoS) settings
on a peer switch or router that is over capacity.
• No traffic is being processed – Check the performance statistics, is the Sensor
receiving traffic?
• Some types of traffic are being dropped/blocked – verify the policy, by putting a null
policy in place and testing.
Select the interface of the Sensor, select the policy, select Null from the drop-down
menu and apply. Then update the Sensor configuration.
When an in-line device experiences problems, most people's instinct is to physically pull it
out of the path; to disconnect the cables and let traffic flow unimpeded while the device
can be examined elsewhere. The NSP Sensors however, have a Layer2 Passthru feature. If
you feel the Sensor is causing network disruption, before removing it from the network,
just simply issue the following command:
layer2 mode assert

This pushes the Sensor into Layer2 Passthru (L2) mode, causing traffic to flow through the
Sensor while bypassing the detection engine. After going into Layer 2 Passthru mode,
check to see whether the services are still affected; if they are, then you have eliminated
certain Sensor hardware issues; the problem could instead be a network issue or a
configuration issue.
The Fail-open operation provides a measure of network integrity when a Sensor fails.
When a Sensor with ports operating in In-line Fail-Open Mode experiences a critical fault,
the Sensor reboots; during the reboot, the Sensor goes into fail-open mode until it restarts.
If a critical fault occurs again, another reboot cycle is initiated. If this continues it can cause
a network bottleneck at the Sensor. This can continue until acted upon through human
intervention.

You can enable a failure threshold to automatically initiate fail-open, or passthru, mode by
configuring the Layer 2 Switch feature from the Manager interface. This feature enables
you to set a threshold on the number of critical failures within a configured period of time
that the Sensor can experience before being forced into passthru mode at Layer 2.
Port issues may involve issues with dropped traffic, and performance. Listed here are
questions to ask that will allow you to narrow down the issue.
If dropping traffic, validate if it’s occurring on all ports or just one. Use the Sensor CLI
commands to look at port status and performance information.
Check that the cabling and port interfaces are functioning properly. A quick test is to
enable Layer 2 mode and see if the issue persists.
If reporting that the Sensor is dropping specific traffic, such as only FTP, start with
validating when the issue began to occur and see what occurred prior to the issue. Did a
new signature set get deployed to the Sensor? Did it recently receive a upgrade or
configuration change. A quick test is to enable Layer 2 mode and see if the issue persists.
When troubleshooting issues with ports, you can look at the port statistics on the Manager
and check for errors, you can also check the LEDs on the Sensor itself. The peer equipment
statistics and hardware LEDs can also provide insight. Most fiber cabling, once installed is
stable, but there are out-of-box failures, it can get pulled during installations of other
equipment, or if in a poor lab environment then dirty and broken fiber could also be an
issue.

As a reminder, only McAfee approved transceivers can be used, transceiver errors can
result in similar errors as fiber.
Most performance issues are related to switch port configuration, duplex mismatches, link
up/down situations, and data link errors. Excessive data link errors usually indicate a
problem.

Half-duplex setting
When operating with a duplex setting of half-duplex, some data link errors such as frame
check sequence (FCS), alignment, runts, and collisions are normal. Generally, a 1% ratio of
errors to total traffic is acceptable for half-duplex connections. If the ratio of errors to input
packets is greater than 2% or 3%, performance degradation may be noticeable.

In half-duplex environments, it is possible for both the switch and the connected device to
sense the wire and transmit at exactly the same time, resulting in a collision. Collisions are
caused when the frame is not completely copied to the wire, resulting in fragmented
frames.

Full-duplex setting
When operating at full-duplex, FCS, cyclic redundancy checks (CRC), alignment errors, and
runt counters should be minimal. If the link is operating at full-duplex, the collision counter
is not active. If the link errors are incrementing, check for a duplex mismatch. Duplex
mismatch is a situation in which the switch is operating at full-duplex and the connected
device is operating at half-duplex, or vice versa. This results in extremely slow
performance, intermittent connectivity, and loss of connection.
Analysis of captured data packets can help monitor data communication and network
usage, along with performing forensic analysis to help in identifying network security
threats. The captured data packets can also be used for troubleshooting Sensor issues.
Data packets can be captured in the port or file mode. In port mode, captured packets are
forwarded to an external device, like Sniffer. In file mode, the packets are sent the
Manager, or an SCP Server.

Packet Capturing
A packet log is created by a Sensor capturing the network traffic of and around an
offending transmission. An expert in protocol analysis can use the log information to
determine what caused the alert and what can be done to prevent future alerts of the same
nature.

Sensor save all packet logs in library packet capture (libpcap) format, and store them in
Manager database. You can examine log files using Wireshark.

NOTE: The Sensor packet capture only captures ingress packets, packets coming in to the
interface – not ones leaving. It will lose half the conversation. Troubleshooting may be
easier on a 3rd party device.
Before starting a capture you must define a place to send the captured data to. This can be
a SPAN port as packets, to the Manager or to an SCP Server as a pcap file. Once the capture
is finished it will upload the pcap file to the manager or to an SCP Server.

Capture Rules
The packet capture configurations are done using packet capture rules applied to the
Sensor. After choosing a destination, you can apply Capture Rule, or filters, to the traffic,
defining which monitoring ports, protocols, VLANs and IP addresses you wish to capture.
You can also create templates to save commonly used capture rules. These rules are
cumulative, all will be taken in account to determine which packets are captured.
While the capture is running, you can log into the Sensor and monitor the progress using
the show pktcapture status command.
The only way to view the captures will be via NSP Manager’s Threat Analyzer and viewing
the pcaps with Wireshark, or exporting the evidence report.

NOTE: The pcaps are stored in the file system of the Manager.
This next section will go over several example scenarios to help you gain a better
understanding on how to approach troubleshooting commonly reported issues.
When unable to view or modify policies, check the policy cache. It may get out of sync with
the database due to server errors, database errors, or client/server communication errors.
Once you clear the caches, it may take a few minutes to open a policy in the Network
Security Policy Editor, because the applied policy must be re-cached.

The Manager Policy Cache panel, from the Manager page displays the following
information:
• Cached Attack Definitions: Number of attacks stored in Manager cache.
• Cached IPS Policies: Number of policies in Manager cache.
• IPS Policy Names: Names of policies in Manager cache.
• Cached Reconnaissance policies: Number of reconnaissance policies in Manager
cache.
• Reconnaissance Policy Names: Names of reconnaissance policies in Manager cache.
In this scenario, it is reported that the Sensor is dropping packets and that the link
indicator light is off. Dropping packets by itself is not always an issue. Some packet loss is
to be expected, but high packet loss, slowness and other accompanying issues should be
investigated.
From the information provided it appears that the Sensor was working before, in this case,
it is probably cabling or a change in the peer device configuration. You’ll want to verify that
there is no traffic flowing. It is important to ensure proper configuration of peer devices.
You can try a different port pair. You can also test with placing the Sensor in Layer 2
Bypass mode in order to rule out the Sensor as being the problem.

NOTE: Latency KB should be consulted if still an issue -- KB70861.


In this scenario, you are unable to receive alerts from the Sensor. The first step here is to
verify the health of the Sensor. Is it up and communicating with the Manager? If so, then
you should check the status of the Sensor ports.
1. From the NSM UI/Devices Page/<Device>/Setup/Monitoring Ports and verify the ports
are up.
• If is up (green) then go to step 2.
• If port is down then troubleshoot that condition.
• Can port be re-enabled?
• Yes = resolved
• No = Start checking cabling, speed/duplex settings, cables, GBICs, Connected
devices (routers, switches).
• Determine why the link cannot be established.
2. Putty (SSH) to the Sensor CLI and log in.
• Verify there is a good trust to the NSM. (show and status commands)
• Run the following CLI commands:
• show intfport 1A <enter> - run this several times to see if there is traffic on the
Sensor.
• show inlinepktdropstats 1A <enter) look to see if the traffic is getting dropped.
• The output shows if packets were dropped and what function/reason it dropped it.
3. Back on the NSM get a trace upload from the Sensor and attach it to the Service
Request (SR). There are many other tests and checks to be can be done, but without
more detail this is a good approach on what to start checking.
This next section focuses on troubleshooting database related issues.

In the event that the Manager loses connectivity to the database (i.e. the database goes
down) the alerts from Sensors will be stored in a flat file on the Manager server. When the
database connectivity is restored, the alerts are stored in the database.

We recommend that customers monitor the disk space on a continuous basis to prevent
this from happening. If the Manager database or disk space is full, the Manager will unable
to process any new alerts or packet logs. In addition, the Manager may not be able to
process any configuration changes, including policy changes and alert acknowledgement.
In fact, the Manager may stop functioning completely. To rectify this situation, perform
maintenance operations on the database, including deleting unnecessary alerts and packet
logs.

The common symptoms that occur if the database tables become corrupt:
• mysql errors reported in the ems.log file.
• Inability to acknowledge or delete faults in Operational Status
• When trying to view packet log for in the Threat Analyzer, you receive an error
message.
• You receive the message "No Packet log available for this alert at this time“.
The Network Security Platform uses the MYSQL database.

You maybe getting errors that no packet log is available in the Alerts page. Be sure to
check that if there is anti-virus application installed, that the appropriate directories are
being excluded from scanning.

They may report that the database service is failing to start. While checking the status of
running processes be sure to validate that there is only one instance of SQL running.
Also look for any old my.ini files using the windows search function.

You can try executing mysqld manually in /mysql/bin directory of the install path.
There are two log files specifically related to Manager/Central Manager installation and
upgrades, available for troubleshooting:
• mgrVersion.properties: Every fresh installation or upgrade of the Central Manager or
Manager is logged to this file. Each entry contains the version of the Central Manager
or Manager that you installed or upgraded to. It also contains the date and time of
when you performed this action. This can help you troubleshoot issues. For example,
you can go through this log to correlate an issue with a specific Manager upgrade. This
file is stored at <Central Manager or Manager install directory>\App\config.

• dbconsistency.log: When you upgrade the Central Manager or Manager, the installed
database schema is compared against the actual schema of the version you are
upgrading to. This comparison is to check for any inconsistencies. The details of this
comparison are logged to this file as error, warning, and informational messages. This
file is stored at <Central Manager or Manager install directory>\App. You can verify
this log to check if any database inconsistency is the cause of an issue. This file is
updated whenever you upgrade the Central Manager or Manager.
This next section covers troubleshooting Sensor upgrade issues.

If you experience Sensor upgrade issues it may report that the software upload failed, or
that you are unable to push configuration or signature sets to the Sensor.

Use the Sensor CLI and check if the Sensor reports that trust is established with the
Manager.

Check the control and pklog channel status. Are they going up and down (flapping)?

For any error messages being reported, be sure to search the knowledgebase for a
resolution.
Shown here are some questions that should be asked to help isolate the problem.
If trust is not established, you will need to re-establish the trust and then attempt to
upload the software again.

If necessary, you may need to tftp the software the Sensor and perform the upgrade
manually.
If the control channels are flapping, restart the Manager. If this doesn’t resolve the issue,
get a network capture and check for TCP re-transmissions.

If necessary, you may need to deinstall and then reinstall the Sensor.

Also, check for any possible firewalls or other network constraints between the device and
Manager.
If unable to push new signature sets, as part of the troubleshooting, re-initialize the Sensor.

Delete the current signature file using the deletsignatures command and then attempt
the push again.

Use the resetconfig command to allow the Sensor to obtain a fresh configuration.
When you have connectivity issues or issues surrounding the Sensors, it’s often best to
start with validating the status of the LEDs on the Sensors. The LEDs can help you
understand what the possible problem may be.

You can verify the link LEDs on devices to indicate whether or not they have an active link.

For information on LED purpose and meaning of indicator lights, refer to the available
Sensor Hardware documentation.

You might also like