You are on page 1of 8

URL Categorization

Tech Note
PAN-OS 4.1

Revision C

2012, Palo Alto Networks, Inc. www.paloaltonetworks.com

Overview
This document describes the URL categorization components and resolution process used in PAN-OS.

URL Categorization Components


The components listed below are used in the URL lookup process on PAN-OS. These components may be located on the
Palo Alto Networks Management Plane (MP), Dataplane (DP), hard disk, or external to the device (Internet).
MP URL Database
20 million URLs stored on disk
Downloaded daily from BrightCloud servers
Cloud Servers
Servers on the Internet that contain the complete BrightCloud database
MP Dynamic URL Cache
URL cache stored on the Management Plane
Contains the last 1 million queries to the cloud
Cache is persistent and is written to disk every 20 minutes to avoid problems if a power failure occurs.
Entries in this cache expire after a certain period of time. In PAN-OS 4.1, that time period defaults to 24 hours,
and is configurable. Prior to PAN-OS 4.0, that time period is 7 days, and was not configurable.
Can be manually cleared using the CLI commands:
o delete dynamic-url host all
o delete dynamic-url host name
DP URL Cache
URL cache stored on the Dataplane
Subset of the MP dynamic URL cache
Contains X number of the most recent queries to the cloud and MP URL database (top 20M on-device database),
where X is the following:

PA-5060

PA-5050

PA-5020

PA-4060

PA-4050

PA-4020

PA-2050

PA-2020

PA-500

PA-200

100,000

100,000

40,000

100,000

100,000

40,000

40,000

40,000

10,000

5,000

Entries in this cache never age out, but will get pushed out if not among the most recently accessed
Upon reboot, DP URL cache is cleared
Can be manually cleared using the CLI command clear url-cache

Bloom Filter Hash Table (optional)


Hash of the 20 million on-device database
When enabled, allows device to quickly check if a particular URL is in the on-device database without actually
accessing the disk
Disabled by default
Stored in MP memory
Recalculated each time a new URL database is loaded
Every 1 hour it is written to disk, in case of power failure

MP URL Cache (optional)


Most recently accessed 1 million URLs of the 20 million on-device database
When enabled, allows the device to potentially determine a URL category without having to access the disk
Disabled by default (enable by running debug device-server bc-url-db cache-enable)
Stored in MP memory

2012, Palo Alto Networks, Inc.

[2]

Entries in this cache never age out, but will get pushed out if not among the most recently accessed
Cache is persistent- every 1 hour the cache is written to disk, in case of power failure

2012, Palo Alto Networks, Inc.

[3]

URL Category Resolution Process

When a user attempts to access a URL and the URL category needs to be determined, the Palo Alto Networks device will
compare the URL with the following components and will stop when a match is found:
1. Block list of the matching URL profile
2. Allow list of the matching URL profile
3. Custom categories that have been defined
4. DP URL cache
5. MP URL database
6. MP dynamic URL cache (if dynamic URL filtering is enabled on the URL profile)
7. Cloud servers (if dynamic URL filtering is enabled on the URL profile)
If there is no response from the MP URL DB within 5 seconds (configurable timeout), the URL will be categorized as notresolved.
If there is no response from the cloud servers within 5 seconds (configurable timeout), the URL will be categorized as
not-resolved.
If the URL is categorized as not-resolved, PAN-OS will take the action configured in the URL profile for not-resolved,
but will continue to attempt resolution of the category. When the final category match is resolved, the result will be entered
into the appropriate cache(s). If the action for the not resolved category is allow or alert, the URL requests are
allowed and forwarded, but the response from the sever will be discarded. Typically the client will retry the request. Since
PAN-OS continued with resolution for the original request, the category is likely to be resolved and entered into the
appropriate cache(s). In this case, the retry will typically match a cache entry.
Entries marked as unknown are truly unknown by BrightCloudthey are likely new sites that have never been classified.

2012, Palo Alto Networks, Inc.

[4]

URL Category Resolution Process High Performance (PAN-OS 3.1.6 and higher)
In terms of speed of lookup, URLs that match the DP cache are resolved quicker than matches to the MP caches. A
match in memory will be quicker than a match on disk. Each of these methods will be faster than querying the cloud
servers. In high-performance environments, the default resolution methods may need to be improved upon.
There are two optional components that can be enabled in order to reduce the time it takes to resolve URLs: the Bloom
filter and MP URL cache. These two components should be enabled in environments that require a combination of
high/new session rates, high URL lookup rates and high logging rates (5,000+ logs/sec). When these two components are
enabled, URLs are compared in the following order:
1. Block list of the matching URL profile
2. Allow list of the matching URL profile
3. Custom categories that have been defined
4. DP URL cache
5. MP Bloom filter hash table:
If there is a match, the URL is in the on-disk database. Check the following:
6. MP URL cache
7. MP URL database
If there is NOT a match, the URL is not in the on-disk database. Check the following:
6. MP dynamic URL cache (if dynamic URL filtering is enabled on the URL profile)
7. Cloud servers (if dynamic URL filtering is enabled on the URL profile)
Note: Since the Bloom filter and MP URL cache use additional MP memory, it is recommended that you only implement
these features where high performance URL filtering is required.
The following diagram shows the sequence that includes the Bloom filter and MP URL cache:

The CLI commands to enable the Bloom filter and MP URL cache are:
admin@PAN(active)> set system setting url-filtering-feature filter true
admin@PAN(active)> set system setting url-filtering-feature cache true
To activate these settings a restart of the device-server is required:
admin@PAN(active)> debug software restart device-server
Confirm that these settings took effect:
admin@PAN(active)> show system setting url-filtering-feature
cfg.url-feature.basedb-cache: True
cfg.url-feature.bloom-filter: True

2012, Palo Alto Networks, Inc.

[5]

These two settings are persistentthey will survive a reboot. These commands will need to be executed on each device
in an HA pair.
You can examine the cache hit rate using the following command:
debug device-server bc-url-db show-stats
Example output of that command is shown below:

URL Lookup and Matching


The categorization components are parsed for matches to requested URLs to find a match. PAN-OS interprets URLs as
being comprised of tokens. A token is considered to be any string located between two separators (refer to the Palo Alto
Networks Administrators Guide Release 4.1 for detailed descriptions of tokens and separators). If there is NOT a match
for the left-most token, it is chopped and the lists are parsed (in order) for entries that match the next token to the right.
This continues until there is a match for a token. When a token is matched, the entry that contained the match is parsed
for a match in the next token to the right. This continues until the remainder of the URL is completely matched.

Wildcard Usage and Processing


Wildcards can be used in custom URL lists and categories. The use of a wildcard is only acceptable when it is the only
character in a token. In other words, a wildcard must be used as a complete token. It is very important to understand how
wildcards are handled by PAN-OS.
Example:
Requested URL: sub2.sub1.domain.com
Allow list entry: sub1.bogus.com
Allow list entry: *.domain.com
MP URL Database: domain.com
In the example above, we first look for a match in the left-most token for sub2. We find a single match against the
wildcard in *.domain.com. We then check the next token to the right in that entry. Since domain does not match sub1,
we disregard that entry for the remainder of parsing. In other words, the allow list entry *.domain.com is thereby
eliminated from further consideration, since we do not apply the wildcard across multiple tokens. In other words, a
wildcard can only match a single token whereas it may seem intuitive to many administrators that the wildcard apply to
any number of tokens especially when used in the left most token position.
As parsing continues, we find no other matches for sub2 in the left most token so we chop it, and are now matching
against URL sub1.domain.com. We find a single match for sub1, but the next token to the right doesnt match.
Remember that sub1 does not match the wildcard in *.domain.com because that entry has been eliminated as a
possible match. We chop again to domain.com and again dont find a match in the allow list since *.domain.com has
been eliminated as a possible match. The rest of the components are parsed in order and we end up matching the
domain.com entry in the pre-defined list rather than the *.domain.com entry in the allow list.
A use case where wildcard usage frequently becomes a point of confusion is where URLs can be requested with or
without the www. portion of URLs. For example, www.domain.com or domain.com can be used in an HTTP GET
request to the Domain main web site. Since wildcards only function for a single token, a custom list/category entry of
*.domain.com will NOT match a request for domain.com. In order to cover requests for both forms of the GET request,
the list/category must include entries for *.domain.com and domain.com. Of course, exact string matches can be used
instead of wildcards.

2012, Palo Alto Networks, Inc.

[6]

A second use case where wildcard usage frequently becomes a point of confusion is the coverage of subdomains. A
wildcard will only match a single subdomain string. Administrators often expect a wildcard in the left most token position to
cover any number of subdomains. In other words, they expect that *.domain.com would cover sub1.domain.com,
sub2.sub1.domaincom, and so on. This is not the case; an entry must be included in the custom list/category for each
subdomain token. Typically, finding two subdomains will suffice.

2012, Palo Alto Networks, Inc.

[7]

Revision History
Date
7/11/12

Revision
C

1/31/12
12/22/11

B
A

2012, Palo Alto Networks, Inc.

Comment
Added URL Lookup and Matching section and paragraph on
handling of not-resolved.
Added command to check status of MP cache.
First release of this document.

[8]

You might also like