|
Websense Enterprise® URL Database FAQs
How big is your database?
The URL Database contains 6 million sites, covering 1.1 billion Web pages.
How many categories in your database?
Websense categorizes the Web into 80+ categories, including shopping, adult content and stock trading.
How did you select these particular categories?
The categories used by Websense Inc. to sort out the millions of Web sites on the Internet have been designed
to collect in useful groupings the kinds of sites of interest and concern to its subscribing customers. They are not intended to characterize any site or group of sites or the persons or
interests who publish them, and they should not be construed as doing so. Likewise, the labels attached to Websense categories are convenient shorthand and are not intended to convey, nor
should they be construed as conveying, any opinion or attitude, approving or otherwise, toward the subject matter or the sites so classified.
Websense built its first filtering solution in 1996. Our version 3 product has 27 fixed categories developed
jointly with the company's first customers. Websense's version 4 database was released in late 1999 with Websense Enterprise 4.0. Twenty-seven additional categories were added; mainly
to address "cyberslacking" problems, such as employees shopping or gaming on company time. These categories were developed jointly from customer feedback and focus groups. Websense
Enterprise version 4.x software provides the ability to dynamically add new categories and product offerings for our customers. In May 2000, Websense added Productivity PG categories (instant
messaging, online trading, advertising and others) and 13 new liability, cyberslacking and bandwidth management categories, such as MP3, Internet auctions and lingerie and swimsuit, while
differentiating out sex education. In June 2001, the company added Bandwidth PG categories (Internet radio and TV, streaming media, peer-to-peer file sharing, personal network storage/backup
and Internet telephony) and five other categories: message boards and clubs into Productivity PG, and four categories in the core group: educational materials, marijuana, sport hunting/gun
clubs and personal Web sites.
In July 2002, Websene added the Security PG category to cover Malicious Web Sites.
If a Web site falls into multiple categories, how do you determine which
category to put it in?
Websense has developed a priority ranking for all of its categories. If a site's content falls into
multiple categories, the site is placed into the category with the highest priority. Highest priority is given to proxy avoidance sites and those categories generally characterized as the
"obscenities": sex, violence, criminal, hate/race, weapons, and tasteless. The next priority goes to other categories in the liability group, such as militancy/extremism and
gambling; and finally, the "cyberslacking" and bandwidth categories, such as entertainment, sports, shopping, vehicles, job search, MP3 and many others. Cyberslacking category
priorities are determined by the adverse effect on the corporation for reducing employee productivity. Thus, frequency of visit and duration of visit are considered in determining these
priorities.
How can you review all sites with human editors and expect to keep up with the
growth of the Web?
Whether a site is simply published or exists is irrelevant; what is relevant is capturing those sites that
people actually do find and visit. Websense's WebCatcher does just that. Everyday, hundreds of Websense customers send uncategorized site information to Websense. This information is
merged and sorted by frequency and then stored in the Websense Warehouse for processing. Our highly automated database factory then processes these sites daily. The result is extremely high
coverage of sites employees actually visit. Please see the Webcatcher white paper and the URL Database white paper for more detail.
Our proprietary Websense software uses a variety of state of the art technology, including dynamic filters to
algorithmically build the database. The secret is knowing where to look on the Web for new sites. With experience dating back to 1996 in searching and classifying the Web, Websense has
codified the experience of its analysts into proprietary, state-of-the-art, back-office systems that last year posted more than 25% of the database without human interaction.
What kind of dynamic filter are you using?
KILO II is Websense's second-generation dynamic filter that uses state-of-the-art, adaptive learning
technology. Unlike rule-based systems (such as keyword-based systems), adaptive learning systems are only as good as the training set that trains them. Websense has mined those decisions from
scores of man-years of classifying the Internet to create a highly refined training set for KILO II. That training set is then processed to first extract the most salient document features;
and then those document features are processed to produce two mathematical equations. These mathematical equations are used at runtime to analyze any new document. These two equations output
a metric that determines whether the document is in a Websense category or not, and by how much. Currently, KILO II can be tuned to classify with 99.5+ percent precision in 50 Websense
categories.
When a new Web site is classified through KILO II, 50 numbers are produced. Taken collectively, these numbers
are graphically rendered as the Websense Digital Fingerprint. A Websense Fingerprint is a context vector that quite precisely describes the content of any Web page in Websense contextual
space.
So, let's assume you can keep up with the growth of the Web. How do keep your
database "fresh"?
In short, our automated database aging system, Websense Change Tracker. When a Web site is posted into our
database, a Websense Digital Fingerprint (as described above) is produced and stored with other meta data about the site. At regular intervals and using proprietary heuristics, the Websense
Database Factory takes new Websense Digital Fingerprints and compares them with the last stored Websense Digital Fingerprint. By measuring the "distance" between these two vectors,
contextual change can be measured. If that contextual change exceeds a pre-selected threshold, the site is selected for review.
What about non-English language coverage in the URL Database?
The URL Database differentiates 52 languages. According to research, these
languages cover 95% of all known Web sites. With WebCatcher, Websense believes it covers at least 95 percent of sites actually visited in these languages. Covered languages include Japanese,
German, French, Chinese, Korean, Spanish, Portuguese, Swedish, Norwegian, Danish, Dutch, Italian, Russian, Turkish, Arabic, Hebrew, Greek, Finnish, Hungarian, Polish, Romanian, Thai, Czech,
Urdu, Vietnamese, Croatian, Slovenian and, of course, English.
|