Research Paper Doctorate 9,410 words

Clustering in Algorithms That Employ

Last reviewed: August 31, 2005 ~48 min read

¶ … clustering in algorithms that employ abstract categories for the pattern matching and pattern recognition procedures used in data mining searches of web documents. With the rapid advances in data mining software technology now taking place website managers and search engine designers have begun to struggle to maintain efficiency in "mining" for patterns of information and user behavior. Part of the problem is the enormous amount of data being generated, making the search of web document databases in real time difficult. Real-time searching is critical for real time problem solving, high-level document searches and prevention of database security breaches.

The analysis of this problem will be followed by a detailed description of weaknesses in data mining methods, with suggestions for a reduction of preprocessing to improve performance of search engine algorithms, and recommendation of an optimum algorithm for this task.

The first investigators who gave a serious thought to the problem of algorithm speed were persons conducting researching in the area of database searches. The field is still in its infancy; most of the tools and techniques used for data mining today come from other related fields such as pattern recognition, statistics and complexity theory. Only recently have the researchers of these various fields been interacting to solve mining and timing issues.

Significance of the Study

Data mining is a knowledge discovery process that uses algorithms and advanced statistical models to analyze data in accordance with a set or sets of rules, as determined by the particular current application. Data mining methods may be classified according to the functions they perform, the class of application they are used in, or the domain of the search (e.g., the Internet is now a common domain for data mining searches). On the whole, data mining models fall into three basic categories: classification, associations and sequencing, and data clustering. "Clustering" assembles documents into related groups, or clusters, without relying on predefined categories. It essentially provides a visual map of the documents, with links between related documents, making it easy to browse through a collection, extracting multiple documents, or one particular topic. The principal scholarly contribution of this study will be a refinement of the testing procedures for clustering algorithms, and a conclusion regarding the optimum algorithm for online real-time searches of web document databases.

The data mining process is inherently iterative: the output of one step may be sent as a feedback to a previous step, as well as to the next step in the process. These steps are categorized into data pre-processing and discovered knowledge post-processing groupings. But as we will see, there are various techniques available to perform these tasks: classification, association and clustering, the focus of the present study. Data classification and association are appropriate to many data mining projects where there are pre-defined rules or concept categories, but many abstract problems require creation of new abstract categories in order to provide the substructure for a more focused search. If the rule set is derived directly from the audit data, any slight deviations from the pattern scheme may go undetected; conversely, minor deviations from "normal behavior" can trigger false alarms. Abstract categories can mitigate this problem.

The following chart (Figure 1) presents the basic taxonomy of the alternative data mining techniques from the user viewpoint, and the purposes of the algorithm and performance analysis.

Figure 1 Data Mining Models: Categories, Clustering and Rule Based Analysis

From "Finding needles in the haystack: Mining meets the web" by Zorn, Emanoil, Marshall, and Panek. Copyright 1999. Online. Reprinted with permission.

Project Design

For this project design, the procedure calls for the algorithms to be selected on the basis of the common feature of abstract categories for clustering data, for the purpose of classifying data according to a generic category scheme.

The principal components of the research design are the following:

the operating system (Windows XP) the laboratory computer (Generic PC of 2.0 megahertz processing speed) the data structures: data clustering algorithms identifying abstract categories; data is parsed by the algorithm to yield search results

The experimental procedure: Testing of the algorithms for elapsed time

Evaluation of results of the speed trials: Time series analysis

Conclusions: Analysis of the optimum algorithm designs and recommendations for research applications

Definitions

Class description (Classification) -- summarization of a collection of data (class characterization). Class description includes summary properties such as count, sum and average as well as data dispersion such as variance and quartiles.

Association (Detection of Relations) -- association relationships or correlations among a set of items, expressed in the rule form showing frequently occurring attribute-value conditions within a given data set. Association analysis is researched with efficient algorithms, including level-wise A priori search, mining multiple-level, multi-dimensional associations, mining associations for numerical, categorical and interval data, meta-pattern directed or constraint-based mining and mining correlations.

Clustering -- identifies embedded clusters in the data, where the cluster is a collection of "similar" data objects, as expressed by distance functions. Data mining research has focused on high quality and scalable clustering methods for large databases and multi-dimensional data warehouses.

Time-series analysis -- analyses large sets of time-series data, searching for similar sequences or sub-sequences and mining sequential patterns, periodicities, trends and deviations.

Overview of the Methodology

The methodology employed in this project will be experimental analysis, with the objective of testing the feasibility of abstract category data clustering algorithms for a real world web application. In order to perform this test, a group of five linear time clustering algorithms will be applied to a sample group of online web documents, simulating the activities of a web search engine looking for similar words, phrases or sequences in a large database set of web articles, publications and records. The five techniques compared will be the K-Means, Single Pass, Fractionation, Buckshot and Suffix Tree clustering algorithms.

The procedure will be to measure the execution time of the test algorithms in clustering data sets consisting of whole documents, excerpts and key words of a fixed quantity and size. The tests will be performed on a standard desktop personal computer running the Microsoft Windows XP operating system at a processing speed of at least 1 gigahertz, so simulate the real world activities of a conventional office worker or librarian doing a document search. Times will be recorded for a series of 10 tests repeated for the same group size and numbers of files, in order to determine the optimum clustering algorithm for real time online web document searches.

Organisation of the Study

The proposed design method is to conduct a speed trial analysis of the various designs currently used in search engine algorithms. The initial evaluation will be followed up with an analysis of the weaknesses in current algorithms and some suggestions for improvement. The study will comprise five chapters: Introduction/Statement of Problem, Review of Literature, Alternative Solutions, Feasibility Tests, Evaluation and Implementation. Future testing will be performed through the implementation of the selected algorithm in a regular application of web-based document searches by a librarian's search engine.

Purpose of the Study

The purpose of this study is to conduct research that will analyze and improve the use of data clustering techniques in creating abstract categories in algorithms, allowing data analysts to conduct more efficient execution of large-scale searches. Increasing the efficiency of the search process requires a detailed knowledge of abstract categories, pattern matching techniques, and their relationship to search engine speed.

Data mining involves the use of search engine algorithms looking for hidden predictive information, patterns and correlations within large databases. The technique of data clustering divides datasets into mutually exclusive groups. The distance between groups is measured with respect to all the available variables, versus variables that are specific predictors, to produce "abstract categories" for analysis. Search engine algorithms and user audit trails are complex, leading to time-consuming quests for specific information. It is anticipated that the proposed study will identify the most efficient and effective data clustering algorithms for this purpose.

Conduct of the Project:

Ten tests, repeated for the same numbers of files and group size, are timed and recorded to find an optimal clustering algorithm for real time database document searches.

A standard desktop personal computer running OS Microsoft Windows XP at a processing speed of one gigahertz or more will simulate the real world activities of an office worker doing a document search. Databases for the time tests are in the public domain.

No additional software, equipment or travel costs are likely to arise. At present, no human resources are anticipated to be necessary to accomplish the research goals. If any interviews or surveys are conducted, appropriate human subjects protocols will be followed.

Statement of Deliverables:

The methodology employed is an empirical analysis of speed tests conducted on groups of similar algorithms. The execution time of the algorithms will be measured for clustering sets that consist of whole documents and excerpts and key words of a fixed number.

The distribution of all the activity types in the search will be determined. Common features are identified within the system and categories for clustering and classification are developed. At this stage, an abstract format or generic classification for the data can be developed. Thus we can see how data are organized and where improvements are possible. Structural relationships within data can be revealed by such detailed analysis.

The final deliverable will be the search time trial results, and the conclusions drawn with respect to the optimum algorithm designs. A definitive direction for the development of future design work is considered a desirable outcome.

Quality assurance will be implemented through systematic review of the experimental procedures and analysis of the test results. Achieving the goals stated at the delivery dates, performance of the tests, and successful completion of the project as determined by the committee members will provide quality assurance to the research outcomes.

CHAPTER 2

Background and a Review of Literature

Data Clustering is a technique employed for the purpose of analyzing statistical data sets. Clustering is the classification of objects with similarities into different groups. This is accomplished by partitioning data into different groups, known as clusters, so that the elements in each cluster share some common trait, usually proximity according to a defined distance measure. Essentially, the goal of clustering is to identify distinct groups within a dataset, and then place the data within those groups, according to their relationships with each other.

One of the main points of document clustering is that it isn't so much a matter of finding the documents off of the web- that's what search engines are for. A clustering algorithm is much more behind the idea of how that information is displayed; in what order they are displayed, what is considered relevant to the query, listing broad categories that can become narrower. Search engines do a fantastic job by themselves when the user has a specific query, and knows exactly what words will get the desired results. But the user gets a ranked list that has questionable relevance because it turns up anything that has the word in it, and the only metric is the number of times that word appears in the document. With a good clustering program, the user will instead be presented with multiple avenues of inquiry organized into broad groups that get more specific as selections are made.

Clustering exploits similarities between the documents to be clustered. The similarity of two documents is computed as a function of distance between the corresponding term vectors for those documents. Of the various measures used to compute this distance, the cosine-measure has proved the most reliable and accurate. [10]

Data clustering algorithms come in two basic types: hierarchical and partitional. In partitions, searches are matched against the designated clusters, and the documents in the highest scoring clusters are returned as a result.

When a hierarchy processes a query, it moves down the tree along the highest scoring branches until it achieves the predetermined stopping condition. The sub-tree where the stopping condition is satisfied is then returned as the result of the search.

Both strategies rely on variations of the near-neighbor search.

In this search, nearness is determined by the similarity measure used to generate the clusters. Cluster search techniques are comparable to direct near-neighbor searches.

Both are evaluated in terms of precision and recall.

The evidence suggests that cluster search techniques are only slightly better than direct near-neighbor searches, and the former can even be less effective than the latter in certain circumstances. Because of the quadratic running times necessary, clustering algorithms are often slow and unsuitable for extremely large volume tasks.

Hierarchical algorithms start with established clusters, then create new clusters based upon the relationships of the data within the set. Hierarchical algorithms can do this one of two ways, from the bottom up or from the top down. These two methods are known as agglomerative and divisive, respectively. Agglomerative algorithms start with the individual elements in the set as clusters, then merge them into successively larger clusters. Divisive algorithms start with the entire dataset in one cluster, and then break it up into successively smaller clusters. Because hierarchical algorithms must analyze all the relationships inherent in the dataset, they tend to be costly in terms of time and processing power.

Partitional algorithms determine the clusters at one time, in the beginning of the clustering process. Once the clusters have been created, each element of the dataset is then analyzed and placed within the cluster that it is the closest to. Partitional algorithms run much faster than hierarchical ones, which allows them to be used in analyzing large datasets, but they have their disadvantages as well. Generally, the initial choice of clusters is arbitrary, and does not necessarily comprise all of the actual groups that exist within a dataset. Therefore, if a particular group is missed in the initial clustering decision, the members of that group will be placed within the clusters that are closest to them, according to the predetermined parameters of the algorithm. In addition, partitional algorithms can yield inconsistent results- the clusters determined this time by the algorithm probably won't be the same as the clusters generated the next time it is used on the same dataset.

Of the five algorithms under scrutiny in this paper, two are hierarchical and three are partitional. The two hierarchical methods are suffix tree and single pass. The suffix tree is "a compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents."[1]

It is a divisive method; it begins with the dataset as a whole and divides it into progressively smaller clusters, each composed of a node with suffixes branching off of it like leaves. Single-pass clustering, on the other hand, is an agglomerative, or bottom-up, method. It begins with a single cluster, and then analyzes each element in turn to determine if it falls within a current cluster, or places it in a new cluster, depending on the similarity threshold set by the analyst.

The three partitional algorithms are k-means, buckshot, and fractionation. K-means derives its clusters based upon longest distance calculations of the elements in the dataset, then assigns each element to the closest centroid. Buckshot partitioning starts with a random sampling of the dataset, then derives the centers by placing the other elements within the randomly chosen clusters. Fractionation is a more careful clustering algorithm which divides the dataset into smaller and smaller groups through successive iterations of the clustering subroutine. [2] Fractionation requires more processing power, and therefore time.

Clustering is a methodology for more effective search and retrieval functions pertaining to datasets, and it has been investigated in great depth in the literature. The principle is simple enough- documents with a high degree of similarity will automatically be sought by the same query. By automatically placing the documents in groups based upon similarity (e.g. clusters), the search is effectively broadened.

The buckshot algorithm is simple in design and intent. It chooses a small random sampling of the documents in the database, and then apply the cluster routine to them. The centers of the clusters generated by the subroutine are returned. This use of a rectangular time clustering algorithms makes the buckshot method fast. The tradeoff is that buckshot is not deterministic, since it initially relies on a random sampling process. Repeated use of this algorithm can return different clusters than previous searches, although it is maintained that repeated trials generate clusters that are similar in quality to the previous set of clusters. [2]

Fractionation algorithms find centers by initially breaking the corpus of documents into a set number of buckets of predetermined size. The cluster subroutine then is applied to each bucket individually, breaking the contents of the bucket into yet smaller groups within the bucket. This process is repeated until a set number of groups is found, and these are the k centers. This method is like building a branching tree from bottom up, with leaves as individual documents and the centers (clusters) as the roots. The best fractionation methods sort the dataset based on a word index key (e.g. based on words in common between two documents).

It is important to note that both buckshot and fractionation rely on a clustering subroutine. Both algorithms are designed to find the initial centers, but rely on a separate algorithm to do the actual clustering of individual documents. This subroutine can itself be agglomerative, or divisive. However, in practice, the subroutine tends to be an agglomerative hierarchical algorithm.

Buckshot applies the cluster subroutine to a random sampling of the dataset to determine the centers, while the fractionation algorithm uses repeated applications of the subroutine over groups of fixed size in order to find the centers. Fractionation is considered more accurate, while buckshot is much faster, making it more suitable for searching in real time on the Web. [2]

After the centers have been determined, each algorithm proceeds to place the documents with their nearest center. After this step is completed, the resulting partitions are refined into better clusters, either by a simple move-to-nearest protocol, or by techniques known as split (dividing a poorly defined cluster into two distinct groups) and join (merging two similar clusters). This step suffers from the same time and accuracy constraints as the initial centering- the more accurate join and split functions are much more time-consuming than move-to-nearest, but are much more reliable in the results they deliver.

One recent study has described a method of parallelizing the buckshot algorithm for more efficient and speedy document clustering.[3]

Parallel processing has been proven as an effective method of maintaining acceptable clustering times.

Parallel processing has also been successfully applied to spherical k-means partitioning, as well as single-pass partitioning. [4, 5] In cases, parallelizing the partitioning algorithm resulted in a near linear speedup in processing time.

Fractionation has also been refined and improved since its creation. Another study used fractionation in another way- instead of establishing the desired number of clusters initially, the number of clusters was determined by refining the meta-observations provided by the initial partitioning into buckets of equal size. [6] The study also introduced fractionation as a model-based hierarchy, as well as applying a new derivative process called refractionation. The model-based fractionation goes beyond reliance on simple characterization of clusters through a mean, but by all important statistics, including mean, covariance, and observations per cluster. Refractionation is repeated applications of fractionation, thereby refining the placement of groups and allowing for more accurate and reliable clustering.

The original k-means algorithm presented by MacQueen (1967) has since been refined.[7] The new version can deal with ellipse shaped data clusters as well as ball shaped ones, and does not suffer from the dead unit problem that plagued the earlier k-means algorithm. In addition, the new k-means algorithm performs proper clustering without pre-determining the exact cluster number.

In experimental runs, it has proven to be efficient and accurate.

Suffix trees were introduced in a paper by Zamir and Etzioni (1998).[8] STC (suffix tree clustering) is an incremental algorithm that runs in linear time. Instead of treating a document as a collection of words, STC treats each document as a string. The string analogy allows suffix tree to analyze relationships between word proximities from document to document. Therefore, documents are grouped together by common phrases and words, and delivered in the form of a suffix tree for ease of browsing. This results in a highly relevant hierarchy of documents.

Since the primary focus of this paper is on clustering as a means of searching the web, then there are criteria to be considered when this is kept in mind. There are several important criteria to take into account when analyzing algorithms for use on the web.[8]

Relevance: the method ought to produce clusters that group documents relevant to the user's query separately from irrelevant ones.

Browsable summaries: The user needs to determine at a glance whether a cluster's contents are of interest. We do not want to replace sifting through ranked lists with sifting through clusters. Therefore the method must produce concise and accurate descriptions of the clusters

Overlap: Since documents have multiple topics, it is important that documents not be confined to a single cluster.

Snippet-tolerance: The method ought to produce high quality clusters even when it only has access to the snippets returned by the search engine, as most users are unwilling to wait for the original documents to be downloaded from the web by the system.

Speed: A patient user might sift through 100 ranked documents in a set of returned results. We want clustering to allow the user to browse through at least an order of magnitude more documents. Therefore the clustering method ought to be able to cluster up to one thousand snippets in a few seconds. For the impatient user, every second counts.

Incrementality: To save time, the method should start to process each snippet as it is received from the web.

Agglomerative Hierarchical Clustering algorithms are generally slow. Within this classification are several methods of approaching the problem. Single link and group average methods are faster than complete link methods, but in either case the slow running times preclude their practicality in such large scale searches as needed on the Web.

One of the hindrances to the use of AHC methods is the halting criteria of the algorithm, which are generally based upon predetermined constants. When searching the web, the variability and diversity of documents makes this sensitivity to halting criteria a hindrance, and often results in poor accuracy of retrieved information.

Algorithms that run in linear time are the ones most likely to meet the necessary speed problems inherent in searching the web. Since this requirement is core to the principle of the paper, all the algorithms chosen for this study run in linear time.

K-means algorithms have an important advantage- unlike AHC methods, K-means can place documents in multiple clusters, thereby allowing for overlap between the topics of the documents. The main drawback to K-means is that the best results are achieved when the desired clusters are spherical when considered in relation to the similarity measure in use. [8] The algorithm doesn't deal well with data in various other shapes, such as ellipticals.

Single Pass suffers from the same disadvantage concerning spherical data and the similarity measure of K-means. IN addition, it is order dependent, and prone to produce large clusters (which would in turn need to be clustered again, slowing down the processing speed of the algorithm). [8]

Buckshot and Fractionation are fast, linear-time operating algorithms first introduced by Cuttings, et. al. (1992) as part of a method that was described as Scatter/Gather. Fractionation is capable of approximating the method of the AHC- however, instead of the search being global, it is locally bounded. It suffers from the same downfall as the AHC- the predetermined halting criteria can corrupt the precision of the search. Also similar to the AHC, it doesn't perform as well in areas with a lot of outliers.[8]

Buckshot is a permutation of the k-means algorithm which draws out an initial sampling of the data and then builds clusters around the sampled documents. This method can miss small clusters, as the random sampling may not be representative of smaller groups within the dataset. Also, like the Fractionation method, Buckshot is not incremental. [8]

Of the five algorithms under scrutiny, the only one that treats a document as a string is the Suffix Tree. The other four treat each document as a collection of disjointed words- the order of the words has no importance. This causes a lot of important information to be lost in the search process. Phrases have been used to supplement word-based indexing in IR projects. Lexical atomas and syntactic phrases have been shown to improve accuracy without a corresponding cost in recall. Simple statistical approaches have also been used successfully to generate phrases. [8]

Suffix Tree Clustering is a linear time clustering algorithm that is based upon identifying documents and grouping them by phrases that the documents in the set have in common. The algorithm recognizes a phrase as an ordered sequence of one or more words. A base cluster is defined as a set of documents that contain the same phrase.[8]

STC is organized into three logical steps. The first step is document cleaning, where a light stemming algorithm is used to reduce prefixes and suffixes, turn plural form to singular form, and non-word tokens such as numbers and symbols are stripped. The original document strings are preserved so that the full document can be displayed on command. The second step is to identify the base clusters. This phase is equated with creating an inverted index of phrases for the document collection. This results in the Suffix Tree (first introduced by Zamir, et. al., 1997), a data structure that can be constructed in a time linear to the number of documents in the dataset, and can also be constructed incrementally, as the documents are being read. The third step of the process is to combine base clusters that have many similarities between their respective sets. Since documents can have multiple phrases in common, it is imperative for the algorithm to recognize nearly identical clusters and merge them to prevent a proliferation of redundant clusters. This step is accomplished by merging clusters that have a high level of overlap.

Five defining characteristics of Suffix trees: [8] suffix tree is a rooted, directed tree.

A each internal node has at least two children.

A each edge is labeled with a non-empty substring of the document collection. The label of a node is defined as the concatenation of the edge-labels on the path from the root to the node.

A no two edges out of the same node can have edge-labels that begin with the same word.

For each suffixs of the document collection, there exists a suffix node whose label equals s.

Each node of a suffix tree consists of the documents connected to that node and the phrase that they have in common. The label of the node is represented by the common phrase and the descendants of the suffix node are comprised of the documents that contain the labeling phrase. All base clusters are nodes, and are represented as such on the suffix tree.[8]

The Suffix Tree algorithm is incremental. Each document is read, cleaned, and then inserted into the appropriate base cluster, or placed in a newly created cluster if the similarity threshold is exceeded. Then each base cluster is recalculated and redistributed according to the same principles of similarity.

The final clusters created by the STC method are scored and sorted by the scores of their base clusters and the amount of overlap. Each cluster is then displayed according to the number of documents it contains and the phrases of the base clusters.

The STC algorithm does not require the user to specify the number of clusters before the search- the clusters are produced incrementally according to the dictates of the algorithm. It does require the specification of the similarity threshold. STC have a markedly less sensitive response to the similarity threshold, as opposed to AHC methods. IN this way, the STC is much more forgiving concerning the initial specifications of the search.

Data Clustering algorithms are basically designed to address one or more problematic factors, such as high dimensionality, efficiency, scalability with data size, sensitivity to noise in the data, and identification of clusters in various shapes. K-means (and other partitioning methods) result in the breakup of genuine clusters. Hierarchical approaches fail to handle convex and elongated shapes of clusters. Density-based algorithms are extremely sensitive to noise in the data. [9]

The need for input parameters is the most pressing problem in clustering methods. Many algorithms, especially hierarchical ones, need the number of clusters to be arbitrarily selected at the beginning of the process. even when this is not the case, the parameters of the algorithm can greatly influence the outcome of the process.

Parameters are considered beneficial in some cases, such as when they provide a means to incorporate domain knowledge into the clustering process. This is especially true when the number of clusters is preselected and fixed in nature. However, in many cases the optimal values for the parameters is far from clear, and must be arrived at by the tiresome process of trial and error. In cases where the number of dimensions is greater than three, this results in an inordinate amount of time on the part of the user, and thus invites the use of an automated process. Domain knowledge can be represented by the definition of similarity functions that can be used to measure the degree of similarity between data points for the purpose of clustering them. [9]

The basic idea of partitioning is intuitive in nature, and the process of partitioning is generally to achieve an optimal performance through iterative means. All partitioning methods have comparable clustering qualities, and the major difficulties of these methods are: the number of clusters to be found needs to be specified before the process, requiring at least some domain knowledge beforehand (which is often unavailable), difficulty in determining clusters with a large variance in size, and that the method is only suitable for conclave clusters.[9] hierarchical clustering algorithm produces a dendogram representing the nested grouping relationship of objects in the dataset.

The main distinguishing characteristic among hierarchical approaches is the measure of similarity between individual clusters and the underlying modeling of the clusters. These algorithms are costly in terms of time and computational power, so many are initiated with a sampling of the dataset that is then clustered. This leaves the end results as a direct function of the initial sampling. While this certainly helps with the problem of computational speed, if there are many small clusters within the dataset, the sampling is likely to miss them.[9]

Dennis et. al. [10] placed current web searching technology into four broad classifications:

Unassisted keyword search: One or more search times are entered and the search engine returns a ranked list of document summaries. (ex: Google or AltaVista)

Assisted keyword search: The search engine produces results based on the user's initial inquiry. (ex: Vivisimo)

Directory-based Search: Here, the information space is divided into a hierarchy of categories, where the user navigates from broad to specific classes. (ex: yahoo)

Query -by-example: The user selects an interesting document snippet, which is then used as the basis for a new query.

Meyer zu Eissen and Stein specify the major criteria for a successful search engine.[10] The ideal search engine process has three stages. first, an initialization phase according to the plain unassisted keyword search paradigm. Second, a categorization phase similar to the directory-based search paradigm. Third, a refinement phase that may combine aspects from assisted keyword search and query-by-example paradigm.

Documents are commonly represented as vector space models. In this model, each document is represented by a point in the space that roughly corresponds with the union of the primary words in the document.

The process includes filtering out common words (such as pronouns and conjunctions), ignoring words that are unique to the document, and words are stemmed in order to reduce them to canonical form. Once this is done, the document can be expressed in the form of vectors, and then those vectors can be used to plot a pint in virtual space that represents that specific document. The words of the document are also weighted; without this provision, the more commonly a word appears in a document, the more important it is considered (which is often not really the case). Conversely, words that appear infrequently in a document are considered discriminatory- they serve as a distinguishing feature for the document in question.[10]

As shown in the work of Zamir, et al. [8] the feasibility of searching web results is heavily influenced by the ability to preserve accuracy when clustering based only upon the snippets returned to a search engine. In their results, it was found that almost the same accuracy was present in the use of snippets only, as opposed to the whole document. This alleviates the computer from downloading the entire document for the express purpose of measuring similarities with other documents. In general, each algorithm was slightly more accurate when using the entire document, but the accuracy of the snippets was much higher than expected. In the case of the k-means algorithm, it actually performed better using the snippets instead of the whole document.

The facts revealed in the literature review can be summarized in a few key points. Data clustering is an essential step in making web- based searches more user friendly and productive. Data Clustering operates by attempting to measure how alike documents are, then sorting them into groups based upon content. The more accurate a Clustering algorithm is, the slower it works, because it must examine more relationships in greater detail. There are many different algorithms to choose from; this study encompasses a review of five of the most successful and popular clustering algorithms.

K-means clustering is a prtitional algorithm that works in linear time. It calculates the centers to be used in clustering through an analyzation of maximum distance relationships between documents in the dataset. Single pass clustering is an incremental algorithm that develops its clusters through calculations based upon the vector differences ascribed to the documents in the dataset. Buckshot is a clustering method that takes a random sampling of documents to use as the centers. Fractionation is an algorithm that divides the entire dataset into fixed size buckets that are then each clustered by the clustering subroutine, then the clusters are merged or split accoridng to similarity. Suffix Tree Clustering is a heirarchical algorithm that runs in linear time, and provides an intuitive platform in the form of the suffix tree.

CHAPTER 3

Alternative Solutions

The focus of this study is on using advanced data clustering for Web-based searches. As such, the primary defining characteristic of Web-based searches is the speed of the algorithm designated to do the job. The algorithms studied herein are all models that have performed well under limited circumstances, such as a large but finite dataset.

The natural tradeoff that exists is between the speed of the algorithm and its accuracy at returning highly relevant results. This is simply because more accurate clustering requires a more intensive analyzation of the data in the set, which translates into a higher number of calculations to be performed in order to accomplish the task. The more accurate algorithms require so much time to construct the clusters that they are rendered impractical for online real time searching. Fractionation, for instance, performs excellently when allowed to cluster data offline, but the high computational cost would make its usefulness in searching the web suspect.

The suffix tree clustering method provides an avenue for exploration. In previous tests, STC outperformed the other methods in terms of speed.[8] It also provides an intuitive visual for presenting the data, thereby allowing the user to sort through the documents for the most relevant ones pertaining to the search.

Another strong point is that it is incremental- this allows it to begin sorting documents into clusters as soon as they are received, then to add new documents as they come. In the tests from the study[8], the results were that the clustering program was finished clustering the last files received just.01 seconds after the search engine had delivered them.

Another interesting approach that could use more examination is the Scatter/gather method introduced by Cutting, et. al.[2] Fractionation and buckshot are not algorithms per se; they are more like methods of applying algorithms. Both rely on a clustering subroutine to do the actual clustering of the documents, and the subroutine is an agglomerative, hierarchical algorithm, which is known to be extremely slow in comparison to other algorithms (see [8]). This approach is interesting, because it combines the initial partitioning of the dataset with a hierarchical approach in an attempt to get the best of both types of clustering methods. What if another subroutine were employed with these techniques? Could the use of a faster subroutine boost the speed of such a method? Buckshot is already a fast algorithm, relying on the agglomerative subroutine. If that subroutine were replaced with a more efficient Suffix tree clustering algorithm, could that increase the speed more? However, considering the evidence presented on the performance by the suffix tree [8], it seems that the scatter/gather method would be superfluous.

The use of snippets seems to detract from the amount of "noise" a document can generate. These excerpts of web documents represent a search engine's attempt at extracting the distinguishing phrases of a web document. This reduction in noise could have a positive effect on low accuracy techniques, such as k-means and buckshot. In the case of k-means, this streamlines the calculations a great deal because of the reduced number of relational analyses that need to be performed. For the buckshot algorithm, the AHC subroutine can also benefit from the reduced noise, resulting in increased accuracy and speed. In both cases, excerpts provide concise summaries of the content of the document and provide clues to its context.

The four primary types of web-based search are the unassisted keyword search, the assisted keyword search, the directory search, and the query-by-example. The algorithms presented here fit some or all of these functions. The unassisted keyword search is straightforward, and presents the results returned from the search engine as is, in their ranked positions. This has, of course, proven to be a difficult and time consuming process for sifting through to find specific information on a given subject (hence the need for data clustering). The assisted keyword search is an advancement in terms of relevancy- it provides links based upon words with similar meanings. This increases the chance of finding the right kind of information, but still leaves the sorting and sifting to the user- there is very little categorization involved, and the search still results in ranked lists. The directory search is much more closely similar to the results of clustering at least. It is a hierarchical approach, allowing the user to begin in broad categories as yielded by the search, then to progressively narrow the search and find smaller groups of more specific documents. The query-by-example is another method that closely simulates the clustering approach. By selecting a snippet and deriving a new search around it, the search becomes iterative, allowing the user to further refine the search based upon the excerpt of a highly relevant site.

When considering these four points, the need for clustering algorithms becomes more clear. The often intimidating number of results that can be found via a search engine can frustrate many users. It also creates a large obstacle to efficient sifting and sorting of data. The major focus of the clustering algorithm is to display those search results in a sensible fashion. By grouping similar documents together, the user is given a much more friendly working environment in which to search.

Based upon the literature, it seems that the hierarchical algorithms have a much better platform from which to satisfy those four essential types of web- search. Certainly, the more favorable searches are the directory based and query-by-example, which are both functions that are facilitated by the suffix tree, as a hierarchical, string-based algorithm. While there will certainly be those times when the keyword search is desired, such as when you know the exact title of the document you're looking for, the more generalized search options are fitting for the average user's needs.

Since the focus of this paper is to measure the speeds of the various algorithms and produce a conclusion based upon the results, there is little attention paid to the functionality or accuracy of the algorithms. In general, it is known that there is a tradeoff between speed and accuracy when using algorithms. In that case, it should be expected that the fastest algorithm is not going to be the most accurate one. Due to the nature of web-based search, however, speed is essential. Therefore, the objective is to find the fastest algorithm, then to find ways to improve its accuracy. Since the introduction of the k-means algorithm, there have been several significant advances and permutations of it that have improved performance and speed. This can also be accomplished with other alogrithms as well.

CHAPTER 4

Feasibility Tests

The procedure of the experiment was to measure the execution time of the test algorithms in clustering data sets consisting of whole documents, excerpts and key words of a fixed quantity and size. The tests were performed on a standard desktop personal computer running the Microsoft Windows XP operating system at a processing speed of at least 1 gigahertz. Times were recorded for a series of 10 tests repeated for the same group size and numbers of files, in order to determine the optimum clustering algorithm for real time online web document searches. The dataset consisted of 1000 documents pertaining to classic literature. Each document contained on average 710 words, while the excerpts averaged 50 words. The documents were results of searches for the keywords "classic" and "literature."

There are three series of tests. Each series is a compilation of the time taken by each of the chosen clustering algorithms to sort the documents of the dataset. The first series consists of each algorithm analyzing the entire content of the document. The second series compares the speed of searches based upon excerpts, or snippets. The third and final series of tests measures the speed of searches based upon keywords only.

Throughout the experiment, the suffix tree consistently outperformed the other algorithms with respect to speed. This proved true for all three levels of the experiment. The second best performer was the buckshot algorithm. The worst performer overall was the fractionation algorithm. Although it still runs at acceptable speeds, it is by far the slowest of the algorithms that see common usage. K-means and Single Pass reside in the middle ground between the two extremes. The following charts show a complete breakdown of each algorithms performance in each series of tests.

Table 1: Full document comparison

Trial No.

A k-means single pass buckshot fractionation suffix tree

97.9s

99.4s

Table 1 illustrates the performance of each algorithm when using the entire content of the document for clustering. The suffix tree handled the document load efficiently and beat the nearest competitor by an average of 11.2 seconds. The slowest algorithm, fractionation is more than twice as slow as the suffix tree in finishing the clustering task. In each case, the mean value of the range was within 1 second of the actual calculated average. In this series of tests, the algorithms each performed in accordance with the expectations as presented in the literature.

Table 1a: Comparative results

Algorithm

Avg. Value

Range (low-high)

Mean Value k-means single pass buckshot fractionation suffix tree

Table 1a gives a breakdown of the comparative results of the first series of tests. The listing includes the average speed of each algorithm, the range of values from lowest to highest, and the mean value of that range. As expected the means stay close to the average values. The suffix tree came it at the lowest average, beating the competition by as much as 117.9 seconds. While k-means and single pass both perform well, they are outperformed by buckshot, which comes in at a close second.

You’re 81% through this paper. Sign up to read the full paper.

Sign Up Now — Instant Access Already a member? Log in
130,000+ paper examples AI writing assistant Citation generator Cancel anytime
Cite This Paper
PaperDue. (2005). Clustering in Algorithms That Employ. PaperDue. https://www.paperdue.com/essay/clustering-in-algorithms-that-employ-67454

Always verify citation format against your institution’s current style guide requirements.