This paper provides a comprehensive overview of text mining — the process of extracting meaningful information from unstructured textual data. It defines text mining in relation to data mining, outlines key technical components such as tagging, stemming, and categorization algorithms, and surveys notable approaches including Porter's Algorithm, Hash Tables, DISCOTEX, and APRIORI. The paper examines applications across literature review automation, medical research, business intelligence, customer relationship management, and law enforcement. It also discusses current limitations of available software tools and the challenges posed by linguistic complexity. The paper concludes that text mining is an essential, evolving complement to traditional data mining in an increasingly text-heavy information environment.
The concept of text mining originates from the idea that a relationship exists between the terms used in an unstructured text message or file. That relationship may extend to other similar files, and once established, it can provide information to businesses and researchers across many areas — changing the way they operate or enhancing collective knowledge.
The definition of text mining is broad. In simple terms, "text mining" refers to the process by which information is retrieved in text form. At a deeper level, however, it involves establishing patterns within textual data — not only locating the appropriate text but also developing a theory for making that information useful. Many definitions of text mining assume the primary goal of extracting high-value information from a database or other unstructured text field and using it to arrive at some conclusion. There is also a connection between text analysis and the concept of data mining. The distinction is that data mining draws from structured databases, whereas the principal challenge in text mining is achieving the same result with unstructured data (Trujillo, 2010).
Text is the most widely used medium and data type. It is the primary means by which people exchange information, and the media involved include email, chat, digital libraries, reports, and books available on the internet and other communication channels. Beyond user-generated content, there are also vast volumes of journals, research materials, and valuable reports such as statistical documents and government publications. These databases grow at astronomical rates and are distributed on a global scale (Mitra & Acharya, 2003). Text mining has therefore become an important set of tools across many operations on the information spectrum.
The method of text mining is complex and involves many steps and determinants to be fully successful. Text mining begins as an algorithm that extracts facts from a textual source and converts them into a form that can be used to create "hypotheses that are further explored by traditional data mining and data analysis methods" (Maimon & Rokach, 2005). In text parsing, problems are encountered with hyponyms — that is, generalizations of information. A contributor such as "Human" and associated positions such as "corporate executive" may constitute casual information in one context, yet vital information in another. Information of a general nature is often ignored because the span and token of the program do not account for it, even though it may be critical when viewed from a different perspective (Srivastava & Sahami, 2009).
To address this, the major operation in text mining is tagging. A text mining program can tag documents using statistical tagging or semantic tagging, and this forms the basis for arriving at new information. There are requirements for managers to find information from new angles, and this is often found in unstructured customer responses. This need is addressed through a task-oriented preprocessing approach that creates structured documents from unstructured ones. Another method, called "Text Mining and Information Extraction," is used to summarize documents. In any case, text mining operations form the basis of tagging and thus create entities and relationships (Maimon & Rokach, 2005). Ongoing research continues to develop better algorithms, with one study demonstrating the possibilities of "implementation of information extraction and categorization in the text mining" (Mustafa, Akbar, & Sultan, 2009).
The aim of text mining is to provide a method for knowledge management, analysis, and decision-making. The numerous functions involved in text matter parsing combine to create a text mining algorithm. Mining activities include performing comprehensive searches that result in categorization, summarizing extracted datasets, and monitoring and answering questions based on specific needs. The fundamental objective of a text mining operation is to obtain an associative distribution for words and terms and to identify common significance that can be applied to research or business forecasting (Mustafa et al., 2009).
The most important part of the process is information extraction — identifying words or feature terms within a textual file and processing them through a layered model of the text mining application (Mustafa et al., 2009). Text mining and data mining share the same analytical functions but differ in their use of natural language (NL) and information retrieval (IR) techniques (Maimon & Rokach, 2005).
The processes differ slightly between data mining and text mining because text mining is designed for unordered data, which changes the basis of the search. A typical step in this process is stemming — identifying the root of a given word. Stemming techniques are of two types: inflectional and derivational. Stemming is a useful concept because root forms avoid singular, plural, and other grammatical nuances, reducing data to bare essentials. Keeping a dictionary to its minimum size, with stems and tokens maintaining accuracy, results in faster and shorter algorithms that extract data from random text. Documents are then classified according to their threads or common contents, and this grouping, combined with the use of identical roots or stems and tokens for related words, helps identify features (Weiss, 2005).
Derivational stemming creates a new word from an existing root. Inflectional stemming, however, has the most practical application. The algorithm used is Porter's Algorithm for stemming, which performs parsing based on elements of language and grammar such as plural, singular, present tense, and past tense (Mustafa et al., 2009). The inclusion of data mining provides a method of extracting data, but since data does not always appear in structured forms, text parsing methods have been developed to handle this. No algorithm can fully anticipate all human communication because of its complexity, and text mining therefore faces challenges not found in data mining — including differences in language, usage, and individual expression. Words may mean different things in different contexts (Mitra & Acharya, 2003).
The process of categorization pinpoints the domain category in use. Combined with a token, this results in allocating text to the most appropriate category using table-managing algorithms called Hash Tables (Mustafa et al., 2009). These procedures are unique to text mining because they work with unstructured data using a domain dictionary that must be exhaustive for the mining to be effective. Text data is typically stored in compressed form, and accessing it in the future will require decompression algorithms alongside search functions. Text databases are compressed using Lempel–Ziv type algorithms, which are similarly used in both data mining and text mining for efficient retrieval. The greatest source of text is the web, and text mining is therefore closely tied to web data (Mitra & Acharya, 2003).
One proposed text mining method, called DISCOTEX (Discovery from Text Extraction), used a standard rule induction module to extract information and create a well-structured, searchable database that makes online text more easily accessible. Another algorithm worth noting is APRIORI, a standard association rule mining algorithm. When combined, DISCOTEX and APRIORI have been claimed to identify interesting patterns from book descriptions (Daelemans, du Plessis, Snyman, & Teck, 2005).
Not only single words but also whole strings can be mined. The analysis of similarities across entire strings also falls within the scope of text mining. The overall goal is information integration — achieved when an optimal correspondence between variables is established such that some factor can be associated on a similarity score. The heuristics involved include probabilistic machine learning approaches, such as the Alignment Conditional Random Fields model, which is designed for scoring sequences in undirected graphical models (Bilenko & Mooney, 2005). Demand for this type of software is growing, and text mining is beginning to take on a significant role in the analysis of literature and research reviews.
"Applications in literature reviews and medical research"
"Business, CRM, internet, and law enforcement uses"
The business and research communities face pressure to decode information obtained in large volumes of text documents that hold relationships and pointers capable of extracting high-value information for decision-making purposes. Text is the most used media and data type, and while data mining from structured databases is used extensively, text mining must also be used extensively to make sense of the much larger volume of text matter that has not been organized into databases. Text mining therefore supports knowledge management, analysis, and decision-making. Combined with data mining, it provides a method of analysis covering not only individual words and phrases but also whole strings from unstructured text.
You’re 55% through this paper. Sign up to read the remaining 2 sections.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.