The process of extracting new information from existing information through the use of computer system is called Text Mining. This paper focuses on text mining methods and algorithms.
¶ … Mining
The process of extracting new information from existing information through the use of computer system is called Text Mining. Text mining retrieves data of available information and establishes the connection between the facts mentioned in that data. This is how, new information is developed. Since it is newly formed information, its validation is conducted through experimentation. The process of web search is often confused with that of text mining, though these are two entirely different processes. In web search, the computers match the keywords in the database and bring the relevant records. The information is written down by somebody and then uploaded on the internet to make it searchable. On the other hand, in text mining, altogether new information is generated out of existing body of knowledge (Berry, 2004).
Text mining finds its roots in data mining. Data mining refers to the process in which the computer system retrieves unique information from the existing database. Hence text mining is also named as Text Data Mining. Other names for text mining are Intelligent Text Analysis and Knowledge-Discovery in Text (KDD). It extracts the interesting information out of unstructured text. Data mining from unstructured information has high value in the emerging field of text mining. It is because of readily availability of unstructured data and its large volume. Text mining enjoys the perception of high commercial value as more than 80% of the information is stored in the form of text and can be explored to generate new body of knowledge. In addition to data extraction, text mining includes computational linguistics, statistics and machine learning as well (Berry, 2004).
Knowledge Discovery from Database (KDD) is enjoying portion of eminence in the field of emerging applications, like Text Understanding. It works through extracting both implicit and explicit concepts from the existing data and then forming semantic relations among the concepts. It is done with the help of Natural Language Processing Techniques commonly known as NLP Techniques. KDD when combined with NLP discovers useful information though knowledge management, information extraction, machine learning, statistics and reasoning (Navathe et al., 2000).
As mentioned earlier, data mining and text mining are somewhat similar concepts. The only difference lies in the type of data explored and the tools used. Data mining works well with highly structured data only, while text mining is applicable for semi-structured or unstructured data as well. The unstructured data includes HTML files, full-text documents and emails. In this perspective, it becomes more preferable to the companies. But there is also an aspect which prevents the use of text mining. This hindrance is the dependence on NLP. It is because natural language was not meant for computer systems initially nor it is developed for this purpose. Because of this issue, structured data and data mining practices are more prevalent in the field of research and development (Navathe et al., 2000).
The obstacles posed by computers system in regard of NLP does not exist in case of human beings. The human beings can easily comprehend the language patterns and can even distinguish between the various ones applied in the same text. The examples are contextual meanings, the slangs and spelling variation in a database. The computer systems are not yet equipped with the capability of linguistic patterns identification quickly (Weiguo, 2005).
A collection of documents is provided to the text mining tool. After exploring them, it selects one particular document to identify its character set and format. After this phase, it starts analyzing the text mentioned in the document. It repeatedly applies various techniques to extract information from the database. The presented example quote three techniques of text analysis, however, there be many others based on the combination of these techniques. It basically depends upon the organizational goals, which provide guidelines about the data to be extracted. The retrieved data is inserted in the organizational management information systems so that the end users may retrieve it for their use (Weiguo, 2005).
Statement of the problem
There is a gap in the literature regarding the text information extraction from a huge database.
Purpose of the study
The study investigates how to extract a specific phrase from a text. It employs survey techniques to interview experts in the field and assesses results using coding techniques.
Rationale of the study
It is important to note that several research studies related to text extraction have been carried out. However, no research has focused on the evaluating text information extraction in large datbases using survey interview techniques. Therefore this research will fill this vital gap existing in the literature and focus on investigating the extent to which text extraction can be made accurate and precise.
Lastly, this study offers a number of theoretical contributions as well. Common analytical and operational issues have become increasingly vital as institutions move from comparatively simple methods and communication models, to intricate multi-channel models. Also, it is worth noting that the collective forces of technology, demography, control, as well as, globalization have been pushing organizational information systems, all over the world, to change their strategy so as to keep pace with the ever changing world. Evaluating the extent to which text extraction from large databases can be made accurate and precise has been a neglected topic. This study will shed light on this vital subject.
Research Questions
The question below are the main research questions:
How to extract a specific phrase from a text in large databases?
Literature Review
Technological foundations
The gap that had started to occur between computer and human languages, because of the numerous variations between them, is now narrowing down due to the improvement in technology. The computer is now able to comprehend, criticize and produce text on its own because they have been taught the natural language with the help of a program created by the people who work in the field of natural language. Some of the things developed in the program that helps the computer in producing text are how to track a topic, how to get relevant information from the database, form data in organized manner, shorten it, form links between topics and how to answer questions. All these developments and their role along with how the user will find these programs to be useful will be discussed in detail (Sergio, 2002).
A. Extraction of Information
That program helps realize the main things of a text which is done by identifying how the text is written, known as pattern matching. The link between all places, time and people is indentified so that the user is given useful information out of the database. This is helpful when large quantity of data is being processed. Previously, it was assumed that the information to be used is the related one. However, that is not the case. In many programs the electronic information is not in the form of a structure but freely available. This issue is dealt by the IE as their work is to form a structured data from the raw one. To do this, the IE module used KDD module. After useful information is taken out from all the information provided, DISCOTEX, by using discovery rules, sees if any information has been missed in the database (Sergio, 2002).
B. Topic Tracking
The free of charge topic tracking tool is available to the users at www.alert.yahoo.com which is offered by Yahoo. This tool informs the user about any news available regarding the topic that the user chooses. Thus, a topic tracking system is a system that maintains a user's profile and suggests different documents to the user associated to documents that the user has viewed earlier. Despite being beneficial, topic tracking has its limitations, for instance, a user can get many news on mining for minerals or characteristics of minerals instead of text mining, although he/she has previously set an alert for 'text mining'. A company can be notified when a competitor enters market through topic tracking, which can add to its advantages, so the company will get updated with the changes in market and take a step further accordingly. Students can utilize topic tracking for research on their subjects and articles related to their studies. Organizations can even find out about news on them through topic tracking. Moreover, topic tracking can help doctors and individuals who search for treatments and latest development in the medical field. More and better text mining tools can be utilized which benefit the users who can opt their interests or the software can conclude the user's interest through their previous selections of articles from the database (Sergio, 2002).
A set of particular words in an article that provides a significant explanation of its substance to the users are known as keywords. It has been very time consuming and almost impossible to extract keywords manually from a given database, which can be more difficult in case of news articles that are published in huge quantities on daily basis. The keyword extraction has developed into a source for several text mining applications like summarization, text categorization, topic detection and search engine as worldwide web has created a platform for online documents. Thus, a summarized data can be obtained by recognizing keywords from a vast amount of online database. An automated process needs to be developed in order to extract keywords from news articles. A keyword extraction module is used for the extraction of candidate keywords after collecting some news articles in HTML from an internet site, after which the cross domain comparison module is used for keyword extraction. It can be explained more thoroughly. In the relational database, tables for 'term occur fact', 'document', 'TFIDF weight' and 'dictionary' is made. Initially the 'Document' table is used to stock up the downloaded news pages and then extraction of nouns from the documents takes place in the 'Document' table (Sergio, 2002).
After this, the 'Term occur fact' table is used to update the words appearing in the document. The 'Term occur fact' is then utilized to calculate the TFIDF weights for the words and the outcome is kept in the 'TFIDF weight' table. Conclusively, a 'Candidate keyword list' is made using the 'TFIDF weight' table which shows the ranking of the words (Sergio, 2002).
The tracking of a given news event through various news stories' database is carried out through topic tracking. The assemblage of lexically similar terms into supposed lexical chains is known as lexical chaining. Significant locations, names and normal terms are extracted into different sub-vectors of document demonstration through multi-vector topic tracking system. The sub-vectors are compared to compute the similarity between two or more documents. The number of characteristics casting an impact on the topic tracking system is studied. Firstly, it is required to choose a particular attribute like words, or phrases which are appropriate to be taken as examples describing that attribute of a given event (Sergio, 2002).
C. Summarization
The worthiness and significance of a lengthy document is well extracted from its text summarization. Text summarization software summarizes a lengthy document in a time equivalent to that required by a human to just read the first paragraph of the document. Text summarization revolves around the reduction of the length, yet keeping up with the retention of its sense and meaning. The dilemma faced by computers is that they do not process semantics and meanings of the words and only deal with the identification of people, places, and time (Haralampos, 2001).
Mostly, humans create text summaries by first getting an idea of the full text by going through it wholly and then creating a summary by focusing on the core points. As computers do not cater with language capabilities of humans, other methods need to be devised. Sentence extraction, is utilized by text summarization tools to find out the sentences describing the central idea of the text, in a statistical manner. Position information also serves as an important tool in text summarization (Haralampos, 2001).
Summarization tools may pick up sentences followed by key phrases in the conclusion, because usually the key points are mainly present here. Headings and subtopic markers are also focused by the summarization tools for selecting the main points. An example of text summarization tool is Microsoft Word's AutoSummarize function. Most of the text summarization tools, ask the users to specify the percentage of the text to be extracted as a summary. Topic tracking tools and categorization tools use summarization to summarize documents collected on a particular topic. If enormous amount of documents are given to organizations, medical personnel or researchers according to their area of interests, then it will reduce the sorting out time for the summarization tools. Eventually, relevant information could be accessed by individuals based upon their interests (Haralampos, 2001).
There are three steps of an automatic summarization process: (1) First is the preprocessing step, in which it is required to get a well structured format of the original text; (2) Second lies the processing step of converting the structured text into the summary structure; (3) The third is the generation step, which involves the extraction of final summary from the summary structure. The methods of summarization are categorized on the basis of linguistic space level and are divided into two vast groups: (a) One is shallow approaches, which deal with the syntax and representation of the text, and extract the salient features of the text in an easy way; and (b) the other is the deeper approaches, which deals with the semantics of the text and revolve around linguistic processing (Liritano S. And Ruffolo, 2001).
The preprocessing step of the first approach aims at reducing the dimensions of the document text, which consists of: (i) stop-word elimination - irrelevant common words, having no significant meaning, for example "the," "a" etc., are removed from the text; (ii) case folding - changing the characters from upper to lower case or vice versa; (iii) stemming - words which are syntactically similar are aggregated; this caters the purpose of obtaining the radix of each word emphasizing the semantics. The vector model is a most commonly used text model. When the preprocessing has been done, each sentence is considered as a N - dimensional vector.
You’re 81% through this paper. Sign up to read the full paper.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.