Mining the Process of Extracting New Information Research Proposal

  • Length: 10 pages
  • Sources: 10
  • Subject: Education - Computers
  • Type: Research Proposal
  • Paper: #20728009

Excerpt from Research Proposal :


The process of extracting new information from existing information through the use of computer system is called Text Mining. Text mining retrieves data of available information and establishes the connection between the facts mentioned in that data. This is how, new information is developed. Since it is newly formed information, its validation is conducted through experimentation. The process of web search is often confused with that of text mining, though these are two entirely different processes. In web search, the computers match the keywords in the database and bring the relevant records. The information is written down by somebody and then uploaded on the internet to make it searchable. On the other hand, in text mining, altogether new information is generated out of existing body of knowledge (Berry, 2004).

Text mining finds its roots in data mining. Data mining refers to the process in which the computer system retrieves unique information from the existing database. Hence text mining is also named as Text Data Mining. Other names for text mining are Intelligent Text Analysis and Knowledge-Discovery in Text (KDD). It extracts the interesting information out of unstructured text. Data mining from unstructured information has high value in the emerging field of text mining. It is because of readily availability of unstructured data and its large volume. Text mining enjoys the perception of high commercial value as more than 80% of the information is stored in the form of text and can be explored to generate new body of knowledge. In addition to data extraction, text mining includes computational linguistics, statistics and machine learning as well (Berry, 2004).

Knowledge Discovery from Database (KDD) is enjoying portion of eminence in the field of emerging applications, like Text Understanding. It works through extracting both implicit and explicit concepts from the existing data and then forming semantic relations among the concepts. It is done with the help of Natural Language Processing Techniques commonly known as NLP Techniques. KDD when combined with NLP discovers useful information though knowledge management, information extraction, machine learning, statistics and reasoning (Navathe et al., 2000).

As mentioned earlier, data mining and text mining are somewhat similar concepts. The only difference lies in the type of data explored and the tools used. Data mining works well with highly structured data only, while text mining is applicable for semi-structured or unstructured data as well. The unstructured data includes HTML files, full-text documents and emails. In this perspective, it becomes more preferable to the companies. But there is also an aspect which prevents the use of text mining. This hindrance is the dependence on NLP. It is because natural language was not meant for computer systems initially nor it is developed for this purpose. Because of this issue, structured data and data mining practices are more prevalent in the field of research and development (Navathe et al., 2000).

The obstacles posed by computers system in regard of NLP does not exist in case of human beings. The human beings can easily comprehend the language patterns and can even distinguish between the various ones applied in the same text. The examples are contextual meanings, the slangs and spelling variation in a database. The computer systems are not yet equipped with the capability of linguistic patterns identification quickly (Weiguo, 2005).

A collection of documents is provided to the text mining tool. After exploring them, it selects one particular document to identify its character set and format. After this phase, it starts analyzing the text mentioned in the document. It repeatedly applies various techniques to extract information from the database. The presented example quote three techniques of text analysis, however, there be many others based on the combination of these techniques. It basically depends upon the organizational goals, which provide guidelines about the data to be extracted. The retrieved data is inserted in the organizational management information systems so that the end users may retrieve it for their use (Weiguo, 2005).

Statement of the problem

There is a gap in the literature regarding the text information extraction from a huge database.

Purpose of the study

The study investigates how to extract a specific phrase from a text. It employs survey techniques to interview experts in the field and assesses results using coding techniques.

Rationale of the study

It is important to note that several research studies related to text extraction have been carried out. However, no research has focused on the evaluating text information extraction in large datbases using survey interview techniques. Therefore this research will fill this vital gap existing in the literature and focus on investigating the extent to which text extraction can be made accurate and precise.

Lastly, this study offers a number of theoretical contributions as well. Common analytical and operational issues have become increasingly vital as institutions move from comparatively simple methods and communication models, to intricate multi-channel models. Also, it is worth noting that the collective forces of technology, demography, control, as well as, globalization have been pushing organizational information systems, all over the world, to change their strategy so as to keep pace with the ever changing world. Evaluating the extent to which text extraction from large databases can be made accurate and precise has been a neglected topic. This study will shed light on this vital subject.

Research Questions

The question below are the main research questions:

How to extract a specific phrase from a text in large databases?

Literature Review

Technological foundations

The gap that had started to occur between computer and human languages, because of the numerous variations between them, is now narrowing down due to the improvement in technology. The computer is now able to comprehend, criticize and produce text on its own because they have been taught the natural language with the help of a program created by the people who work in the field of natural language. Some of the things developed in the program that helps the computer in producing text are how to track a topic, how to get relevant information from the database, form data in organized manner, shorten it, form links between topics and how to answer questions. All these developments and their role along with how the user will find these programs to be useful will be discussed in detail (Sergio, 2002).

A. Extraction of Information

That program helps realize the main things of a text which is done by identifying how the text is written, known as pattern matching. The link between all places, time and people is indentified so that the user is given useful information out of the database. This is helpful when large quantity of data is being processed. Previously, it was assumed that the information to be used is the related one. However, that is not the case. In many programs the electronic information is not in the form of a structure but freely available. This issue is dealt by the IE as their work is to form a structured data from the raw one. To do this, the IE module used KDD module. After useful information is taken out from all the information provided, DISCOTEX, by using discovery rules, sees if any information has been missed in the database (Sergio, 2002).

B. Topic Tracking

The free of charge topic tracking tool is available to the users at which is offered by Yahoo. This tool informs the user about any news available regarding the topic that the user chooses. Thus, a topic tracking system is a system that maintains a user's profile and suggests different documents to the user associated to documents that the user has viewed earlier. Despite being beneficial, topic tracking has its limitations, for instance, a user can get many news on mining for minerals or characteristics of minerals instead of text mining, although he/she has previously set an alert for 'text mining'. A company can be notified when a competitor enters market through topic tracking, which can add to its advantages, so the company will get updated with the changes in market and take a step further accordingly. Students can utilize topic tracking for research on their subjects and articles related to their studies. Organizations can even find out about news on them through topic tracking. Moreover, topic tracking can help doctors and individuals who search for treatments and latest development in the medical field. More and better text mining tools can be utilized which benefit the users who can opt their interests or the software can conclude the user's interest through their previous selections of articles from the database (Sergio, 2002).

A set of particular words in an article that provides a significant explanation of its substance to the users are known as keywords. It has been very time consuming and almost impossible to extract keywords manually from a given database, which can be more difficult in case of news articles that are published in huge quantities on daily basis. The keyword extraction has developed into a source for several text mining applications…

Cite This Research Proposal:

"Mining The Process Of Extracting New Information" (2012, January 09) Retrieved January 20, 2017, from

"Mining The Process Of Extracting New Information" 09 January 2012. Web.20 January. 2017. <>

"Mining The Process Of Extracting New Information", 09 January 2012, Accessed.20 January. 2017,