Extracting Information (Sentiment) from Blogs
Introduction and Theoretical Framework
So-called "Web logs" or "blogs" have become the medium of choice for many pundits who might not otherwise have a ready forum for their views (Flynn, 2006; Lang, 2005; Piper & Ramos, 2005). According to Flierl and Fowler (2007), "A blog (the term is a contraction of 'Web log') is a form of online communication most often described as an online journal. A blog is usually created and maintained by a single individual and does not allow visitors to change the original posted material. Instead, visitors add their comments to the original posting and to one another's comments" (p. 241). Anyone who has used a social networking site such as Facebook is familiar with the easy-to-use features that blogs provide, as well as the inanity of much of the commentary that is being made. This aspect of blogging makes close analysis difficult, particularly given the staggering amount of information that is involved (Chung, Kim, Trammell & Porter, 2007; Nemeth & Gropper, 2008). Some authorities suggest that the term "blog" can be expanded to include the comments made on these social networking sites, but must be differentiated from other online communication techniques such as email (Orr, 2004). In this regard, Finin, Joshi, Kolari, Java, Kale and Karandikar (2008) report that, "Web-based social media systems such as blogs, wikis, media-sharing sites, and message forums have become an important new way to transmit information, engage in discussions, and form communities on the Internet" (p. 77).
Today, tens of millions of "bloggers" are taking advantage of the last true bastion of free speech on the Internet by routinely posting their innermost feelings for the world to see, and in some cases, provide responses the express their own personal reactions, thoughts, insights and sentiments (Sun, 2009; Schultz, 2005; Pikas, 2005; Trammell & Keshelashvili, 2005; Rector, 2008). In fact, at least 8% of American Internet users post blogs on a routine basis and 34% of Internet users in the United States routinely read blogs as an important source for breaking news and hot topics (Sweetser, Porter, Chung & Kim, 2008). In this environment, determining the preponderance of sentiments that are being expressed in these dynamic fora represents a timely and valuable enterprise for social researchers who are interested in gauging these aspects of society. According to Merriam-Webster's New World Dictionary (1991), a sentiment is "an attitude, thought or judgment permeated or prompted by feeling; a complex of emotion and idea" (p. 2069). This definition means that sentiments can capture the entire range of human emotions, but capturing this type of information in ways that can provide meaningful analyses remains problematic, an issue that directly relates to the problem to be considered herein and which is discussed further below.
Statement of the Problem
Given the enormity of the amount of blogging data that is already available online as well as the prodigious amounts that are being added every day, it would be impossible to read all of the blogs in a thousand lifetimes. Indeed, saying there is plenty of blog content is like saying the Pacific Ocean is "moist." Today, there are billions of words already published online and tens of millions more being added by millions of bloggers around the world every day (Catalino, 2006). These blogs pertain to an incredibly diverse range of personal interests, vocational and avocational pursuits, as well as the "gripe du jour" type of forum. Bloggers may spill their hearts out for the whole world to see, and be devastated or rejuvenated by the responses they receive -- or both. When tens of millions of people are engaging in this type of behavior on a regular basis, there is clearly some valuable information that can be gained if the right analytical techniques are used in a thoughtful fashion.
A consistent theme that emerges from these blogs is the fact that people from all walks of life are actively and regularly taking advantage of an open forum to communicate with others in ways that have never been possible in the past. Indeed, it may be that this mode of communication may accelerate the untimely death of print media entirely, with newspaper after newspaper folding across the country in the wake of the digital revolution. The current proliferation of Web sites devoted to blogs is firm testament to the growing popularity of this communication mode, but there remains a paucity of timely and relevant studies concerning how the content of these online posts can be mined for valuable social research purposes, a constraint that directly relates to the purpose of the proposed study which is set forth below.
Purpose of the Study
The purpose of the proposed study is four-fold as follows:
A. To deliver a comprehensive and critical review of the relevant peer-reviewed and scholarly literature concerning blogs and their impact on contemporary society;
B. To identify opportunities to exploit the billions of words that are routinely posted in blogs through an automated information extraction process by identifying how this information can be of value to marketers, policymakers, sociologists and so forth; and,
C. To determine what algorithmic approach is best suited to the content analysis of blogs to ascertain sentimental expressions and the context in which they are used; and,
Review of the Literature
In reality, blogs have been around longer than many observers might think (Quible, 2005). For instance, according to Leight (2008), "One of the newer trends on the Internet is the use of web logs, also referred to as blogs. The term may be new, but blogging has been around since 1999. Anyone with Internet access can view a blog, respond to it, and even subscribe to it. Blogs are so easy to create that anyone with basic computer skills can create their own" (p. 52). A growing number of blog sites exist, including those shown in Table 1 below.
Table 1
Representative List of Blog Sites (Top Three on Google Blog Search)
Site Name
URL
Sample Sentiment-Related Blog Entry as of 12/21/10
The Big Picture
http://www.ritholtz.com/blog / 2010/12 / lunar-eclipse-winter- solstice-awesome / [Re lunar eclipse]: "How often do you get to witness an event that has not been seen since the year 1378, over half a millennium, 632 years ago?"
Big Government
http://biggovernment.com / publius/2010/12/20/fcc-poised-to-regulate-the-internet/
The open Internet is a crucial American marketplace, and I believe that it is appropriate for the FCC to safeguard it by adopting an order that will establish clear rules to protect consumers' access," Commissioner Mignon Clyburn, a Democrat, said in a statement. Yet many supporters of network neutrality are disappointed. Clyburn and the other Democrat, Michael Copps, both said the rules are not as strong as they would like, even after Genachowski made some changes to address their concerns. That sentiment was echoed by some public interest groups.
Daily Kos
http://www.dailykos.com/
There's nothing like antagonizing a nuclear power in a fit of pique over gays being allowed to serve in the military. The Republican obstruction of and posturing over START Treaty ratification is not amusing Moscow:
Each of these representative blog entries has dozens and dozens of comments, and in some cases hundreds and even thousands of responses from other interested or, in many cases, outraged fellow bloggers who weigh in on every subject under the sun. Not surprisingly, blogs have also attracted the attention of people who are more interested in making money from the medium than they are in communicating their own innermost thoughts and feelings, and an increasing number of businesses are using blogs as a way of keeping in touch with their valued customers and attracting new ones (Marken, 2005). Likewise, politicians have actively embraced blogging as a way of keeping in touch with their constituents (Gordon-Mundane, 2006; Bichard, 2006). In this regard, Bielski reports that, "Citizen journalists and uppity political bloggers may have seeped into the public lexicon, yet chances are, you haven't thought all that much about blogging in a business context. But all user-generated websites of informal, semi-regular 'e' dispatches aren't strictly personal, radical, or unserious. A handful are kept by CEOs and other key execs at Fortune 500 firms" (p. 8). The robust nature of even business-related blogs makes their analysis an important element in the overall content analysis of blogs (Stepp, 2006). As Bielski points out, "These forums blend in commentary, opinion, and forecasting on corporate and industry undertakings with oblique marketing references designed to generate product or service buzz with a ring of authenticity" (p. 8).
In response to the growing popularity of business-related blogs, researchers at the University of Massachusetts (Dartmouth)'s Center for Marketing Research conducted a survey of bloggers that resulted in the following key points from its executive summary:
1. Blogs take time and commitment (the worst blogs are updated infrequently).
2. Blogs must be part of a plan. Have a designated focus and key objectives outlined for the site in advance of that first post.
3. Blog posts should be, in effect, a form of conversation. That is, they should be an open, somewhat informal invitation for debate or exchange of opinions.
4. Transparency, authenticity, and focus are good. Bland is bad. Many people are looking for someone who is in authority to share their ideas, experiences, or suggestions (Bielski, 2007, p. 9).
Moreover, just as content analysis of other written and symbolic forms has provided new insights that might have otherwise gone unnoticed, the analysis of blog content may reveal some unexpected findings concerning hot topics and significant social trends that are shaping the users of this information. For example, a data infrastructure engineering team intern working at Facebook recently generated an eerily accurate global map based on Facebook friendship links. According to the developer, "I was interested in seeing how geography and political borders affected where people lived relative to their friends. I wanted a visualization that would show which cities had a lot of friendships between them" (Butler, 2010, para. 3). While Butler had some vague ideas about the types of clusters that would populate the map, he would surprised by the results in the way they mirrored the population densities of the world so accurately, with some noticeable absences (Cuba, North Korea, large parts of Africa and South America, the western half of the United States, etc.).
Based on his content analysis of 10 million Facebook friendship links, Butler plotted the location of each individual's latitude and longitude lines and generated connecting lines between each friendship pair, with higher levels of paired links being shown as brighter lines in the map in Figure 1 below.
Figure 1. Butler's Facebook friendship links map: dark areas on the map represent where Facebook use is less prevalent
The map's striking similarity to geopolitical maps was also noted by Butler. According to Butler, "Not only were continents visible, certain international borders were apparent as well. What really struck me, though, was knowing that the lines didn't represent coasts or rivers or political borders, but real human relationships. Each line might represent a friendship made while travelling, a family member abroad, or an old college friend pulled away by the various forces of life" (2010, para. 4).
This analytical approach is also used by Finin and his associates for sentiment-identification purposes. According to these authorities, "Our approach uses the link structure of a blog graph to associate sentiments with the links connecting blogs. Such links are manifested as a URL that blogger a uses in his blog post to refer to blogger B's post. We call this sentiment link polarity, and the sign and magnitude of this value is based on the sentiment of text surrounding the link" (p. 78). Clearly, this type of online data can be used to reveal some valuable new information in ways that have never been possible in the past.
Such graphic representations are just some of the attributes of written communication that content analysis can provide. Because blogs (and this term can be expanded to include the idle chit-chat, back-and-forth, thoughts, ramblings, viewpoints and other posts shared on Facebook and other social networking fora ever day) represent an incredibly accessible way to reach other people, and people who know those people and so forth in an ever-widening network of social interaction. This accessibility may be fundamentally more significant in the long-term than other important innovations in communication such as the telephone. In this regard, a growing number of observers cite the increasing importance of the Internet in the business world and suggest that blogging has become the platform of choice for consumers and their favorite companies (Pikas, 2005). For instance, Bielski emphasizes that not all bloggers are created equally, at least with respect to their online posts. "Certainly, there is hype surrounding Web 2.0 with its dual message of the internet as application platform and internet as the ultimate participatory forum. and, blogging is viewed as a staple of this new internet" (2007, p. 8).
Identifying recurring themes and emerging trends in this type of dynamic environment is a challenging enterprise to be sure. As Bielski points out, "Yet out of the glare, the reality of user-generated content is a mixed bag. The writing can be freeform, to put it politely. Many blogs look horrible," she notes and adds that many are "boring, or 'safe' might be better adjectives" (2007, p. 8). Furthermore, this "mixed bag" of blog content makes identifying posts that may communicate certain sentiments even more challenging. According to Bielski, "Corporate creators don't make these blogs easy to subscribe to, search through, or otherwise interact with" (2007, p. 8).
Fortunately, Google provides a series of URL templates that can be "invoked via command M-x emacspeak-url-template-fetch normally bound to control e u . This command prompts for the name of the template, and completion is available via Emacs' minibuffer completion" (Google Blog Search, 2010, para. 2). The steps involved in conducting this analysis for each URL template are as follows:
A. Prompt for the relevant information.
B. Fetch the resulting URL using an appropriate fetcher.
C. Set up the resulting resource with appropriate customizations.
Although "unblog-related," the template application used by Google Blog Search developers provides a useful example of how this procedure operates. According to Google Blog Search, "As an example, the URL templates that enable access to NPR media streams prompt for a program id and date, and automatically launch the realmedia player after fetching the resource" (2010, para. 3). As to their online application, the developers at Google Blog Search describe their efforts thusly: "Blog Search is Google search technology focused on blogs. Google is a strong believer in the self-publishing phenomenon represented by blogging, and we hope Blog Search will help our users to explore the blogging universe more effectively, and perhaps inspire many to join the revolution themselves" (2010, para. 2). As to the expected blog content that will be sentiment related, the developers make it clear their hosting ranges the entire human experience:
Whether you're looking for Harry Potter reviews, political commentary, summer salad recipes or anything else, Blog Search enables you to find out what people are saying on any subject of your choice. Your results include all blogs, not just those published through Blogger; our blog index is continually updated, so you'll always get the most accurate and up-to-date results; and you can search not just for blogs written in English, but in French, Italian, German, Spanish, Korean, Brazilian Portuguese, Dutch, Russian, Japanese, Swedish, Malay, Polish, Thai, Indonesian, Tagalog, Turkish, Vietnamese and other languages as well (Google Blog Search, 2010, para. 3).
Some of the other key features that make Google Blog Search useful for the purposes of the proposed study include the following:
A. The links allow user to browse Google Blog Search results by topic. For example, clicking the Technology link shows top stories in the tech world.
B. The goal of Blog Search is to include every blog that publishes a site feed (either RSS or Atom). It is not restricted to Blogger blogs, or blogs from any other service.
C. Google Blog Search uses a set of algorithms to try to determine the most popular stories in the blogosphere. The applications takes into account factors such as a blog's title and content, as well as its popularity throughout the rest of the blogging community. The results are displayed based on groups of posts that are closely related..
An informal blog search using Google's "search blogs" feature provides the following raw sentiment-related search results:
Table 1
Blog Search Results of Sentiment-Related Terms (as of December 20, 2010)
Search Term
Number of Matches
Love
467,098,607
Hate
67,059,281
Awesome
79,550,156
Terrible
17,692,083
Angry
24,621,192
Like
821,870,100
Dislike
6,399,023
Enjoy
152,132,318
Clearly, there is a great deal of sentiment being expressed in blogs, but without knowing the specific context in which these sentiment-related terms are used, though, it is impossible to discern their true meanings. For instance, some bloggers might enthuse that they "just love the pasta at Joe's Spaghetti House," while others might state they "love the president's economic policies." Likewise, other bloggers might "hate the weather" while others "hate the president's economic policies." Given the enormous response to the search term "like," it is clear that some bloggers might "like Ike" while others use the term as a comparison as in, "Eating at this restaurant is like a trip to the dentist's office." The context of the sentiment-related posts will therefore require comparison to a corpus of various sentiments used in common practice to identify positive from negative sentiments (Ojala, 2009). For example, the word "like" or "love" when used immediately with or adjacent to descriptors such as "movie" or "restaurant" could be categorized as a review, while these words used with descriptors such as personal nouns might indicate a romantic relationship. This corpus would be fine-tuned as the learning process proceeded through additional permutations of the supporting algorithms.
The results of a study by Manning (2009) that sought to identify effective ways to garner sentiment-related data from online reviews provides some useful insights into what steps are involved in the blog-searching process. According to Manning, "A large and growing body of user-generated reviews is available on the Internet, from product reviews at sites like Amazon.com to restaurant reviews at sites like Yelp.com. For users making a purchasing or dining decision, the opinions of others can be an important factor" (p. 1). The need for a method by which blog posts can be mined to discern sentiment-related communications is made clear by Manning's observation that, "Although some aggregate information -- like average star ratings -- for multiple reviews is sometimes available, in general the only way to get a sense of the overall sentiment among users is by reading through many reviews. As the number of reviews for a single product or restaurant becomes large (on the order of hundreds or even thousands), it becomes increasingly impractical to read every review" (2009, p. 1).
While some blogs may in fact be reviews of restaurants, movies, books and so forth (Marken, 2006), it is more likely that the vast majority relate to other issues that are of immediate importance and relevance to the bloggers (Kelleher & Helkkula, 2010). It would therefore be useful to separate these types of blogging entries that involve reviews as opposed to other, potentially more relevant sentimental posts, to help determine the context in which they are used. In this regard, Finin et al. emphasize that, "An important component in understanding influence is to detect the sentiment and opinions expressed in blog posts. An aggregated opinion over many users is a predictor for an interesting trend in a community" (p. 78). Although off-the-shelf sentiment extraction software is available for this purpose (Brynko, 2007; Sharp, 2010), the approach used by Manning (2009) for this purpose involved using customized neuro-linguistic programming machine-learning techniques to automatically categorize online content. According to Manning, "We view the goal of reading multiple reviews as finding widely-held opinions and weighing the positive against the negative, and we wish to automate this sort of task using neuro-linguistic programming machine-learning techniques" (p. 2).
The analytical framework developed by Manning (2009) for discerning sentiment in online reviews categorized content according to three major components as follows:
A. Sentence-level sentiment classification;
B. Sentiment clustering and ranking; and,
C. Summarization.
This approach is congruent with the approach used in other studies to date that have focused on discerning new meanings that might go otherwise undetected through automated content analysis. For online analysis purposes, Manning used the following approach: "Sentiment-classification involves labeling every sentence in every review for a particular restaurant as either Subjective-Positive, Subjective-Negative, or Objective" (2009, p. 3). In support of this approach, Manning cites comparable research by Pang et al. that used traditional machine-learning techniques such as Naive-Bayes to classify sentiment content contained in a complete online movie review. Likewise, Manning cites Hu and Liu's framework that expanded this analytical framework by providing a way to determine the context in which the sentiment-related word is used. According to Manning, "Instead of simple classification, they approach the problem by first extracting opinion words from each sentence and then predicting the polarity of the sentence by the dominant polarity of its constituents. They grow sets of positive and negative opinion words using seed words in WordNet" (2009, p. 3).
Based on the effectiveness of these studies using straightforward classification methods, Manning (2009) applied comparable methods to his analysis of online movie reviews to identify sentiment-related words and terms based on the context in which they were used. In this regard, Manning reports that, "We [explore] various feature sets and classifiers. After isolating subjective sentences from objective sentences, we cluster those subjective sentences that are closely-related using a simple K-means algorithm and rank the resulting clusters using a cluster-quality metric that rewards large, cohesive clusters" (2009, p. 3)..
Figure 2 below illustrates the steps used in the Manning algorithm to identify the context in which online posts were used that provided a ranked-listed set of closely-related opinions:
Figure 2. Extracting Common Sentiments from Online Reviews
Source: Manning, C. (2009), at http://nlp.stanford.edu/courses/cs224n/2009/fp/14.pdf
The various classifier filters used in the Manning algorithm to develop clusters of sentiment-related content provide a useful framework for the same type of analysis of blog content. For example, Manning reports that:
One common aspect of good clusters is that they seem to form around specific aspects of a restaurant, like atmosphere, lamb falling off the bone, service, or wok pho noodles. Poorer, but still high-scoring, clusters tend to form around high-frequency words that the stop-word list did not remove. This suggests a slightly different approach to building feature-vectors for each sentence. (p. 3)
Significantly, the learning aspects of this machine-learning technique proceed in an iterative fashion, with the application becoming more astute and effective in its ability to identify sentiment-related context over time based on a preponderance of usage. In this regard, Manning describes these steps as follows:
A. A pre-processing pass could build a list of words and phrases that appear frequently in the review of a particular restaurant but are uncommon in the wider corpus. This should find phrases like the name of a dish that many people are talking about.
B. Given the narrow domain of the problem, it should also be possible to hand-build a list of common ideas a reader might want to know about, like service, food, and price.
C. Extracting these combined, specific features should lead to purpose-built vectors that form clusters around relevant concepts (Manning, 2009, p. 13).
The effectiveness of this analytical method is obviously tied to the forethought that goes into its design and how it is administered and refined over time. Therefore, it is also likely possible to develop extraction methods that can automatically delve deep into blog content to identify common themes, metaphors, trends and hot topics that can help inform the discussion concerning sentiment-related online communication today. In other to achieve these goals, the proposed study will use the research questions outlined below.
Research Questions
The proposed study will be guided by the following research questions:
A. What are the optimum classification techniques for identifying the context of sentiment-related terminology in blogs?
B. What classification techniques produce the most reliable categorization results?
C. How can the hand-built list of common ideas be amplified automatically?
D. How can sentiment-related blog content analysis help inform social researchers, policymakers, marketers and others who may benefit from the new insights that can emerge for this methodology proposed herein which is described further below.
The Design -- Methods and Procedures
A.
Data Collection. Just as the automated classification techniques become more adept at discerning sentiment-related content over time, so too will the data collection process evolve and improve as its is applied to blog content. This content will initially consist of Google's blog search feature to develop a growing list of reliable online blog content that can be searched using the extraction and analytical methods described further below. Although Manning (2009) found that the classification techniques used to analyze sentiment-related online reviews was effective, he was unable to transfer the movie review sentiment classifiers to other domains, making its wholesale application to blog content analysis inappropriate.
Nevertheless, the analytical framework identified in the Manning (2009) study provides a useful general departure point for further refinements for blog content analysis specifically, particularly with respect to facilitating the classification of blog content that is related to reviews of movies, restraurants and books from other types of sentimental content. In this regard, Manning emphasizes the need for refinements in the process as quickly as possible to ensure the validity and reliability of the findings that emerge from sentiment-related online analyses. According to Manning:
Looking at the features with highest information gain gives some insight into why this is the case. Stems like 'movi' and 'film' are unlikely to be informative in a restaurant-review domain. One potential solution to this poor cross-domain performance is to do more work upfront to extract the particular types of features that are informative of opinion. Something like the approach of Hu and Liu -- extracting opinion words -- would seem to be appropriate. Alternatively, combining training data from multiple domains would also potentially be helpful. (2009, p. 12)
You’re 81% through this paper. Sign up to read the full paper.
Sign Up Now — Instant Access Already a member? Log inAlways verify citation format against your institution’s current style guide requirements.