Sentiment Analysis – A Solution Overview

Originally published 29 April 2010

User generated content on the web has grown many fold in the last few years and much of the content is in the form of reviews, commentaries, ratings and now tweets. Users are expressing their opinions through these forms. The various tasks of identifying the opinions, monitoring them, summarizing them and organizing them are collectively termed as sentiment analysis or opinion mining.

Sentiment analysis is of real value for companies to manage their brands and reputation. Traditionally brand and reputation management has been done via surveys, focus groups, user conferences and, while these are not likely to go away in the near future, the ability to monitor the brands in real time is a value-add that cannot be over-estimated.

Sentiment analysis involves elements of natural language processing (NLP), text mining, machine learning and data analytics. The research in the field of opinion mining has been on going for several years, and many models and techniques have been proposed. The theory is well understood, and tools and solutions are available to implement a sentiment analysis system. Companies in the text analytics area are usually the first to come up with solutions but there is an increasing presence of new start-up firms that are creating a buzz in this domain.

In this article, I don’t delve into the theory and the algorithms involved in sentiment analysis but I will take a look at the entire process, from identifying the opinion sources to the visualization of the results.

Solution Overview

Mining opinions is different from regular text mining as direct keywords cannot be used for searching opinions. This is because sentiments or opinions are not usually directly evident. Though sentiments themselves are domain independent, the words used to describe sentiments can vary from domain to domain. Sentiments are expressed in different ways: with overall scores (star ratings) such as pros and cons on features/aspects of the object of opinion, rants and raves etc. Opinions are subjective and comparative in nature and multiple opinions are expressed in a single passage. So the process of identifying opinions, classifying them, extracting them and summarizing them is unique and demanding enough that specialized systems are needed.

The following steps describe the process of sentiment analysis. The process is independent of the methods employed for analyzing sentiments. These steps can be thought of as describing what the process is and not how the process is implemented. The steps are:

  1. Identifying and classifying data sources: Deals with the different structures of opinion-based text such as reviews, editorials and news based on sources of opinions.
  2. Retrieval and storage of data: Deals with the challenges of handling very large volumes of data and the extraction of sentiment phrases from text passages found in the identified data sources.
  3. Sentiment classification: This step involves categorization of sentiments as being positive, negative or neutral. 
  4. Sentiment summarization: The sentiments are summarized into aggregate scores for positive and negative orientation along with relevant snippets.
  5. Visualization: This final step involves building dashboards and trackers that will help the user segment and view the data in a useful manner to get better insights into the sentiments being expressed.

Identifying and Classifying Data Sources

Opinions appear all over the web and some sites are more reputable than others. The formats are different and opinions are often interspersed with commerce or other types of content. Separating opinions from other types of text is referred to as genre classification. The opinion orientation itself is not known - only that the passages contain opinions. Opinions are usually subjective statements and techniques such as parts of speech (POS) taggers are used for this classification. A study by Finn et al1 demonstrates that POS tagger methods are superior to other methods.

Data sources also can be classified by their reputation, relevance and popularity, and a weight can be ascribed to the source to be used when scoring opinions.

Retrieval and Storage of Data

Spiders and crawlers are the most common and scalable way of retrieving content. Focused or topical crawlers may be employed when you are looking for opinions in a certain set of topics. Once the data has been retrieved, it needs to be stored and though disk space is cheap, you can end up with terabytes of data that increase the storage costs very quickly. The entire retrieval and storage process could be managed using a cloud platform such as Amazon EC2. The preprocessing and cleansing of data can be done on the cloud and only the relevant data i.e., opinions could be stored locally.

Sentiment Classification

After having extracted opinions from the text that has been crawled, you now come to a key step wherein you need to know the sentiment being expressed (i.e., the semantic orientation of the sentiment whether it is expressed as negative, positive or neutral) and when it is not neutral, you need to be able to grade it to see how positive or negative that sentiment is.

In a typical opinion passage, several sentiments are expressed on several aspects/features of the opinion target. Most classifiers work at the phrase or sentence level rather than at the word or document level. NLP or machine learning-based approaches are used in this step.

NLP techniques involve using opinion words, detecting subjective parts of speech with POS tagger and building sentiment lexicons. A sentiment lexicon differs from lexicons such as WordNet by including the semantic orientation of adjectives, making them opinion words. SentiWordNet2  is one such sentiment lexicon and it is publicly available. When detecting sentences with opinions, special patterns such as a noun (NN) following an adjective (JJ) are used for pattern matching.

While NLP methods are rules-based, machine learning methods use probabilistic classifiers such as Naïve Bayes and large margin classifiers such as Support Vector Machines. Sentiment classification is treated as a special type of topic classification and by applying it to more than just single words (bi-grams, tri-grams, n-grams), classification of sentiments is possible.

Classification of product reviews further categorizes sentiments by the features on which the opinion is expressed. In such cases, an additional step wherein the main features themselves are discerned is introduced, and POS tagging (noun and noun phrases) can be used for this purpose.

In addition to the orientation, you also need to note the level of the sentiment. This is useful for qualitative analysis as well as to attach a numeric grade to the sentiment.

Sentiment Summarization

In this step, the classified sentiments are aggregated by their orientation i.e., all negative sentiments are aggregated together into one summary and all positive sentiments into another. A qualitative representative set of sentences is also presented as a snapshot along with the aggregation. You can also attach a score such as a numeric rating to the summary if individual sentiments contain grades.

Visualization

While summarization gives an overall score and sentiment, it can be scarce in details. Also certain angles and perspectives will not be evident in a simple summary. Visualization helps provide better insights into the sentiments. The views are very similar to traditional OLAP views. You can create dashboards, drill-down charts along various dimensions such as geography, customer segment. You can also incorporate tracking tools such as trend graphs and alerts (e.g., a sudden spurt in negative publicity). In order to facilitate some of these views, you need to ensure that the sentiment object which is stored in the database is annotated with meta-information such as source of the review, the topic/product that it belongs to and other properties that can be used in the business analysis.

The Technologies and Tools

As mentioned at the start of this article, sentiment analysis encompasses several areas and is a complex process, but there are tools available to speed up the development process. There are sentiment analysis services and products available in the market, and using them maybe the optimal path for companies and services whose core business is not analytics or information technology.

For organizations that want to build their own sentiment analyzer tools APIs such as LingPipe, OpenNLP, and Evri Sentiment web APIs are available. The Apache Lucene eco-system (Nutch, Hadoop and Mahout) would be a compelling choice for the data retrieval and preprocessing as well as running scalable machine learning algorithms.  Flex is a well proven technology to build the widgets needed for the business analysis views.

While sentiment analysis is a very hot topic in text analytics and research has been ongoing for several years, the advent of user-generated content that has changed the dynamics of brand management has pitched this technology to the forefront. Though these systems add real value, they are not highly accurate at this time. However, the momentum in both the areas of academic research and commercial applications suggests that constant improvements can be expected.

The underlying concepts and technologies are complex and, even though there are tools and technologies for each of the components of such a system, building an integrated solution would require subject matter expertise in areas such as machine learning and NLP. Unless a company’s core competency lies in these areas, it maybe better to buy than to build. While text analytics companies offer sentiment analysis systems, these are not off-the-shelf solutions and need to be customized. Given the complexity and the evolving nature of this field, partnering with integrators and solution developers could be a quick way of bringing up a solution for most companies.

References:

  1. A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information filtering. Proc. 24th European Colloquium on Information Retrieval Research, Glasgow, 2002, 353-362.
  2. A. Esuli and F. Sebastiani. Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of LREC-06, the 5th Conference on Language Resources and Evaluation, Genova, Italy, 2006.
  3. Yi, J., Nasukawa, T., Bunescu, R., Niblack, W. Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM-2003).
  4. Pang, B., Lee, L., Vaithyanathan, S. Thumbs up? Sentiment Classification Using Machine Learning Techniques. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2002). (2002) 79-86.
  5. H. Cui, V. Mittal, and M. Datar. 2006. Comparative experiments on sentiment classification for online product reviews. In Proceedings of the 21st International Conference on Artificial Intelligence, Boston, MA.
  • Aravind ShenaiAravind Shenai

    Aravind K. Shenai is Senior Architect and Head of Software Product Engineering (SPE) Labs at MindTree Ltd. He started his professional career as a hardware developer and more than 15 years of experience in software development. Prior to joining MindTree, Shenai worked in several Bay Area start-ups building products in areas such as database storage, network management, web portals, pricing and hosted infrastructure. You may contact him at aravind_shenai@mindtree.com

 

Comments

Want to post a comment? Login or become a member today!

Posted 3 May 2010 by Shyam Kapur

This is an excellent and timely post. Sentiment analysis is one of the hottest areas at the moment. More and more consumers and businesses can now see more clearly how useful sentiment analysis when done well can be for them. One of the best embodiments of quality sentiment analysis today is TipTop, a semantic, social, real-time search engine at http://FeelTipTop.com

Is this comment inappropriate? Click here to flag this comment.