I'm sorry Dave. I'm afraid I can't do that...
In the post Dashboards, scorecards and sentiment I wrote about why I don't think computers can accurately assess the emotional meaning of a sentence. This article from The New York Times entitled Mining the Web for Feelings, Not Facts touches on how performing exactly this function is a "growing business". What's interesting to me is how often I hear it repeated that these algorithms are "70 to 80 percent accurate" often with the addendum that people only agree on the meaning of something 70 to 80 percent of the time. You are being invited to commit a logical fallacy, in as much as the implicit suggestion is that the algorithm is about as accurate as human assessed sentiment. This isn't the case. The article does touch on this highlighting the following:
"A quick search on Tweetfeel, for example, reveals that 77 percent of recent tweeters liked the movie 'Julie & Julia'. But the same search on Twitrratr reveals a few misfires. The site assigned a negative score to a tweet reading 'julie and julia was truly delightful!!' That same message ended with 'we all felt very hungry afterwards' — and the system took the word 'hungry' to indicate a negative sentiment."
In my experience when the computer gets it wrong it gets it wrong in a way a human wouldn't. These monitoring companies are effectively saying that 20 to 30 percent of their data cannot be relied upon. Were the data to be assessed by people the areas of disagreement would actually be highly useful as there are probably interesting reasons as to why the disagreement was occurring, specifically to with context that cannot be assessed by looking at one sentence.
Although this figure of 70 to 80 percent accuracy gets thrown around I have yet to see a monitoring company that supplies a dataset that has been assessed by both its algorithm and a team of humans to prove that this measure of 'accuracy' is one that can be relied on. A set of results like this would also allow us to see where the computer and the human assessors disagree which, given it's people's opinions we are actually trying to quantify is something worth testing.
Most sentiment analysis systems place opinion in one of three buckets, either positive, neutral or negative. This sounds superficially plausible, but if you've ever looked at hundreds of mentions around a keyword or topic you quickly realise that this doesn't really fit with how people express opinions or have conversations. Combined with the lack of accuracy, the lack of nuance in these assessments reduces the value of these tools.
Many dashboards I've seen expect you to take the figures they provide as a given. If you go behind the percentages into the data you start to realise that you do not have a system on which to make reliable judgements, which is after all where the supposed value of these tools lies.
Scout Labs have a good post entitled How does sentiment work? And how accurate is it, anyway? that is worth reading as they try and address these issues which few other companies offering these services have even touched on. They mention the use of Mechanical Turk as a way of being able to assess sentiment using humans and point to a good paper on the problems with this way of doing things. My feeling is that for most of these companies the issue is that they are offering a volume service and that the only way to realistically process the vast amounts of data generated is to use a computer, which for now is an imperfect way of trying to provide what would be very useful information.