Project Twitter is collected and analysed on the fly

                             Project Title:Sentiment AnalysisAbstract:Sentiment Analysis is a variant of Opinion Mining.It basically deals with going through volumes of already existing data collected from the Social Networking Websites such as Twitter, processing that data in order to derive conclusion(s) from it.Not only that, it takes it a step further, where it not only gathers and analyses the data, but also categorises into primarily into three categories namely positive, negative and sometimes even neutral.The  data from Twitter  is collected and analysed  on the fly to get sentiments out of the public for a particular brand .This very feature of Sentiment Analysis can be used to recognize the market value of a business brand by its users, and after comprehending the overall value of the brand in the eyes of its consumer, the brand owners can determine how their product is performing in the market in order to take, corrective action, if the need arises, to improve their product and strategically take over the market.Thus, this paper proposes a smart method to campaign for a business brand, whereby the business owner determines his position in the market, and how well(or bad) his business is doing ,by mining data and deriving inferences from the same , rendering them the capability to make insightful and well-informed decisions, thereby providing a cost-effective as well as a highly efficient method to review a business.Thus, it gives the business owners an ability to add value to their business and acquire a competitive edge.Keywords:Sentiment Analysis, Opinion Mining, Social Brand Monitoring , Social Media Analytics, Business Analytics. Introduction:-Business Analytics has been in boom since several decades.Many organisations have realised the importance of the same and have invested a significant amount in this global phenomenon. This has enabled organisations to take cognizance of the current market scenario and strategically steer their businesses to success, reaping exponential profits and unprecedented growth.Social Media Analytics is a branch of Business Analytics(BA) and has practically grown into a profound and widely used technical strategy in the business spheres.Social Media Analytics can be concisely defined as an analytic capability to analyse and break-down huge of data, both semi-structured and unstructured data from Social Media. Social Media is the “new big thing” which has happened to the world and  not without good reasons.It is a revolution in itself, which has given the organisations, an alternate and unique medium of communication, where they have unlimited access to huge amount of useful data.Since the advent of World Wide Web 2.0, the  Internet has been redefined in every way and nothing has ever been same, its capabilities have only rapidly mulltiplied and its reach has substancially grown.And Social Media Platforms form an integral component of the World Wide Web revolution.Social Media has provided the customers a new and uncomparable channel to interact with the organisations , businesses and also provides them an unprecedented opportunity to offer their opinions , suggestions, remarks on their products and the services that are being offered.Social Media possesses the unparalleled ability to influence the perspective of the customers and their interests and inclination in purchasing the products or services. Thus, with the launch of the Social Media, the customers are equipped with an ability to give their opinions about any topic under the sun and not only that, this ability could be further extended to discussions , public polls , debates etc. on a public platform. Thus, Online Social Networks, along with the micro-blogging websites, have become the top priority for the user to express their thoughts on a particular product or an event or any activity, and that too in real time.Sentiment Analysis is used to derive inferences from diverse texts.This appealing property of the Sentiment Analysis can be used to extract reviews, to conduct election polls and to determine answers to trending questions.By studying and interpreting the user’s behaviour on the social online networks, the users determine as to how the customers take their products and services, and also figure out, ways and means ,to better their brand reputation and exponentially increase their electronic commerce.   Literature SurveyFollowing are among the many challenges in the domain of Sentiment Analysis which need to be dealt with and resolved :-   i)”Hidden Sentiment Identification” is to analyse and comprehend the actual emotion in the data rather than simply classifying into any of the three polarities ie positve,negative or neutral   ii)”Handling Polysemy” is nothing but having more than one meaning of the same word leading to multiple sentiment polarity.              iii)”Mapping Slangs” is to narrow down the slangs in the data and to determine  their associated meanings and conclude their polarity.Generally, the practice has been that, in order to figure out the reputation of the business, tools or services provided by various agencies are called upon, wherein several sentiment analysis alogrithms are implemented to determine the sentiment in a sentence or extract the opinion from the text.  Now algorithms  used to determine the polarity of  the text in question, consist of using lexical resources.Other popular approaches are based on Machine Learning where popular alogorithms such as  Support Vector Machines or Naive Bayes Classifiers are utilised.Along with extracting the sentiment in the text, the other advantage of  the Sentiment Analysis, is to evaluate and determine the influence of the users on the Social Networking portal or the microblogging site. Various Social Media Monitoring tools and Social Media Services are available which evalutae how much a particular brand is visible on the social networks.BrandWatch and  Sysomos are one of the prominent examples which are used for business marketing and to understand how the customers really feel about them.MethodologyHadoop Map-Reduce Framework:-Hadoop is an open source software project written in Java. It used to optimize the usage of massive volumes of data. It is essentially a software framework, for the distributed processing of large datasets across large clusters of commodity servers.Hadoop is based on simple programming model called the MapReduce model. It basically provides reliability through Replication.               In the Hadoop Ecosystem, there are two components:-     i)HDFS(Hadoop Distributed File System) for purpose of storage.    ii)MapReduce for Processing.  Hadoop Distributed File System:-It is one of the primary component of the hadoop clusters and it is designed in the structure of the Master-Slave Architecture.                                            Hadoop Master/Slave Architecture The Master(NameNode) manages the file system namespace operations such as opening, closing, renaming files and directories and also determines the mapping of blocks to DataNodes. It also regulates access to files by clients. Slaves(Datanodes) are responsiblefor serving read and write requests from the client along with block creation, deletion and replication upon instruction from the Master(NameNode). ii)HadooP Map Reduce Framework                                                    HDFS Architecture When a client makes a request for a  hadoop cluster, this request is managed by the JobTracker. The JobTracker, working with the NameNode, distributes work as closely as possible to the data on which it will work. The NameNode is the master of the file system, providing the metadata services for data distribution and replication. The JobTracker schedules map and reduce tasks into available slots at one or more TaskTrackers. The TaskTracker working with the DataNode (the slave portions of the distriuted file system) to execute map and reduce tasks on data from the DataNode.When the map and reduce tasks are completed, the TaskTracker notifies the JobTracker, which identifies which  all tasks are complete and eventually notifies  the client after the conclusion of the job.Proposed SystemThis system enables to gauge the feelings of the customers about the product and hence understand their position in the market.By analysing the content produced by the users, the organizations can obtain an effective idea about the users think of their products, as a result they can effectively manage their reputation in the market and take corrective action before the user gets to repond on a particular product, with the help of ad-hoc marketing campaigns and digital marketing, in order to assess the sentiment of their customers.More importantly the data available on the Social Media Platforms is free of cost and hence no question of being budened financially and hence this freely available data can be used to create the  prediction models in order to accurately predict the sentiment.                 Hence , more or less the objective of the system is to obtain the recent tweets in the required time frame, and to evaluate the tweets in order to get the sentiments of the users from the text after it has been analysed. So that, on the collection and collation of these tweets, the overall image of the business can be generated. System Design:                               Process of Sentiment Analysis-The FlowTweet Data is accumulated using streaming API , known as Twitter4j, which provides Tweet Data for the particular topic. The collected Twitter Data is analysed by gathering the adjectives in the tweet and categorising  the data into positive, negative or neutral. The analysis of the data is executed in parallel using Apache Scala and their RDDs.Data is prepared using the following set of procedures:- i)Stop Word Removal:-Stop Words are the words that dont generate any sentiments, and hence are dead weights. Thus, it is mandatory to get rid of them, in order to optimise the process. ii)Tokenization:-is used so that the tokens can be singled out and identified ie the given text is broken down to its individualistic components so that the text is pre-processed for tagging the different Parts of Speech. iii)POS(Part of Speech) tagging:-Several Parts of Speech such as nounds, adjectives ,verbs and more are found out in the phase.The objective of Part of Speech, is separate out the adjectives from a phrase so that the underlying latent emotion can be identified with ease.Thus, using the Natural Language Processing of the Standford University, the emphasis is laid out more on disintegrating the sentence and isolating adjectives from them.Apache Scala is used to stream the data from Twitter using Twitter4j API and the data is acquired and stored in the JSON(JavaScriptObjectNotation) format, which is light-weight format used for the purpose of communication. Twitter 4j API, renders us the ability to crawl the web and in this case, Twitter. This API can be simply obtained by possessing an Twitter account and being registered as a developer.      Once the Data has been prepared, groomed and refined, the next and the most vital stage is to extract and identify  the sentiment hidden in the text and it implemented through the Maximum Entrophy Algorithm. This enables us, not only to determine the polarity in the sentence but also to comprehend the influence of the User, on Twitter, who wrote it.Ordinarily, the approach used to gauge the influence of a particular User is by getting hold of his followers, his mentions on Twitter and reactions to his tweet.The  pre-classifed data for training the model is provided by a dictionary, known as the SentiWordNet dictionary. The Maximum Entrophy Algorithm, uses Entrophy as a criteria to polarise the text into the concerned classes of Positive , Negative and Neutral with the help of the training data provided.The Maximum Entrophy Algorithm, is a probablistic model, that excels in the classification of text.It also trains relatively less time to train the data when compared to other algorithms. The And moreover, Laplacian Smoothing is used to deal with the words that have not been encountered in the Training Model.Another noteworthy aspect of this system, is that Maximum Entropy Algorithm is used in combination with  Part of Speech Tagging so as to achieve and maintain the best possible accuracy.Also , Negation Handling  techniques are employed to take care of “not” in  sentences, so that the meaning of the sentence is not altered.Emoticons:-These are terms used in sentences in order to convey a feeling or an emotion in a given text.They are most widely used and found in written communication.Over the last decades, they have dominated the Social Networking sites.                            Some  examples of Emoticons are as follows:-           Emoticon for a positive feeling/emotion      🙂                 Emoticon for a negative feeling/emotion      🙁 And our application makes use of them in order to classify the post into different classes of Polarity.System Architecture:-The entire application consists of three different  function tiers.              i)Presentation Layer:-This is what the end-user sees. This is where the input is collected and the output is displayed.This is the layer established for the purpose of interaction with the end-user.Input is taken from the user in the form of keywords to be searched for or with name of the brand/product along with start-date and the end-date of the search, in the data streamed from Twitter.  ii)Application Layer:-This layer is used for executing all the Logical Operations.This layer is created using the Apache Scala Language.This layer accomplishes its task of Sentiment Analysis by seeking adjectives in the given tweets and polarising them into categories of classes of Positive, Negative or Neutral. iii)Database Layer:-It is the layer used for the purpose of storage. Data from Twitter is streamed into the HDFS using Twitter4j API.Using this interface, all the content in the Twitter regarding a particular feature, can be pulled from its database and stored in this layer.                                                                   System Architecture     Finally, the result is displayed using a Graphical Format such as pie-chart, donut or a half-donut. Then, the overall Sentiment is derived and summarized into any of  the following emotions:-              i)Joy                     ii)Disappointment                 iii)Furious                 iv)Thrilled             The Polarity in every tweet is categorised into the following sets: i)0 ii)1 iii)-0.5 iv)0.5Since the system,is analyzing real time data, the data is collected and analyzed on the fly and thus, this application is successful in providing Sentiment Analysis over any topic in Real Time, hence, characterising it as a Real Time Application.Results:-  A Keyword, in the form of a string , is accepted from the User. The user can type-in the Text box provided next to the Search button, any string, which is relevant for a brand.  A pre-decided no. of tweets, are drawn from the Twitter Database, who are found to be relevant to the string keyed-in and then are analysed to conclude the holistic sentiment regarding the keyword.The result are visualized then, in the graphical format, using a           Pie-chart or donut-shapes and tables.The sentiment is shown in the graphical structures and the polarity is displayed in the tables for every tweet collected.Conclusion:- Sentiment Analysis, is the need of the hour, for any and all businesses to not only to determine their market value in the eyes of the customer, but also gives them a competitive advantage by offering deep insights in the marker scenario. It is proved that our application can be used to derive accurate conclusions, from data that is collected in real time and scrutinised also in real time, thereby , providing results on the fly.