Be it large enterprises or SMEs, structured data represents only 20% of the information available to an organization. 80% of the data is in an unstructured form. Mind boggling isn’t it? If businesses are flourishing by analyzing only 20% of their data, imagine what can be done if they can make sense of the rest of the 80%.
Big Data comes into play to streamline the 80% of the unstructured form of data.
Did you know, 50% of the unstructured data is Text. Text is the easiest form of data from which insights can be gleaned using text analytics tools and algorithms. With an enormous amount of unstructured data being text, it should be priority to analyze it and obtain business insights.
Text analytics can help SME’s as well as Enterprise scale organizations to make use of the unstructured data to understand the likes, dislikes and motivations of the customer. Knowing the sentiment of your customer toward your brand helps retain their loyalty. This can be done by changing loyalty program incentives to align with the desires of the customers. An increase in sales and a growing customer base can be achieved.
Structured Data vs Unstructured Data:
Data that is easily searchable by basic algorithms is structured data. Spreadsheets and data from machine sensors are examples of structured data.
Unstructured data on the other hand is more like the human language. It does not settle nicely with relational databases and using old algorithms to search is extremely difficult if not impossible
Text Analytics Techniques
This part is where we extract the unstructured data from the systems/silos in which they are generated to a central repository to facilitate easy access by any application for further processing.
This also brings in some sort of structure in the data, at-least in terms of, the output data structures viz. JSON, XML etc. In essence, the unstructured text data gets some sort of structure.
Data Warehousing Tools
MongoDB – An open-source NoSQL big data tool which can handle and serve as a warehouse for large amount of unstructured data. MongoDB can handle a variety of unstructured data – text (natural language), image, videos, audio etc.
It is highly scalable in terms of amount of data it can handle and also very flexible in terms of the kind of data structures it can handle. A cluster of MongoDB machines can be set-up if the amount of data to be handled becomes huge. Different tables can be created for the different source/kind of data. The output is given in semi-structured data formats like JSON, XML which can be used by algorithms which make sense of data.
The pipeline to ingest the data in MongoDB and extract data from MongoDB can be created in scripting programming languages like Python. The database can be made available via user-friendly APIs.
ElasticSearch – An open-sourced text warehousing cum search engine has the ability to handle large volumes of data as a cluster of machines can be set-up which can process such data in a distributed manner.
ElasticSearch specializes in text data and provides powerful text matching and text search facility using state-of-the-art algorithm. Once the text data is indexed (uploaded) in ES, the data can be queried by passing a search query and it retrieves all the relevant text documents. If text search and text matching are key use cases, ES works better than other data warehousing solutions.
Noah Data has capability in developing a unique Big Data solution on ElasticSearch and MongoDB for Fortune 500 companies. Our solution supports diverse unstructured data sources and various formats keeping in mind the schema less architecture of the repositories – ElasticSearch and MongoDB. We assure that our solution improves operational efficiency by 5 folds.
NLP backend layer (for making sense of data)- We have worked on a variety of problems where the goal is to make sense of text data on use cases like sensitivity analysis, topic modeling, text summarization etc.
NLP are text specific algorithms which have been used by us to solve complex problems like product classification besides the usual use cases of text mining like summarization, sentimental analysis etc. Listed below is our NLP Algorithm expertise:
Text matching – This means matching similar texts to each other e.g. Austrlia to Australia. Here algorithms like Levenstein distance, Least common subsequence etc. are used. We even use Elastic Search which has a powerful text matching algorithm and gives a match score for relevant items in the indexed data corresponding to any incoming query.
We generated an R and ElasticSearch based solution to match the various versions of company names (ebay.com, ebay India etc. to Ebay Inc.) based on indexed EDGAR database
Text classification – This pertains to classifying text into ones containing positive/negative sentiments or to belong to sports, politics, culture etc. Here classification algorithms like Naive Bayes, SVM, Random Forests and Neural Networks are used. As this is a supervised method, labeled training data is required.
We categorized 100 million + products with an accuracy of 80%.
Sentiment Analysis – For sentiment analysis, we need training data with labels of positive, negative and neutral news. Sentiment analysis uses pre-trained word embedding models to classify positive and negative sentiments.
Topic modeling – These methods extract hidden topics in a given text corpus. Latent Dirichlet Allocation is a popular algorithm that we use. A topic is made up of a linear combination of words. The number of topics to be mined can be set by the user. It is an un-supervised method so data doesn’t need to be labeled. However, the accuracy and richness of results would depend on the richness of the data.
We have extracted hidden topics from 160mn documents scraped from the internet using LDA. We also performed entity recognition from the data.
Topic summarization – This consists of gleaning out short summaries containing keywords out of the text data. There are two approaches we use to do text summarization – extractive and abstractive text summarization methods. For text summarization, we use methods like Gensim TextRank, PyTextRank, Sumy-Luhn, Sumy LSA.
The accuracy of text summarization would be validated and fine-tuned with validation methods like Rouge-N score, Bleu score etc.
These algorithms can be implemented in open-source tools like Python, R etc.
The NLP processing solutions will be deployed using Python-based web frameworks like Django and Flask. After deployment, the solution will be accessible via a URL and the user can feed in text data of their choice to analyze and make sense of. It can also be set-up in a way that all the data for a day (or week or month) can be analyzed together and topics, summary, keyword extraction etc. are done for all the data together.
We have deployed several web solutions using Django and Flask.
We have also deployed a web solution to perform Principal Component Analysis and Clustering on any data using Django/Flask.
We can perform the above both on cloud and on-premise machines.