Blogs

Web Data Discovery and Sourcing Approaches for Text Analytics

“The web data discovery and sourcing problem is multifold: ranging right from intellectual property ownership/control, volume, ethics, precision and authenticity. A knowledgebase should at the least adhere to select resources which assure a degree of credibility, authenticity, repeatability, reliability and quality. Just because something is there on the web doesn’t necessarily mean it’s easy to find. Most of the time, we know what we want but don’t know where we can find it and how we can use it.”

ABM strategy

Today, the World Wide Web has become the most wide, deep and important information source. The content available on the web can be broadly categorized in to two types: text and media (image, video and audio). Browsers and search engines make this content accessible for users who do an ad hoc search and analysis. For consistent and automated use of this data source, the information on the web needs to be converted in to a knowledgebase. However, this is a complicated proposition. Text analytics together with data discovery and sourcing can be used to convert this hugely available text data in to a knowledgebase.

Text analytics is a process of deriving information from text by employing principles and techniques of analysis methodologies like natural language processing (NLP). Text is an unstructured form of data and requires human intelligence to derive meaning by reading it. The goal of the text analytics is to discover appropriate information that is possibly not directly specified and/or is hidden and needs to be derived by understanding and relating the context and other peripheral information.

Natural Language Processing (or NLP) does linguistic analysis that helps the machine to read text. It uses linguistic features, dictionaries, patterns/models and ontologies to derive structured data points, relations and interpretations.

Most of the traditional text analytics systems provide data hooks and stop here. They assume that the users have and would provide the data corpus for which the intelligence needs to be derived. However, we will have to take care of discovering the data and sourcing this data from the web which we will feed it in to our text analytics system to create the knowledgebase.

Web data discovery and sourcing problem is multifold: ranging right from intellectual property ownership/control, volume, ethics, precision and authenticity. A knowledgebase should at the least adhere to select resources which assure a degree of credibility, authenticity, repeatability, reliability and quality. Just because something is there on the web doesn’t necessarily mean it’s easy to find. Most of the time, we know what we want but don’t know where we can find it and how we can use it. Search engines come to the rescue of users who are doing a manual search of data on the web. However, preparing a knowledgebase of the web is a humongous task and would require robust, highly controlled, programmed and automated approach for data discovery. This is where the problems start. Commercial search engine providers over the years have developed strategies which discourage automated and programmed use of their search services. Search engines will only offer the URL, title and a two line page snippet to its API users. Moreover, search engines won’t allow using this data for further machine learning purposes. A knowledgebase would essentially require much more than this i.e. the whole page content and will need to acquire it explicitly and directly from the source. Downloading of URLs discovered through the search engines will have ethical considerations. The content which is allowed for the search engines to download may not be allowed for the users of these search engines to download. Adhering to site policies and ethically downloading such content at a higher volume will need a sophisticated approach. Building and running such systems in entirety will have huge cost, effort and resource considerations. We are dealing with large datasets typically categorized as Big Data that cannot be processed using traditional computing techniques. We would require an infrastructure that can manage and process such huge volumes of data.

Following options can be used for data discovery and sourcing:

  • Open source crawled data: Open source community has shared petabytes of crawled data to be used for such purposes. The data is periodically updated and is available for free download or point to their AWS instance. Such datasets have an URL index and hence searchable if you know where the data is located. If we don’t know where the data is but know what we are looking for then we would need to custom index this crawl data on the content and provide a facility to search this index on keywords from the content. Such datasets have done quite a lot of ground work and can be leveraged for a quick start. However, we won’t have much control on what and when the new content will be discovered and made available for use. If certain content is not available then we will be left at the mercy of their crawlers to discover the data and make it available for us.

  • Custom crawlers: Custom crawlers will have full control of how, what and when the data needs to be acquired. What needs to be crawled depends on the business requirements. It can be as wide as full web or as deep as concentrated verticals. Authenticity, accuracy and ranking of these sources improve as the historical references grow over time. Such crawlers should ethically adhere to site policies.

By feeding the discovered and sourced data in for Text Analytics, it opens up infinite possibilities of data, information and semantic extraction from the raw sources and creates a knowledgebase. This knowledgebase from the web then can be used to gain valuable actionable insights and business intelligence. We, at BizKonnect, leverage all these techniques to build our Actionable Sales Intelligence Solution. The challenges faced and the solutions proposed that are mentioned above are from our own experiences of building this product.