Screen scraping and web scraping techniques are powerful allies of e-commerce stores. They provide a quick way to collect and compare prices or content from other sites, whether they are suppliers or competitors. This is crucial because the more products there are in a store, and the more complete the information is to be published or analyze the competition.
Data Gathering To Monitor E-commerce Competitors
It is normal in these stores that the usual maintenance tasks are out of the “manual” scope if managed by a small group of people, as it can be in the case of stores with large inventories (let’s imagine a food or gadget store).
But it can also happen that stores with small or very specific inventories want to keep a large number of competitors “in check.” Visiting their stores and sales every day and comparing prices could be a near-impossible mission, so resorting to these techniques can be part of the solution – or at least an extra help that never hurts.
Both techniques refer to the way in which information is collected. The screen scraping refers to the oldest form, both own teams and external, usually old and generally operating in terminal screens or “text”. Specially designed software can capture that data and translate it into more manageable databases or formats like spreadsheets.
Although that term is still used today, it is normal to refer mainly to web scrapping. All that information is on the web. The same programming techniques can be used (requests for web pages and analysis of their content) to obtain the data.
But It’s Not Easy For The Beginners
Good scrapping is not a trivial task. In fact, it usually requires specialized programmers and tools to do it successfully. Even if the information is on public web pages that can be accessed almost automatically, it must be “scratched” data by data in order to process it. And this is not always easy: although visually it seems simple, the internal HTML code of the web pages can facilitate the task … Or quite the opposite.
For this reason, many stores and aggregators use API calls (application programming interfaces), which are basically a way for some machines to talk to others. For example, Amazon has APIs where any software or “bot” can check the names of books, toys and other products, their prices, descriptions, get the photos and more.
Many airlines also provide APIs to check the price of their tickets. This is because Amazon or the agencies provide this information to partner websites and stores to recommend or resell their products for a commission. So they want to make it easy for you. Generally, you only need to register as a developer to access these APIs; sometimes, you have to pay if daily consultations are high.
When a store or website does not offer an API for any reason, specific applications can still be used to extract the data: software and web services such as Import.io, WebHose.io, Scrapinghub and others generate files in .CSV format (sheets of calculation), launching their bots to collect the requested information from the required sites. It usually requires checking and marking what data is desired and that the site to be analyzed is well structured.
Use Cases In Brief
At an experimental level and as a “walk around the house” use, even Google Docs has a tool in its spreadsheets (the ImportXML function) that allows data to be extracted from any fairly structured web page. (There is a tutorial on this on the Ben L. Collins website: Google Sheets as a basic Web Scraper).
There are also those who use these techniques to extract information from comments and product reviews or from social networks, such as controlling mentions of a certain brand or extract user profiles and analyzing them demographically.
The Techniques Are Always Not Effective
These techniques cannot always be applied as there are websites that, for various reasons, do not want the data to be automatically extracted from their catalogs. They sometimes use code obfuscation techniques (make the code difficult to read for robots but visible to humans). Generally, They block automatic or bulk requests from the outside with the robots.txt file.
There is nothing to do in these cases: “Internet etiquette” forces those who program bots to respect the wishes of those sites that their information should not be accessed automatically or “aggressively”.
And legally, it would in all probability mean getting into trouble – just as it would be using web scrapping to collect profiles and personal information in bulk in order to use it later to send emails or messages by any means, without the authorization of the recipients.
Thanks to the bots that perform web scrapping (in the absence of the APIs), it is possible to obtain complete catalogs of different providers many times to unify them in one, compare the prices of these providers or the competition to adjust them almost in real-time and be aware quickly of price drops and increases or specific offers, to keep the prices of the store competitive.