Table of Contents
Web scraping has been a somewhat dull task with all of the different solutions to be had nowadays, especially given the fact that they can be employed in many ways that require almost no knowledge of web programming and data analysis. This tutorial will develop a browser-based app that crawls the web and pulls out word cloud charts.
You use custom extractors to integrate with a PostgreSQL database. You’re familiar with the concept. You understand what “site-specific” means.
But just what does this mean? What exactly is an extractor, and why is it the best way to build code?
A fast way to get to a MySQL database
Web applications often have databases scattered around the server. You want to get to a specific database for certain types of queries. For instance, a blog post query might be fine, but a query to select blog posts based on certain data may not be. An extractor allows you to create one query with a specific data set.
Imagine you had a PostgreSQL table for people who had visited a particular blog. You may have a person.blog_visits table, but the database on your server is your own. A PostgreSQL database that you have access to on your local machine.
(Sometimes this is called “migrating data from MySQL to PostgreSQL,” but we’ll ignore that here.)
Now, your blog has a dashboard for people who have visited your site. And you also have a “People Who Visited This Blog” dashboard.
In the second table, you could use MySQL SELECT CREATE TABLE with INSERT INTO table (first_name, last_name) FROM blog_visits. People who visited (id, blog_id). But you want to have that table in a table with just the information for people who visited your blog, not the whole thing.
1. High-level metadata: mass-scale crawls
Mass-scale crawling; generic extractors are used for high-level metadata while you have an ongoing large-scale crawling need. The low level of computation required is an order of magnitude more efficient when compared to, for instance, Google Books. In many cases, you will also need to expand or augment the collection of data that you’ve queried, as the kind of collection and/or methodology you use to handle this will differ significantly.
If you haven’t done so already, consider changing your Content-Based Address Translation (CAT) strategy to a Personalized Address Translation (PAT) one – either because of the security benefit of avoiding link rot or because you prefer querying over querying. But don’t let the final result affect your code decisions – write your code in such a way that will allow you to upgrade your language resources in the future.
Serialization for object-oriented languages
If you are writing your application as a member of an object-oriented language or an MVP architect using one of the many object-oriented frameworks, your choice of serialization strategy might not be based on your architecture’s functional requirements. You might want to include a serialization component into your application that directly results from its object-oriented nature.
For instance, a web framework might be divided into multiple components that must be serialized and deserialized in order to work correctly – they might implement validators, validation filters, session-specific events etc. This may cause complications for serialization strategies that rely on only pure SQL.
In the end, you will have to make a conscious decision on whether or not to serialize your entities at all. Sometimes, you will not be able to avoid it. Serialization has to happen somewhere, and you might as well take advantage of it where you can to move away from a “conventional” pattern.
In most object-oriented languages, serialization is solved by an adapter in terms of a special type of serialization. Unfortunately, this special type of serialization often has many problems: it can’t be simply imported into other projects because of its incompatibilities; its complexity, consistency and maintainability make it impractical to use on a day-to-day basis; and its performance can be abysmal.
Usually, an adapter’s side-effects can cause interference with the application’s business logic, and it is difficult to change them. As such, it is not a suitable serialization strategy for most applications.
However, there are cases where adapters may be a useful strategy, e.g., when dealing with software that interfaces with other software and the software that interacts with the software require a high interoperability level. Such adapters are common in some enterprise software packages.
For web applications and other “conventional” web applications, the adapters’ design can be very tightly integrated into the application architecture. In other words, they are the application’s APIs. Sometimes they’re not so special – they might be an internal API, implemented within a single component or a plugin.
If you can make it so, you can still encapsulate the problems of serialization and/or deserialization in your app by specifying the interface of the adapter with the application itself. In the case of the latter, it would be the same thing that would be done with the application’s native interface. However, this level of encapsulation can become more complex.
2. Detailed datasets: Low/Midscale crawls
Anybody dependent on site-specific data extraction, also occurring from a fully centralized method, will be utilizing the same kind of framework. There are many out there, many with their own strengths and weaknesses. But extracting the content of the data from the site is another question: several techniques are required.
Some are “ground up” techniques that involve developing a process to extract information from large collections of data in various formats. Others look for patterns, compare time series data, combine data and behavioral patterns. Extracting the content of data from a web page in multiple formats is also useful.
The Common Access
Text processing (using a string processor or text editor) is the core of almost any content extraction technique. It’s relatively easy to write a text processing program that works on various data types, some of which are not commonly available.
This is important to note when thinking about content extraction because the information that is actually extracted might not be that interesting. For example, the content of a web page containing a webmail login form isn’t going to be that interesting, even if you can extract it. After all, you’re probably just trying to extract the login form, not the contents of the email.
That said, even if you’re extracting a single piece of data, such as the body of an email, you might want to put some effort into making sure that you’re extracting the relevant information.