Three Common Methods For Web Data Extraction

Probably the most common technique usedan engine. The level of expertise required to even
traditionally to extract data from web pages thisunderstand an extraction engine that uses artificial
is to cook up some regular expressions thatintelligence and ontologies is much higher than
match the pieces you want (e.g., URL's and linkwhat is required to deal with regular expressions.
titles). Our screen-scraper software actually- These types of engines are expensive to build.
started out as an application written in Perl for thisThere are commercial offerings that will give you
very reason. In addition to regular expressions,the basis for doing this type of data extraction,
you might also use some code written inbut you still need to configure them to work with
something like Java or Active Server Pages tothe specific content domain you're targeting.
parse out larger chunks of text. Using raw regular- You still have to deal with the data discovery
expressions to pull out the data can be a littleportion of the process, which may not fit as well
intimidating to the uninitiated, and can get a bitwith this approach (meaning you may have to
messy when a script contains a lot of them. Atcreate an entirely separate engine to handle data
the same time, if you're already familiar withdiscovery). Data discovery is the process of
regular expressions, and your scraping project iscrawling web sites such that you arrive at the
relatively small, they can be a great solution.Otherpages where you want to extract data.When to
techniques for getting the data out can get veryuse this approach: Typically you'll only get into
sophisticated as algorithms that make use ofontologies and artificial intelligence when you're
artificial intelligence and such are applied to theplanning on extracting information from a very
page. Some programs will actually analyze thelarge number of sources. It also makes sense to
semantic content of an HTML page, thendo this when the data you're trying to extract is
intelligently pull out the pieces that are of interest.in a very unstructured format (e.g., newspaper
Still other approaches deal with developingclassified ads). In cases where the data is very
"ontologies", or hierarchical vocabularies intendedstructured (meaning there are clear labels
to represent the content domain.There are aidentifying the various data fields), it may make
number of companies (including our own) thatmore sense to go with regular expressions or a
offer commercial applications specifically intendedscreen-scraping application.Screen-scraping
to do screen-scraping. The applications vary quitesoftwareAdvantages:- Abstracts most of the
a bit, but for medium to large-sized projectscomplicated stuff away. You can do some pretty
they're often a good solution. Each one will havesophisticated things in most screen-scraping
its own learning curve, so you should plan onapplications without knowing anything about
taking time to learn the ins and outs of a newregular expressions, HTTP, or cookies.
application. Especially if you plan on doing a fair- Dramatically reduces the amount of time
amount of screen-scraping it's probably a goodrequired to set up a site to be scraped. Once you
idea to at least shop around for a screen-scrapinglearn a particular screen-scraping application the
application, as it will likely save you time andamount of time it requires to scrape sites vs.
money in the long run.So what's the bestother methods is significantly lowered.
approach to data extraction? It really depends on- Support from a commercial company. If you
what your needs are, and what resources yourun into trouble while using a commercial
have at your disposal. Here are some of the prosscreen-scraping application, chances are there are
and cons of the various approaches, as well assupport forums and help lines where you can get
suggestions on when you might use eachassistance.Disadvantages:- The learning curve.
one:Raw regular expressions andEach screen-scraping application has its own way
codeAdvantages:- If you're already familiar withof going about things. This may imply learning a
regular expressions and at least one programmingnew scripting language in addition to familiarizing
language, this can be a quick solution.yourself with how the core application works.
- Regular expressions allow for a fair amount of- A potential cost. Most ready-to-go
"fuzziness" in the matching such that minorscreen-scraping applications are commercial, so
changes to the content won't break them.you'll likely be paying in dollars as well as time for
- You likely don't need to learn any newthis solution.
languages or tools (again, assuming you're already- A proprietary approach. Any time you use a
familiar with regular expressions and aproprietary application to solve a computing
programming language).problem (and proprietary is obviously a matter of
- Regular expressions are supported in almost alldegree) you're locking yourself into using that
modern programming languages. Heck, evenapproach. This may or may not be a big deal, but
VBScript has a regular expression engine. It's alsoyou should at least consider how well the
nice because the various regular expressionapplication you're using will integrate with other
implementations don't vary too significantly in theirsoftware applications you currently have. For
syntax.Disadvantages:- They can be complex forexample, once the screen-scraping application has
those that don't have a lot of experience withextracted the data how easy is it for you to get
them. Learning regular expressions isn't like goingto that data from your own code?When to use
from Perl to Java. It's more like going from Perlthis approach: Screen-scraping applications vary
to XSLT, where you have to wrap your mindwidely in their ease-of-use, price, and suitability to
around a completely different way of viewing thetackle a broad range of scenarios. Chances are,
problem.though, that if you don't mind paying a bit, you
- They're often confusing to analyze. Take a lookcan save yourself a significant amount of time by
through some of the regular expressions peopleusing one. If you're doing a quick scrape of a
have created to match something as simple as ansingle page you can use just about any language
email address and you'll see what I mean.with regular expressions. If you want to extract
- If the content you're trying to match changesdata from hundreds of web sites that are all
(e.g., they change the web page by adding a newformatted differently you're probably better off
"font" tag) you'll likely need to update your regularinvesting in a complex system that uses
expressions to account for the change.ontologies and/or artificial intelligence. For just
- The data discovery portion of the processabout everything else, though, you may want to
(traversing various web pages to get to the pageconsider investing in an application specifically
containing the data you want) will still need to bedesigned for screen-scraping.As an aside, I
handled, and can get fairly complex if you need tothought I should also mention a recent project
deal with cookies and such.When to use thiswe've been involved with that has actually
approach: You'll most likely use straight regularrequired a hybrid approach of two of the
expressions in screen-scraping when you have aaforementioned methods. We're currently working
small job you want to get done quickly. Especiallyon a project that deals with extracting newspaper
if you already know regular expressions, there'sclassified ads. The data in classifieds is about as
no sense in getting into other tools if all you needunstructured as you can get. For example, in a
to do is pull some news headlines off of areal estate ad the term "number of bedrooms"
site.Ontologies and artificial intelligenceAdvantages:-can be written about 25 different ways. The data
You create it once and it can more or lessextraction portion of the process is one that lends
extract the data from any page within theitself well to an ontologies-based approach, which
content domain you're targeting.is what we've done. However, we still had to
- The data model is generally built in. Forhandle the data discovery portion. We decided to
example, if you're extracting data about carsuse screen-scraper for that, and it's handling it
from web sites the extraction engine alreadyjust great. The basic process is that
knows what the make, model, and price are, so itscreen-scraper traverses the various pages of
can easily map them to existing data structuresthe site, pulling out raw chunks of data that
(e.g., insert the data into the correct locations inconstitute the classified ads. These ads then get
your database).passed to code we've written that uses
- There is relatively little long-term maintenanceontologies in order to extract out the individual
required. As web sites change you likely will needpieces we're after. Once the data has been
to do very little to your extraction engine in orderextracted we then insert it into a database.Todd
to account for the changes.Disadvantages:- It'sWilson is the owner of ( a company which
relatively complex to create and work with suchspecializes in data extraction from web pages.