Your ultimate packaging resource


Three Common Methods For Web Data Extraction

Probably the most common technique usedengine. The level of expertise required to
traditionally to extract data from web pageseven understand an extraction engine that
this is to cook up some regular expressionsuses artificial intelligence and ontologies
that match the pieces you want (e.g., URL'sis much higher than what is required to deal
and link titles). Our screen-scraper softwarewith  regular  expressions.
actually started out as an application
written in Perl for this very reason. In- These types of engines are expensive to
addition to regular expressions, you mightbuild. There are commercial offerings that
also use some code written in something likewill give you the basis for doing this type
Java or Active Server Pages to parse outof data extraction, but you still need to
larger chunks of text. Using raw regularconfigure them to work with the specific
expressions to pull out the data can be acontent  domain  you're  targeting.
little intimidating to the uninitiated, and
can get a bit messy when a script contains a- You still have to deal with the data
lot of them. At the same time, if you'rediscovery portion of the process, which may
already familiar with regular expressions,not fit as well with this approach (meaning
and your scraping project is relativelyyou may have to create an entirely separate
small, they can be a great solution.Otherengine to handle data discovery). Data
techniques for getting the data out can getdiscovery is the process of crawling web
very sophisticated as algorithms that makesites such that you arrive at the pages where
use of artificial intelligence and such areyou want to extract data.When to use this
applied to the page. Some programs willapproach: Typically you'll only get into
actually analyze the semantic content of anontologies and artificial intelligence when
HTML page, then intelligently pull out theyou're planning on extracting information
pieces that are of interest. Still otherfrom a very large number of sources. It also
approaches deal with developing "ontologies",makes sense to do this when the data you're
or hierarchical vocabularies intended totrying to extract is in a very unstructured
represent the content domain.There are aformat (e.g., newspaper classified ads). In
number of companies (including our own) thatcases where the data is very structured
offer commercial applications specifically(meaning there are clear labels identifying
intended to do screen-scraping. Thethe various data fields), it may make more
applications vary quite a bit, but for mediumsense to go with regular expressions or a
to large-sized projects they're often a goodscreen-scraping application.Screen-scraping
solution. Each one will have its ownsoftwareAdvantages:- Abstracts most of the
learning curve, so you should plan on takingcomplicated stuff away. You can do some
time to learn the ins and outs of a newpretty sophisticated things in most
application. Especially if you plan on doingscreen-scraping applications without knowing
a fair amount of screen-scraping it'sanything about regular expressions, HTTP, or
probably a good idea to at least shop aroundcookies.
for a screen-scraping application, as it will
likely save you time and money in the long- Dramatically reduces the amount of time
run.So what's the best approach to datarequired to set up a site to be scraped.
extraction? It really depends on what yourOnce you learn a particular screen-scraping
needs are, and what resources you have atapplication the amount of time it requires to
your disposal. Here are some of the pros andscrape sites vs. other methods is
cons of the various approaches, as well assignificantly  lowered.
suggestions on when you might use each
one:Raw regular expressions and- Support from a commercial company. If you
codeAdvantages:- If you're already familiarrun into trouble while using a commercial
with regular expressions and at least onescreen-scraping application, chances are
programming language, this can be a quickthere are support forums and help lines where
solution.you can get assistance.Disadvantages:- The
learning curve. Each screen-scraping
- Regular expressions allow for a fairapplication has its own way of going about
amount of "fuzziness" in the matching suchthings. This may imply learning a new
that minor changes to the content won't breakscripting language in addition to
them.familiarizing yourself with how the core
application  works.
- You likely don't need to learn any new
languages or tools (again, assuming you're- A potential cost. Most ready-to-go
already familiar with regular expressions andscreen-scraping applications are commercial,
a  programming  language).so you'll likely be paying in dollars as well
as  time  for  this  solution.
- Regular expressions are supported in
almost all modern programming languages.- A proprietary approach. Any time you use a
Heck, even VBScript has a regular expressionproprietary application to solve a computing
engine. It's also nice because the variousproblem (and proprietary is obviously a
regular expression implementations don't varymatter of degree) you're locking yourself
too significantly in theirinto using that approach. This may or may
syntax.Disadvantages:- They can be complexnot be a big deal, but you should at least
for those that don't have a lot of experienceconsider how well the application you're
with them. Learning regular expressions isn'tusing will integrate with other software
like going from Perl to Java. It's more likeapplications you currently have. For
going from Perl to XSLT, where you have toexample, once the screen-scraping application
wrap your mind around a completely differenthas extracted the data how easy is it for you
way  of  viewing  the  problem.to get to that data from your own code?When
to use this approach: Screen-scraping
- They're often confusing to analyze. Take aapplications vary widely in their
look through some of the regular expressionsease-of-use, price, and suitability to tackle
people have created to match something asa broad range of scenarios. Chances are,
simple as an email address and you'll seethough, that if you don't mind paying a bit,
what  I  mean.you can save yourself a significant amount of
time by using one. If you're doing a quick
- If the content you're trying to matchscrape of a single page you can use just
changes (e.g., they change the web page byabout any language with regular expressions.
adding a new "font" tag) you'll likely needIf you want to extract data from hundreds of
to update your regular expressions to accountweb sites that are all formatted differently
for  the  change.you're probably better off investing in a
complex system that uses ontologies and/or
- The data discovery portion of the processartificial intelligence. For just about
(traversing various web pages to get to theeverything else, though, you may want to
page containing the data you want) will stillconsider investing in an application
need to be handled, and can get fairlyspecifically designed for screen-scraping.As
complex if you need to deal with cookies andan aside, I thought I should also mention a
such.When to use this approach: You'll mostrecent project we've been involved with that
likely use straight regular expressions inhas actually required a hybrid approach of
screen-scraping when you have a small job youtwo of the aforementioned methods. We're
want to get done quickly. Especially if youcurrently working on a project that deals
already know regular expressions, there's nowith extracting newspaper classified ads.
sense in getting into other tools if all youThe data in classifieds is about as
need to do is pull some news headlines off ofunstructured as you can get. For example, in
a site.Ontologies and artificiala real estate ad the term "number of
intelligenceAdvantages:- You create it oncebedrooms" can be written about 25 different
and it can more or less extract the data fromways. The data extraction portion of the
any page within the content domain you'reprocess is one that lends itself well to an
targeting.ontologies-based approach, which is what
we've done. However, we still had to handle
- The data model is generally built in. Forthe data discovery portion. We decided to
example, if you're extracting data about carsuse screen-scraper for that, and it's
from web sites the extraction engine alreadyhandling it just great. The basic process is
knows what the make, model, and price are, sothat screen-scraper traverses the various
it can easily map them to existing datapages of the site, pulling out raw chunks of
structures (e.g., insert the data into thedata that constitute the classified ads.
correct  locations  in  your  database).These ads then get passed to code we've
written that uses ontologies in order to
- There is relatively little long-termextract out the individual pieces we're
maintenance required. As web sites change youafter. Once the data has been extracted we
likely will need to do very little to yourthen insert it into a database.Todd Wilson is
extraction engine in order to account for thethe owner of ( a company which specializes in
changes.Disadvantages:- It's relativelydata extraction from web pages.
complex to create and work with such an



1 A B C 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 92 93 94 95 96 97 98 99 100 101 102