| Probably the most common technique used
| |
| | extraction engine that uses artificial
|
| traditionally to extract data from web
| |
| | intelligence and ontologies is much
|
| pages this is to cook up some regular
| |
| | higher than what is required to deal with
|
| expressions that match the pieces you
| |
| | regular expressions.
|
| want (e.g., URL's and link titles). Our
| |
| | - These types of engines are expensive
|
| screen-scraper software actually started
| |
| | to build. There are commercial offerings
|
| out as an application written in Perl for
| |
| | that will give you the basis for doing
|
| this very reason. In addition to
| |
| | this type of data extraction, but you
|
| regular expressions, you might also use
| |
| | still need to configure them to work with
|
| some code written in something like Java
| |
| | the specific content domain you're
|
| or Active Server Pages to parse out
| |
| | targeting.
|
| larger chunks of text. Using raw regular
| |
| | - You still have to deal with the data
|
| expressions to pull out the data can be a
| |
| | discovery portion of the process, which
|
| little intimidating to the uninitiated,
| |
| | may not fit as well with this approach
|
| and can get a bit messy when a script
| |
| | (meaning you may have to create an
|
| contains a lot of them. At the same
| |
| | entirely separate engine to handle data
|
| time, if you're already familiar with
| |
| | discovery). Data discovery is the
|
| regular expressions, and your scraping
| |
| | process of crawling web sites such that
|
| project is relatively small, they can be
| |
| | you arrive at the pages where you want to
|
| a great solution.Other techniques for
| |
| | extract data.When to use this approach:
|
| getting the data out can get very
| |
| | Typically you'll only get into ontologies
|
| sophisticated as algorithms that make use
| |
| | and artificial intelligence when you're
|
| of artificial intelligence and such are
| |
| | planning on extracting information from a
|
| applied to the page. Some programs will
| |
| | very large number of sources. It also
|
| actually analyze the semantic content of
| |
| | makes sense to do this when the data
|
| an HTML page, then intelligently pull out
| |
| | you're trying to extract is in a very
|
| the pieces that are of interest. Still
| |
| | unstructured format (e.g., newspaper
|
| other approaches deal with developing
| |
| | classified ads). In cases where the data
|
| "ontologies", or hierarchical
| |
| | is very structured (meaning there are
|
| vocabularies intended to represent the
| |
| | clear labels identifying the various data
|
| content domain.There are a number of
| |
| | fields), it may make more sense to go
|
| companies (including our own) that offer
| |
| | with regular expressions or a
|
| commercial applications specifically
| |
| | screen-scraping
|
| intended to do screen-scraping. The
| |
| | application.Screen-scraping
|
| applications vary quite a bit, but for
| |
| | softwareAdvantages:- Abstracts most of
|
| medium to large-sized projects they're
| |
| | the complicated stuff away. You can do
|
| often a good solution. Each one will
| |
| | some pretty sophisticated things in most
|
| have its own learning curve, so you
| |
| | screen-scraping applications without
|
| should plan on taking time to learn the
| |
| | knowing anything about regular
|
| ins and outs of a new application.
| |
| | expressions, HTTP, or cookies.
|
| Especially if you plan on doing a fair
| |
| | - Dramatically reduces the amount of
|
| amount of screen-scraping it's probably a
| |
| | time required to set up a site to be
|
| good idea to at least shop around for a
| |
| | scraped. Once you learn a particular
|
| screen-scraping application, as it will
| |
| | screen-scraping application the amount of
|
| likely save you time and money in the
| |
| | time it requires to scrape sites vs.
|
| long run.So what's the best approach to
| |
| | other methods is significantly lowered.
|
| data extraction? It really depends on
| |
| | - Support from a commercial company. If
|
| what your needs are, and what resources
| |
| | you run into trouble while using a
|
| you have at your disposal. Here are some
| |
| | commercial screen-scraping application,
|
| of the pros and cons of the various
| |
| | chances are there are support forums and
|
| approaches, as well as suggestions on
| |
| | help lines where you can get
|
| when you might use each one:Raw regular
| |
| | assistance.Disadvantages:- The learning
|
| expressions and codeAdvantages:- If
| |
| | curve. Each screen-scraping application
|
| you're already familiar with regular
| |
| | has its own way of going about things.
|
| expressions and at least one programming
| |
| | This may imply learning a new scripting
|
| language, this can be a quick solution.
| |
| | language in addition to familiarizing
|
| - Regular expressions allow for a fair
| |
| | yourself with how the core application
|
| amount of "fuzziness" in the matching
| |
| | works.
|
| such that minor changes to the content
| |
| | - A potential cost. Most ready-to-go
|
| won't break them.
| |
| | screen-scraping applications are
|
| - You likely don't need to learn any new
| |
| | commercial, so you'll likely be paying in
|
| languages or tools (again, assuming
| |
| | dollars as well as time for this
|
| you're already familiar with regular
| |
| | solution.
|
| expressions and a programming language).
| |
| | - A proprietary approach. Any time you
|
| - Regular expressions are supported in
| |
| | use a proprietary application to solve a
|
| almost all modern programming languages.
| |
| | computing problem (and proprietary is
|
| Heck, even VBScript has a regular
| |
| | obviously a matter of degree) you're
|
| expression engine. It's also nice because
| |
| | locking yourself into using that
|
| the various regular expression
| |
| | approach. This may or may not be a big
|
| implementations don't vary too
| |
| | deal, but you should at least consider
|
| significantly in their
| |
| | how well the application you're using
|
| syntax.Disadvantages:- They can be
| |
| | will integrate with other software
|
| complex for those that don't have a lot
| |
| | applications you currently have. For
|
| of experience with them. Learning regular
| |
| | example, once the screen-scraping
|
| expressions isn't like going from Perl to
| |
| | application has extracted the data how
|
| Java. It's more like going from Perl to
| |
| | easy is it for you to get to that data
|
| XSLT, where you have to wrap your mind
| |
| | from your own code?When to use this
|
| around a completely different way of
| |
| | approach: Screen-scraping applications
|
| viewing the problem.
| |
| | vary widely in their ease-of-use, price,
|
| - They're often confusing to analyze.
| |
| | and suitability to tackle a broad range
|
| Take a look through some of the regular
| |
| | of scenarios. Chances are, though, that
|
| expressions people have created to match
| |
| | if you don't mind paying a bit, you can
|
| something as simple as an email address
| |
| | save yourself a significant amount of
|
| and you'll see what I mean.
| |
| | time by using one. If you're doing a
|
| - If the content you're trying to match
| |
| | quick scrape of a single page you can use
|
| changes (e.g., they change the web page
| |
| | just about any language with regular
|
| by adding a new "font" tag) you'll likely
| |
| | expressions. If you want to extract data
|
| need to update your regular expressions
| |
| | from hundreds of web sites that are all
|
| to account for the change.
| |
| | formatted differently you're probably
|
| - The data discovery portion of the
| |
| | better off investing in a complex system
|
| process (traversing various web pages to
| |
| | that uses ontologies and/or artificial
|
| get to the page containing the data you
| |
| | intelligence. For just about everything
|
| want) will still need to be handled, and
| |
| | else, though, you may want to consider
|
| can get fairly complex if you need to
| |
| | investing in an application specifically
|
| deal with cookies and such.When to use
| |
| | designed for screen-scraping.As an aside,
|
| this approach: You'll most likely use
| |
| | I thought I should also mention a recent
|
| straight regular expressions in
| |
| | project we've been involved with that has
|
| screen-scraping when you have a small job
| |
| | actually required a hybrid approach of
|
| you want to get done quickly. Especially
| |
| | two of the aforementioned methods. We're
|
| if you already know regular expressions,
| |
| | currently working on a project that deals
|
| there's no sense in getting into other
| |
| | with extracting newspaper classified ads.
|
| tools if all you need to do is pull some
| |
| | The data in classifieds is about as
|
| news headlines off of a site.Ontologies
| |
| | unstructured as you can get. For
|
| and artificial intelligenceAdvantages:-
| |
| | example, in a real estate ad the term
|
| You create it once and it can more or
| |
| | "number of bedrooms" can be written about
|
| less extract the data from any page
| |
| | 25 different ways. The data extraction
|
| within the content domain you're
| |
| | portion of the process is one that lends
|
| targeting.
| |
| | itself well to an ontologies-based
|
| - The data model is generally built in.
| |
| | approach, which is what we've done.
|
| For example, if you're extracting data
| |
| | However, we still had to handle the data
|
| about cars from web sites the extraction
| |
| | discovery portion. We decided to use
|
| engine already knows what the make,
| |
| | screen-scraper for that, and it's
|
| model, and price are, so it can easily
| |
| | handling it just great. The basic
|
| map them to existing data structures
| |
| | process is that screen-scraper traverses
|
| (e.g., insert the data into the correct
| |
| | the various pages of the site, pulling
|
| locations in your database).
| |
| | out raw chunks of data that constitute
|
| - There is relatively little long-term
| |
| | the classified ads. These ads then get
|
| maintenance required. As web sites change
| |
| | passed to code we've written that uses
|
| you likely will need to do very little to
| |
| | ontologies in order to extract out the
|
| your extraction engine in order to
| |
| | individual pieces we're after. Once the
|
| account for the changes.Disadvantages:-
| |
| | data has been extracted we then insert it
|
| It's relatively complex to create and
| |
| | into a database.Todd Wilson is the owner
|
| work with such an engine. The level of
| |
| | of ( a company which specializes in data
|
| expertise required to even understand an
| |
| | extraction from web pages.
|