| Probably the most common technique used | | | | an engine. The level of expertise required to even |
| traditionally to extract data from web pages this | | | | understand an extraction engine that uses artificial |
| is to cook up some regular expressions that | | | | intelligence and ontologies is much higher than |
| match the pieces you want (e.g., URL's and link | | | | what is required to deal with regular expressions. |
| titles). Our screen-scraper software actually | | | | - These types of engines are expensive to build. |
| started out as an application written in Perl for this | | | | There are commercial offerings that will give you |
| very reason. In addition to regular expressions, | | | | the basis for doing this type of data extraction, |
| you might also use some code written in | | | | but you still need to configure them to work with |
| something like Java or Active Server Pages to | | | | the specific content domain you're targeting. |
| parse out larger chunks of text. Using raw regular | | | | - You still have to deal with the data discovery |
| expressions to pull out the data can be a little | | | | portion of the process, which may not fit as well |
| intimidating to the uninitiated, and can get a bit | | | | with this approach (meaning you may have to |
| messy when a script contains a lot of them. At | | | | create an entirely separate engine to handle data |
| the same time, if you're already familiar with | | | | discovery). Data discovery is the process of |
| regular expressions, and your scraping project is | | | | crawling web sites such that you arrive at the |
| relatively small, they can be a great solution.Other | | | | pages where you want to extract data.When to |
| techniques for getting the data out can get very | | | | use this approach: Typically you'll only get into |
| sophisticated as algorithms that make use of | | | | ontologies and artificial intelligence when you're |
| artificial intelligence and such are applied to the | | | | planning on extracting information from a very |
| page. Some programs will actually analyze the | | | | large number of sources. It also makes sense to |
| semantic content of an HTML page, then | | | | do this when the data you're trying to extract is |
| intelligently pull out the pieces that are of interest. | | | | in a very unstructured format (e.g., newspaper |
| Still other approaches deal with developing | | | | classified ads). In cases where the data is very |
| "ontologies", or hierarchical vocabularies intended | | | | structured (meaning there are clear labels |
| to represent the content domain.There are a | | | | identifying the various data fields), it may make |
| number of companies (including our own) that | | | | more sense to go with regular expressions or a |
| offer commercial applications specifically intended | | | | screen-scraping application.Screen-scraping |
| to do screen-scraping. The applications vary quite | | | | softwareAdvantages:- Abstracts most of the |
| a bit, but for medium to large-sized projects | | | | complicated stuff away. You can do some pretty |
| they're often a good solution. Each one will have | | | | sophisticated things in most screen-scraping |
| its own learning curve, so you should plan on | | | | applications without knowing anything about |
| taking time to learn the ins and outs of a new | | | | regular expressions, HTTP, or cookies. |
| application. Especially if you plan on doing a fair | | | | - Dramatically reduces the amount of time |
| amount of screen-scraping it's probably a good | | | | required to set up a site to be scraped. Once you |
| idea to at least shop around for a screen-scraping | | | | learn a particular screen-scraping application the |
| application, as it will likely save you time and | | | | amount of time it requires to scrape sites vs. |
| money in the long run.So what's the best | | | | other methods is significantly lowered. |
| approach to data extraction? It really depends on | | | | - Support from a commercial company. If you |
| what your needs are, and what resources you | | | | run into trouble while using a commercial |
| have at your disposal. Here are some of the pros | | | | screen-scraping application, chances are there are |
| and cons of the various approaches, as well as | | | | support forums and help lines where you can get |
| suggestions on when you might use each | | | | assistance.Disadvantages:- The learning curve. |
| one:Raw regular expressions and | | | | Each screen-scraping application has its own way |
| codeAdvantages:- If you're already familiar with | | | | of going about things. This may imply learning a |
| regular expressions and at least one programming | | | | new scripting language in addition to familiarizing |
| language, this can be a quick solution. | | | | yourself with how the core application works. |
| - Regular expressions allow for a fair amount of | | | | - A potential cost. Most ready-to-go |
| "fuzziness" in the matching such that minor | | | | screen-scraping applications are commercial, so |
| changes to the content won't break them. | | | | you'll likely be paying in dollars as well as time for |
| - You likely don't need to learn any new | | | | this solution. |
| languages or tools (again, assuming you're already | | | | - A proprietary approach. Any time you use a |
| familiar with regular expressions and a | | | | proprietary application to solve a computing |
| programming language). | | | | problem (and proprietary is obviously a matter of |
| - Regular expressions are supported in almost all | | | | degree) you're locking yourself into using that |
| modern programming languages. Heck, even | | | | approach. This may or may not be a big deal, but |
| VBScript has a regular expression engine. It's also | | | | you should at least consider how well the |
| nice because the various regular expression | | | | application you're using will integrate with other |
| implementations don't vary too significantly in their | | | | software applications you currently have. For |
| syntax.Disadvantages:- They can be complex for | | | | example, once the screen-scraping application has |
| those that don't have a lot of experience with | | | | extracted the data how easy is it for you to get |
| them. Learning regular expressions isn't like going | | | | to that data from your own code?When to use |
| from Perl to Java. It's more like going from Perl | | | | this approach: Screen-scraping applications vary |
| to XSLT, where you have to wrap your mind | | | | widely in their ease-of-use, price, and suitability to |
| around a completely different way of viewing the | | | | tackle a broad range of scenarios. Chances are, |
| problem. | | | | though, that if you don't mind paying a bit, you |
| - They're often confusing to analyze. Take a look | | | | can save yourself a significant amount of time by |
| through some of the regular expressions people | | | | using one. If you're doing a quick scrape of a |
| have created to match something as simple as an | | | | single page you can use just about any language |
| email address and you'll see what I mean. | | | | with regular expressions. If you want to extract |
| - If the content you're trying to match changes | | | | data from hundreds of web sites that are all |
| (e.g., they change the web page by adding a new | | | | formatted differently you're probably better off |
| "font" tag) you'll likely need to update your regular | | | | investing in a complex system that uses |
| expressions to account for the change. | | | | ontologies and/or artificial intelligence. For just |
| - The data discovery portion of the process | | | | about everything else, though, you may want to |
| (traversing various web pages to get to the page | | | | consider investing in an application specifically |
| containing the data you want) will still need to be | | | | designed for screen-scraping.As an aside, I |
| handled, and can get fairly complex if you need to | | | | thought I should also mention a recent project |
| deal with cookies and such.When to use this | | | | we've been involved with that has actually |
| approach: You'll most likely use straight regular | | | | required a hybrid approach of two of the |
| expressions in screen-scraping when you have a | | | | aforementioned methods. We're currently working |
| small job you want to get done quickly. Especially | | | | on a project that deals with extracting newspaper |
| if you already know regular expressions, there's | | | | classified ads. The data in classifieds is about as |
| no sense in getting into other tools if all you need | | | | unstructured as you can get. For example, in a |
| to do is pull some news headlines off of a | | | | real estate ad the term "number of bedrooms" |
| site.Ontologies and artificial intelligenceAdvantages:- | | | | can be written about 25 different ways. The data |
| You create it once and it can more or less | | | | extraction portion of the process is one that lends |
| extract the data from any page within the | | | | itself well to an ontologies-based approach, which |
| content domain you're targeting. | | | | is what we've done. However, we still had to |
| - The data model is generally built in. For | | | | handle the data discovery portion. We decided to |
| example, if you're extracting data about cars | | | | use screen-scraper for that, and it's handling it |
| from web sites the extraction engine already | | | | just great. The basic process is that |
| knows what the make, model, and price are, so it | | | | screen-scraper traverses the various pages of |
| can easily map them to existing data structures | | | | the site, pulling out raw chunks of data that |
| (e.g., insert the data into the correct locations in | | | | constitute the classified ads. These ads then get |
| your database). | | | | passed to code we've written that uses |
| - There is relatively little long-term maintenance | | | | ontologies in order to extract out the individual |
| required. As web sites change you likely will need | | | | pieces we're after. Once the data has been |
| to do very little to your extraction engine in order | | | | extracted we then insert it into a database.Todd |
| to account for the changes.Disadvantages:- It's | | | | Wilson is the owner of ( a company which |
| relatively complex to create and work with such | | | | specializes in data extraction from web pages. |