| Probably the most common technique used | | | | engine. The level of expertise required to |
| traditionally to extract data from web pages | | | | even understand an extraction engine that |
| this is to cook up some regular expressions | | | | uses artificial intelligence and ontologies |
| that match the pieces you want (e.g., URL's | | | | is much higher than what is required to deal |
| and link titles). Our screen-scraper software | | | | with regular expressions. |
| actually started out as an application | | | | |
| written in Perl for this very reason. In | | | | - These types of engines are expensive to |
| addition to regular expressions, you might | | | | build. There are commercial offerings that |
| also use some code written in something like | | | | will give you the basis for doing this type |
| Java or Active Server Pages to parse out | | | | of data extraction, but you still need to |
| larger chunks of text. Using raw regular | | | | configure them to work with the specific |
| expressions to pull out the data can be a | | | | content domain you're targeting. |
| little intimidating to the uninitiated, and | | | | |
| can get a bit messy when a script contains a | | | | - You still have to deal with the data |
| lot of them. At the same time, if you're | | | | discovery portion of the process, which may |
| already familiar with regular expressions, | | | | not fit as well with this approach (meaning |
| and your scraping project is relatively | | | | you may have to create an entirely separate |
| small, they can be a great solution.Other | | | | engine to handle data discovery). Data |
| techniques for getting the data out can get | | | | discovery is the process of crawling web |
| very sophisticated as algorithms that make | | | | sites such that you arrive at the pages where |
| use of artificial intelligence and such are | | | | you want to extract data.When to use this |
| applied to the page. Some programs will | | | | approach: Typically you'll only get into |
| actually analyze the semantic content of an | | | | ontologies and artificial intelligence when |
| HTML page, then intelligently pull out the | | | | you're planning on extracting information |
| pieces that are of interest. Still other | | | | from a very large number of sources. It also |
| approaches deal with developing "ontologies", | | | | makes sense to do this when the data you're |
| or hierarchical vocabularies intended to | | | | trying to extract is in a very unstructured |
| represent the content domain.There are a | | | | format (e.g., newspaper classified ads). In |
| number of companies (including our own) that | | | | cases where the data is very structured |
| offer commercial applications specifically | | | | (meaning there are clear labels identifying |
| intended to do screen-scraping. The | | | | the various data fields), it may make more |
| applications vary quite a bit, but for medium | | | | sense to go with regular expressions or a |
| to large-sized projects they're often a good | | | | screen-scraping application.Screen-scraping |
| solution. Each one will have its own | | | | softwareAdvantages:- Abstracts most of the |
| learning curve, so you should plan on taking | | | | complicated stuff away. You can do some |
| time to learn the ins and outs of a new | | | | pretty sophisticated things in most |
| application. Especially if you plan on doing | | | | screen-scraping applications without knowing |
| a fair amount of screen-scraping it's | | | | anything about regular expressions, HTTP, or |
| probably a good idea to at least shop around | | | | cookies. |
| for a screen-scraping application, as it will | | | | |
| likely save you time and money in the long | | | | - Dramatically reduces the amount of time |
| run.So what's the best approach to data | | | | required to set up a site to be scraped. |
| extraction? It really depends on what your | | | | Once you learn a particular screen-scraping |
| needs are, and what resources you have at | | | | application the amount of time it requires to |
| your disposal. Here are some of the pros and | | | | scrape sites vs. other methods is |
| cons of the various approaches, as well as | | | | significantly lowered. |
| suggestions on when you might use each | | | | |
| one:Raw regular expressions and | | | | - Support from a commercial company. If you |
| codeAdvantages:- If you're already familiar | | | | run into trouble while using a commercial |
| with regular expressions and at least one | | | | screen-scraping application, chances are |
| programming language, this can be a quick | | | | there are support forums and help lines where |
| solution. | | | | you can get assistance.Disadvantages:- The |
| | | | learning curve. Each screen-scraping |
| - Regular expressions allow for a fair | | | | application has its own way of going about |
| amount of "fuzziness" in the matching such | | | | things. This may imply learning a new |
| that minor changes to the content won't break | | | | scripting language in addition to |
| them. | | | | familiarizing yourself with how the core |
| | | | application works. |
| - You likely don't need to learn any new | | | | |
| languages or tools (again, assuming you're | | | | - A potential cost. Most ready-to-go |
| already familiar with regular expressions and | | | | screen-scraping applications are commercial, |
| a programming language). | | | | so you'll likely be paying in dollars as well |
| | | | as time for this solution. |
| - Regular expressions are supported in | | | | |
| almost all modern programming languages. | | | | - A proprietary approach. Any time you use a |
| Heck, even VBScript has a regular expression | | | | proprietary application to solve a computing |
| engine. It's also nice because the various | | | | problem (and proprietary is obviously a |
| regular expression implementations don't vary | | | | matter of degree) you're locking yourself |
| too significantly in their | | | | into using that approach. This may or may |
| syntax.Disadvantages:- They can be complex | | | | not be a big deal, but you should at least |
| for those that don't have a lot of experience | | | | consider how well the application you're |
| with them. Learning regular expressions isn't | | | | using will integrate with other software |
| like going from Perl to Java. It's more like | | | | applications you currently have. For |
| going from Perl to XSLT, where you have to | | | | example, once the screen-scraping application |
| wrap your mind around a completely different | | | | has extracted the data how easy is it for you |
| way of viewing the problem. | | | | to get to that data from your own code?When |
| | | | to use this approach: Screen-scraping |
| - They're often confusing to analyze. Take a | | | | applications vary widely in their |
| look through some of the regular expressions | | | | ease-of-use, price, and suitability to tackle |
| people have created to match something as | | | | a broad range of scenarios. Chances are, |
| simple as an email address and you'll see | | | | though, that if you don't mind paying a bit, |
| what I mean. | | | | you can save yourself a significant amount of |
| | | | time by using one. If you're doing a quick |
| - If the content you're trying to match | | | | scrape of a single page you can use just |
| changes (e.g., they change the web page by | | | | about any language with regular expressions. |
| adding a new "font" tag) you'll likely need | | | | If you want to extract data from hundreds of |
| to update your regular expressions to account | | | | web sites that are all formatted differently |
| for the change. | | | | you're probably better off investing in a |
| | | | complex system that uses ontologies and/or |
| - The data discovery portion of the process | | | | artificial intelligence. For just about |
| (traversing various web pages to get to the | | | | everything else, though, you may want to |
| page containing the data you want) will still | | | | consider investing in an application |
| need to be handled, and can get fairly | | | | specifically designed for screen-scraping.As |
| complex if you need to deal with cookies and | | | | an aside, I thought I should also mention a |
| such.When to use this approach: You'll most | | | | recent project we've been involved with that |
| likely use straight regular expressions in | | | | has actually required a hybrid approach of |
| screen-scraping when you have a small job you | | | | two of the aforementioned methods. We're |
| want to get done quickly. Especially if you | | | | currently working on a project that deals |
| already know regular expressions, there's no | | | | with extracting newspaper classified ads. |
| sense in getting into other tools if all you | | | | The data in classifieds is about as |
| need to do is pull some news headlines off of | | | | unstructured as you can get. For example, in |
| a site.Ontologies and artificial | | | | a real estate ad the term "number of |
| intelligenceAdvantages:- You create it once | | | | bedrooms" can be written about 25 different |
| and it can more or less extract the data from | | | | ways. The data extraction portion of the |
| any page within the content domain you're | | | | process is one that lends itself well to an |
| targeting. | | | | ontologies-based approach, which is what |
| | | | we've done. However, we still had to handle |
| - The data model is generally built in. For | | | | the data discovery portion. We decided to |
| example, if you're extracting data about cars | | | | use screen-scraper for that, and it's |
| from web sites the extraction engine already | | | | handling it just great. The basic process is |
| knows what the make, model, and price are, so | | | | that screen-scraper traverses the various |
| it can easily map them to existing data | | | | pages of the site, pulling out raw chunks of |
| structures (e.g., insert the data into the | | | | data that constitute the classified ads. |
| correct locations in your database). | | | | These ads then get passed to code we've |
| | | | written that uses ontologies in order to |
| - There is relatively little long-term | | | | extract out the individual pieces we're |
| maintenance required. As web sites change you | | | | after. Once the data has been extracted we |
| likely will need to do very little to your | | | | then insert it into a database.Todd Wilson is |
| extraction engine in order to account for the | | | | the owner of ( a company which specializes in |
| changes.Disadvantages:- It's relatively | | | | data extraction from web pages. |
| complex to create and work with such an | | | | |