Book Description
Despite all the advancements in web APIs and interoperability, it's inevitable that, at some point in your career, you will have to "scrape" content from a website that was not built with web services in mind. And, despite its sometimes less-than-stellar reputation, web scraping is usually an entire legitimate activity—for example, to capture data from an old version of a website for insertion into a modern CMS.
This book, written by scraping expert Matthew Turland, covers web scraping techniques and topics that range from the simple to exotic using a variety of technologies and frameworks:
- Understanding HTTP requests
- The PHP HTTP streams wrapper
- cURL
- pecl_http
- PEAR:HTTP
- Zend_Http_Client
- Building your own scraping library
- Using Tidy
- Analyzing code with the DOM, SimpleXML and XMLReader extensions
- CSS selector libraries
- PCRE pattern matching
- Tips and Tricks
- Multiprocessing / parallel processing
Book Details
- Paperback: 192 pages
- Publisher: Marco Tabini & Associates, Inc. (August, 2010)
- Language: English
- ISBN-10: 0981034519
- ISBN-13: 978-0981034515
- File Size: 5.7 MiB
- Hits: 2,151 times