Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Web Scraping with Python
Web Scraping with Python
Web Scraping with Python
Ebook322 pages2 hours

Web Scraping with Python

Rating: 4.5 out of 5 stars

4.5/5

()

Read preview

About this ebook

This book is aimed at developers who want to build reliable solutions to scrape data from websites. It is assumed that the reader has prior programming experience with Python. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principles involved.
LanguageEnglish
Release dateOct 28, 2015
ISBN9781782164371
Web Scraping with Python

Related to Web Scraping with Python

Related ebooks

Programming For You

View More

Related articles

Reviews for Web Scraping with Python

Rating: 4.25 out of 5 stars
4.5/5

4 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Web Scraping with Python - Richard Lawson

    Table of Contents

    Web Scraping with Python

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Errata

    Piracy

    Questions

    1. Introduction to Web Scraping

    When is web scraping useful?

    Is web scraping legal?

    Background research

    Checking robots.txt

    Examining the Sitemap

    Estimating the size of a website

    Identifying the technology used by a website

    Finding the owner of a website

    Crawling your first website

    Downloading a web page

    Retrying downloads

    Setting a user agent

    Sitemap crawler

    ID iteration crawler

    Link crawler

    Advanced features

    Parsing robots.txt

    Supporting proxies

    Throttling downloads

    Avoiding spider traps

    Final version

    Summary

    2. Scraping the Data

    Analyzing a web page

    Three approaches to scrape a web page

    Regular expressions

    Beautiful Soup

    Lxml

    CSS selectors

    Comparing performance

    Scraping results

    Overview

    Adding a scrape callback to the link crawler

    Summary

    3. Caching Downloads

    Adding cache support to the link crawler

    Disk cache

    Implementation

    Testing the cache

    Saving disk space

    Expiring stale data

    Drawbacks

    Database cache

    What is NoSQL?

    Installing MongoDB

    Overview of MongoDB

    MongoDB cache implementation

    Compression

    Testing the cache

    Summary

    4. Concurrent Downloading

    One million web pages

    Parsing the Alexa list

    Sequential crawler

    Threaded crawler

    How threads and processes work

    Implementation

    Cross-process crawler

    Performance

    Summary

    5. Dynamic Content

    An example dynamic web page

    Reverse engineering a dynamic web page

    Edge cases

    Rendering a dynamic web page

    PyQt or PySide

    Executing JavaScript

    Website interaction with WebKit

    Waiting for results

    The Render class

    Selenium

    Summary

    6. Interacting with Forms

    The Login form

    Loading cookies from the web browser

    Extending the login script to update content

    Automating forms with the Mechanize module

    Summary

    7. Solving CAPTCHA

    Registering an account

    Loading the CAPTCHA image

    Optical Character Recognition

    Further improvements

    Solving complex CAPTCHAs

    Using a CAPTCHA solving service

    Getting started with 9kw

    9kw CAPTCHA API

    Integrating with registration

    Summary

    8. Scrapy

    Installation

    Starting a project

    Defining a model

    Creating a spider

    Tuning settings

    Testing the spider

    Scraping with the shell command

    Checking results

    Interrupting and resuming a crawl

    Visual scraping with Portia

    Installation

    Annotation

    Tuning a spider

    Checking results

    Automated scraping with Scrapely

    Summary

    9. Overview

    Google search engine

    Facebook

    The website

    The API

    Gap

    BMW

    Summary

    Index

    Web Scraping with Python


    Web Scraping with Python

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: October 2015

    Production reference: 1231015

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78216-436-4

    www.packtpub.com

    Credits

    Author

    Richard Lawson

    Reviewers

    Martin Burch

    Christopher Davis

    William Sankey

    Ayush Tiwari

    Acquisition Editor

    Rebecca Youé

    Content Development Editor

    Akashdeep Kundu

    Technical Editors

    Novina Kewalramani

    Shruti Rawool

    Copy Editor

    Sonia Cheema

    Project Coordinator

    Milton Dsouza

    Proofreader

    Safis Editing

    Indexer

    Mariammal Chettiar

    Production Coordinator

    Nilesh R. Mohite

    Cover Work

    Nilesh R. Mohite

    About the Author

    Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing at web scraping while traveling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational at Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones.

    I would like to thank Professor Timothy Baldwin for introducing me to this exciting field and Tharavy Douc for hosting me in Paris while I wrote this book.

    About the Reviewers

    Martin Burch is a data journalist based in New York City, where he makes interactive graphics for The Wall Street Journal. He holds a master of arts in journalism from the City University of New York's Graduate School of Journalism, and has a baccalaureate from New Mexico State University, where he studied journalism and information systems.

    I would like to thank my wife, Lisa, who encouraged me to assist with this book; my uncle, Michael, who has always patiently answered my programming questions; and my father, Richard, who inspired my love of journalism and writing.

    William Sankey is a data professional and hobbyist developer who lives in College Park, Maryland. He graduated in 2012 from Johns Hopkins University with a master's degree in public policy and specializes in quantitative analysis. He is currently a health services researcher at L&M Policy Research, LLC, working on projects for the Centers for Medicare and Medicaid Services (CMS). The scope of these projects range from evaluating Accountable Care Organizations to monitoring the Inpatient Psychiatric Facility Prospective Payment System.

    I would like to thank my devoted wife, Julia, and rambunctious puppy, Ruby, for all their love and support.

    Ayush Tiwari is a Python developer and undergraduate at IIT Roorkee. He has been working at Information Management Group, IIT Roorkee, since 2013, and has been actively working in the web development field. Reviewing this book has been a great experience for him. He did his part not only as a reviewer, but also as an avid learner of web scraping. He recommends this book to all Python enthusiasts so that they can enjoy the benefits of scraping.

    He is enthusiastic about Python web scraping and has worked on projects such as live sports feeds, as well as a generalized Python e-commerce web scraper (at Miranj).

    He has also been handling a placement portal with the help of a Django app to assist the placement process at IIT Roorkee.

    Besides backend development, he loves to work on computational Python/data analysis using Python libraries, such as NumPy, SciPy, and is currently working in the CFD research field. You can visit his projects on GitHub. His username is tiwariayush.

    He loves trekking through Himalayan valleys and participates in several treks every year, adding this to his list of interests, besides playing the guitar. Among his accomplishments, he is a part of the internationally acclaimed Super 30 group and has also been a rank holder in it. When he was in high school, he also qualified for the International Mathematical Olympiad.

    I have been provided a lot of help by my family members (my sister, Aditi, my parents, and Anand sir), my friends at VI and IMG, and my professors. I would like to thank all of them for the support they have given me.

    Last but not least, kudos to the respected author and the Packt Publishing team for publishing these fantastic tech books. I commend all the hard work involved in producing their books.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    Preface

    The Internet contains the most useful set of data ever assembled, which is largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be extracted to be useful. This process of extracting data from web pages is known as web scraping and is becoming increasingly useful as ever more information is available online.

    What this book covers

    Chapter 1, Introduction to Web Scraping, introduces web scraping and explains ways to crawl a website.

    Chapter 2, Scraping the Data, shows you how to extract data from web pages.

    Chapter 3, Caching Downloads, teaches you how to avoid redownloading by caching results.

    Chapter 4, Concurrent Downloading, helps you to scrape data faster by downloading in parallel.

    Chapter 5, Dynamic Content, shows you how to extract data from dynamic websites.

    Chapter 6, Interacting with Forms, shows you how to work with forms to access the data you are after.

    Chapter 7, Solving CAPTCHA, elaborates how to access data that is protected by CAPTCHA images.

    Chapter 8, Scrapy, teaches you how to use the popular high-level Scrapy framework.

    Chapter 9, Overview, is an overview of web scraping techniques that have been covered.

    What you need for this book

    All the code used in this book has been tested with Python 2.7, and is available for download at http://bitbucket.org/wswp/code. Ideally, in a future version of this book, the examples will be ported to Python 3. However, for now, many of the libraries required (such as Scrapy/Twisted, Mechanize, and Ghost) are only available for Python 2. To help illustrate the crawling examples, we created a sample website at http://example.webscraping.com. This website limits how fast you can download content, so if you prefer to host this yourself the source code and installation instructions are available at http://bitbucket.org/wswp/places.

    We decided to build a custom website for many of the examples used in this book instead of scraping live websites, so that we have full control over the environment. This provides us stability—live websites are updated more often than books, and by the time you try a scraping example, it may no longer work. Also, a custom website allows us to craft examples that illustrate specific skills and avoid distractions. Finally, a

    Enjoying the preview?
    Page 1 of 1