Python Web Scraping Cookbook

Untangle your web scraping complexities and access web data with ease using Python scripts
Preview in Mapt

Python Web Scraping Cookbook

Michael Heydt

Untangle your web scraping complexities and access web data with ease using Python scripts
Mapt Subscription
FREE
$29.99/m after trial
eBook
$10.00
RRP $31.99
Save 68%
Print + eBook
$39.99
RRP $39.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$10.00
$39.99
$29.99 p/m after trial
RRP $31.99
RRP $39.99
Subscription
eBook
Print + eBook
Start 14 Day Trial

Frequently bought together


Python Web Scraping Cookbook Book Cover
Python Web Scraping Cookbook
$ 31.99
$ 10.00
Learning Node.js Development Book Cover
Learning Node.js Development
$ 31.99
$ 22.40
Buy 2 for $27.50
Save $36.48
Add to Cart

Book Details

ISBN 139781787285217
Paperback364 pages

Book Description

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance scrapers and deal with crawlers, sitemaps, forms automation, Ajax-based sites, caches, and more.You'll explore a number of real-world scenarios where every part of the development/product life cycle will be fully covered. You will not only develop the skills to design and develop reliable, performance data flows, but also deploy your codebase to an AWS. If you are involved in software engineering, product development, or data mining (or are interested in building data-driven products), you will find this book useful as each recipe has a clear purpose and objective.

Right from extracting data from the websites to writing a sophisticated web crawler, the book's independent recipes will be a godsend on the job. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with AJAX websites, paginated items, and more. You will also learn to tackle problems such as 403 errors, working with proxy, scraping images, LXML, and more.

By the end of this book, you will be able to scrape websites more efficiently and to be able to deploy and operate your scraper in the cloud.

Table of Contents

Chapter 1: Getting Started with Scraping
Introduction
Setting up a Python development environment 
Scraping Python.org with Requests and Beautiful Soup
Scraping Python.org in urllib3 and Beautiful Soup
Scraping Python.org with Scrapy
Scraping Python.org with Selenium and PhantomJS
Chapter 2: Data Acquisition and Extraction
Introduction
How to parse websites and navigate the DOM using BeautifulSoup
Searching the DOM with Beautiful Soup's find methods
Querying the DOM with XPath and lxml
Querying data with XPath and CSS selectors
Using Scrapy selectors
Loading data in unicode / UTF-8
Chapter 3: Processing Data
Introduction
Working with CSV and JSON data
Storing data using AWS S3
Storing data using MySQL
Storing data using PostgreSQL
Storing data in Elasticsearch
How to build robust ETL pipelines with AWS SQS
Chapter 4: Working with Images, Audio, and other Assets
Introduction
Downloading media content from the web
 Parsing a URL with urllib to get the filename
Determining the type of content for a URL 
Determining the file extension from a content type
Downloading and saving images to the local file system
Downloading and saving images to S3
 Generating thumbnails for images
Taking a screenshot of a website
Taking a screenshot of a website with an external service
Performing OCR on an image with pytesseract
Creating a Video Thumbnail
Ripping an MP4 video to an MP3
Chapter 5: Scraping - Code of Conduct
Introduction
Scraping legality and scraping politely
Respecting robots.txt
Crawling using the sitemap
Crawling with delays
Using identifiable user agents 
Setting the number of concurrent requests per domain
Using auto throttling
Using an HTTP cache for development
Chapter 6: Scraping Challenges and Solutions
Introduction
Retrying failed page downloads
Supporting page redirects
Waiting for content to be available in Selenium
Limiting crawling to a single domain
Processing infinitely scrolling pages
Controlling the depth of a crawl
Controlling the length of a crawl
Handling paginated websites
Handling forms and forms-based authorization
Handling basic authorization
Preventing bans by scraping via proxies
Randomizing user agents
Caching responses
Chapter 7: Text Wrangling and Analysis
Introduction
Installing NLTK
Performing sentence splitting
Performing tokenization
Performing stemming
Performing lemmatization
Determining and removing stop words
Calculating the frequency distributions of words
Identifying and removing rare words
Identifying and removing rare words
Removing punctuation marks
Piecing together n-grams
Scraping a job listing from StackOverflow 
Reading and cleaning the description in the job listing
Chapter 8: Searching, Mining and Visualizing Data
Introduction
Geocoding an IP address
How to collect IP addresses of Wikipedia edits
Visualizing contributor location frequency on Wikipedia
Creating a word cloud from a StackOverflow job listing
Crawling links on Wikipedia
Visualizing page relationships on Wikipedia
Calculating degrees of separation
Chapter 9: Creating a Simple Data API
Introduction
Creating a REST API with Flask-RESTful
Integrating the REST API with scraping code
Adding an API to find the skills for a job listing
Storing data in Elasticsearch as the result of a scraping request
Checking Elasticsearch for a listing before scraping
Chapter 10: Creating Scraper Microservices with Docker
Introduction
Installing Docker
Installing a RabbitMQ container from Docker Hub
Running a Docker container (RabbitMQ)
Creating and running an Elasticsearch container
Stopping/restarting a container and removing the image
Creating a generic microservice with Nameko
Creating a scraping microservice
Creating a scraper container
Creating an API container
Composing and running the scraper locally with docker-compose
Chapter 11: Making the Scraper as a Service Real
Introduction
Creating and configuring an Elastic Cloud trial account
Accessing the Elastic Cloud cluster with curl
Connecting to the Elastic Cloud cluster with Python
Performing an Elasticsearch query with the Python API 
Using Elasticsearch to query for jobs with specific skills
Modifying the API to search for jobs by skill
Storing configuration in the environment 
Creating an AWS IAM user and a key pair for ECS
Configuring Docker to authenticate with ECR
Pushing containers into ECR
Creating an ECS cluster
Creating a task to run our containers
Starting and accessing the containers in AWS

What You Will Learn

  • Use a wide variety of tools to scrape any website and data—including BeautifulSoup, Scrapy, Selenium, and many more
  • Master expression languages such as XPath, CSS, and regular expressions to extract web data
  • Deal with scraping traps such as hidden form fields, throttling, pagination, and different status codes
  • Build robust scraping pipelines with SQS and RabbitMQ
  • Scrape assets such as images media and know what to do when Scraper fails to run
  • Explore ETL techniques of build a customized crawler, parser, and convert structured and unstructured data from websites
  • Deploy and run your scraper-as-aservice in AWS Elastic Container Service

Authors

Table of Contents

Chapter 1: Getting Started with Scraping
Introduction
Setting up a Python development environment 
Scraping Python.org with Requests and Beautiful Soup
Scraping Python.org in urllib3 and Beautiful Soup
Scraping Python.org with Scrapy
Scraping Python.org with Selenium and PhantomJS
Chapter 2: Data Acquisition and Extraction
Introduction
How to parse websites and navigate the DOM using BeautifulSoup
Searching the DOM with Beautiful Soup's find methods
Querying the DOM with XPath and lxml
Querying data with XPath and CSS selectors
Using Scrapy selectors
Loading data in unicode / UTF-8
Chapter 3: Processing Data
Introduction
Working with CSV and JSON data
Storing data using AWS S3
Storing data using MySQL
Storing data using PostgreSQL
Storing data in Elasticsearch
How to build robust ETL pipelines with AWS SQS
Chapter 4: Working with Images, Audio, and other Assets
Introduction
Downloading media content from the web
 Parsing a URL with urllib to get the filename
Determining the type of content for a URL 
Determining the file extension from a content type
Downloading and saving images to the local file system
Downloading and saving images to S3
 Generating thumbnails for images
Taking a screenshot of a website
Taking a screenshot of a website with an external service
Performing OCR on an image with pytesseract
Creating a Video Thumbnail
Ripping an MP4 video to an MP3
Chapter 5: Scraping - Code of Conduct
Introduction
Scraping legality and scraping politely
Respecting robots.txt
Crawling using the sitemap
Crawling with delays
Using identifiable user agents 
Setting the number of concurrent requests per domain
Using auto throttling
Using an HTTP cache for development
Chapter 6: Scraping Challenges and Solutions
Introduction
Retrying failed page downloads
Supporting page redirects
Waiting for content to be available in Selenium
Limiting crawling to a single domain
Processing infinitely scrolling pages
Controlling the depth of a crawl
Controlling the length of a crawl
Handling paginated websites
Handling forms and forms-based authorization
Handling basic authorization
Preventing bans by scraping via proxies
Randomizing user agents
Caching responses
Chapter 7: Text Wrangling and Analysis
Introduction
Installing NLTK
Performing sentence splitting
Performing tokenization
Performing stemming
Performing lemmatization
Determining and removing stop words
Calculating the frequency distributions of words
Identifying and removing rare words
Identifying and removing rare words
Removing punctuation marks
Piecing together n-grams
Scraping a job listing from StackOverflow 
Reading and cleaning the description in the job listing
Chapter 8: Searching, Mining and Visualizing Data
Introduction
Geocoding an IP address
How to collect IP addresses of Wikipedia edits
Visualizing contributor location frequency on Wikipedia
Creating a word cloud from a StackOverflow job listing
Crawling links on Wikipedia
Visualizing page relationships on Wikipedia
Calculating degrees of separation
Chapter 9: Creating a Simple Data API
Introduction
Creating a REST API with Flask-RESTful
Integrating the REST API with scraping code
Adding an API to find the skills for a job listing
Storing data in Elasticsearch as the result of a scraping request
Checking Elasticsearch for a listing before scraping
Chapter 10: Creating Scraper Microservices with Docker
Introduction
Installing Docker
Installing a RabbitMQ container from Docker Hub
Running a Docker container (RabbitMQ)
Creating and running an Elasticsearch container
Stopping/restarting a container and removing the image
Creating a generic microservice with Nameko
Creating a scraping microservice
Creating a scraper container
Creating an API container
Composing and running the scraper locally with docker-compose
Chapter 11: Making the Scraper as a Service Real
Introduction
Creating and configuring an Elastic Cloud trial account
Accessing the Elastic Cloud cluster with curl
Connecting to the Elastic Cloud cluster with Python
Performing an Elasticsearch query with the Python API 
Using Elasticsearch to query for jobs with specific skills
Modifying the API to search for jobs by skill
Storing configuration in the environment 
Creating an AWS IAM user and a key pair for ECS
Configuring Docker to authenticate with ECR
Pushing containers into ECR
Creating an ECS cluster
Creating a task to run our containers
Starting and accessing the containers in AWS

Book Details

ISBN 139781787285217
Paperback364 pages
Read More

Read More Reviews

Recommended for You

Learning Node.js Development Book Cover
Learning Node.js Development
$ 31.99
$ 22.40
Python: End-to-end Data Analysis Book Cover
Python: End-to-end Data Analysis
$ 71.99
$ 50.40
Rust Programming By Example Book Cover
Rust Programming By Example
$ 35.99
$ 25.20
Rust Essentials - Second Edition Book Cover
Rust Essentials - Second Edition
$ 35.99
$ 25.20
Natural Language Processing with Python Cookbook Book Cover
Natural Language Processing with Python Cookbook
$ 31.99
$ 10.00
Raspberry Pi 3 Cookbook for Python Programmers - Third Edition Book Cover
Raspberry Pi 3 Cookbook for Python Programmers - Third Edition
$ 27.99
$ 19.60