One of the awesome things about Python is how relatively simple it is to do pretty complex and impressive tasks. A great example of this is web scraping.
- Web Scraping Online
- Web Scraping Using Requests
- Python Web Scraping With Requests And Beautifulsoup
- Web Scraping Software Comparison
- Best Web Scraping Tools
- Web Scraping With Requests Python
Nov 26, 2018 So, to simplify the process, we can also download the data as raw text and format it. For instance, downloading content from a personal blog or profile information of a GitHub user without any registration. This guide will explain the process of making web requests in python using Requests package and its various features. Generally, Requests has two main use cases, making requests to an API and getting raw HTML content from websites (i.e., scraping). Whenever you send any type of request, you should always check the status code (especially when scraping), to make sure your request was served successfully. You can find a useful overview of status codes here. This newer library is requests-HTML and is well worth looking at once you have got a basic understanding of what you are trying to achieve with web scraping. Another library which is often used for more advanced projects spanning multiple pages is scrapy, but that is a.
This is an article about web scraping with Python. In it we will look at the basics of web scraping using popular libraries such as requests
and beautiful soup
.
Topics covered:
- What is web scraping?
- What are
requests
andbeautiful soup
? - Using CSS selectors to target data on a web-page
- Getting product data from a demo book site
- Storing scraped data in CSV and JSON formats
What is Web Scraping?
Some websites can contain a large amount of valuable data. Web scraping means extracting data from websites, usually in an automated fashion using a bot or web crawler. The kinds or data available are as wide ranging as the internet itself. Common tasks include
- scraping stock prices to inform investment decisions
- automatically downloading files hosted on websites
- scraping data about company contacts
- scraping data from a store locator to create a list of business locations
- scraping product data from sites like Amazon or eBay
- scraping sports stats for betting
- collecting data to generate leads
- collating data available from multiple sources
Legality of Web Scraping
There has been some confusion in the past about the legality of scraping data from public websites. This has been cleared up somewhat recently (I’m writing in July 2020) by a court case where the US Court of Appeals denied LinkedIn’s requests to prevent HiQ, an analytics company, from scraping its data.
The decision was a historic moment in the data privacy and data regulation era. It showed that any data that is publicly available and not copyrighted is potentially fair game for web crawlers.
However, proceed with caution. You should always honour the terms and conditions of a site that you wish to scrape data from as well as the contents of its robots.txt
file. You also need to ensure that any data you scrape is used in a legal way. For example you should consider copyright issues and data protection laws such as GDPR. Also, be aware that the high court decision could be reversed and other laws may apply. This article is not intended to prvide legal advice, so please do you own research on this topic. One place to start is Quora. There are some good and detailed questions and answers there such as at this link
One way you can avoid any potential legal snags while learning how to use Python to scrape websites for data is to use sites which either welcome or tolerate your activity. One great place to start is to scrape – a web scraping sandbox which we will use in this article.
An example of Web Scraping in Python
You will need to install two common scraping libraries to use the following code. This can be done using
pip install requests
and
pip install beautifulsoup4
in a command prompt. For details in how to install packages in Python, check out Installing Python Packages with Pip.
The requests
library handles connecting to and fetching data from your target web-page, while beautifulsoup
enables you to parse and extract the parts of that data you are interested in.
Let’s look at an example:
So how does the code work?
In order to be able to do web scraping with Python, you will need a basic understanding of HTML and CSS. This is so you understand the territory you are working in. You don’t need to be an expert but you do need to know how to navigate the elements on a web-page using an inspector such as chrome dev tools. If you don’t have this basic knowledge, you can go off and get it (w3schools is a great place to start), or if you are feeling brave, just try and follow along and pick up what you need as you go along.
To see what is happening in the code above, navigate to http://books.toscrape.com/. Place your cursor over a book price, right-click your mouse and select “inspect” (that’s the option on Chrome – it may be something slightly different like “inspect element” in other browsers. When you do this, a new area will appear showing you the HTML which created the page. You should take particular note of the “class” attributes of the elements you wish to target.
In our code we have
This uses the class attribute and returns a list of elements with the class product_pod
.
Then, for each of these elements we have:
The first line is fairly straightforward and just selects the text of the h3
element for the current product. The next line does lots of things, and could be split into separate lines. Basically, it finds the p
tag with class price_color
within the div
tag with class product_price
, extracts the text, strips out the pound sign and finally converts to a float. This last step is not strictly necessary as we will be storing our data in text format, but I’ve included it in case you need an actual numeric data type in your own projects.
Storing Scraped Data in CSV Format
csv
(comma-separated values) is a very common and useful file format for storing data. It is lightweight and does not require a database.
Add this code above the if __name__ '__main__':
line
and just before the line print('### RESULTS ###')
, add this:
store_as_csv(data, headings=['title', 'price'])
When you run the code now, a file will be created containing your book data in csv format. Pretty neat huh?
Storing Scraped Data in JSON Format
Another very common format for storing data is JSON
(JavaScript Object Notation), which is basically a collection of lists and dictionaries (called arrays and objects in JavaScript).
Add this extra code above if __name__ ..:
and store_as_json(data)
above the print('### Results ###')
line.
So there you have it – you now know how to scrape data from a web-page, and it didn’t take many lines of Python code to achieve!
Full Code Listing for Python Web Scraping Example
Here’s the full listing of our program for your convenience.
One final note. We have used requests
and beautifulsoup
for our scraping, and a lot of the existing code on the internet in articles and repositories uses those libraries. However, there is a newer library which performs the task of both of these put together, and has some additional functionality which you may find useful later on. This newer library is requests-HTML
and is well worth looking at once you have got a basic understanding of what you are trying to achieve with web scraping. Another library which is often used for more advanced projects spanning multiple pages is scrapy
, but that is a more complex beast altogether, for a later article.
Working through the contents of this article will give you a firm grounding in the basics of web scraping in Python. I hope you find it helpful
Happy computing.
- Python Web Scraping Tutorial
- Python Web Scraping Resources
Web Scraping Online
- Selected Reading
In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us understand scraping of websites that work on user based inputs, that is form based websites.
Introduction
These days WWW (World Wide Web) is moving towards social media as well as usergenerated contents. So the question arises how we can access such kind of information that is beyond login screen? For this we need to deal with forms and logins.
In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work with HTTP POST method that pushes information to a web server for storage and analysis.
Interacting with Login forms
While working on Internet, you must have interacted with login forms many times. They may be very simple like including only a very few HTML fields, a submit button and an action page or they may be complicated and have some additional fields like email, leave a message along with captcha for security reasons.
In this section, we are going to deal with a simple submit form with the help of Python requests library.
First, we need to import requests library as follows −
Now, we need to provide the information for the fields of login form.
In next line of code, we need to provide the URL on which action of the form would happen.
After running the script, it will return the content of the page where action has happened.
Suppose if you want to submit any image with the form, then it is very easy with requests.post(). You can understand it with the help of following Python script −
Loading Cookies from the Web Server
A cookie, sometimes called web cookie or internet cookie, is a small piece of data sent from a website and our computer stores it in a file located inside our web browser.
In the context of dealings with login forms, cookies can be of two types. One, we dealt in the previous section, that allows us to submit information to a website and second which lets us to remain in a permanent “logged-in” state throughout our visit to the website. For the second kind of forms, websites use cookies to keep track of who is logged in and who is not.
What do cookies do?
These days most of the websites are using cookies for tracking. We can understand the working of cookies with the help of following steps −
Step 1 − First, the site will authenticate our login credentials and stores it in our browser’s cookie. This cookie generally contains a server-generated toke, time-out and tracking information.
Step 2 − Next, the website will use the cookie as a proof of authentication. This authentication is always shown whenever we visit the website.
Cookies are very problematic for web scrapers because if web scrapers do not keep track of the cookies, the submitted form is sent back and at the next page it seems that they never logged in. It is very easy to track the cookies with the help of Python requests library, as shown below −
In the above line of code, the URL would be the page which will act as the processor for the login form.
After running the above script, we will retrieve the cookies from the result of last request.
There is another issue with cookies that sometimes websites frequently modify cookies without warning. Such kind of situation can be dealt with requests.Session() as follows −
In the above line of code, the URL would be the page which will act as the processor for the login form.
Prada serial number 8833313. Observe that you can easily understand the difference between script with session and without session.
Automating forms with Python
In this section we are going to deal with a Python module named Mechanize that will reduce our work and automate the process of filling up forms.
Mechanize module
Web Scraping Using Requests
Mechanize module provides us a high-level interface to interact with forms. Before starting using it we need to install it with the following command −
Python Web Scraping With Requests And Beautifulsoup
Note that it would work only in Python 2.x.
Web Scraping Software Comparison
Example
Best Web Scraping Tools
In this example, we are going to automate the process of filling a login form having two fields namely email and password −
Web Scraping With Requests Python
The above code is very easy to understand. First, we imported mechanize module. Then a Mechanize browser object has been created. Then, we navigated to the login URL and selected the form. After that, names and values are passed directly to the browser object.