添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

This article is for all those data analysts out there. If you’ve been looking for a better way to download data from the Internet, check out what Python requests and cURL can do.

One of the first steps in any data analytics project is getting your hands on a dataset. It could be anything – measurements of physical phenomenon, results from a simulation, images, text, even music. In fact, we have an article on How to Visualize Sound in Python to provide some inspiration. It’s even possible to generate fake data using a little-known Python library .

One common way to get data is to download it from the Internet. In this article, we’ll introduce you to several ways to do this, ranging from a command line tool to some Python libraries.

This article is aimed at data analysts who have a little more experience in Python. We’ll start slow, but we’ll end up covering some more advanced material. Even if you’re new to programming with Python, there’ll be something in here for you.

Downloading a bunch of stuff from the internet necessitates handling large numbers of files. If you would like to learn how to do this, take a look at the course Working with Files and Directories in Python . It contains over 100 interactive exercises to get you on your feet. If you specifically work with JSON files, our How to Read and Write JSON Files in Python course could be what you’re looking for.

The GET Request in HTTP

The GET request forms the backbone of downloading stuff from the internet in Python. It’s the link between the source (you) and the target (the website you want to scrape). It’s a very common HTTP request method used to retrieve data without any other effect on the state of the target (i.e. GET doesn’t change a webpage, it just loads a representation of it into the browser). Each request consists of a header and a body. The header can be thought of as containing request meta-data, such as information about the type and size of data in the request body. The body contains the data itself. We’ll see some examples in the next section which will make this clear.

cURL and the Command Line

cURL (Client for URL) is a tool used for querying URLs from the command line. It can perform a GET request to download data from a website; it can also upload or delete data on a webpage, post a message to a message board, authenticate users, and do other useful things.

To test it out, open the command line (e.g. the Command Prompt in Windows), and execute the following command:

>>> curl https://learnpython.com/

The response from the target server is printed to the command line. In this case, it simply prints the HTML for the learnpython.com homepage as text. There are many options you can use to configure the response. To get only the header information (using the HEAD method), try sending the request with the -I option (or alternatively use --head ):

>>> curl -I https://learnpython.com/

The response this time is much shorter and contains data like the date and time of the request and information about cookies.

To use curl to download data, add the -o or -O option. Use the -o option to explicitly define where to save the result. For example, to download a photo of the cutest baby goat you’ve ever seen, just run the following command:

>>> curl -o goat.jpg https://learnpython.com/blog/python-pillow-module/1.jpg

This will save the photo to your current working directory as goat.jpg. To find the full path to the directory, just run the pwd command in the command line. There’s a huge range of options you can use to configure the request. Check out the cURL documentation for more information.

The requests Library in Python

Using the curl command is a quick and convenient way to download a small number of images, for example. But for most analytics projects, you will need a much larger dataset. So, we will now turn to some useful Python libraries that will allow you to write more complex programs and download much more data.

The requests library is a simple Python library for making HTTP requests. We’ve already covered some basics about this library in the article Web Scraping with Python Libraries , so give it a read for some background material. There’s a useful example of how to download text data from a web page and parse it into a readable format. We’ll build off that material to do something more complex and interesting.

For this example, we’ll write a script to download the Astronomy Picture of the Day from the NASA website. These are stunning images which are posted daily; the webpage has become one of the most visited American government sites. But before we get started, we’ll need to get a few things set up.

The first step is to generate an API key using this form . We’ll be using the requests library to get the response from the server. We need to pass our API key and the date as a dictionary to the get function. We’ll also be using the datetime module to work with dates from within our Python script. ( How to Work with Date and Time in Python has more information about this module.) Finally, we’ll need some help from json to parse the response from the API.

Setting up our project looks like this:

>>> import requests >>> import datetime as dt >>> import json >>> # define your api key here >>> api_key = 'YourUniqueAPIKeyHere' >>> # define the date as datetime object, then get string >>> dte = dt.datetime(2023, 1, 2) >>> dte_string = dt.datetime.strftime(dte, '%Y-%m-%d') >>> # define parameters >>> params = {'api_key':api_key, 'hd':'True', 'date':dte_string} >>> # first request >>> r = requests.get('https://api.nasa.gov/planetary/apod', params=params)

We also passed the 'hd':'True' parameter to request the high definition version of the image. The response, r, is in the JSON format. We need to check if the response was successful, then we can parse it to get the URL for the high definition image. Once we have the URL, we can send another request and save the image:

>>> # check response. 200 = good, 404 = bad >>> if r.status_code == 200: ... # parse json data ... data = json.loads(r.content.decode('utf-8')) ... # check if data is an image ... if (data['media_type'] == 'image'): ... # get the image ... r = requests.get(data['hdurl'], stream=True) ... # check response again ... if r.status_code == 200: ... # save the image ... with open('apod_{}.jpg'.format(dte_string), 'wb') as f: ... for chunk in r.iter_content(): ... f.write(chunk)

By setting stream=True , the data is sent in non-overlapping chunks, one after the other. This avoids reading all the content into memory at once, which is useful for large amounts of data. We use the iter_content() function to iterate over the response chunks, and save them as one jpg image in the current working directory, with the filename 'apod_2023-01-02.jpg' .

This script could be extended to download pictures from a range of dates and store them in a directory. To do this you need to use the 'start_date' and 'end_date' keys in the params dictionary. Using some image processing libraries (for example, PIL ), the description could be printed on the image, which could then be used as your new desktop picture.

To perform similar tasks as how to use curl in Python scripts , you can use the requests library to make HTTP requests, download files, and handle other web-related tasks programmatically.

PycURL

The next tool we’ll look at is PycURL , which can also be used to get data from a URL within your Python program. It is aimed more at advanced developers and has a lot of features. It is flexible, powerful, and very fast.

We’ll now take a look at sending a GET request to a webpage using pycurl . We’ll also need the help of the BytesIO module:

>>> import pycurl >>> from io import BytesIO >>> crl_obj = pycurl.Curl() >>> b_obj = BytesIO()

We have created a curl object to transfer data over a URL and a bytes object to store the data. The curl object can be customized with many options using the setopt() function. For example, to define the URL and tell the curl object where to store the data, we do the following:

>>> url = 'https://en.wikipedia.org/wiki/Stephen_Hawking' >>> crl_obj.setopt(crl_obj.URL, url) >>> crl_obj.setopt(crl_obj.WRITEDATA, b_obj)

Now we want to perform a data transfer and end the session:

>>> crl_obj.perform() >>> crl_obj.close()

To get our hands on the data stored in the bytes object, we use:

>>> data = b_obj.getvalue() >>> print(data.decode('utf8'))

This will print out the HTML of the webpage. There are many other options to customize the request. You can, for example, request only the header information for a URL. There’s more information on the functionality in the official Py cURL documentation .

Practice cURL and Other Python Request Tools

In this article, we introduced you to some tools to help you download stuff in Python. The first tool, cURL, can be used independently of Python straight from the command line. However, using Python libraries to download Internet data opens the door to more power and flexibility. With these tools, you could get access to Wikipedia text data, images, music, tabular data, and  more.

To sharpen your skills in processing data in various formats, we have a Data Processing with Python track. This bundles together 5 of the best interactive courses to teach you everything you need to know about working with data in Python.

If you need some inspiration for fun projects to practice your Python skills, check out our articles on Python Coding Project Ideas for Beginners and Useful Python Libraries for Fun Hobby Projects . You will surely come up with a great Python project to practice your new skills on. Happy coding!