Learn-how-to-scrape-product-or-price-data-from-Walmart

Can you scrape data from Walmart.com?

In this article, we will understand web scraping Walmart data using Python. Walmart is among the top players in the e-commerce business. Not only because it has a large customer base but also because it is a great platform to get good information about prices, products, and much more for those looking to invest in e-commerce. Read on to find out if you can scrape such data from Walmart.

Although a little tricky, it is not impossible to scrape data from Walmart. It is tough because the platform is not very supportive of data scraping. The website has many anti-span systems, and they have IP address tracking and blocking. This would stop web scrapers’ access to the site. Additionally, AI-based systems, captchas, and cookies prevent scrapers from accessing and collecting data. However, given you dodge their IP tracking system and get a captcha solver bot, you can scrape Walmart site data without getting detected or banned. One easy way to do this is to rotate residential proxies. All you need is a professional scraping tool that is efficient in the mentioned requirements, and you can extract Walmart data. As a professional coder, you can customize code meetings of these requirements and get your data. An important thing to remember is to pay close attention to data usage, and you will only be able to scrape public data, or else you risk violating local laws.

What data can you scrape from Walmart.com?

Here is the list of fields that you can extract from the product listing page of Walmart

  • Images
  • Product specifications
  • Name
  • Price
  • Description
  • Brand
  • Seller
  • Category
  • Model
  • Ratings
  • Reviews
  • Status of product availability
  • Walmart item number
  • Histogram of ratings
  • Product variations

How to scrap Walmart data using Python

how-to-scrap-walmart-data-using-python

We will now take a closer look at how to fetch a product page from Walmart.com using Python requests and how to extract the information using BeautifulSoup.

Let us first quickly create a new folder to keep all the required codes and create a walmart_scraper.py file.

mkdir project
cd project
touch walmart_scraper.py

We will use two libraries called – Requests and BeautifulSoup. With Requests, we will fetch the website via Python, and with BeautifulSoup we will learn parsing through the HTML of Walmart and extract the data we require. We will begin by installing them: pip install beautifulsoup4 requests.

We know that a website consists of HTML. When we make a browser request, the server will send this HTML. Then the HTML is parsed further by the browser and displayed on the screen. With the help of Requests, it is possible to fetch the HTML of a web page. For this, you will have to open your file (walmart_scraper.py) and write the following code:

import requests
url = "https://www.walmart.com/ip/HP-11-6-Chromebook-AMD-A4-4GB-RAM-32GB-Storage-Black-16W64UT-ABA/592161882"
html = requests.get(url)
print(html.text)

Running this code should print a couple of HTML in the console. Ideally, you get the HTML of an actual product page, but sometimes you can also end up with a captcha page. In this case, you want to verify by checking the first line of your output. If you get something like the following lines, it is possible that you got a captcha:

<html lang="en">
<head>    
<title>Robot or human?</title>    
<meta name="viewport" content="width=device-width">
</head>    

This happens because Walmart is aware of a request and hence blocked it. You can get around this by opening the link in the browser, solving the captcha, and sending additional information to Walmart while trying to access the website using Python. This is an additional step called User-Agent string which you need to take to make Walmart believe that a real browser is sending the request instead of an automated program

With everything going right, you will soon have a valid HTML displayed on your screen. Let’s now look at how to extract the data we need from this.

At this point, we need to be crystal clear on what kind of data we need to extract Name, Price, Images, and Description. Look at the following image to understand where such data is.

Now coming to the extraction of this data, we will use BeautifulSoup, a Python library. It is an easy-to-use library with a good API and can handle use-cases effortlessly. By this time, we have the library installed. Now we need to closely look at our HTML page and find its structure together with zeroing down on the precise location of the data we need to extract. For this, we can go to the browser and use the inspect element tool and understand the closest tag to use. Let’s check the product name to see how it works. You will have to right-click on the name of the product and click the inspect element option as seen below:

You can see in the image how we can search for an h1 tag with the itemprop attribute set to ‘name’ to extract the product name. You can now access your walmart_scraper.py file and write codes in order to parse the HTML with BeautifulSoup and extract the name of the product:

import requests
from bs4 import BeautifulSoup

url = "https://www.walmart.com/ip/HP-11-6-Chromebook-AMD-A4-4GB-RAM-32GB-Storage-Black-16W64UT-ABA/592161882"
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0"}
html = requests.get(url, headers=headers)

soup = BeautifulSoup(html.text)
product_name = soup.find('h1', attrs={'itemprop': 'name'}).text

print(product_name)

When you run the script, it will display the name in the console: 'HP 11.6" Chromebook, AMD A4, 4GB RAM, 32GB Storage, Black 16W64UT#ABA'.

Similarly, you can extract the rest of the data. However, the procedure of extracting the description of the product is a little different. For this, you need to extract the data from the div with the class dangerous-html. The only two divs on the page with this class have the information we require.

However, this dangerous-html is missing from the HTML that Walmart sent back to our Python script. This is because Walmart uses a JavaScript called NextJS to send information on the screen as a big JSON blob. But fret not, we have a solution to this too. All you need to do is open the console or terminal where we earlier ran the script and look for the description of the product in HTML returned by Walmart. In that, you will find a script tag with the class of _NECT_DATA_. This contains the information we need in a JSON format.

Now you can extract and parse the JSON format using a code like this:

import json

script = soup.find('script', {'id': '__NEXT_DATA__'})
parsed_json = json.loads(script.text)

You cannot look at the parsed JSON format and find out the dictionary keys required to extract the data:

description_1 = parsed_json['props']['pageProps']['initialData']['data']['product']['shortDescription']
description_2 = parsed_json['props']['pageProps']['initialData']['data']['idml']['longDescription']

You now have the data in HTML format which can further be parsed using BeautifulSoup and extract the required text.

description_1_text = BeautifulSoup(description_1).text
description_2_text = BeautifulSoup(description_2).text

At this step, your Walmart_scraper.py file will look something like this:

import json
import requests
from bs4 import BeautifulSoup

headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:104.0) Gecko/20100101 Firefox/104.0"}
url = "https://www.walmart.com/ip/HP-11-6-Chromebook-AMD-A4-4GB-RAM-32GB-Storage-Black-16W64UT-ABA/592161882"
html = requests.get(url, headers=headers)

soup = BeautifulSoup(html.text)

product_name = soup.find('h1', attrs={'itemprop': 'name'}).text
image_divs = soup.findAll('div', attrs={'data-testid': 'media-thumbnail'})
all_image_urls = []

for div in image_divs:
    image = div.find('img', attrs={'loading': 'lazy'})
    if image:
        image_url = image['src']
        all_image_urls.append(image_url)

price = soup.find('span', attrs={'itemprop': 'price'}).text

next_data = soup.find('script', {'id': '__NEXT_DATA__'})
parsed_json = json.loads(next_data.text)

description_1 = parsed_json['props']['pageProps']['initialData']['data']['product']['shortDescription']
description_2 = parsed_json['props']['pageProps']['initialData']['data']['idml']['longDescription']

description_1_text = BeautifulSoup(description_1).text
description_2_text = BeautifulSoup(description_2).text

Conclusion

If you need to make any changes, all you need to do is just run the script to extract data from whichever page you want and you can have the product information you are looking for.


Post Comments

Get A Quote