{"id":2639,"date":"2023-08-30T10:25:45","date_gmt":"2023-08-30T10:25:45","guid":{"rendered":"https:\/\/dataprot.net\/?p=2639"},"modified":"2023-10-17T08:33:09","modified_gmt":"2023-10-17T08:33:09","slug":"how-to-scrap-amazon","status":"publish","type":"post","link":"https:\/\/dataprot.net\/articles\/how-to-scrap-amazon\/","title":{"rendered":"Scraping Amazon: A Step-by-Step Guide toCreating Your Own Scraper"},"content":{"rendered":"\n<p>Being one of the largest online retailers, it\u2019s no surprise that Amazon is packed with a wealth of product data \u2013 from product information to reviews. As such, gaining a deep understanding of product trends is paramount for those looking to succeed in e-commerce. Fortunately, web scraping offers a powerful way to gather massive amounts of data, making it an invaluable tool for e-commerce success.<\/p>\n\n\n\n<p>In this tutorial, you\u2019ll learn how to create a web scraper from scratch using Python. You\u2019ll also learn the caveats and challenges of scraping Amazon and how to bypass those challenges.<\/p>\n\n\n\n<p>Let\u2019s get started!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Set up<\/strong><\/h3>\n\n\n\n<p>First, you\u2019ll need to install <a href=\"https:\/\/python.org\/\" target=\"_blank\" rel=\"noopener\"><u>Python<\/u><\/a> to follow this tutorial. Once you\u2019ve done that, you can run the command below in your terminal. With this command, we\u2019re installing the necessary dependencies, such as <a href=\"https:\/\/pypi.org\/project\/requests\/\" target=\"_blank\" rel=\"noopener\"><u>requests<\/u><\/a>, <a href=\"https:\/\/pypi.org\/project\/pandas\/\" target=\"_blank\" rel=\"noopener\"><u>pandas<\/u><\/a>, and <a href=\"https:\/\/pypi.org\/project\/beautifulsoup4\/\" target=\"_blank\" rel=\"noopener\"><u>Beautiful Soup 4<\/u><\/a>:<\/p>\n\n\n\n<p>Unset<\/p>\n\n\n\n<p>pip install requests pandas bs4<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to Fetch Amazon Product Pages<\/h2>\n\n\n\n<p>Now, let\u2019s begin writing the scraper. First, you\u2019ll need to import all three libraries that you\u2019ve just installed. Then you\u2019ll use them to send a <strong>GET <\/strong>request to fetch the Amazon product page.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Import libraries<\/h4>\n\n\n\n<p>The below code will import the necessary libraries:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\n<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">import <\/mark>requests\n<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">from <\/mark>bs4 <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">import <\/mark>BeautifulSoup\n<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">import <\/mark>pandas <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">as <\/mark>pd<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Making HTTP requests<\/h3>\n\n\n\n<p>Next, you\u2019ll use the requests module to make an HTTP <strong>GET <\/strong>request to Amazon\u2019s web server:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\npage = '<mark style=\"background-color:rgba(0, 0, 0, 0);color:#18924f\" class=\"has-inline-color\">https:\/\/www.amazon.com\/iPhone-Pro-Max-128GB-Gold\/dp\/B0BGYDDWDF<\/mark>\/'\nresponse = requests.get(page)\n<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">print<\/mark>(response.status_code)<\/code><\/pre>\n\n\n\n<p>At this point, if you executed the code successfully, you would expect to see the output 200. However, if it prints 503, which means something went wrong.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Amazon\u2019s anti-bot challenge<\/h3>\n\n\n\n<p>In this case, Amazon\u2019s anti-bot protection blocked your request. Many websites have security measures in place to protect themselves against malicious bots. While our scraper is not malicious, Amazon\u2019s protection systems can confuse it for one. Let\u2019s see how to deal with this.<\/p>\n\n\n\n<p>You can validate it by checking the source of the response:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\n<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">print<\/mark>(response.text)<\/code><\/pre>\n\n\n\n<p>This will show the HTML source of the response in the text format. However, if you want to see how it looks on a browser, just save the HTML document locally and open it in a browse. It will look similar to the one below:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"555\" height=\"314\" src=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/checking-the-source-of-the-response.jpg\" alt=\"\" class=\"wp-image-2640\" style=\"object-fit:cover;width:600px;height:400px\" srcset=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/checking-the-source-of-the-response.jpg 555w, https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/checking-the-source-of-the-response-300x170.jpg 300w\" sizes=\"(max-width: 555px) 100vw, 555px\" \/><\/figure>\n\n\n\n<p>As mentioned above, we got a 503 error. Not to worry, this is a simple bot protection challenge that you can bypass by using a User-Agent header.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Adding User-Agent<\/h3>\n\n\n\n<p>So, let\u2019s add a custom header. You\u2019ll have to create a headers dict with the<\/p>\n\n\n\n<p>User-Agent string and pass it to the <strong>GET <\/strong>method:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\nheaders = {\n'User-Agent': '<mark style=\"background-color:rgba(0, 0, 0, 0);color:#338b34\" class=\"has-inline-color\">Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 \n(KHTML, like Gecko) Chrome\/114.0.0.0 Safari\/537.36<\/mark>'\n}\n\nresponse = requests.get(page, headers=headers) \n<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">print<\/mark>(response.status_code) \n<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-ast-global-color-0-color\">print<\/mark>(response.text)\n<\/code><\/pre>\n\n\n\n<p>This time you get a status code of 200 as output. This status code indicates everything worked well. However, once you check the HTML source code, you\u2019ll notice a CAPTCHA test, which Amazon employs as a sophisticated anti-bot system.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"626\" height=\"397\" src=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/status-code-of-200-as-output.jpg\" alt=\"\" class=\"wp-image-2641\" srcset=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/status-code-of-200-as-output.jpg 626w, https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/status-code-of-200-as-output-300x190.jpg 300w\" sizes=\"(max-width: 626px) 100vw, 626px\" \/><\/figure>\n\n\n\n<p>Bypassing this CAPTCHA challenge is a lot more dif\ufb01cult. You\u2019ll have to use a<\/p>\n\n\n\n<p>good-quality proxy pool and rotate them from time to time. Additionally, you should consider adding more headers to simulate a real web browser.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Adding a Proxy<\/strong><\/h3>\n\n\n\n<p>Manually handling multiple proxies, rotating them, and maintaining your code can become quite tedious, especially if you start scaling your scraper. Thankfully, there are many solutions out there that take away the hassle of doing all that.<\/p>\n\n\n\n<p>For example, Oxylabs <u><a href=\"https:\/\/oxylabs.go2cloud.org\/aff_c?offer_id=7&amp;aff_id=289&amp;url_id=14\" target=\"_blank\" rel=\"noopener\">Web Unblocker<\/a><\/u> proxy solution is among the best-performing ones. It will automatically manage the proxy pool, headers, cookies, and other browser parameters for you, so you don\u2019t have to worry about getting blocked. You can also sign up and try it out for free before committing.<\/p>\n\n\n\n<p>To integrate it, you need to con\ufb01gure the code as exempli\ufb01ed below. Note that you will need Web Unblocker credentials that you get upon registering.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\nproxy = 'http:\/\/{}:{}@unblock.oxylabs.io:60000'.format('USERNAME', 'PASSWORD')\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>proxies = {\n'http': proxy, 'https': proxy\n}\nresponse = requests.get(page, proxies=proxies, verify=False)<\/code><\/pre>\n\n\n\n<p>You\u2019ll have to replace USERNAME and PASSWORD with your sub-user credentials. Also, when creating a network request, you\u2019ll have to pass an additional parameter verify=False as shown above.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Extracting Product Information<\/strong><\/h3>\n\n\n\n<p>If you run the code, the response object will now have the HTML source code of the Amazon product page. Before you begin parsing the product information, you\u2019ll have to inspect the target elements using a web browser. To do that:<\/p>\n\n\n\n<ol type=\"1\">\n<li>Open the product link in a web browser.<\/li>\n\n\n\n<li><strong>Right-click <\/strong>on the screen.<\/li>\n\n\n\n<li>Select <strong>Inspect.<\/strong><strong><\/strong><\/li>\n<\/ol>\n\n\n\n<p>That\u2019s the view you should get:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"607\" height=\"295\" src=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Extracting-product-information.jpg\" alt=\"\" class=\"wp-image-2642\" srcset=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Extracting-product-information.jpg 607w, https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Extracting-product-information-300x146.jpg 300w\" sizes=\"(max-width: 607px) 100vw, 607px\" \/><\/figure>\n\n\n\n<p>Now, let\u2019s use Beautiful Soup to parse this HTML content and extract the elements:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\ndata = &#91;]\nsoup = BeautifulSoup(response.content, 'html.parser')<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Product Title<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"626\" height=\"304\" src=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Product-title.jpg\" alt=\"\" class=\"wp-image-2643\" srcset=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Product-title.jpg 626w, https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Product-title-300x146.jpg 300w\" sizes=\"(max-width: 626px) 100vw, 626px\" \/><\/figure>\n\n\n\n<p>Beautiful Soup will parse the HTML and create a soup object. Using this object, you can extract the product title. Carefully inspect the product page again:<\/p>\n\n\n\n<p>Notice that the title has a property id=\u201dproductTitle\u201d. Using this property we can select it as below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\ntitle = soup.find('span', {'id': 'productTitle'}).text<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Product Price<\/strong><\/h3>\n\n\n\n<p>Next, let\u2019s grab the product price. Inspect the price element using the browser:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"618\" height=\"290\" src=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Product-price.jpg\" alt=\"\" class=\"wp-image-2644\" srcset=\"https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Product-price.jpg 618w, https:\/\/dataprot.net\/wp-content\/uploads\/2023\/08\/Product-price-300x141.jpg 300w\" sizes=\"(max-width: 618px) 100vw, 618px\" \/><\/figure>\n\n\n\n<p>As you can see, the price is in the span element, wrapped in another span element with a class a-text-price:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\nprice = soup.find('span', {'class': 'a-text-price'}).find('span').text<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Product Rating<\/strong><\/h3>\n\n\n\n<p>Similarly, you can also extract the total amount of product ratings:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\ntotal_ratings = soup.find('span', {'id': 'acrCustomerReviewText'}).text<\/code><\/pre>\n\n\n\n<p>Then, you can use the following code line to extract the product rating score:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\nrating = soup.find('a', {'class': 'a-popover-trigger\na-declarative'}).find('span', {'class': 'a-size-base a-color-base'}).text<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Storing Data Into CSV<\/strong><\/h2>\n\n\n\n<p>Once you\u2019re done selecting the elements you want to extract, let\u2019s get all this information in a usable format.<\/p>\n\n\n\n<p>Using panda\u2019s data frame object, let\u2019s export the data in a CSV \ufb01le with the following line of code. Since you don\u2019t need an index, set the index to False.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\ndata.append({\n'title': title, 'price': price,\n'total ratings': total_ratings, 'rating': rating\n})\ndf = pd.DataFrame(data) df.to_csv('amazon_product_data', index=False)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Full Source Code<\/strong><\/h3>\n\n\n\n<p>You can also modify the code to extract multiple products by using a list of product URLs and a simple for loop. Note that Web Unblocker uses headers automatically, so you don\u2019t need to pass additional HTTP headers. The full source code is given below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Python\nimport requests\nfrom bs4 import BeautifulSoup import pandas as pd\nproxy = 'http:\/\/{}:{}@unblock.oxylabs.io:60000'.format('USERNAME', 'PASSWORD')\n\nproxies = {\n'http': proxy, 'https': proxy\n}<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>\npages = &#91;\n'https:\/\/www.amazon.com\/iPhone-Pro-Max-128GB-Gold\/dp\/B0BGYDDWDF\/'\ndata = &#91;]\nfor page in pages:\nresponse = requests.get(page, proxies=proxies, verify=False) soup = BeautifulSoup(response.content, 'html.parser')\ntitle = soup.find('span', {'id': 'productTitle'}).text\nprice = soup.find('span', {'class': 'a-text-price'}).find('span').text total_ratings = soup.find('span', {'id': 'acrCustomerReviewText'}).text rating = soup.find('a', {'class': 'a-popover-trigger\na-declarative'}).find('span', {'class': 'a-size-base a-color-base'}).text data.append({\n'title': title, 'price': price,\n'total ratings': total_ratings, 'rating': rating\n})\ndf = pd.DataFrame(data) df.to_csv('amazon_product_data', index=False)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h3>\n\n\n\n<p>Hopefully, this step-by-step guide has equipped you with the necessary skills to navigate the Amazon website, extract product data, and overcome anti-bot challenges. Having the ability to gather data from Amazon opens up a world of possibilities for market research, competitor analysis, pricing optimization, and much more!<\/p>\n","protected":false},"excerpt":{"rendered":"<p> [&hellip;]<\/p>\n","protected":false},"author":29,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""}},"footnotes":""},"categories":[3],"tags":[],"acf":[],"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false},"uagb_author_info":{"display_name":"Maryia Stsiopkina","author_link":"https:\/\/dataprot.net\/author\/maryia-stsiopkina\/"},"uagb_comment_info":0,"uagb_excerpt":"[&hellip;]","_links":{"self":[{"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/posts\/2639"}],"collection":[{"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/comments?post=2639"}],"version-history":[{"count":5,"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/posts\/2639\/revisions"}],"predecessor-version":[{"id":2716,"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/posts\/2639\/revisions\/2716"}],"wp:attachment":[{"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/media?parent=2639"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/categories?post=2639"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dataprot.net\/wp-json\/wp\/v2\/tags?post=2639"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}