• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

How to Scrape Amazon Influencer Links Using Python?

Lomanu4 Оффлайн

Lomanu4

Команда форума
Администратор
Регистрация
1 Мар 2015
Сообщения
1,481
Баллы
155
Introduction


When trying to scrape data from Amazon using an influencer account, many developers face challenges retrieving dynamic content such as banners. This article addresses the issue of extracting href attributes from the SiteStripe banner on Amazon product pages, helping you understand why certain information might be missing and how to overcome it.

Understanding the Problem


When logged in to an Amazon account designated for influencers, the SiteStripe functionality adds a banner at the top of product pages. This banner is often populated with data like commission information and links relevant to the product. However, web scraping these elements can be tricky, as missing elements or inconsistencies can occur depending on authentication state and how dynamic the rendered HTML is.

The primary reasons for discrepancies in the scraped HTML versus what’s rendered in your browser include:

  • JavaScript Rendering: A lot of modern web applications, including Amazon, use JavaScript to manipulate the DOM. If the content you’re trying to scrape is dynamically generated by JavaScript after the initial page load, libraries like BeautifulSoup won’t capture this data.
  • Session and Cookies: Amazon, like many websites, can respond differently based on user session and cookies. If the scraper isn’t authenticated correctly or if the session expires, it might receive a different response, causing missing data.
Step-by-Step Scraping Solution


To scrape the Amazon influencer link effectively, you’ll need to ensure that:

  1. You are authenticated and retain your session cookies throughout the scraping process.
  2. You account for possible dynamic content.
1. Authenticating Your Session


Start by collecting your session information as follows:

import requests
from bs4 import BeautifulSoup, SoupStrainer

product = '

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.


'

cookies = {
'skin': 'noskin',
'csm-hit': 'tb:TR5FV79EYEZ70CT815Y6+s-TR5FV79EYEZ70CT815Y6|1747257991197&t:1747257991197&adb:adblk_no',
'session-id': '138-5859344-6659632',
'session-id-time': '2082787201l',
'ubid-main': '132-5344427-5619030',
'sid': 'F06u+aHuY2eGLM35scN8Mg==|UOluFodvJ/FBZsPpRrWVfL2nFQN1d+qggoOgSwB+ACo=',
'session-token': '9ITCA059K4E7BLPJEWLndXO+5zm725rJrZkWF1823CyhCdfwzEBbADzlly2lYfN2UI/Ui8KXxMZnD7oMw273f1nH06u1DtX9vAIfkaBEl7dDNaN8H6jP8D00pgWQJJVYPt5jxKUObAFCkELmhiuZ6ianvHwmrZQjqtPBSpOMbkauUqCQE1sVizDXw4HsTMZUqntmjiN3q7smtyZvkP1T73v7LddXIdSW/1sK4pzDZJpSgJHiQtd0RohbnJPdMf2kkR4ZHu6HTkI4ziluHmPEy+hTNdg8D3sIpVzjEAsIFzCZ3OO3xw2UVeFxBJYlJ3t/LXOSBcSmhiOeX6JhhIuZRX4ScY6j/9UW9jr4qZOOht50Qcr8b3WmhPptdHhLHZEt',
}

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:138.0) Gecko/20100101 Firefox/138.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': '

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

',
'Connection': 'keep-alive',
}

response = requests.get(product, cookies=cookies, headers=headers)

2. Parsing the HTML


Once you have the correct response, use BeautifulSoup to extract relevant elements. To only get the target div, modify your parsing code as follows:

html = BeautifulSoup(response.text, 'lxml', parse_only=SoupStrainer('div',{'id':'amzn-ss-cc-campaign-container'}))
print(html)

3. Handle Missing Data


When you notice that the info like the campaign URL is missing (i.e., the href), it may be due to JavaScript altering the data dynamically. To tackle this, consider using Selenium to automate a real browser, as it executes JavaScript, thus retrieving the rendered HTML:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get(product)
html = browser.page_source
browser.quit()


By doing this, all JavaScript content loads as it appears in the browser, and you can then follow up with BeautifulSoup to extract information.

Frequently Asked Questions

Why does the scraped result differ from the browser view?


Dynamic elements loaded by JavaScript will not be present in the static HTML fetched via requests.

How can I avoid missing elements after being logged out?


Maintain your session cookies properly, and check the response to ensure you have a valid session.

Should I switch to Selenium for scraping?


If you face issues with dynamically loaded content, then switching to Selenium may resolve many of these problems, albeit at the cost of performance.

Conclusion


In summary, scraping Amazon influencer links can be challenging due to dynamic content and session handling. By using proper session management and considering tools like Selenium where necessary, you can successfully access the data you need. This solution balances efficiency and functionality, allowing you to continue using Python for your web scraping projects.


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу