• Что бы вступить в ряды "Принятый кодер" Вам нужно:
    Написать 10 полезных сообщений или тем и Получить 10 симпатий.
    Для того кто не хочет терять время,может пожертвовать средства для поддержки сервеса, и вступить в ряды VIP на месяц, дополнительная информация в лс.

  • Пользаватели которые будут спамить, уходят в бан без предупреждения. Спам сообщения определяется администрацией и модератором.

  • Гость, Что бы Вы хотели увидеть на нашем Форуме? Изложить свои идеи и пожелания по улучшению форума Вы можете поделиться с нами здесь. ----> Перейдите сюда
  • Все пользователи не прошедшие проверку электронной почты будут заблокированы. Все вопросы с разблокировкой обращайтесь по адресу электронной почте : info@guardianelinks.com . Не пришло сообщение о проверке или о сбросе также сообщите нам.

?️ The Final Boss of Web Scraping: A Streamlit-Powered, Multi-Page Ethical Scraper.

Lomanu4 Оффлайн

Lomanu4

Команда форума
Администратор
Регистрация
1 Мар 2015
Сообщения
1,481
Баллы
155
"Most scrapers stop at the first page. But this one doesn't know when to quit."
— Probably you, after building this.

? Introduction
Web scraping is one of the most powerful tools in any developer’s toolkit — from collecting product prices and news articles to monitoring SEO tags or academic citations. But most beginner tutorials stop at scraping a single page.

Today, we go full Final Boss mode.
You’ll learn how to build a smart, ethical, and multi-page web scraper wrapped in a beautiful Streamlit app.

? What this project does:

✅ Scrapes headings, paragraphs, images, and links.
? Crawls multiple internal pages recursively.
? Allows keyword filtering.
? Respects robots.txt.
? Saves everything to CSV.
? Features progress bars and live feedback.
?️ Has an intuitive Streamlit UI.

Let’s dive in.

? Tech Stack
Python ?
Requests – for HTTP requests
BeautifulSoup – for parsing HTML
Streamlit – for interactive UI
CSV module – for saving scraped data

?️ How It Works
Here's a high-level look at the logic:
You enter a URL in the Streamlit app.

The scraper:
Checks if scraping is allowed via robots.txt
Fetches the HTML of the page
Extracts key elements (headings, paragraphs, images, links)

Optionally, it:
Crawls internal links recursively (within the same domain)
Filters content based on a keyword
Results are displayed and downloadable via Streamlit.


import requests
from bs4 import BeautifulSoup
import csv
import os
import streamlit as st
# Function to check if scraping is allowed on a website
def is_scraping_allowed(url):
try:
# Checking robots.txt for the URL
robots_url = url.rstrip("/") + "/robots.txt"
response = requests.get(robots_url)

if response.status_code == 200:
if "Disallow: /" in response.text:
return False
return True
else:
return True
except Exception as e:
return True

# Function to scrape the website and extract content
def scrape_website(url):
if not is_scraping_allowed(url):
st.error("Scraping is disallowed on this site.")
return

try:
# Send GET request to the webpage
response = requests.get(url)

if response.status_code != 200:
st.error(f"Failed to retrieve webpage. Status code: {response.status_code}")
return

# Parse the content with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Extracting headings
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
headings_text = [heading.get_text(strip=True) for heading in headings]

# Extracting paragraphs
paragraphs = soup.find_all('p')
paragraphs_text = [para.get_text(strip=True) for para in paragraphs]

# Extracting links
links = soup.find_all('a', href=True)
links_list = [link['href'] for link in links]

# Extracting image URLs
images = soup.find_all('img', src=True)
images_list = [image['src'] for image in images]

# Save the data to a CSV file
save_to_csv(headings_text, paragraphs_text, links_list, images_list)

return headings_text, paragraphs_text, links_list, images_list

except Exception as e:
st.error(f"Error during scraping: {e}")
return

# Function to save the data into a CSV file
def save_to_csv(headings, paragraphs, links, images):
filename = 'scraped_data.csv'

# Check if file exists, if so, append data
file_exists = os.path.isfile(filename)

with open(filename, mode='a', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
if not file_exists:
# Write the header row if it's a new file
writer.writerow(['Heading', 'Paragraph', 'Link', 'Image URL'])

# Write the data to the CSV file
for heading, paragraph, link, image in zip(headings, paragraphs, links, images):
writer.writerow([heading, paragraph, link, image])
# Streamlit UI setup
st.title("Web Scraper with Streamlit")
# Input for URL
url = st.text_input("Enter the URL to scrape:")

if url:
# Scrape the website and get the results
headings, paragraphs, links, images = scrape_website(url)
if headings:
st.subheader("Headings")
st.write(headings)
if paragraphs:
st.subheader("Paragraphs")
st.write(paragraphs)
if links:
st.subheader("Links")
st.write(links)
if images:
st.subheader("Image URLs")
st.write(images)
st.write(f"Data saved to 'scraped_data.csv'")

⚠ Ethical Reminder
Scraping is powerful, but always respect robots.txt and never overload a server.
Use time delays, user-agent headers, and never scrape private or login-protected areas.

Screenshots:


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.




Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.



? Possible Improvements:
Want to take it even further?
Add sitemap.xml support to find all internal pages.
Integrate a headless browser like Selenium or Playwright.
Store data in a MongoDB or SQLite database.
Add domain blocking and rate limiting.
Deploy with Streamlit Cloud.

? Conclusion
You’ve just built a Final Boss-level web scraper — no more toy examples.
With keyword filters, recursion, and a live UI, you’ve taken scraping to the next level ?

Whether you're building a research tool, monitoring content, or just flexing your skills — this scraper gives you a powerful base to expand from.

? Feedback?
Got ideas to improve it? Questions about deployment? Drop a comment or fork the code!
? Follow me on Dev.to for more Python + AI + DevTool content!


Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

 
Вверх Снизу