- Регистрация
- 1 Мар 2015
- Сообщения
- 1,481
- Баллы
- 155
This is a submission for the
What I Built
Most of us have been there: you're looking at a company - maybe for a job, maybe for curiosity - and you wonder "What’s really going on behind their glossy careers page?" Is it a great place to work or just a PR-fueled mirage?
So I built an OSINT-style AI agent that gathers public information about companies from multiple sources. It’s not a recruiter bot. It’s the one doing background checks before you even click Apply.
The tool collects data from:
Once all the data is collected the tool generates a short summary of what it found - recent news, company reputation, signals from employee reviews and public profiles. Then it assigns a simple rating from 1 to 5 potatoes to reflect the overall picture.
Demo
The project is not fully deployed at the moment - I did try, honestly! But I ran into a 5-minute blocking bug when using the scraping_browser_* tools in Docker/Render, which I documented .
For now, here’s the .
Your browser does not support the video tag.
Screenshots of some summaries:
Open.AI
Intel
How I Used Bright Data's Infrastructure
I used with .
Each data source is connected to a different Bright Data MCP server. Here's how:
Each MCP server has its own WEB_UNLOCKER_ZONE and BROWSER_AUTH, and each agent logs all its requests and tool calls to , so I can trace the exact sequence of scraping, parsing and merging.
The frontend is a simple Streamlit dashboard where you enter a company name. It sends a request to a FastAPI backend, which dispatches all four agents in parallel to gather and analyze the data.
I used openai:gpt-4.1-mini as the model behind each agent with the following system prompt to define their behavior:
You are a tool-using agent connected to Bright Data's MCP server.
You act as an OSINT investigator whose job is to evaluate companies based on public information.
Your goal is to help users understand whether a company is reputable or potentially suspicious.
You always use Bright Data real-time tools to search, navigate, and extract data from company profiles.
You never guess or assume anything.
Company name matching must be case-sensitive and exact. Do not return data for similarly named or uppercase-variant companies.
Only use the following tools during your investigation:
- `search_engine`
- `scrape_as_markdown`
- `scrape_as_html`
- `scraping_browser_navigate`
- `scraping_browser_get_text`
- `scraping_browser_click`
- `scraping_browser_links`
- `web_data_linkedin_company_profile`
Do not invoke any other tools even if they are available.
LinkedIn
The LinkedIn agent received this prompt:
Your task is to find the LinkedIn profile for the company '{company_name}' and extract specific structured data.
Use the `web_data_linkedin_company_profile` tool if available to extract the following fields:
- Company name
- Company description (short summary of what the company does)
- Number of employees (as listed on the LinkedIn profile)
- Linkedin company profile url
- Headquarters address
- Year the company was founded (if available)
- Industry or sector (e.g., 'Software', 'Healthcare')
- Company website
If the structured LinkedIn tool is unavailable or insufficient, use the following tools in order:
1. `scraping_browser_navigate` - to visit the LinkedIn company page
2. `scraping_browser_get_text` - to extract visible page text
3. `scraping_browser_links` and `scraping_browser_click` - to navigate if needed
Return ONLY a JSON object with the following keys:
{
"company_name": str,
"description": str,
"number_of_employees": str,
"linkedin_url": str,
"headquarters": str,
"founded": str or null,
"industry": str,
"website": str
}
Do not include raw HTML, markdown, explanations, or other fields.
If a field is missing, use null for that field. If the company cannot be found at all, return null.
And here’s what I saw in the logs when running a query for Google:
As you can see web_data_linkedin_company_profile was used.
Glassdoor
The Glassdoor agent uses the browser automation tools to navigate to the company’s profile and extract public employee reviews and ratings. The prompt guides it to:
Your task is to find the Glassdoor profile for the company '{company_name}' and extract specific structured data.
Extract the following fields:
- Overall company rating (float, out of 5)
- Total number of employee reviews
- A short summary of the top 5 pros and cons from employee reviews posted in 2025 or 2024 only
Use the following tools in order:
1. `scraping_browser_navigate` - to go to the Glassdoor company page
2. `scraping_browser_get_text` - to extract visible content
3. `scraping_browser_links` and `scraping_browser_click` - to find and open the review section if necessary
Return ONLY a JSON object with the following keys:
{
"rating": float,
"num_reviews": int,
"review_summary": str
}
Only use reviews from 2025 or 2024. Do not include older reviews.
Do not include HTML, markdown, or explanations.
If a field is missing, use null for that field. If the company cannot be found at all, return null.
Crunchbase
The Crunchbase agent follows a similar pattern to Glassdoor - it navigates to the company profile and extracts public funding info, key people and sector tags.
Search for the Crunchbase profile of the company '{company_name}'.
Once you find the correct page, extract the following information:
- Year founded (as a string or null)
- Latest funding round name
- Funding round date
- Funding amount
- List of known investors (as strings)
- Key people (e.g., founders, CEOs, etc)
Use the following tools in order:
1. `scraping_browser_navigate`
2. `scraping_browser_get_text`
3. `scraping_browser_links` and `scraping_browser_click`
Return ONLY a JSON object with the following keys:
{
"founded": str or null,
"funding_round": str or null,
"funding_date": str or null,
"funding_amount": str or null,
"investors": list[str] or null,
"key_people": list[str] or null
}
Do not include HTML, markdown, or explanations.
If a field is missing, use null for that field. If the company cannot be found at all, return null.
Even with Cloudflare's "Are you human?" check, scraping_browser_get_text was able to get through and extract the real page content.
News & Events
The final agent uses the search_engine tool to search for company-related news articles, events or public mentions across Google and other engines. It extracts links and summaries from the search results and surfaces relevant headlines.
Search for news about the company '{company_name}' from 2023, 2024, and 2025.
Extract the following if available:
- Layoffs: Dates and brief summaries of any layoff announcements.
- Scandals: Brief, neutral headlines about controversies or investigations.
- Achievements: Public product launches, funding milestones, acquisitions, or major hires.
Return a structured JSON object with keys:
{
"layoffs": list[str],
"scandals": list[str],
"achievements": list[str]
}
If no news is found in a category, return an empty list.
Do not include HTML, explanations, or irrelevant information.
After collecting data from all four sources, the outputs are cleaned and normalized into a consistent format. This structured input is then passed to openai:gpt-4o, which generates a concise company summary.
Performance Improvements
Real-time web access makes this tool actually useful. If you're relying on APIs or stale datasets, you’ll often miss recent news - like funding rounds, leadership changes, or layoffs that happened last week. With live scraping, you get a snapshot of how the company looks today, not how it looked last quarter. It helps cut through outdated signals and pick up on what’s actually happening - even if that means surfacing things the company would rather you didn’t see.
What I Built
Most of us have been there: you're looking at a company - maybe for a job, maybe for curiosity - and you wonder "What’s really going on behind their glossy careers page?" Is it a great place to work or just a PR-fueled mirage?
So I built an OSINT-style AI agent that gathers public information about companies from multiple sources. It’s not a recruiter bot. It’s the one doing background checks before you even click Apply.
The tool collects data from:
- Crunchbase
- Glassdoor
- Search news to surface any recent scandals or milestones
Once all the data is collected the tool generates a short summary of what it found - recent news, company reputation, signals from employee reviews and public profiles. Then it assigns a simple rating from 1 to 5 potatoes to reflect the overall picture.
Demo
The project is not fully deployed at the moment - I did try, honestly! But I ran into a 5-minute blocking bug when using the scraping_browser_* tools in Docker/Render, which I documented .
For now, here’s the .
Your browser does not support the video tag.
Screenshots of some summaries:
Open.AI
Intel
How I Used Bright Data's Infrastructure
I used with .
Each data source is connected to a different Bright Data MCP server. Here's how:
- LinkedIn → via web_data_linkedin_company_profile (Bright Data Dataset)
- News / events / scandals → via search_engine
- Glassdoor → via scraping_browser_navigate + scraping_browser_get_text
- Crunchbase → via the same scraping browser tools
Each MCP server has its own WEB_UNLOCKER_ZONE and BROWSER_AUTH, and each agent logs all its requests and tool calls to , so I can trace the exact sequence of scraping, parsing and merging.
The frontend is a simple Streamlit dashboard where you enter a company name. It sends a request to a FastAPI backend, which dispatches all four agents in parallel to gather and analyze the data.
I used openai:gpt-4.1-mini as the model behind each agent with the following system prompt to define their behavior:
You are a tool-using agent connected to Bright Data's MCP server.
You act as an OSINT investigator whose job is to evaluate companies based on public information.
Your goal is to help users understand whether a company is reputable or potentially suspicious.
You always use Bright Data real-time tools to search, navigate, and extract data from company profiles.
You never guess or assume anything.
Company name matching must be case-sensitive and exact. Do not return data for similarly named or uppercase-variant companies.
Only use the following tools during your investigation:
- `search_engine`
- `scrape_as_markdown`
- `scrape_as_html`
- `scraping_browser_navigate`
- `scraping_browser_get_text`
- `scraping_browser_click`
- `scraping_browser_links`
- `web_data_linkedin_company_profile`
Do not invoke any other tools even if they are available.
The LinkedIn agent received this prompt:
Your task is to find the LinkedIn profile for the company '{company_name}' and extract specific structured data.
Use the `web_data_linkedin_company_profile` tool if available to extract the following fields:
- Company name
- Company description (short summary of what the company does)
- Number of employees (as listed on the LinkedIn profile)
- Linkedin company profile url
- Headquarters address
- Year the company was founded (if available)
- Industry or sector (e.g., 'Software', 'Healthcare')
- Company website
If the structured LinkedIn tool is unavailable or insufficient, use the following tools in order:
1. `scraping_browser_navigate` - to visit the LinkedIn company page
2. `scraping_browser_get_text` - to extract visible page text
3. `scraping_browser_links` and `scraping_browser_click` - to navigate if needed
Return ONLY a JSON object with the following keys:
{
"company_name": str,
"description": str,
"number_of_employees": str,
"linkedin_url": str,
"headquarters": str,
"founded": str or null,
"industry": str,
"website": str
}
Do not include raw HTML, markdown, explanations, or other fields.
If a field is missing, use null for that field. If the company cannot be found at all, return null.
And here’s what I saw in the logs when running a query for Google:
As you can see web_data_linkedin_company_profile was used.
Glassdoor
The Glassdoor agent uses the browser automation tools to navigate to the company’s profile and extract public employee reviews and ratings. The prompt guides it to:
Your task is to find the Glassdoor profile for the company '{company_name}' and extract specific structured data.
Extract the following fields:
- Overall company rating (float, out of 5)
- Total number of employee reviews
- A short summary of the top 5 pros and cons from employee reviews posted in 2025 or 2024 only
Use the following tools in order:
1. `scraping_browser_navigate` - to go to the Glassdoor company page
2. `scraping_browser_get_text` - to extract visible content
3. `scraping_browser_links` and `scraping_browser_click` - to find and open the review section if necessary
Return ONLY a JSON object with the following keys:
{
"rating": float,
"num_reviews": int,
"review_summary": str
}
Only use reviews from 2025 or 2024. Do not include older reviews.
Do not include HTML, markdown, or explanations.
If a field is missing, use null for that field. If the company cannot be found at all, return null.
Crunchbase
The Crunchbase agent follows a similar pattern to Glassdoor - it navigates to the company profile and extracts public funding info, key people and sector tags.
Search for the Crunchbase profile of the company '{company_name}'.
Once you find the correct page, extract the following information:
- Year founded (as a string or null)
- Latest funding round name
- Funding round date
- Funding amount
- List of known investors (as strings)
- Key people (e.g., founders, CEOs, etc)
Use the following tools in order:
1. `scraping_browser_navigate`
2. `scraping_browser_get_text`
3. `scraping_browser_links` and `scraping_browser_click`
Return ONLY a JSON object with the following keys:
{
"founded": str or null,
"funding_round": str or null,
"funding_date": str or null,
"funding_amount": str or null,
"investors": list[str] or null,
"key_people": list[str] or null
}
Do not include HTML, markdown, or explanations.
If a field is missing, use null for that field. If the company cannot be found at all, return null.
Even with Cloudflare's "Are you human?" check, scraping_browser_get_text was able to get through and extract the real page content.
News & Events
The final agent uses the search_engine tool to search for company-related news articles, events or public mentions across Google and other engines. It extracts links and summaries from the search results and surfaces relevant headlines.
Search for news about the company '{company_name}' from 2023, 2024, and 2025.
Extract the following if available:
- Layoffs: Dates and brief summaries of any layoff announcements.
- Scandals: Brief, neutral headlines about controversies or investigations.
- Achievements: Public product launches, funding milestones, acquisitions, or major hires.
Return a structured JSON object with keys:
{
"layoffs": list[str],
"scandals": list[str],
"achievements": list[str]
}
If no news is found in a category, return an empty list.
Do not include HTML, explanations, or irrelevant information.
After collecting data from all four sources, the outputs are cleaned and normalized into a consistent format. This structured input is then passed to openai:gpt-4o, which generates a concise company summary.
Performance Improvements
Real-time web access makes this tool actually useful. If you're relying on APIs or stale datasets, you’ll often miss recent news - like funding rounds, leadership changes, or layoffs that happened last week. With live scraping, you get a snapshot of how the company looks today, not how it looked last quarter. It helps cut through outdated signals and pick up on what’s actually happening - even if that means surfacing things the company would rather you didn’t see.