Diffbot
Diffbot provides AI-powered tools to extract and structure data from web pages, transforming unstructured web content into structured, linked data.
Verdict
Common use cases
- Extract competitor product specs and pricing
- Pull article metadata for content audits
- Scrape event details from venue pages
- Archive forum threads with nested replies
- Build datasets from bulk URL lists
Integration
- Vendor
- Diffbot
- Category
- other
- Auth
- API_KEY
- Tools
- 14
- Composio slug
diffbot
Tools
- Diffbot Analyze
Tool to automatically determine a page's content type and route it to the appropriate extraction api. use when you have only a url and need diffbot to choose the right extractor.
- Diffbot Get Event
Tool to extract event details from web pages. use when you need structured event data such as venue, date, and description.
- Diffbot Get Image
Tool to extract detailed information about images, including dimensions and recognition data. use after confirming the image url is publicly accessible.
- Diffbot Get Product
Tool to extract product information such as specifications, prices, availability, and reviews. use when you need structured product data including specs, pricing, and reviews.
- Diffbot Search
Tool to search data extracted by crawl or bulk jobs using dql queries. use after data extraction jobs complete to retrieve search results.
- Get Article Data
Tool to extract information from articles, including authors, publication dates, and images. use when you need structured metadata from a web article url.
- Get Diffbot Account Details
Tool to retrieve account details, including plan information and usage statistics. use after authenticating to verify subscription and daily quota status.
- Get Discussion Thread
Tool to extract threads of content from forums, comment sections, and review pages. use when you need structured discussion data from web pages after identifying the discussion url.
- Get Video Data
Tool to extract information from videos, including titles, descriptions, and embedded html. use when you need structured video metadata from any web page.
- List Bulk Jobs
Tool to list all bulk jobs associated with a specific token. use after authenticating to retrieve statuses of all jobs for the account.
- Resolve Lost ID
Tool to resolve lost ids in the knowledge graph. use when you need to map a lost identifier to its canonical counterpart for data consistency.
- Start Bulk Job
Tool to start a bulk extract job. use when processing large numbers of urls asynchronously.
- Start Crawl Job
Tool to spider a site for links and process them with the extract api into a single collection. use when you have seed urls and want to collect structured data across a site. requires a plus plan for crawl api access.
- Stop Bulk Job
Tool to stop a running bulk job. use when you need to halt further processing of urls in a job in progress. invoke only after confirming the jobid to avoid accidental stoppage.
Setup
Setup guide
- 11. In Switchy, open your workspace settings and navigate to the Integrations tab. 2. Click 'Add Integration' and select Diffbot from the list. 3. Go to diffbot.com, log in to your account, and copy your API token from the dashboard. 4. Paste the token into Switchy's Diffbot connection form and click 'Connect'. 5. Switchy will verify the token by fetching your account details and displaying your plan type and remaining quota. 6. Open any Space, type '@Diffbot analyze https://example.com/article' and send the message. 7. Diffbot will return structured data for that page — if it's an article you'll see headline, author, and publish date; if it's a product you'll see price and specs. 8. Check the response to confirm extraction worked and your quota decremented by one request.
What teammates see: by default, memories from Diffbot are scoped to the Space (PROJECT visibility) - you can mark any memory PRIVATE or share it ORG-wide.
Works well with
Top models
Compatibility data appears once enough Spaces have used this MCP together with a given model.
How Switchy teams use it
Starter prompts
Analyze Unknown Page
@Diffbot analyze this page and tell me what type of content it is, then extract all key fields: https://example.com/pageOpen in a Space →
Extract Product Details
@Diffbot get product data from this URL including specs, pricing, and any reviews: https://example.com/productOpen in a Space →
Scrape Article Metadata
@Diffbot get article data from this URL and return the headline, author, publish date, and main image: https://example.com/articleOpen in a Space →
Capture Event Information
@Diffbot get event details from this page including venue, date, and description: https://example.com/eventOpen in a Space →
Archive Discussion Thread
@Diffbot get discussion thread from this URL and return all posts with replies and timestamps: https://example.com/forum/threadOpen in a Space →
Example outputs
Illustrative - representative of the model's voice and quality, not literal recordings.
@diffbot extract product details from this Amazon listing: https://www.amazon.com/dp/B08N5WRWNW and summarize the key specs, current price, and average review score.
Diffbot's Product API returned structured data for the Echo Dot (4th Gen). Key specs include: Alexa-enabled smart speaker, 1.6" front-firing driver, 3.5mm audio output, Wi-Fi connectivity. Current price: $49.99 (list price $59.99, 17% discount). Average review score: 4.7/5 stars across 127,438 reviews. The extraction also captured 8 product images, availability status (in stock), and shipping options. This structured output can be used to compare pricing across retailers or track price changes over time.
This example shows Diffbot's strength in normalizing messy e-commerce HTML into clean JSON fields. Requires a valid product URL; results depend on the page's current state (prices/reviews update frequently). The Product API consumes one API call per extraction, so batch requests carefully if analyzing multiple listings.
@diffbot I need to track tech news about AI regulation. Set up a search query using DQL to find articles published in the last 30 days mentioning 'artificial intelligence' and 'policy' from major news outlets.
Diffbot's Search tool executed a DQL query against your Knowledge Graph: `type:Article date>now-30d text:"artificial intelligence" AND text:policy site:(nytimes.com OR wsj.com OR reuters.com)`. The search returned 47 matching articles. Top results include: NYT piece on EU AI Act implementation (Jan 15), WSJ analysis of US federal AI guidelines (Jan 10), and Reuters coverage of China's AI governance framework (Jan 8). Each result includes extracted author, publish date, full text, and sentiment tags. You can refine the query to add more sources or narrow by sentiment.
This example demonstrates Diffbot's ability to search previously crawled data using its query language (DQL). Requires an active Crawl or Bulk Job to have already indexed the target domains—Search won't work on fresh URLs. DQL syntax is powerful but has a learning curve; consult Diffbot's query docs for advanced filtering.
@diffbot analyze this blog post URL and tell me if it's an article, product page, or discussion thread, then extract the main content: https://techcrunch.com/2024/01/20/startup-funding-round
Diffbot's Analyze API automatically classified the URL as type: Article. Extracted metadata: Title: 'Startup Raises $50M Series B to Expand AI Platform', Author: Sarah Chen, Publish Date: January 20, 2024. The tool pulled 1,200 words of body text, 3 embedded images, and identified 5 key entities (company names, funding amounts). Analyze also detected 2 related articles linked in the sidebar. Since this was classified as an article, Diffbot routed it to the Article API for deeper extraction—no manual tool selection needed.
This showcases Analyze's 'smart routing' capability: it inspects the page structure and picks the right extraction API (Article, Product, Event, etc.) automatically. Ideal when you're unsure of a URL's content type. Note that Analyze uses one API call for classification plus one for extraction, so it's slightly less efficient than calling a specific API directly if you already know the page type.
Use-case deep-dives
When Diffbot beats manual scraping for product intel
A 6-person growth team launching a SaaS product needs to track competitor pricing, feature lists, and review sentiment across 40 vendor sites. Diffbot's Get Product tool extracts structured specs, availability, and reviews without writing custom scrapers for each site. The team runs a weekly crawl job, then uses Diffbot Search to query the dataset with DQL filters like 'price > 50 AND rating < 4'. This works until you hit Diffbot's daily quota (check Get Account Details first) or need real-time updates—Diffbot crawls are batch jobs, not live APIs. If your competitor set is stable and you can wait 24 hours for fresh data, Diffbot turns a 3-day scraping project into a 20-minute setup.
When Diffbot's article extraction scales better than manual tagging
A 4-person editorial team at a B2B publisher needs to audit 800 archived articles for author attribution, publish dates, and image metadata before a CMS migration. Get Article Data pulls structured fields (authors, dates, images) from each URL in one pass, feeding a spreadsheet the team uses to flag missing bylines or broken image links. This beats manual review when you have more than 100 articles and consistent URL patterns. Diffbot struggles with paywalled content or sites that heavily obfuscate their markup, so test 10 sample URLs first. If 8 out of 10 return clean metadata, batch the rest and save your team 40 hours of copy-paste work.
When Diffbot event extraction beats calendar scraping
A solo community manager at a regional nonprofit tracks 30 partner organizations' event pages to cross-promote workshops and fundraisers. Diffbot Get Event extracts venue, date, and description from each partner's site, even when the HTML structure varies wildly. The manager runs a weekly crawl, then filters results by date range to build a monthly calendar. This works if your event sources use standard schema markup or consistent page layouts—Diffbot's extraction accuracy drops below 70% on heavily customized event pages. If you're aggregating fewer than 10 sources, a manual RSS feed or shared Google Calendar is simpler. Above 20 sources, Diffbot's 14 tools justify the API key cost.
Frequently asked
What does the Diffbot MCP do in Switchy?
It lets your team extract structured data from web pages — articles, products, events, images, discussion threads — without writing scrapers. Point Diffbot at a URL and it returns clean JSON with authors, prices, dates, specs, or whatever the page type contains. You can also search previously extracted data using DQL queries if you've run bulk crawl jobs.
Do I need a paid Diffbot account to use this MCP?
Yes. You authenticate with a Diffbot API key, which means you need an active Diffbot subscription. The MCP calls Get Diffbot Account Details on setup to verify your plan and daily quota. Free trials exist, but production use requires a paid plan because Diffbot charges per API call.
Can the Diffbot MCP scrape JavaScript-heavy sites or paywalled content?
It handles JavaScript-rendered pages well — Diffbot's extraction runs in a browser context. Paywalled content is a different story: if the page requires login credentials or a subscription to view, Diffbot can't extract it unless you provide session cookies or the content is publicly accessible after authentication. For most public pages, it works fine.
Why use this MCP instead of calling Diffbot's API directly?
The MCP saves you from writing API client code and managing rate limits in your prompts. Your team can extract product specs or article metadata in a single natural-language request, and Switchy handles the HTTP calls, error retries, and quota tracking. If you already have Diffbot API scripts, keep them; if you're starting fresh, the MCP is faster.
Who on the team should connect the Diffbot MCP?
Whoever owns your Diffbot account and has the API key. That's usually someone in data, product, or engineering. Once connected, any Switchy workspace member can use the tools, but the API quota and billing tie back to the connected account — so coordinate with your Diffbot admin before running bulk extraction jobs.