Use CasesNovember 4, 2025

Web Scraping Best Practices with Cloud Browsers

Production-ready patterns for web scraping using cloud browsers, covering data extraction, error handling, and rate limiting.

Introduction

Web scraping with cloud browsers combines the full rendering capability of a real browser with the scalability of cloud infrastructure. Unlike HTTP-based scraping, browser-based scraping handles JavaScript-rendered content, SPAs, and dynamic loading without custom parsing logic.

When to Use Browser-Based Scraping

Use a browser when:

Content is rendered by JavaScript (React, Vue, Angular apps)
The page requires interaction (clicking, scrolling, form submission)
You need to handle login flows
Anti-bot protection requires a real browser environment
Content loads dynamically (infinite scroll, lazy loading)

Use HTTP requests when:

Content is in the initial HTML
Speed is critical and volume is very high
No JavaScript rendering is needed
The API is available and documented

Data Extraction Patterns

CSS Selectors

const data = await page.evaluate(() => {
  const items = document.querySelectorAll('.product-card');
  return Array.from(items).map(item => ({
    title: item.querySelector('.title')?.textContent?.trim(),
    price: item.querySelector('.price')?.textContent?.trim(),
    image: item.querySelector('img')?.src,
    link: item.querySelector('a')?.href,
  }));
});

XPath

For complex DOM traversal:

const elements = await page.$x('//div[@class="review"]//span[@class="rating"]');
const ratings = await Promise.all(
  elements.map(el => page.evaluate(e => e.textContent, el))
);

Structured Data

Many sites embed structured data (JSON-LD, microdata):

const structuredData = await page.evaluate(() => {
  const scripts = document.querySelectorAll('script[type="application/ld+json"]');
  return Array.from(scripts).map(s => JSON.parse(s.textContent));
});

Handling Dynamic Content

Infinite Scroll

async function scrollToBottom(page) {
  let previousHeight = 0;

  while (true) {
    const currentHeight = await page.evaluate(() => document.body.scrollHeight);

    if (currentHeight === previousHeight) break;
    previousHeight = currentHeight;

    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(2000); // Wait for content to load
  }
}

Click to Load More

async function loadAllItems(page) {
  while (true) {
    const loadMore = await page.$('.load-more-button');
    if (!loadMore) break;

    const isVisible = await page.evaluate(
      el => el.offsetParent !== null, loadMore
    );
    if (!isVisible) break;

    await loadMore.click();
    await page.waitForTimeout(1500);
  }
}

Wait for AJAX

// Wait for specific network request to complete
await page.waitForResponse(
  response => response.url().includes('/api/products') && response.status() === 200
);

Rate Limiting

Respect target sites by implementing rate limiting:

class RateLimiter {
  constructor(requestsPerMinute) {
    this.interval = 60000 / requestsPerMinute;
    this.lastRequest = 0;
  }

  async wait() {
    const now = Date.now();
    const elapsed = now - this.lastRequest;

    if (elapsed < this.interval) {
      await new Promise(r => setTimeout(r, this.interval - elapsed));
    }

    this.lastRequest = Date.now();
  }
}

const limiter = new RateLimiter(30); // 30 requests per minute

for (const url of urls) {
  await limiter.wait();
  await processPage(url);
}

Error Recovery

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const browser = await puppeteer.connect({
      browserWSEndpoint: 'wss://bots.win/ws?apiKey=YOUR_API_KEY',
    });

    try {
      const page = await browser.newPage();
      await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
      const data = await extractData(page);
      return data;
    } catch (error) {
      console.error(`Attempt ${attempt + 1} failed: ${error.message}`);

      if (attempt === maxRetries - 1) throw error;

      // Exponential backoff
      await new Promise(r => setTimeout(r, 2000 * (attempt + 1)));
    } finally {
      await browser.close();
    }
  }
}

Data Validation

Always validate extracted data:

function validateProduct(product) {
  const errors = [];

  if (!product.title || product.title.length < 2) {
    errors.push('Missing or invalid title');
  }
  if (!product.price || isNaN(parseFloat(product.price.replace(/[^0-9.]/g, '')))) {
    errors.push('Missing or invalid price');
  }
  if (product.image && !product.image.startsWith('http')) {
    errors.push('Invalid image URL');
  }

  return { valid: errors.length === 0, errors };
}

Best Practices

Respect robots.txt and rate limits
Validate extracted data before storing
Handle errors gracefully with retries and logging
Use the simplest selector that reliably identifies the target element
Check for structured data first since it is more reliable than DOM scraping
Set timeouts on all operations to prevent hung tasks
Store raw HTML alongside extracted data for debugging and re-extraction

#scraping#data-extraction#best-practices#automation