Back to Blog
Use Cases

Web Scraping Best Practices with Cloud Browsers

Production-ready patterns for web scraping using cloud browsers, covering data extraction, error handling, and rate limiting.

Introduction

Web scraping with cloud browsers combines the full rendering capability of a real browser with the scalability of cloud infrastructure. Unlike HTTP-based scraping, browser-based scraping handles JavaScript-rendered content, SPAs, and dynamic loading without custom parsing logic.

When to Use Browser-Based Scraping

Use a browser when:

  • Content is rendered by JavaScript (React, Vue, Angular apps)
  • The page requires interaction (clicking, scrolling, form submission)
  • You need to handle login flows
  • Anti-bot protection requires a real browser environment
  • Content loads dynamically (infinite scroll, lazy loading)

Use HTTP requests when:

  • Content is in the initial HTML
  • Speed is critical and volume is very high
  • No JavaScript rendering is needed
  • The API is available and documented

Data Extraction Patterns

CSS Selectors

const data = await page.evaluate(() => {
  const items = document.querySelectorAll('.product-card');
  return Array.from(items).map(item => ({
    title: item.querySelector('.title')?.textContent?.trim(),
    price: item.querySelector('.price')?.textContent?.trim(),
    image: item.querySelector('img')?.src,
    link: item.querySelector('a')?.href,
  }));
});

XPath

For complex DOM traversal:

const elements = await page.$x('//div[@class="review"]//span[@class="rating"]');
const ratings = await Promise.all(
  elements.map(el => page.evaluate(e => e.textContent, el))
);

Structured Data

Many sites embed structured data (JSON-LD, microdata):

const structuredData = await page.evaluate(() => {
  const scripts = document.querySelectorAll('script[type="application/ld+json"]');
  return Array.from(scripts).map(s => JSON.parse(s.textContent));
});

Handling Dynamic Content

Infinite Scroll

async function scrollToBottom(page) {
  let previousHeight = 0;

  while (true) {
    const currentHeight = await page.evaluate(() => document.body.scrollHeight);

    if (currentHeight === previousHeight) break;
    previousHeight = currentHeight;

    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(2000); // Wait for content to load
  }
}

Click to Load More

async function loadAllItems(page) {
  while (true) {
    const loadMore = await page.$('.load-more-button');
    if (!loadMore) break;

    const isVisible = await page.evaluate(
      el => el.offsetParent !== null, loadMore
    );
    if (!isVisible) break;

    await loadMore.click();
    await page.waitForTimeout(1500);
  }
}

Wait for AJAX

// Wait for specific network request to complete
await page.waitForResponse(
  response => response.url().includes('/api/products') && response.status() === 200
);

Rate Limiting

Respect target sites by implementing rate limiting:

class RateLimiter {
  constructor(requestsPerMinute) {
    this.interval = 60000 / requestsPerMinute;
    this.lastRequest = 0;
  }

  async wait() {
    const now = Date.now();
    const elapsed = now - this.lastRequest;

    if (elapsed < this.interval) {
      await new Promise(r => setTimeout(r, this.interval - elapsed));
    }

    this.lastRequest = Date.now();
  }
}

const limiter = new RateLimiter(30); // 30 requests per minute

for (const url of urls) {
  await limiter.wait();
  await processPage(url);
}

Error Recovery

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const browser = await puppeteer.connect({
      browserWSEndpoint: 'wss://bots.win/ws?apiKey=YOUR_API_KEY',
    });

    try {
      const page = await browser.newPage();
      await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
      const data = await extractData(page);
      return data;
    } catch (error) {
      console.error(`Attempt ${attempt + 1} failed: ${error.message}`);

      if (attempt === maxRetries - 1) throw error;

      // Exponential backoff
      await new Promise(r => setTimeout(r, 2000 * (attempt + 1)));
    } finally {
      await browser.close();
    }
  }
}

Data Validation

Always validate extracted data:

function validateProduct(product) {
  const errors = [];

  if (!product.title || product.title.length < 2) {
    errors.push('Missing or invalid title');
  }
  if (!product.price || isNaN(parseFloat(product.price.replace(/[^0-9.]/g, '')))) {
    errors.push('Missing or invalid price');
  }
  if (product.image && !product.image.startsWith('http')) {
    errors.push('Invalid image URL');
  }

  return { valid: errors.length === 0, errors };
}

Best Practices

  1. Respect robots.txt and rate limits
  2. Validate extracted data before storing
  3. Handle errors gracefully with retries and logging
  4. Use the simplest selector that reliably identifies the target element
  5. Check for structured data first since it is more reliable than DOM scraping
  6. Set timeouts on all operations to prevent hung tasks
  7. Store raw HTML alongside extracted data for debugging and re-extraction
#scraping#data-extraction#best-practices#automation