उपयोग के मामले
Web Scraping Best Practices with Cloud Browsers
Production-ready patterns for web scraping using cloud browsers, covering data extraction, error handling, and rate limiting.
Introduction
Web scraping with cloud browsers combines the full rendering capability of a real browser with the scalability of cloud infrastructure. Unlike HTTP-based scraping, browser-based scraping handles JavaScript-rendered content, SPAs, and dynamic loading without custom parsing logic.
When to Use Browser-Based Scraping
Use a browser when:
- Content is rendered by JavaScript (React, Vue, Angular apps)
- The page requires interaction (clicking, scrolling, form submission)
- You need to handle login flows
- Anti-bot protection requires a real browser environment
- Content loads dynamically (infinite scroll, lazy loading)
Use HTTP requests when:
- Content is in the initial HTML
- Speed is critical and volume is very high
- No JavaScript rendering is needed
- The API is available and documented
Data Extraction Patterns
CSS Selectors
const data = await page.evaluate(() => {
const items = document.querySelectorAll('.product-card');
return Array.from(items).map(item => ({
title: item.querySelector('.title')?.textContent?.trim(),
price: item.querySelector('.price')?.textContent?.trim(),
image: item.querySelector('img')?.src,
link: item.querySelector('a')?.href,
}));
});
XPath
For complex DOM traversal:
const elements = await page.$x('//div[@class="review"]//span[@class="rating"]');
const ratings = await Promise.all(
elements.map(el => page.evaluate(e => e.textContent, el))
);
Structured Data
Many sites embed structured data (JSON-LD, microdata):
const structuredData = await page.evaluate(() => {
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
return Array.from(scripts).map(s => JSON.parse(s.textContent));
});
Handling Dynamic Content
Infinite Scroll
async function scrollToBottom(page) {
let previousHeight = 0;
while (true) {
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) break;
previousHeight = currentHeight;
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000); // Wait for content to load
}
}
Click to Load More
async function loadAllItems(page) {
while (true) {
const loadMore = await page.$('.load-more-button');
if (!loadMore) break;
const isVisible = await page.evaluate(
el => el.offsetParent !== null, loadMore
);
if (!isVisible) break;
await loadMore.click();
await page.waitForTimeout(1500);
}
}
Wait for AJAX
// Wait for specific network request to complete
await page.waitForResponse(
response => response.url().includes('/api/products') && response.status() === 200
);
Rate Limiting
Respect target sites by implementing rate limiting:
class RateLimiter {
constructor(requestsPerMinute) {
this.interval = 60000 / requestsPerMinute;
this.lastRequest = 0;
}
async wait() {
const now = Date.now();
const elapsed = now - this.lastRequest;
if (elapsed < this.interval) {
await new Promise(r => setTimeout(r, this.interval - elapsed));
}
this.lastRequest = Date.now();
}
}
const limiter = new RateLimiter(30); // 30 requests per minute
for (const url of urls) {
await limiter.wait();
await processPage(url);
}
Error Recovery
async function scrapeWithRetry(url, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://bots.win/ws?apiKey=YOUR_API_KEY',
});
try {
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
const data = await extractData(page);
return data;
} catch (error) {
console.error(`Attempt ${attempt + 1} failed: ${error.message}`);
if (attempt === maxRetries - 1) throw error;
// Exponential backoff
await new Promise(r => setTimeout(r, 2000 * (attempt + 1)));
} finally {
await browser.close();
}
}
}
Data Validation
Always validate extracted data:
function validateProduct(product) {
const errors = [];
if (!product.title || product.title.length < 2) {
errors.push('Missing or invalid title');
}
if (!product.price || isNaN(parseFloat(product.price.replace(/[^0-9.]/g, '')))) {
errors.push('Missing or invalid price');
}
if (product.image && !product.image.startsWith('http')) {
errors.push('Invalid image URL');
}
return { valid: errors.length === 0, errors };
}
Best Practices
- Respect robots.txt and rate limits
- Validate extracted data before storing
- Handle errors gracefully with retries and logging
- Use the simplest selector that reliably identifies the target element
- Check for structured data first since it is more reliable than DOM scraping
- Set timeouts on all operations to prevent hung tasks
- Store raw HTML alongside extracted data for debugging and re-extraction
#scraping#data-extraction#best-practices#automation