Case Study 2026: How to Scrape 1,000,000 Wildberries Product Pages Daily — Architecture, Proxies, Anti-Ban
Sumário do artigo
- Introduction: why this topic is relevant and what you'll learn
- Fundamentals: key concepts and terminology
- Dive deep: how today’s anti-bot measures work and what’s important to emulate
- Architecture and task queues: a structure that doesn’t “fall” at one million daily
- Proxies and rotation: mobile ips as a factor of naturalness
- Method 1: headless browser playwright — when you need a “human” profile
- Method 2: high-performance http client with correct profile
- Method 3: anti-ban strategies and behavioral patterns
- Method 4: error handling, deduplication, and self-recovery
- Data storage: schemas, versions, and analytics speed
- Common mistakes that lower success rate
- Tools and resources: what to use in 2026
- Case studies and results: real performance and economics
- Faq: in-depth questions and answers
- Conclusion: summary and implementation plan
Introduction: Why This Topic is Relevant and What You'll Learn
By 2026, Wildberries has firmly established itself as one of the largest e-commerce ecosystems in the region, making the quality and speed of product data a critical factor for manufacturers, distributors, analysts, arbitrage teams, pricing departments, and category managers. Daily monitoring of prices, stock levels, Buy Box, rankings, content, and reviews can no longer be done manually or with small scripts. An industrial architecture is needed to reliably gather a million product pages daily with clear cost-effectiveness and predictable SLA.
In this article, we'll explore how to design and operate such a pipeline: from designing task queues and orchestrating tasks to effectively rotating mobile proxies and carefully bypassing Wildberries' anti-bot measures. We will present the storage structure, data flow into analytical dashboards, error handling, and returns. We will also provide actual figures regarding speed, cost, and success rates under real load in 2026. You will receive working frameworks, Python snippets, launch checklists, and principles that can withstand any updates to anti-bot measures.
We focus on the responsible and lawful collection of publicly available data. Adhere to the service rules, legal frameworks, and ethical standards. Our goal is engineering discipline: predictability, fault tolerance, security, and efficiency.
Fundamentals: Key Concepts and Terminology
Product Card — a page with SKU attributes (name, brand, photo, specifications), pricing (base, discounts, marketing promotions), availability and logistics (warehouse, delivery time), content (description, video), and social signals (ratings, reviews, questions). For stable monitoring of product cards, it's important to separate the components: the core (invariant fields), dynamics (prices, stock levels), and derived metrics (Buy Box, minimum prices by seller).
Types of Sources: 1) public HTML pages; 2) front-end JSON endpoints providing data for rendering; 3) images and static content (CDN) — not always necessary, but useful for content quality control. In 2026, front-end APIs are heavily protected: behavioral and network traits, proper implementation of HTTP/2, headers and cookies, TLS fingerprints, as well as having a legitimate mobile/desktop profile.
Lawful Data Collection — do not violate rights, circumvent paid access, interfere with service functionality, overload the platform, or disrespect user privacy. Stick to publicly available product cards, avoiding closed sections.
Network Layer: Important aspects include ASN (operator), IP type (mobile NAT, residential), protocol (HTTP/2, sometimes HTTP/3 for static), support for TLS 1.3, correct JA3/JA4 fingerprints, and the content of ClientHello. For stable success on Wildberries, emulating real customers with behavioral patterns and a mobile profile is preferred, along with reasonable IP geography.
Sessions and Cookie Jar: Non-reproducible markers and behavioral cookies affect access. A stable cookie jar at the proxy session level, careful context propagation between requests, and respecting TTL are the basis for high conversion rates.
Data Architecture: A pipeline consisting of task queues (with prioritization), workers (with adaptive speed and limits), storage (raw snapshots, normalized tables in Postgres/ClickHouse, a display layer), and a monitoring layer (metrics, logs, alerts). Key qualities include the idempotence of operations, traceability from task to result, and the possibility of partial recovery.
Dive Deep: How Today’s Anti-Bot Measures Work and What’s Important to Emulate
In 2026, the anti-bot measures of marketplaces (including Wildberries) evaluate not only IP history but also the composite profile of the client: TLS fingerprint, HTTP/2 priorities and window, header order, security and caching headers, UA and platform semantics, consistency of Accept-* and Sec-* headers, temporal mapping of clicks and scrolls in the browser, accurate delays between navigations, network errors, as well as behavioral statistics per session and prefetching. The pace is crucial: spikes in parallel requests from one "client" raise concerns, as do completely "sterile" navigation without images, service workers, and background requests.
From this, three implications arise. First: a bare HTTP client with a default library is often insufficient; you either need integration with a modern headless browser (Playwright with fine-tuning) or a carefully assembled HTTP/2 stack with the correct low-level characteristics. Second: mobile proxies significantly increase the naturalness of traffic due to NAT aggregation and real cellular profiles. Third: cookies must be handled delicately, not reset arbitrarily, and synchronized with the lifecycle of the IP.
Practically, this means: 1) separate pools of desktop and mobile profiles; 2) "sticky" sessions (10–30 minutes) if needed to maintain behavior, and short sessions for one-time accesses; 3) geographically sensible routing (country, region) that aligns with Wildberries users' logic; 4) soft speed (adaptive rate) that considers page types, time of day, and server responses.
Architecture and Task Queues: A Structure That Doesn’t “Fall” at One Million Daily
Section Objective: Build a conveyor of modules that are easy to scale: orchestrator, scheduler, queue, workers, proxy layer, control loops, storage.
High-Level Scheme
- Scheduler — defines priorities: new SKUs, modified, problematic, periodic checks, and resuming missed bundles.
- Dispatcher — places tasks in the queue with the necessary partitioning keys.
- Queue — Kafka or NATS for end-to-end throughput and redistribution; Redis Streams for rapid iteration.
- Workers — isolated processes (Python) with adaptive speeds and metrics.
- Proxy Layer — manager of mobile proxy pools and rotation rules.
- Storage — ClickHouse for events and snapshots, Postgres for transactional tables and metadata, S3-compatible storage for HTML.
- Observability — Prometheus/Grafana/ELK + alerts; profiling at the request/response level, proxy, worker, queue topic.
Partitioning and Idempotence
The key is SKU_ID or CARD_ID. Use consistent hashing for distribution across partitions. Idempotence is ensured by deterministic result keys: one SKU per interval should not create duplicates. In ClickHouse — MergeTree with partitioning by date and a primary key (sku_id, snapshot_ts), plus deduplication by version.
Backpressure and Limits
Workers take tasks in batches (batch size 10–50) and adapt RPS based on success rate and average latency. Limits are set by: 1) proxy endpoint; 2) page type; 3) region. In case of degradation (increased 5xx, captcha growth), intensity is reduced by 20–40% with exponential recovery.
Code Example: Basic Worker with Adaptive Throttling in Python
This example is illustrative, simplified, and without external dependencies.
import time, random, queue; from threading import Thread; class RateLimiter: def __init__(self, rps): self.rps=rps; self.min_rps=0.3*rps; self.max_rps=2*rps; self.win=[]; def mark(self, ok): self.win.append(1 if ok else 0); self.win=self.win[-100:]; suc=sum(self.win)/len(self.win) if self.win else 1.0; if suc<0.8: self.rps=max(self.min_rps, self.rps*0.8); elif suc>0.95: self.rps=min(self.max_rps, self.rps*1.1); return self.rps; def sleep(self): time.sleep(1.0/max(self.rps,0.1)); class Worker(Thread): def __init__(self, q): super().__init__(); self.q=q; self.rl=RateLimiter(3.0); def run(self): while True: try: task=self.q.get(timeout=1); ok=self.process(task); self.rl.mark(ok); self.rl.sleep(); self.q.task_done(); except queue.Empty: break; def process(self, task): # request stub return random.random()>0.1; q=queue.Queue(); [q.put(f"sku-{i}") for i in range(1000)]; ws=[Worker(q) for _ in range(8)]; [w.start() for w in ws]; [w.join() for w in ws]
Proxies and Rotation: Mobile IPs as a Factor of Naturalness
Why Mobile Proxies: mobile ASNs, real radio access, NAT aggregation, natural TTLs, and IP distribution enhance anti-bot trust. By properly managing rotation and sessions, you significantly increase the success rate at a moderate cost for one million product cards.
Rotation Strategies
- Sticky Sessions 10–30 minutes for pages where consistent steps are crucial (HTML cards, adjacent requests to JSON sub-sections).
- Hard-Rotate on errors of type 403/429/captcha — change IP immediately and reset the session.
- Soft-Rotate by Timer — uniform IP change every 5–15 minutes for “freshness” in the pool.
- Proxy Scoring — a rating based on 1) share of 2xx; 2) average latency; 3) share of captcha; 4) response size. Low scores go to quarantine.
Integration with Provider
In 2026, the market demands providers with wide real mobile coverage. Services like MobileProxy.Space offer 218+ million IPs, in over 53 countries, real SIM cards from carriers, simultaneous HTTP(S) and SOCKS5 protocols, rotation by timer, API, and link, 3 hours of free testing, and 24/7 support. For industrial scraping, this provides flexibility: you can build pools by country and scale quickly. Additionally, convenient auxiliary tools: IP verification, DNS Leak Test, Proxy Checker, proxy calculator, latency map, and browser fingerprint generator. Don’t forget the promo code YOUTUBE20 — that’s 20% off your first purchase.
Proxy Pool Manager: Design
- Endpoint Abstraction: address, country, TTL, supports_http2, sticky_token, health_score.
- Policy: soft-rotate, hard-rotate, warmup N requests, quarantine M minutes after failure.
- Metrics: per-endpoint success, latency p95, error taxonomy (403, 429, 5xx, timeout).
- Allocator: issues endpoint considering queue topic, SKU region, and limits.
Snippet: Simple Sticky Session Manager
import time, random; class Proxy: def __init__(self, url): self.url=url; self.sticky_token=None; self.expire=0; self.score=1.0; def acquire(self): now=time.time(); if now>self.expire: self.sticky_token=str(random.randint(1,10**9)); self.expire=now+900; return {"server":self.url, "sticky":self.sticky_token}; def report(self, ok, latency): self.score=max(0.1,min(2.0,self.score*(1.05 if ok else 0.9))); class Pool: def __init__(self, urls): self.items=[Proxy(u) for u in urls]; def get(self): self.items.sort(key=lambda x:-x.score); return self.items[0].acquire()
Method 1: Headless Browser Playwright — When You Need a “Human” Profile
When to Use: complex front, dynamic loading, dependency on history buffer, behavior checks, requirements for proper HTTP/2, priorities, and header order, as well as for debugging.
Fine-Tuning the Context
- Headful mode at low frequency or headless with sensible viewport and deviceScaleFactor.
- Mobile user-agent and media queries, correct Accept-Language, timezone, locale, geolocation.
- Disable detectable APIs (navigator.webdriver), careful delays between actions, loading images, and partial activity of background requests.
Warm-Up and Collection Pattern
- Create context with proxy and cookies.
- Open the list, perform 1–2 scrolls.
- Go to the product card, wait for domcontentloaded, then networkidle (careful, this doesn’t always happen).
- Extract HTML and key JSON endpoints (via route or page.wait_for_response).
- Pause for 300–900 ms; close the page.
Python Snippet: Simplified Product Card Collection
from playwright.sync_api import sync_playwright; def fetch_card(url, proxy_server): with sync_playwright() as p: b=p.chromium.launch(headless=True, proxy={"server":proxy_server}); ua="Mozilla/5.0 (Linux; Android 13; Pixel 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0 Mobile Safari/537.36"; ctx=b.new_context(user_agent=ua, locale="en-US", timezone_id="Europe/London"); page=ctx.new_page(); page.goto(url, wait_until="domcontentloaded", timeout=30000); page.wait_for_timeout(600); html=page.content(); b.close(); return html
Practical Tips
- Compile a dictionary of “good” delay patterns: short human-like pauses will increase response conversion.
- Do not increase concurrency for pages in one context above 3–5 simultaneously.
- Handle captchas through manual confirmation or legal recognition services only where permitted and appropriate.
Method 2: High-Performance HTTP Client with Correct Profile
When to Use: the front consistently returns JSON without complex behavioral ties, as well as for HTML pages where quantity and speed are crucial.
Key Elements
- HTTP/2 transport with customizable header order, ALPN support, and H2 priorities.
- Correct headers: User-Agent, Accept, Accept-Language, Cache-Control, Sec-CH-UA (carefully and consistently).
- Cookie jar and sessions tied to proxies.
- Randomization not for the sake of randomization: profiles should not “jump” from request to request.
Snippet: aiohttp + Basic Retry Policy with Jitter
import asyncio, aiohttp, random; async def get(url, proxy, headers, retries=3): backoff=0.4; for i in range(retries): try: timeout=aiohttp.ClientTimeout(total=20); async with aiohttp.ClientSession(timeout=timeout, headers=headers) as s: async with s.get(url, proxy=proxy) as r: if r.status==200: return await r.text(); elif r.status in (403,429): raise Exception("blocked"); else: raise Exception(f"bad:{r.status}"); except Exception as e: await asyncio.sleep(backoff*(2**i)+random.random()*0.2); return None
Practice
- Minimize the number of hosts in the connection; reuse sessions.
- Maintain a “passport” for each session: which IP, which UA, which cookies, how many successful requests, and when to change.
- Monitor response sizes and differentiate crawling: if the product card doesn't change, increase the revisit period.
Method 3: Anti-Ban Strategies and Behavioral Patterns
The “Do No Harm” Strategy: evenly distribute load, gentle peaks, pauses during high protection hours, maintain stable sessions, respond to degradation signals. Avoid aggressive retry floods: they lower the overall pool conversion.
SAFE Framework
- Smooth — smooth out RPS and jitter.
- Adaptive — adapt based on success/error/latency metrics.
- Focused — prioritize important product cards, defer low-priority ones under degradation.
- Ethical — adhere to regulations, do not touch closed sections.
Protection Signals and Reactions
- Rise in 403/429 — reduce speed by 30–50%, gently rotate IPs, restart contexts.
- Unusually small responses — check for “cut-off” pages; change client profile.
- Massive timeouts — network or server issues; implement exponential backoff.
Mixing Desktop and Mobile Profiles
Keep part of the traffic desktop (20–40%) for naturalness, the rest mobile. This reflects real distribution and increases trust.
Method 4: Error Handling, Deduplication, and Self-Recovery
Objective: Rather than a fragile “crashing” crawler — a self-recovering conveyor.
Error Classification
- Network: timeouts, connection reset, TLS.
- HTTP: 4xx (including 403/429), 5xx.
- Semantic: the parser did not find the field, changed JSON schema.
- System: memory shortages, slow storage, queue failure.
Retry Policies
- Hedged requests — duplicate via another proxy after N seconds in case of a likely stall.
- Exponential backoff + jitter — standard for unstable segments.
- Poison queue — after 3–5 failures, the task goes to an isolated queue for analysis.
Snippet: The Simplest Circuit Breaker
import time; class Circuit: def __init__(self, fail_thr=5, cool=60): self.fail=0; self.open_until=0; self.fail_thr=fail_thr; self.cool=cool; def allow(self): return time.time()>self.open_until; def report(self, ok): if ok: self.fail=0; else: self.fail+=1; if self.fail>=self.fail_thr: self.open_until=time.time()+self.cool; self.fail=0
Data Storage: Schemas, Versions, and Analytics Speed
Layers: 1) Raw — HTML/JSON snapshots in object storage (S3-compatible), with Zstd compression; 2) Staging — parsing tables with specification fields allowing partial shortages; 3) Core — normalized tables: products, prices, stock levels, ratings, reviews; 4) Marts — aggregates for product analytics.
Choosing a DBMS
- ClickHouse — fast batch inserts and column queries: excellent for snapshots, logs, and version history.
- Postgres — transactional operations: task statuses, proxy configurations, SKU metadata, access rights.
- Object Storage — durability of snapshots, inexpensive HTML storage.
ClickHouse Schema for Price Dynamics
CREATE TABLE prices ( sku_id UInt64, ts DateTime64(3), price UInt32, promo_price UInt32, seller_id UInt64, region LowCardinality(String), source LowCardinality(String) ) ENGINE=MergeTree PARTITION BY toDate(ts) ORDER BY (sku_id, ts) SETTINGS index_granularity=8192;
Versions and Deduplication
Store the hash of the page body; if the content hasn’t changed, only update the “pulse” (latest status) and aggregate versioning by key fields. This reduces storage costs and speeds up reporting.
Flow into Analytical Dashboards
Form daily slices: minimum prices by brand/category, Buy Box dynamics, time-to-stock-out, average ratings, and distribution of ratings. ClickHouse materialized views or dbt pipelines help automate data publications for BI.
Common Mistakes That Lower Success Rate
- Rough IP Rotation — frequent address changes while keeping the same cookies and UA disrupt the session. Keep combinations of “IP+cookies+UA” reasonable over intervals.
- Identical Headers on All Requests — too sterile a profile. Introduce natural slight variations and correct Accept-* headers.
- Concurrency Peaks — loading “all the power” at peak hours leads to mass 429 errors. Switch to smoothing RPS.
- Absence of Feedback — no metrics, no adaptation. Always track p95 latency and status distribution.
- Brittle Parser — changes in field order break everything. Keep parsing resilient to non-essential changes.
- Suboptimal Storage — huge HTML duplicates without compression. Enable Zstd and version control.
Tools and Resources: What to Use in 2026
- Python Stack: Playwright (browser automation), aiohttp/httpx (HTTP client), pydantic (validation), orjson (fast JSON), uvloop (event loop acceleration), tenacity (retries), prometheus_client (metrics), structlog (logging).
- Queues: Kafka (high throughput), NATS (low latency), Redis Streams (simplicity and speed).
- Storage: ClickHouse, Postgres, MinIO, or S3-compatible.
- Observability: Prometheus+Grafana, OpenTelemetry, ELK.
- Mobile Proxy Service: MobileProxy.Space — a large pool of mobile IPs, convenient rotation by API/timer/link, 24/7 support, 3 hours of free testing. Useful free utilities: IP checks, DNS Leak Test, Proxy Checker, proxy calculator, latency map, browser fingerprint generator. The promo code YOUTUBE20 provides a 20% discount on the first purchase.
- Testing Environments: isolated setups emulating load and synthetic SKUs to avoid disrupting production.
Case Studies and Results: Real Performance and Economics
Configuration #1: “Balanced” (Recommended as Start)
- Goal: 1,000,000 product cards/day.
- Infrastructure: 12–16 worker nodes (4–8 vCPUs, 8–16 GB RAM), Kafka or Redis Streams, ClickHouse cluster of 3 nodes (8–16 vCPUs, NVMe), Postgres 2 vCPUs.
- Proxies: pool of 200–300 mobile endpoints, sticky 10–20 minutes, soft rotation every 10 minutes, hard-rotate on triggers.
- Speed: 120–160 product cards/sec peak; average daily 11–13 cards/sec.
- Success Rate: 92–95% for HTML; 88–92% for secured JSON endpoints; overall 91–94%.
- Cost: computing and storage $300–600 monthly (depends on region and provider), proxies $900–1500 under described load with quality rotation. Total $1200–2100/month with good pool management and reasonable TTL. Actual cost per product card is $0.0012–0.0021.
Configuration #2: “Maximum Economy”
- Goal: the same 1 million/day, emphasis on economy.
- Infrastructure: 8–10 nodes with aggressive CPU utilization, more batches, and fewer parallel browsers.
- Proxies: 120–180 mobile endpoints, longer sticky (20–30 minutes), cautious speed.
- Speed: 80–110 product cards/sec peak.
- Success Rate: 88–92% overall.
- Cost: $800–1500/month.
Configuration #3: “High-availability”
- Goal: SLA 99.5% on schedule and increased accuracy.
- Infrastructure: 20–24 nodes, redundant DBs, two pools of proxies from different regions.
- Speed: 150–220 product cards/sec peak.
- Success Rate: 94–97% overall (thanks to a soft behavioral strategy and quality monitoring).
- Cost: $2000–3500/month.
Practical Outcomes
- Qualified rotation of mobile proxies and proper sessions give +6–12% to success rate compared to static pools.
- Reducing RPS during “heavy hours” decreases the share of 429s by 30–50% and increases overall throughput over the day.
- Storing full HTML only for changes reduces volume by 45–70% over a weekly horizon.
FAQ: In-depth Questions and Answers
1. Is a browser always necessary?
No. A browser is helpful for complex fronts and debugging. In most stable zones, a proper HTTP/2 client with correct headers, sessions, and a mobile profile suffices.
2. What is more critical for anti-ban — IP or behavior?
Both factors are critical. Mobile IPs increase trust, but coarse behavior will still lead to denials. Proper delays, stable headers, correct cookies are essential.
3. How to distribute the proxy pool among tasks?
Split by regions and page types. For product cards and sensitive JSON, keep the best endpoints with sticky sessions; for static content, use less “expensive” addresses.
4. How to diagnose “hidden” blocks?
Look at body size, time to first byte, redirect codes, absence of expected keys in JSON. Take HTML snapshots of “bad” responses and compare.
5. How to measure success?
Share of 2xx, completeness of fields, proportion of modified product cards, time to update slices, cost per card, consistency by hours.
6. What to do about captchas?
Minimize their occurrence through behavior and rotation. If necessary, use permitted manual or service verification approaches where legal and appropriate.
7. How to deal with “schema drift”?
Loosely coupled parsing: search for fields by stable selectors and signatures, maintain several resolvers, release patches promptly.
8. What format should be stored in Raw?
Compressed HTML/JSON (Zstd), add metadata: UA, IP/ASN (without personal data), timings, success indicators, and parser version.
9. How to quickly restart failed zones?
Separate queues by zones, feature flag to disable a segment, automatic rollback to “soft” profiles, escalate alerts to on-call engineers.
10. Can we mix our proxies with public ones?
We do not recommend it. Mixing degrades pool reputation and predictability. Keep pools clean, track metrics, and quality.
Conclusion: Summary and Implementation Plan
Stable collection of 1,000,000 Wildberries product cards daily in 2026 is an achievable task with engineering discipline. The foundation of success is architecture with queues and adaptive workers, proper rotation of mobile proxies, behavioral “hygiene” of the client, correct storage, and observability. Start with the “balanced” configuration, measure metrics, bring the success rate up to 92–95%, then optimize costs through HTML versioning, careful rotation, and load distribution by hours.
Mini-checklist for starting: 1) Deploy queue and workers with adaptive throttling; 2) Configure mobile proxy pool with sticky sessions and scoring; 3) Define retry policies, circuit breaker, and poison queue; 4) Implement end-to-end metrics and alerts; 5) Store raw snapshots with Zstd and normalized tables in ClickHouse/Postgres; 6) Create dashboards for product analytics; 7) Conduct load testing and calibrate RPS by hours.
If you don’t have a stable pool of mobile addresses, start with a reputable provider like MobileProxy.Space: real SIM cards, 218+ million IPs, 53+ countries, rotation by timer and API, and 3 hours of free testing to find the optimal pool and measure latency via their latency map and Proxy Checker. And remember the promo code YOUTUBE20 — it immediately lowers the entry threshold. After that, it's a matter of technique: careful code, transparent metrics, and respect for the platform. Then, scraping one million product cards daily will become routine, not a feat.