Oliver Virt

Introduction

Scraping websites can often be straightforward - fetch the content, parse it, and you’re done. But ASPX-based websites present a unique challenge. These sites rely on hidden fields like__VIEWSTATE to manage user interactions and maintain the state of the page. Without properly handling these state variables, your requests will fail or return incorrect results.

Adding to the complexity, ASPX pages often require persistent cookies to track sessions and include dynamic forms that need precise data to function correctly. For many developers, the go-to solution is to use browser-based tools like Puppeteer or Selenium, but these come with significant drawbacks - they’re heavy, slow, and overkill for these types of tasks.

In this post, I’ll share how I built a lightweight, browser-free TypeScript scraper specifically designed for ASPX pages. It’s efficient, resource-friendly, and eliminates the need for headless browsers while seamlessly handling cookies, ViewState, and form submissions.

Core Implementation

Below are the key components of the scraper, explaining how they work together to handle cookies, state variables, and form submissions on ASPX pages.

1. Cookie Management

Cookies play a crucial role in maintaining sessions. Here’s how the scraper manages cookies efficiently:

private updateCookies(newCookies: string[]): void {
  newCookies.forEach((cookie) => {
    const [keyValue] = cookie.split(";");
    const [key, value] = keyValue.split("=");
    if (key && value) {
      this.cookieStore[key.trim()] = value.trim();
    }
  });
}

private getCookieHeader(): string {
  return Object.entries(this.cookieStore)
    .map(([key, value]) => `${key}=${value}`)
    .join("; ");
}

2. ViewState Handling

ASPX pages use hidden fields like __VIEWSTATE to maintain page state. The scraper extracts and manages these variables using the following method:

private parseStateVariables(html: string): void {
  const $ = cheerio.load(html);
  this.viewState = this.normalizeValue($("input#__VIEWSTATE").val());
  this.viewStateGenerator = this.normalizeValue(
    $("input#__VIEWSTATEGENERATOR").val()
  );
  this.eventValidation = this.normalizeValue(
    $("input#__EVENTVALIDATION").val()
  );
}

private normalizeValue(value: string | string[] | undefined): string {
  if (Array.isArray(value)) {
    return value[0] || "";
  }
  return value || "";
}

3. Form Submission

The scraper automates form submissions by including the necessary state variables:

async submitForm(
  endpoint: string,
  formData: Record<string, string | undefined | number>
): Promise<string> {
  const body = new URLSearchParams({
    ...formData,
    __VIEWSTATE: this.viewState,
    __VIEWSTATEGENERATOR: this.viewStateGenerator,
    __EVENTVALIDATION: this.eventValidation,
  }).toString();

  const response = await this.fetch(`${this.baseUrl}${endpoint}`, {
    method: "POST",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
      Cookie: this.getCookieHeader(),
    },
    data: body,
  });

  const html = response.data;
  this.parseStateVariables(html);
  return html;
}

4. Fetching Results

After submitting forms or navigating pages, the scraper retrieves and parses results:

async fetchResults(endpoint: string): Promise<cheerio.Root> {
  const response = await this.fetch(`${this.baseUrl}${endpoint}`, {
    method: "GET",
    headers: {
      "Accept-Language": "en-IL,en;q=0.9",
      Cookie: this.getCookieHeader(),
    },
  });

  const html = response.data;
  return cheerio.load(html);
}

Conclusion

Scraping ASPX pages doesn’t have to involve heavy tools like Puppeteer or Selenium. A lightweight, TypeScript-based approach focusing on HTTP-level interactions can be significantly faster and more efficient for these scenarios. By automating state management and cookie handling, this scraper offers a practical solution for ASPX-specific challenges.

Ditch the Browser: Creating a Lightweight ASPX Scraper