Crawlee/CHANGELOG and Crawlee Releases (Page 3)

All Versions

Latest Version

3.1.2

Avg Release Cycle

21 days

Latest Release

526 days ago

Changelog History

Page 2

v2.3.0 Changes
April 07, 2022
- feat: accept more social media patterns (#1286)
- 👍 feat: add multiple click support to enqueueLinksByClickingElements (#1295)
- 🔧 feat: instance-scoped "global" configuration (#1315)
- feat: requestList accepts proxyConfiguration for requestsFromUrls (#1317)
- ⚡️ feat: update playwright to v1.20.2
- ⚡️ feat: update puppeteer to v13.5.2 > We noticed that with this version of puppeteer actor run could crash with > We either navigate top level or have old version of the navigated frame error > (puppeteer issue here). > It should not happen while running the browser in headless mode. > In case you need to run the browser in headful mode (headless: false), > we recommend pinning puppeteer version to 10.4.0 in actor package.json file.
- 🗄 feat: stealth deprecation (#1314)
- feat: allow passing a stream to KeyValueStore.setRecord (#1325)
- 🛠 fix: use correct apify-client instance for snapshotting (#1308)
- 🛠 fix: automatically reset RequestQueue state after 5 minutes of inactivity, closes #997
- 🛠 fix: improve guessing of chrome executable path on windows (#1294)
- 🛠 fix: prune CPU snapshots locally (#1313)
- 🛠 fix: improve browser launcher types (#1318)
0 concurrency mitigation

🚀 This release should resolve the 0 concurrency bug by automatically resetting the internal RequestQueue state after 5 minutes of inactivity.

We now track last activity done on a RequestQueue instance:
- ➕ added new request
- started processing a request (added to inProgress cache)
- marked request as handled
- reclaimed request
If we don't detect one of those actions in last 5 minutes, and we have some requests in the inProgress cache, we try to reset the state. We can override this limit via CRAWLEE_INTERNAL_TIMEOUT env var.

This should finally resolve the 0 concurrency bug, as it was always about stuck requests in the inProgress cache.
v2.2.2 Changes
February 14, 2022
- 🛠 fix: ensure request.headers is set
- 🛠 fix: lower RequestQueue API timeout to 30 seconds
- 👌 improve logging for fetching next request and timeouts
v2.2.1 Changes
January 03, 2022
- 🛠 fix: ignore requests that are no longer in progress (#1258)
- 🛠 fix: do not use tryCancel() from inside sync callback (#1265)
- 🛠 fix: revert to puppeteer 10.x (#1276)
- 🛠 fix: wait when body is not available in infiniteScroll() from Puppeteer utils (#1238)
- 🛠 fix: expose logger classes on the utils.log instance (#1278)
v2.2.0 Changes
December 17, 2021
Proxy per page

💻 Up until now, browser crawlers used the same session (and therefore the same proxy) for 💻 all request from a single browser * now get a new proxy for each session. This means that with incognito pages, each page will get a new proxy, aligning the behaviour with CheerioCrawler.

0️⃣ This feature is not enabled by default. To use it, we need to enable useIncognitoPages flag under launchContext:
```
new Apify.Playwright({
    launchContext: {
        useIncognitoPages: true,
    },
    // ...
})
```
🐎 > Note that currently there is a performance overhead for using useIncognitoPages.

Use this flag at your own will.

0️⃣ We are planning to enable this feature by default in SDK v3.0.

Abortable timeouts

Previously when a page function timed out, the task still kept running. This could lead to requests being processed multiple times. In v2.2 we now have abortable timeouts that will cancel the task as early as possible.

Mitigation of zero concurrency issue

Several new timeouts were added to the task function, which should help mitigate the zero concurrency bug. Namely fetching of next request information and reclaiming failed requests back to the queue ⏱ are now executed with a timeout with 3 additional retries before the task fails. The timeout is always at least 300s (5 minutes), or requestHandlerTimeoutSecs if that value is higher.

Full list of changes
- 🛠 fix RequestError: URI malformed in cheerio crawler (#1205)
- only provide Cookie header if cookies are present (#1218)
- 🖐 handle extra cases for diffCookie (#1217)
- ➕ add timeout for task function (#1234)
- 💻 implement proxy per page in browser crawlers (#1228)
- ➕ add fingerprinting support (#1243)
- implement abortable timeouts (#1245)
- ➕ add timeouts with retries to runTaskFunction() (#1250)
- automatically convert google spreadsheet URLs to CSV exports (#1255)
v2.1.0 Changes
October 07, 2021
- 📄 automatically convert google docs share urls to csv download ones in request list (#1174)
- 👉 use puppeteer emulating scrolls instead of window.scrollBy (#1170)
- warn if apify proxy is used in proxyUrls (#1173)
- fix YOUTUBE_REGEX_STRING being too greedy (#1171)
- ➕ add purgeLocalStorage utility method (#1187)
- catch errors inside request interceptors (#1188, #1190)
- ➕ add support for cgroups v2 (#1177)
- 🛠 fix incorrect offset in fixUrl function (#1184)
- 👌 support channel and user links in YouTube regex (#1178)
- 🛠 fix: allow passing requestsFromUrl to RequestListOptions in TS (#1191)
- 👍 allow passing forceCloud down to the KV store (#1186), closes #752
- 🔀 merge cookies from session with user provided ones (#1201), closes #1197
- 👉 use ApifyClient v2 (full rewrite to TS)
v2.0.7 Changes
September 08, 2021
- Fix casting of int/bool environment variables (e.g. APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE), closes #956
- 🛠 Fix incognito pages and user data dir (#1145)
- ➕ Add @ts-ignore comments to imports of optional peer dependencies (#1152)
- 👉 Use config instance in sdk.openSessionPool() (#1154)
- ➕ Add a breaking callback to infiniteScroll (#1140)
v2.0.6 Changes
August 27, 2021
- 🛠 Fix deprecation messages logged from ProxyConfiguration and CheerioCrawler.
- ⚡️ Update got-scraping to receive multiple improvements.
v2.0.5 Changes
August 24, 2021
- 🛠 Fix error handling in puppeteer crawler
v2.0.4 Changes
August 23, 2021
- 👉 Use sessionToken with got-scraping
v2.0.3 Changes
August 20, 2021
- 💥 BREAKING IN EDGE CASES * We removed forceUrlEncoding in requestAsBrowser because we found out that recent versions of the underlying HTTP client got already encode URLs and forceUrlEncoding could lead to weird behavior. We think of this as fixing a bug, so we're not bumping the major version.
- Limit handleRequestTimeoutMillis to max valid value to prevent Node.js fallback to 1.
- 👉 Use got-scraping@^3.0.1
- Disable SSL validation on MITM proxie
- Limit handleRequestTimeoutMillis to max valid value

Crawlee changelog

Changelog History Page 2

0 concurrency mitigation

Proxy per page

Abortable timeouts

Mitigation of zero concurrency issue

Full list of changes

Changelog History

Page 2