Changelog History
Page 1
-
v3.1.2 Changes
November 15, 2022π Bug Fixes
- injectJQuery in context does not survive navs (#1661) (493a7cf)
- π make router error message more helpful for undefined routes (#1678) (ab359d8)
- MemoryStorage: correctly respect the desc option (#1666) (b5f37f6)
- requestHandlerTimeout timing (#1660) (493ea0c)
- π― shallow clone browserPoolOptions before normalization (#1665) (22467ca)
- π support headfull mode in playwright js project template (ea2e61b)
- π support headfull mode in puppeteer js project template (e6aceb8)
π Features
-
v3.1.1 Changes
November 07, 2022π Bug Fixes
- β
utils.playwright.blockRequests
warning message (#1632) (76549eb) - concurrency option override order (#1649) (7bbad03)
- π handle non-error objects thrown gracefully (#1652) (c3a4e1a)
- mark session as bad on failed requests (#1647) (445ae43)
- π support reloading of sessions with lots of retries (ebc89d2)
- π fix type errors when
playwright
is not installed (#1637) (de9db0c) - β¬οΈ upgrade to [email protected] (#1623) (ce36d6b)
π Features
- β
-
v3.1.0 Changes
October 13, 2022π Bug Fixes
- β add overload for
KeyValueStore.getValue
with defaultValue (#1541) (e3cb509) - β add retry attempts to methods in CLI (#1588) (9142e59)
- π allow
label
inenqueueLinksByClickingElements
options (#1525) (18b7c25) - basic-crawler: handle
request.noRetry
aftererrorHandler
(#1542) (2a2040e) - π build storage classes by using
this
instead of the class (#1596) (2b14eb7) - correct some typing exports (#1527) (4a136e5)
- do not hide stack trace of (retried) Type/Syntax/ReferenceErrors (469b4b5)
- enqueueLinks: ensure the enqueue strategy is respected alongside user patterns (#1509) (2b0eeed)
- enqueueLinks: prevent useless request creations when filtering by user patterns (#1510) (cb8fe36)
- π¦ export
Cookie
fromcrawlee
metapackage (7b02ceb) - π handle redirect cookies (#1521) (2f7fc7c)
- http-crawler: do not hang on POST without payload (#1546) (8c87390)
- β remove undeclared dependency on core package from puppeteer utils (827ae60)
- π support TypeScript 4.8 (#1507) (4c3a504)
- wait for persist state listeners to run when event manager closes (#1481) (aa550ed)
π Features
- β add
Dataset.exportToValue
(#1553) (acc6344) - β add
Dataset.getData()
shortcut (522ed6e) - β add
utils.downloadListOfUrls
to crawlee metapackage (7b33b0a) - β add
utils.parseOpenGraph()
(#1555) (059f85e) - β add
utils.playwright.compileScript
(#1559) (2e14162) - β add
utils.playwright.infiniteScroll
(#1543) (60c8289), closes #1528 - β add
utils.playwright.saveSnapshot
(#1544) (a4ceef0) - β add global
useState
helper (#1551) (2b03177) - β add static
Dataset.exportToValue
(#1564) (a7c17d4) - π allow disabling storage persistence (#1539) (f65e3c6)
- β¬οΈ bump puppeteer support to 17.x (#1519) (b97a852)
- core: add
forefront
option toenqueueLinks
helper (f8755b6), closes #1595 - don't close page before calling errorHandler (#1548) (1c8cd82)
- enqueue links by clicking for Playwright (#1545) (3d25ade)
- error tracker (#1467) (6bfe1ce)
- π make the CLI download directly from GitHub (#1540) (3ff398a)
- router: add userdata generic to addHandler (#1547) (19cdf13)
- π use JSON5 for
INPUT.json
to support comments (#1538) (09133ff)
- β add overload for
-
v3.0.4 Changes
August 22, 2022 -
v3.0.3 Changes
August 11, 2022 -
v3.0.2 Changes
July 28, 2022π Fixes
- regression in resolving the base url for enqueue link filtering (1422)
- π improve file saving on memory storage (1421)
- β add
UserData
type argument toCheerioCrawlingContext
and related interfaces (1424) - always limit
desiredConcurrency
to the value ofmaxConcurrency
(bcb689d) - wait for storage to finish before resolving
crawler.run()
(9d62d56) - using explicitly typed router with
CheerioCrawler
(07b7e69) - π¦ declare dependency on
ow
in@crawlee/cheerio
package (be59f99) - π use
crawlee@^3.0.0
in the CLI templates (6426f22) - π fix building projects with TS when puppeteer and playwright are not installed (1404)
- enqueueLinks should respect full URL of the current request for relative link resolution (1427)
- 0οΈβ£ use
desiredConcurrency: 10
as the default forCheerioCrawler
(1428)
π Features
-
v3.0.1 Changes
July 26, 2022π Fixes
- β remove
JSONData
generic type arg fromCheerioCrawler
in (#1402) - 0οΈβ£ rename default storage folder to just
storage
in (#1403) - β remove trailing slash for proxyUrl in (#1405)
- 0οΈβ£ run browser crawlers in headless mode by default in (#1409)
- π rename interface
FailedRequestHandler
toErrorHandler
in (#1410) - 0οΈβ£ ensure default route is not ignored in
CheerioCrawler
in (#1411) - β add
headless
option toBrowserCrawlerOptions
in (#1412) - π¨ processing custom cookies in (#1414)
- enqueue link not finding relative links if the checked page is redirected in (#1416)
- π fix building projects with TS when puppeteer and playwright are not installed in (#1404)
- π» calling
enqueueLinks
in browser crawler on page without any links in (385ca27) - π improve error message when no default route provided in (04c3b6a)
π Features
- π feat: add parseWithCheerio for puppeteer & playwright in (#1418)
- β remove
-
v3.0.0 Changes
July 13, 2022π This section summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3.
Crawlee vs Apify SDK
π¦ Up until version 3 of
apify
, the package contained both scraping related tools and Apify platform related helper methods. With v3 we are splitting the whole project into two main parts:- π¦ Crawlee, the new web-scraping library, available as
crawlee
package on NPM - π¦ Apify SDK, helpers for the Apify platform, available as
apify
package on NPM
π¦ Moreover, the Crawlee library is published as several packages under
@crawlee
namespace:@crawlee/core
: the base for all the crawler implementations, also contains things likeRequest
,RequestQueue
,RequestList
orDataset
classes@crawlee/basic
: exportsBasicCrawler
@crawlee/cheerio
: exportsCheerioCrawler
- π»
@crawlee/browser
: exportsBrowserCrawler
(which is used for creating@crawlee/playwright
and@crawlee/puppeteer
) @crawlee/playwright
: exportsPlaywrightCrawler
@crawlee/puppeteer
: exportsPuppeteerCrawler
@crawlee/memory-storage
:@apify/storage-local
alternative- π¦
@crawlee/browser-pool
: previouslybrowser-pool
package @crawlee/utils
: utility methods@crawlee/types
: holds TS interfaces mainly about theStorageClient
Installing Crawlee
π > As Crawlee is not yet released as
latest
, we need to install from thenext
distribution tag!π¦ Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g.
@crawlee/playwright
if you plan on usingplaywright
- it already contains everything from the@crawlee/browser
package, which includes everything from@crawlee/basic
, which includes everything from@crawlee/core
.npm install crawlee@next
π Or if all we need is cheerio support, we can install only @crawlee/cheerio
npm install @crawlee/cheerio@next
When using
playwright
orpuppeteer
, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used.npm install crawlee@next playwright
- π¦ Crawlee, the new web-scraping library, available as
-
v2.3.2 Changes
May 05, 2022- π fix: use default user agent for playwright with chrome instead of the default "headless UA"
- π fix: always hide webdriver of chrome browsers
-
v2.3.1 Changes
May 03, 2022- π fix:
utils.apifyClient
early instantiation (#1330) - feat:
utils.playwright.injectJQuery()
(#1337) - feat: add
keyValueStore
option toStatistics
class (#1345) - π fix: ensure failed req count is correct when using
RequestList
(#1347) - π fix: random puppeteer crawler (running in headful mode) failure (#1348)
> This should help with the
We either navigate top level or have old version of the navigated frame
bug in puppeteer. - π fix: allow returning falsy values in
RequestTransform
's return type
- π fix: