v0.2.9
What's changed
New features
- Introducing the long-awaited async support for Scrapling! Now you have the
AsyncFetcher
class version ofFetcher
, and bothStealthyFetcher
andPlayWrightFetcher
have a new method calledasync_fetch
with the same options.
>> from scrapling import StealthyFetcher
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch
>> page.status == 200
True
-
Now the
StealthyFetcher
class has thegeoip
argument in its fetch methods which when enabled makes the class automatically use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. -
Added the
retries
argument toFetcher
/AsyncFetcher
classes so now you can set the number of retries of each request done byhttpx
. -
Added the
url_join
method toAdaptor
and Fetchers which takes a relative URL and joins it with the current URL to generate an absolute full URL! -
Added the
keep_cdata
method toAdaptor
and Fetchers to stop the parser from removing cdata when needed. -
Now
Adaptor
/Response
body
method returns the raw HTML response when possible (without processing it in the library). -
Adding logging for the
Response
class so now when you use the Fetchers you will get a log that gives info about the response you got.
Example:>> from scrapling.defaults import Fetcher >> Fetcher.get('https://books.toscrape.com/index.html') [2024-12-16 13:33:36] INFO: Fetched (200) <GET https://books.toscrape.com/index.html> (referer: https://www.google.com/search?q=toscrape) >>
-
Now using all standard string methods on a
TextHandler
like.replace()
will result in anotherTextHandler
. It was returning the standard string before. -
Big improvements to speed across the library and improvements to stealth in Fetchers classes overall.
-
Added dummy functions like
extract_first
`extract` which returns the same result as the parent. These functions are added only to make it easy to copy code from Scrapy/Parsel to Scrapling when needed as these functions are used there! -
Due to refactoring a lot of the code and using caching at the right positions, now doing requests in bulk will have a big speed increase.
Breaking changes
-
Now the support for Python 3.8 has been dropped. (Mainly because Playwright stopped supporting it but it was a problematic version anyway)
-
The
debug
argument has been removed from all the library, now if you want to set the library to debugging, do this after importing the library:>>> import logging >>> logging.getLogger("scrapling").setLevel(logging.DEBUG)
Bugs Squashed
- Now WebGL is enabled by default as a lot of protections are checking if it's enabled now.
- Some mistakes and typos in the docs/README.
Quality of life changes
- All logging is now unified under the logger name
scrapling
for easier and cleaner control. We were using the root logger before. - Restructured the tests folder into a cleaner structure and added tests for the new features. All the tests were rewritten to a cleaner version and more tests were added for higher coverage.
- Refactored a big part of the code to be cleaner and easier to maintain.
All these changes were part of the changes I decided before to add with 0.3 but decided to add them here because it will be some time till the next version. Now the next step is to finish the detailed documentation website and then work on version 0.3
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.