Skip to content

Releases: D4Vinci/Scrapling

v0.2.1

15 Nov 16:27
773fcd5
Compare
Choose a tag to compare

What's changed

New features

  1. Now the Response object returned from all fetchers is the same as the Adaptor object except it has these added attributes: status, reason, cookies, headers, and request_headers. All cookies, headers, and request_headers are always of type dictionary.
    So your code can now become like:
    >> from scrapling import Fetcher
    >> page = Fetcher().get('https://example.com', stealthy_headers=True)
    >> print(page.status)
    200
    >> products = page.css('.product')
    Instead of before
    >> from scrapling import Fetcher
    >> fetcher = Fetcher().get('https://example.com', stealthy_headers=True)
    >> print(fetcher.status)
    200
    >> page = fetcher.adaptor
    >> products = page.css('.product')
    But I have left the .adaptor property working for backward compatibility.
  2. Now both the StealthyFetcher and PlayWrightFetcher classes can take a proxy argument with the fetch method which accepts a string or a dictionary.
  3. Now the StealthyFetcher class has the os_randomize argument with the fetch method. If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.

Bugs Squashed

  1. Fixed a bug that happens while passing headers with the Fetcher class.
  2. Fixed a bug with parsing JSON responses passed from the fetcher-type classes.

Quality of life changes

  1. The text functionality behavior was to try to remove HTML comments before returning the text but that induced errors in some cases and made the code more complicated than needed. Now it has reverted to the default lxml behavior, you will notice a slight speed increase to all operations that counts on elements' text like selectors. Now if you want Scrapling to remove HTML comments from elements before returning the text to avoid the weird text-splitting behavior that's in lxml/parsel/scrapy, just keep the keep_comments argument set to True as it is by default.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

Version 0.2

11 Nov 17:31
4b5f9e1
Compare
Choose a tag to compare

What's changed

New features

  1. Introducing the Fetchers feature with 3 new main types to make Scrapling fetch pages for you with a LOT of options!
    • The Fetcher class for basic HTTP requests
    • The StealthyFetcher class is a completely stealthy fetcher that uses a stealthy modified version of Firefox.
    • The PlayWrightFetcher class that allows doing browser-based requests with Vanilla PlayWright, PlayWright with stealth mode made by me, Real browsers through CDP, and NSTBrowser's docker browserless!
  2. Added the completely new find_all/find methods to find elements easily on the page with dark magic!
  3. Added the methods filter and search to the Adaptors class for easier bulk operations on Adaptor object groups.
  4. Added methods css_first and xpath_first methods for easier usage.
  5. Added the new class type TextHandlers which is used for bulk operations on TextHandler objects like the Adaptors class.
  6. Added generate_full_css_selector and generate_full_xpath_selector methods.

Bugs Squashed

  1. Now the Adaptors class version of re_first returns the first result that matches in all Adaptor objects inside instead of the faulty logic of returning the results of re_first of all Adaptor objects.
  2. Now if the user selects a text-type content to be returned from selected elements (like css ::text function) with any method like .css or .xpath. The Adaptor object will return the TextHandlers class instead of returning a list of strings like before. So now you can do page.css('something::text').re_first(r'regex_pattern').json() instead of page.css('something::text')[0].re_first(r'regex_pattern').json()
  3. Now Adaptor/Adaptors re/re_first arguments are consistent with the TextHandler ones. So now you have clean_match and case_sensitive arguments.
  4. Now the auto_match argument is enabled by default in the initialization of Adaptor but still you have to enable it while selecting elements if you want to enable it. (Not a bug but a design decision)
  5. A lot of type-annotations corrections here and there for better auto-completion experience while you are coding with Scrapling.

Quality of life changes

  1. Renamed both css_selector and xpath_selector methods to generate_css_selector and generate_xpath_selector for clarity and to not interrupt the auto-completion while coding.
  2. Restructured most of the old code into a core subpackage and other design decisions for cleaner and easier maintenance in the future.
  3. Restructured the tests folder into a cleaner structure and added tests for the new features. Also now tox environments are cached on GitHub for faster automated tests with each commit.

v0.1.2

16 Oct 16:44
Compare
Choose a tag to compare

Changelog:

  • Fixed a bug where the keep_comments argument is not working as intended.
  • Adjusted the text function to automatically remove HTML comments from elements before extracting its text to prevent Lxml different behavior, for example:
    >>> page = Adaptor('<span>CONDITION: <!-- -->Excellent</span>', keep_comments=True)
    >>> page.css('span::text')
    ['CONDITION: ', 'Excellent']
    previously would result in this because of Lxml default behavior but now it would return the full text 'CONDITION: Excellent'
    This behavior is known with parsel\scrapy as well so wanted to handle it here.
  • Fixed a bug where the SQLite db file created by the library is not deleted when doing pip uninstall scrapling or similar.

v0.1.1

14 Oct 00:21
Compare
Choose a tag to compare

Minor fixes

v0.1

13 Oct 22:01
Compare
Choose a tag to compare