Merge pull request #18 from D4Vinci/dev

v0.2.7
D4Vinci · Nov 26, 2024 · 26aebba · 26aebba
2 parents 468d9b8 + 06a47f9
commit 26aebba
Show file tree

Hide file tree

Showing 10 changed files with 101 additions and 40 deletions.
diff --git a/README.md b/README.md
@@ -44,10 +44,11 @@ Scrapling is a high-performance, intelligent web scraping library for Python tha
     * [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
     * [Extraction By Text Speed Test](#extraction-by-text-speed-test)
   * [Installation](#installation)
-  * [Fetching Websites Features](#fetching-websites-features)
-    * [Fetcher](#fetcher)
-    * [StealthyFetcher](#stealthyfetcher)
-    * [PlayWrightFetcher](#playwrightfetcher)
+  * [Fetching Websites](#fetching-websites)
+    * [Features](#features)
+    * [Fetcher class](#fetcher)
+    * [StealthyFetcher class](#stealthyfetcher)
+    * [PlayWrightFetcher class](#playwrightfetcher)
   * [Advanced Parsing Features](#advanced-parsing-features)
     * [Smart Navigation](#smart-navigation)
     * [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
@@ -210,7 +211,10 @@ playwright install chromium
 python -m browserforge update
 ```
 
-## Fetching Websites Features
+## Fetching Websites
+Fetchers are basically interfaces that do requests or fetch pages for you in a single request fashion then return an `Adaptor` object for you. This feature was introduced because the only option we had before was to fetch the page as you want then pass it manually to the `Adaptor` class to create an `Adaptor` instance and start playing around with the page.
+
+### Features
 You might be a little bit confused by now so let me clear things up. All fetcher-type classes are imported in the same way
 ```python
 from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
@@ -233,9 +237,11 @@ Also, the `Response` object returned from all fetchers is the same as `Adaptor`
 This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.
 
 For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
+
+You can route all traffic (HTTP and HTTPS) to a proxy for any of these methods in this format `http://username:password@localhost:8030`
 ```python
 >> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
->> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
+>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
 >> page = Fetcher().delete('https://httpbin.org/delete')
 ```
@@ -263,6 +269,7 @@ True
 |       addons        | List of Firefox addons to use. **Must be paths to extracted addons.**                                                                                                                                                                                                                                                                                                                                           |    ✔️    |
 |      humanize       | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                  |    ✔️    |
 |     allow_webgl     | Whether to allow WebGL. To prevent leaks, only use this for special cases.                                                                                                                                                                                                                                                                                                                                      |    ✔️    |
+|     disable_ads     | Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                              |    ✔️    |
 |    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |       timeout       | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000.                                                                                                                                                                                                                                                                                                    |    ✔️    |
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                                                                                                                                                                                                     |    ✔️    |
@@ -317,6 +324,7 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
 |    disable_webgl    | Disables WebGL and WebGL 2.0 support entirely.                                                                                                                                                                                                                                                                                                                                                                  |    ✔️    |
 |       stealth       | Enables stealth mode, always check the documentation to see what stealth mode does currently.                                                                                                                                                                                                                                                                                                                   |    ✔️    |
 |     real_chrome     | If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.                                                                                                                                                                                                                                                                            |    ✔️    |
+|       locale        | Set the locale for the browser if wanted. The default value is `en-US`.                                                                                                                                                                                                                                                                                                                                         |    ✔️    |
 |       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.                                                                                                                                                                                                                                                                                           |    ✔️    |
 |   nstbrowser_mode   | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.**                                                                                                                                                                                                                                                                                                      |    ✔️    |
 |  nstbrowser_config  | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._                                                                                                                                                                                                                                                        |    ✔️    |

diff --git a/scrapling/__init__.py b/scrapling/__init__.py
@@ -4,7 +4,7 @@
 from scrapling.core.custom_types import TextHandler, AttributesHandler
 
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.2.6"
+__version__ = "0.2.7"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"
 
 

diff --git a/scrapling/engines/camo.py b/scrapling/engines/camo.py
@@ -12,6 +12,7 @@
     generate_convincing_referer,
 )
 
+from camoufox import DefaultAddons
 from camoufox.sync_api import Camoufox
 
 
@@ -21,7 +22,8 @@ def __init__(
             block_webrtc: Optional[bool] = False, allow_webgl: Optional[bool] = False, network_idle: Optional[bool] = False, humanize: Optional[Union[bool, float]] = True,
             timeout: Optional[float] = 30000, page_action: Callable = do_nothing, wait_selector: Optional[str] = None, addons: Optional[List[str]] = None,
             wait_selector_state: str = 'attached', google_search: Optional[bool] = True, extra_headers: Optional[Dict[str, str]] = None,
-            proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, adaptor_arguments: Dict = None
+            proxy: Optional[Union[str, Dict[str, str]]] = None, os_randomize: Optional[bool] = None, disable_ads: Optional[bool] = True,
+            adaptor_arguments: Dict = None,
     ):
         """An engine that utilizes Camoufox library, check the `StealthyFetcher` class for more documentation.
 
@@ -36,6 +38,7 @@ def __init__(
         :param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
         :param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases.
         :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
+        :param disable_ads: Enabled by default, this installs `uBlock Origin` addon on the browser if enabled.
         :param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
         :param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
@@ -54,6 +57,7 @@ def __init__(
         self.network_idle = bool(network_idle)
         self.google_search = bool(google_search)
         self.os_randomize = bool(os_randomize)
+        self.disable_ads = bool(disable_ads)
         self.extra_headers = extra_headers or {}
         self.proxy = construct_proxy_dict(proxy)
         self.addons = addons or []
@@ -75,9 +79,11 @@ def fetch(self, url: str) -> Response:
         :param url: Target url.
         :return: A `Response` object that is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`
         """
+        addons = [] if self.disable_ads else [DefaultAddons.UBO]
         with Camoufox(
                 proxy=self.proxy,
                 addons=self.addons,
+                exclude_addons=addons,
                 headless=self.headless,
                 humanize=self.humanize,
                 i_know_what_im_doing=True,  # To turn warnings off with the user configurations
@@ -105,6 +111,11 @@ def fetch(self, url: str) -> Response:
             if self.wait_selector and type(self.wait_selector) is str:
                 waiter = page.locator(self.wait_selector)
                 waiter.first.wait_for(state=self.wait_selector_state)
+                # Wait again after waiting for the selector, helpful with protections like Cloudflare
+                page.wait_for_load_state(state="load")
+                page.wait_for_load_state(state="domcontentloaded")
+                if self.network_idle:
+                    page.wait_for_load_state('networkidle')
 
             # This will be parsed inside `Response`
             encoding = res.headers.get('content-type', '') or 'utf-8'  # default encoding

diff --git a/scrapling/engines/constants.py b/scrapling/engines/constants.py
@@ -44,7 +44,7 @@
     '--disable-default-apps',
     '--disable-print-preview',
     '--disable-dev-shm-usage',
-    '--disable-popup-blocking',
+    # '--disable-popup-blocking',
     '--metrics-recording-only',
     '--disable-crash-reporter',
     '--disable-partial-raster',

diff --git a/scrapling/engines/pw.py b/scrapling/engines/pw.py
@@ -26,6 +26,7 @@ def __init__(
             timeout: Optional[float] = 30000,
             page_action: Callable = do_nothing,
             wait_selector: Optional[str] = None,
+            locale: Optional[str] = 'en-US',
             wait_selector_state: Optional[str] = 'attached',
             stealth: Optional[bool] = False,
             real_chrome: Optional[bool] = False,
@@ -50,6 +51,7 @@ def __init__(
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
         :param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
         :param wait_selector: Wait for a specific css selector to be in a specific state.
+        :param locale: Set the locale for the browser if wanted. The default value is `en-US`.
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
         :param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
         :param real_chrome: If you have chrome browser installed on your device, enable this and the Fetcher will launch an instance of your browser and use it.
@@ -64,6 +66,7 @@ def __init__(
         :param adaptor_arguments: The arguments that will be passed in the end while creating the final Adaptor's class.
         """
         self.headless = headless
+        self.locale = check_type_validity(locale, [str], 'en-US', param_name='locale')
         self.disable_resources = disable_resources
         self.network_idle = bool(network_idle)
         self.stealth = bool(stealth)
@@ -87,6 +90,14 @@ def __init__(
         self.nstbrowser_mode = bool(nstbrowser_mode)
         self.nstbrowser_config = nstbrowser_config
         self.adaptor_arguments = adaptor_arguments if adaptor_arguments else {}
+        self.harmful_default_args = [
+            # This will be ignored to avoid detection more and possibly avoid the popup crashing bug abuse: https://issues.chromium.org/issues/340836884
+            '--enable-automation',
+            '--disable-popup-blocking',
+            # '--disable-component-update',
+            # '--disable-default-apps',
+            # '--disable-extensions',
+        ]
 
     def _cdp_url_logic(self, flags: Optional[List] = None) -> str:
         """Constructs new CDP URL if NSTBrowser is enabled otherwise return CDP URL as it is
@@ -151,15 +162,15 @@ def fetch(self, url: str) -> Response:
             else:
                 if self.stealth:
                     browser = p.chromium.launch(
-                        headless=self.headless, args=flags, ignore_default_args=['--enable-automation'], chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
+                        headless=self.headless, args=flags, ignore_default_args=self.harmful_default_args, chromium_sandbox=True, channel='chrome' if self.real_chrome else 'chromium'
                     )
                 else:
-                    browser = p.chromium.launch(headless=self.headless, ignore_default_args=['--enable-automation'], channel='chrome' if self.real_chrome else 'chromium')
+                    browser = p.chromium.launch(headless=self.headless, ignore_default_args=self.harmful_default_args, channel='chrome' if self.real_chrome else 'chromium')
 
             # Creating the context
             if self.stealth:
                 context = browser.new_context(
-                    locale='en-US',
+                    locale=self.locale,
                     is_mobile=False,
                     has_touch=False,
                     proxy=self.proxy,
@@ -176,6 +187,8 @@ def fetch(self, url: str) -> Response:
                 )
             else:
                 context = browser.new_context(
+                    locale=self.locale,
+                    proxy=self.proxy,
                     color_scheme='dark',
                     user_agent=useragent,
                     device_scale_factor=2,
@@ -221,6 +234,11 @@ def fetch(self, url: str) -> Response:
             if self.wait_selector and type(self.wait_selector) is str:
                 waiter = page.locator(self.wait_selector)
                 waiter.first.wait_for(state=self.wait_selector_state)
+                # Wait again after waiting for the selector, helpful with protections like Cloudflare
+                page.wait_for_load_state(state="load")
+                page.wait_for_load_state(state="domcontentloaded")
+                if self.network_idle:
+                    page.wait_for_load_state('networkidle')
 
             # This will be parsed inside `Response`
             encoding = res.headers.get('content-type', '') or 'utf-8'  # default encoding