Skip to content

Commit

Permalink
add option to --no-check-certs use at own risk (#89)
Browse files Browse the repository at this point in the history
* add option to --no-check-certs use at own risk
* support to target single file, and HEAD request
* bug: update selenium to use newer interface

The current failures are a result of an update to selenium,
so the instantiation of our driver fails, returns as None,
and then all the requests are done with only requests. As
the web matures (and sites do not want scraping) it is less
likely this approach will work - we need the driver. This
change will update the selenium UI to ensure the driver works
and restore functionality. I will follow up with any tweaks
needed for the CI (working locally for me).

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch authored Feb 3, 2024
1 parent 7dbd7ac commit d0e7560
Show file tree
Hide file tree
Showing 19 changed files with 156 additions and 96 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
name: Build Container
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Build
run: |
Expand Down
13 changes: 7 additions & 6 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
formatting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- name: Setup black environment
run: conda create --quiet --name black pyflakes
Expand All @@ -28,7 +28,7 @@ jobs:
needs: formatting
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4

- name: Setup mypy environment
run: conda create --quiet --name type_checking mypy
Expand All @@ -45,15 +45,16 @@ jobs:
needs: type_checking
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Setup testing environment
run: conda create --quiet --name testing pytest

- name: Download ChromeDriver
run: |
wget https://chromedriver.storage.googleapis.com/107.0.5304.18/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
rm chromedriver_linux64.zip
# Note if you use locally, must match
wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/121.0.6167.85/linux64/chromedriver-linux64.zip
unzip chromedriver-linux64.zip
rm chromedriver-linux64.zip
- name: Test
run: |
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and **Merged pull requests**. Critical items to know are:
Referenced versions in headers are tagged on Github, in parentheses are for pypi.

## [vxx](https://github.com/urlstechie/urlschecker-python/tree/master) (master)
- allow variable to skip checking certificates (0.0.35)
- switch back to pypi release of fake-useragent (0.0.34)
- preparing to install from git for fake-useragent (0.0.33)
- serial option for debugging (0.0.32)
Expand Down
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ RUN /bin/bash -c "source activate urlchecker && \
pip install --upgrade certifi && \
pip install .[all]"
# Download chrome driver for selenium
RUN /bin/bash -c "wget https://chromedriver.storage.googleapis.com/107.0.5304.18/chromedriver_linux64.zip && \
unzip chromedriver_linux64.zip && \
rm chromedriver_linux64.zip"
RUN /bin/bash -c "wget https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/121.0.6167.85/linux64/chromedriver-linux64.zip && \
unzip -o chromedriver-linux64.zip && \
rm chromedriver-linux64.zip"
RUN echo "source activate urlchecker" > ~/.bashrc
ENV PATH /code:/opt/conda/envs/urlchecker/bin:${PATH}
ENTRYPOINT ["urlchecker"]
Expand Down
47 changes: 20 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,49 +56,42 @@ for files. In this case, you can use urlchecker check:

```bash
$ urlchecker check --help
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup]
[--force-pass] [--no-print] [--file-types FILE_TYPES]
[--files FILES] [--exclude-urls EXCLUDE_URLS]
[--exclude-patterns EXCLUDE_PATTERNS]
[--exclude-files EXCLUDE_FILES] [--save SAVE]
[--retry-count RETRY_COUNT] [--timeout TIMEOUT]
```
```console
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup] [--serial] [--no-check-certs]
[--force-pass] [--no-print] [--verbose] [--file-types FILE_TYPES] [--files FILES]
[--exclude-urls EXCLUDE_URLS] [--exclude-patterns EXCLUDE_PATTERNS]
[--exclude-files EXCLUDE_FILES] [--save SAVE] [--retry-count RETRY_COUNT] [--timeout TIMEOUT]
path

positional arguments:
path the local path or GitHub repository to clone and check

optional arguments:
options:
-h, --help show this help message and exit
-b BRANCH, --branch BRANCH
if cloning, specify a branch to use (defaults to
master)
if cloning, specify a branch to use (defaults to main)
--subfolder SUBFOLDER
relative subfolder path within path (if not specified,
we use root)
--cleanup remove root folder after checking (defaults to False,
no cleaup)
--force-pass force successful pass (return code 0) regardless of
result
--no-print Skip printing results to the screen (defaults to
printing to console).
relative subfolder path within path (if not specified, we use root)
--cleanup remove root folder after checking (defaults to False, no cleaup)
--serial run checks in serial (no multiprocess)
--no-check-certs Allow urls to validate that fail certificate checks
--force-pass force successful pass (return code 0) regardless of result
--no-print Skip printing results to the screen (defaults to printing to console).
--verbose Print file names for failed urls in addition to the urls.
--file-types FILE_TYPES
comma separated list of file extensions to check
(defaults to .md,.py)
--files FILES comma separated list of exact files or patterns to
check.
comma separated list of file extensions to check (defaults to .md,.py)
--files FILES comma separated list of exact files or patterns to check.
--exclude-urls EXCLUDE_URLS
comma separated links to exclude (no spaces)
--exclude-patterns EXCLUDE_PATTERNS
comma separated list of patterns to exclude (no
spaces)
comma separated list of patterns to exclude (no spaces)
--exclude-files EXCLUDE_FILES
comma separated list of files and patterns to exclude
(no spaces)
comma separated list of files and patterns to exclude (no spaces)
--save SAVE Path to a csv file to save results to.
--retry-count RETRY_COUNT
retry count upon failure (defaults to 2, one retry).
--timeout TIMEOUT timeout (seconds) to provide to the requests library
(defaults to 5)
--timeout TIMEOUT timeout (seconds) to provide to the requests library (defaults to 5)
```

You have a lot of flexibility to define patterns of urls or files to skip,
Expand Down
2 changes: 0 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,6 @@ def get_reqs(lookup=None, key="INSTALL_REQUIRES"):
INSTALL_REQUIRES = get_reqs(lookup)
TESTS_REQUIRES = get_reqs(lookup, "TESTS_REQUIRES")
INSTALL_REQUIRES_ALL = get_reqs(lookup, "INSTALL_REQUIRES_ALL")
SELENIUM_REQUIRES = get_reqs(lookup, "SELENIUM_REQUIRES")

setup(
name=NAME,
Expand All @@ -90,7 +89,6 @@ def get_reqs(lookup=None, key="INSTALL_REQUIRES"):
tests_require=TESTS_REQUIRES,
extras_require={
"all": INSTALL_REQUIRES_ALL,
"selenium": SELENIUM_REQUIRES,
},
classifiers=[
"Intended Audience :: Developers",
Expand Down
9 changes: 8 additions & 1 deletion urlchecker/client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

"""
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down Expand Up @@ -76,6 +76,13 @@ def get_parser():
default=False,
action="store_true",
)
check.add_argument(
"--no-check-certs",
dest="no_check_certs",
help="Allow urls to validate that fail certificate checks",
default=False,
action="store_true",
)
check.add_argument(
"--force-pass",
help="force successful pass (return code 0) regardless of result",
Expand Down
5 changes: 3 additions & 2 deletions urlchecker/client/check.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""
client/github.py: entrypoint for interaction with a GitHub repostiory.
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down Expand Up @@ -73,6 +72,7 @@ def main(args, extra):
print(" urls excluded: %s" % exclude_urls)
print(" url patterns excluded: %s" % exclude_patterns)
print(" file patterns excluded: %s" % exclude_files)
print(" no check certs: %s" % args.no_check_certs)
print(" force pass: %s" % args.force_pass)
print(" retry count: %s" % args.retry_count)
print(" save: %s" % args.save)
Expand All @@ -90,6 +90,7 @@ def main(args, extra):
check_results = checker.run(
exclude_urls=exclude_urls,
exclude_patterns=exclude_patterns,
no_check_certs=args.no_check_certs,
retry_count=args.retry_count,
timeout=args.timeout,
)
Expand Down
40 changes: 24 additions & 16 deletions urlchecker/core/check.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand All @@ -12,7 +12,7 @@
import random
import re
import sys
from typing import Dict, List
from typing import Optional, Dict, List

from urlchecker.core import fileproc
from urlchecker.core.urlproc import UrlCheckResult
Expand All @@ -27,11 +27,11 @@ class UrlChecker:

def __init__(
self,
path: str = None,
file_types: List[str] = None,
exclude_files: List[str] = None,
path: Optional[str] = None,
file_types: Optional[List[str]] = None,
exclude_files: Optional[List[str]] = None,
print_all: bool = True,
include_patterns: List[str] = None,
include_patterns: Optional[List[str]] = None,
serial: bool = False,
):
"""
Expand Down Expand Up @@ -73,12 +73,16 @@ def __init__(
if not os.path.exists(path):
sys.exit("%s does not exist." % path)

self.file_paths = fileproc.get_file_paths(
base_path=path,
file_types=self.file_types,
exclude_files=self.exclude_files,
include_patterns=self.include_patterns,
)
# Case 1: a single file
if os.path.isfile(path):
self.file_paths = [os.path.abspath(path)]
else:
self.file_paths = fileproc.get_file_paths(
base_path=path,
file_types=self.file_types,
exclude_files=self.exclude_files,
include_patterns=self.include_patterns,
)

def __str__(self) -> str:
if self.path:
Expand All @@ -92,7 +96,7 @@ def save_results(
self,
file_path: str,
sep: str = ",",
header: List[str] = None,
header: Optional[List[str]] = None,
relative_paths: bool = True,
) -> str:
"""
Expand Down Expand Up @@ -161,11 +165,12 @@ def save_results(

def run(
self,
file_paths: List[str] = None,
exclude_patterns: List[str] = None,
exclude_urls: List[str] = None,
file_paths: Optional[List[str]] = None,
exclude_patterns: Optional[List[str]] = None,
exclude_urls: Optional[List[str]] = None,
retry_count: int = 2,
timeout: int = 5,
no_check_certs: bool = False,
) -> Dict[str, set]:
"""
Run the url checker given a path, excluded patterns for urls/files
Expand All @@ -179,6 +184,7 @@ def run(
- exclude_patterns (list) : list of excluded patterns for urls.
- retry_count (int) : number of retries on failed first check. Default=2.
- timeout (int) : timeout to use when waiting on check feedback. Default=5.
- no_check_certs (bool) : do not check certificates
Returns:
dictionary with each of list of urls for "failed" and "passed."
Expand Down Expand Up @@ -210,6 +216,7 @@ def run(
kwargs = {
"file_name": file_name,
"exclude_patterns": exclude_patterns,
"no_check_certs": no_check_certs,
"exclude_urls": exclude_urls,
"print_all": self.print_all,
"retry_count": retry_count,
Expand Down Expand Up @@ -257,6 +264,7 @@ def check_task(*args, **kwargs):
retry_count=kwargs.get("retry_count", 2),
timeout=kwargs.get("timeout", 5),
port=kwargs.get("port"),
no_check_certs=kwargs.get("no_check_certs"),
)

# Update flattened results
Expand Down
8 changes: 5 additions & 3 deletions urlchecker/core/exclude.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
"""
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
"""

from typing import List
from typing import Optional, List


def excluded(
url: str, exclude_urls: List[str] = None, exclude_patterns: List[str] = None
url: str,
exclude_urls: Optional[List[str]] = None,
exclude_patterns: Optional[List[str]] = None,
) -> bool:
"""
Check if link is in the excluded URLs or patterns to ignore.
Expand Down
12 changes: 6 additions & 6 deletions urlchecker/core/fileproc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand All @@ -10,7 +10,7 @@
import fnmatch
import os
import re
from typing import List
from typing import Optional, List

from urlchecker.core import urlmarker

Expand Down Expand Up @@ -43,8 +43,8 @@ def check_file_type(file_path: str, file_types: List[str]) -> bool:

def include_file(
file_path: str,
exclude_patterns: List[str] = None,
include_patterns: List[str] = None,
exclude_patterns: Optional[List[str]] = None,
include_patterns: Optional[List[str]] = None,
) -> bool:
"""
Check a file path for inclusion based on an OR regular expression.
Expand Down Expand Up @@ -86,8 +86,8 @@ def include_file(
def get_file_paths(
base_path: str,
file_types: List[str],
exclude_files: List[str] = None,
include_patterns: List[str] = None,
exclude_files: Optional[List[str]] = None,
include_patterns: Optional[List[str]] = None,
) -> List[str]:
"""
Get path to all files under a give directory and its subfolders.
Expand Down
2 changes: 1 addition & 1 deletion urlchecker/core/urlmarker.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611
Copyright (c) 2020-2022 Ayoub Malek and Vanessa Sochat
Copyright (c) 2020-2024 Ayoub Malek and Vanessa Sochat
This source code is licensed under the terms of the MIT license.
For a copy, see <https://opensource.org/licenses/MIT>.
Expand Down
Loading

0 comments on commit d0e7560

Please sign in to comment.