-
-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of smaller extraction bugs (text & metadata) #4
Comments
Extraction bugs in text and metadata can be listed here as in adbar/htmldate#8 where issues specifically related to dates should be reported.
For details see below. |
Words are getting smashed together on this page: I looked into the extraction code here a bit. The date here is inside a span, which gets stripped, and then the date becomes the tail of the header. All of the whitespace (which includes a newline) gets lost, and then the tail is just directly appended to the header. I'm not sure if the best strategy to fix would be to include a space between the tail and text of the header node when they get extracted, or maybe to look for newlines in the text and somehow respect them. It looks like some of the lxml stuff just strips whitespace automatically when you access "text" and "tail" attributes. I didn't dig into this one, but I'm guessing it's something similar as the first case. The webpage relies on whitespace that gets stripped by the extraction algorithm. |
Yes, I think the issues in the document you mention are related to deleted |
Hey, this is a great library. I was ready to subscribe to a service just to get what this does for me. For the 30th page I extracted, https://thehill.com/homenews/senate/594044-sen-lujan-to-return-to-senate-in-time-to-vote-for-supreme-court-nominee, Trafilatura 1.0.0 returned only 150 chars of text:
I downloaded the HTML source (lujan.txt) and confirmed it does have the article text in it (starting with "Sen. Ben Ray Luján"). I decided to try the external fallback "Readability". I started Python in my trafilatura container and ran this code:
But that just gave me a bunch of XML/JavaScript that didn't even have the main text in it. Perhaps a fallback could be added that when extracted text is small and there are large continuous blocks of unextracted text, to grab those instead? |
Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That's definitely a case to watch for. EDIT: for the archived version of the page I now get the same problem as you. |
Suggested in #208:
|
Hey @adbar I'm having problem with a few publications like huffpost where it is not extracting the metadata correctly. trafilatura/trafilatura/utils.py Line 177 in 168e660
Example: https://bit.ly/3PuvL26 |
Hi @felipehertzer, I don't think I can reproduce the bug, which metadata fields do you mean exactly? |
Hello, URL of testing: https://orientxxi.info/fa import trafilatura
downloaded = trafilatura.fetch_url("https://orientxxi.info/fa")
trafilatura.extract(downloaded, output_format="json") I am wondering why the title is not the one provided in the HTML element {"title": "به زبانهای دیگر Yémen. Une paix qui se fait attendre Laurent Bonnefoy · 21 septembre أوسلو، نموذج للفشل دانيال ليفي · 21 أيلول (سبتمبر) موقع “أوريان 21” يدعوكم للاحتفال بعيد ميلاده العاشر! · 20 أيلول (سبتمبر) Petroleum. Turkey vs. Iraq, but the Kurds are Collateral Victims Benoît Drevet · 20 September El doble estándar de Egipto para acoger a sus “huéspedes” sudaneses Séverine Evanno · 1ro de septiembre Khaled El Qaisi, colpevole di Palestina Cecilia Dalla Negra · 18 settembre", "author": null,.... Thanks! |
Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: Please note that the extraction doesn't work as well on homepages in general. |
Hi @adbar, I ran into extraction issues. URL: https://microsoft.github.io/autogen/docs/Use-Cases/enhanced_inference/ Output: ! d o c t y p e h t m l > I also tested using htmlttext feature and it didn't work any better. It gave me this output. h t m l c l a s s = " d o c s - v e r s i o n - c u r r e n t " l a n g = " e n " d i r = " l t r " > I run scraping of HTML outside of trafilitura. I confirmed that we are getting all of the HTML, but seems like there's something in the HTML that trips the extraction. I used trafilitura.extract() and passed the html code as string into the function. I tested different settings for the favor_recall and favor_precision arguments. They didn't change the output in any significant way. I also tested using trafilitura.baseline() function and it yielded similar results. |
@adbar yes, I'm on M1 MacBook |
Did you try building LXML from source? |
I can't seem to get it to work. I'm new into this level of tweaking with the system. The installation fails because of missing precompiled Cython files. Trying to run that with the --without-cython flag also doesn't work.
I think I'll just move the script into a Docker container and see if that helps. |
@adbar Thanks for your answer on my previous case. I have another one! Doing something like: trafi_extraction = trafilatura.extract(
response.decode(errors='ignore'),
output_format='json',
include_images=False,
date_extraction_params={
'extensive_search': True,
'original_date': True,
'min_date': EARLIEST_VALID_DATE,
},
include_comments=False,
)
trafilatura_data = trafi_extraction and json.loads(trafi_extraction) Returns
For this given URL : http://sport.kurganobl.ru/8980.html trafic_extraction contains : {"title": null, "author": null, "hostname": null, "date": "2016-12-12", "categories": "", "tags": "", "fingerprint": "6920faf8766bf202", "id": null, "license": null, "comments": null, "raw_text": ", 8 . 350 35 . 1000 1000 , . 21 8 . 7 300 , 01:08:25, . ̀ . 1000 1000 div> \n \n \n \n - \n -2016 \n \n \n \n - \n \n \n \n - \n \n \n \n \n , , ! - \n \n \r\n8\r\n\r\n\r\n1000\r\n1000\r\n \n \n \n \n \n , - \n \n \n II - \n \n - \n \n \n ! \n \n \n ++ = \n \n \n \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n ZauraLife \n 150 \n \n 76- \n 3 : \n 2015 \n \n \n \n \n \n \n \n \n \n \n \n \n \n - \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n - 2016 \n \n \n \n \n \n \n \n ! \n \n \n \n \n - 2016 \n \n \n \n ! \n \n \n \n \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n \n \n \n \n \n - \n \n \n \n \n - \n - 2015 ? \n \n \n \n ( ) \n \n \n \n \n \r\n8\r\n \r\n\r\n1000\r\n1000\r\n \n \n \n \n \n - \n \n \n \n \n II - \n - \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n , , ! \n \n \n \n , \n 2015 \n \n \n 2016 \n \n , , ! \n \n \n \n \n - \n \n \n \n ! \n \n \n \n \n \n . - . \n \n \n <\r\n\r\n1000\r\n1000\r\na href=\"8444.html\" title=\" \"> \n \n \n \n - - 2015 \n \n \n \n \n ZauraLife \n \n \n \n \n \n \n \n \n \n \n XXVII \n \n \n \n ! \n \n \n \n \n XXVII \n \n \n \n - - \n \n \n - \n \n 26 \n ? \n \n ! 2016 \n 38", "text": ", 8 . 350 35 .\n1000 1000 , . 21 8 .\n7 300 , 01:08:25, .\ǹ .\n1000 1000 div>\n-\n-2016\n-\n-\n, , ! -\n8\n1000\n1000\n, -\nII -\n-\n!\n++ =\n8\n1000\n1000\nZauraLife\n150\n76-\n3 :\n2015\n-\n8\n1000\n1000\n- 2016\n!\n- 2016\n!\n8\n1000\n1000\n-\n-\n- 2015 ?\n( )\n8\n1000\n1000\n-\nII -\n-\n, , !\n,\n2015\n2016\n, , !\n-\n!\n. - .\n<\n1000\n1000\na href=\"8444.html\" title=\" \">\n- - 2015\nZauraLife\nXXVII\n!\nXXVII\n- -\n-\n26\n?\n! 2016\n38", "language": null, "image": null, "pagetype": null, "source": null, "source-hostname": null, "excerpt": null} Edit: Right now I am handling this with this method: def fix_invalid_escapes(self, s):
# This regex matches a backslash not followed by a valid JSON escape
return re.sub(r'\\(?![/bfnrt"\\u])', r'\\\\', s) But I think maybe Trafilatura could handle this natively? (I'm not even sure my fix is enough/good) |
Hi @kinoute, there must be something wrong in the way you encore or decode the HTML response, I cannot reproduce the bug: |
@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way: |
Thanks @adbar using brew to install trafilitura fixed the problem. |
Hi there, I'm not sure this is the right thread, but here's the problem I'm having. Some sites have more than one The XPath that extracts the text is |
Hi @hugoobauer, this problem is also mentioned in #432. The problem with taking all article elements is that sometimes they are related content and not main content (e.g. a list of teasers at the end of a page). |
Hi @adbar, I completely agree that this is a misuse of So I made a little POC to test a solution:
for expr in BODY_XPATH:
# select tree if the expression has been found
try:
subtrees = tree.xpath(expr)
if len(subtrees) > 1: # and favor_recall=True ?
new_subtree = Element(subtrees[0].tag)
for _subtree in subtrees:
for child in _subtree:
# if len(' '.join(child.itertext()).strip()) > MIN_EXTRACTED_SIZE ?
new_subtree.append(child)
subtree = new_subtree
else:
subtree = subtrees[0]
except IndexError:
continue If there's only one item, it's the same as before. Otherwise, I create a new node of the same tag (article in this case), and I insert in it each child of each of the nodes. In addition, we could check whether the favor_recall option is enabled, so that it's not done by default. And use the MIN_EXTRACTED_SIZE value to extract only those elements that are long enough? |
@hugoobauer Your idea looks good. The length heuristic would have to run on whole In any case, feel free to draft a pull request for this or for another issue. You can add a test case somewhere in |
Okay great, I will work on a PR soon |
Hi @adbar I am having an issue with this URL - https://www.energyvault.com/about#leaders. I am not able to extract the text from it. Here's the code I am using:
When I debugged it a little bit, I find it throws an exception with the following traceback -
Let me know if you need any more info. |
Hi @Sang12-2017-18, I cannot reproduce the bug as such but something is odd with this webpage. Do you use the latest version of the trafilatura and htmldate packages? If so, please file an issue on the htmldate repository. |
Hi @adbar |
@Sang12-2017-18 So far there is no such option. I still cannot reproduce the error, how did you get the traceback? |
@Sang12-2017-18 the bug is now fixed in Htmldate version 1.8.1. As for the option to bypass metadata extraction I'm going to add it to the to do list. |
The web page I am trying to parse is not modern one and uses Russian language. Not sure if it worth the efforts to support the parsing of such pages. However, I report it here. Screenshot of absent content (including content on other tabs) I am interested in: I tried with CLI:
|
I have mostly tested
trafilatura
on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see
BODY_XPATH
andCOMMENTS_XPATH
lists).Thanks!
The text was updated successfully, but these errors were encountered: