Metadata-Version: 2.1
Name: htmldate
Version: 1.9.3
Summary: Fast and robust extraction of original and updated publication dates from URLs and web pages.
Author-email: Adrien Barbaresi <barbaresi@bbaw.de>
License: Apache 2.0
Project-URL: Homepage, https://htmldate.readthedocs.io
Project-URL: Source, https://github.com/adbar/htmldate
Project-URL: Blog, https://adrien.barbaresi.eu/blog/
Project-URL: Tracker, https://github.com/adbar/htmldate/issues
Keywords: datetime,date-parser,entity-extraction,html-extraction,html-parsing,metadata-extraction,webarchives,web-scraping
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: charset-normalizer>=3.4.0
Requires-Dist: dateparser>=1.1.2
Requires-Dist: python-dateutil>=2.9.0.post0
Requires-Dist: urllib3<3,>=1.26
Requires-Dist: lxml<6,>=5.3.0; platform_system != "Darwin" or python_version > "3.8"
Requires-Dist: lxml==4.9.2; platform_system == "Darwin" and python_version <= "3.8"
Provides-Extra: all
Requires-Dist: htmldate[dev]; extra == "all"
Requires-Dist: htmldate[speed]; extra == "all"
Provides-Extra: dev
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: types-dateparser; extra == "dev"
Requires-Dist: types-python-dateutil; extra == "dev"
Requires-Dist: types-lxml; extra == "dev"
Requires-Dist: types-urllib3; extra == "dev"
Provides-Extra: speed
Requires-Dist: faust-cchardet>=2.1.19; extra == "speed"
Requires-Dist: urllib3[brotli]; extra == "speed"
Requires-Dist: backports-datetime-fromisoformat; python_version < "3.11" and extra == "speed"

# Htmldate: Find the Publication Date of Web Pages

[![Python package](https://img.shields.io/pypi/v/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Python versions](https://img.shields.io/pypi/pyversions/htmldate.svg)](https://pypi.python.org/pypi/htmldate)
[![Documentation Status](https://readthedocs.org/projects/htmldate/badge/?version=latest)](https://htmldate.readthedocs.org/en/latest/?badge=latest)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/htmldate.svg)](https://codecov.io/gh/adbar/htmldate)
[![Downloads](https://img.shields.io/pypi/dm/htmldate?color=informational)](https://pepy.tech/project/htmldate)
[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)

<br/>

<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-logo.png" alt="Htmldate Logo" align="center" width="60%"/>

<br/>

Find **original and updated publication dates** of any web page.
It is often not possible to do it using just the URL or the server response.

**On the command-line or with Python**, all the steps needed from web page
download to HTML parsing, scraping, and text analysis are included.

The package is used in production on millions of documents and integrated into
[thousands of projects](https://github.com/adbar/htmldate/network/dependents).


## In a nutshell

<br/>

<img src="https://raw.githubusercontent.com/adbar/htmldate/master/docs/htmldate-demo.gif" alt="Demo as GIF image" align="center" width="80%"/>

<br/>

### With Python

``` python
>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
```

### On the command-line

``` bash
$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
```

## Features

-   Flexible input: URLs, HTML files, or HTML trees can be used as input
    (including batch processing).
-   Customizable output: Any date format (defaults to [ISO 8601
    YMD](https://en.wikipedia.org/wiki/ISO_8601)).
-   Detection of both original and updated dates.
-   Multilingual.
-   Compatible with all recent versions of Python.

### How it works

Htmldate operates by sifting through HTML markup and if necessary text
elements. It features the following heuristics:

1.  **Markup in header**: Common patterns are used to identify relevant
    elements (e.g. `link` and `meta` elements) including [Open Graph
    protocol](http://ogp.me/) attributes.
2.  **HTML code**: The whole document is searched for structural markers
    like `abbr` or `time` elements and a series of attributes (e.g.
    `postmetadata`).
3.  **Bare HTML content**: Heuristics are run on text and markup:
    -   In `fast` mode the HTML page is cleaned and precise patterns are
        targeted.
    -   In `extensive` mode all potential dates are collected and a
        disambiguation algorithm determines the best one.

Finally, the output is validated and converted to the chosen format.

## Performance

1000 web pages containing identifiable dates (as of 2023-11-13 on Python 3.10)

| Python Package | Precision | Recall | Accuracy | F-Score | Time |
| -------------- | --------- | ------ | -------- | ------- | ---- |
| articleDateExtractor 0.20 | 0.803 | 0.734 | 0.622 | 0.767 | 5x |
| date_guesser 2.1.4 | 0.781 | 0.600 | 0.514 | 0.679 | 18x |
| goose3 3.1.17 | 0.869 | 0.532 | 0.493 | 0.660 | 15x |
| htmldate\[all\] 1.6.0 (fast) | **0.883** | 0.924 | 0.823 | 0.903 | **1x** |
| htmldate\[all\] 1.6.0 (extensive) | 0.870 | **0.993** | **0.865** | **0.928** | 1.7x |
| newspaper3k 0.2.8 | 0.769 | 0.667 | 0.556 | 0.715 | 15x |
| news-please 1.5.35 | 0.801 | 0.768 | 0.645 | 0.784 | 34x |

For the complete results and explanations see [evaluation
page](https://htmldate.readthedocs.io/en/latest/evaluation.html).

## Installation

Htmldate is tested on Linux, macOS and Windows systems, it is compatible
with Python 3.8 upwards. It can notably be installed with `pip` (`pip3`
where applicable) from the PyPI package repository:

-   `pip install htmldate`
-   (optionally) `pip install htmldate[speed]`

The last version to support Python 3.6 and 3.7 is `htmldate==1.8.1`.

## Documentation

For more details on installation, Python & CLI usage, **please refer to
the documentation**:
[htmldate.readthedocs.io](https://htmldate.readthedocs.io/)

## License

This package is distributed under the [Apache 2.0
license](https://www.apache.org/licenses/LICENSE-2.0.html).

Versions prior to v1.8.0 are under GPLv3+ license.

## Context and contributions

Initially launched to create text databases for research purposes
at the Berlin-Brandenburg Academy of Sciences (DWDS and ZDL units),
this project continues to be maintained but its future development
depends on community support.

**If you value this software or depend on it for your product, consider
sponsoring it and contributing to its codebase**. Your support
will help maintain and enhance this package.
Visit the [Contributing page](https://github.com/adbar/htmldate/blob/master/CONTRIBUTING.md)
for more information.

Reach out via the software repository or the [contact page](https://adrien.barbaresi.eu/)
for inquiries, collaborations, or feedback.

[![JOSS article reference DOI: 10.21105/joss.02439](https://img.shields.io/badge/JOSS-10.21105%2Fjoss.02439-brightgreen)](https://doi.org/10.21105/joss.02439)
[![Zenodo archive DOI: 10.5281/zenodo.3459599](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.3459599-blue)](https://doi.org/10.5281/zenodo.3459599)


``` shell
@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}
```

-   Barbaresi, A. \"[htmldate: A Python package to extract publication
    dates from web pages](https://doi.org/10.21105/joss.02439)\",
    Journal of Open Source Software, 5(51), 2439, 2020. DOI:
    10.21105/joss.02439
-   Barbaresi, A. \"[Generic Web Content Extraction with Open-Source
    Software](https://hal.archives-ouvertes.fr/hal-02447264/document)\",
    Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
-   Barbaresi, A. \"[Efficient construction of metadata-enhanced web
    corpora](https://hal.archives-ouvertes.fr/hal-01371704v2/document)\",
    Proceedings of the [10th Web as Corpus Workshop
    (WAC-X)](https://www.sigwac.org.uk/wiki/WAC-X), 2016.


## Acknowledgements

Kudos to the following software libraries:

-   [lxml](http://lxml.de/),
    [dateparser](https://github.com/scrapinghub/dateparser)
-   A few patterns are derived from the
    [python-goose](https://github.com/grangier/python-goose),
    [metascraper](https://github.com/ianstormtaylor/metascraper),
    [newspaper](https://github.com/codelucas/newspaper) and
    [articleDateExtractor](https://github.com/Webhose/article-date-extractor)
    libraries. This module extends their coverage and robustness
    significantly.
