Metadata-Version: 2.1
Name: html2json
Version: 0.2.4.1
Summary: Parsing HTML to JSON
Home-page: https://github.com/chuanconggao/html2json
Download-URL: https://github.com/chuanconggao/html2json/tarball/0.2.4.1
Author: Chuancong Gao
Author-email: chuancong@gmail.com
License: MIT
Keywords: parser,html,json
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
License-File: LICENSE

[![PyPi version](https://img.shields.io/pypi/v/html2json.svg)](https://pypi.python.org/pypi/html2json/)
[![PyPi pyversions](https://img.shields.io/pypi/pyversions/html2json.svg)](https://pypi.python.org/pypi/html2json/)
[![PyPi license](https://img.shields.io/pypi/l/html2json.svg)](https://pypi.python.org/pypi/html2json/)

Convert a HTML webpage to JSON data using a template defined in JSON.

Installation
----

This package is available on PyPi. Just use `pip install -U html2json` to install it. Then you can import it using `from html2json import collect`.

API
----

The method is `collect(html, template)`. `html` is the HTML of page loaded as string, and `template` is the JSON of template loaded as Python objects.

Note that the HTML must contain the root node, like `<html>...</html>` or `<div>...</div>`.

Template Syntax
----

- The basic syntax is `keyName: [selector, attr, [listOfRegexes]]`.
    1. `selector` is a CSS selector (supported by [lxml](http://lxml.de/)).
        - When the selector is `null`, the root node itself is matched.
        - When the selector cannot be matched, `null` is returned.
    2. `attr` matches the attribute value. It can be `null` to match either the inner text or the outer text when the inner text is empty.
    3. The list of regexes `[listOfRegexes]` supports two forms of regex operations. The operations with in the list are executed sequentially.
        - Replacement: `s/regex/replacement/g`. `g` is optional for multiple replacements.
        - Extraction: `/regex/`.

For example:

```json
{
    "Color": ["head link:nth-of-type(1)", "href", ["/\\w+(?=\\.css)/"]],
}
```

- As JSON, nested structure can be easily constructed.

```json
{
    "Cover": {
        "URL": [".cover img", "src", []],
        "Number of Favorites": [".cover .favorites", "value", []]
    },
}
```

- An alternative simplified syntax `keyName: [subRoot, subTemplate]` can be used.
    1. `subRoot` a CSS selector of the new root for each sub entry.
    2. `subTemplate` is a sub-template for each entry, recursively.

For example, the previous example can be simplified as follow.

```json
{
    "Cover": [".cover", {
        "URL": ["img", "src", []],
        "Number of Favorites": [".favorites", "value", []]
    }],
}
```

- To extract a list of sub-entries following the same sub-template, the list syntax is `keyName: [[subRoot, subTemplate]]`. Please note the difference (surrounding `[` and `]`) from the previous syntax above.
    1. `subRoot` is the CSS selector of the new root for each sub entry.
    2. `subTemplate` is the sub-template for each entry, recursively.

For example:

```json
{
    "Comments": [[".comments", {
        "From": [".from", null, []],
        "Content": [".content", null, []],
        "Photos": [["img", {
            "URL": ["", "src", []]
        }]]
    }]]
}
```
