Introduction
************


Quick Start
===========

1. Run "urlwatch" once to migrate your old data or start fresh

2. Use "urlwatch --edit" to customize jobs and filters ("urls.yaml")

3. Use "urlwatch --edit-config" to customize settings and reporters
   ("urlwatch.yaml")

4. Add "urlwatch" to your crontab ("crontab -e") to monitor webpages
   periodically

The checking interval is defined by how often you run "urlwatch". You
can use e.g. crontab.guru to figure out the schedule expression for
the checking interval, we recommend not more often than 30 minutes
(this would be "*/30 * * * *"). If you have never used cron before,
check out the crontab command help.

On Windows, "cron" is not installed by default. Use the Windows Task
Scheduler instead, or see this StackOverflow question for
alternatives.


How it works
============

Every time you run *urlwatch(1)*, it:

* retrieves the output of each job and filters it

* compares it with the version retrieved the previous time (“diffing”)

* if it finds any differences, it invokes enabled reporters (e.g. text
  reporter, e-mail reporter, …) to notify you of the changes


Jobs and Filters
================

Each website or shell command to be monitored constitutes a “job”.

The instructions for each such job are contained in a config file in
the YAML format. If you have more than one job, you separate them with
a line containing only "---".

You can edit the job and filter configuration file using:

   urlwatch --edit

If you get an error, set your "$EDITOR" (or "$VISUAL") environment
variable in your shell, for example:

   export EDITOR=/bin/nano

While you can edit the YAML file manually, using "--edit" will do
sanity checks before activating the new configuration file.


Kinds of Jobs
-------------

Each job must have exactly one of the following keys, which also
defines the kind of job:

* "url" retrieves what is served by the web server (HTTP GET by
  default),

* "navigate" uses a headless browser to load web pages requiring
  JavaScript, and

* "command" runs a shell command.

Each job can have an optional "name" key to define a user-visible name
for the job.

You can then use optional keys to finely control various job’s
parameters.


Filters
-------

You may use the "filter" key to select one or more Filters to apply to
the data after it is retrieved, for example to:

* select HTML: "css", "xpath", "element-by-class", "element-by-id",
  "element-by-style", "element-by-tag"

* make HTML more readable: "html2text", "beautify"

* make PDFs readable: "pdf2text"

* make JSON more readable: "format-json"

* make iCal more readable: "ical2text"

* make binary readable: "hexdump"

* just detect changes: "sha1sum"

* edit text: "grep", "grepi", "strip", "sort", "striplines"

These filters can be chained. As an example, after retrieving an HTML
document by using the "url" key, you can extract a selection with the
"xpath" filter, convert this to text with "html2text", use "grep" to
extract only lines matching a specific regular expression, and then
"sort" them:

   name: "Sample urlwatch job definition"
   url: "https://example.dummy/"
   https_proxy: "http://dummy.proxy/"
   max_tries: 2
   filter:
     - xpath: '//section[@role="main"]'
     - html2text:
         method: pyhtml2text
         unicode_snob: true
         body_width: 0
         inline_links: false
         ignore_links: true
         ignore_images: true
         pad_tables: false
         single_line_break: true
     - grep: "lines I care about"
     - sort:
   ---


Reporters
=========

*urlwatch* can be configured to do something with its report besides
(or in addition to) the default of displaying it on the console.

Reporters are configured in the global configuration file:

   urlwatch --edit-config

Examples of reporters:

* "email" (using SMTP)

* email using "mailgun"

* "slack"

* "discord"

* "pushbullet"

* "telegram"

* "matrix"

* "pushover"

* "stdout"

* "xmpp"

* "shell"
