From 8e8028b7860ebb09eae92dcd43b4b6916d26d4d6 Mon Sep 17 00:00:00 2001 From: hleskien <34342248+hleskien@users.noreply.github.com> Date: Sat, 10 Feb 2024 04:42:22 +0100 Subject: Adopt WebDriverAbstract as a solution for active (JavaScript) websites (#3971) * first working version --------- Co-authored-by: Dag --- docs/05_Bridge_API/04_WebDriverAbstract.md | 83 +++++++++++++++ docs/05_Bridge_API/04_XPathAbstract.md | 160 ----------------------------- docs/05_Bridge_API/05_XPathAbstract.md | 160 +++++++++++++++++++++++++++++ docs/05_Bridge_API/index.md | 3 +- 4 files changed, 245 insertions(+), 161 deletions(-) create mode 100644 docs/05_Bridge_API/04_WebDriverAbstract.md delete mode 100644 docs/05_Bridge_API/04_XPathAbstract.md create mode 100644 docs/05_Bridge_API/05_XPathAbstract.md (limited to 'docs/05_Bridge_API') diff --git a/docs/05_Bridge_API/04_WebDriverAbstract.md b/docs/05_Bridge_API/04_WebDriverAbstract.md new file mode 100644 index 00000000..60b5e99d --- /dev/null +++ b/docs/05_Bridge_API/04_WebDriverAbstract.md @@ -0,0 +1,83 @@ +`WebDriverAbstract` extends [`BridgeAbstract`](./02_BridgeAbstract.md) and adds functionality for generating feeds +from active websites that use XMLHttpRequest (XHR) to load content and / or JavaScript to +modify content. +It highly depends on the php-webdriver library which offers Selenium WebDriver bindings for PHP. + +- https://github.com/php-webdriver/php-webdriver (Project Repository) +- https://php-webdriver.github.io/php-webdriver/latest/ (API) + +Please note that this class is intended as a solution for websites _that cannot be covered +by the other classes_. The WebDriver starts a browser and is therefore very resource-intensive. + +# Configuration + +You need a running WebDriver to use bridges that depend on `WebDriverAbstract`. +The easiest way is to start the Selenium server from the project of the same name: +``` +docker run -d -p 4444:4444 --shm-size="2g" docker.io/selenium/standalone-chrome:latest +``` + +- https://github.com/SeleniumHQ/docker-selenium + +With these parameters only one browser window can be started at a time. +On a multi-user site, Selenium Grid should be used +and the number of sessions should be adjusted to the number of processor cores. + +Finally, the `config.ini.php` file must be adjusted so that the WebDriver +can find the Selenium server: +``` +[webdriver] + +selenium_server_url = "http://localhost:4444" +``` + +# Development + +While you are programming a new bridge, it is easier to start a local WebDriver because then you can see what is happening and where the errors are. I've also had good experience recording the process with a screen video to find any timing problems. + +``` +chromedriver --port=4444 +``` + +- https://chromedriver.chromium.org/ + +If you start rss-bridge from a container, then Chrome driver is only accessible +if you call it with the `--allowed-ips` option so that it binds to all network interfaces. + +``` +chromedriver --port=4444 --allowed-ips=192.168.1.42 +``` + +The **most important rule** is that after an event such as loading the web page +or pressing a button, you often have to explicitly wait for the desired elements to appear. + +A simple example is the bridge `ScalableCapitalBlogBridge.php`. +A more complex and relatively complete example is the bridge `GULPProjekteBridge.php`. + +# Template + +Use this template to create your own bridge. + +```PHP +cleanUp(); + } + } +} + +``` \ No newline at end of file diff --git a/docs/05_Bridge_API/04_XPathAbstract.md b/docs/05_Bridge_API/04_XPathAbstract.md deleted file mode 100644 index fd697995..00000000 --- a/docs/05_Bridge_API/04_XPathAbstract.md +++ /dev/null @@ -1,160 +0,0 @@ -`XPathAbstract` extends [`BridgeAbstract`](./02_BridgeAbstract.md) and adds functionality for generating feeds based on _XPath expressions_. It makes creation of new bridges easy and if you're familiar with XPath expressions this class is probably the right point for you to start with. - -At the end of this document you'll find a complete [template](#template) based on these instructions. - -*** -# Required constants -To create a new Bridge based on `XPathAbstract` your inheriting class should specify a set of constants describing the feed and the XPath expressions. - -It is advised to override constants inherited from [`BridgeAbstract`](./02_BridgeAbstract.md#step-3---add-general-constants-to-the-class) aswell. - -## Class constant `FEED_SOURCE_URL` -Source Web page URL (should provide either HTML or XML content). You can specify any website URL which serves data suited for display in RSS feeds - -## Class constant `XPATH_EXPRESSION_FEED_TITLE` -XPath expression for extracting the feed title from the source page. If this is left blank or does not provide any data `BridgeAbstract::getName()` is used instead as the feed's title. - -## Class constant `XPATH_EXPRESSION_FEED_ICON` -XPath expression for extracting the feed favicon URL from the source page. If this is left blank or does not provide any data `BridgeAbstract::getIcon()` is used instead as the feed's favicon URL. - -## Class constant `XPATH_EXPRESSION_ITEM` -XPath expression for extracting the feed items from the source page. Enter an XPath expression matching a list of dom nodes, each node containing one feed article item in total (usually a surrounding `
` or `` tag). This will be the context nodes for all of the following expressions. This expression usually starts with a single forward slash. - -## Class constant `XPATH_EXPRESSION_ITEM_TITLE` -XPath expression for extracting an item title from the item context. This expression should match a node contained within each article item node containing the article headline. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. - -## Class constant `XPATH_EXPRESSION_ITEM_CONTENT` -XPath expression for extracting an item's content from the item context. This expression should match a node contained within each article item node containing the article content or description. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. - -## Class constant `XPATH_EXPRESSION_ITEM_URI` -XPath expression for extracting an item link from the item context. This expression should match a node's attribute containing the article URL (usually the href attribute of an `` tag). It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. - -## Class constant `XPATH_EXPRESSION_ITEM_AUTHOR` -XPath expression for extracting an item author from the item context. This expression should match a node contained within each article item node containing the article author's name. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. - -## Class constant `XPATH_EXPRESSION_ITEM_TIMESTAMP` -XPath expression for extracting an item timestamp from the item context. This expression should match a node or node's attribute containing the article timestamp or date (parsable by PHP's strtotime function). It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. - -## Class constant `XPATH_EXPRESSION_ITEM_ENCLOSURES` -XPath expression for extracting item enclosures (media content like images or movies) from the item context. This expression should match a node's attribute containing an article image URL (usually the src attribute of an tag or a style attribute). It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. - -## Class constant `XPATH_EXPRESSION_ITEM_CATEGORIES` -XPath expression for extracting an item category from the item context. This expression should match a node or node's attribute contained within each article item node containing the article category. This could be inside
or tags or sometimes be hidden in a data attribute. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. - -## Class constant `SETTING_FIX_ENCODING` -Turns on automatic fixing of encoding errors. Set this to true for fixing feed encoding by invoking PHP's `utf8_decode` function on all extracted texts. Try this in case you see "broken" or "weird" characters in your feed where you'd normally expect umlauts or any other non-ascii characters. - -# Optional methods -`XPathAbstract` offers a set of methods which can be overridden by derived classes for fine tuning and customization. This is optional. The methods provided for overriding can be grouped into three categories. - -## Methods for providing XPath expressions -Usually XPath expressions are defined in the class constants described above. By default the following base methods just return the value of its corresponding class constant. However deriving classed can override them in case if XPath expressions need to be formed dynamically or based on conditions. In case any of these methods is defined, the method's return value is used instead of the corresponding constant for providing the value. - -### Method `getSourceUrl()` -Should return the source Web page URL used as a base for applying the XPath expressions. - -### Method `getExpressionTitle()` -Should return the XPath expression for extracting the feed title from the source page. - -### Method `getExpressionIcon()` -Should return the XPath expression for extracting the feed favicon from the source page. - -### Method `getExpressionItem()` -Should return the XPath expression for extracting the feed items from the source page. - -### Method `getExpressionItemTitle()` -Should return the XPath expression for extracting an item title from the item context. - -### Method `getExpressionItemContent()` -Should return the XPath expression for extracting an item's content from the item context. - -### Method `getSettingUseRawItemContent()` -Should return the 'Use raw item content' setting value (bool true or false). - -### Method `getExpressionItemUri()` -Should return the XPath expression for extracting an item link from the item context. - -### Method `getExpressionItemAuthor()` -Should return the XPath expression for extracting an item author from the item context. - -### Method `getExpressionItemTimestamp()` -Should return the XPath expression for extracting an item timestamp from the item context. - -### Method `getExpressionItemEnclosures()` -Should return the XPath expression for extracting item enclosures (media content like images or movies) from the item context. - -### Method `getExpressionItemCategories()` -Should return the XPath expression for extracting an item category from the item context. - -### Method `getSettingFixEncoding()` -Should return the Fix encoding setting value (bool true or false). - -## Methods for providing feed data -Those methods are invoked for providing the HTML source as a base for applying the XPath expressions as well as feed meta data as the title and icon. - -### Method `provideWebsiteContent()` -This method should return the HTML source as a base for the XPath expressions. Usually it merely returns the HTML content of the URL specified in the constant `FEED_SOURCE_URL` retrieved by curl. Some sites however require user authentication mechanisms, the use of special cookies and/or headers, where the direct retrival using standard curl would not suffice. In that case this method should be overridden and take care of the page retrival. - -### Method `provideFeedTitle()` -This method should provide the feed title. Usually the XPath expression defined in `XPATH_EXPRESSION_FEED_TITLE` is used for extracting the title directly from the page source. - -### Method `provideFeedIcon()` -This method should provide the feed title. Usually the XPath expression defined in `XPATH_EXPRESSION_FEED_ICON` is used for extracting the title directly from the page source. - -### Method `provideFeedItems()` -This method should provide the feed items. Usually the XPath expression defined in `XPATH_EXPRESSION_ITEM` is used for extracting the items from the page source. All other XPath expressions are applied on a per-item basis, item by item, and only on the item's contents. - -## Methods for formatting and filtering feed item attributes -The following methods are invoked after extraction of the feed items from the source. Each of them expect one parameter, the value of the corresponding field, which then can be processed and transformed by the method. You can override these methods in order to format or filter parts of the feed output. - -### Method `formatItemTitle()` -Accepts the items title values as parameter, processes and returns it. Should return a string. - -### Method `formatItemContent()` -Accepts the items content as parameter, processes and returns it. Should return a string. - -### Method `formatItemUri()` -Accepts the items link URL as parameter, processes and returns it. Should return a string. - -### Method `formatItemAuthor()` -Accepts the items author as parameter, processes and returns it. Should return a string. - -### Method `formatItemTimestamp()` -Accepts the items creation timestamp as parameter, processes and returns it. Should return a unix timestamp as integer. - -### Method `cleanImageUrl()` -Method invoked for cleaning feed icon and item image URL's. Extracts the image URL from the passed parameter, stripping any additional content. Furthermore makes sure that relative image URL's get transformed to absolute ones. - -### Method `fixEncoding()` -Only invoked when class constant `SETTING_FIX_ENCODING` is set to true. It then passes all extracted string values through PHP's `utf8_decode` function. - -### Method `generateItemId()` -This method plays in important role for generating feed item ids for all extracted items. Every feed item needs an unique identifier (Uid), so that your feed reader updates the original item instead of adding a duplicate in case an items content is updated on the source site. Usually the items link URL is a good candidate the the Uid. - -*** - -# Template - -Use this template to create your own bridge. Please remove any unnecessary comments and parameters. - -```PHP -` or `` tag). This will be the context nodes for all of the following expressions. This expression usually starts with a single forward slash. + +## Class constant `XPATH_EXPRESSION_ITEM_TITLE` +XPath expression for extracting an item title from the item context. This expression should match a node contained within each article item node containing the article headline. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. + +## Class constant `XPATH_EXPRESSION_ITEM_CONTENT` +XPath expression for extracting an item's content from the item context. This expression should match a node contained within each article item node containing the article content or description. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. + +## Class constant `XPATH_EXPRESSION_ITEM_URI` +XPath expression for extracting an item link from the item context. This expression should match a node's attribute containing the article URL (usually the href attribute of an `` tag). It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. + +## Class constant `XPATH_EXPRESSION_ITEM_AUTHOR` +XPath expression for extracting an item author from the item context. This expression should match a node contained within each article item node containing the article author's name. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. + +## Class constant `XPATH_EXPRESSION_ITEM_TIMESTAMP` +XPath expression for extracting an item timestamp from the item context. This expression should match a node or node's attribute containing the article timestamp or date (parsable by PHP's strtotime function). It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. + +## Class constant `XPATH_EXPRESSION_ITEM_ENCLOSURES` +XPath expression for extracting item enclosures (media content like images or movies) from the item context. This expression should match a node's attribute containing an article image URL (usually the src attribute of an tag or a style attribute). It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. + +## Class constant `XPATH_EXPRESSION_ITEM_CATEGORIES` +XPath expression for extracting an item category from the item context. This expression should match a node or node's attribute contained within each article item node containing the article category. This could be inside
or tags or sometimes be hidden in a data attribute. It should start with a dot followed by two forward slashes, referring to any descendant nodes of the article item node. Attributes can be selected by prepending an `@` char before the attributes name. + +## Class constant `SETTING_FIX_ENCODING` +Turns on automatic fixing of encoding errors. Set this to true for fixing feed encoding by invoking PHP's `utf8_decode` function on all extracted texts. Try this in case you see "broken" or "weird" characters in your feed where you'd normally expect umlauts or any other non-ascii characters. + +# Optional methods +`XPathAbstract` offers a set of methods which can be overridden by derived classes for fine tuning and customization. This is optional. The methods provided for overriding can be grouped into three categories. + +## Methods for providing XPath expressions +Usually XPath expressions are defined in the class constants described above. By default the following base methods just return the value of its corresponding class constant. However deriving classed can override them in case if XPath expressions need to be formed dynamically or based on conditions. In case any of these methods is defined, the method's return value is used instead of the corresponding constant for providing the value. + +### Method `getSourceUrl()` +Should return the source Web page URL used as a base for applying the XPath expressions. + +### Method `getExpressionTitle()` +Should return the XPath expression for extracting the feed title from the source page. + +### Method `getExpressionIcon()` +Should return the XPath expression for extracting the feed favicon from the source page. + +### Method `getExpressionItem()` +Should return the XPath expression for extracting the feed items from the source page. + +### Method `getExpressionItemTitle()` +Should return the XPath expression for extracting an item title from the item context. + +### Method `getExpressionItemContent()` +Should return the XPath expression for extracting an item's content from the item context. + +### Method `getSettingUseRawItemContent()` +Should return the 'Use raw item content' setting value (bool true or false). + +### Method `getExpressionItemUri()` +Should return the XPath expression for extracting an item link from the item context. + +### Method `getExpressionItemAuthor()` +Should return the XPath expression for extracting an item author from the item context. + +### Method `getExpressionItemTimestamp()` +Should return the XPath expression for extracting an item timestamp from the item context. + +### Method `getExpressionItemEnclosures()` +Should return the XPath expression for extracting item enclosures (media content like images or movies) from the item context. + +### Method `getExpressionItemCategories()` +Should return the XPath expression for extracting an item category from the item context. + +### Method `getSettingFixEncoding()` +Should return the Fix encoding setting value (bool true or false). + +## Methods for providing feed data +Those methods are invoked for providing the HTML source as a base for applying the XPath expressions as well as feed meta data as the title and icon. + +### Method `provideWebsiteContent()` +This method should return the HTML source as a base for the XPath expressions. Usually it merely returns the HTML content of the URL specified in the constant `FEED_SOURCE_URL` retrieved by curl. Some sites however require user authentication mechanisms, the use of special cookies and/or headers, where the direct retrival using standard curl would not suffice. In that case this method should be overridden and take care of the page retrival. + +### Method `provideFeedTitle()` +This method should provide the feed title. Usually the XPath expression defined in `XPATH_EXPRESSION_FEED_TITLE` is used for extracting the title directly from the page source. + +### Method `provideFeedIcon()` +This method should provide the feed title. Usually the XPath expression defined in `XPATH_EXPRESSION_FEED_ICON` is used for extracting the title directly from the page source. + +### Method `provideFeedItems()` +This method should provide the feed items. Usually the XPath expression defined in `XPATH_EXPRESSION_ITEM` is used for extracting the items from the page source. All other XPath expressions are applied on a per-item basis, item by item, and only on the item's contents. + +## Methods for formatting and filtering feed item attributes +The following methods are invoked after extraction of the feed items from the source. Each of them expect one parameter, the value of the corresponding field, which then can be processed and transformed by the method. You can override these methods in order to format or filter parts of the feed output. + +### Method `formatItemTitle()` +Accepts the items title values as parameter, processes and returns it. Should return a string. + +### Method `formatItemContent()` +Accepts the items content as parameter, processes and returns it. Should return a string. + +### Method `formatItemUri()` +Accepts the items link URL as parameter, processes and returns it. Should return a string. + +### Method `formatItemAuthor()` +Accepts the items author as parameter, processes and returns it. Should return a string. + +### Method `formatItemTimestamp()` +Accepts the items creation timestamp as parameter, processes and returns it. Should return a unix timestamp as integer. + +### Method `cleanImageUrl()` +Method invoked for cleaning feed icon and item image URL's. Extracts the image URL from the passed parameter, stripping any additional content. Furthermore makes sure that relative image URL's get transformed to absolute ones. + +### Method `fixEncoding()` +Only invoked when class constant `SETTING_FIX_ENCODING` is set to true. It then passes all extracted string values through PHP's `utf8_decode` function. + +### Method `generateItemId()` +This method plays in important role for generating feed item ids for all extracted items. Every feed item needs an unique identifier (Uid), so that your feed reader updates the original item instead of adding a duplicate in case an items content is updated on the source site. Usually the items link URL is a good candidate the the Uid. + +*** + +# Template + +Use this template to create your own bridge. Please remove any unnecessary comments and parameters. + +```PHP +