node.js实例：使用 xpath-html 模块基于XPath模式获取html页面元素

nodejs 2023-08-28 15:58:11小码哥的IT人生shichen

一、功能需求

node.js 基于HTML页面的XPath 模式获取页面元素值

二、解决方法

1. 首先获取给定URL页面的HTML代码

具体代码可参考前文 axios模块根据URL获取网页源码：http://www.phpcodeweb.com/news/4961.html 。

实例代码如下：

import iconv from 'iconv-lite';
import axios from 'axios';
async function getPage(url,charset='utf-8'){
    var data;
    try {
        const res = await axios({
            method:"get",
            url:url,
            responseType:"arraybuffer",
            headers:{
                "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.36'
            }
        })
        data = iconv.decode(res.data,charset);
    } catch (error) {
        if(error.response && error.response.status === '404'){
            console.log('======页面不存在==========');
        }else{
            console.log('提取',url,'页面出错：',error.message)
        }
        data = "error:" + error.response?.status;
    }
    return data;
}
//获取制定页面HTML代码：
let weburl = 'https://blog.csdn.net/ximen2012/article/details/132499292';
const cont = await getPage(weburl);

2. 使用 xpath-html 模块（经笔者测试，使用 xpath-html 模块调试可运行成功。）

① 安装 xpath-html 模块：

npm install xpath-html --save

② 引入 xpath-html 模块：

（"type": "module"模式【这里笔者使用该模式测试】）：

import xpath from 'xpath-html';

（"type": "commonjs"模式）：

const xpath = require('xpath-html');

3. 使用 xpath-html 模块解析HTML源码：

const node = xpath.fromPageSource(cont).findElement('//*[@id="articleContentId"]');
// console.log(node.toString());
console.log(node.getText());

运行结果：

node.js实例：axios模块根据URL获取网页源码

PS：这里有个bug，在解析HTML页面的时候会报错：
[xmldom error] element parse error: Error: invalid attribute::src
@#[line:1449,col:13]

目前暂时未找到解决方案，但是不影响程序运行结果，可暂时忽略该bug。

4. 完整实例代码：

import iconv from 'iconv-lite';
import axios from 'axios';
import xpath from 'xpath-html';
async function getPage(url,charset='utf-8'){
    var data;
    try {
        const res = await axios({
            method:"get",
            url:url,
            responseType:"arraybuffer",
            headers:{
                "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.36'
            }
        })
        data = iconv.decode(res.data,charset);
    } catch (error) {
        if(error.response && error.response.status === '404'){
            console.log('======页面不存在==========');
        }else{
            console.log('提取',url,'页面出错：',error.message)
        }
        data = "error:" + error.response?.status;
    }
    return data;
}
//获取制定页面HTML代码：
let weburl = 'https://blog.csdn.net/ximen2012/article/details/132499292';
const cont = await getPage(weburl);
const node = xpath.fromPageSource(cont).findElement('//*[@id="articleContentId"]');
// console.log(node.toString());
console.log(node.getText());

上一篇node.js实例：使用pinyin-pro模块实现中文转拼音首字母下一篇node.js实例：使用 adbkit-apkreader 模块获取apk文件包名

小码哥的IT人生

node.js实例：使用 xpath-html 模块基于XPath模式获取html页面元素

一、功能需求

二、解决方法

1. 首先获取给定URL页面的HTML代码

2. 使用 xpath-html 模块（经笔者测试，使用 xpath-html 模块调试可运行成功。）

3. 使用 xpath-html 模块解析HTML源码：

4. 完整实例代码：

tags标签云

JS栏目导航

最新文章

热门文章

node.js实例：使用 xpath-html 模块基于XPath模式获取html页面元素

一、功能需求

二、解决方法

1. 首先获取给定URL页面的HTML代码

2. 使用 xpath-html 模块（经笔者测试，使用 xpath-html 模块调试可运行成功。）

3. 使用 xpath-html 模块解析HTML源码：

4. 完整实例代码：

相关阅读

tags标签云

JS栏目导航

最新文章

热门文章