javascript - How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it?

I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

To be specific, I want to download the whole page's HTML code (a Rotten Tomatoes page) in the background and store it in a variable and then use getElementsByClassName[0] in order to extract the text I want from the element with class name "critic_consensus".

I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "/",true);
xhr.responseType = "document";
xhr.send();

It shows this error message when I run it in Firefox Scratchpad:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at /. This can be fixed by moving the resource to the same domain or enabling CORS.

PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.

I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "http://www.rottentomatoes./m/godfather/",true);
xhr.responseType = "document";
xhr.send();

It shows this error message when I run it in Firefox Scratchpad:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://www.rottentomatoes./m/godfather/. This can be fixed by moving the resource to the same domain or enabling CORS.

PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.

Share Improve this question edited Feb 6, 2018 at 22:01 Brock Adams 93.7k23 gold badges241 silver badges305 bronze badges asked Nov 5, 2014 at 19:19 darkred 6379 silver badges28 bronze badges

2 What is not-working? What error do you get? – Bergi Commented Nov 5, 2014 at 19:20
2 No error message inside Firefox's Scratchpad. After seeing Igor Barinov's reply, I checked the Firefox Web Console and that's where appears the error message he mentioned. I added the error message to my question. – darkred Commented Nov 5, 2014 at 19:52
I edited my answer with new idea, give it a try! – Igor Barinov Commented Nov 5, 2014 at 20:38

Add a ment |

3 Answers 3

Sorted by: Reset to default 5

For cross-origin requests, where the fetched site has not helpfully set a permissive CORS policy, Greasemonkey provides the GM_xmlhttpRequest() function. (Most other userscript engines also provide this function.)

GM_xmlhttpRequest is expressly designed to allow cross-origin requests.

To get your target information create a DOMParser on the result. Do not use jQuery methods as this will cause extraneous images, scripts and objects to load, slowing things down, or crashing the page.

Here's a plete script that illustrates the process:

// ==UserScript==
// @name        _Parse Ajax Response for specific nodes
// @include     http://stackoverflow./questions/*
// @require     http://ajax.googleapis./ajax/libs/jquery/2.1.0/jquery.min.js
// @grant       GM_xmlhttpRequest
// ==/UserScript==

GM_xmlhttpRequest ( {
    method: "GET",
    url:    "http://www.rottentomatoes./m/godfather/",
    onload: function (response) {
        var parser  = new DOMParser ();
        /* IMPORTANT!
            1) For Chrome, see
            https://developer.mozilla/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers
            for a work-around.

            2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded.
        */
        var doc         = parser.parseFromString (response.responseText, "text/html");
        var criticTxt   = doc.getElementsByClassName ("critic_consensus")[0].textContent;

        $("body").prepend ('<h1>' + criticTxt + '</h1>');
    },
    onerror: function (e) {
        console.error ('**** error ', e);
    },
    onabort: function (e) {
        console.error ('**** abort ', e);
    },
    ontimeout: function (e) {
        console.error ('**** timeout ', e);
    }
} );

The problem is: XMLHttpRequest cannot load http://www.rottentomatoes./m/godfather/. No 'Access-Control-Allow-Origin' header is present on the requested resource.

Because you are not the owner of the resource you can not set up this header.

What you can do is set up a proxy on heroku which will proxy all requests to rottentomatoes web site Here is a small node.js proxy https://gist.github./igorbarinov/a970cdaf5fc9451f8d34

var https = require('https'),
    http  = require('http'),
    util  = require('util'),
    path  = require('path'),
    fs    = require('fs'),
    colors = require('colors'),
    url = require('url'),
    httpProxy = require('http-proxy'),
    dotenv = require('dotenv');

dotenv.load();

var proxy = httpProxy.createProxyServer({});
var host = "www.rottentomatoes.";
var port = Number(process.env.PORT || 5000);

process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";

var server = require('http').createServer(function(req, res) {
    // You can define here your custom logic to handle the request
    // and then proxy the request.
    var path = url.parse(req.url, true).path;

    req.headers.host = host;
res.setHeader("Access-Control-Allow-Origin", "*");
    proxy.web(req, res, {
        target: "http://"+host+path,

    });

}).listen(port);

proxy.on('proxyRes', function (res) {
    console.log('RAW Response from the target', JSON.stringify(res.headers, true, 2));
});


util.puts('Proxying to '+ host +'. Server'.blue + ' started '.green.bold + 'on port '.blue + port);

I modified https://github./massive/firebase-proxy/ code for this

I published proxy on http://peaceful-cove-8072.herokuapp./ and on http://peaceful-cove-8072.herokuapp./m/godfather you can test it

Here is a gist to test http://jsfiddle/uuw8nryy/

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0]);
}
xhr.open("GET", "http://peaceful-cove-8072.herokuapp./m/godfather",true);
xhr.responseType = "document";
xhr.send();

The JavaScript same origin policy prevents you from accessing content that belongs to a different domain.

The above reference also gives you four techniques for relaxing this rule (CORS being one of them).

Programmer puzzle solving

javascript - How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it? - Stack Ov

3 Answers 3

Articles related to this article

comment list (0)