最新消息:Welcome to the puzzle paradise for programmers! Here, a well-designed puzzle awaits you. From code logic puzzles to algorithmic challenges, each level is closely centered on the programmer's expertise and skills. Whether you're a novice programmer or an experienced tech guru, you'll find your own challenges on this site. In the process of solving puzzles, you can not only exercise your thinking skills, but also deepen your understanding and application of programming knowledge. Come to start this puzzle journey full of wisdom and challenges, with many programmers to compete with each other and show your programming wisdom! Translated with DeepL.com (free version)

javascript - mongodb import xml into mongodb - Stack Overflow

matteradmin9PV0评论

I have problem with importing big xml file (1.3 gb) into mongodb in order to search for most frequent words in map & reduce manner.

.xml.bz2

Here I enclose xml cut (first 10 000 lines) out from this big file:


I know that I can't import xml directly into mongodb. I used some tools do so. I used some python scripts and all has failed.

Which tool or script should I use? What should be a key & value? I think the best solution to find most frequent world would be this.

(_id : id, value: word )

then I would sum all the elements like in docs example:

/

Any clues would be greatly appreciated, but how to import this file into mongodb to have collections like that?

(_id : id, value: word )

If you have any idea please share.

Edited After research, I would use python or js to plete this task.

I would extract only words in <text></text> section which is under /<page><revision>, exlude &lt, &gt etc., and then separate words and upload them to mongodb with pymongo or js.

So there are several pages with revision and text.

Edited

I have problem with importing big xml file (1.3 gb) into mongodb in order to search for most frequent words in map & reduce manner.

http://dumps.wikimedia/plwiki/20141228/plwiki-20141228-pages-articles-multistream.xml.bz2

Here I enclose xml cut (first 10 000 lines) out from this big file:

http://www.filedropper./text2

I know that I can't import xml directly into mongodb. I used some tools do so. I used some python scripts and all has failed.

Which tool or script should I use? What should be a key & value? I think the best solution to find most frequent world would be this.

(_id : id, value: word )

then I would sum all the elements like in docs example:

http://docs.mongodb/manual/core/map-reduce/

Any clues would be greatly appreciated, but how to import this file into mongodb to have collections like that?

(_id : id, value: word )

If you have any idea please share.

Edited After research, I would use python or js to plete this task.

I would extract only words in <text></text> section which is under /<page><revision>, exlude &lt, &gt etc., and then separate words and upload them to mongodb with pymongo or js.

So there are several pages with revision and text.

Edited

Share Improve this question edited Jan 8, 2015 at 9:23 user2980480 asked Jan 7, 2015 at 18:18 user2980480user2980480 451 gold badge1 silver badge6 bronze badges 5
  • Does anyone know how to convert such a big file, text section , into csv or json – user2980480 Commented Jan 8, 2015 at 23:02
  • the problem of big files, can be solved with fileinput, because you will load only one line at once, and not the whole file will be loaded to memory, then you decide when you will write to another file (csv or json). – Abdelouahab Commented Jan 9, 2015 at 0:38
  • Can you give me an example? – user2980480 Commented Jan 9, 2015 at 0:41
  • i made this, since the resulting file will be really big, then using open will use all memory, github./abdelouahabb/kouider-ezzadam/blob/master/… – Abdelouahab Commented Jan 9, 2015 at 1:15
  • I treid doing that and also stackoverflow./questions/19286118/… and got memory error.... – user2980480 Commented Jan 9, 2015 at 9:16
Add a ment  | 

2 Answers 2

Reset to default 1

To save all this data, save them on Gridfs

And the easiest way to convert the xml, is to use this tool to convert it to json and save it:

https://stackoverflow./a/10201405/861487

import xmltodict

doc = xmltodict.parse("""
... <mydocument has="an attribute">
...   <and>
...     <many>elements</many>
...     <many>more elements</many>
...   </and>
...   <plus a="plex">
...     element as well
...   </plus>
... </mydocument>
... """)

doc['mydocument']['@has']
Out[3]: u'an attribute'

The XML file i'm using goes this way :

<labels>
     <label>
          <name>Bobby Nice</name>
          <urls>
               <url>www.examplex.</url>
               <url>www.exampley.</url>
               <url>www.examplez.</url>
          </urls>
     </label>
     ...
</labels>

and i can import it using xml-stream with mongodb

See: https://github./assistunion/xml-stream

Code:

var XmlStream = require('xml-stream');
// Pass the ReadStream object to xml-stream
var stream = fs.createReadStream('20080309_labels.xml');
var xml = new XmlStream(stream);

var i = 1;
var array = [];
xml.on('endElement: label', function(label) {
  array.push(label);
  db.collection('labels').update(label, label, { upsert:true }, (err, doc) => {
    if(err) {
      process.stdout.write(err + "\r");
    } else {
      process.stdout.write(`Saved ${i} entries..\r`);
      i++;
    }
  });
});

xml.on('end', function() {
  console.log('end event received, done');
});
Post a comment

comment list (0)

  1. No comments so far