Trouble scraping NzbMegaSearch Indexer

rickatnight11 · September 28, 2014, 6:26pm

I believe this started after I updated Nzbdrone yesterday, but now it’s not able to parse any releases from my NzbMegaSearch API:

[Info] NzbSearchService: Searching 3 indexers for [Fargo : S01E10]
[Warn] FetchFeedService: NzbMegaSearch http://omitted:8050/api?t=tvsearch&cat=5030,5040&extended=1&apikey=omitted&limit=100&rid=35276&season=1&ep=10 Indexer responded with html content. Site is likely blocked or unavailable.
[Info] FetchFeedService: Finished searching NzbMegaSearch for [Fargo : S01E10]. Found 0
[Info] DownloadDecisionMaker: No reports found

I can load that exact link in my browser, and it appears to be a valid XML RSS page.

markus101 · September 28, 2014, 7:40pm

Megasearch is returning the content type as HTML instead of text/xml or another proper RSS format. If you use the dev tools of your browser you should be able to check the response headers, specifically what the Content-Type header is set to.

The change on drone’s end is to deal with issues when the provider returns HTML instead of XML content and prevents drone from attempting to parse something that will fail, unfortunately megasearch has a few issues and this adds another to the list.

KiefDelicious · September 30, 2014, 6:34pm

So there is nothing we can do right now? Just hope that It will be fixed in NZBdrone (because it was working before the latest release)

Statia · September 30, 2014, 7:17pm

The actual XML returned from NzbMegaSearch API shows the correct type of “application/rss+xml”:

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/">
<channel>
<atom:link href="http://bogusurl.info/api?age=1500&amp;t=tvsearch&amp;cat=5040%2C5030" rel="self" type="application/rss+xml" />

… and the content looks to be correctly formed. Wouldn’t it be better to check the content rather than the headers for the correct type?

syndac · September 30, 2014, 7:53pm

I seem to be having this problem as well. A fix for this would be much appreciated.

markus101 · September 30, 2014, 9:10pm

@Statia I believe its the response header thats the issue, not definitely not the content of the XML itself. We made these changes in drone for better error handling as its required in some cases. Since drone has no way to determine whether or not its connecting to a newznab indexer or megasearch we can’t handle this on our end without reverting our intentional change. I’m not familiar with Megasearch, so I can’t say whether or not this can easily be solved in their code.

Can someone confirm what the response header is set to?

syndac · September 30, 2014, 9:29pm

@markus101 These are the response headers I got:

>     Content-Length	7905
>     Content-Type	text/html; charset=utf-8
>     Date	Tue, 30 Sep 2014 21:28:12 GMT
>     Server	Werkzeug/0.8.3 Python/2.7.3

rickatnight11 · September 30, 2014, 9:55pm

@markus101, what would be the lift in reverting the intentional change in favor of triggering based on detecting the text content versus the HTTP headers? This would avoid issues with any other indexers that do a bad job of sending the correct headers, but the content is valid.

Taloth · September 30, 2014, 11:04pm

xref:

Statia · October 1, 2014, 7:48am

@Taloth Yeah I raised an issue with NzbMegaSearch on GitHub, but there’s little by way of activity from the original developer so I don’t know if it will ever get looked into.

@markus101 Would it be better to check if the response headers OR the content contain the correct type rather than just the headers? I know in an ideal world, all API’s should return the correct type header, but it’s rarely an ideal world.

Taloth · October 1, 2014, 8:53am

@Statia, yes I figured I just added it as reference.
Late last night I took a quick look in the metasearch python code, looks to me like you need a grand total of 3 lines to fix it. But it was already like 2 am so didn’t feel like actually installing it to test it out.

diff https://github.com/pillone/usntssearch/blob/master/NZBmegasearch/mega2.py#L345
-		return apiresp.dosearch(request.args, urlparse(request.url))
+               resp = make_response(apiresp.dosearch(request.args, urlparse(request.url)))
+               resp.headers['Content-Type'] = 'application/rss+xml'
+               return resp

But that’s totally untested, might not even compile.
You can do it in one line too

return apiresp.dosearch(request.args, url parse(request.url)), 200, {'Content-Type': 'application/rss+xml'}

but dunno if that’s pythonian. I’m not really a python developer.

Statia · October 1, 2014, 9:22am

@Taloth

Using the three additional lines in mega.py, it compiles, but the headers content-type remain text/htm unfortunately:

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 10075
Server: Werkzeug/0.8.3 Python/2.7.3
Date: Wed, 01 Oct 2014 09:19:16 GMT

Taloth · October 1, 2014, 9:50am

are you sure you’re hitting the /api endpoint?

Otherwise try resp.mimetype = ‘application/rss+xml’ instead of the Content-Type one.

Statia · October 1, 2014, 10:05am

Yes I’m hitting the API endpoint.

Still the same with mimetype.

Does Werkzeug create the responses which is overriding the headers you’re adding?

Taloth · October 1, 2014, 11:22am

Aw that was just stupid. The line I had you change was for without api key.

Here’s the required diff:

diff --git a/NZBmegasearch/mega2.py b/NZBmegasearch/mega2.py
index 145ff85..c1169d2 100644
--- a/NZBmegasearch/mega2.py
+++ b/NZBmegasearch/mega2.py
@@ -336,13 +336,13 @@ def api():
        if(len(cfgsets.cgen['general_apikey'])):
                if('apikey' in request.args):
                        if(request.args['apikey'] == cfgsets.cgen['general_apikey']):
-                               return apiresp.dosearch(request.args, urlparse(request.url))
+                               return apiresp.dosearch(request.args, urlparse(request.url)), 200, {'Content-Type':'application/rss+xml'}
                        else:   
                                return '[API key protection ACTIVE] Wrong key selected'
                else:   
                                return '[API key protection ACTIVE] API key required'
        else:
-               return apiresp.dosearch(request.args, urlparse(request.url))
+               return apiresp.dosearch(request.args, urlparse(request.url)), 200, {'Content-Type':'application/rss+xml'}

Not a perfect fix, since the ‘wrong api key’ response isn’t conform newznab, but it should be enough to get it working.

And it possibly belongs on the /rss endpoint response as well.

Statia · October 1, 2014, 12:14pm

Works a treat.

For anyone wanting to manually update the Mega2.py file, the updated file contents are here:

http://pastebin.com/FpmYFnck

@Taloth I have created a pull request for the fix in Github and credited yourself with the fix. Thanks for your efforts.

Edit: I also added the fix for RSS too.

syndac · October 1, 2014, 7:00pm

@Statia I copied and pasted your pastebin code into mega2.py and restarted NZBMegaSearch, but I’m getting the same error. Any idea why?

Statia · October 1, 2014, 7:29pm

It worked for me, not sure why yours didn’t work. Try a reboot.

syndac · October 1, 2014, 7:46pm

@Statia @Taloth Still nothing. Here’s what I did:

Shutdown NZBMegasearch through the web GUI
Verify mega2.exe isn’t running
Open mega2.py in Notepad++
Copy everything from your pastebin and overwrite in mega2.py
Save mega2.py
Restart PC
Open mega2.py in notepad++ to verify changes are still there
Run a search in NZB Drone for South Park S18E01
Doesn’t work, check log.
Log giving the same error and not returning any search results.

After viewing the Response Headers, they’re still coming back as text/html.

Maybe it’s the version of Python I’m running. Are you running 2.7 like me?

…Or maybe it’s because I’m running the Windows version? Are you? Do I need to do anything with mega2.exe?

Annihilist · October 2, 2014, 12:07am

Aside from the content type issue, I also ran into an issue where only the first result was being scraped out of the XML file. After doing a bit of searching I found that NzbDrone recently changed to using the GUID field in the results to identify releases. To fix this, I changed the following line in ApiModule.py:

niceResults_row['encodedurl'] = 'http://bogus.gu/bog'

to

niceResults_row['encodedurl'] = results[i]['url']

I’ve only tested with CouchPotato and NzbDrone and it hasnt caused any issues, however you could also look at changing

if(self.typesearch != 0):

to

if(self.typesearch != 0 or self.typesearch != 1):

which should only affect results for NzbDrone.