Posts Tagged‘sort’

Computer Sorting Algorithms – Visualized.

The way I explain learning to code is that it’s akin to trying to learn another language, one that helps you communicate with computers. It should come as little surprise that much of terminology in linguistics and programming shared (syntax, language, context, etc.) which can help ease people into it, especially if they’ve learnt another language in the past. However I’ve always found the best tools to be visual as its one thing to explain a system you built to someone and it’s another thing to show how it operates and links together. Visualizing algorithms can be quite tricky however although one intrepid person has taken it upon themselves to give sorting algorithms, being one of the fundamental yet readily misunderstood programming paradigms, a visual representation, and it’s glorious.

Whilst this isn’t a particularly scientific way of demonstrating which algorithms are the best (they’ve all been designed to run for roughly the same amount of time and you can see that the number of integers they have to sort varies wildly) it does give you a pretty amazing visual view of how each of the sorting algorithms operate. You can see the patterns of their sorting behavior quite easily and the iterations they go through at each step to sort the array. Probably the most fun I got out of it was reading up on Bogosort afterwards as I couldn’t really tell what it was trying to accomplish by looking at it and the subsequent explanation is a great testament to all the things you shouldn’t do while programming. As a programmer that kind of absurdity is hilarious, unless you see it in production code of course.

Sortilio Update: It’s Just Better All Over.

So like most products that a developer creates with one purpose in mind my first iteration of Sortilio was pretty bare bones. Sure if you had a small media collection that was named semi-coherently it worked fine (like it did for my test data) but past that it started to fall apart rather rapidly. Case in point: I let it loose on my own media collection, you know for the purposes of eating my own dog food. It didn’t take long for it to fall flat on its face, querying The TVDB’s API so rapidly that the rate limiter kicked in almost instantaneously. There was also the issue of not being able to massage the data once it had done the automated matching portion as even the best automated tools can still make mistakes. With that in mind I set about improving Sortilio and put the finishing touches on it yesterday.

Now the first update you’ll notice is the slightly changed main screen with a new Options tab and two extra buttons down in the right hand corner. They all function pretty much as you’d expect: the options tab has a few options for you to configure (only one of them works currently, the extensions one), save will export the current selection to a file for use later and load will  import said file back into Sortilio. The save/load functionality is quite handy if you’d like to manually go in there and sort out the data yourself as it’s all plain XML that I’m sure anyone with half a coding mind about them would be able to figure out. I put it in mostly for debugging purposes (re-running the identification process is rather slow, more on that in a bit) but I can see it being quite useful, especially with larger collections.

As I mentioned earlier whilst the automated matching does a pretty good job of getting things right there are times when it either doesn’t find anything or its got it completely wrong. To alleviate this I added in the ability for you to be able to double click the row to bring up the following screen:

Shown in this dialog is the series drop down which allows you to select from a list of episodes that Sortilio has already downloaded. The list is populated by the cache that Sortilio creates from its queries to The TVDB so if it managed to match one file in the series correctly it will have it cached already so you can just select it and hit update. Sortilio will then identify other files that had the same search term and ask if you’d like to update them as well (since it will have probably got them wrong as well). Should the series you’re looking for not be available you can then hit the search button which brings up this dialog:

From here you can enter whatever term you want and hit search. This will then query The TVDB and then display the results in a list for you. Select the most appropriate one and then hit OK and you’ll have the new series assigned to that file.

Under the hood things have gotten quite a bit better as well. The season string matching algorithm has been improved a bit so that identifies seasons better than it previously did. For instance if you had a file that was like say battlestar.galactica.2003.s01e20.avi Sortilio would (wrongly) identify that as season 20 because of the 2003 before the series/episode identifier. It now prefers the right kind of identifiers and is a little better overall at getting it right, although I still think that the way I’m going about it is slightly ass backwards. Chalk that up to still figuring out how to best do string splitting based on a regex.

Now on the surface if you were to compare this version to the previous it would appear to run quite a bit slower. There’s a good reason for this and it all comes down to the rate limit on The TVDB API. After playing around with various values I found that the sweet spot was somewhere around a 2 second delay between searches. Without any series cached this would mean that every request will incur a 2 second penalty, significantly increasing the amount of time required to get the initial sort done. I’ve alleviated this somewhat by having Sortilio search its local cache first before attempting to head out to the API but that’s still noticeably slower that it was originally. I’ve reached out to the guys behind The TVDB in the hopes that I can get an excerpt of their database that I can include within Sortilio that will make the process lightening fast but I’ve yet to hear back from them.

So as always feel free to grab it, have a play and then send me any feedback you have regarding it. I’ve already got a list of improvements to make on this version but I’d definitely call this usable and to prove a point I have indeed used it on my own media collection. It gets about 90% of the way there with the last 10% needing manual intervention, either within Sortilio or outside cleaning up after it has done its job. If you’ve used it and encountered problems please save the sort file and the debug log and send them to me at [email protected].

You can grab the latest version here.

[NOTE: There is no link currently because gmail barfed at the file attachment I sent myself to upload this morning. Follow me on Twitter to be notified of when it comes out!]

Sortilio: Because Sorting Media Isn’t Hard.

My post last week about the trials and tribulations of sorting ones media collection struck a chord with a lot of my friends. Like me they’d been doing this sort of thing for decades and the fact that none of us had any kind of sense to our sorting systems (apart from the common thread of “just leave it where it lies”) came at something of a surprise. I mean just taking the desk I’m sitting at right now for an example it’s clear of everything bar computer equipment and the stuff I bring in with me every day. The fact that this kind of organization doesn’t extend to our file systems means that we either simply don’t care enough or that it’s just too bothersome to get things sorted. Whilst I can’t change the former I decided I could do something about the latter.

Enter Sortilio.

So my quest last week proving fruitless I set about developing a program that could sort media based on a couple cues derived from the files themselves. Now for the most part media files have a few clues as to what they actually are. For the more organized of us the top level folder will contain the episode name but since mine was all over the place I figured it couldn’t be trusted. Instead I figured that the file name would be semi-reliable based on a cursory glance at my media folder and that most of them were single strings delimited with only a few characters. Additionally the identifier for season and episode number is usually pretty standard (S01E01, 2×01,1008, etc) so that pulling the season out of them would be relatively easy. What I was missing was something to verify that I was looking in the right place and that’s where I TheTVDB comes in.

The TV Database is like IMDB for TV shows except that it’s all community driven. Also unlike IMDB they have a really nice API that someone has wrapped up in a nice C# library that I could just import straight into my project. What I use this for is a kind of fuzzy matching filter for TV show names so that I can generate a folder with the correct name. At this point I could also probably rename the files with the right name (if I was so inclined) but for the point of making the tool simple I opted not to do this (at this point). With that under my belt I started on the really hard stuff: figuring out how to sort the damn files.

Now I could have cracked open the source of some other renaming programs to see how they did it but I figured out a half decent process after pondering the idea for a short while. It’s a multi-stage process that makes a few assumptions but seems to work well for my test data. First I take the file name and split it up based on common delimiters used in media files. Then I build up a search string using those broken up names stopping when I hit a string that matches a season/episode identifier. I then add that into a list of search terms to query for later, checking first to see if it’s already added. If it’s already in there I then add the file path into another list for that specific search term, so that I know that all files under that search term belong to the same series. Finally I create the new file location string and then present this all to the user, which ends up looking like this:

The view you see here is just a straight up data table of the list of files that Sortilio has found and identified as media (basically anything with the extension .avi or .mkv currently) and the confidence level it has in its ability to sort said media. Green means that in the search for the series name it only found one match, so it’s a pretty good assumption that it’s got it right. Yellow means that when I was doing a search for that particular title I got multiple responses back from TheTVDB so the confidence in the result is a little lower. Right now all I do is take the first response and use that for verification which has served me well with the test data, but I can easily see how that could go wrong. Red means I couldn’t find any match at all (you can see what terms I was searching for in the debug log) and everything marked like that will end up in one giant “Unsorted” folder for manual processing. Once you hit the sort button it will perform the move operations, and suffice to say, it works pretty darn well:

Of course it’s your standard hacked-together-over-the-weekend type deal with a lot of not quite necessary but really nice to have features left out. For starters there’s no way to tell it that a file belongs to a certain series (like if something is misspelled) or if it picks the wrong series to tell it to pick another. Eventually I’m planning to make it so you can click on the items and change the series, along with a nice dialog box to search for new ones should it not get it right. This means you might want to do this on a small subset of your media each time (another thing I can code in) as otherwise you might get files ending up in strange folders.

Also lacking is any kind of options page where you can specify things like other extensions, regex expressions for season/episode matching and a whole host of other preferences that are currently hard coded in. These things are nice to have but take forever to get right so they’ll eventually make their way into another revision but for now you’re stuck with the way I think things should be done. Granted I believe they’ll work for the majority of people out there, but I won’t blame you if you wait for the next release.

Finally the code will eventually be open sourced once I get it to a point where I’m not so embarrassed by it. If you really want to know what I did in the ~400 odd lines that constitute this program then shoot me an email/twitter and I’ll send the source code to you. Realistically any half decent programmer could come up with this in half the amount of time I did so I can’t imagine anyone will need it yet, unless you really need to save 3 hours 😛

So without further ado, Sortilio can be had here. Download it, unleash it on your media files and let me know how it works for you. Comments, questions, bugs and feature requests can be left here as a comment, an @ message on Twitter or you can email me on [email protected].

Surely Someone Has Done This Before (or My Media Management Maladies).

I decided to take December off working on my side projects, mostly because all those little things that I used to get done on weekends were starting to slip by the wayside. They weren’t huge things but they’re those kinds of things that when you see them you always think “I should fix that” but never end up doing. My inner perfectionist hates this and will guilt me endlessly about them and I figure that was what was causing me to feel burnt out on my projects, even though I had made some really good progress with them. One of those tasks I had set myself was to organise my media collection into something more sensible, with the ultimate goal of hiding it all under Xbox Media Centre.

After more than a decade of collecting media from all over the place the organisation was, to say the least, non-existent. Everything was lumped into giant folders all helpfully labelled “downloads” or “recent downloads” or “unsorted”. No worries I thought, the first step would be pretty easy: just sort everything out into their respective categories. That’ll make the process of sorting everything out afterwards a lot easier. That process took a good few hours to complete but in the end I had around 5 top-level folders that had everything nicely categorized. For the most part I didn’t care too much about the organization of things like software ISOs and installers (realistically I should delete most of them since they’re woefully outdated) but I knew XBMC was a little picky about how media was sorted so I started looking at solutions to that problem.

Now my media folders were a total, undignified mess. Even after sorting everything into series folders the files contained therein had no rhyme or reason to their layout. I did know where I wanted to end up however, hopefully in the form of Series -> Season -> Episodes, and figured that this would have been a common enough problem that someone would have already coded up some brilliant solution to do it all automatically for me. From what I could read on the various forums indeed many people had done exactly that and all that was left to do was to find one and unleash it on my tangled mess of media.

From what I could tell the best one of the lot was Ember Media Manager Revisited which had the added benefit of not looking like it was coded in VB6 by someone’s cousin. After installing and configuring it up I was presented with a massive list of all the stuff I had. Figuring I’d trial it on the movies before trying the TV shows (which it says it’s not particularly good at organizing) I sent it on its merry way, hoping it would start sorting my media. Unfortunately there doesn’t seem to be an option for “Go look at everything and find the best match possible and prompt me if you can’t find one”. The option of “prompt if no exact match” doesn’t work properly as it either gets it wrong or will prompt you for everything, as it seems no movie title is completely unique. Figuring that this was only the first of many options I engaged my Google-Fu to find some alternatives and gave them a shot one by one.

TVRename was one that I had stumbled across in the past (and heard good things about) and I tried it on my media. Trouble is TVRename expects the exact folder structure I wanted to be already created and can’t create it on its own. Once everything’s sorted like that it’s actually quite brilliant, but the amount of effort required to get it there is too large. Several other programs I tried like TheRenamer, Media Companion and Media Centre Master fall into a similar category of being able to rename stuff but unable to move them into a folder structure. I also tried a multitude of other programs that either flat-out didn’t work (or crashed) or required just as much work as doing it manually would.

The simple fact is there’s really nothing out there that can take a disorganised media folder and then sort it and rename it at the same time. This boggles my mind as if you’re capable of renaming something down to the level of the name of the episode you have enough information to sort it. It wouldn’t be particularly hard to add-on either as the process of creating folders and moving files into them is basic I/O stuff that any developer should be familiar with. I could be facetious and say what should I expect from people coding in VB.NET (most of the apps are open source so you can see what language they use) but honestly it’s got to be plain old-fashioned oversight.

In the end I didn’t end up getting my media organised and I’ve resigned myself to creating a simple program that will do exactly what I need it to do. It shouldn’t be too hard as all I’ll be doing is searching for season and episode numbers and moving them into appropriate folders. After that I’ll use one of the other programs to do all the funky metadata handling and whatnot as they seem much more refined at that than I would be during a weekend slog. If it works well enough I’ll even throw it up here for good measure, source code and all.