Quantcast
Channel: Many Niches » marketplace
Viewing all articles
Browse latest Browse all 2

Crawling the Windows Phone Marketplace

$
0
0

I have been asked by a few people how sites like WP7AppList get their data.  The Windows Phone Marketplace, which you access on your PC via Zune software, uses XML to get data over the wire.  I wanted to share a couple of code snippets which might help an erstwhile data junkie on their way.  This code works.  It may not be the most elegant solution, but it works, and I wanted to share it with others in case they wanted to see how to parse the XML, or how to write LINQ queries against it.

Caveat – this is a geek enthusiast post.  I used Fiddler to figure out how to parse the XML.  This was something I did over Christmas break to give me a project about which I could be excited, and learn some more about parsing XML with LINQ.  I also wanted to do some large database stuff, and this crawler throws off a ton of data.  I did not use an proprietary knowledge about how our backend systems are working.  This is all done against the public XML feeds.

First up, we are going to need to create some data structures to catch all of the inbound data.  You can use anonymous types with LINQ, but I liked having a measure of control, and having the ability to handle null values and potential errors in the feed.

public class ZestAppData
{
    public string Title { get; set; }
    public string Id { get; set; }
    public DateTime ReleaseDate { get; set; }
    public DateTime Updated { get; set; }
    public string Version { get; set; }
    public string ShortDescription { get; set; }
    public decimal AverageUserRating { get; set; }
    public int UserRatingCount { get; set; }
    public string ImageId { get; set; }
    
    public IList<ZestCategory> Categories = new List<ZestCategory>();
    public IList<ZestPublisher> Publisher = new List<ZestPublisher>();
    public IList<ZestOffer> Offers = new List<ZestOffer>();
}

public class ZestCategory
{
    public string Id { get; set; }
    public string IsRoot { get; set; }
    public string Title { get; set; }
}

public class ZestOffer
{
    public string OfferId { get; set; }
    public string MediaInstanceId { get; set; }
    public decimal Price { get; set; }
    public string PriceCurrencyCode { get; set; }
    public string LicenseRight { get; set; }
    public List<string> PaymentType = new List<string>();
}

public class ZestPublisher
{
    public string Id { get; set; }
    public string Name { get; set; }
}

 

You are also going to want to have a bunch of variables defined for the URLs where the XML is coming from, the XML namespaces, etc:

const string BaseAppsUrl = "http://catalog.zune.net";
const string BaseImageUrl = "http://image.catalog.zune.net";

const string ZestVersion = "/v3.2/";
const string ZestImageVersion = "/v3.0/";

const string BaseApps = "apps/";
const string BaseImage = "image/";

const string BaseAppsResource = "?clientType=WinMobile%207.0&store=Zest&orderby=downloadRank";
const string BaseCommentsResource = "/reviews/?store=Zest&chunkSize=10";
const string BaseImageResource = "?width=240&height=240";

ZestCrawlEntities ZestCrawlContext;

XNamespace ns = "http://www.w3.org/2005/Atom";
XNamespace zestns = "http://schemas.zune.net/catalog/apps/2008/02";

public string LangCode = "en-us"; //setting the default value

public List<string> ValidLangCodes = new List<string>(
    new string[] {  "en-us", "en-gb", "de-de",
                    "fr-fr", "es-es", "it-it",
                    "en-au", "de-at", "fr-be",
                    "fr-ca", "en-ca", "en-hk",
                    "en-in", "en-ie", "es-mx",
                    "en-nz", "en-sg", "de-ch",
                    "fr-ch" });

public string AppAfterMarkerUrl { get; set; }
public bool HasMoreApps = true;
public string AppsResponseString { get; set; }
public XElement ReturnedAppsXml;

 

Have a look at the ValidLangCodes list.  that’s the coding we have on the URLs for country specific data.  So if you want to get the data from Mexico, us “es-mx.”  The first two letters are the language code, and the second two are the country code.  If an app is listed in the feed, it is active.  The list returned is ordered, meaning the first app is ranked #1.  I am pulling the ALL APPs list, which is the orderby clause on the BaseAppsResource.

The ZextCrawlContext is the ADO.NET DB model.  Create your own and stuff the data however you want.

Now that we have the code segments, you are going to need a way to get the XML from MSFT servers.

public void GetAppsResponse()
{
    string FullUrl;
    bool done = false;

    if (!String.IsNullOrEmpty(AppAfterMarkerUrl))
    {
        FullUrl = AppAfterMarkerUrl;
    }
    else
    {
        FullUrl = BaseAppsUrl + ZestVersion + LangCode + "/"
            + BaseApps + BaseAppsResource;
    }
            
    while (!done)
    {
        try
        {
            var request = WebRequest.Create(FullUrl) as HttpWebRequest;
            request.KeepAlive = false;

            var response = request.GetResponse() as HttpWebResponse;

            if (request.HaveResponse == true && response != null)
            {
                var reader = new StreamReader(response.GetResponseStream());
                ReturnedAppsXml = XElement.Parse(reader.ReadToEnd());
                done = true;
            }
        }
        catch
        {
            Console.WriteLine("yeah, your connection was likely aborted");
            done = false;
        }
    }
}

 

Now comes the fun part.  Remember, the XML is coming over the wire, and it comes 100 elements at a time.  So you have to parse the stream, stuff them somewhere and get the next stream.  Included in the XML returned is the token for how you request the next bit of XML. (note, yes I know I am using RegEx where I could be using String.Replace; also sorry about the wonky formatting, but my blog has width issues)

public IEnumerable<ZestAppData> GetAppEntries()
{
    //first we have to parse the feed which came back
    IEnumerable<ZestAppData> entries =
        from e in ReturnedAppsXml.Elements(ns + "entry")
        select new ZestAppData
        {

            Title = e.Element(ns + "title").Value,

            Id = Regex.Replace(e.Element(ns + "id").Value, "(urn:uuid:)(.)", "$2"),

            ReleaseDate = DateTime.Parse(e.Element(zestns + "releaseDate").Value),

            Updated = DateTime.Parse(e.Element(ns + "updated").Value),

            ShortDescription = e.Element(zestns + "shortDescription") == null
                ? "" : e.Element(zestns + "shortDescription").Value,

            AverageUserRating = decimal.Parse(e.Element(zestns + "averageUserRating").Value),

            UserRatingCount = int.Parse(e.Element(zestns + "userRatingCount").Value),

            Version = e.Element(zestns + "version").Value,

            ImageId = Regex.Replace(e.Element(zestns + "image").Element(zestns + "id").Value, "(urn:uuid:)(.)", "$2"),

            Categories = (
                from category in e.Elements(zestns + "categories").Elements(zestns + "category")
                select new ZestCategory
                {
                    Id = category.Element(zestns + "id").Value,
                    Title = category.Element(zestns + "title").Value,
                    IsRoot = category.Element(zestns + "isRoot").Value
                }).ToList(),

            Publisher = (
                from publisher in e.Elements(zestns + "publisher")
                select new ZestPublisher
                {
                    Id = publisher.Element(zestns + "id").Value,
                    Name = publisher.Element(zestns + "name").Value
                }).ToList(),

            Offers = (
                from offer in e.Elements(zestns + "offers").Elements(zestns + "offer")
                select new ZestOffer
                {
                    OfferId = offer.Element(zestns + "offerId").Value,
                    MediaInstanceId = offer.Element(zestns + "mediaInstanceId").Value,
                    Price = decimal.Parse(offer.Element(zestns + "price").Value),
                    PriceCurrencyCode = offer.Element(zestns + "priceCurrencyCode").Value,
                    LicenseRight = offer.Element(zestns + "licenseRight").Value,
                    PaymentType = (
                        from paymenttype in offer.Elements(zestns + "paymentTypes").Elements()
                        select paymenttype.Value).ToList()
                }).ToList()
        };

    //now I need to get the AfterMarkerUrl from the XML feed
    var afterMarker =
        from e in ReturnedAppsXml.Elements(ns + "link")
        where e.Attribute("rel").Value == "next"
        select (string)e.Attribute("href").Value;

    if (afterMarker.Count() > 0)
    {
        AppAfterMarkerUrl = BaseAppsUrl + afterMarker.Single();
    }
    else
    {
        HasMoreApps = false;
    }

    return entries;
}

Now you have all the data you need to crawl the marketplace whenever you want.  The LINQ stuff is really, really fast.  Crawling the marketplaces can be a bit slow.  I crawl each one individually when my code runs, and I store app lists for each of the markets.

One of the mistakes I made was having ZestAppData.Udpated be a DateTime and not a Date.  I only crawl once per day, so I don’t need all the extra data.  The Zest feeds update daily, I think every couple of hours.


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images