Tv Show Analysis

What makes a successful TV show?

This analysis and model attemts to determine the factors that influence high or low IMDb ratings for TV shows. All generes are examined, and while most originate in the United States, there are a few from the UK and elsewhere included.

Two separate models are developed. In both, the top and bottom rated shows are classified as winners and losers, respectively, and an array of 12 classisfiers are applied using cross validation to identify the best performing model. 25% of the data was reserved as a test set, and cross-validation scores and test set scores are both shown in the tables below. Baseline score is 0.52.

The first utilizes natural language processing (NLP) on the IMDb summary descriptions of each show. Term Frequency - Inverse Document Frequency and Count Vectorization were used on n-grams of size 2-4 were used. With both vectorization techniqes, Random Forrest and Naive Bayesian classifiers were most successful, with the highest score of 0.642 was achieved using TF-IDF vectorization and a Multinomial Naive Bayes classifier. The n-grams with highest cumulative score are identified as the most significant factors in the model, giving a clue as to the words in a summary description that foretell a show’s likelihood of being a winner or loser.

A second model, using factors such as genre, lenght, schedule times, network and format was also built, and the same set of 12 classifiers was applied. The ADA Boost classifier achieved the top score of 0.92 using these factors.

Data collection and cleanup was tedious, and involved multiple runs of webscraping IMDb pages on show ratings, then using the TVmaze API to return show detals. Unless interested in these details, the reader is encouraged to skip to the section titled “Modeling Section” a bit more than halfway through this notebook.

Results Summary:

From the NLP models, it seems shows featuring adult characters in crime and drama series set in times before or after the present in New York will fare better than reality or animated series featuring children or teens and highlighting pop culture.

The model on the factors other than the summary showed similar tendencies. Realty formats were the strongest negative factor in predicting success, while the scripted format was the strongest positive predictor. Game and Talk shows were negative, while crime, science fiction, comedy, drama and documentaries were positive predictors. Shows aired by HBO and BBC predicted success, while the lower rated shows were found more predominantly on MTV, E!, Comedy Central and Lifetime.

Though interesting associations have been found, it must be said that nothing in the techniques used here can be interpreted as causality. For example, it cannot be said that reality shows featuring teenagers will always flop. This report is based on initial efforts to determine factors that may influence a show’s success, and have shown a path for future more detailed modeling. Suggested future paths include textual analysis of critics’ reviews, analysis based on cast or producers, analysis of differences in rating based on audience demographics, and a more detailed look at the connection between genre and the type/style of show.

# Import libraries needed for scraping and saving results.  
# Additional libraries needed for modeling, analysis and display will be imported when needed.

import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle

Data Acquisition: List of top rated TV Shows

# Retrieve current top 250 TV shows webpage

url = "http://www.imdb.com/chart/toptv/"
r = requests.get(url)
html = r.text
html[0:200]
u'\n\n\n\n<!DOCTYPE html>\n<html\nxmlns:og="http://ogp.me/ns#"\nxmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=ed'
# Use Beautiful soup to extract the imdb numbers from the webpage
soup = BeautifulSoup(html, "lxml")
# Scrape the IMDb numbers for the 250 top rated shows

show_list = []
for tbody in soup.findAll('tbody', class_='lister-list'):
    for title in tbody.findAll('td', class_='titleColumn'):
        show_list.append(str(title.findAll('a')).split("/")[2])

show_list

['tt5491994',
 'tt0185906',
 'tt0795176',
 'tt0944947',
 'tt0903747',
 'tt0306414',
 'tt2861424',
 'tt2395695',
 'tt0081846',
 'tt0071075',
 'tt0141842',
 'tt1475582',
 'tt1533395',
 'tt0417299',
 'tt0098769',
 'tt1806234',
 'tt0303461',
 'tt0092337',
 'tt0052520',
 'tt3530232',
 'tt2356777',
 'tt1355642',
 'tt2802850',
 'tt0103359',
 'tt0296310',
 'tt0877057',
 'tt4508902',
 'tt0475784',
 'tt2092588',
 'tt0213338',
 'tt1856010',
 'tt0063929',
 'tt0112130',
 'tt2571774',
 'tt0081834',
 'tt0367279',
 'tt4742876',
 'tt4574334',
 'tt2085059',
 'tt0108778',
 'tt0098904',
 'tt3718778',
 'tt0081912',
 'tt0098936',
 'tt1518542',
 'tt0074006',
 'tt2707408',
 'tt0193676',
 'tt1865718',
 'tt0096548',
 'tt0072500',
 'tt0384766',
 'tt0118421',
 'tt0096697',
 'tt0090509',
 'tt0121955',
 'tt0386676',
 'tt4299972',
 'tt2560140',
 'tt0472954',
 'tt0412142',
 'tt0214341',
 'tt5555260',
 'tt2442560',
 'tt5712554',
 'tt0200276',
 'tt0353049',
 'tt1910272',
 'tt0086661',
 'tt0248654',
 'tt5189670',
 'tt0121220',
 'tt1486217',
 'tt0096639',
 'tt0120570',
 'tt4786824',
 'tt1628033',
 'tt0348914',
 'tt0403778',
 'tt5288312',
 'tt0459159',
 'tt3032476',
 'tt0407362',
 'tt4093826',
 'tt0773262',
 'tt0417349',
 'tt3322312',
 'tt0264235',
 'tt0106179',
 'tt0286486',
 'tt2297757',
 'tt0088484',
 'tt2098220',
 'tt5425186',
 'tt0318871',
 'tt0094517',
 'tt0436992',
 'tt1586680',
 'tt0092324',
 'tt0994314',
 'tt0203082',
 'tt1606375',
 'tt0380136',
 'tt0187664',
 'tt1513168',
 'tt0118273',
 'tt0421357',
 'tt1641384',
 'tt0314979',
 'tt5834204',
 'tt0092455',
 'tt0115147',
 'tt4295140',
 'tt0080306',
 'tt1266020',
 'tt1831164',
 'tt3920596',
 'tt0804503',
 'tt1492966',
 'tt0053488',
 'tt0086831',
 'tt0758745',
 'tt0995832',
 'tt0434706',
 'tt2401256',
 'tt0423731',
 'tt0111958',
 'tt0863046',
 'tt1733785',
 'tt2049116',
 'tt0275137',
 'tt1305826',
 'tt0472027',
 'tt2100976',
 'tt1489428',
 'tt0112159',
 'tt4158110',
 'tt1227926',
 'tt1870479',
 'tt0979432',
 'tt0106028',
 'tt0387764',
 'tt0237123',
 'tt0047708',
 'tt0088509',
 'tt0290978',
 'tt1984119',
 'tt0098825',
 'tt2306299',
 'tt0280249',
 'tt3647998',
 'tt0094525',
 'tt0163507',
 'tt0118266',
 'tt0182629',
 'tt0080297',
 'tt0061287',
 'tt1758429',
 'tt3671754',
 'tt0487831',
 'tt0388629',
 'tt2575988',
 'tt4189022',
 'tt0458254',
 'tt2788432',
 'tt0096657',
 'tt0346314',
 'tt1474684',
 'tt4288182',
 'tt0417373',
 'tt1298820',
 'tt0262150',
 'tt1695360',
 'tt1230180',
 'tt2243973',
 'tt0129690',
 'tt1632701',
 'tt2433738',
 'tt0149460',
 'tt1124373',
 'tt0075520',
 'tt1795096',
 'tt1442449',
 'tt5249462',
 'tt2937900',
 'tt1439629',
 'tt5071412',
 'tt0397150',
 'tt0083466',
 'tt2701582',
 'tt5114356',
 'tt4156586',
 'tt0319969',
 'tt0103584',
 'tt0302199',
 'tt0070644',
 'tt1883092',
 'tt2311418',
 'tt3428912',
 'tt1442437',
 'tt0362192',
 'tt0278238',
 'tt0387199',
 'tt2384811',
 'tt0098833',
 'tt0074028',
 'tt2303687',
 'tt0807832',
 'tt0056751',
 'tt0173528',
 'tt3358020',
 'tt0103466',
 'tt1526318',
 'tt0185133',
 'tt0075572',
 'tt0112084',
 'tt1837492',
 'tt2919910',
 'tt1299368',
 'tt0094535',
 'tt1520211',
 'tt0108906',
 'tt0988824',
 'tt5421602',
 'tt5853176',
 'tt0934320',
 'tt0337898',
 'tt0495212',
 'tt0460681',
 'tt2407574',
 'tt0290988',
 'tt1598754',
 'tt1119644',
 'tt1220617',
 'tt3398228',
 'tt0411008',
 'tt0163503',
 'tt2249364',
 'tt1409055',
 'tt4270492',
 'tt0060028',
 'tt0118480',
 'tt0925266',
 'tt3012698',
 'tt0402711',
 'tt0068098',
 'tt0442632',
 'tt1839578',
 'tt0043208',
 'tt5673782']
# This code has been executed, and the results pickled and stored locally, so no need to run these requests
# to the API again. The api address with key to look up show with imdb number is
# http://api.tvmaze.com/lookup/shows?imdb=<show imdb identifier>

DO_NOT_RUN = True     # Do not run when notebook is loaded to avoid unnecessary calls to the API

if not DO_NOT_RUN:
    shows = pd.DataFrame()
    for show_id in show_list:
            try:
                print show_id
                # Get the tv show info from the api
                url = "http://api.tvmaze.com/lookup/shows?imdb=" + show_id
                r = requests.get(url)

                # convert the return data to a dictionary
                json_data = r.json()

                # load a temp datafram with the dictionary, then append to the composite dataframe
                temp_df = pd.DataFrame.from_dict(json_data, orient='index', dtype=None)
                ttemp_df = temp_df.T     # Was not able to load json in column orientation, so must transpose
                shows = shows.append(ttemp_df, ignore_index=True)
            except: 
                print show_id, " could not be retrieved from api"

    shows.head()    


# write the contents of an object to a file for later retrieval

DO_NOT_RUN = True   # Be sure to check the file name to write before enabling execution on this block

if not DO_NOT_RUN:
    pickle.dump( shows, open( "save_shows_df.p", "wb" ) )

Get list of bottom rated TV Series

# This code block was changed multiple times to pull html with different sets of low rated shows
# ultimately about 1200 imdb ids were scraped, and about 1/3 of those could be pulled from the TV Maze API.

url ="http://www.imdb.com/search/title?count=600&languages=en&title_type=tv_series&user_rating=3.4,5.0&sort=user_rating,asc"
r = requests.get(url)
html = r.text
html[0:200]
u'\n\n\n\n<!DOCTYPE html>\n<html\nxmlns:og="http://ogp.me/ns#"\nxmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=ed'
# Use Beautiful soup to extract the imdb numbers from the webpage
soup = BeautifulSoup(html, "lxml")

loser_list = []
for div in soup.findAll('div', class_='lister-list'):
    for h3 in div.findAll('h3', class_='lister-item-header'):
        loser_list.append(str(h3.findAll('a')).split("/")[2])

loser_list

['tt0773264',
 'tt1798695',
 'tt1307083',
 'tt4845734',
 'tt0046641',
 'tt1519575',
 'tt0853078',
 'tt0118423',
 'tt0284767',
 'tt4052124',
 'tt0878801',
 'tt3703500',
 'tt1105170',
 'tt4363582',
 'tt3155428',
 'tt0362350',
 'tt0287196',
 'tt2766052',
 'tt0405545',
 'tt0262975',
 'tt0367278',
 'tt7134262',
 'tt1695352',
 'tt0421470',
 'tt2466890',
 'tt0343305',
 'tt1002739',
 'tt1615697',
 'tt0274262',
 'tt0465320',
 'tt1388381',
 'tt0358889',
 'tt1085789',
 'tt1011591',
 'tt0364804',
 'tt1489335',
 'tt3612584',
 'tt0363377',
 'tt0111930',
 'tt0401913',
 'tt0808086',
 'tt0309212',
 'tt5464192',
 'tt0080250',
 'tt4533338',
 'tt4741696',
 'tt1922810',
 'tt1793868',
 'tt4789316',
 'tt0185054',
 'tt1079622',
 'tt1786048',
 'tt0790508',
 'tt1716372',
 'tt0295098',
 'tt3409706',
 'tt0222574',
 'tt2171325',
 'tt0442643',
 'tt2142117',
 'tt0371433',
 'tt0138244',
 'tt1002010',
 'tt0495557',
 'tt1811817',
 'tt5529996',
 'tt1352053',
 'tt0439346',
 'tt0940147',
 'tt3075138',
 'tt1974439',
 'tt2693842',
 'tt0092325',
 'tt6772826',
 'tt1563069',
 'tt0489598',
 'tt0142055',
 'tt1566154',
 'tt0338592',
 'tt0167515',
 'tt2330327',
 'tt1576464',
 'tt2389845',
 'tt0186747',
 'tt0355096',
 'tt1821877',
 'tt0112033',
 'tt1792654',
 'tt0472243',
 'tt6453018',
 'tt3648886',
 'tt1599374',
 'tt2946482',
 'tt4672020',
 'tt1016283',
 'tt2649480',
 'tt1229945',
 'tt2390606',
 'tt1876612',
 'tt0140732',
 'tt1176156',
 'tt0158522',
 'tt4922726',
 'tt0068104',
 'tt2798842',
 'tt1150627',
 'tt1545453',
 'tt3685566',
 'tt0287223',
 'tt4185510',
 'tt0329912',
 'tt0289808',
 'tt0358849',
 'tt2320439',
 'tt0906840',
 'tt0800281',
 'tt1103082',
 'tt2416362',
 'tt3493906',
 'tt0381827',
 'tt0817553',
 'tt0252172',
 'tt0799872',
 'tt0816224',
 'tt1077162',
 'tt1918005',
 'tt1240983',
 'tt1415000',
 'tt5039916',
 'tt0451467',
 'tt0296438',
 'tt1159990',
 'tt0144701',
 'tt4718304',
 'tt1095213',
 'tt1453090',
 'tt0168372',
 'tt0425725',
 'tt3300126',
 'tt1415098',
 'tt5459976',
 'tt4041694',
 'tt2322264',
 'tt1441005',
 'tt1117549',
 'tt0365991',
 'tt0364807',
 'tt1591375',
 'tt3562462',
 'tt6118186',
 'tt3587176',
 'tt1372127',
 'tt0445865',
 'tt2088493',
 'tt4658248',
 'tt0103444',
 'tt4956964',
 'tt1326185',
 'tt0406422',
 'tt1973659',
 'tt1578933',
 'tt0446621',
 'tt1850624',
 'tt0159177',
 'tt0490539',
 'tt0306398',
 'tt0288922',
 'tt0465336',
 'tt0176397',
 'tt1641939',
 'tt0498879',
 'tt0306296',
 'tt1394277',
 'tt0398416',
 'tt2849552',
 'tt1433566',
 'tt0806893',
 'tt3252890',
 'tt3774098',
 'tt0791275',
 'tt5690224',
 'tt0361181',
 'tt0486953',
 'tt1514319',
 'tt3697290',
 'tt1342752',
 'tt0478936',
 'tt0094448',
 'tt0795101',
 'tt1340759',
 'tt0840061',
 'tt1151434',
 'tt0281429',
 'tt0845745',
 'tt2993514',
 'tt0783634',
 'tt1650352',
 'tt1249256',
 'tt2135766',
 'tt3231114',
 'tt1702421',
 'tt2940494',
 'tt6664486',
 'tt0081857',
 'tt1319598',
 'tt0247094',
 'tt6392176',
 'tt0320969',
 'tt2720144',
 'tt0360266',
 'tt2287380',
 'tt1715368',
 'tt0282291',
 'tt2248736',
 'tt2010634',
 'tt1489432',
 'tt4855578',
 'tt1721484',
 'tt0380850',
 'tt3084090',
 'tt2392683',
 'tt1381004',
 'tt1628058',
 'tt2935638',
 'tt1837169',
 'tt2404111',
 'tt2364381',
 'tt0888095',
 'tt2352123',
 'tt1013862',
 'tt4295320',
 'tt1249227',
 'tt1879603',
 'tt0167566',
 'tt0924528',
 'tt0361144',
 'tt0133300',
 'tt5888698',
 'tt1468817',
 'tt4006060',
 'tt0106096',
 'tt0287243',
 'tt1287376',
 'tt0060032',
 'tt1535270',
 'tt4831262',
 'tt0416397',
 'tt1546138',
 'tt2203971',
 'tt0214353',
 'tt0368518',
 'tt0382506',
 'tt5317980',
 'tt2313839',
 'tt1202295',
 'tt4146118',
 'tt1226448',
 'tt0403748',
 'tt0415448',
 'tt4665932',
 'tt3016956',
 'tt1412249',
 'tt1829773',
 'tt0872053',
 'tt0481443',
 'tt0493098',
 'tt0039120',
 'tt1411598',
 'tt0106123',
 'tt1740718',
 'tt0362153',
 'tt1637756',
 'tt0120974',
 'tt2328067',
 'tt0057741',
 'tt1261356',
 'tt2559390',
 'tt0083433',
 'tt0380934',
 'tt4388486',
 'tt0108821',
 'tt0115338',
 'tt0167735',
 'tt0460630',
 'tt2330453',
 'tt0398429',
 'tt0294140',
 'tt0804423',
 'tt2191952',
 'tt1118131',
 'tt4016700',
 'tt5786580',
 'tt0950199',
 'tt1760165',
 'tt4896654',
 'tt0414719',
 'tt1675974',
 'tt0465343',
 'tt1477137',
 'tt0115171',
 'tt3565412',
 'tt0382458',
 'tt0945153',
 'tt0199278',
 'tt1353293',
 'tt1426343',
 'tt2180165',
 'tt5117094',
 'tt1191039',
 'tt0497857',
 'tt0780409',
 'tt2670950',
 'tt1385183',
 'tt3396736',
 'tt2563482',
 'tt4094138',
 'tt0295065',
 'tt1696268',
 'tt0891053',
 'tt0914267',
 'tt1786018',
 'tt1988479',
 'tt1707814',
 'tt1595853',
 'tt2310444',
 'tt5434894',
 'tt0267216',
 'tt0855313',
 'tt1832828',
 'tt0426685',
 'tt2309561',
 'tt2486556',
 'tt0284786',
 'tt3136814',
 'tt1989818',
 'tt1179310',
 'tt0424748',
 'tt1126298',
 'tt0944946',
 'tt1882639',
 'tt0439904',
 'tt0875887',
 'tt1624991',
 'tt2747670',
 'tt2324247',
 'tt0403810',
 'tt1724452',
 'tt2366252',
 'tt3752894',
 'tt0198211',
 'tt1491318',
 'tt1666205',
 'tt2460474',
 'tt0303435',
 'tt0453329',
 'tt0220938',
 'tt0299264',
 'tt0783341',
 'tt0850175',
 'tt1191056',
 'tt0235917',
 'tt0111892',
 'tt0166442',
 'tt2643770',
 'tt5633924',
 'tt0075485',
 'tt0423657',
 'tt5327970',
 'tt3326032',
 'tt5785658',
 'tt2190731',
 'tt0101041',
 'tt3317020',
 'tt4732076',
 'tt2305717',
 'tt3828162',
 'tt0890935',
 'tt0449460',
 'tt0126175',
 'tt3601886',
 'tt5062878',
 'tt1579911',
 'tt0407354',
 'tt6723012',
 'tt5819414',
 'tt4180738',
 'tt0300802',
 'tt2649738',
 'tt3181412',
 'tt0382400',
 'tt3189040',
 'tt0324919',
 'tt2168240',
 'tt2560966',
 'tt0168373',
 'tt0403824',
 'tt0375440',
 'tt3746054',
 'tt2488150',
 'tt4081326',
 'tt5011838',
 'tt2644204',
 'tt1210781',
 'tt0246359',
 'tt0048898',
 'tt3398108',
 'tt5701572',
 'tt0426827',
 'tt0425714',
 'tt1252620',
 'tt0800289',
 'tt0111991',
 'tt0479847',
 'tt2429392',
 'tt2901828',
 'tt4147072',
 'tt1442411',
 'tt2093677',
 'tt0498421',
 'tt3006666',
 'tt3017190',
 'tt0193680',
 'tt5952954',
 'tt0381759',
 'tt2539740',
 'tt0369176',
 'tt3016990',
 'tt0328787',
 'tt2197994',
 'tt0478753',
 'tt4530152',
 'tt0372643',
 'tt5693024',
 'tt0855669',
 'tt1263594',
 'tt5935350',
 'tt1589855',
 'tt0367444',
 'tt3384116',
 'tt3790338',
 'tt2007260',
 'tt0343300',
 'tt0813904',
 'tt0883849',
 'tt0433296',
 'tt1342705',
 'tt0444988',
 'tt1333495',
 'tt0969661',
 'tt0272967',
 'tt0283184',
 'tt0444577',
 'tt3064496',
 'tt0436996',
 'tt1796788',
 'tt1879997',
 'tt4800624',
 'tt0497079',
 'tt1755893',
 'tt0329824',
 'tt2245937',
 'tt2147632',
 'tt3218114',
 'tt1583417',
 'tt0367403',
 'tt1963853',
 'tt4854900',
 'tt6415490',
 'tt1520150',
 'tt0236907',
 'tt6672370',
 'tt1055136',
 'tt5865052',
 'tt1231448',
 'tt6315022',
 'tt4351710',
 'tt4346344',
 'tt6043450',
 'tt0096605',
 'tt1181712',
 'tt0182623',
 'tt0307719',
 'tt1056344',
 'tt0328795',
 'tt0098916',
 'tt1584617',
 'tt2354136',
 'tt4287478',
 'tt0426347',
 'tt1874006',
 'tt2006560',
 'tt1694893',
 'tt2338766',
 'tt0843808',
 'tt0115155',
 'tt4354068',
 'tt1134663',
 'tt0495787',
 'tt0088539',
 'tt5426274',
 'tt1797127',
 'tt5763656',
 'tt0360301',
 'tt4245504',
 'tt0318214',
 'tt0080254',
 'tt1430135',
 'tt0892562',
 'tt2603010',
 'tt1038918',
 'tt0390746',
 'tt3773682',
 'tt0969372',
 'tt1470839',
 'tt1477822',
 'tt1056446',
 'tt0340474',
 'tt5104198',
 'tt2815184',
 'tt0468998',
 'tt0772146',
 'tt3920816',
 'tt3654000',
 'tt1753229',
 'tt0865687',
 'tt0459631',
 'tt1314665',
 'tt4660152',
 'tt0086685',
 'tt0150323',
 'tt0338576',
 'tt2118185',
 'tt0198086',
 'tt0412184',
 'tt4420148',
 'tt0497853',
 'tt1240534',
 'tt2479832',
 'tt0174195',
 'tt1999642',
 'tt1155579',
 'tt1640376',
 'tt1227586',
 'tt3784176',
 'tt1958848',
 'tt2778982',
 'tt1273636',
 'tt0357357',
 'tt1287301',
 'tt0852784',
 'tt0482432',
 'tt1651941',
 'tt0043235',
 'tt2110603',
 'tt1178184',
 'tt0846757',
 'tt0170959',
 'tt0413617',
 'tt1726890',
 'tt0220874',
 'tt0859872',
 'tt4219276',
 'tt0327268',
 'tt0843319',
 'tt3131346',
 'tt0795072',
 'tt5650560',
 'tt0827847',
 'tt1525767',
 'tt1043913',
 'tt0266179',
 'tt0413558',
 'tt0307714',
 'tt4693416',
 'tt0409619',
 'tt5684430',
 'tt0134269',
 'tt5486088',
 'tt1252370',
 'tt6370626',
 'tt3824018',
 'tt2555880',
 'tt3310544',
 'tt2125758',
 'tt1973047',
 'tt6748366',
 'tt0106113',
 'tt0934701',
 'tt2059031',
 'tt0088598',
 'tt1056536',
 'tt1618950',
 'tt6987940',
 'tt5915978',
 'tt0106008',
 'tt0115206',
 'tt0120992',
 'tt4575056',
 'tt2889104',
 'tt0428169']
len(loser_list)
600
# first_loser_list = loser_list
# This code has been executed, and the results pickled and stored locally, so no need to run these requests
# to the API again

DO_NOT_RUN = True

if not DO_NOT_RUN:
    losers = pd.DataFrame()
    for loser_id in loser_list:
            try:
                print loser_id
                # Get the tv show info from the api
                url = "http://api.tvmaze.com/lookup/shows?imdb=" + loser_id
                r = requests.get(url)

                # convert the return data to a dictionary
                json_data = r.json()

                # load a temp datafram with the dictionary, then append to the composite dataframe
                temp_df = pd.DataFrame.from_dict(json_data, orient='index', dtype=None)
                ttemp_df = temp_df.T     # Was not able to load json in column orientation, so must transpose
                losers = losers.append(ttemp_df, ignore_index=True)
            except: 
                print loser_id, " could not be retrieved from api"

    losers.head()    


tt0465347
tt0465347  could not be retrieved from api
tt4427122
tt4427122  could not be retrieved from api
tt1015682
tt1015682  could not be retrieved from api
tt2505738
tt2505738  could not be retrieved from api
tt2402465
tt2402465  could not be retrieved from api
tt0278236
tt0278236  could not be retrieved from api
tt0268066
tt0268066  could not be retrieved from api
tt4813760
tt4813760  could not be retrieved from api
tt1526001
tt1526001  could not be retrieved from api
tt1243976
tt1243976  could not be retrieved from api
tt2058498
tt3897284
tt3897284  could not be retrieved from api
tt3665690
tt3665690  could not be retrieved from api
tt4132180
tt4132180  could not be retrieved from api
tt0824229
tt0824229  could not be retrieved from api
tt0314990
tt0314990  could not be retrieved from api
tt5423750
tt5423750  could not be retrieved from api
tt5423664
tt5423664  could not be retrieved from api
tt2175125
tt2175125  could not be retrieved from api
tt0404593
tt0404593  could not be retrieved from api
tt4160422
tt4160422  could not be retrieved from api
tt4552562
tt4552562  could not be retrieved from api
tt5804854
tt5804854  could not be retrieved from api
tt0886666
tt0886666  could not be retrieved from api
tt5423824
tt5423824  could not be retrieved from api
tt3500210
tt3500210  could not be retrieved from api
tt0285357
tt0285357  could not be retrieved from api
tt0280234
tt0280234  could not be retrieved from api
tt1863530
tt1863530  could not be retrieved from api
tt0280349
tt0280349  could not be retrieved from api
tt2660922
tt2660922  could not be retrieved from api
tt0292776
tt0292776  could not be retrieved from api
tt4566242
tt0264230
tt0264230  could not be retrieved from api
tt1102523
tt1102523  could not be retrieved from api
tt3333790
tt3333790  could not be retrieved from api
tt0320863
tt0320863  could not be retrieved from api
tt0830848
tt0830848  could not be retrieved from api
tt0939270
tt0939270  could not be retrieved from api
tt1459294
tt1459294  could not be retrieved from api
tt6026132
tt6026132  could not be retrieved from api
tt1443593
tt1443593  could not be retrieved from api
tt0354267
tt0354267  could not be retrieved from api
tt0147749
tt0147749  could not be retrieved from api
tt0161180
tt0161180  could not be retrieved from api
tt4733812
tt4733812  could not be retrieved from api
tt0367362
tt0367362  could not be retrieved from api
tt5626868
tt5626868  could not be retrieved from api
tt7268752
tt7268752  could not be retrieved from api
tt1364951
tt2341819
tt0464767
tt0464767  could not be retrieved from api
tt3550770
tt3550770  could not be retrieved from api
tt6422012
tt6422012  could not be retrieved from api
tt3154248
tt3154248  could not be retrieved from api
tt5016274
tt5016274  could not be retrieved from api
tt1715229
tt1715229  could not be retrieved from api
tt0489426
tt0489426  could not be retrieved from api
tt5798754
tt5798754  could not be retrieved from api
tt2022182
tt2022182  could not be retrieved from api
tt0303564
tt0303564  could not be retrieved from api
tt3462252
tt3462252  could not be retrieved from api
tt0329849
tt0329849  could not be retrieved from api
tt5074180
tt5074180  could not be retrieved from api
tt3900878
tt3900878  could not be retrieved from api
tt3887402
tt3887402  could not be retrieved from api
tt1893088
tt0445890
tt0149408
tt0149408  could not be retrieved from api
tt1360544
tt1360544  could not be retrieved from api
tt1718355
tt1718355  could not be retrieved from api
tt2364950
tt2364950  could not be retrieved from api
tt2279571
tt0285374
tt0285374  could not be retrieved from api
tt5267590
tt5267590  could not be retrieved from api
tt0314993
tt0314993  could not be retrieved from api
tt0300870
tt0300870  could not be retrieved from api
tt7036530
tt7036530  could not be retrieved from api
tt5657014
tt5657014  could not be retrieved from api
tt0149488
tt0149488  could not be retrieved from api
tt1204865
tt1204865  could not be retrieved from api
tt1182860
tt1182860  could not be retrieved from api
tt0423626
tt0423626  could not be retrieved from api
tt4223864
tt4223864  could not be retrieved from api
tt1773440
tt1773440  could not be retrieved from api
tt0872067
tt0872067  could not be retrieved from api
tt0428172
tt0428172  could not be retrieved from api
tt0817379
tt0817379  could not be retrieved from api
tt1210720
tt1210720  could not be retrieved from api
tt3855028
tt3855028  could not be retrieved from api
tt1611594
tt1611594  could not be retrieved from api
tt5822004
tt5822004  could not be retrieved from api
tt6524930
tt6524930  could not be retrieved from api
tt1733734
tt1902032
tt1902032  could not be retrieved from api
tt0466201
tt0466201  could not be retrieved from api
tt1757293
tt1757293  could not be retrieved from api
tt1807575
tt1807575  could not be retrieved from api
tt0332896
tt0332896  could not be retrieved from api
tt3140278
tt3140278  could not be retrieved from api
tt1176297
tt1176297  could not be retrieved from api
tt0285406
tt0285406  could not be retrieved from api
tt6680212
tt6680212  could not be retrieved from api
tt0200336
tt0200336  could not be retrieved from api
tt0385483
tt0385483  could not be retrieved from api
tt3534894
tt3534894  could not be retrieved from api
tt1108281
tt1108281  could not be retrieved from api
tt3855016
tt3855016  could not be retrieved from api
tt0787948
tt0787948  could not be retrieved from api
tt1372153
tt1292967
tt1292967  could not be retrieved from api
tt1466565
tt1466565  could not be retrieved from api
tt0435565
tt0435565  could not be retrieved from api
tt1817054
tt2879822
tt1229266
tt1229266  could not be retrieved from api
tt0364837
tt0364837  could not be retrieved from api
tt0477409
tt0477409  could not be retrieved from api
tt0875097
tt0875097  could not be retrieved from api
tt1227542
tt1227542  could not be retrieved from api
tt1131289
tt1131289  could not be retrieved from api
tt0355135
tt0355135  could not be retrieved from api
tt1418598
tt0290970
tt0290970  could not be retrieved from api
tt0184124
tt0184124  could not be retrieved from api
tt0490736
tt0490736  could not be retrieved from api
tt0439354
tt0439354  could not be retrieved from api
tt1157935
tt1157935  could not be retrieved from api
tt1425641
tt1425641  could not be retrieved from api
tt2830404
tt2830404  could not be retrieved from api
tt0835397
tt0835397  could not be retrieved from api
tt0880581
tt0880581  could not be retrieved from api
tt1078463
tt1078463  could not be retrieved from api
tt0190177
tt1234506
tt1234506  could not be retrieved from api
tt0323463
tt0323463  could not be retrieved from api
tt5047510
tt5338860
tt5168468
tt5168468  could not be retrieved from api
tt0296322
tt0296322  could not be retrieved from api
tt3911254
tt3911254  could not be retrieved from api
tt3827516
tt3827516  could not be retrieved from api
tt0364899
tt0364899  could not be retrieved from api
tt4204032
tt4204032  could not be retrieved from api
tt0259768
tt0259768  could not be retrieved from api
tt0287880
tt0287880  could not be retrieved from api
tt0270763
tt0270763  could not be retrieved from api
tt0846349
tt0846349  could not be retrieved from api
tt2699648
tt2699648  could not be retrieved from api
tt3616368
tt3616368  could not be retrieved from api
tt2672920
tt2672920  could not be retrieved from api
tt1848281
tt0813074
tt0813074  could not be retrieved from api
tt1694422
tt1694422  could not be retrieved from api
tt0472241
tt0472241  could not be retrieved from api
tt0202186
tt0202186  could not be retrieved from api
tt1297366
tt1297366  could not be retrieved from api
tt3919918
tt3919918  could not be retrieved from api
tt1564985
tt1564985  could not be retrieved from api
tt3336800
tt3336800  could not be retrieved from api
tt6839504
tt2114184
tt2254454
tt2254454  could not be retrieved from api
tt1674023
tt0824737
tt0824737  could not be retrieved from api
tt1288431
tt1288431  could not be retrieved from api
tt1705811
tt1705811  could not be retrieved from api
tt0968726
tt0968726  could not be retrieved from api
tt2058840
tt2058840  could not be retrieved from api
tt1971860
tt3857708
tt3857708  could not be retrieved from api
tt0315030
tt0315030  could not be retrieved from api
tt2337185
tt2337185  could not be retrieved from api
tt0775356
tt0775356  could not be retrieved from api
tt0244356
tt0244356  could not be retrieved from api
tt2338400
tt2338400  could not be retrieved from api
tt0220047
tt0220047  could not be retrieved from api
tt0341789
tt0341789  could not be retrieved from api
tt0197151
tt0197151  could not be retrieved from api
tt0222529
tt0222529  could not be retrieved from api
tt6086050
tt6086050  could not be retrieved from api
tt3100634
tt1625263
tt1625263  could not be retrieved from api
tt2289244
tt2289244  could not be retrieved from api
tt1936732
tt0278229
tt0278229  could not be retrieved from api
tt0429438
tt0429438  could not be retrieved from api
tt1410490
tt1410490  could not be retrieved from api
tt5588910
tt5588910  could not be retrieved from api
tt3670858
tt3670858  could not be retrieved from api
tt1197582
tt0397182
tt0397182  could not be retrieved from api
tt1911975
tt1911975  could not be retrieved from api
tt0420366
tt0420366  could not be retrieved from api
tt3079034
tt3079034  could not be retrieved from api
tt0859270
tt0859270  could not be retrieved from api
tt0050070
tt0050070  could not be retrieved from api
tt0300798
tt0300798  could not be retrieved from api
tt5915502
tt5915502  could not be retrieved from api
tt6697244
tt6697244  could not be retrieved from api
tt1776388
tt1776388  could not be retrieved from api
tt0424639
tt0424639  could not be retrieved from api
tt1119204
tt1119204  could not be retrieved from api
tt1744868
tt1744868  could not be retrieved from api
tt1588824
tt1588824  could not be retrieved from api
tt1485389
tt3696798
tt3696798  could not be retrieved from api
tt0301123
tt0301123  could not be retrieved from api
tt1018436
tt1018436  could not be retrieved from api
tt0815776
tt0815776  could not be retrieved from api
tt0407462
tt0407462  could not be retrieved from api
tt0198147
tt0198147  could not be retrieved from api
tt0997412
tt0997412  could not be retrieved from api
tt2288050
tt1612920
tt0402701
tt5047494
tt5047494  could not be retrieved from api
tt5368216
tt5368216  could not be retrieved from api
tt3356610
tt3356610  could not be retrieved from api
tt0491735
tt1454750
tt1454750  could not be retrieved from api
tt5891726
tt5891726  could not be retrieved from api
tt2369946
tt4286824
tt4286824  could not be retrieved from api
tt0476926
tt0476926  could not be retrieved from api
tt5167034
tt5167034  could not be retrieved from api
tt0056759
tt0056759  could not be retrieved from api
tt3622818
tt3622818  could not be retrieved from api
tt0887788
tt0887788  could not be retrieved from api
tt4588620
tt4588620  could not be retrieved from api
tt0258341
tt0258341  could not be retrieved from api
tt0489430
tt0489430  could not be retrieved from api
tt2567210
tt2567210  could not be retrieved from api
tt0990403
tt4674178
tt4674178  could not be retrieved from api
tt0125638
tt0125638  could not be retrieved from api
tt5146640
tt5146640  could not be retrieved from api
tt0196284
tt0196284  could not be retrieved from api
tt3075154
tt3075154  could not be retrieved from api
tt0436003
tt0436003  could not be retrieved from api
tt1538090
tt1538090  could not be retrieved from api
tt1728226
tt1728226  could not be retrieved from api
tt3796070
tt3796070  could not be retrieved from api
tt1381395
tt1381395  could not be retrieved from api
tt0190199
tt0190199  could not be retrieved from api
tt0855213
tt0855213  could not be retrieved from api
tt0358890
tt0358890  could not be retrieved from api
tt3484986
tt3484986  could not be retrieved from api
tt2208507
tt2208507  could not be retrieved from api
tt4896052
tt4896052  could not be retrieved from api
tt6148376
tt0217211
tt0217211  could not be retrieved from api
tt0430836
tt0430836  could not be retrieved from api
tt1429551
tt1291098
tt1291098  could not be retrieved from api
tt0399968
tt0399968  could not be retrieved from api
tt2909920
tt2909920  could not be retrieved from api
tt3164276
tt3164276  could not be retrieved from api
tt1586637
tt4873032
tt0926012
tt0926012  could not be retrieved from api
tt1305560
tt1305560  could not be retrieved from api
tt1291488
tt1291488  could not be retrieved from api
tt0428088
tt0428088  could not be retrieved from api
tt1057469
tt1057469  could not be retrieved from api
tt3807326
tt3807326  could not be retrieved from api
tt3293566
tt0410964
tt1579186
tt0271931
tt6519752
tt1417358
tt4568130
tt1705611
tt2235190
tt0244328
tt0244328  could not be retrieved from api
tt0459155
tt0459155  could not be retrieved from api
tt1890984
tt1890984  could not be retrieved from api
tt0460381
tt0460381  could not be retrieved from api
tt0439069
tt0439069  could not be retrieved from api
tt0329817
tt0329817  could not be retrieved from api
tt1805082
tt1805082  could not be retrieved from api
tt0468985
tt0468985  could not be retrieved from api
tt1071166
tt1071166  could not be retrieved from api
tt1634699
tt1634699  could not be retrieved from api
tt1086761
tt4214468
tt0170930
tt0170930  could not be retrieved from api
tt5937940
tt0305056
tt1024887
tt1024887  could not be retrieved from api
tt1833558
tt7062438
tt7062438  could not be retrieved from api
tt4411548
tt4411548  could not be retrieved from api
tt0105970
tt0105970  could not be retrieved from api
tt0348949
tt0348949  could not be retrieved from api
tt2309197
tt2309197  could not be retrieved from api
tt0327271
tt0327271  could not be retrieved from api
tt1729597
tt1729597  could not be retrieved from api
tt0428108
tt0428108  could not be retrieved from api
tt3144026
tt3144026  could not be retrieved from api
tt0292770
tt0077041
tt1489024
tt0458269
tt1020924
tt0444578
tt0787980
tt0249275
tt1280868
tt0462121
tt3136086
tt1908157
tt0055714
tt0781991
tt0224517
tt0426804
tt0484508
tt0186742
tt0460081
tt0320809
tt0798631
tt3119834
tt3804586
tt0479614
tt0479614  could not be retrieved from api
tt0780447
tt0780447  could not be retrieved from api
tt0123366
tt3481544
tt3975956
tt3975956  could not be retrieved from api
tt5335110
tt0471990
tt0471990  could not be retrieved from api
tt1332074
tt6846846
tt6846846  could not be retrieved from api
tt1259798
tt0381741
tt0381741  could not be retrieved from api
tt2953706
tt1244881
tt6208480
tt6208480  could not be retrieved from api
tt1232190
tt0829040
tt0829040  could not be retrieved from api
tt3859844
tt1761662
tt1761662  could not be retrieved from api
tt2262354
tt0103411
tt0103411  could not be retrieved from api
tt0356281
tt0356281  could not be retrieved from api
tt4628798
tt4628798  could not be retrieved from api
tt0283714
tt1147702
tt1147702  could not be retrieved from api
tt0780444
tt0780444  could not be retrieved from api
tt1981147
tt0756524
tt0312095
tt0260645
tt1728958
tt4688354
tt1296242
tt1062211
tt1500453
tt0358320
tt1118205
tt0480781
tt0303490
tt0278256
tt0812148
tt0892683
tt1562042
tt0218767
tt2265901
tt1456074
tt1978967
tt0313038
tt5437800
tt5437800  could not be retrieved from api
tt2453016
tt5209238
tt5209238  could not be retrieved from api
tt7165310
tt7165310  could not be retrieved from api
tt1277979
tt0362379
tt0362379  could not be retrieved from api
tt0348512
tt0348512  could not be retrieved from api
tt1024814
tt0065343
tt0065343  could not be retrieved from api
tt3976016
tt3976016  could not be retrieved from api
tt1459376
tt1459376  could not be retrieved from api
tt4629950
tt4629950  could not be retrieved from api
tt0443361
tt0443361  could not be retrieved from api
tt1320317
tt1320317  could not be retrieved from api
tt1770959
tt6212410
tt6212410  could not be retrieved from api
tt3731648
tt5872774
tt5872774  could not be retrieved from api
tt4410468
tt0196232
tt0196232  could not be retrieved from api
tt3693866
tt3693866  could not be retrieved from api
tt6295148
tt6295148  could not be retrieved from api
tt0804424
tt0804424  could not be retrieved from api
tt0458252
tt0458252  could not be retrieved from api
tt2933730
tt2933730  could not be retrieved from api
tt5690306
tt5690306  could not be retrieved from api
tt3038492
tt0854912
tt0426740
tt0364787
tt1033281
tt0473416
tt5423592
tt2064427
tt1208634
tt0402660
tt1566044
tt0292845
tt2633208
tt1685317
tt0421158
tt1176154
tt3099832
tt0396337
tt0337790
tt0287847
tt0421343
tt0408364
tt0346300
tt0346300  could not be retrieved from api
tt2908564
tt2908564  could not be retrieved from api
tt0348894
tt6959064
tt6959064  could not be retrieved from api
tt1737565
tt1454730
tt0468999
tt1495163
tt2514488
tt2390003
tt0293725
tt0293725  could not be retrieved from api
tt0092362
tt0092362  could not be retrieved from api
tt0818895
tt0818895  could not be retrieved from api
tt1509653
tt1509653  could not be retrieved from api
tt1809909
tt1809909  could not be retrieved from api
tt1796975
tt1796975  could not be retrieved from api
tt6501522
tt6501522  could not be retrieved from api
tt0424611
tt0424611  could not be retrieved from api
tt0439932
tt0439932  could not be retrieved from api
tt4671004
tt0471048
tt0471048  could not be retrieved from api
tt1156526
tt1156526  could not be retrieved from api
tt0264226
tt0264226  could not be retrieved from api
tt1170222
tt1170222  could not be retrieved from api
tt2689384
tt0295081
tt0295081  could not be retrieved from api
tt4369244
tt4369244  could not be retrieved from api
tt2781594
tt2781594  could not be retrieved from api
tt4662374
tt1105316
tt1105316  could not be retrieved from api
tt3840030
tt3840030  could not be retrieved from api
tt2579722
tt0072546
tt4628790
tt0046590
tt2184509
tt0497854
tt0363323
tt1458207
tt0439356
tt0377146
tt0954318
tt2214505
tt2435530
tt0473419
tt0768151
tt0439365
tt0278177
tt1299440
tt2083701
tt1933836
tt6473824
tt6473824  could not be retrieved from api
tt0187632
tt0187632  could not be retrieved from api
tt4033696
tt0391666
tt0391666  could not be retrieved from api
tt0465344
tt0465344  could not be retrieved from api
tt2170392
tt4390084
tt2189892
tt2189892  could not be retrieved from api
tt6586510
tt6586510  could not be retrieved from api
tt3174316
tt2374870
tt2374870  could not be retrieved from api
tt2366111
tt2111994
tt2111994  could not be retrieved from api
tt4588734
tt4588734  could not be retrieved from api
tt0863047
tt0863047  could not be retrieved from api
tt1495648
tt1579108
tt1579108  could not be retrieved from api
tt1159610
tt0984168
tt0984168  could not be retrieved from api
tt6752226
tt6752226  could not be retrieved from api
tt0856723
tt0856723  could not be retrieved from api
tt0416347
tt0416347  could not be retrieved from api
tt5571740
tt5571740  could not be retrieved from api
tt1552185
tt1552185  could not be retrieved from api
tt3595870
tt1728864
tt1062185
tt0380949
tt1013861
tt0848174
tt0321000
tt1855738
tt0363335
tt0420381
tt1814550
tt1987353
tt0187654
tt1461569
tt1850160
tt0954661
tt0198095
tt4012388
tt0482028
tt0176381
tt0419307
tt1684732
tt5154762
tt3139774
tt0819708
tt0819708  could not be retrieved from api
tt0888280
tt0888280  could not be retrieved from api
tt6021260
tt6021260  could not be retrieved from api
tt0185065
tt0185065  could not be retrieved from api
tt4123482
tt1491299
tt1492090
tt6059298
tt6059298  could not be retrieved from api
tt1826951
tt0273025
tt0273025  could not be retrieved from api
tt1888795
tt1888795  could not be retrieved from api
tt1821879
tt1821879  could not be retrieved from api
tt2497788
tt0476038
tt0476038  could not be retrieved from api
tt1830924
tt1830924  could not be retrieved from api
tt1368470
tt1368470  could not be retrieved from api
tt1361721
tt1361721  could not be retrieved from api
tt2647792
tt2647792  could not be retrieved from api
tt3148194
tt0302163
tt0302163  could not be retrieved from api
tt5515342
tt0292859
tt0292859  could not be retrieved from api
tt0243082
tt0243082  could not be retrieved from api
tt4654650
tt4654650  could not be retrieved from api
tt0298682
tt0298682  could not be retrieved from api
tt1534856
tt1534856  could not be retrieved from api
tt3097134
tt3097134  could not be retrieved from api
tt2582840
tt2582840  could not be retrieved from api
tt4605154
tt1478217
tt1478217  could not be retrieved from api
tt0374366
tt1631948
tt0368494
tt1721347
tt5319670
tt1684855
tt5209280
tt6217260
tt6842890
tt5040090
tt3501210
tt0367323
tt0397012
tt0954837
tt1784056
tt3228548
tt0861753
tt0933898
tt0433705
tt0287845
tt0329816
tt0329816  could not be retrieved from api
tt2815342
tt3548386
tt3548386  could not be retrieved from api
tt0410958
tt0410958  could not be retrieved from api
tt0057740
tt0057740  could not be retrieved from api
tt5583124
tt5583124  could not be retrieved from api
tt1440045
tt1440045  could not be retrieved from api
tt0810737
tt0810737  could not be retrieved from api
tt0989753
tt0989753  could not be retrieved from api
tt1313075
tt1313075  could not be retrieved from api
tt1073528
tt1073528  could not be retrieved from api
tt0310516
tt0310516  could not be retrieved from api
tt1642103
tt1642103  could not be retrieved from api
tt0448973
tt0448973  could not be retrieved from api
tt0302098
tt0302098  could not be retrieved from api
tt0805368
tt0805368  could not be retrieved from api
tt1124662
tt1124662  could not be retrieved from api
tt0324891
tt0324891  could not be retrieved from api
tt0423631
tt0423631  could not be retrieved from api
tt2226096
tt2226096  could not be retrieved from api
tt0773264
tt1798695
tt1307083
tt4845734
tt0046641
tt0046641  could not be retrieved from api
tt1519575
tt1519575  could not be retrieved from api
tt0853078
tt0853078  could not be retrieved from api
tt0118423
tt0118423  could not be retrieved from api
tt0284767
tt4052124
tt4052124  could not be retrieved from api
tt0878801
tt3703500
# Oops,  We've hit the API to hard.  A second attempt to pull low rated show information
# will be needed, with a time delay to stay within API limitations.

# This shape is misleading, as many of the rows simply contain a message that the API limit 
# had been exceeded

losers.shape
(229, 22)
# This is accurate, 235 shows from the top show list were obtained
shows.shape
(235, 20)
DO_NOT_RUN = True  # Be sure to check the file name to write before enabling execution on this block

if not DO_NOT_RUN:
    pickle.dump( losers, open( "save_losers_df.p", "wb" ) )
# read data back in from the saved file
losers2 = pickle.load( open( "save_losers_df.p", "rb" ) )

This is the start of a second attempt to pull more TV shows with low ratings

This is needed. After the first pull, and after cleanup, there were only 10 Shows left in the low rating category with complete information. The cells below collect more data from the API for additional low rated shows.

losers.loc[0:9]['externals']
0    {u'thetvdb': 283995, u'tvrage': 40425, u'imdb'...
1    {u'thetvdb': 299234, u'tvrage': 50418, u'imdb'...
2    {u'thetvdb': 118021, u'tvrage': None, u'imdb':...
3    {u'thetvdb': 274705, u'tvrage': 31580, u'imdb'...
4    {u'thetvdb': 246161, u'tvrage': None, u'imdb':...
5    {u'thetvdb': 75638, u'tvrage': None, u'imdb': ...
6    {u'thetvdb': 260183, u'tvrage': 31024, u'imdb'...
7    {u'thetvdb': None, u'tvrage': None, u'imdb': u...
8    {u'thetvdb': 299688, u'tvrage': None, u'imdb':...
9    {u'thetvdb': 222481, u'tvrage': None, u'imdb':...
Name: externals, dtype: object
# In the first attempt, there were a number of shows where data was not returned becuase of two many api calls
# in quick succession. In order to re-submit those show ids, it is necessary to get a list of ids that were
# returned successfully, and then to remove them from the original list of ids before resubmitting.  
# losers_pulled is a list of ids that were successful on the previous attempt.

losers_pulled = []
no_imdb_at_idx = []
for i in range(len(losers)):
    try:
        losers_pulled.append(losers.loc[i,'externals']['imdb'])
    except:
        no_imdb_at_idx.append(i)
print no_imdb_at_idx
print
print losers_pulled
print len(losers_pulled)
[11, 35, 36, 37, 38, 39, 40, 41, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 228]

[u'tt2058498', u'tt4566242', u'tt1364951', u'tt2341819', u'tt1893088', u'tt0445890', u'tt2279571', u'tt1733734', u'tt1372153', u'tt1817054', u'tt2879822', u'tt0190177', u'tt5047510', u'tt5338860', u'tt1848281', u'tt6839504', u'tt2114184', u'tt1674023', u'tt1971860', u'tt3100634', u'tt1936732', u'tt1197582', u'tt1485389', u'tt2288050', u'tt1612920', u'tt0402701', u'tt0491735', u'tt2369946', u'tt0990403', u'tt6148376', u'tt1429551', u'tt1586637', u'tt4873032', u'tt3293566', u'tt2235190', u'tt1086761', u'tt4214468', u'tt5937940', u'tt0305056', u'tt1833558', u'tt0123366', u'tt3481544', u'tt5335110', u'tt1332074', u'tt1259798', u'tt2953706', u'tt1244881', u'tt1232190', u'tt3859844', u'tt2262354', u'tt0283714', u'tt0313038', u'tt2453016', u'tt1277979', u'tt1024814', u'tt1770959', u'tt3731648', u'tt4410468', u'tt0348894', u'tt1737565', u'tt1454730', u'tt0468999', u'tt1495163', u'tt2514488', u'tt2390003', u'tt4671004', u'tt2689384', u'tt4662374', u'tt1299440', u'tt2083701', u'tt1933836', u'tt4033696', u'tt2170392', u'tt4390084', u'tt3174316', u'tt2366111', u'tt1495648', u'tt1159610', u'tt4123482', u'tt1491299', u'tt1492090', u'tt1826951', u'tt2497788', u'tt3148194', u'tt5515342', u'tt4605154', u'tt2815342', u'tt0773264', u'tt1798695', u'tt1307083', u'tt4845734', u'tt0284767', u'tt0878801']
93
# There were that do not even include their own imdb number, and indicator that the pull was unsuccessful
# While a few of these might have been successful but have only limited data, most are unusuable.
# Thus all will be re-requested at a slower rate and any duplicates removed when the data is merged.

print len(no_imdb_at_idx)
136
# This generates a list of the original requests that were not successfully returned from the api.   
# First the will be requested again, using a time delay to avoid requesting more than the server
# will willingly return.  They will also be batched in groups of 100 ids

missing_losers = [x for x in loser_list if x not in losers_pulled]
missing_losers
['tt0465347',
 'tt4427122',
 'tt1015682',
 'tt2505738',
 'tt2402465',
 'tt0278236',
 'tt0268066',
 'tt4813760',
 'tt1526001',
 'tt1243976',
 'tt3897284',
 'tt3665690',
 'tt4132180',
 'tt0824229',
 'tt0314990',
 'tt5423750',
 'tt5423664',
 'tt2175125',
 'tt0404593',
 'tt4160422',
 'tt4552562',
 'tt5804854',
 'tt0886666',
 'tt5423824',
 'tt3500210',
 'tt0285357',
 'tt0280234',
 'tt1863530',
 'tt0280349',
 'tt2660922',
 'tt0292776',
 'tt0264230',
 'tt1102523',
 'tt3333790',
 'tt0320863',
 'tt0830848',
 'tt0939270',
 'tt1459294',
 'tt6026132',
 'tt1443593',
 'tt0354267',
 'tt0147749',
 'tt0161180',
 'tt4733812',
 'tt0367362',
 'tt5626868',
 'tt7268752',
 'tt0464767',
 'tt3550770',
 'tt6422012',
 'tt3154248',
 'tt5016274',
 'tt1715229',
 'tt0489426',
 'tt5798754',
 'tt2022182',
 'tt0303564',
 'tt3462252',
 'tt0329849',
 'tt5074180',
 'tt3900878',
 'tt3887402',
 'tt0149408',
 'tt1360544',
 'tt1718355',
 'tt2364950',
 'tt0285374',
 'tt5267590',
 'tt0314993',
 'tt0300870',
 'tt7036530',
 'tt5657014',
 'tt0149488',
 'tt1204865',
 'tt1182860',
 'tt0423626',
 'tt4223864',
 'tt1773440',
 'tt0872067',
 'tt0428172',
 'tt0817379',
 'tt1210720',
 'tt3855028',
 'tt1611594',
 'tt5822004',
 'tt6524930',
 'tt1902032',
 'tt0466201',
 'tt1757293',
 'tt1807575',
 'tt0332896',
 'tt3140278',
 'tt1176297',
 'tt0285406',
 'tt6680212',
 'tt0200336',
 'tt0385483',
 'tt3534894',
 'tt1108281',
 'tt3855016',
 'tt0787948',
 'tt1292967',
 'tt1466565',
 'tt0435565',
 'tt1229266',
 'tt0364837',
 'tt0477409',
 'tt0875097',
 'tt1227542',
 'tt1131289',
 'tt0355135',
 'tt1418598',
 'tt0290970',
 'tt0184124',
 'tt0490736',
 'tt0439354',
 'tt1157935',
 'tt1425641',
 'tt2830404',
 'tt0835397',
 'tt0880581',
 'tt1078463',
 'tt1234506',
 'tt0323463',
 'tt5168468',
 'tt0296322',
 'tt3911254',
 'tt3827516',
 'tt0364899',
 'tt4204032',
 'tt0259768',
 'tt0287880',
 'tt0270763',
 'tt0846349',
 'tt2699648',
 'tt3616368',
 'tt2672920',
 'tt0813074',
 'tt1694422',
 'tt0472241',
 'tt0202186',
 'tt1297366',
 'tt3919918',
 'tt1564985',
 'tt3336800',
 'tt2254454',
 'tt0824737',
 'tt1288431',
 'tt1705811',
 'tt0968726',
 'tt2058840',
 'tt3857708',
 'tt0315030',
 'tt2337185',
 'tt0775356',
 'tt0244356',
 'tt2338400',
 'tt0220047',
 'tt0341789',
 'tt0197151',
 'tt0222529',
 'tt6086050',
 'tt1625263',
 'tt2289244',
 'tt0278229',
 'tt0429438',
 'tt1410490',
 'tt5588910',
 'tt3670858',
 'tt0397182',
 'tt1911975',
 'tt0420366',
 'tt3079034',
 'tt0859270',
 'tt0050070',
 'tt0300798',
 'tt5915502',
 'tt6697244',
 'tt1776388',
 'tt0424639',
 'tt1119204',
 'tt1744868',
 'tt1588824',
 'tt3696798',
 'tt0301123',
 'tt1018436',
 'tt0815776',
 'tt0407462',
 'tt0198147',
 'tt0997412',
 'tt5047494',
 'tt5368216',
 'tt3356610',
 'tt1454750',
 'tt5891726',
 'tt4286824',
 'tt0476926',
 'tt5167034',
 'tt0056759',
 'tt3622818',
 'tt0887788',
 'tt4588620',
 'tt0258341',
 'tt0489430',
 'tt2567210',
 'tt4674178',
 'tt0125638',
 'tt5146640',
 'tt0196284',
 'tt3075154',
 'tt0436003',
 'tt1538090',
 'tt1728226',
 'tt3796070',
 'tt1381395',
 'tt0190199',
 'tt0855213',
 'tt0358890',
 'tt3484986',
 'tt2208507',
 'tt4896052',
 'tt0217211',
 'tt0430836',
 'tt1291098',
 'tt0399968',
 'tt2909920',
 'tt3164276',
 'tt0926012',
 'tt1305560',
 'tt1291488',
 'tt0428088',
 'tt1057469',
 'tt3807326',
 'tt0410964',
 'tt1579186',
 'tt0271931',
 'tt6519752',
 'tt1417358',
 'tt4568130',
 'tt1705611',
 'tt0244328',
 'tt0459155',
 'tt1890984',
 'tt0460381',
 'tt0439069',
 'tt0329817',
 'tt1805082',
 'tt0468985',
 'tt1071166',
 'tt1634699',
 'tt0170930',
 'tt1024887',
 'tt7062438',
 'tt4411548',
 'tt0105970',
 'tt0348949',
 'tt2309197',
 'tt0327271',
 'tt1729597',
 'tt0428108',
 'tt3144026',
 'tt0292770',
 'tt0077041',
 'tt1489024',
 'tt0458269',
 'tt1020924',
 'tt0444578',
 'tt0787980',
 'tt0249275',
 'tt1280868',
 'tt0462121',
 'tt3136086',
 'tt1908157',
 'tt0055714',
 'tt0781991',
 'tt0224517',
 'tt0426804',
 'tt0484508',
 'tt0186742',
 'tt0460081',
 'tt0320809',
 'tt0798631',
 'tt3119834',
 'tt3804586',
 'tt0479614',
 'tt0780447',
 'tt3975956',
 'tt0471990',
 'tt6846846',
 'tt0381741',
 'tt6208480',
 'tt0829040',
 'tt1761662',
 'tt0103411',
 'tt0356281',
 'tt4628798',
 'tt1147702',
 'tt0780444',
 'tt1981147',
 'tt0756524',
 'tt0312095',
 'tt0260645',
 'tt1728958',
 'tt4688354',
 'tt1296242',
 'tt1062211',
 'tt1500453',
 'tt0358320',
 'tt1118205',
 'tt0480781',
 'tt0303490',
 'tt0278256',
 'tt0812148',
 'tt0892683',
 'tt1562042',
 'tt0218767',
 'tt2265901',
 'tt1456074',
 'tt1978967',
 'tt5437800',
 'tt5209238',
 'tt7165310',
 'tt0362379',
 'tt0348512',
 'tt0065343',
 'tt3976016',
 'tt1459376',
 'tt4629950',
 'tt0443361',
 'tt1320317',
 'tt6212410',
 'tt5872774',
 'tt0196232',
 'tt3693866',
 'tt6295148',
 'tt0804424',
 'tt0458252',
 'tt2933730',
 'tt5690306',
 'tt3038492',
 'tt0854912',
 'tt0426740',
 'tt0364787',
 'tt1033281',
 'tt0473416',
 'tt5423592',
 'tt2064427',
 'tt1208634',
 'tt0402660',
 'tt1566044',
 'tt0292845',
 'tt2633208',
 'tt1685317',
 'tt0421158',
 'tt1176154',
 'tt3099832',
 'tt0396337',
 'tt0337790',
 'tt0287847',
 'tt0421343',
 'tt0408364',
 'tt0346300',
 'tt2908564',
 'tt6959064',
 'tt0293725',
 'tt0092362',
 'tt0818895',
 'tt1509653',
 'tt1809909',
 'tt1796975',
 'tt6501522',
 'tt0424611',
 'tt0439932',
 'tt0471048',
 'tt1156526',
 'tt0264226',
 'tt1170222',
 'tt0295081',
 'tt4369244',
 'tt2781594',
 'tt1105316',
 'tt3840030',
 'tt2579722',
 'tt0072546',
 'tt4628790',
 'tt0046590',
 'tt2184509',
 'tt0497854',
 'tt0363323',
 'tt1458207',
 'tt0439356',
 'tt0377146',
 'tt0954318',
 'tt2214505',
 'tt2435530',
 'tt0473419',
 'tt0768151',
 'tt0439365',
 'tt0278177',
 'tt6473824',
 'tt0187632',
 'tt0391666',
 'tt0465344',
 'tt2189892',
 'tt6586510',
 'tt2374870',
 'tt2111994',
 'tt4588734',
 'tt0863047',
 'tt1579108',
 'tt0984168',
 'tt6752226',
 'tt0856723',
 'tt0416347',
 'tt5571740',
 'tt1552185',
 'tt3595870',
 'tt1728864',
 'tt1062185',
 'tt0380949',
 'tt1013861',
 'tt0848174',
 'tt0321000',
 'tt1855738',
 'tt0363335',
 'tt0420381',
 'tt1814550',
 'tt1987353',
 'tt0187654',
 'tt1461569',
 'tt1850160',
 'tt0954661',
 'tt0198095',
 'tt4012388',
 'tt0482028',
 'tt0176381',
 'tt0419307',
 'tt1684732',
 'tt5154762',
 'tt3139774',
 'tt0819708',
 'tt0888280',
 'tt6021260',
 'tt0185065',
 'tt6059298',
 'tt0273025',
 'tt1888795',
 'tt1821879',
 'tt0476038',
 'tt1830924',
 'tt1368470',
 'tt1361721',
 'tt2647792',
 'tt0302163',
 'tt0292859',
 'tt0243082',
 'tt4654650',
 'tt0298682',
 'tt1534856',
 'tt3097134',
 'tt2582840',
 'tt1478217',
 'tt0374366',
 'tt1631948',
 'tt0368494',
 'tt1721347',
 'tt5319670',
 'tt1684855',
 'tt5209280',
 'tt6217260',
 'tt6842890',
 'tt5040090',
 'tt3501210',
 'tt0367323',
 'tt0397012',
 'tt0954837',
 'tt1784056',
 'tt3228548',
 'tt0861753',
 'tt0933898',
 'tt0433705',
 'tt0287845',
 'tt0329816',
 'tt3548386',
 'tt0410958',
 'tt0057740',
 'tt5583124',
 'tt1440045',
 'tt0810737',
 'tt0989753',
 'tt1313075',
 'tt1073528',
 'tt0310516',
 'tt1642103',
 'tt0448973',
 'tt0302098',
 'tt0805368',
 'tt1124662',
 'tt0324891',
 'tt0423631',
 'tt2226096',
 'tt0046641',
 'tt1519575',
 'tt0853078',
 'tt0118423',
 'tt4052124',
 'tt3703500']
# This processes the oringinal list of 600 ids, minus the ones that were successfully pulled, 
# into groups of 100 + 7 in last list
# break up the missing list into groups of 100
subset_loser_list = []
print len(missing_losers)
for i in range(len(missing_losers)/100):
    temp_list = []
    for j in range(100):
        temp_list.append(missing_losers[i*100 + j])
    subset_loser_list.append(temp_list)    

# get last 7
for j in range(500, len(missing_losers)):
    temp_list = []
    for j in range(500, len(missing_losers)):
        temp_list.append(missing_losers[j])
507
# After reprocessing the first list of ids a 2nd time,  there are still not enough samples of low rated shows
# A third list of 600 low rated shows was scraped from IMDB, and this list is broken into subsets of 100 here

subset_loser_list2 = []
print len(loser_list)
for i in range(len(loser_list)/100):
    temp_list = []
    for j in range(100):
        temp_list.append(loser_list[i*100 + j])
    subset_loser_list2.append(temp_list)    

600
subset_loser_list2[0]

['tt0773264',
 'tt1798695',
 'tt1307083',
 'tt4845734',
 'tt0046641',
 'tt1519575',
 'tt0853078',
 'tt0118423',
 'tt0284767',
 'tt4052124',
 'tt0878801',
 'tt3703500',
 'tt1105170',
 'tt4363582',
 'tt3155428',
 'tt0362350',
 'tt0287196',
 'tt2766052',
 'tt0405545',
 'tt0262975',
 'tt0367278',
 'tt7134262',
 'tt1695352',
 'tt0421470',
 'tt2466890',
 'tt0343305',
 'tt1002739',
 'tt1615697',
 'tt0274262',
 'tt0465320',
 'tt1388381',
 'tt0358889',
 'tt1085789',
 'tt1011591',
 'tt0364804',
 'tt1489335',
 'tt3612584',
 'tt0363377',
 'tt0111930',
 'tt0401913',
 'tt0808086',
 'tt0309212',
 'tt5464192',
 'tt0080250',
 'tt4533338',
 'tt4741696',
 'tt1922810',
 'tt1793868',
 'tt4789316',
 'tt0185054',
 'tt1079622',
 'tt1786048',
 'tt0790508',
 'tt1716372',
 'tt0295098',
 'tt3409706',
 'tt0222574',
 'tt2171325',
 'tt0442643',
 'tt2142117',
 'tt0371433',
 'tt0138244',
 'tt1002010',
 'tt0495557',
 'tt1811817',
 'tt5529996',
 'tt1352053',
 'tt0439346',
 'tt0940147',
 'tt3075138',
 'tt1974439',
 'tt2693842',
 'tt0092325',
 'tt6772826',
 'tt1563069',
 'tt0489598',
 'tt0142055',
 'tt1566154',
 'tt0338592',
 'tt0167515',
 'tt2330327',
 'tt1576464',
 'tt2389845',
 'tt0186747',
 'tt0355096',
 'tt1821877',
 'tt0112033',
 'tt1792654',
 'tt0472243',
 'tt6453018',
 'tt3648886',
 'tt1599374',
 'tt2946482',
 'tt4672020',
 'tt1016283',
 'tt2649480',
 'tt1229945',
 'tt2390606',
 'tt1876612',
 'tt0140732']

# This block calls the API.   It is run repeatedly with each new sublist of 100 show ids,  sleeping 10
# seconds between each request.  There is a do not run flag that will prevent running this block if the 
# notebook is restarted.  The first time it was executed, a new dataframe called "more_losers" was initialized,
# and then commented out for subsequent executions so the data returned in eacn subsequent data request will
# be appended to the bottom of the dataframe.

# After collection is complete, set flag to prevent running this block unnecessarily if notebook is restarted

import time
DO_NOT_RUN = True

if not DO_NOT_RUN:
#     responses = []
#     more_losers = pd.DataFrame()
    for loser_id in subset_loser_list2[0]:   # change the index and re-run to accesses each set of 100 ids
        time.sleep(10)    
        try: 
            # Get the tv show info from the api
            url = "http://api.tvmaze.com/lookup/shows?imdb=" + loser_id
            r = requests.get(url)

            # convert the return data to a dictionary
            json_data = r.json()

            # load a temp datafram with the dictionary, then append to the composite dataframe
            temp_df = pd.DataFrame.from_dict(json_data, orient='index', dtype=None)
            ttemp_df = temp_df.T     # Was not able to load json in column orientation, so must transpose
            more_losers = more_losers.append(ttemp_df, ignore_index=True)
            stat = ''
        except: 
            stat = 'failed'

        print loser_id, stat, r.status_code
        res = [loser_id, stat, r.status_code]
        responses.append(res)
        
    losers.head()    


tt0773264  200
tt1798695  200
tt1307083  200
tt4845734  200
tt0046641 failed 404
tt1519575 failed 404
tt0853078 failed 404
tt0118423 failed 404
tt0284767  200
tt4052124 failed 404
tt0878801  200
tt3703500  200
tt1105170 failed 404
tt4363582 failed 404
tt3155428  200
tt0362350 failed 404
tt0287196  200
tt2766052  200
tt0405545 failed 404
tt0262975  200
tt0367278 failed 404
tt7134262 failed 404
tt1695352 failed 404
tt0421470 failed 404
tt2466890 failed 404
tt0343305 failed 404
tt1002739 failed 404
tt1615697 failed 404
tt0274262 failed 404
tt0465320 failed 404
tt1388381  200
tt0358889  200
tt1085789 failed 404
tt1011591  200
tt0364804 failed 404
tt1489335 failed 404
tt3612584  200
tt0363377 failed 404
tt0111930 failed 404
tt0401913 failed 404
tt0808086 failed 404
tt0309212 failed 404
tt5464192  200
tt0080250 failed 404
tt4533338 failed 404
tt4741696  200
tt1922810 failed 404
tt1793868 failed 404
tt4789316 failed 404
tt0185054 failed 404
tt1079622 failed 404
tt1786048 failed 404
tt0790508 failed 404
tt1716372 failed 404
tt0295098 failed 404
tt3409706 failed 404
tt0222574 failed 404
tt2171325 failed 404
tt0442643 failed 404
tt2142117 failed 404
tt0371433 failed 404
tt0138244 failed 404
tt1002010 failed 404
tt0495557 failed 404
tt1811817 failed 404
tt5529996 failed 404
tt1352053 failed 404
tt0439346 failed 404
tt0940147 failed 404
tt3075138 failed 404
tt1974439  200
tt2693842 failed 404
tt0092325  200
tt6772826  200
tt1563069  200
tt0489598  200
tt0142055 failed 404
tt1566154  200
tt0338592  200
tt0167515  200
tt2330327  200
tt1576464 failed 404
tt2389845 failed 404
tt0186747  200
tt0355096 failed 404
tt1821877  200
tt0112033 failed 404
tt1792654 failed 404
tt0472243 failed 404
tt6453018 failed 404
tt3648886 failed 404
tt1599374  200
tt2946482  200
tt4672020 failed 404
tt1016283 failed 404
tt2649480  200
tt1229945  200
tt2390606 failed 404
tt1876612  200
tt0140732 failed 404
len(responses)
1000
for i in range(len(more_losers)):
    print more_losers.loc[i, 'externals']
{u'thetvdb': 279947, u'tvrage': 37045, u'imdb': u'tt3595870'}
{u'thetvdb': None, u'tvrage': 13173, u'imdb': u'tt0848174'}
{u'thetvdb': 72157, u'tvrage': None, u'imdb': u'tt0374366'}
{u'thetvdb': 218241, u'tvrage': None, u'imdb': u'tt1684855'}
{u'thetvdb': 327908, u'tvrage': None, u'imdb': u'tt6842890'}
{u'thetvdb': 279810, u'tvrage': None, u'imdb': u'tt3501210'}
{u'thetvdb': 283658, u'tvrage': None, u'imdb': u'tt0367323'}
{u'thetvdb': 271341, u'tvrage': 33650, u'imdb': u'tt2633208'}
{u'thetvdb': 260677, u'tvrage': None, u'imdb': u'tt2579722'}
{u'thetvdb': 77616, u'tvrage': None, u'imdb': u'tt0072546'}
{u'thetvdb': 74419, u'tvrage': None, u'imdb': u'tt0458269'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt0249275'}
{u'thetvdb': 282527, u'tvrage': 42189, u'imdb': u'tt2815184'}
{u'thetvdb': 246631, u'tvrage': None, u'imdb': u'tt1753229'}
{u'thetvdb': 82500, u'tvrage': None, u'imdb': u'tt1240534'}
{u'thetvdb': 206381, u'tvrage': 26873, u'imdb': u'tt1999642'}
{u'thetvdb': 284259, u'tvrage': None, u'imdb': u'tt3784176'}
{u'thetvdb': 250186, u'tvrage': None, u'imdb': u'tt1958848'}
{u'thetvdb': 320679, u'tvrage': None, u'imdb': u'tt5684430'}
{u'thetvdb': 74181, u'tvrage': 6494, u'imdb': u'tt0134269'}
{u'thetvdb': 84159, u'tvrage': 19672, u'imdb': u'tt1252370'}
{u'thetvdb': 300105, u'tvrage': 48178, u'imdb': u'tt3824018'}
{u'thetvdb': 264850, u'tvrage': None, u'imdb': u'tt2555880'}
{u'thetvdb': 277020, u'tvrage': 35629, u'imdb': u'tt3310544'}
{u'thetvdb': 254524, u'tvrage': 31887, u'imdb': u'tt2125758'}
{u'thetvdb': 271916, u'tvrage': None, u'imdb': u'tt1973047'}
{u'thetvdb': 82005, u'tvrage': None, u'imdb': u'tt0934701'}
{u'thetvdb': 250472, u'tvrage': None, u'imdb': u'tt2059031'}
{u'thetvdb': 81491, u'tvrage': None, u'imdb': u'tt1056536'}
{u'thetvdb': 137691, u'tvrage': None, u'imdb': u'tt1618950'}
{u'thetvdb': 74395, u'tvrage': 3883, u'imdb': u'tt0115206'}
{u'thetvdb': 298860, u'tvrage': 50010, u'imdb': u'tt4575056'}
{u'thetvdb': 269115, u'tvrage': 33511, u'imdb': u'tt2889104'}
{u'thetvdb': 285008, u'tvrage': None, u'imdb': u'tt2644204'}
{u'thetvdb': 82237, u'tvrage': None, u'imdb': u'tt1210781'}
{u'thetvdb': 314998, u'tvrage': None, u'imdb': u'tt0048898'}
{u'thetvdb': 276337, u'tvrage': None, u'imdb': u'tt3398108'}
{u'thetvdb': 221621, u'tvrage': None, u'imdb': u'tt1252620'}
{u'thetvdb': 269059, u'tvrage': 35857, u'imdb': u'tt2901828'}
{u'thetvdb': 273303, u'tvrage': 35560, u'imdb': u'tt3006666'}
{u'thetvdb': 260473, u'tvrage': 30918, u'imdb': u'tt2197994'}
{u'thetvdb': 83313, u'tvrage': None, u'imdb': u'tt1263594'}
{u'thetvdb': 80117, u'tvrage': 7218, u'imdb': u'tt0497079'}
{u'thetvdb': 174991, u'tvrage': 25843, u'imdb': u'tt1755893'}
{u'thetvdb': 71424, u'tvrage': None, u'imdb': u'tt0329824'}
{u'thetvdb': 258632, u'tvrage': 31545, u'imdb': u'tt2245937'}
{u'thetvdb': 259235, u'tvrage': None, u'imdb': u'tt2147632'}
{u'thetvdb': 297209, u'tvrage': 38100, u'imdb': u'tt3218114'}
{u'thetvdb': 185651, u'tvrage': None, u'imdb': u'tt1583417'}
{u'thetvdb': 250370, u'tvrage': 28934, u'imdb': u'tt1963853'}
{u'thetvdb': 129051, u'tvrage': None, u'imdb': u'tt1520150'}
{u'thetvdb': 76370, u'tvrage': None, u'imdb': u'tt0236907'}
{u'thetvdb': 316174, u'tvrage': None, u'imdb': u'tt5865052'}
{u'thetvdb': 82304, u'tvrage': 19011, u'imdb': u'tt1231448'}
{u'thetvdb': 289640, u'tvrage': 46963, u'imdb': u'tt4287478'}
{u'thetvdb': 249750, u'tvrage': None, u'imdb': u'tt1874006'}
{u'thetvdb': 250959, u'tvrage': 28442, u'imdb': u'tt2006560'}
{u'thetvdb': 281375, u'tvrage': 38313, u'imdb': u'tt3565412'}
{u'thetvdb': 274414, u'tvrage': None, u'imdb': u'tt3396736'}
{u'thetvdb': 271820, u'tvrage': None, u'imdb': u'tt0855313'}
{u'thetvdb': 250955, u'tvrage': None, u'imdb': u'tt2309561'}
{u'thetvdb': 273130, u'tvrage': 36774, u'imdb': u'tt3136814'}
{u'thetvdb': 84669, u'tvrage': 18525, u'imdb': u'tt1191056'}
{u'thetvdb': 74697, u'tvrage': 3348, u'imdb': u'tt0235917'}
{u'thetvdb': 76708, u'tvrage': None, u'imdb': u'tt0111892'}
{u'thetvdb': 266934, u'tvrage': None, u'imdb': u'tt2643770'}
{u'thetvdb': 79896, u'tvrage': None, u'imdb': u'tt0423657'}
{u'thetvdb': 303252, u'tvrage': None, u'imdb': u'tt5327970'}
{u'thetvdb': 256806, u'tvrage': None, u'imdb': u'tt2190731'}
{u'thetvdb': 78409, u'tvrage': None, u'imdb': u'tt0101041'}
{u'thetvdb': 274820, u'tvrage': None, u'imdb': u'tt3317020'}
{u'thetvdb': 296474, u'tvrage': 45813, u'imdb': u'tt4732076'}
{u'thetvdb': 285651, u'tvrage': 41593, u'imdb': u'tt3828162'}
{u'thetvdb': 315767, u'tvrage': None, u'imdb': u'tt5819414'}
{u'thetvdb': 287534, u'tvrage': 42884, u'imdb': u'tt4180738'}
{u'thetvdb': 76621, u'tvrage': None, u'imdb': u'tt0300802'}
{u'thetvdb': 280683, u'tvrage': 34278, u'imdb': u'tt2649738'}
{u'thetvdb': 280256, u'tvrage': 41644, u'imdb': u'tt3181412'}
{u'thetvdb': 79496, u'tvrage': 2677, u'imdb': u'tt0382400'}
{u'thetvdb': 271514, u'tvrage': None, u'imdb': u'tt2168240'}
{u'thetvdb': 271826, u'tvrage': None, u'imdb': u'tt2560966'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt0375440'}
{u'thetvdb': 282253, u'tvrage': 44602, u'imdb': u'tt4081326'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt6664486'}
{u'thetvdb': 70734, u'tvrage': 14443, u'imdb': u'tt0247094'}
{u'thetvdb': 70852, u'tvrage': 5323, u'imdb': u'tt0320969'}
{u'thetvdb': 267185, u'tvrage': None, u'imdb': u'tt2720144'}
{u'thetvdb': 265320, u'tvrage': 33976, u'imdb': u'tt2287380'}
{u'thetvdb': 252485, u'tvrage': None, u'imdb': u'tt2010634'}
{u'thetvdb': 271722, u'tvrage': 36787, u'imdb': u'tt3084090'}
{u'thetvdb': 260126, u'tvrage': 30877, u'imdb': u'tt2392683'}
{u'thetvdb': 251033, u'tvrage': 28408, u'imdb': u'tt1628058'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt1837169'}
{u'thetvdb': 260341, u'tvrage': 31462, u'imdb': u'tt2404111'}
{u'thetvdb': 89831, u'tvrage': 22647, u'imdb': u'tt1411598'}
{u'thetvdb': 70609, u'tvrage': 5102, u'imdb': u'tt0106123'}
{u'thetvdb': 245071, u'tvrage': 26645, u'imdb': u'tt1740718'}
{u'thetvdb': 73230, u'tvrage': 6188, u'imdb': u'tt0362153'}
{u'thetvdb': 163671, u'tvrage': None, u'imdb': u'tt1637756'}
{u'thetvdb': 259478, u'tvrage': 31194, u'imdb': u'tt2328067'}
{u'thetvdb': 294774, u'tvrage': None, u'imdb': u'tt0057741'}
{u'thetvdb': 282993, u'tvrage': None, u'imdb': u'tt1261356'}
{u'thetvdb': 268795, u'tvrage': 36420, u'imdb': u'tt2559390'}
{u'thetvdb': 72048, u'tvrage': 4056, u'imdb': u'tt0083433'}
{u'thetvdb': 256513, u'tvrage': 31344, u'imdb': u'tt2330453'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt0804423'}
{u'thetvdb': 159351, u'tvrage': None, u'imdb': u'tt1118131'}
{u'thetvdb': 300384, u'tvrage': None, u'imdb': u'tt4016700'}
{u'thetvdb': 264239, u'tvrage': None, u'imdb': u'tt0950199'}
{u'thetvdb': 106801, u'tvrage': None, u'imdb': u'tt1477137'}
{u'thetvdb': 87131, u'tvrage': None, u'imdb': u'tt1176156'}
{u'thetvdb': 173981, u'tvrage': None, u'imdb': u'tt1545453'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt1240983'}
{u'thetvdb': 264762, u'tvrage': 31404, u'imdb': u'tt1415000'}
{u'thetvdb': 72180, u'tvrage': None, u'imdb': u'tt0144701'}
{u'thetvdb': 307473, u'tvrage': None, u'imdb': u'tt4718304'}
{u'thetvdb': 147701, u'tvrage': None, u'imdb': u'tt1095213'}
{u'thetvdb': 98371, u'tvrage': None, u'imdb': u'tt1453090'}
{u'thetvdb': 72141, u'tvrage': None, u'imdb': u'tt0168372'}
{u'thetvdb': 75567, u'tvrage': 12949, u'imdb': u'tt0425725'}
{u'thetvdb': 275787, u'tvrage': None, u'imdb': u'tt3300126'}
{u'thetvdb': 308457, u'tvrage': 51439, u'imdb': u'tt5459976'}
{u'thetvdb': 285286, u'tvrage': 44525, u'imdb': u'tt4041694'}
{u'thetvdb': 261287, u'tvrage': 32847, u'imdb': u'tt2322264'}
{u'thetvdb': 250325, u'tvrage': None, u'imdb': u'tt1441005'}
{u'thetvdb': 72133, u'tvrage': None, u'imdb': u'tt0365991'}
{u'thetvdb': 72488, u'tvrage': None, u'imdb': u'tt0364807'}
{u'thetvdb': 149371, u'tvrage': 25246, u'imdb': u'tt1591375'}
{u'thetvdb': 291820, u'tvrage': None, u'imdb': u'tt3562462'}
{u'thetvdb': 96071, u'tvrage': None, u'imdb': u'tt1372127'}
{u'thetvdb': 287516, u'tvrage': None, u'imdb': u'tt2088493'}
{u'thetvdb': 295059, u'tvrage': 48857, u'imdb': u'tt4658248'}
{u'thetvdb': 250280, u'tvrage': None, u'imdb': u'tt1973659'}
{u'thetvdb': 272357, u'tvrage': None, u'imdb': u'tt2849552'}
{u'thetvdb': 282130, u'tvrage': None, u'imdb': u'tt3774098'}
{u'thetvdb': None, u'tvrage': 18611, u'imdb': u'tt1151434'}
{u'thetvdb': 271067, u'tvrage': None, u'imdb': u'tt2993514'}
{u'thetvdb': 80311, u'tvrage': None, u'imdb': u'tt0773264'}
{u'thetvdb': 260189, u'tvrage': 32126, u'imdb': u'tt1798695'}
{u'thetvdb': 139481, u'tvrage': 20203, u'imdb': u'tt1307083'}
{u'thetvdb': 297960, u'tvrage': 49841, u'imdb': u'tt4845734'}
{u'thetvdb': 70656, u'tvrage': None, u'imdb': u'tt0284767'}
{u'thetvdb': 80694, u'tvrage': 15758, u'imdb': u'tt0878801'}
{u'thetvdb': 282654, u'tvrage': 39954, u'imdb': u'tt3703500'}
{u'thetvdb': 272737, u'tvrage': 37535, u'imdb': u'tt3155428'}
{u'thetvdb': 76237, u'tvrage': None, u'imdb': u'tt0287196'}
{u'thetvdb': 270469, u'tvrage': 34560, u'imdb': u'tt2766052'}
{u'thetvdb': 301235, u'tvrage': None, u'imdb': u'tt0262975'}
{u'thetvdb': 126811, u'tvrage': None, u'imdb': u'tt1388381'}
{u'thetvdb': 307480, u'tvrage': None, u'imdb': u'tt0358889'}
{u'thetvdb': 83326, u'tvrage': None, u'imdb': u'tt1011591'}
{u'thetvdb': 279772, u'tvrage': None, u'imdb': u'tt3612584'}
{u'thetvdb': 305936, u'tvrage': None, u'imdb': u'tt5464192'}
{u'thetvdb': 267921, u'tvrage': None, u'imdb': u'tt4741696'}
{u'thetvdb': 95351, u'tvrage': None, u'imdb': u'tt1974439'}
{u'thetvdb': 79838, u'tvrage': 5631, u'imdb': u'tt0092325'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt6772826'}
{u'thetvdb': 127351, u'tvrage': 24425, u'imdb': u'tt1563069'}
{u'thetvdb': 79550, u'tvrage': 6890, u'imdb': u'tt0489598'}
{u'thetvdb': 148561, u'tvrage': 24465, u'imdb': u'tt1566154'}
{u'thetvdb': 70905, u'tvrage': 3150, u'imdb': u'tt0338592'}
{u'thetvdb': 70829, u'tvrage': None, u'imdb': u'tt0167515'}
{u'thetvdb': 262883, u'tvrage': 31271, u'imdb': u'tt2330327'}
{u'thetvdb': 84208, u'tvrage': None, u'imdb': u'tt0186747'}
{u'thetvdb': 239961, u'tvrage': 27826, u'imdb': u'tt1821877'}
{u'thetvdb': 216741, u'tvrage': None, u'imdb': u'tt1599374'}
{u'thetvdb': 270465, u'tvrage': 35836, u'imdb': u'tt2946482'}
{u'thetvdb': 268600, u'tvrage': 35103, u'imdb': u'tt2649480'}
{u'thetvdb': 82550, u'tvrage': None, u'imdb': u'tt1229945'}
{u'thetvdb': 248039, u'tvrage': 23213, u'imdb': u'tt1876612'}
more_losers
status rating genres weight updated name language schedule url officialSite externals premiered summary _links image webChannel runtime type id network
0 Ended {u'average': None} [] 0 1449178946 Famous in 12 English {u'days': [u'Tuesday'], u'time': u'20:00'} http://www.tvmaze.com/shows/9024/famous-in-12 None {u'thetvdb': 279947, u'tvrage': 37045, u'imdb'... 2014-06-03 <p><i><b>"Famous in 12"</b></i>, the new unscr... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 9024 {u'country': {u'timezone': u'America/New_York'...
1 Ended {u'average': None} [Comedy, Family] 14 1497059695 The Sharon Osbourne Show English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/19004/the-sharon-o... None {u'thetvdb': None, u'tvrage': 13173, u'imdb': ... 2006-08-29 <p>Daily talk show hosted by Sharon Osbourne.</p> {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Talk Show 19004 {u'country': {u'timezone': u'Europe/London', u...
2 Ended {u'average': None} [Comedy] 0 1503083428 Steve Harvey's Big Time Challenge English {u'days': [u'Sunday'], u'time': u'21:00'} http://www.tvmaze.com/shows/29202/steve-harvey... None {u'thetvdb': 72157, u'tvrage': None, u'imdb': ... 2003-09-11 <p><b>Steve Harvey's Big Time Challenge</b>, a... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Talk Show 29202 {u'country': {u'timezone': u'America/New_York'...
3 Ended {u'average': None} [] 0 1475183910 The Spin Crowd English {u'days': [u'Sunday'], u'time': u'22:30'} http://www.tvmaze.com/shows/21619/the-spin-crowd None {u'thetvdb': 218241, u'tvrage': None, u'imdb':... 2010-08-22 <p>Nobody knows how to make stars shine bright... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 21619 {u'country': {u'timezone': u'America/New_York'...
4 Running {u'average': 1} [] 0 1495714601 Babushka English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/25450/babushka http://www.itv.com/beontv/shows/babushka {u'thetvdb': 327908, u'tvrage': None, u'imdb':... 2017-05-01 <p><b>Babushka</b> is a brand new game show wh... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Game Show 25450 {u'country': {u'timezone': u'Europe/London', u...
5 Ended {u'average': None} [] 0 1483745416 Chrome Underground English {u'days': [u'Tuesday'], u'time': u'22:00'} http://www.tvmaze.com/shows/24213/chrome-under... http://www.discovery.com/tv-shows/chrome-under... {u'thetvdb': 279810, u'tvrage': None, u'imdb':... 2014-05-23 <p>Two international classic car dealers searc... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 24213 {u'country': {u'timezone': u'America/New_York'...
6 Ended {u'average': None} [] 0 1495602919 Fear Factor English {u'days': [u'Sunday'], u'time': u''} http://www.tvmaze.com/shows/26838/fear-factor None {u'thetvdb': 283658, u'tvrage': None, u'imdb':... 2002-09-10 <p>This version has two teams of three contest... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Game Show 26838 {u'country': {u'timezone': u'Europe/London', u...
7 Ended {u'average': None} [] 0 1495254081 Owner's Manual English {u'days': [u'Thursday'], u'time': u'22:00'} http://www.tvmaze.com/shows/9261/owners-manual None {u'thetvdb': 271341, u'tvrage': 33650, u'imdb'... 2013-08-15 <p><b>Owner's Manual</b> will test one of the ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 9261 {u'country': {u'timezone': u'America/New_York'...
8 Ended {u'average': None} [] 0 1487011574 The Shire English {u'days': [u'Monday'], u'time': u'21:45'} http://www.tvmaze.com/shows/25288/the-shire None {u'thetvdb': 260677, u'tvrage': None, u'imdb':... 2012-07-16 <p>The series follows the lives and love of a ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 25 Reality 25288 {u'country': {u'timezone': u'Australia/Sydney'...
9 Ended {u'average': None} [Comedy] 0 1483143763 The Montefuscos English {u'days': [u'Thursday'], u'time': u'20:00'} http://www.tvmaze.com/shows/24079/the-montefuscos None {u'thetvdb': 77616, u'tvrage': None, u'imdb': ... 1975-09-04 <p>The trials and tribulations of three genera... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Scripted 24079 {u'country': {u'timezone': u'America/New_York'...
10 Ended {u'average': None} [] 0 1464030266 I Want to Be a Hilton English {u'days': [u'Tuesday'], u'time': u'21:00'} http://www.tvmaze.com/shows/17541/i-want-to-be... None {u'thetvdb': 74419, u'tvrage': None, u'imdb': ... 2005-06-21 <p>Kathy Hilton, onetime actress and mother of... {u'previousepisode': {u'href': u'http://api.tv... None None 60 Reality 17541 {u'country': {u'timezone': u'America/New_York'...
11 Ended {u'average': None} [] 20 1478379662 ABC's Nightlife English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/22597/abcs-nightlife None {u'thetvdb': None, u'tvrage': None, u'imdb': u... 1964-11-09 <p><b>ABC's Nightlife</b> is a late night dail... {u'previousepisode': {u'href': u'http://api.tv... None None 105 Talk Show 22597 {u'country': {u'timezone': u'America/New_York'...
12 Running {u'average': None} [] 0 1454050022 Untying the Knot English {u'days': [u'Monday'], u'time': u'22:00'} http://www.tvmaze.com/shows/6843/untying-the-knot http://www.bravotv.com/untying-the-knot {u'thetvdb': 282527, u'tvrage': 42189, u'imdb'... 2014-06-04 <p>Vikki Ziegler, known as the Divorce Diva, i... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 6843 {u'country': {u'timezone': u'America/New_York'...
13 Ended {u'average': None} [Action] 0 1495406329 Wipeout Canada English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/12998/wipeout-canada None {u'thetvdb': 246631, u'tvrage': None, u'imdb':... 2011-04-03 <p><b>Wipeout Canada</b> is a hilarious game s... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Game Show 12998 {u'country': {u'timezone': u'Canada/Atlantic',...
14 Ended {u'average': None} [] 0 1464363967 Hurl! English {u'days': [u'Tuesday'], u'time': u'21:00'} http://www.tvmaze.com/shows/17705/hurl None {u'thetvdb': 82500, u'tvrage': None, u'imdb': ... 2008-07-15 <p>Get ready to get grossed out with G4's off-... {u'previousepisode': {u'href': u'http://api.tv... None None 30 Reality 17705 {u'country': {u'timezone': u'America/New_York'...
15 Ended {u'average': None} [] 0 1457450255 Meet the Parents English {u'days': [u'Thursday'], u'time': u'21:30'} http://www.tvmaze.com/shows/13973/meet-the-par... http://www.channel4.com/programmes/meet-the-pa... {u'thetvdb': 206381, u'tvrage': 26873, u'imdb'... 2010-11-18 <p><i>Meet the Parents</i> is a reality series... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 13973 {u'country': {u'timezone': u'Europe/London', u...
16 Ended {u'average': None} [Drama, Action] 0 1481553637 4th and Loud English {u'days': [u'Tuesday'], u'time': u'22:00'} http://www.tvmaze.com/shows/11854/4th-and-loud http://www.amc.com/shows/4th-and-loud {u'thetvdb': 284259, u'tvrage': None, u'imdb':... 2014-08-12 <p><b>4th and Loud</b> will follow the LA KISS... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 11854 {u'country': {u'timezone': u'America/New_York'...
17 Ended {u'average': None} [] 0 1495496078 It's Worth What? English {u'days': [u'Tuesday'], u'time': u'20:00'} http://www.tvmaze.com/shows/17619/its-worth-what None {u'thetvdb': 250186, u'tvrage': None, u'imdb':... 2011-07-19 <p><b>It's Worth What? </b>stars Cedric the En... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Game Show 17619 {u'country': {u'timezone': u'America/New_York'...
18 To Be Determined {u'average': 6.6} [Drama, Thriller, Adult] 92 1497788418 The Deleted English {u'days': [], u'time': u''} http://www.tvmaze.com/shows/19884/the-deleted https://www.fullscreen.com/series/the-deleted {u'thetvdb': 320679, u'tvrage': None, u'imdb':... 2016-12-04 <p>When escapees from a mysterious cult start ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... {u'country': {u'timezone': u'America/New_York'... 15 Scripted 19884 None
19 Ended {u'average': 7.3} [Comedy, Action, Crime] 14 1500877446 V.I.P. English {u'days': [u'Saturday'], u'time': u''} http://www.tvmaze.com/shows/1885/vip None {u'thetvdb': 74181, u'tvrage': 6494, u'imdb': ... 1998-09-26 <p>A campy syndicated series about Vallery Iro... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 1885 {u'country': {u'timezone': u'America/New_York'...
20 Running {u'average': 6} [Drama] 63 1496679327 The Real Housewives of Atlanta English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/597/the-real-house... None {u'thetvdb': 84159, u'tvrage': 19672, u'imdb':... 2008-10-07 <p>An up-close and personal look at life in Ho... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 597 {u'country': {u'timezone': u'America/New_York'...
21 Running {u'average': None} [Comedy, Children] 0 1475116665 Pickle and Peanut English {u'days': [u'Monday'], u'time': u'18:30'} http://www.tvmaze.com/shows/3019/pickle-and-pe... http://disneyxd.disney.com/pickle-and-peanut {u'thetvdb': 300105, u'tvrage': 48178, u'imdb'... 2015-09-07 <p><i><b>"Pickle &amp; Peanut"</b></i> is abou... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... {u'country': {u'timezone': u'America/New_York'... 15 Animation 3019 {u'country': {u'timezone': u'America/New_York'...
22 Ended {u'average': None} [Drama, Comedy, Romance] 0 1501880843 Buckwild English {u'days': [u'Thursday'], u'time': u'22:00'} http://www.tvmaze.com/shows/25036/buckwild http://www.mtv.com/shows/buckwild {u'thetvdb': 264850, u'tvrage': None, u'imdb':... 2013-01-03 <p>The show follows the lives of nine young ad... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 25036 {u'country': {u'timezone': u'America/New_York'...
23 Ended {u'average': 3} [] 0 1486506841 Mystery Girls English {u'days': [u'Wednesday'], u'time': u'20:30'} http://www.tvmaze.com/shows/3950/mystery-girls http://abcfamily.go.com/shows/mystery-girls {u'thetvdb': 277020, u'tvrage': 35629, u'imdb'... 2014-06-25 <p>Two former detective TV show starlets broug... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Scripted 3950 {u'country': {u'timezone': u'America/New_York'...
24 Running {u'average': None} [Family] 12 1450883412 Celebrity Wife Swap English {u'days': [u'Wednesday'], u'time': u'22:00'} http://www.tvmaze.com/shows/1783/celebrity-wif... http://abc.go.com/shows/celebrity-wife-swap/ab... {u'thetvdb': 254524, u'tvrage': 31887, u'imdb'... 2012-01-02 <p>The spouses in two celebrity families with ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 1783 {u'country': {u'timezone': u'America/New_York'...
25 Running {u'average': 7} [Comedy] 0 1472855087 Dish Nation English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/9199/dish-nation http://www.reelz.com/dish-nation/ {u'thetvdb': 271916, u'tvrage': None, u'imdb':... 2011-07-25 <p><i>Dish Nation</i> is a nightly syndicated ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Scripted 9199 {u'country': {u'timezone': u'America/New_York'...
26 Ended {u'average': None} [Children] 0 1502544202 Ni Hao, Kai-lan English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/13161/ni-hao-kai-lan None {u'thetvdb': 82005, u'tvrage': None, u'imdb': ... 2008-02-07 <p>Ni Hao, Kai-lan , which is Mandarin for "He... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Animation 13161 {u'country': {u'timezone': u'America/New_York'...
27 Running {u'average': None} [Comedy, Family] 0 1502948333 Scaredy Squirrel English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/20564/scaredy-squi... http://www.scaredysquirrel.com {u'thetvdb': 250472, u'tvrage': None, u'imdb':... 2011-04-01 <p><b>Scaredy Squirrel </b>follows the adventu... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 10 Animation 20564 {u'country': {u'timezone': u'Canada/Atlantic',...
28 Running {u'average': None} [] 76 1502312151 Big Brother After Dark English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/18240/big-brother-... http://poptv.com/big_brother_after_dark {u'thetvdb': 81491, u'tvrage': None, u'imdb': ... 2007-07-05 <p><b>Big Brother After Dark</b> is the live, ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 180 Reality 18240 {u'country': {u'timezone': u'America/New_York'...
29 Ended {u'average': 1} [Action, Adventure] 0 1474827145 American Paranormal English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/19115/american-par... None {u'thetvdb': 137691, u'tvrage': None, u'imdb':... 2010-01-24 <p>Whether it is the existence of aliens, the ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 19115 {u'country': {u'timezone': u'America/New_York'...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
107 To Be Determined {u'average': None} [] 0 1495420105 Who's Doing the Dishes? English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/8612/whos-doing-th... None {u'thetvdb': 300384, u'tvrage': None, u'imdb':... 2014-09-01 <p><b>Who's Doing the Dishes?</b> is a UK game... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Game Show 8612 {u'country': {u'timezone': u'Europe/London', u...
108 Ended {u'average': None} [] 0 1474499818 I'm a Celebrity, Get Me Out of Here! NOW! English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/8558/im-a-celebrit... http://www.itv.com/imacelebrity/itv2-now {u'thetvdb': 264239, u'tvrage': None, u'imdb':... 2011-11-13 <p><i>"I'm a Celebrity...Get Me Out of Here! N... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 8558 {u'country': {u'timezone': u'Europe/London', u...
109 Ended {u'average': None} [Romance] 0 1474764176 More to Love English {u'days': [u'Tuesday'], u'time': u'21:00'} http://www.tvmaze.com/shows/21467/more-to-love None {u'thetvdb': 106801, u'tvrage': None, u'imdb':... 2009-07-28 <p>Follows one regular guy's search for love a... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 21467 {u'country': {u'timezone': u'America/New_York'...
110 Ended {u'average': None} [] 0 1467307858 I Want to Work for Diddy English {u'days': [u'Monday'], u'time': u'22:00'} http://www.tvmaze.com/shows/18829/i-want-to-wo... None {u'thetvdb': 87131, u'tvrage': None, u'imdb': ... 2008-08-04 <p>Diddy. He only needs one name, but he needs... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 18829 {u'country': {u'timezone': u'America/New_York'...
111 Ended {u'average': None} [] 0 1490997318 Donald J. Trump Presents: The Ultimate Merger English {u'days': [u'Thursday'], u'time': u'22:00'} http://www.tvmaze.com/shows/26564/donald-j-tru... None {u'thetvdb': 173981, u'tvrage': None, u'imdb':... 2010-06-17 <p>Through a series of challenges, both relati... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 26564 {u'country': {u'timezone': u'America/New_York'...
112 Running {u'average': None} [] 0 1477193039 America's Election Headquarters English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/11837/americas-ele... http://www.foxnews.com/on-air/americas-news-hq... {u'thetvdb': None, u'tvrage': None, u'imdb': u... 2008-04-22 <p><b>America's Election Headquarters</b> is a... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Talk Show 11837 {u'country': {u'timezone': u'America/New_York'...
113 Running {u'average': None} [] 45 1502693229 BBC Weekend News English {u'days': [u'Saturday', u'Sunday'], u'time': u''} http://www.tvmaze.com/shows/7333/bbc-weekend-news http://www.bbc.co.uk/programmes/b009m51q {u'thetvdb': 264762, u'tvrage': 31404, u'imdb'... 1954-07-05 <p><b>BBC Weekend News</b> is the national new... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None None News 7333 {u'country': {u'timezone': u'Europe/London', u...
114 Ended {u'average': None} [Comedy, Children] 0 1477293529 Barney & Friends English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/15482/barney-friends None {u'thetvdb': 72180, u'tvrage': None, u'imdb': ... 1992-04-06 <p><b>Barney &amp; Friends</b> is an American ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Scripted 15482 {u'country': {u'timezone': u'America/New_York'...
115 Running {u'average': 7} [Comedy] 89 1503147213 The Powerpuff Girls English {u'days': [u'Sunday'], u'time': u'17:30'} http://www.tvmaze.com/shows/6771/the-powerpuff... None {u'thetvdb': 307473, u'tvrage': None, u'imdb':... 2016-04-04 <p>The city of Townsville may be a beautiful, ... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 15 Animation 6771 {u'country': {u'timezone': u'America/New_York'...
116 Running {u'average': None} [] 0 1497251730 TMZ English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/24857/tmz http://www.tmz.com/when-its-on?adid=tmz_web_na... {u'thetvdb': 147701, u'tvrage': None, u'imdb':... 2011-11-02 <p><b>TMZ </b>(also known simply as <i>TMZ</i>... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Talk Show 24857 {u'country': {u'timezone': u'America/New_York'...
117 Ended {u'average': 6} [] 0 1476263385 Kendra English {u'days': [u'Sunday'], u'time': u'22:00'} http://www.tvmaze.com/shows/21952/kendra None {u'thetvdb': 98371, u'tvrage': None, u'imdb': ... 2009-06-07 <p>Kendra Wilkinson finds herself at a crossro... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 21952 {u'country': {u'timezone': u'America/New_York'...
118 Ended {u'average': None} [] 0 1465065870 The Roseanne Show English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/12252/the-roseanne... None {u'thetvdb': 72141, u'tvrage': None, u'imdb': ... 1998-09-14 <p><i><b>"The Roseanne Show"</b></i> is a synd... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Talk Show 12252 {u'country': {u'timezone': u'America/New_York'...
119 Running {u'average': 1} [Music] 0 1484368515 The Xtra Factor Live English {u'days': [u'Saturday', u'Sunday'], u'time': u''} http://www.tvmaze.com/shows/3764/the-xtra-fact... None {u'thetvdb': 75567, u'tvrage': 12949, u'imdb':... 2004-09-04 <p>Thousands audition. Only one can win. The s... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 3764 {u'country': {u'timezone': u'Europe/London', u...
120 Ended {u'average': 7} [Comedy] 0 1488031177 But I'm Chris Jericho! English {u'days': [u'Tuesday'], u'time': u'22:00'} http://www.tvmaze.com/shows/13150/but-im-chris... http://butimchrisjericho.com {u'thetvdb': 275787, u'tvrage': None, u'imdb':... 2013-10-29 <p><b>But I'm Chris Jericho!</b> is an interac... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... {u'country': {u'timezone': u'America/New_York'... 8 Scripted 13150 {u'country': {u'timezone': u'Canada/Atlantic',...
121 Ended {u'average': 6} [Comedy] 0 1466802381 Party Over Here English {u'days': [u'Saturday'], u'time': u'23:00'} http://www.tvmaze.com/shows/12662/party-over-here http://www.fox.com/party-over-here {u'thetvdb': 308457, u'tvrage': 51439, u'imdb'... 2016-03-12 <p>A new late-night half-hour sketch comedy se... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Variety 12662 {u'country': {u'timezone': u'America/New_York'...
122 Ended {u'average': None} [Action, Adventure, Horror] 0 1500043650 Alaska Monsters English {u'days': [u'Saturday'], u'time': u'22:00'} http://www.tvmaze.com/shows/3124/alaska-monsters http://www.destinationamerica.com/tv-shows/ala... {u'thetvdb': 285286, u'tvrage': 44525, u'imdb'... 2014-09-12 <p>Treacherous terrain and unforgiving natural... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 3124 {u'country': {u'timezone': u'America/New_York'...
123 Ended {u'average': None} [Drama, Children] 0 1495726406 Abby's Ultimate Dance Competition English {u'days': [u'Tuesday'], u'time': u'21:00'} http://www.tvmaze.com/shows/9420/abbys-ultimat... http://www.mylifetime.com/shows/abbys-ultimate... {u'thetvdb': 261287, u'tvrage': 32847, u'imdb'... 2012-10-09 <p>Lifetime has picked-up the reality series <... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Game Show 9420 {u'country': {u'timezone': u'America/New_York'...
124 Ended {u'average': None} [Children, Mystery, Supernatural] 0 1502934987 The Othersiders English {u'days': [u'Wednesday'], u'time': u'21:00'} http://www.tvmaze.com/shows/9593/the-othersiders None {u'thetvdb': 250325, u'tvrage': None, u'imdb':... 2009-06-17 <p><b>The Othersiders</b> was an American para... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 9593 {u'country': {u'timezone': u'America/New_York'...
125 Ended {u'average': None} [] 0 1449520834 Canadian Idol English {u'days': [], u'time': u'20:00'} http://www.tvmaze.com/shows/9674/canadian-idol None {u'thetvdb': 72133, u'tvrage': None, u'imdb': ... 2003-06-09 <p><i><b>"Canadian Idol"</b></i> is a Canadian... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 9674 {u'country': {u'timezone': u'Canada/Atlantic',...
126 Ended {u'average': None} [] 0 1474314323 Extreme Makeover English {u'days': [u'Thursday'], u'time': u'21:00'} http://www.tvmaze.com/shows/21134/extreme-make... None {u'thetvdb': 72488, u'tvrage': None, u'imdb': ... 2002-12-11 <p>Three people are chosen to receive the make... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 21134 {u'country': {u'timezone': u'America/New_York'...
127 Ended {u'average': None} [] 0 1469556547 Pretty Wild English {u'days': [u'Sunday'], u'time': u'22:30'} http://www.tvmaze.com/shows/19522/pretty-wild None {u'thetvdb': 149371, u'tvrage': 25246, u'imdb'... 2010-03-14 {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 19522 {u'country': {u'timezone': u'America/New_York'...
128 Running {u'average': None} [Comedy] 0 1502570683 Just for Laughs: All Access English {u'days': [u'Saturday'], u'time': u'22:00'} http://www.tvmaze.com/shows/18044/just-for-lau... http://www.thecomedynetwork.ca/Shows/JustForLa... {u'thetvdb': 291820, u'tvrage': None, u'imdb':... 2012-10-12 <p>Comedians celebrate the 30th anniversary of... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Variety 18044 {u'country': {u'timezone': u'Canada/Atlantic',...
129 Ended {u'average': None} [] 0 1455387941 Jesse James is a Dead Man English {u'days': [u'Sunday'], u'time': u'22:00'} http://www.tvmaze.com/shows/12951/jesse-james-... None {u'thetvdb': 96071, u'tvrage': None, u'imdb': ... 2009-05-31 <p>Jesse James takes on the role of a modern-d... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 12951 {u'country': {u'timezone': u'America/New_York'...
130 Ended {u'average': None} [] 0 1488221218 Secretly Pregnant English {u'days': [u'Thursday'], u'time': u'22:00'} http://www.tvmaze.com/shows/25580/secretly-pre... http://www.discoverylife.com/tv-shows/secretly... {u'thetvdb': 287516, u'tvrage': None, u'imdb':... 2011-10-13 <p>The stories of women who, for various reaso... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 25580 {u'country': {u'timezone': u'America/New_York'...
131 Running {u'average': None} [Family] 13 1455319657 The Briefcase English {u'days': [], u'time': u'20:00'} http://www.tvmaze.com/shows/1831/the-briefcase None {u'thetvdb': 295059, u'tvrage': 48857, u'imdb'... 2015-05-27 <p>The show features a social experiment eleme... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 1831 {u'country': {u'timezone': u'America/New_York'...
132 Ended {u'average': None} [Comedy] 0 1492370917 PrankStars English {u'days': [], u'time': u''} http://www.tvmaze.com/shows/27206/prankstars None {u'thetvdb': 250280, u'tvrage': None, u'imdb':... 2011-07-15 <p>A hidden-camera series where unsuspecting t... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 27206 {u'country': {u'timezone': u'America/New_York'...
133 Ended {u'average': None} [] 0 1485549026 Cash Dome English {u'days': [u'Tuesday'], u'time': u'21:30'} http://www.tvmaze.com/shows/24751/cash-dome None {u'thetvdb': 272357, u'tvrage': None, u'imdb':... 2013-08-13 <p>For a quarter century, <b>Cash Dome</b> Jew... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 24751 {u'country': {u'timezone': u'America/New_York'...
134 Ended {u'average': None} [Comedy] 0 1502593090 CeeLo Green's The Good Life English {u'days': [u'Monday'], u'time': u'22:30'} http://www.tvmaze.com/shows/25900/ceelo-greens... None {u'thetvdb': 282130, u'tvrage': None, u'imdb':... 2014-06-23 <p>Follow CeeLo as he tackles not only a packe... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 25900 {u'country': {u'timezone': u'America/New_York'...
135 Ended {u'average': None} [] 0 1477193480 America's Prom Queen English {u'days': [u'Monday'], u'time': u'21:00'} http://www.tvmaze.com/shows/16384/americas-pro... None {u'thetvdb': None, u'tvrage': 18611, u'imdb': ... 2008-03-17 <p><b>America's Prom Queen</b> is a reality TV... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 16384 {u'country': {u'timezone': u'America/New_York'...
136 Ended {u'average': None} [] 0 1461445299 Hollywood Me English {u'days': [u'Wednesday'], u'time': u'20:00'} http://www.tvmaze.com/shows/15972/hollywood-me None {u'thetvdb': 271067, u'tvrage': None, u'imdb':... 2013-06-19 <p>Martyn Lawrence Bullard's normal clients in... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 15972 {u'country': {u'timezone': u'Europe/London', u...

137 rows × 20 columns

# Create a backup occasionally, and pickle after we've pulled the data
more_losers_backup = more_losers.copy()

DO_NOT_RUN = True  # Be sure to check the file name to write before enabling execution on this block

if not DO_NOT_RUN:
    pickle.dump( more_losers, open( "save_more_losers_df.p", "wb" ) )

Add a column to both shows (good) and losers (bad) to classify the rows as winners or losers

# All the data pulled from api and placed in dataframes was pickled and written to disk.
# Reading it all back in and adding a column to indicate if it was a winner or loser
# then will clean up and begin the analysis.
# $ ls *.p
# save_losers_df.p    save_more_losers_df.p     save_shows_df.p
# read data back in from the saved file
winners = pickle.load( open( "save_shows_df.p", "rb" ) )
losers1 = pickle.load( open( "save_losers_df.p", "rb" ) )
losers2 = pickle.load( open( "save_more_losers_df.p", "rb" ) )
print " Winners:", winners.shape
print " Losers1:", losers1.shape
print " Losers2:", losers2.shape
 Winners: (235, 20)
 Losers1: (229, 22)
 Losers2: (170, 20)
# Investigate why Losers1 has 22 columns, must have been pickled after a change.   
losers1.columns
Index([u'_links', u'code', u'externals', u'genres', u'id', u'image',
       u'language', u'message', u'name', u'network', u'officialSite',
       u'premiered', u'rating', u'runtime', u'schedule', u'status', u'summary',
       u'type', u'updated', u'url', u'webChannel', u'weight'],
      dtype='object')
losers2.columns
Index([u'status', u'rating', u'genres', u'weight', u'updated', u'name',
       u'language', u'schedule', u'url', u'officialSite', u'externals',
       u'premiered', u'summary', u'_links', u'image', u'webChannel',
       u'runtime', u'type', u'id', u'network'],
      dtype='object')
winners.columns
Index([u'status', u'rating', u'genres', u'weight', u'updated', u'name',
       u'language', u'schedule', u'url', u'officialSite', u'externals',
       u'premiered', u'summary', u'_links', u'image', u'webChannel',
       u'runtime', u'type', u'id', u'network'],
      dtype='object')
# Correct the issue by copying correct columns from losers1 into new_losers1
cols = losers2.columns
new_losers1 = losers1[cols]
new_losers1.shape
(229, 20)
# check that all three dataframes have same data in same order
winners.head(2)
status rating genres weight updated name language schedule url officialSite externals premiered summary _links image webChannel runtime type id network
0 Ended {u'average': 9.4} [Nature] 87 1490631396 Planet Earth II English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/22036/planet-earth-ii http://www.bbc.co.uk/programmes/p02544td {u'thetvdb': 318408, u'tvrage': None, u'imdb':... 2016-11-06 <p>David Attenborough presents a documentary s... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Documentary 22036 {u'country': {u'timezone': u'Europe/London', u...
1 Ended {u'average': 9.4} [Drama, Action, War, History] 86 1492651730 Band of Brothers English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/465/band-of-brothers http://www.hbo.com/band-of-brothers {u'thetvdb': 74205, u'tvrage': 2708, u'imdb': ... 2001-09-09 <p>Drawn from interviews with survivors of Eas... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 465 {u'country': {u'timezone': u'America/New_York'...
new_losers1.head(2)
status rating genres weight updated name language schedule url officialSite externals premiered summary _links image webChannel runtime type id network
0 Running {u'average': None} [] 63 1463447317 The Bill Cunningham Show English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/6068/the-bill-cunn... http://www.thebillcunninghamshow.com/ {u'thetvdb': 283995, u'tvrage': 40425, u'imdb'... 2011-09-19 <p><i><b>"The Bill Cunningham Show"</b>,</i> T... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Talk Show 6068 {u'country': {u'timezone': u'America/New_York'...
1 To Be Determined {u'average': None} [Comedy, Music] 0 1477139892 Six Degrees of Everything English {u'days': [u'Tuesday'], u'time': u'23:00'} http://www.tvmaze.com/shows/2821/six-degrees-o... http://www.trutv.com/shows/six-degrees-of-ever... {u'thetvdb': 299234, u'tvrage': 50418, u'imdb'... 2015-08-18 <p><b>Six Degrees of Everything</b> is a fast-... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Variety 2821 {u'country': {u'timezone': u'America/New_York'...
losers2.head(2)
status rating genres weight updated name language schedule url officialSite externals premiered summary _links image webChannel runtime type id network
0 Ended {u'average': None} [] 0 1449178946 Famous in 12 English {u'days': [u'Tuesday'], u'time': u'20:00'} http://www.tvmaze.com/shows/9024/famous-in-12 None {u'thetvdb': 279947, u'tvrage': 37045, u'imdb'... 2014-06-03 <p><i><b>"Famous in 12"</b></i>, the new unscr... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Reality 9024 {u'country': {u'timezone': u'America/New_York'...
1 Ended {u'average': None} [Comedy, Family] 14 1497059695 The Sharon Osbourne Show English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/19004/the-sharon-o... None {u'thetvdb': None, u'tvrage': 13173, u'imdb': ... 2006-08-29 <p>Daily talk show hosted by Sharon Osbourne.</p> {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Talk Show 19004 {u'country': {u'timezone': u'Europe/London', u...
# Add a column to classify the shows as winners or losers (not winners)
winners['winner'] = 1
new_losers1['winner'] = 0
losers2['winner'] = 0

/Users/erhepp/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

Merge into one dataframe called shows

# now concatenate the loser data to the winner data, the result is the dataframe shows
shows = pd.DataFrame()
shows = winners.copy()
shows = shows.append(new_losers1, ignore_index=True)
shows = shows.append(losers2, ignore_index=True)
shows.shape
(634, 21)
shows.head()
status rating genres weight updated name language schedule url officialSite ... premiered summary _links image webChannel runtime type id network winner
0 Ended {u'average': 9.4} [Nature] 87 1490631396 Planet Earth II English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/22036/planet-earth-ii http://www.bbc.co.uk/programmes/p02544td ... 2016-11-06 <p>David Attenborough presents a documentary s... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Documentary 22036 {u'country': {u'timezone': u'Europe/London', u... 1
1 Ended {u'average': 9.4} [Drama, Action, War, History] 86 1492651730 Band of Brothers English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/465/band-of-brothers http://www.hbo.com/band-of-brothers ... 2001-09-09 <p>Drawn from interviews with survivors of Eas... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 465 {u'country': {u'timezone': u'America/New_York'... 1
2 Ended {u'average': 9.2} [Nature] 82 1502854135 Planet Earth English {u'days': [u'Sunday'], u'time': u'21:00'} http://www.tvmaze.com/shows/768/planet-earth http://www.bbc.co.uk/programmes/b006mywy ... 2006-03-05 <p>David Attenborough celebrates the amazing v... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Documentary 768 {u'country': {u'timezone': u'Europe/London', u... 1
3 Running {u'average': 9.3} [Drama, Adventure, Fantasy] 100 1502955537 Game of Thrones English {u'days': [u'Sunday'], u'time': u'21:00'} http://www.tvmaze.com/shows/82/game-of-thrones http://www.hbo.com/game-of-thrones ... 2011-04-17 <p>Based on the bestselling book series <i>A S... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 82 {u'country': {u'timezone': u'America/New_York'... 1
4 Ended {u'average': 9.3} [Drama, Crime, Thriller] 97 1502331382 Breaking Bad English {u'days': [u'Sunday'], u'time': u'22:00'} http://www.tvmaze.com/shows/169/breaking-bad http://www.amc.com/shows/breaking-bad ... 2008-01-20 <p><b>Breaking Bad</b> follows protagonist Wal... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 169 {u'country': {u'timezone': u'America/New_York'... 1

5 rows × 21 columns

# Check id column for any duplicates. There will be some from the losers for two reasons:
#    During first pull, the API limitions were not known, so some were returned with message,
#       "Too Many Requests"  rather tahn data, these need to be removed
#    Some did not contain their own imdb number in the data, so when the list of imdb #s to recheck was generated, 
#        these had to be included in the 2nd attempt as they could not be identified as being in the first pull.  

shows = shows[shows['name'] != 'Too Many Requests']
print shows.shape

print "Duplicate show IDs", shows.duplicated('id').sum()

# Display the duplicates to visually examine before dropping
# shows[shows.isin(shows[shows.duplicated()])].sort("ID")
shows[shows.duplicated('id')]
(498, 21)
Duplicate show IDs 6
status rating genres weight updated name language schedule url officialSite ... premiered summary _links image webChannel runtime type id network winner
601 Ended {u'average': None} [] 0 1477683583 Tyler Perry's House of Payne English {u'days': [u'Friday'], u'time': u'20:00'} http://www.tvmaze.com/shows/14013/tyler-perrys... None ... 2007-06-06 <p><b>Tyler Perry's House of Payne</b> is a co... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Scripted 14013 {u'country': {u'timezone': u'America/New_York'... 0
602 Ended {u'average': 3.3} [Comedy] 4 1502774582 The Inbetweeners English {u'days': [u'Monday'], u'time': u'22:30'} http://www.tvmaze.com/shows/1138/the-inbetweeners None ... 2012-08-20 <p><b>The Inbetweeners</b> takes a comedic loo... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Scripted 1138 {u'country': {u'timezone': u'America/New_York'... 0
603 Ended {u'average': 6} [Family] 0 1497646938 19 Kids and Counting English {u'days': [u'Tuesday'], u'time': u'21:00'} http://www.tvmaze.com/shows/969/19-kids-and-co... http://www.tlc.com/tv-shows/19-kids-and-counting/ ... 2008-09-29 <p><b>19 Kids and Counting</b> follows Michell... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Reality 969 {u'country': {u'timezone': u'America/New_York'... 0
604 Ended {u'average': 9} [Comedy, Food, Family] 0 1463627692 Talia in the Kitchen English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/2369/talia-in-the-... http://www.nick.com/talia-in-the-kitchen/ ... 2015-07-06 <p>When 14-year-old Talia visits her grandmoth... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 30 Scripted 2369 {u'country': {u'timezone': u'America/New_York'... 0
605 Running {u'average': 3.8} [] 48 1497310190 The Factor English {u'days': [u'Monday', u'Tuesday', u'Wednesday'... http://www.tvmaze.com/shows/9066/the-factor http://www.foxnews.com/shows/the-oreilly-facto... ... 1996-10-07 <p><b>The Factor</b>, originally titled <i>The... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 News 9066 {u'country': {u'timezone': u'America/New_York'... 0
606 Ended {u'average': None} [Drama, Comedy, Music] 0 1462214107 Viva Laughlin English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/6924/viva-laughlin None ... 2007-10-18 <p>A remake of the British series <i>Blackpool... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 6924 {u'country': {u'timezone': u'America/New_York'... 0

6 rows × 21 columns

# validate that these are really dups by looking at both rows with the duplicate id
shows[shows['id'] == 6924]
status rating genres weight updated name language schedule url officialSite ... premiered summary _links image webChannel runtime type id network winner
462 Ended {u'average': None} [Drama, Comedy, Music] 0 1462214107 Viva Laughlin English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/6924/viva-laughlin None ... 2007-10-18 <p>A remake of the British series <i>Blackpool... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 6924 {u'country': {u'timezone': u'America/New_York'... 0
606 Ended {u'average': None} [Drama, Comedy, Music] 0 1462214107 Viva Laughlin English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/6924/viva-laughlin None ... 2007-10-18 <p>A remake of the British series <i>Blackpool... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60 Scripted 6924 {u'country': {u'timezone': u'America/New_York'... 0

2 rows × 21 columns

# All 6 of these check out as true duplicates, so remove the 2nd instance of each
shows = shows.drop_duplicates(subset='id')
shows.shape
(492, 21)
# make a copy, so there's a backup without having to re-pull shows info from api or from pickle and recombine
df_shows = shows.copy()
# Subdivide the columns so we can fit sections of the dataframe in notebook windows to see what we have
first_cols = df_shows.columns[1:10]
second_cols = df_shows.columns[10:17]
third_cols = df_shows.columns[17:]
df_shows[first_cols].head()
rating genres weight updated name language schedule url officialSite
0 {u'average': 9.4} [Nature] 87 1490631396 Planet Earth II English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/22036/planet-earth-ii http://www.bbc.co.uk/programmes/p02544td
1 {u'average': 9.4} [Drama, Action, War, History] 86 1492651730 Band of Brothers English {u'days': [u'Sunday'], u'time': u'20:00'} http://www.tvmaze.com/shows/465/band-of-brothers http://www.hbo.com/band-of-brothers
2 {u'average': 9.2} [Nature] 82 1502854135 Planet Earth English {u'days': [u'Sunday'], u'time': u'21:00'} http://www.tvmaze.com/shows/768/planet-earth http://www.bbc.co.uk/programmes/b006mywy
3 {u'average': 9.3} [Drama, Adventure, Fantasy] 100 1502955537 Game of Thrones English {u'days': [u'Sunday'], u'time': u'21:00'} http://www.tvmaze.com/shows/82/game-of-thrones http://www.hbo.com/game-of-thrones
4 {u'average': 9.3} [Drama, Crime, Thriller] 97 1502331382 Breaking Bad English {u'days': [u'Sunday'], u'time': u'22:00'} http://www.tvmaze.com/shows/169/breaking-bad http://www.amc.com/shows/breaking-bad
df_shows[second_cols].head()
externals premiered summary _links image webChannel runtime
0 {u'thetvdb': 318408, u'tvrage': None, u'imdb':... 2016-11-06 <p>David Attenborough presents a documentary s... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60
1 {u'thetvdb': 74205, u'tvrage': 2708, u'imdb': ... 2001-09-09 <p>Drawn from interviews with survivors of Eas... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60
2 {u'thetvdb': 79257, u'tvrage': 8077, u'imdb': ... 2006-03-05 <p>David Attenborough celebrates the amazing v... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60
3 {u'thetvdb': 121361, u'tvrage': 24493, u'imdb'... 2011-04-17 <p>Based on the bestselling book series <i>A S... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60
4 {u'thetvdb': 81189, u'tvrage': 18164, u'imdb':... 2008-01-20 <p><b>Breaking Bad</b> follows protagonist Wal... {u'previousepisode': {u'href': u'http://api.tv... {u'medium': u'http://static.tvmaze.com/uploads... None 60
df_shows[third_cols].head()
type id network winner
0 Documentary 22036 {u'country': {u'timezone': u'Europe/London', u... 1
1 Scripted 465 {u'country': {u'timezone': u'America/New_York'... 1
2 Documentary 768 {u'country': {u'timezone': u'Europe/London', u... 1
3 Scripted 82 {u'country': {u'timezone': u'America/New_York'... 1
4 Scripted 169 {u'country': {u'timezone': u'America/New_York'... 1

Cleanup and Organization of the DataFrame

# Cleanup and Organization

# The genres column is generally a list of strings, but is missing some values, and has empty lists for others.
#   !. Change all NaN to []
#   2. Convert all to strings
#   3. Use Count Vectorizer to make new columns for each genre
#   4. Remove existing genres column

df_shows['genres'] = df_shows['genres'].fillna(0).map(lambda x: [] if x == 0 else x)
df_shows['genres'] = df_shows['genres'].map(lambda x: ','.join(x))
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(
    binary=True,
    tokenizer=(lambda x: x.split(','))
    )
cvfit = cv.fit_transform(df_shows['genres']).todense()
genre_cols = pd.DataFrame(cvfit, columns=cv.get_feature_names())
genre_cols.rename(columns={'' : 'unknown'}, inplace=True)
genre_cols.columns
Index([        u'unknown',          u'action',           u'adult',
             u'adventure',           u'anime',        u'children',
                u'comedy',           u'crime',           u'drama',
             u'espionage',          u'family',         u'fantasy',
                  u'food',         u'history',          u'horror',
                 u'legal',         u'medical',           u'music',
               u'mystery',          u'nature',         u'romance',
       u'science-fiction',          u'sports',    u'supernatural',
              u'thriller',          u'travel',             u'war',
               u'western'],
      dtype='object')
new_genre_columns = []
for item in genre_cols:
    new_genre_columns.append('gn_' + item)
genre_cols.columns = new_genre_columns
genre_cols.head()
gn_unknown gn_action gn_adult gn_adventure gn_anime gn_children gn_comedy gn_crime gn_drama gn_espionage ... gn_mystery gn_nature gn_romance gn_science-fiction gn_sports gn_supernatural gn_thriller gn_travel gn_war gn_western
0 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 1 0 ... 0 0 0 0 0 0 1 0 0 0

5 rows × 28 columns

# Add the new genre columns to the df_shows dataframe
df_shows = pd.concat([df_shows, genre_cols], axis=1, join_axes=[df_shows.index])
df_shows = df_shows.drop('genres', 1)
# Genre information is missing for 69 loser shows and 13 winner shows

df_shows[df_shows['gn_unknown'] ==1][['gn_unknown', 'winner']].groupby(['winner']).count()
gn_unknown
winner
0 69
1 13
df_shows.columns
Index([            u'status',             u'rating',             u'weight',
                  u'updated',               u'name',           u'language',
                 u'schedule',                u'url',       u'officialSite',
                u'externals',          u'premiered',            u'summary',
                   u'_links',              u'image',         u'webChannel',
                  u'runtime',               u'type',                 u'id',
                  u'network',             u'winner',         u'gn_unknown',
                u'gn_action',           u'gn_adult',       u'gn_adventure',
                 u'gn_anime',        u'gn_children',          u'gn_comedy',
                 u'gn_crime',           u'gn_drama',       u'gn_espionage',
                u'gn_family',         u'gn_fantasy',            u'gn_food',
               u'gn_history',          u'gn_horror',           u'gn_legal',
               u'gn_medical',           u'gn_music',         u'gn_mystery',
                u'gn_nature',         u'gn_romance', u'gn_science-fiction',
                u'gn_sports',    u'gn_supernatural',        u'gn_thriller',
                u'gn_travel',             u'gn_war',         u'gn_western'],
      dtype='object')
# Convert the rating to a number
# sometimes the rating column is NaN, and sometimes the value for 'average' in the dictionary is Nan
# so the NaNs must be handled twice, once for each case
# This code first fills the missing dictionarys with -1 (value chosen to signify no rating)
# It then sets the column to the average value in the rating dictionary, and if that is NaN converts to -1

df_shows['rating'] = df_shows['rating'].fillna(-1).map(lambda x: -1 if x == -1 else x['average']).fillna(-1)
# Rating information is missing for 192 loser shows and 6 winner shows
df_shows[df_shows['rating'] == -1][['rating', 'winner']].groupby(['winner']).count()
rating
winner
0 192
1 6
# Unpack 'schedule' into days treating NaN in a similar way, 
df_shows['sched_day'] = df_shows['schedule'].fillna(0).map(lambda x: [] if x == 0 else x)
df_shows['sched_day'] = df_shows['sched_day'].map(lambda x: x if x == [] else x['days'])
df_shows['sched_day'] = df_shows['sched_day'].map(lambda x: ','.join(x))
cv = CountVectorizer(
    binary=True,
    tokenizer=(lambda x: x.split(','))
    )
cvfit = cv.fit_transform(df_shows['sched_day']).todense()
day_cols = pd.DataFrame(cvfit, columns=cv.get_feature_names())
day_cols.rename(columns={'' : 'unknown'}, inplace=True)
day_cols.columns
Index([  u'unknown',    u'friday',    u'monday',  u'saturday',    u'sunday',
        u'thursday',   u'tuesday', u'wednesday'],
      dtype='object')
new_day_columns = []
for item in day_cols:
    new_day_columns.append('sched_' + item)
day_cols.columns = new_day_columns
day_cols.head()
sched_unknown sched_friday sched_monday sched_saturday sched_sunday sched_thursday sched_tuesday sched_wednesday
0 0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 0
2 0 0 0 0 1 0 0 0
3 0 0 0 0 1 0 0 0
4 0 0 0 0 1 0 0 0
# Add the new genre columns to the df_shows dataframe
df_shows = pd.concat([df_shows, day_cols], axis=1, join_axes=[df_shows.index])

df_shows = df_shows.drop('sched_day', 1)
# Scheduled Day information is missing for 15 loser shows and 45 winner shows

df_shows[df_shows['sched_unknown'] ==1][['sched_unknown', 'winner']].groupby(['winner']).count()
sched_unknown
winner
0 15
1 45
# Unpack 'schedule' into times treating NaN in a similar way.
# Samples with a valid show time will be HH:MM and missing values will be :
df_shows['sched_time'] = df_shows['schedule'].fillna(':').map(lambda x: x if x == ':' else x['time'])
df_shows['sched_time'] = df_shows['sched_time'].map(lambda x: ':' if x == '' else x)
# Scheduled Time information is missing for 35 loser shows and 61 winner shows

print len(df_shows[df_shows['sched_time'] == ':'])
df_shows[df_shows['sched_time'] == ':'][['sched_time', 'winner']].groupby(['winner']).count()
96
sched_time
winner
0 35
1 61
# Sched time is in HH:MM format as a string. I will leave this as string, and count vectorize it
print type(df_shows.loc[0,'sched_time'])

cv = CountVectorizer(
    binary=True,
    tokenizer=(lambda x: x.split(','))
    )
cvfit = cv.fit_transform(df_shows['sched_time']).todense()
time_cols = pd.DataFrame(cvfit, columns=cv.get_feature_names())
time_cols.rename(columns={':' : 'unknown'}, inplace=True)
time_cols.columns
<type 'unicode'>





Index([  u'00:00',   u'00:30',   u'00:50',   u'00:55',   u'01:00',   u'01:05',
         u'01:30',   u'01:35',   u'02:00',   u'02:05',   u'08:00',   u'10:00',
         u'11:00',   u'12:00',   u'13:00',   u'13:30',   u'14:00',   u'14:30',
         u'15:00',   u'15:15',   u'16:00',   u'16:30',   u'17:00',   u'17:15',
         u'17:30',   u'18:00',   u'18:30',   u'19:00',   u'19:30',   u'19:45',
         u'20:00',   u'20:15',   u'20:30',   u'20:40',   u'20:45',   u'20:55',
         u'21:00',   u'21:10',   u'21:15',   u'21:30',   u'21:45',   u'22:00',
         u'22:10',   u'22:30',   u'22:35',   u'23:00',   u'23:02',   u'23:15',
         u'23:30', u'unknown'],
      dtype='object')
new_time_columns = []
for item in time_cols:
    new_time_columns.append('sched_time_' + item)
time_cols.columns = new_time_columns
time_cols.head()
sched_time_00:00 sched_time_00:30 sched_time_00:50 sched_time_00:55 sched_time_01:00 sched_time_01:05 sched_time_01:30 sched_time_01:35 sched_time_02:00 sched_time_02:05 ... sched_time_21:45 sched_time_22:00 sched_time_22:10 sched_time_22:30 sched_time_22:35 sched_time_23:00 sched_time_23:02 sched_time_23:15 sched_time_23:30 sched_time_unknown
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 50 columns

# Add the new genre columns to the df_shows dataframe
df_shows = pd.concat([df_shows, time_cols], axis=1, join_axes=[df_shows.index])

df_shows = df_shows.drop('schedule', 1)
df_shows = df_shows.drop('sched_time', 1)
print df_shows.columns
Index([            u'status',             u'rating',             u'weight',
                  u'updated',               u'name',           u'language',
                      u'url',       u'officialSite',          u'externals',
                u'premiered',
       ...
         u'sched_time_21:45',   u'sched_time_22:00',   u'sched_time_22:10',
         u'sched_time_22:30',   u'sched_time_22:35',   u'sched_time_23:00',
         u'sched_time_23:02',   u'sched_time_23:15',   u'sched_time_23:30',
       u'sched_time_unknown'],
      dtype='object', length=105)

# Print out a network dictionary to learn how to unpack the structure
df_shows.loc[0,'network']
{u'country': {u'code': u'GB',
  u'name': u'United Kingdom',
  u'timezone': u'Europe/London'},
 u'id': 12,
 u'name': u'BBC One'}
# 25 shows have no network info,  might need to drop these, but dummied for now
df_shows['network'].isnull().sum()
25
# Unpack 'network' into country code, country name, timezone,  treating NaN in a similar way, 
df_shows['country_code'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['country'])
df_shows['country_code'] = df_shows['country_code'].map(lambda x: x if x == '' else x['code'])

df_shows['country_name'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['country'])
df_shows['country_name'] = df_shows['country_name'].map(lambda x: x if x == '' else x['name'])

df_shows['country_tz'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['country'])
df_shows['country_tz'] = df_shows['country_tz'].map(lambda x: x if x == '' else x['timezone'])

df_shows['network_id'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['id'])
df_shows['network_name'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['name'])

df_shows = df_shows.drop(['network'], 1)
# Country and network information is missing for 4 loser shows and 21 winner shows

df_shows[df_shows['country_code'] == ''] [['country_code', 'winner']].groupby(['winner']).count()
country_code
winner
0 4
1 21
df_shows[['country_code', 'country_name', 'country_tz', 'network_id', 'network_name']].head()
country_code country_name country_tz network_id network_name
0 GB United Kingdom Europe/London 12 BBC One
1 US United States America/New_York 8 HBO
2 GB United Kingdom Europe/London 12 BBC One
3 US United States America/New_York 8 HBO
4 US United States America/New_York 20 AMC
df_shows[['updated', 'premiered']].head()
updated premiered
0 1490631396 2016-11-06
1 1492651730 2001-09-09
2 1502854135 2006-03-05
3 1502955537 2011-04-17
4 1502331382 2008-01-20
# Updated date is complete, premiered date is missing 6 values

print df_shows['updated'].isnull().sum()
print df_shows['premiered'].isnull().sum()

0
6
# Must represent updated as a real date time object, currently is seconds from epoch (1970)
# Convert string to int, then int to datetime
import datetime
print type(df_shows.loc[0,'updated'])

df_shows['updated'] = df_shows['updated'].fillna(0).apply(lambda x: x if x == 0 else datetime.datetime.fromtimestamp(x))

<type 'int'>
# Turn premiered into real date time object, currently this is a string, need to convert to date
print type(df_shows.loc[0,'premiered'])
df_shows['premiered'] = df_shows['premiered'].fillna(0).apply(lambda x: x if x == 0 else datetime.datetime.strptime(x, '%Y-%m-%d'))
<type 'unicode'>
df_shows[['updated', 'premiered']].head()
updated premiered
0 2017-03-27 12:16:36 2016-11-06 00:00:00
1 2017-04-19 21:28:50 2001-09-09 00:00:00
2 2017-08-15 23:28:55 2006-03-05 00:00:00
3 2017-08-17 03:38:57 2011-04-17 00:00:00
4 2017-08-09 22:16:22 2008-01-20 00:00:00
# Updated date is complete, premiered date is missing 6 values, all from loser shows

df_shows[df_shows['premiered'] == 0] [['premiered', 'winner']].groupby(['winner']).count()
premiered
winner
0 6
# Drop columns not useful for analysis

# webChannel has no or insufficient useful information, can drop
print "webChannel null count:", df_shows['webChannel'].isnull().sum()

# url, officialSite, externals, _links, image, webChannel
df_shows = df_shows.drop(['url', 'officialSite', 'externals', '_links', 'image', 'webChannel', ], 1)
webChannel null count: 464
# Looks like runtime is already an integer number of minutes
# runtime is missing 9 values, 5 winners and 4 losers
print type(df_shows.loc[0,'runtime'])
print df_shows['runtime'].isnull().sum(), " null values"
# df_shows['runtime'].value_counts()


<type 'int'>
9  null values
df_shows[df_shows['runtime'].isnull()][['runtime', 'winner']]
runtime winner
17 None 1
65 None 1
137 None 1
144 None 1
198 None 1
544 None 0
556 None 0
577 None 0
609 None 0
# Contains html tags, otherwise a string, html tags will be removed in text processing steps during analysis
print df_shows.loc[0,'summary']
print df_shows['summary'].isnull().sum(), " null values"
<p>David Attenborough presents a documentary series exploring how animals meet the challenges of surviving in the most iconic habitats on earth.</p>
1  null values
df_shows[df_shows['summary'].isnull()]
status rating weight updated name language premiered summary runtime type ... sched_time_23:00 sched_time_23:02 sched_time_23:15 sched_time_23:30 sched_time_unknown country_code country_name country_tz network_id network_name
570 Ended -1.0 75 2017-04-18 17:55:28 Chop Socky Chooks English 2008-03-07 00:00:00 None 11 Animation ... NaN NaN NaN NaN NaN US United States America/New_York 11 Cartoon Network

1 rows × 104 columns

# This one with the missing summary, Chop Socky Chooks, is missing other information also, and will be dropped.
# Too bad,  looks like a truly dreadful one that would be good for the very bottom of the losers list.
df_shows = df_shows[df_shows['summary'].notnull()]
df_shows.shape
(491, 104)
# Use textacy to clean the html tags, punctuation, etc. from the summary text
from textacy.preprocess import preprocess_text

df_shows['summary'] = df_shows['summary'].map(lambda x: preprocess_text(x, fix_unicode=True, lowercase=True, \
                              transliterate=False, no_contractions = True,
                              no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True))
print df_shows.loc[1,'summary']
print
print df_shows.loc[2,'summary']
<p>drawn from interviews with survivors of easy company as well as their journals and letters <b>band of brothers<b> chronicles the experiences of these men from paratrooper training in georgia through the end of the war as an elite rifle company parachuting into normandy early on dday morning participants in the battle of the bulge and witness to the horrors of war the men of easy knew extraordinary bravery and extraordinary fear and became the stuff of legend based on stephen e ambroses acclaimed book of the same name<p>

<p>david attenborough celebrates the amazing variety of the natural world in this epic documentary series filmed over four years across 64 different countries<p>
# Looks like all the summaries have html paragraph <p> and break <b> tags, and textacy hasn't removed them. 
# These lambda function knock them out

import string
df_shows['summary'] = df_shows['summary'].map(lambda x: x.replace('<p>',''))
df_shows['summary'] = df_shows['summary'].map(lambda x: x.replace('<b>',''))
# This looks better for analysis
print df_shows.loc[1,'summary']
drawn from interviews with survivors of easy company as well as their journals and letters band of brothers chronicles the experiences of these men from paratrooper training in georgia through the end of the war as an elite rifle company parachuting into normandy early on dday morning participants in the battle of the bulge and witness to the horrors of war the men of easy knew extraordinary bravery and extraordinary fear and became the stuff of legend based on stephen e ambroses acclaimed book of the same name
df_shows[df_shows.isnull().any(axis=1)]
status rating weight updated name language premiered summary runtime type ... sched_time_23:00 sched_time_23:02 sched_time_23:15 sched_time_23:30 sched_time_unknown country_code country_name country_tz network_id network_name
17 Ended 9.0 0 1455913373 The Decalogue Polish 1989-12-10 00:00:00 <p>Ten television drama films, each one based ... None Variety ... 0.0 0.0 0.0 0.0 1.0 PL Poland Europe/Warsaw 336 TVP1
65 Ended 9.0 85 1501781828 Sherlock Holmes English 1984-04-24 00:00:00 <p><b>Sherlock Holmes</b> is one of the world'... None Scripted ... 0.0 0.0 0.0 0.0 0.0 GB United Kingdom Europe/London 35 ITV
137 Running 8.7 98 1489944935 Taboo English 2017-01-07 00:00:00 <p>1814: James Keziah Delaney returns to Londo... None Scripted ... 0.0 0.0 0.0 0.0 0.0 GB United Kingdom Europe/London 12 BBC One
144 Ended 8.6 42 1494693177 The New Batman Adventures English 1997-09-13 00:00:00 <p>The New Batman Adventures comes from the cr... None Animation ... 0.0 0.0 0.0 0.0 1.0 US United States America/New_York 71 The WB
198 Ended 9.0 0 1491564027 The Larry Sanders Show English 1992-08-15 00:00:00 <p>Comic Garry Shandling draws upon his own ta... None Scripted ... 0.0 0.0 0.0 0.0 1.0 US United States America/New_York 8 HBO
492 Running -1.0 76 1502312151 Big Brother After Dark English 2007-07-05 00:00:00 <p><b>Big Brother After Dark</b> is the live, ... 180 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 88 Pop
493 Ended 1.0 0 1474827145 American Paranormal English 2010-01-24 00:00:00 <p>Whether it is the existence of aliens, the ... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 42 National Geographic Channel
494 Ended -1.0 11 1469108505 Homeboys in Outer Space English 1996-08-27 00:00:00 <p>The plot centers around two astronauts, Tyb... 30 Scripted ... NaN NaN NaN NaN NaN US United States America/New_York 70 UPN
495 Ended -1.0 0 1485097253 Gainesville: Friends Are Family English 2015-08-20 00:00:00 <p><i><b>"Gainesville: Friends Are Family"</b>... 30 Documentary ... NaN NaN NaN NaN NaN US United States America/New_York 173 CMT
496 Ended -1.0 0 1449234102 The Show with Vinny English 2013-05-01 00:00:00 <p>Vinny Guadagnino invites musicians, TV star... 30 Talk Show ... NaN NaN NaN NaN NaN US United States America/New_York 22 MTV
497 Ended -1.0 0 1457985576 Gormiti Nature Unleashed French 2013-04-01 00:00:00 <p>Gormiti Nature Unleashed is an Italian CGI ... 25 Animation ... NaN NaN NaN NaN NaN FR France Europe/Paris 1050 Canal J
498 Ended -1.0 23 1483294279 Denise Richards: It's Complicated English 2008-05-26 00:00:00 <p><b>Denise Richards: It's Complicated</b> is... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 43 E!
499 Ended -1.0 0 1482875019 Stanley English 1956-09-24 00:00:00 <p><b>Stanley</b> revolved around the adventur... 30 Scripted ... NaN NaN NaN NaN NaN US United States America/New_York 1 NBC
500 Ended 1.0 0 1468782928 Uncovering Aliens English 2013-12-15 00:00:00 <p>Across America, there are more UFO sighting... 60 Documentary ... NaN NaN NaN NaN NaN US United States America/New_York 92 Animal Planet
501 Ended -1.0 0 1477142177 Bulging Brides English 2008-01-31 00:00:00 <p><b>Bulging Brides</b> is a television serie... 30 Reality ... NaN NaN NaN NaN NaN CA Canada Canada/Atlantic 472 Slice
502 Running 6.7 0 1502923678 Never Ever Do This at Home English 2013-05-06 00:00:00 <p><b>Never Ever Do This at Home</b> is a come... 30 Reality ... NaN NaN NaN NaN NaN CA Canada Canada/Atlantic 298 Discovery Channel
503 Ended -1.0 0 1465987779 Hello Ross English 2013-09-06 00:00:00 <p><i><b>"Hello Ross"</b></i> is the new weekl... 30 Talk Show ... NaN NaN NaN NaN NaN US United States America/New_York 43 E!
504 Ended -1.0 0 1499803314 3 English 2012-07-26 00:00:00 <p><b>3</b> is a new relationship series in wh... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 2 CBS
505 Ended -1.0 0 1495568447 Trexx and Flipside English 0 <p>Wannabe hip hop stars but their music label... 30 Scripted ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 49 BBC Three
506 Running 8.5 96 1503483430 The Real Housewives of Orange County English 2006-03-21 00:00:00 <p>These ladies show no signs of slowing down ... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 52 Bravo
507 Ended 5.3 16 1479782037 Skins English 2011-01-17 00:00:00 <p><b>Skins</b> is about the lives and loves o... 60 Scripted ... NaN NaN NaN NaN NaN US United States America/New_York 22 MTV
508 Running -1.0 73 1503490679 Dr. Phil English 2002-09-16 00:00:00 <p>The <b>Dr. Phil</b> show provides the most ... 60 Talk Show ... NaN NaN NaN NaN NaN US United States America/New_York 72 Syndication
509 Running 7.5 50 1497449904 My Big Fat American Gypsy Wedding English 2012-04-29 00:00:00 <p>Going inside the hidden world of American G... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 80 TLC
510 Running 1.0 0 1479731918 Mystery Diners English 2012-05-20 00:00:00 <p>When a restaurant owner suspects employees ... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 81 Food Network
511 Running -1.0 0 1498393231 Pig Goat Banana Cricket English 2015-07-18 00:00:00 <p><b>Pig Goat Banana Cricket</b> features a s... 30 Animation ... NaN NaN NaN NaN NaN US United States America/New_York 73 nicktoons
512 Ended -1.0 9 1460230772 Jerseylicious English 2010-03-21 00:00:00 <p>Jerseylicious is a reality show which takes... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 184 Esquire Network
513 Ended -1.0 38 1501384818 South Beach Tow English 2011-07-20 00:00:00 <p>The <b>South Beach Tow</b> crew returns to ... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 84 truTV
514 Ended -1.0 0 1466882679 Starhyke English 2009-11-30 00:00:00 <p>It's the year 3034. Everyone on Earth has b... 30 Scripted ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 324 Showcase TV
515 Ended -1.0 0 1496675604 Making the Band English 2000-03-24 00:00:00 <p><b>Making the Band</b> was the brainchild o... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 22 MTV
516 Running 4.5 68 1480821374 Second Jen English 2016-08-28 00:00:00 <p><b>Second Jen</b> is a ground-breaking scri... 30 Scripted ... NaN NaN NaN NaN NaN CA Canada Canada/Atlantic 151 City
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
598 Ended -1.0 0 1502593090 CeeLo Green's The Good Life English 2014-06-23 00:00:00 <p>Follow CeeLo as he tackles not only a packe... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 32 TBS
599 Ended -1.0 0 1477193480 America's Prom Queen English 2008-03-17 00:00:00 <p><b>America's Prom Queen</b> is a reality TV... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 26 FreeForm
600 Ended -1.0 0 1461445299 Hollywood Me English 2013-06-19 00:00:00 <p>Martyn Lawrence Bullard's normal clients in... 30 Reality ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 45 Channel 4
607 Ended -1.0 52 1490313454 Utopia English 2014-09-07 00:00:00 <p>Get ready to witness the birth of a brave n... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 4 FOX
608 Running -1.0 93 1499236738 Storage Wars: Canada English 2013-08-29 00:00:00 <p>On a daily basis, high-stakes buyers descen... 30 Reality ... NaN NaN NaN NaN NaN CA Canada Canada/Atlantic 350 OLN
609 Ended -1.0 0 1502217322 Big Brother English 2001-04-23 00:00:00 <p><b>Big Brother Australia</b> is based on th... None Reality ... NaN NaN NaN NaN NaN AU Australia Australia/Sydney 120 Nine Network
610 Ended -1.0 0 1497307824 The Vineyard English 2013-07-23 00:00:00 <p>ABC Family's newest original docu-series, <... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 26 FreeForm
611 Running -1.0 0 1503655502 Na dobre i na złe Polish 1999-11-07 00:00:00 60 Scripted ... NaN NaN NaN NaN NaN PL Poland Europe/Warsaw 333 TVP2
612 Ended -1.0 0 1477348482 Big Top English 2009-12-02 00:00:00 <p><b>Big Top</b> was a sit-com that aired on ... 30 Scripted ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 12 BBC One
613 Running 9.0 0 1468322551 MTV Suspect English 2016-02-23 00:00:00 <p>Across America, people are hiding deep secr... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 22 MTV
614 Ended -1.0 0 1497305713 Kimora Life in the Fab Lane English 2007-08-05 00:00:00 <p>A glimpse into the life of former model Kim... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 43 E!
615 Ended -1.0 0 1490293113 Celebrities Undercover English 2014-03-18 00:00:00 <p>Celebrities are used to transforming into o... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 79 Oxygen
616 Running -1.0 0 1458216770 Recipe for Deception English 2016-01-21 00:00:00 <p>Bravo Media cooks up a battle of secrets an... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 52 Bravo
617 Ended -1.0 0 1481538915 16 Kids and Counting English 2013-01-11 00:00:00 <p>What's life like when you have enough child... 60 Documentary ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 45 Channel 4
618 Ended -1.0 0 1484475919 A Poet's Guide to Britain English 2009-05-04 00:00:00 <p>Poet and author Owen Sheers presents a seri... 30 Documentary ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 51 BBC Four
619 Running -1.0 94 1502640953 The Bold and the Beautiful English 1987-03-23 00:00:00 <p>They created a dynasty where passion rules,... 30 Scripted ... NaN NaN NaN NaN NaN US United States America/New_York 2 CBS
620 Running -1.0 99 1502485797 Life of Kylie English 2017-08-06 00:00:00 <p><b>Life of Kylie</b> will follow Kylie Jenn... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 43 E!
621 Ended 6.0 0 1502487937 Jersey Shore English 2009-12-03 00:00:00 <p>Grab your hair gel, wax that Cadillac and g... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 43 E!
622 Ended -1.0 0 1485103110 The Hills English 2006-05-31 00:00:00 <p>In the final season of <b>The Hills</b> - K... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 22 MTV
623 Running 2.7 91 1500442171 Teen Mom English 2009-12-08 00:00:00 <p>In 16 and Pregnant, they were moms-to-be. N... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 22 MTV
624 Ended 5.7 66 1489774713 Coupling English 2003-09-25 00:00:00 <p><b>Coupling</b> is an American remake of th... 30 Scripted ... NaN NaN NaN NaN NaN US United States America/New_York 1 NBC
625 Running -1.0 0 1486846250 Access Hollywood Live English 1996-09-09 00:00:00 <p><b>Access Hollywood Live</b> is a weekday t... 60 Variety ... NaN NaN NaN NaN NaN US United States America/New_York 75 REELZ
626 To Be Determined -1.0 0 1462596807 The First Family English 2012-09-17 00:00:00 <p><i><b>"The First Family"</b></i> is an Amer... 30 Scripted ... NaN NaN NaN NaN NaN US United States America/New_York 5 The CW
627 Ended 10.0 0 1502461972 Garbage Pail Kids English 0 <p>From deep within the historic TV animation ... 25 Animation ... NaN NaN NaN NaN NaN US United States America/New_York 2 CBS
628 Ended -1.0 50 1497743151 Khloé & Lamar English 2011-04-10 00:00:00 <p>In <b>Khloé &amp; Lamar</b>, cameras will f... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 43 E!
629 Ended -1.0 0 1482948423 The Paul Reiser Show English 2011-04-14 00:00:00 <p>Paul Reiser plays a fictional version of hi... 30 Scripted ... NaN NaN NaN NaN NaN US United States America/New_York 1 NBC
630 Ended -1.0 0 1485719969 Pretty Wicked Moms English 2013-06-04 00:00:00 <p>Six Atlanta moms give a whole new meaning t... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 18 Lifetime
631 Ended -1.0 0 1502430474 The Wright Way English 2013-04-23 00:00:00 <p>Gerald Wright runs the Baselricky Council H... 30 Scripted ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 12 BBC One
632 Ended -1.0 0 1474119411 High School Musical: Get in the Picture English 2008-07-20 00:00:00 <p>A group of teenagers are invited to partici... 60 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 3 ABC
633 Ended -1.0 0 1477283569 Audrina English 2011-04-17 00:00:00 <p>Besides Audrina's blossoming career and tum... 30 Reality ... NaN NaN NaN NaN NaN US United States America/New_York 55 VH1

141 rows × 103 columns

# What do we have that is mostly complete
print df_shows[~df_shows.isnull().any(axis=1)]['winner'].value_counts()
df_shows_notnull = df_shows[~df_shows.isnull().any(axis=1)]
1    209
0    117
Name: winner, dtype: int64
# In the processing above, NaNs were replaced by other values for some columns.  This block creates a new
# dataframe where all rows with these coded values representing missing data have been removed.

df_shows_complete = df_shows_notnull[(df_shows_notnull['rating'] != -1) & \
                                     (df_shows_notnull['gn_unknown'] != 1) & \
                                     (df_shows_notnull['sched_unknown'] != 1) & \
                                     (df_shows_notnull['sched_time_unknown'] != 1) & \
                                     (df_shows_notnull['country_code'] != '') & \
                                     (df_shows_notnull['country_name'] != '') & \
                                     (df_shows_notnull['country_tz'] != '') & \
                                     (df_shows_notnull['network_id'] != '') & \
                                     (df_shows_notnull['network_name'] != '') & \
                                     (df_shows_notnull['premiered'] != 0)]
df_shows_complete.shape
(157, 103)
# Cool, at least not missing any summaries for samples that are otherwise complete
df_shows_complete['summary'].isnull().sum()
0
df_shows[['summary', 'winner']].groupby(['winner']).count()
summary
winner
0 256
1 235

Modeling Section

– Note: Cells in this section must be run sequentially to obtain correct results as some variables are reused in the various modeling sections

Vectorize summary text in different ways

# I'll first try a model with just the summary text, that is available for 491 shows, 256 loosers and 235 winners


# Use NLP techniques to create lots of factors
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from collections import Counter

# Use different Vectorizers to find ngrams for us
tfidf = TfidfVectorizer(ngram_range=(2,4), max_features=2000, stop_words='english')
cvec = CountVectorizer(ngram_range=(2,4), max_features=2000, stop_words='english')
hvec = HashingVectorizer(ngram_range=(2,4), n_features=2000, stop_words='english')

X_tfidf = tfidf.fit_transform(df_shows['summary']).todense()
X_cvec = cvec.fit_transform(df_shows['summary']).todense()
X_hvec = hvec.fit_transform(df_shows['summary']).todense()

y = df_shows['winner'].values

print '\ntfidf shape:', X_tfidf.shape
print '\ncvec shape:', X_cvec.shape
print '\nhvec shape:', X_hvec.shape
print len(y)

tfidf shape: (491, 2000)

cvec shape: (491, 2000)

hvec shape: (491, 2000)
491

Model on summary text using Count Vectorizer

  • results were best when Count Vectorizer scores were modeled with Gaussian Naive Bayes

Features: 2000
Train Set Accuracy: 0.905
CrossVal Accuracy: 0.644 +/- 0.028
Test Set Accuracy: 0.626

**n-grams with higest cumulative sum of tf-idf scores for winners: ** ‘drama series’, ‘david attenborough’, ‘tells story’, ‘young boy’, ‘anthology series’, ‘documentary series’, ‘years later’, ‘main character’, ‘trials tribulations’, ‘crime drama’, ‘serial killer’, ‘tv history’, ‘super hero’, ‘story starts goku’, ‘starts goku’, ‘story starts’, ‘american television’, ‘fictional town’, ‘television drama’, ‘american crime’

**n-grams with higest cumulative sum of tf-idf scores for losers: ** ‘real housewives’, ‘television series’, ‘reality series’, ‘follows lives’, ‘series produced’, ‘pop culture’, ‘reality television’, ‘reality television series’, ‘animated series’, ‘come true’, ‘aired abc’, ‘reality tv’, ‘series debuted’, ‘real housewives orange county’, ‘real housewives orange’, ‘housewives orange’, ‘housewives orange county’, ‘talk hosted’, ‘studio audience’, ‘cash prize’

# Baseline for training set
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline
0.521384928717
# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_cvec, y, test_size=0.25)
print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)
(368, 2000) 368
(123, 2000) 123
#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)


# Run lots of classifiers on this and see which perform the best
# Import all the modeling libraries

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, \
                                    KFold, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt
# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, Xs_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(X_train,y_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, X_train.shape[1], metrics.accuracy_score(y_train, this_model.predict(Xs_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, this_model.predict(Xs_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()
Baseline:   0.521 

Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        2000         0.938               0.660 +/- 0.037         0.626 
LDA       2000         0.938               0.544 +/- 0.054         0.593 
QDA       2000         0.549               0.399 +/- 0.034         0.390 
KNN       2000         0.758               0.500 +/- 0.010         0.528 
CART      2000         0.943               0.576 +/- 0.028         0.585 
RFST      2000         0.940               0.636 +/- 0.046         0.626 
GB        2000         0.826               0.546 +/- 0.020         0.585 
ADA       2000         0.769               0.552 +/- 0.042         0.545 
SVM       2000         0.519               0.519 +/- 0.018         0.528 
GNB       2000         0.913               0.688 +/- 0.038         0.561 
    MNB  :   failed on this input dataset 
BNB       2000         0.902               0.625 +/- 0.023         0.602 

png

# Which words are most common in the winner summaries ?
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = CountVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 1]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)
[(u'new york', 11),
 (u'drama series', 8),
 (u'york city', 6),
 (u'high school', 6),
 (u'men women', 5),
 (u'tv series', 5),
 (u'series based', 5),
 (u'video game', 5),
 (u'bugs bunny', 5),
 (u'new york city', 5),
 (u'tells story', 4),
 (u'young boy', 4),
 (u'comedy series', 4),
 (u'main character', 4),
 (u'united states', 4),
 (u'life new', 4),
 (u'series follows', 4),
 (u'anthology series', 3),
 (u'mr bean', 3),
 (u'prisoner cell', 3)]
# Which words are most common in the loser summaries ?
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = CountVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 0]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)
[(u'real housewives', 12),
 (u'television series', 12),
 (u'los angeles', 11),
 (u'pop culture', 10),
 (u'series follows', 9),
 (u'new york', 9),
 (u'animated series', 7),
 (u'cartoon network', 7),
 (u'big brother', 6),
 (u'dance moms', 6),
 (u'reality series', 6),
 (u'best friend', 6),
 (u'high school', 5),
 (u'late night', 5),
 (u'best friends', 5),
 (u'nick jr', 5),
 (u'reality television series', 5),
 (u'plastic surgery', 5),
 (u'access hollywood', 5),
 (u'comedy series', 5)]
# Sum matrix columns to see what has the most overall importance ?

print "Highest sum Count Vectoror score for n_grams in winner shows"

cvec_results = pd.DataFrame(Xs_train, columns=cvec.get_feature_names())
cvec_results['winners'] = y_train

winner_results = pd.DataFrame(cvec_results[cvec_results['winners'] ==1].sum(), columns=['cvec_sum'])


high = winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20).index
print  [str(r) for r in high]

winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20)

Highest sum Count Vectoror score for n_grams in winner shows
['drama series', 'david attenborough', 'main character', 'tells story', 'years later', 'years ago', 'fictional town', 'anthology series', 'documentary series', 'provocative series', 'makes effort', 'standup comedian', 'set world', 'time 13yearold', 'based manga', 'highs lows', 'set fictional', 'sherlock holmes', 'series takes', 'seaside town']
cvec_sum
drama series 21.615324
david attenborough 20.022240
main character 20.022240
tells story 20.022240
years later 17.315999
years ago 17.315999
fictional town 17.315999
anthology series 17.315999
documentary series 17.315999
provocative series 14.119126
makes effort 14.119126
standup comedian 14.119126
set world 14.119126
time 13yearold 14.119126
based manga 14.119126
highs lows 14.119126
set fictional 14.119126
sherlock holmes 14.119126
series takes 14.119126
seaside town 14.119126
# Sum matrix columns to see what has the most overall importance ?

print "Highest sum Count Vectoror score for n_grams in loser shows"

cvec_results = pd.DataFrame(Xs_train, columns=cvec.get_feature_names())
cvec_results['winners'] = y_train

winner_results = pd.DataFrame(cvec_results[cvec_results['winners'] ==0].sum(), columns=['cvec_sum'])


high = winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20).index
print  [str(r) for r in high]

winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20)

Highest sum Count Vectoror score for n_grams in loser shows
['reality series', 'television series', 'series produced', 'real housewives', 'series debuted', 'family friends', 'follows lives', 'series features', 'animated series', 'los angeles', 'reality television', 'reality television series', 'bros television distribution', 'bros television', 'warner bros television', 'warner bros television distribution', 'television series debuted', 'cash prize', 'new series', 'news channel']
cvec_sum
reality series 22.787391
television series 21.394240
series produced 20.773274
real housewives 19.851334
series debuted 18.554642
family friends 18.554642
follows lives 18.554642
series features 18.554642
animated series 18.554642
los angeles 18.313960
reality television 17.522176
reality television series 17.522176
bros television distribution 16.046764
bros television 16.046764
warner bros television 16.046764
warner bros television distribution 16.046764
television series debuted 16.046764
cash prize 16.046764
new series 16.046764
news channel 16.046764

Model on summary text using TF-IDF Vectorizer

  • results were best when tf-idf scores were modeled with Gaussian Naive Bayes

Features: 2000
Train Set Accuracy: 0.924
CrossVal Accuracy: 0.609 +/- 0.034
Test Set Accuracy: 0.609 +/- 0.034

**n-grams with higest cumulative sum of tf-idf scores for winners: ** ‘david attenborough’, ‘drama series’, ‘men women’, ‘new york’, ‘documentary series’, ‘new york city’, ‘york city’, ‘quest save’, ‘tv series’, ‘world know’, ‘television drama’, ‘sitcom set’, ‘young boy’, ‘comedy series’, ‘series created’, ‘tells story’, ’21st century’, ‘super hero’, ‘cable news’, ‘best friends’

**n-grams with higest cumulative sum of tf-idf scores for losers: ** ‘real housewives’, ‘series follows’, ‘television series’, ‘best friends’, ‘best friend’, ‘los angeles’, ‘things just’, ‘group teenagers’, ‘series features’, ‘restaurant industry’, ‘children ages’, ‘animated series’, ‘big brother’, ‘cartoon network’, ‘recent divorce’, ‘american women’, ‘high school’, ‘reality series’, ‘follows lives’, ‘lives loves’

# Baseline for training set
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline

0.521384928717
# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25)
print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)
(368, 2000) 368
(123, 2000) 123
#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)


# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(X_train,y_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, X_train.shape[1], metrics.accuracy_score(y_train, this_model.predict(X_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, this_model.predict(X_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()
Baseline:   0.521 

Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        2000         0.957               0.620 +/- 0.020         0.634 
LDA       2000         0.959               0.658 +/- 0.035         0.610 
QDA       2000         0.671               0.437 +/- 0.013         0.431 
KNN       2000         0.647               0.519 +/- 0.031         0.488 
CART      2000         0.959               0.554 +/- 0.026         0.496 
RFST      2000         0.957               0.581 +/- 0.020         0.561 
GB        2000         0.872               0.598 +/- 0.034         0.545 
ADA       2000         0.772               0.541 +/- 0.047         0.504 
SVM       2000         0.505               0.492 +/- 0.009         0.569 
GNB       2000         0.943               0.668 +/- 0.028         0.634 
MNB       2000         0.927               0.658 +/- 0.029         0.642 
BNB       2000         0.932               0.641 +/- 0.048         0.593 

png

# Which words are most common in the winner summaries ?
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 1]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)
[(u'new york', 11),
 (u'drama series', 8),
 (u'york city', 6),
 (u'high school', 6),
 (u'men women', 5),
 (u'tv series', 5),
 (u'series based', 5),
 (u'video game', 5),
 (u'bugs bunny', 5),
 (u'new york city', 5),
 (u'tells story', 4),
 (u'young boy', 4),
 (u'comedy series', 4),
 (u'main character', 4),
 (u'united states', 4),
 (u'life new', 4),
 (u'series follows', 4),
 (u'anthology series', 3),
 (u'mr bean', 3),
 (u'prisoner cell', 3)]
# Which words are most common in the loser summaries ?
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 0]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)
[(u'real housewives', 12),
 (u'television series', 12),
 (u'los angeles', 11),
 (u'pop culture', 10),
 (u'series follows', 9),
 (u'new york', 9),
 (u'animated series', 7),
 (u'cartoon network', 7),
 (u'big brother', 6),
 (u'dance moms', 6),
 (u'reality series', 6),
 (u'best friend', 6),
 (u'high school', 5),
 (u'late night', 5),
 (u'best friends', 5),
 (u'nick jr', 5),
 (u'reality television series', 5),
 (u'plastic surgery', 5),
 (u'access hollywood', 5),
 (u'comedy series', 5)]
# Sum matrix columns to see what has the most overall importance ?

print "Highest cumulative tfidf score for n_grams in winner shows"

tfidf_results = pd.DataFrame(X_train, columns= tfidf.get_feature_names())
tfidf_results['winners'] = y_train

winner_results = pd.DataFrame(tfidf_results[tfidf_results['winners'] ==1].sum(), columns=['tfidf_sum'])


high = winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20).index
print  [str(r) for r in high]

winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20)

Highest cumulative tfidf score for n_grams in winner shows
['new york', 'men women', 'documentary series', 'york city', 'new york city', 'high school', 'drama series', 'tells story', 'years ago', 'david attenborough', 'series created', 'years later', 'young man', 'comedy series', 'main character', '21st century', 'tv series', 'andrew davies', 'cable news', 'series based']
tfidf_sum
new york 2.835972
men women 2.484786
documentary series 2.171334
york city 1.897754
new york city 1.897754
high school 1.743989
drama series 1.716267
tells story 1.685216
years ago 1.634339
david attenborough 1.522240
series created 1.484294
years later 1.484223
young man 1.474759
comedy series 1.401982
main character 1.366767
21st century 1.358933
tv series 1.307707
andrew davies 1.304276
cable news 1.261358
series based 1.258897
# Sum matrix columns to see what has the most overall importance ?

print "Highest cumulative tfidf score for n_grams in loser shows"

tfidf_results = pd.DataFrame(X_train, columns= tfidf.get_feature_names())
tfidf_results['winners'] = y_train

winner_results = pd.DataFrame(tfidf_results[tfidf_results['winners'] == 0].sum(), columns=['tfidf_sum'])

low = winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20).index

print  [str(r) for r in low]
winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20)
Highest cumulative tfidf score for n_grams in loser shows
['reality series', 'television series', 'los angeles', 'real housewives', 'things just', 'best friend', 'group teenagers', 'series follows', 'restaurant industry', 'new york', 'high school', 'children ages', 'big brother', 'recent divorce', 'series features', 'cartoon network', 'football team', 'plastic surgery', 'bizarre adventures', 'nick jr']
tfidf_sum
reality series 3.535573
television series 2.715189
los angeles 2.582930
real housewives 2.160853
things just 2.000000
best friend 1.958176
group teenagers 1.791283
series follows 1.772954
restaurant industry 1.707107
new york 1.662150
high school 1.657534
children ages 1.648007
big brother 1.591352
recent divorce 1.473112
series features 1.443284
cartoon network 1.441719
football team 1.414214
plastic surgery 1.382887
bizarre adventures 1.368894
nick jr 1.366875

Model using data other than the TV show summary text

# Get list of columns for the useful non-summary data.  Dropping the "unknown" columns will solve
# the colinearity issue with dummied columns, as these will be the dropped dummies.  
# Dropping premiered as it is a datatime and standardize can't handle it.  Also dropping
# weight as it is not understood, and rating and winner as they are the targets

cols = [x for x in df_shows.columns if x not in ['rating', 'weight', 'updated', 'premiered', 'summary', 'id', \
                                                 'gn_unknown', 'sched_unknown', 'sched_time_unknown', \
                                                 'country_name', 'country_tz', 'network_name', 'name', 'winner']]
cols
[u'status',
 u'language',
 u'runtime',
 u'type',
 u'network',
 u'gn_action',
 u'gn_adult',
 u'gn_adventure',
 u'gn_anime',
 u'gn_children',
 u'gn_comedy',
 u'gn_crime',
 u'gn_drama',
 u'gn_espionage',
 u'gn_family',
 u'gn_fantasy',
 u'gn_food',
 u'gn_history',
 u'gn_horror',
 u'gn_legal',
 u'gn_medical',
 u'gn_music',
 u'gn_mystery',
 u'gn_nature',
 u'gn_romance',
 u'gn_science-fiction',
 u'gn_sports',
 u'gn_supernatural',
 u'gn_thriller',
 u'gn_travel',
 u'gn_war',
 u'gn_western',
 u'sched_friday',
 u'sched_monday',
 u'sched_saturday',
 u'sched_sunday',
 u'sched_thursday',
 u'sched_tuesday',
 u'sched_wednesday',
 u'sched_time_00:00',
 u'sched_time_00:30',
 u'sched_time_00:50',
 u'sched_time_00:55',
 u'sched_time_01:00',
 u'sched_time_01:05',
 u'sched_time_01:30',
 u'sched_time_01:35',
 u'sched_time_02:00',
 u'sched_time_02:05',
 u'sched_time_08:00',
 u'sched_time_10:00',
 u'sched_time_11:00',
 u'sched_time_12:00',
 u'sched_time_13:00',
 u'sched_time_13:30',
 u'sched_time_14:00',
 u'sched_time_14:30',
 u'sched_time_15:00',
 u'sched_time_15:15',
 u'sched_time_16:00',
 u'sched_time_16:30',
 u'sched_time_17:00',
 u'sched_time_17:15',
 u'sched_time_17:30',
 u'sched_time_18:00',
 u'sched_time_18:30',
 u'sched_time_19:00',
 u'sched_time_19:30',
 u'sched_time_19:45',
 u'sched_time_20:00',
 u'sched_time_20:15',
 u'sched_time_20:30',
 u'sched_time_20:40',
 u'sched_time_20:45',
 u'sched_time_20:55',
 u'sched_time_21:00',
 u'sched_time_21:10',
 u'sched_time_21:15',
 u'sched_time_21:30',
 u'sched_time_21:45',
 u'sched_time_22:00',
 u'sched_time_22:10',
 u'sched_time_22:30',
 u'sched_time_22:35',
 u'sched_time_23:00',
 u'sched_time_23:02',
 u'sched_time_23:15',
 u'sched_time_23:30',
 'country_code',
 'network_id']
# Dummy country code, network id, status, language, and type
df_showsd = pd.get_dummies(df_shows, columns=['network_id'], prefix='NW', prefix_sep='_')

df_showsd = df_showsd.drop('NW_', 1)
df_showsd = df_showsd.drop('network', 1)

df_showsd = pd.get_dummies(df_showsd, columns=['country_code'], prefix='C', prefix_sep='_', drop_first=True)
df_showsd = pd.get_dummies(df_showsd, columns=['status'], prefix='ST', prefix_sep='_', drop_first=True)
df_showsd = pd.get_dummies(df_showsd, columns=['language'], prefix='L', prefix_sep='_', drop_first=True)
df_showsd = pd.get_dummies(df_showsd, columns=['type'], prefix='T', prefix_sep='_', drop_first=True)

# Handle any NaN values that remain
shows_clean = df_showsd.dropna()
# We have 326 total samples left, about 1/3 loser and 2/3 winner
# Seems reasonable to proceed with a classification model

print "Number winner samples:", shows_clean['winner'].sum()
print "Number loser samples:", len(shows_clean[shows_clean['winner'] == 0])
Number winner samples: 230
Number loser samples: 121
cols = [x for x in shows_clean.columns if x not in ['rating', 'weight', 'updated', 'premiered', 'summary', 'id', \
                                                 'gn_unknown', 'sched_unknown', 'sched_time_unknown', \
                                                 'country_name', 'country_tz', 'network_name', 'name', 'winner']]
cols
[u'runtime',
 u'gn_action',
 u'gn_adult',
 u'gn_adventure',
 u'gn_anime',
 u'gn_children',
 u'gn_comedy',
 u'gn_crime',
 u'gn_drama',
 u'gn_espionage',
 u'gn_family',
 u'gn_fantasy',
 u'gn_food',
 u'gn_history',
 u'gn_horror',
 u'gn_legal',
 u'gn_medical',
 u'gn_music',
 u'gn_mystery',
 u'gn_nature',
 u'gn_romance',
 u'gn_science-fiction',
 u'gn_sports',
 u'gn_supernatural',
 u'gn_thriller',
 u'gn_travel',
 u'gn_war',
 u'gn_western',
 u'sched_friday',
 u'sched_monday',
 u'sched_saturday',
 u'sched_sunday',
 u'sched_thursday',
 u'sched_tuesday',
 u'sched_wednesday',
 u'sched_time_00:00',
 u'sched_time_00:30',
 u'sched_time_00:50',
 u'sched_time_00:55',
 u'sched_time_01:00',
 u'sched_time_01:05',
 u'sched_time_01:30',
 u'sched_time_01:35',
 u'sched_time_02:00',
 u'sched_time_02:05',
 u'sched_time_08:00',
 u'sched_time_10:00',
 u'sched_time_11:00',
 u'sched_time_12:00',
 u'sched_time_13:00',
 u'sched_time_13:30',
 u'sched_time_14:00',
 u'sched_time_14:30',
 u'sched_time_15:00',
 u'sched_time_15:15',
 u'sched_time_16:00',
 u'sched_time_16:30',
 u'sched_time_17:00',
 u'sched_time_17:15',
 u'sched_time_17:30',
 u'sched_time_18:00',
 u'sched_time_18:30',
 u'sched_time_19:00',
 u'sched_time_19:30',
 u'sched_time_19:45',
 u'sched_time_20:00',
 u'sched_time_20:15',
 u'sched_time_20:30',
 u'sched_time_20:40',
 u'sched_time_20:45',
 u'sched_time_20:55',
 u'sched_time_21:00',
 u'sched_time_21:10',
 u'sched_time_21:15',
 u'sched_time_21:30',
 u'sched_time_21:45',
 u'sched_time_22:00',
 u'sched_time_22:10',
 u'sched_time_22:30',
 u'sched_time_22:35',
 u'sched_time_23:00',
 u'sched_time_23:02',
 u'sched_time_23:15',
 u'sched_time_23:30',
 'NW_1',
 'NW_2',
 'NW_3',
 'NW_4',
 'NW_5',
 'NW_6',
 'NW_8',
 'NW_9',
 'NW_10',
 'NW_11',
 'NW_12',
 'NW_13',
 'NW_14',
 'NW_16',
 'NW_17',
 'NW_18',
 'NW_19',
 'NW_20',
 'NW_22',
 'NW_23',
 'NW_24',
 'NW_25',
 'NW_26',
 'NW_27',
 'NW_29',
 'NW_30',
 'NW_32',
 'NW_34',
 'NW_35',
 'NW_36',
 'NW_37',
 'NW_41',
 'NW_42',
 'NW_43',
 'NW_44',
 'NW_45',
 'NW_47',
 'NW_48',
 'NW_49',
 'NW_51',
 'NW_52',
 'NW_54',
 'NW_55',
 'NW_56',
 'NW_59',
 'NW_63',
 'NW_66',
 'NW_70',
 'NW_71',
 'NW_72',
 'NW_73',
 'NW_75',
 'NW_76',
 'NW_77',
 'NW_78',
 'NW_79',
 'NW_80',
 'NW_81',
 'NW_84',
 'NW_85',
 'NW_88',
 'NW_91',
 'NW_92',
 'NW_107',
 'NW_109',
 'NW_114',
 'NW_115',
 'NW_118',
 'NW_120',
 'NW_122',
 'NW_125',
 'NW_131',
 'NW_132',
 'NW_137',
 'NW_144',
 'NW_149',
 'NW_151',
 'NW_155',
 'NW_157',
 'NW_158',
 'NW_159',
 'NW_163',
 'NW_173',
 'NW_177',
 'NW_184',
 'NW_185',
 'NW_206',
 'NW_224',
 'NW_231',
 'NW_239',
 'NW_248',
 'NW_251',
 'NW_270',
 'NW_286',
 'NW_298',
 'NW_309',
 'NW_324',
 'NW_333',
 'NW_336',
 'NW_349',
 'NW_350',
 'NW_360',
 'NW_376',
 'NW_409',
 'NW_472',
 'NW_551',
 'NW_553',
 'NW_639',
 'NW_652',
 'NW_714',
 'NW_809',
 'NW_813',
 'NW_821',
 'NW_870',
 'NW_976',
 'NW_1027',
 'NW_1050',
 'NW_1485',
 u'C_AU',
 u'C_CA',
 u'C_DE',
 u'C_DK',
 u'C_FR',
 u'C_GB',
 u'C_IT',
 u'C_JP',
 u'C_KR',
 u'C_NO',
 u'C_NZ',
 u'C_PL',
 u'C_RU',
 u'C_SE',
 u'C_TR',
 u'C_US',
 u'ST_Running',
 u'ST_To Be Determined',
 u'L_English',
 u'L_French',
 u'L_German',
 u'L_Hindi',
 u'L_Italian',
 u'L_Japanese',
 u'L_Korean',
 u'L_Norwegian',
 u'L_Polish',
 u'L_Russian',
 u'L_Swedish',
 u'L_Turkish',
 u'T_Documentary',
 u'T_Game Show',
 u'T_News',
 u'T_Panel Show',
 u'T_Reality',
 u'T_Scripted',
 u'T_Talk Show',
 u'T_Variety']
# Generate X matrix and y target

X = shows_clean[cols]
y = shows_clean['winner'].values
# Baseline 
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline

0.655270655271
# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)
(263, 240) 263
(88, 240) 88
#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)# Test Train Split


# Gridsearch for best C and penalty
gs_params = {
    'penalty':['l1', 'l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,5,100)
}
from sklearn.model_selection import GridSearchCV
lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=3, verbose=1, n_jobs=-1)

lr_gridsearch.fit(Xs_train, y_train)
Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:    1.4s finished





GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.26186e-05, ...,   7.92483e+04,   1.00000e+05]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)
# best score on the training data:
lr_gridsearch.best_score_
0.9125475285171103
# best parameters on the training data:
lr_gridsearch.best_params_
{'C': 0.068926121043496949, 'penalty': 'l2', 'solver': 'liblinear'}
# assign the best estimator to a variable:
best_lr = lr_gridsearch.best_estimator_
# Score it on the testing data:
best_lr.score(Xs_test, y_test)
0.88636363636363635
# Much better than baseline, and we can find the most important factors and run all the classifiers using
# those factors.
coef_df = pd.DataFrame({
        'features': X.columns,
        'log odds': best_lr.coef_[0],
        'percentage change in odds': np.round(np.exp(best_lr.coef_[0])*100-100,2)
    })
coef_df.sort_values(by='percentage change in odds', ascending=0)
features log odds percentage change in odds
6 gn_comedy 0.500129 64.89
237 T_Scripted 0.477513 61.21
232 T_Documentary 0.268693 30.83
8 gn_drama 0.254062 28.93
3 gn_adventure 0.246904 28.01
90 NW_8 0.242839 27.49
94 NW_12 0.207017 23.00
114 NW_37 0.201048 22.27
7 gn_crime 0.192888 21.27
83 sched_time_23:30 0.178175 19.50
21 gn_science-fiction 0.177248 19.39
23 gn_supernatural 0.177209 19.39
18 gn_mystery 0.167534 18.24
95 NW_13 0.139862 15.01
218 ST_Running 0.139006 14.91
0 runtime 0.138903 14.90
11 gn_fantasy 0.137532 14.74
64 sched_time_19:45 0.137084 14.69
176 NW_270 0.126593 13.50
87 NW_4 0.122341 13.01
143 NW_85 0.115952 12.29
13 gn_history 0.111519 11.80
24 gn_thriller 0.102413 10.78
30 sched_saturday 0.102229 10.76
207 C_GB 0.101888 10.73
1 gn_action 0.090291 9.45
14 gn_horror 0.088511 9.25
62 sched_time_19:00 0.086923 9.08
98 NW_17 0.085510 8.93
225 L_Japanese 0.085179 8.89
... ... ... ...
53 sched_time_15:00 -0.106270 -10.08
153 NW_122 -0.113304 -10.71
132 NW_71 -0.113592 -10.74
201 NW_1485 -0.115251 -10.89
186 NW_376 -0.115363 -10.90
169 NW_185 -0.116553 -11.00
106 NW_26 -0.117644 -11.10
220 L_English -0.119882 -11.30
105 NW_25 -0.126560 -11.89
55 sched_time_16:00 -0.128590 -12.07
29 sched_monday -0.134095 -12.55
192 NW_652 -0.143685 -13.38
217 C_US -0.154332 -14.30
117 NW_43 -0.155093 -14.37
110 NW_32 -0.157596 -14.58
158 NW_144 -0.158937 -14.69
49 sched_time_13:00 -0.159660 -14.76
140 NW_80 -0.160574 -14.83
33 sched_tuesday -0.174971 -16.05
203 C_CA -0.179060 -16.39
93 NW_11 -0.179608 -16.44
17 gn_music -0.189299 -17.25
179 NW_309 -0.216452 -19.46
124 NW_52 -0.224190 -20.08
238 T_Talk Show -0.233104 -20.79
233 T_Game Show -0.244726 -21.71
78 sched_time_22:30 -0.251273 -22.22
102 NW_22 -0.268289 -23.53
67 sched_time_20:30 -0.300379 -25.95
236 T_Reality -0.557917 -42.76

240 rows × 3 columns

# Create a subset of "coef_df" DataFrame with most important coefficients
imp_coefs = pd.concat([coef_df.sort_values(by='percentage change in odds', ascending=0).head(10),
                     coef_df.sort_values(by='percentage change in odds', ascending=0).tail(10)])
imp_coefs.set_index('features', inplace=True)
imp_coefs
log odds percentage change in odds
features
gn_comedy 0.500129 64.89
T_Scripted 0.477513 61.21
T_Documentary 0.268693 30.83
gn_drama 0.254062 28.93
gn_adventure 0.246904 28.01
NW_8 0.242839 27.49
NW_12 0.207017 23.00
NW_37 0.201048 22.27
gn_crime 0.192888 21.27
sched_time_23:30 0.178175 19.50
NW_11 -0.179608 -16.44
gn_music -0.189299 -17.25
NW_309 -0.216452 -19.46
NW_52 -0.224190 -20.08
T_Talk Show -0.233104 -20.79
T_Game Show -0.244726 -21.71
sched_time_22:30 -0.251273 -22.22
NW_22 -0.268289 -23.53
sched_time_20:30 -0.300379 -25.95
T_Reality -0.557917 -42.76
# Plot important coefficients
imp_coefs['percentage change in odds'].plot(kind = "barh")
plt.title("Percentage change in odds with Ridge regularization")
plt.show()

png

df_shows[df_shows['network_id'] == 309]
status rating weight updated name language premiered summary runtime type ... sched_time_23:00 sched_time_23:02 sched_time_23:15 sched_time_23:30 sched_time_unknown country_code country_name country_tz network_id network_name
279 Running 8.7 40 2017-08-22 19:26:37 I Live with Models English 2015-02-23 00:00:00 tommy heads to new york city with scarlet as t... 30 Scripted ... 0.0 0.0 0.0 0.0 0.0 GB United Kingdom Europe/London 309 Comedy Central
535 To Be Determined 6.5 0 2016-01-13 15:12:17 Brotherhood English 2015-06-02 00:00:00 twentysomethings dan and toby are in over thei... 30 Scripted ... NaN NaN NaN NaN NaN GB United Kingdom Europe/London 309 Comedy Central

2 rows × 104 columns

# Get list of features and re-run model with just the 20 most important features
imp_features = imp_coefs.index
# Set up X and y
X = shows_clean[imp_features]
y = shows_clean['winner'].values
# Baseline
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline
0.655270655271
# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)
(263, 20) 263
(88, 20) 88
#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)# Test Train Split

# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(X_train,y_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, X_train.shape[1], metrics.accuracy_score(y_train, this_model.predict(X_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, this_model.predict(X_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()
Baseline:   0.655 

Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        20         0.905               0.874 +/- 0.010         0.909 
LDA       20         0.901               0.890 +/- 0.020         0.875 
QDA       20         0.669               0.448 +/- 0.082         0.682 
KNN       20         0.909               0.905 +/- 0.005         0.886 
CART      20         0.943               0.894 +/- 0.019         0.898 
RFST      20         0.939               0.909 +/- 0.000         0.886 
GB        20         0.943               0.905 +/- 0.020         0.898 
ADA       20         0.916               0.897 +/- 0.032         0.920 
SVM       20         0.863               0.848 +/- 0.014         0.818 
GNB       20         0.616               0.673 +/- 0.057         0.614 
MNB       20         0.886               0.886 +/- 0.010         0.898 
BNB       20         0.905               0.901 +/- 0.010         0.920 

png


Written on September 26, 2017