Tv Show Analysis

What makes a successful TV show?

This analysis and model attemts to determine the factors that influence high or low IMDb ratings for TV shows. All generes are examined, and while most originate in the United States, there are a few from the UK and elsewhere included.

Two separate models are developed. In both, the top and bottom rated shows are classified as winners and losers, respectively, and an array of 12 classisfiers are applied using cross validation to identify the best performing model. 25% of the data was reserved as a test set, and cross-validation scores and test set scores are both shown in the tables below. Baseline score is 0.52.

The first utilizes natural language processing (NLP) on the IMDb summary descriptions of each show. Term Frequency - Inverse Document Frequency and Count Vectorization were used on n-grams of size 2-4 were used. With both vectorization techniqes, Random Forrest and Naive Bayesian classifiers were most successful, with the highest score of 0.642 was achieved using TF-IDF vectorization and a Multinomial Naive Bayes classifier. The n-grams with highest cumulative score are identified as the most significant factors in the model, giving a clue as to the words in a summary description that foretell a show’s likelihood of being a winner or loser.

A second model, using factors such as genre, lenght, schedule times, network and format was also built, and the same set of 12 classifiers was applied. The ADA Boost classifier achieved the top score of 0.92 using these factors.

Data collection and cleanup was tedious, and involved multiple runs of webscraping IMDb pages on show ratings, then using the TVmaze API to return show detals. Unless interested in these details, the reader is encouraged to skip to the section titled “Modeling Section” a bit more than halfway through this notebook.

Results Summary:

From the NLP models, it seems shows featuring adult characters in crime and drama series set in times before or after the present in New York will fare better than reality or animated series featuring children or teens and highlighting pop culture.

The model on the factors other than the summary showed similar tendencies. Realty formats were the strongest negative factor in predicting success, while the scripted format was the strongest positive predictor. Game and Talk shows were negative, while crime, science fiction, comedy, drama and documentaries were positive predictors. Shows aired by HBO and BBC predicted success, while the lower rated shows were found more predominantly on MTV, E!, Comedy Central and Lifetime.

Though interesting associations have been found, it must be said that nothing in the techniques used here can be interpreted as causality. For example, it cannot be said that reality shows featuring teenagers will always flop. This report is based on initial efforts to determine factors that may influence a show’s success, and have shown a path for future more detailed modeling. Suggested future paths include textual analysis of critics’ reviews, analysis based on cast or producers, analysis of differences in rating based on audience demographics, and a more detailed look at the connection between genre and the type/style of show.

# Import libraries needed for scraping and saving results.  
# Additional libraries needed for modeling, analysis and display will be imported when needed.

import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle

Data Acquisition: List of top rated TV Shows

# Retrieve current top 250 TV shows webpage

url = "http://www.imdb.com/chart/toptv/"
r = requests.get(url)
html = r.text
html[0:200]

u'\n\n\n\n<!DOCTYPE html>\n<html\nxmlns:og="http://ogp.me/ns#"\nxmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=ed'

# Use Beautiful soup to extract the imdb numbers from the webpage
soup = BeautifulSoup(html, "lxml")

# Scrape the IMDb numbers for the 250 top rated shows

show_list = []
for tbody in soup.findAll('tbody', class_='lister-list'):
    for title in tbody.findAll('td', class_='titleColumn'):
        show_list.append(str(title.findAll('a')).split("/")[2])

show_list

['tt5491994',
 'tt0185906',
 'tt0795176',
 'tt0944947',
 'tt0903747',
 'tt0306414',
 'tt2861424',
 'tt2395695',
 'tt0081846',
 'tt0071075',
 'tt0141842',
 'tt1475582',
 'tt1533395',
 'tt0417299',
 'tt0098769',
 'tt1806234',
 'tt0303461',
 'tt0092337',
 'tt0052520',
 'tt3530232',
 'tt2356777',
 'tt1355642',
 'tt2802850',
 'tt0103359',
 'tt0296310',
 'tt0877057',
 'tt4508902',
 'tt0475784',
 'tt2092588',
 'tt0213338',
 'tt1856010',
 'tt0063929',
 'tt0112130',
 'tt2571774',
 'tt0081834',
 'tt0367279',
 'tt4742876',
 'tt4574334',
 'tt2085059',
 'tt0108778',
 'tt0098904',
 'tt3718778',
 'tt0081912',
 'tt0098936',
 'tt1518542',
 'tt0074006',
 'tt2707408',
 'tt0193676',
 'tt1865718',
 'tt0096548',
 'tt0072500',
 'tt0384766',
 'tt0118421',
 'tt0096697',
 'tt0090509',
 'tt0121955',
 'tt0386676',
 'tt4299972',
 'tt2560140',
 'tt0472954',
 'tt0412142',
 'tt0214341',
 'tt5555260',
 'tt2442560',
 'tt5712554',
 'tt0200276',
 'tt0353049',
 'tt1910272',
 'tt0086661',
 'tt0248654',
 'tt5189670',
 'tt0121220',
 'tt1486217',
 'tt0096639',
 'tt0120570',
 'tt4786824',
 'tt1628033',
 'tt0348914',
 'tt0403778',
 'tt5288312',
 'tt0459159',
 'tt3032476',
 'tt0407362',
 'tt4093826',
 'tt0773262',
 'tt0417349',
 'tt3322312',
 'tt0264235',
 'tt0106179',
 'tt0286486',
 'tt2297757',
 'tt0088484',
 'tt2098220',
 'tt5425186',
 'tt0318871',
 'tt0094517',
 'tt0436992',
 'tt1586680',
 'tt0092324',
 'tt0994314',
 'tt0203082',
 'tt1606375',
 'tt0380136',
 'tt0187664',
 'tt1513168',
 'tt0118273',
 'tt0421357',
 'tt1641384',
 'tt0314979',
 'tt5834204',
 'tt0092455',
 'tt0115147',
 'tt4295140',
 'tt0080306',
 'tt1266020',
 'tt1831164',
 'tt3920596',
 'tt0804503',
 'tt1492966',
 'tt0053488',
 'tt0086831',
 'tt0758745',
 'tt0995832',
 'tt0434706',
 'tt2401256',
 'tt0423731',
 'tt0111958',
 'tt0863046',
 'tt1733785',
 'tt2049116',
 'tt0275137',
 'tt1305826',
 'tt0472027',
 'tt2100976',
 'tt1489428',
 'tt0112159',
 'tt4158110',
 'tt1227926',
 'tt1870479',
 'tt0979432',
 'tt0106028',
 'tt0387764',
 'tt0237123',
 'tt0047708',
 'tt0088509',
 'tt0290978',
 'tt1984119',
 'tt0098825',
 'tt2306299',
 'tt0280249',
 'tt3647998',
 'tt0094525',
 'tt0163507',
 'tt0118266',
 'tt0182629',
 'tt0080297',
 'tt0061287',
 'tt1758429',
 'tt3671754',
 'tt0487831',
 'tt0388629',
 'tt2575988',
 'tt4189022',
 'tt0458254',
 'tt2788432',
 'tt0096657',
 'tt0346314',
 'tt1474684',
 'tt4288182',
 'tt0417373',
 'tt1298820',
 'tt0262150',
 'tt1695360',
 'tt1230180',
 'tt2243973',
 'tt0129690',
 'tt1632701',
 'tt2433738',
 'tt0149460',
 'tt1124373',
 'tt0075520',
 'tt1795096',
 'tt1442449',
 'tt5249462',
 'tt2937900',
 'tt1439629',
 'tt5071412',
 'tt0397150',
 'tt0083466',
 'tt2701582',
 'tt5114356',
 'tt4156586',
 'tt0319969',
 'tt0103584',
 'tt0302199',
 'tt0070644',
 'tt1883092',
 'tt2311418',
 'tt3428912',
 'tt1442437',
 'tt0362192',
 'tt0278238',
 'tt0387199',
 'tt2384811',
 'tt0098833',
 'tt0074028',
 'tt2303687',
 'tt0807832',
 'tt0056751',
 'tt0173528',
 'tt3358020',
 'tt0103466',
 'tt1526318',
 'tt0185133',
 'tt0075572',
 'tt0112084',
 'tt1837492',
 'tt2919910',
 'tt1299368',
 'tt0094535',
 'tt1520211',
 'tt0108906',
 'tt0988824',
 'tt5421602',
 'tt5853176',
 'tt0934320',
 'tt0337898',
 'tt0495212',
 'tt0460681',
 'tt2407574',
 'tt0290988',
 'tt1598754',
 'tt1119644',
 'tt1220617',
 'tt3398228',
 'tt0411008',
 'tt0163503',
 'tt2249364',
 'tt1409055',
 'tt4270492',
 'tt0060028',
 'tt0118480',
 'tt0925266',
 'tt3012698',
 'tt0402711',
 'tt0068098',
 'tt0442632',
 'tt1839578',
 'tt0043208',
 'tt5673782']

# This code has been executed, and the results pickled and stored locally, so no need to run these requests
# to the API again. The api address with key to look up show with imdb number is
# http://api.tvmaze.com/lookup/shows?imdb=<show imdb identifier>

DO_NOT_RUN = True     # Do not run when notebook is loaded to avoid unnecessary calls to the API

if not DO_NOT_RUN:
    shows = pd.DataFrame()
    for show_id in show_list:
            try:
                print show_id
                # Get the tv show info from the api
                url = "http://api.tvmaze.com/lookup/shows?imdb=" + show_id
                r = requests.get(url)

                # convert the return data to a dictionary
                json_data = r.json()

                # load a temp datafram with the dictionary, then append to the composite dataframe
                temp_df = pd.DataFrame.from_dict(json_data, orient='index', dtype=None)
                ttemp_df = temp_df.T     # Was not able to load json in column orientation, so must transpose
                shows = shows.append(ttemp_df, ignore_index=True)
            except: 
                print show_id, " could not be retrieved from api"

    shows.head()

# write the contents of an object to a file for later retrieval

DO_NOT_RUN = True   # Be sure to check the file name to write before enabling execution on this block

if not DO_NOT_RUN:
    pickle.dump( shows, open( "save_shows_df.p", "wb" ) )

Get list of bottom rated TV Series

# This code block was changed multiple times to pull html with different sets of low rated shows
# ultimately about 1200 imdb ids were scraped, and about 1/3 of those could be pulled from the TV Maze API.

url ="http://www.imdb.com/search/title?count=600&languages=en&title_type=tv_series&user_rating=3.4,5.0&sort=user_rating,asc"
r = requests.get(url)
html = r.text
html[0:200]

u'\n\n\n\n<!DOCTYPE html>\n<html\nxmlns:og="http://ogp.me/ns#"\nxmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=ed'

# Use Beautiful soup to extract the imdb numbers from the webpage
soup = BeautifulSoup(html, "lxml")

loser_list = []
for div in soup.findAll('div', class_='lister-list'):
    for h3 in div.findAll('h3', class_='lister-item-header'):
        loser_list.append(str(h3.findAll('a')).split("/")[2])

loser_list

['tt0773264',
 'tt1798695',
 'tt1307083',
 'tt4845734',
 'tt0046641',
 'tt1519575',
 'tt0853078',
 'tt0118423',
 'tt0284767',
 'tt4052124',
 'tt0878801',
 'tt3703500',
 'tt1105170',
 'tt4363582',
 'tt3155428',
 'tt0362350',
 'tt0287196',
 'tt2766052',
 'tt0405545',
 'tt0262975',
 'tt0367278',
 'tt7134262',
 'tt1695352',
 'tt0421470',
 'tt2466890',
 'tt0343305',
 'tt1002739',
 'tt1615697',
 'tt0274262',
 'tt0465320',
 'tt1388381',
 'tt0358889',
 'tt1085789',
 'tt1011591',
 'tt0364804',
 'tt1489335',
 'tt3612584',
 'tt0363377',
 'tt0111930',
 'tt0401913',
 'tt0808086',
 'tt0309212',
 'tt5464192',
 'tt0080250',
 'tt4533338',
 'tt4741696',
 'tt1922810',
 'tt1793868',
 'tt4789316',
 'tt0185054',
 'tt1079622',
 'tt1786048',
 'tt0790508',
 'tt1716372',
 'tt0295098',
 'tt3409706',
 'tt0222574',
 'tt2171325',
 'tt0442643',
 'tt2142117',
 'tt0371433',
 'tt0138244',
 'tt1002010',
 'tt0495557',
 'tt1811817',
 'tt5529996',
 'tt1352053',
 'tt0439346',
 'tt0940147',
 'tt3075138',
 'tt1974439',
 'tt2693842',
 'tt0092325',
 'tt6772826',
 'tt1563069',
 'tt0489598',
 'tt0142055',
 'tt1566154',
 'tt0338592',
 'tt0167515',
 'tt2330327',
 'tt1576464',
 'tt2389845',
 'tt0186747',
 'tt0355096',
 'tt1821877',
 'tt0112033',
 'tt1792654',
 'tt0472243',
 'tt6453018',
 'tt3648886',
 'tt1599374',
 'tt2946482',
 'tt4672020',
 'tt1016283',
 'tt2649480',
 'tt1229945',
 'tt2390606',
 'tt1876612',
 'tt0140732',
 'tt1176156',
 'tt0158522',
 'tt4922726',
 'tt0068104',
 'tt2798842',
 'tt1150627',
 'tt1545453',
 'tt3685566',
 'tt0287223',
 'tt4185510',
 'tt0329912',
 'tt0289808',
 'tt0358849',
 'tt2320439',
 'tt0906840',
 'tt0800281',
 'tt1103082',
 'tt2416362',
 'tt3493906',
 'tt0381827',
 'tt0817553',
 'tt0252172',
 'tt0799872',
 'tt0816224',
 'tt1077162',
 'tt1918005',
 'tt1240983',
 'tt1415000',
 'tt5039916',
 'tt0451467',
 'tt0296438',
 'tt1159990',
 'tt0144701',
 'tt4718304',
 'tt1095213',
 'tt1453090',
 'tt0168372',
 'tt0425725',
 'tt3300126',
 'tt1415098',
 'tt5459976',
 'tt4041694',
 'tt2322264',
 'tt1441005',
 'tt1117549',
 'tt0365991',
 'tt0364807',
 'tt1591375',
 'tt3562462',
 'tt6118186',
 'tt3587176',
 'tt1372127',
 'tt0445865',
 'tt2088493',
 'tt4658248',
 'tt0103444',
 'tt4956964',
 'tt1326185',
 'tt0406422',
 'tt1973659',
 'tt1578933',
 'tt0446621',
 'tt1850624',
 'tt0159177',
 'tt0490539',
 'tt0306398',
 'tt0288922',
 'tt0465336',
 'tt0176397',
 'tt1641939',
 'tt0498879',
 'tt0306296',
 'tt1394277',
 'tt0398416',
 'tt2849552',
 'tt1433566',
 'tt0806893',
 'tt3252890',
 'tt3774098',
 'tt0791275',
 'tt5690224',
 'tt0361181',
 'tt0486953',
 'tt1514319',
 'tt3697290',
 'tt1342752',
 'tt0478936',
 'tt0094448',
 'tt0795101',
 'tt1340759',
 'tt0840061',
 'tt1151434',
 'tt0281429',
 'tt0845745',
 'tt2993514',
 'tt0783634',
 'tt1650352',
 'tt1249256',
 'tt2135766',
 'tt3231114',
 'tt1702421',
 'tt2940494',
 'tt6664486',
 'tt0081857',
 'tt1319598',
 'tt0247094',
 'tt6392176',
 'tt0320969',
 'tt2720144',
 'tt0360266',
 'tt2287380',
 'tt1715368',
 'tt0282291',
 'tt2248736',
 'tt2010634',
 'tt1489432',
 'tt4855578',
 'tt1721484',
 'tt0380850',
 'tt3084090',
 'tt2392683',
 'tt1381004',
 'tt1628058',
 'tt2935638',
 'tt1837169',
 'tt2404111',
 'tt2364381',
 'tt0888095',
 'tt2352123',
 'tt1013862',
 'tt4295320',
 'tt1249227',
 'tt1879603',
 'tt0167566',
 'tt0924528',
 'tt0361144',
 'tt0133300',
 'tt5888698',
 'tt1468817',
 'tt4006060',
 'tt0106096',
 'tt0287243',
 'tt1287376',
 'tt0060032',
 'tt1535270',
 'tt4831262',
 'tt0416397',
 'tt1546138',
 'tt2203971',
 'tt0214353',
 'tt0368518',
 'tt0382506',
 'tt5317980',
 'tt2313839',
 'tt1202295',
 'tt4146118',
 'tt1226448',
 'tt0403748',
 'tt0415448',
 'tt4665932',
 'tt3016956',
 'tt1412249',
 'tt1829773',
 'tt0872053',
 'tt0481443',
 'tt0493098',
 'tt0039120',
 'tt1411598',
 'tt0106123',
 'tt1740718',
 'tt0362153',
 'tt1637756',
 'tt0120974',
 'tt2328067',
 'tt0057741',
 'tt1261356',
 'tt2559390',
 'tt0083433',
 'tt0380934',
 'tt4388486',
 'tt0108821',
 'tt0115338',
 'tt0167735',
 'tt0460630',
 'tt2330453',
 'tt0398429',
 'tt0294140',
 'tt0804423',
 'tt2191952',
 'tt1118131',
 'tt4016700',
 'tt5786580',
 'tt0950199',
 'tt1760165',
 'tt4896654',
 'tt0414719',
 'tt1675974',
 'tt0465343',
 'tt1477137',
 'tt0115171',
 'tt3565412',
 'tt0382458',
 'tt0945153',
 'tt0199278',
 'tt1353293',
 'tt1426343',
 'tt2180165',
 'tt5117094',
 'tt1191039',
 'tt0497857',
 'tt0780409',
 'tt2670950',
 'tt1385183',
 'tt3396736',
 'tt2563482',
 'tt4094138',
 'tt0295065',
 'tt1696268',
 'tt0891053',
 'tt0914267',
 'tt1786018',
 'tt1988479',
 'tt1707814',
 'tt1595853',
 'tt2310444',
 'tt5434894',
 'tt0267216',
 'tt0855313',
 'tt1832828',
 'tt0426685',
 'tt2309561',
 'tt2486556',
 'tt0284786',
 'tt3136814',
 'tt1989818',
 'tt1179310',
 'tt0424748',
 'tt1126298',
 'tt0944946',
 'tt1882639',
 'tt0439904',
 'tt0875887',
 'tt1624991',
 'tt2747670',
 'tt2324247',
 'tt0403810',
 'tt1724452',
 'tt2366252',
 'tt3752894',
 'tt0198211',
 'tt1491318',
 'tt1666205',
 'tt2460474',
 'tt0303435',
 'tt0453329',
 'tt0220938',
 'tt0299264',
 'tt0783341',
 'tt0850175',
 'tt1191056',
 'tt0235917',
 'tt0111892',
 'tt0166442',
 'tt2643770',
 'tt5633924',
 'tt0075485',
 'tt0423657',
 'tt5327970',
 'tt3326032',
 'tt5785658',
 'tt2190731',
 'tt0101041',
 'tt3317020',
 'tt4732076',
 'tt2305717',
 'tt3828162',
 'tt0890935',
 'tt0449460',
 'tt0126175',
 'tt3601886',
 'tt5062878',
 'tt1579911',
 'tt0407354',
 'tt6723012',
 'tt5819414',
 'tt4180738',
 'tt0300802',
 'tt2649738',
 'tt3181412',
 'tt0382400',
 'tt3189040',
 'tt0324919',
 'tt2168240',
 'tt2560966',
 'tt0168373',
 'tt0403824',
 'tt0375440',
 'tt3746054',
 'tt2488150',
 'tt4081326',
 'tt5011838',
 'tt2644204',
 'tt1210781',
 'tt0246359',
 'tt0048898',
 'tt3398108',
 'tt5701572',
 'tt0426827',
 'tt0425714',
 'tt1252620',
 'tt0800289',
 'tt0111991',
 'tt0479847',
 'tt2429392',
 'tt2901828',
 'tt4147072',
 'tt1442411',
 'tt2093677',
 'tt0498421',
 'tt3006666',
 'tt3017190',
 'tt0193680',
 'tt5952954',
 'tt0381759',
 'tt2539740',
 'tt0369176',
 'tt3016990',
 'tt0328787',
 'tt2197994',
 'tt0478753',
 'tt4530152',
 'tt0372643',
 'tt5693024',
 'tt0855669',
 'tt1263594',
 'tt5935350',
 'tt1589855',
 'tt0367444',
 'tt3384116',
 'tt3790338',
 'tt2007260',
 'tt0343300',
 'tt0813904',
 'tt0883849',
 'tt0433296',
 'tt1342705',
 'tt0444988',
 'tt1333495',
 'tt0969661',
 'tt0272967',
 'tt0283184',
 'tt0444577',
 'tt3064496',
 'tt0436996',
 'tt1796788',
 'tt1879997',
 'tt4800624',
 'tt0497079',
 'tt1755893',
 'tt0329824',
 'tt2245937',
 'tt2147632',
 'tt3218114',
 'tt1583417',
 'tt0367403',
 'tt1963853',
 'tt4854900',
 'tt6415490',
 'tt1520150',
 'tt0236907',
 'tt6672370',
 'tt1055136',
 'tt5865052',
 'tt1231448',
 'tt6315022',
 'tt4351710',
 'tt4346344',
 'tt6043450',
 'tt0096605',
 'tt1181712',
 'tt0182623',
 'tt0307719',
 'tt1056344',
 'tt0328795',
 'tt0098916',
 'tt1584617',
 'tt2354136',
 'tt4287478',
 'tt0426347',
 'tt1874006',
 'tt2006560',
 'tt1694893',
 'tt2338766',
 'tt0843808',
 'tt0115155',
 'tt4354068',
 'tt1134663',
 'tt0495787',
 'tt0088539',
 'tt5426274',
 'tt1797127',
 'tt5763656',
 'tt0360301',
 'tt4245504',
 'tt0318214',
 'tt0080254',
 'tt1430135',
 'tt0892562',
 'tt2603010',
 'tt1038918',
 'tt0390746',
 'tt3773682',
 'tt0969372',
 'tt1470839',
 'tt1477822',
 'tt1056446',
 'tt0340474',
 'tt5104198',
 'tt2815184',
 'tt0468998',
 'tt0772146',
 'tt3920816',
 'tt3654000',
 'tt1753229',
 'tt0865687',
 'tt0459631',
 'tt1314665',
 'tt4660152',
 'tt0086685',
 'tt0150323',
 'tt0338576',
 'tt2118185',
 'tt0198086',
 'tt0412184',
 'tt4420148',
 'tt0497853',
 'tt1240534',
 'tt2479832',
 'tt0174195',
 'tt1999642',
 'tt1155579',
 'tt1640376',
 'tt1227586',
 'tt3784176',
 'tt1958848',
 'tt2778982',
 'tt1273636',
 'tt0357357',
 'tt1287301',
 'tt0852784',
 'tt0482432',
 'tt1651941',
 'tt0043235',
 'tt2110603',
 'tt1178184',
 'tt0846757',
 'tt0170959',
 'tt0413617',
 'tt1726890',
 'tt0220874',
 'tt0859872',
 'tt4219276',
 'tt0327268',
 'tt0843319',
 'tt3131346',
 'tt0795072',
 'tt5650560',
 'tt0827847',
 'tt1525767',
 'tt1043913',
 'tt0266179',
 'tt0413558',
 'tt0307714',
 'tt4693416',
 'tt0409619',
 'tt5684430',
 'tt0134269',
 'tt5486088',
 'tt1252370',
 'tt6370626',
 'tt3824018',
 'tt2555880',
 'tt3310544',
 'tt2125758',
 'tt1973047',
 'tt6748366',
 'tt0106113',
 'tt0934701',
 'tt2059031',
 'tt0088598',
 'tt1056536',
 'tt1618950',
 'tt6987940',
 'tt5915978',
 'tt0106008',
 'tt0115206',
 'tt0120992',
 'tt4575056',
 'tt2889104',
 'tt0428169']

len(loser_list)

# first_loser_list = loser_list

# This code has been executed, and the results pickled and stored locally, so no need to run these requests
# to the API again

DO_NOT_RUN = True

if not DO_NOT_RUN:
    losers = pd.DataFrame()
    for loser_id in loser_list:
            try:
                print loser_id
                # Get the tv show info from the api
                url = "http://api.tvmaze.com/lookup/shows?imdb=" + loser_id
                r = requests.get(url)

                # convert the return data to a dictionary
                json_data = r.json()

                # load a temp datafram with the dictionary, then append to the composite dataframe
                temp_df = pd.DataFrame.from_dict(json_data, orient='index', dtype=None)
                ttemp_df = temp_df.T     # Was not able to load json in column orientation, so must transpose
                losers = losers.append(ttemp_df, ignore_index=True)
            except: 
                print loser_id, " could not be retrieved from api"

    losers.head()

tt0465347
tt0465347  could not be retrieved from api
tt4427122
tt4427122  could not be retrieved from api
tt1015682
tt1015682  could not be retrieved from api
tt2505738
tt2505738  could not be retrieved from api
tt2402465
tt2402465  could not be retrieved from api
tt0278236
tt0278236  could not be retrieved from api
tt0268066
tt0268066  could not be retrieved from api
tt4813760
tt4813760  could not be retrieved from api
tt1526001
tt1526001  could not be retrieved from api
tt1243976
tt1243976  could not be retrieved from api
tt2058498
tt3897284
tt3897284  could not be retrieved from api
tt3665690
tt3665690  could not be retrieved from api
tt4132180
tt4132180  could not be retrieved from api
tt0824229
tt0824229  could not be retrieved from api
tt0314990
tt0314990  could not be retrieved from api
tt5423750
tt5423750  could not be retrieved from api
tt5423664
tt5423664  could not be retrieved from api
tt2175125
tt2175125  could not be retrieved from api
tt0404593
tt0404593  could not be retrieved from api
tt4160422
tt4160422  could not be retrieved from api
tt4552562
tt4552562  could not be retrieved from api
tt5804854
tt5804854  could not be retrieved from api
tt0886666
tt0886666  could not be retrieved from api
tt5423824
tt5423824  could not be retrieved from api
tt3500210
tt3500210  could not be retrieved from api
tt0285357
tt0285357  could not be retrieved from api
tt0280234
tt0280234  could not be retrieved from api
tt1863530
tt1863530  could not be retrieved from api
tt0280349
tt0280349  could not be retrieved from api
tt2660922
tt2660922  could not be retrieved from api
tt0292776
tt0292776  could not be retrieved from api
tt4566242
tt0264230
tt0264230  could not be retrieved from api
tt1102523
tt1102523  could not be retrieved from api
tt3333790
tt3333790  could not be retrieved from api
tt0320863
tt0320863  could not be retrieved from api
tt0830848
tt0830848  could not be retrieved from api
tt0939270
tt0939270  could not be retrieved from api
tt1459294
tt1459294  could not be retrieved from api
tt6026132
tt6026132  could not be retrieved from api
tt1443593
tt1443593  could not be retrieved from api
tt0354267
tt0354267  could not be retrieved from api
tt0147749
tt0147749  could not be retrieved from api
tt0161180
tt0161180  could not be retrieved from api
tt4733812
tt4733812  could not be retrieved from api
tt0367362
tt0367362  could not be retrieved from api
tt5626868
tt5626868  could not be retrieved from api
tt7268752
tt7268752  could not be retrieved from api
tt1364951
tt2341819
tt0464767
tt0464767  could not be retrieved from api
tt3550770
tt3550770  could not be retrieved from api
tt6422012
tt6422012  could not be retrieved from api
tt3154248
tt3154248  could not be retrieved from api
tt5016274
tt5016274  could not be retrieved from api
tt1715229
tt1715229  could not be retrieved from api
tt0489426
tt0489426  could not be retrieved from api
tt5798754
tt5798754  could not be retrieved from api
tt2022182
tt2022182  could not be retrieved from api
tt0303564
tt0303564  could not be retrieved from api
tt3462252
tt3462252  could not be retrieved from api
tt0329849
tt0329849  could not be retrieved from api
tt5074180
tt5074180  could not be retrieved from api
tt3900878
tt3900878  could not be retrieved from api
tt3887402
tt3887402  could not be retrieved from api
tt1893088
tt0445890
tt0149408
tt0149408  could not be retrieved from api
tt1360544
tt1360544  could not be retrieved from api
tt1718355
tt1718355  could not be retrieved from api
tt2364950
tt2364950  could not be retrieved from api
tt2279571
tt0285374
tt0285374  could not be retrieved from api
tt5267590
tt5267590  could not be retrieved from api
tt0314993
tt0314993  could not be retrieved from api
tt0300870
tt0300870  could not be retrieved from api
tt7036530
tt7036530  could not be retrieved from api
tt5657014
tt5657014  could not be retrieved from api
tt0149488
tt0149488  could not be retrieved from api
tt1204865
tt1204865  could not be retrieved from api
tt1182860
tt1182860  could not be retrieved from api
tt0423626
tt0423626  could not be retrieved from api
tt4223864
tt4223864  could not be retrieved from api
tt1773440
tt1773440  could not be retrieved from api
tt0872067
tt0872067  could not be retrieved from api
tt0428172
tt0428172  could not be retrieved from api
tt0817379
tt0817379  could not be retrieved from api
tt1210720
tt1210720  could not be retrieved from api
tt3855028
tt3855028  could not be retrieved from api
tt1611594
tt1611594  could not be retrieved from api
tt5822004
tt5822004  could not be retrieved from api
tt6524930
tt6524930  could not be retrieved from api
tt1733734
tt1902032
tt1902032  could not be retrieved from api
tt0466201
tt0466201  could not be retrieved from api
tt1757293
tt1757293  could not be retrieved from api
tt1807575
tt1807575  could not be retrieved from api
tt0332896
tt0332896  could not be retrieved from api
tt3140278
tt3140278  could not be retrieved from api
tt1176297
tt1176297  could not be retrieved from api
tt0285406
tt0285406  could not be retrieved from api
tt6680212
tt6680212  could not be retrieved from api
tt0200336
tt0200336  could not be retrieved from api
tt0385483
tt0385483  could not be retrieved from api
tt3534894
tt3534894  could not be retrieved from api
tt1108281
tt1108281  could not be retrieved from api
tt3855016
tt3855016  could not be retrieved from api
tt0787948
tt0787948  could not be retrieved from api
tt1372153
tt1292967
tt1292967  could not be retrieved from api
tt1466565
tt1466565  could not be retrieved from api
tt0435565
tt0435565  could not be retrieved from api
tt1817054
tt2879822
tt1229266
tt1229266  could not be retrieved from api
tt0364837
tt0364837  could not be retrieved from api
tt0477409
tt0477409  could not be retrieved from api
tt0875097
tt0875097  could not be retrieved from api
tt1227542
tt1227542  could not be retrieved from api
tt1131289
tt1131289  could not be retrieved from api
tt0355135
tt0355135  could not be retrieved from api
tt1418598
tt0290970
tt0290970  could not be retrieved from api
tt0184124
tt0184124  could not be retrieved from api
tt0490736
tt0490736  could not be retrieved from api
tt0439354
tt0439354  could not be retrieved from api
tt1157935
tt1157935  could not be retrieved from api
tt1425641
tt1425641  could not be retrieved from api
tt2830404
tt2830404  could not be retrieved from api
tt0835397
tt0835397  could not be retrieved from api
tt0880581
tt0880581  could not be retrieved from api
tt1078463
tt1078463  could not be retrieved from api
tt0190177
tt1234506
tt1234506  could not be retrieved from api
tt0323463
tt0323463  could not be retrieved from api
tt5047510
tt5338860
tt5168468
tt5168468  could not be retrieved from api
tt0296322
tt0296322  could not be retrieved from api
tt3911254
tt3911254  could not be retrieved from api
tt3827516
tt3827516  could not be retrieved from api
tt0364899
tt0364899  could not be retrieved from api
tt4204032
tt4204032  could not be retrieved from api
tt0259768
tt0259768  could not be retrieved from api
tt0287880
tt0287880  could not be retrieved from api
tt0270763
tt0270763  could not be retrieved from api
tt0846349
tt0846349  could not be retrieved from api
tt2699648
tt2699648  could not be retrieved from api
tt3616368
tt3616368  could not be retrieved from api
tt2672920
tt2672920  could not be retrieved from api
tt1848281
tt0813074
tt0813074  could not be retrieved from api
tt1694422
tt1694422  could not be retrieved from api
tt0472241
tt0472241  could not be retrieved from api
tt0202186
tt0202186  could not be retrieved from api
tt1297366
tt1297366  could not be retrieved from api
tt3919918
tt3919918  could not be retrieved from api
tt1564985
tt1564985  could not be retrieved from api
tt3336800
tt3336800  could not be retrieved from api
tt6839504
tt2114184
tt2254454
tt2254454  could not be retrieved from api
tt1674023
tt0824737
tt0824737  could not be retrieved from api
tt1288431
tt1288431  could not be retrieved from api
tt1705811
tt1705811  could not be retrieved from api
tt0968726
tt0968726  could not be retrieved from api
tt2058840
tt2058840  could not be retrieved from api
tt1971860
tt3857708
tt3857708  could not be retrieved from api
tt0315030
tt0315030  could not be retrieved from api
tt2337185
tt2337185  could not be retrieved from api
tt0775356
tt0775356  could not be retrieved from api
tt0244356
tt0244356  could not be retrieved from api
tt2338400
tt2338400  could not be retrieved from api
tt0220047
tt0220047  could not be retrieved from api
tt0341789
tt0341789  could not be retrieved from api
tt0197151
tt0197151  could not be retrieved from api
tt0222529
tt0222529  could not be retrieved from api
tt6086050
tt6086050  could not be retrieved from api
tt3100634
tt1625263
tt1625263  could not be retrieved from api
tt2289244
tt2289244  could not be retrieved from api
tt1936732
tt0278229
tt0278229  could not be retrieved from api
tt0429438
tt0429438  could not be retrieved from api
tt1410490
tt1410490  could not be retrieved from api
tt5588910
tt5588910  could not be retrieved from api
tt3670858
tt3670858  could not be retrieved from api
tt1197582
tt0397182
tt0397182  could not be retrieved from api
tt1911975
tt1911975  could not be retrieved from api
tt0420366
tt0420366  could not be retrieved from api
tt3079034
tt3079034  could not be retrieved from api
tt0859270
tt0859270  could not be retrieved from api
tt0050070
tt0050070  could not be retrieved from api
tt0300798
tt0300798  could not be retrieved from api
tt5915502
tt5915502  could not be retrieved from api
tt6697244
tt6697244  could not be retrieved from api
tt1776388
tt1776388  could not be retrieved from api
tt0424639
tt0424639  could not be retrieved from api
tt1119204
tt1119204  could not be retrieved from api
tt1744868
tt1744868  could not be retrieved from api
tt1588824
tt1588824  could not be retrieved from api
tt1485389
tt3696798
tt3696798  could not be retrieved from api
tt0301123
tt0301123  could not be retrieved from api
tt1018436
tt1018436  could not be retrieved from api
tt0815776
tt0815776  could not be retrieved from api
tt0407462
tt0407462  could not be retrieved from api
tt0198147
tt0198147  could not be retrieved from api
tt0997412
tt0997412  could not be retrieved from api
tt2288050
tt1612920
tt0402701
tt5047494
tt5047494  could not be retrieved from api
tt5368216
tt5368216  could not be retrieved from api
tt3356610
tt3356610  could not be retrieved from api
tt0491735
tt1454750
tt1454750  could not be retrieved from api
tt5891726
tt5891726  could not be retrieved from api
tt2369946
tt4286824
tt4286824  could not be retrieved from api
tt0476926
tt0476926  could not be retrieved from api
tt5167034
tt5167034  could not be retrieved from api
tt0056759
tt0056759  could not be retrieved from api
tt3622818
tt3622818  could not be retrieved from api
tt0887788
tt0887788  could not be retrieved from api
tt4588620
tt4588620  could not be retrieved from api
tt0258341
tt0258341  could not be retrieved from api
tt0489430
tt0489430  could not be retrieved from api
tt2567210
tt2567210  could not be retrieved from api
tt0990403
tt4674178
tt4674178  could not be retrieved from api
tt0125638
tt0125638  could not be retrieved from api
tt5146640
tt5146640  could not be retrieved from api
tt0196284
tt0196284  could not be retrieved from api
tt3075154
tt3075154  could not be retrieved from api
tt0436003
tt0436003  could not be retrieved from api
tt1538090
tt1538090  could not be retrieved from api
tt1728226
tt1728226  could not be retrieved from api
tt3796070
tt3796070  could not be retrieved from api
tt1381395
tt1381395  could not be retrieved from api
tt0190199
tt0190199  could not be retrieved from api
tt0855213
tt0855213  could not be retrieved from api
tt0358890
tt0358890  could not be retrieved from api
tt3484986
tt3484986  could not be retrieved from api
tt2208507
tt2208507  could not be retrieved from api
tt4896052
tt4896052  could not be retrieved from api
tt6148376
tt0217211
tt0217211  could not be retrieved from api
tt0430836
tt0430836  could not be retrieved from api
tt1429551
tt1291098
tt1291098  could not be retrieved from api
tt0399968
tt0399968  could not be retrieved from api
tt2909920
tt2909920  could not be retrieved from api
tt3164276
tt3164276  could not be retrieved from api
tt1586637
tt4873032
tt0926012
tt0926012  could not be retrieved from api
tt1305560
tt1305560  could not be retrieved from api
tt1291488
tt1291488  could not be retrieved from api
tt0428088
tt0428088  could not be retrieved from api
tt1057469
tt1057469  could not be retrieved from api
tt3807326
tt3807326  could not be retrieved from api
tt3293566
tt0410964
tt1579186
tt0271931
tt6519752
tt1417358
tt4568130
tt1705611
tt2235190
tt0244328
tt0244328  could not be retrieved from api
tt0459155
tt0459155  could not be retrieved from api
tt1890984
tt1890984  could not be retrieved from api
tt0460381
tt0460381  could not be retrieved from api
tt0439069
tt0439069  could not be retrieved from api
tt0329817
tt0329817  could not be retrieved from api
tt1805082
tt1805082  could not be retrieved from api
tt0468985
tt0468985  could not be retrieved from api
tt1071166
tt1071166  could not be retrieved from api
tt1634699
tt1634699  could not be retrieved from api
tt1086761
tt4214468
tt0170930
tt0170930  could not be retrieved from api
tt5937940
tt0305056
tt1024887
tt1024887  could not be retrieved from api
tt1833558
tt7062438
tt7062438  could not be retrieved from api
tt4411548
tt4411548  could not be retrieved from api
tt0105970
tt0105970  could not be retrieved from api
tt0348949
tt0348949  could not be retrieved from api
tt2309197
tt2309197  could not be retrieved from api
tt0327271
tt0327271  could not be retrieved from api
tt1729597
tt1729597  could not be retrieved from api
tt0428108
tt0428108  could not be retrieved from api
tt3144026
tt3144026  could not be retrieved from api
tt0292770
tt0077041
tt1489024
tt0458269
tt1020924
tt0444578
tt0787980
tt0249275
tt1280868
tt0462121
tt3136086
tt1908157
tt0055714
tt0781991
tt0224517
tt0426804
tt0484508
tt0186742
tt0460081
tt0320809
tt0798631
tt3119834
tt3804586
tt0479614
tt0479614  could not be retrieved from api
tt0780447
tt0780447  could not be retrieved from api
tt0123366
tt3481544
tt3975956
tt3975956  could not be retrieved from api
tt5335110
tt0471990
tt0471990  could not be retrieved from api
tt1332074
tt6846846
tt6846846  could not be retrieved from api
tt1259798
tt0381741
tt0381741  could not be retrieved from api
tt2953706
tt1244881
tt6208480
tt6208480  could not be retrieved from api
tt1232190
tt0829040
tt0829040  could not be retrieved from api
tt3859844
tt1761662
tt1761662  could not be retrieved from api
tt2262354
tt0103411
tt0103411  could not be retrieved from api
tt0356281
tt0356281  could not be retrieved from api
tt4628798
tt4628798  could not be retrieved from api
tt0283714
tt1147702
tt1147702  could not be retrieved from api
tt0780444
tt0780444  could not be retrieved from api
tt1981147
tt0756524
tt0312095
tt0260645
tt1728958
tt4688354
tt1296242
tt1062211
tt1500453
tt0358320
tt1118205
tt0480781
tt0303490
tt0278256
tt0812148
tt0892683
tt1562042
tt0218767
tt2265901
tt1456074
tt1978967
tt0313038
tt5437800
tt5437800  could not be retrieved from api
tt2453016
tt5209238
tt5209238  could not be retrieved from api
tt7165310
tt7165310  could not be retrieved from api
tt1277979
tt0362379
tt0362379  could not be retrieved from api
tt0348512
tt0348512  could not be retrieved from api
tt1024814
tt0065343
tt0065343  could not be retrieved from api
tt3976016
tt3976016  could not be retrieved from api
tt1459376
tt1459376  could not be retrieved from api
tt4629950
tt4629950  could not be retrieved from api
tt0443361
tt0443361  could not be retrieved from api
tt1320317
tt1320317  could not be retrieved from api
tt1770959
tt6212410
tt6212410  could not be retrieved from api
tt3731648
tt5872774
tt5872774  could not be retrieved from api
tt4410468
tt0196232
tt0196232  could not be retrieved from api
tt3693866
tt3693866  could not be retrieved from api
tt6295148
tt6295148  could not be retrieved from api
tt0804424
tt0804424  could not be retrieved from api
tt0458252
tt0458252  could not be retrieved from api
tt2933730
tt2933730  could not be retrieved from api
tt5690306
tt5690306  could not be retrieved from api
tt3038492
tt0854912
tt0426740
tt0364787
tt1033281
tt0473416
tt5423592
tt2064427
tt1208634
tt0402660
tt1566044
tt0292845
tt2633208
tt1685317
tt0421158
tt1176154
tt3099832
tt0396337
tt0337790
tt0287847
tt0421343
tt0408364
tt0346300
tt0346300  could not be retrieved from api
tt2908564
tt2908564  could not be retrieved from api
tt0348894
tt6959064
tt6959064  could not be retrieved from api
tt1737565
tt1454730
tt0468999
tt1495163
tt2514488
tt2390003
tt0293725
tt0293725  could not be retrieved from api
tt0092362
tt0092362  could not be retrieved from api
tt0818895
tt0818895  could not be retrieved from api
tt1509653
tt1509653  could not be retrieved from api
tt1809909
tt1809909  could not be retrieved from api
tt1796975
tt1796975  could not be retrieved from api
tt6501522
tt6501522  could not be retrieved from api
tt0424611
tt0424611  could not be retrieved from api
tt0439932
tt0439932  could not be retrieved from api
tt4671004
tt0471048
tt0471048  could not be retrieved from api
tt1156526
tt1156526  could not be retrieved from api
tt0264226
tt0264226  could not be retrieved from api
tt1170222
tt1170222  could not be retrieved from api
tt2689384
tt0295081
tt0295081  could not be retrieved from api
tt4369244
tt4369244  could not be retrieved from api
tt2781594
tt2781594  could not be retrieved from api
tt4662374
tt1105316
tt1105316  could not be retrieved from api
tt3840030
tt3840030  could not be retrieved from api
tt2579722
tt0072546
tt4628790
tt0046590
tt2184509
tt0497854
tt0363323
tt1458207
tt0439356
tt0377146
tt0954318
tt2214505
tt2435530
tt0473419
tt0768151
tt0439365
tt0278177
tt1299440
tt2083701
tt1933836
tt6473824
tt6473824  could not be retrieved from api
tt0187632
tt0187632  could not be retrieved from api
tt4033696
tt0391666
tt0391666  could not be retrieved from api
tt0465344
tt0465344  could not be retrieved from api
tt2170392
tt4390084
tt2189892
tt2189892  could not be retrieved from api
tt6586510
tt6586510  could not be retrieved from api
tt3174316
tt2374870
tt2374870  could not be retrieved from api
tt2366111
tt2111994
tt2111994  could not be retrieved from api
tt4588734
tt4588734  could not be retrieved from api
tt0863047
tt0863047  could not be retrieved from api
tt1495648
tt1579108
tt1579108  could not be retrieved from api
tt1159610
tt0984168
tt0984168  could not be retrieved from api
tt6752226
tt6752226  could not be retrieved from api
tt0856723
tt0856723  could not be retrieved from api
tt0416347
tt0416347  could not be retrieved from api
tt5571740
tt5571740  could not be retrieved from api
tt1552185
tt1552185  could not be retrieved from api
tt3595870
tt1728864
tt1062185
tt0380949
tt1013861
tt0848174
tt0321000
tt1855738
tt0363335
tt0420381
tt1814550
tt1987353
tt0187654
tt1461569
tt1850160
tt0954661
tt0198095
tt4012388
tt0482028
tt0176381
tt0419307
tt1684732
tt5154762
tt3139774
tt0819708
tt0819708  could not be retrieved from api
tt0888280
tt0888280  could not be retrieved from api
tt6021260
tt6021260  could not be retrieved from api
tt0185065
tt0185065  could not be retrieved from api
tt4123482
tt1491299
tt1492090
tt6059298
tt6059298  could not be retrieved from api
tt1826951
tt0273025
tt0273025  could not be retrieved from api
tt1888795
tt1888795  could not be retrieved from api
tt1821879
tt1821879  could not be retrieved from api
tt2497788
tt0476038
tt0476038  could not be retrieved from api
tt1830924
tt1830924  could not be retrieved from api
tt1368470
tt1368470  could not be retrieved from api
tt1361721
tt1361721  could not be retrieved from api
tt2647792
tt2647792  could not be retrieved from api
tt3148194
tt0302163
tt0302163  could not be retrieved from api
tt5515342
tt0292859
tt0292859  could not be retrieved from api
tt0243082
tt0243082  could not be retrieved from api
tt4654650
tt4654650  could not be retrieved from api
tt0298682
tt0298682  could not be retrieved from api
tt1534856
tt1534856  could not be retrieved from api
tt3097134
tt3097134  could not be retrieved from api
tt2582840
tt2582840  could not be retrieved from api
tt4605154
tt1478217
tt1478217  could not be retrieved from api
tt0374366
tt1631948
tt0368494
tt1721347
tt5319670
tt1684855
tt5209280
tt6217260
tt6842890
tt5040090
tt3501210
tt0367323
tt0397012
tt0954837
tt1784056
tt3228548
tt0861753
tt0933898
tt0433705
tt0287845
tt0329816
tt0329816  could not be retrieved from api
tt2815342
tt3548386
tt3548386  could not be retrieved from api
tt0410958
tt0410958  could not be retrieved from api
tt0057740
tt0057740  could not be retrieved from api
tt5583124
tt5583124  could not be retrieved from api
tt1440045
tt1440045  could not be retrieved from api
tt0810737
tt0810737  could not be retrieved from api
tt0989753
tt0989753  could not be retrieved from api
tt1313075
tt1313075  could not be retrieved from api
tt1073528
tt1073528  could not be retrieved from api
tt0310516
tt0310516  could not be retrieved from api
tt1642103
tt1642103  could not be retrieved from api
tt0448973
tt0448973  could not be retrieved from api
tt0302098
tt0302098  could not be retrieved from api
tt0805368
tt0805368  could not be retrieved from api
tt1124662
tt1124662  could not be retrieved from api
tt0324891
tt0324891  could not be retrieved from api
tt0423631
tt0423631  could not be retrieved from api
tt2226096
tt2226096  could not be retrieved from api
tt0773264
tt1798695
tt1307083
tt4845734
tt0046641
tt0046641  could not be retrieved from api
tt1519575
tt1519575  could not be retrieved from api
tt0853078
tt0853078  could not be retrieved from api
tt0118423
tt0118423  could not be retrieved from api
tt0284767
tt4052124
tt4052124  could not be retrieved from api
tt0878801
tt3703500

# Oops,  We've hit the API to hard.  A second attempt to pull low rated show information
# will be needed, with a time delay to stay within API limitations.

# This shape is misleading, as many of the rows simply contain a message that the API limit 
# had been exceeded

losers.shape

(229, 22)

# This is accurate, 235 shows from the top show list were obtained
shows.shape

(235, 20)

DO_NOT_RUN = True  # Be sure to check the file name to write before enabling execution on this block

if not DO_NOT_RUN:
    pickle.dump( losers, open( "save_losers_df.p", "wb" ) )

# read data back in from the saved file
losers2 = pickle.load( open( "save_losers_df.p", "rb" ) )

This is the start of a second attempt to pull more TV shows with low ratings

This is needed. After the first pull, and after cleanup, there were only 10 Shows left in the low rating category with complete information. The cells below collect more data from the API for additional low rated shows.

losers.loc[0:9]['externals']

0    {u'thetvdb': 283995, u'tvrage': 40425, u'imdb'...
1    {u'thetvdb': 299234, u'tvrage': 50418, u'imdb'...
2    {u'thetvdb': 118021, u'tvrage': None, u'imdb':...
3    {u'thetvdb': 274705, u'tvrage': 31580, u'imdb'...
4    {u'thetvdb': 246161, u'tvrage': None, u'imdb':...
5    {u'thetvdb': 75638, u'tvrage': None, u'imdb': ...
6    {u'thetvdb': 260183, u'tvrage': 31024, u'imdb'...
7    {u'thetvdb': None, u'tvrage': None, u'imdb': u...
8    {u'thetvdb': 299688, u'tvrage': None, u'imdb':...
9    {u'thetvdb': 222481, u'tvrage': None, u'imdb':...
Name: externals, dtype: object

# In the first attempt, there were a number of shows where data was not returned becuase of two many api calls
# in quick succession. In order to re-submit those show ids, it is necessary to get a list of ids that were
# returned successfully, and then to remove them from the original list of ids before resubmitting.  
# losers_pulled is a list of ids that were successful on the previous attempt.

losers_pulled = []
no_imdb_at_idx = []
for i in range(len(losers)):
    try:
        losers_pulled.append(losers.loc[i,'externals']['imdb'])
    except:
        no_imdb_at_idx.append(i)
print no_imdb_at_idx
print
print losers_pulled
print len(losers_pulled)

[11, 35, 36, 37, 38, 39, 40, 41, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 228]

[u'tt2058498', u'tt4566242', u'tt1364951', u'tt2341819', u'tt1893088', u'tt0445890', u'tt2279571', u'tt1733734', u'tt1372153', u'tt1817054', u'tt2879822', u'tt0190177', u'tt5047510', u'tt5338860', u'tt1848281', u'tt6839504', u'tt2114184', u'tt1674023', u'tt1971860', u'tt3100634', u'tt1936732', u'tt1197582', u'tt1485389', u'tt2288050', u'tt1612920', u'tt0402701', u'tt0491735', u'tt2369946', u'tt0990403', u'tt6148376', u'tt1429551', u'tt1586637', u'tt4873032', u'tt3293566', u'tt2235190', u'tt1086761', u'tt4214468', u'tt5937940', u'tt0305056', u'tt1833558', u'tt0123366', u'tt3481544', u'tt5335110', u'tt1332074', u'tt1259798', u'tt2953706', u'tt1244881', u'tt1232190', u'tt3859844', u'tt2262354', u'tt0283714', u'tt0313038', u'tt2453016', u'tt1277979', u'tt1024814', u'tt1770959', u'tt3731648', u'tt4410468', u'tt0348894', u'tt1737565', u'tt1454730', u'tt0468999', u'tt1495163', u'tt2514488', u'tt2390003', u'tt4671004', u'tt2689384', u'tt4662374', u'tt1299440', u'tt2083701', u'tt1933836', u'tt4033696', u'tt2170392', u'tt4390084', u'tt3174316', u'tt2366111', u'tt1495648', u'tt1159610', u'tt4123482', u'tt1491299', u'tt1492090', u'tt1826951', u'tt2497788', u'tt3148194', u'tt5515342', u'tt4605154', u'tt2815342', u'tt0773264', u'tt1798695', u'tt1307083', u'tt4845734', u'tt0284767', u'tt0878801']
93

# There were that do not even include their own imdb number, and indicator that the pull was unsuccessful
# While a few of these might have been successful but have only limited data, most are unusuable.
# Thus all will be re-requested at a slower rate and any duplicates removed when the data is merged.

print len(no_imdb_at_idx)

# This generates a list of the original requests that were not successfully returned from the api.   
# First the will be requested again, using a time delay to avoid requesting more than the server
# will willingly return.  They will also be batched in groups of 100 ids

missing_losers = [x for x in loser_list if x not in losers_pulled]
missing_losers

['tt0465347',
 'tt4427122',
 'tt1015682',
 'tt2505738',
 'tt2402465',
 'tt0278236',
 'tt0268066',
 'tt4813760',
 'tt1526001',
 'tt1243976',
 'tt3897284',
 'tt3665690',
 'tt4132180',
 'tt0824229',
 'tt0314990',
 'tt5423750',
 'tt5423664',
 'tt2175125',
 'tt0404593',
 'tt4160422',
 'tt4552562',
 'tt5804854',
 'tt0886666',
 'tt5423824',
 'tt3500210',
 'tt0285357',
 'tt0280234',
 'tt1863530',
 'tt0280349',
 'tt2660922',
 'tt0292776',
 'tt0264230',
 'tt1102523',
 'tt3333790',
 'tt0320863',
 'tt0830848',
 'tt0939270',
 'tt1459294',
 'tt6026132',
 'tt1443593',
 'tt0354267',
 'tt0147749',
 'tt0161180',
 'tt4733812',
 'tt0367362',
 'tt5626868',
 'tt7268752',
 'tt0464767',
 'tt3550770',
 'tt6422012',
 'tt3154248',
 'tt5016274',
 'tt1715229',
 'tt0489426',
 'tt5798754',
 'tt2022182',
 'tt0303564',
 'tt3462252',
 'tt0329849',
 'tt5074180',
 'tt3900878',
 'tt3887402',
 'tt0149408',
 'tt1360544',
 'tt1718355',
 'tt2364950',
 'tt0285374',
 'tt5267590',
 'tt0314993',
 'tt0300870',
 'tt7036530',
 'tt5657014',
 'tt0149488',
 'tt1204865',
 'tt1182860',
 'tt0423626',
 'tt4223864',
 'tt1773440',
 'tt0872067',
 'tt0428172',
 'tt0817379',
 'tt1210720',
 'tt3855028',
 'tt1611594',
 'tt5822004',
 'tt6524930',
 'tt1902032',
 'tt0466201',
 'tt1757293',
 'tt1807575',
 'tt0332896',
 'tt3140278',
 'tt1176297',
 'tt0285406',
 'tt6680212',
 'tt0200336',
 'tt0385483',
 'tt3534894',
 'tt1108281',
 'tt3855016',
 'tt0787948',
 'tt1292967',
 'tt1466565',
 'tt0435565',
 'tt1229266',
 'tt0364837',
 'tt0477409',
 'tt0875097',
 'tt1227542',
 'tt1131289',
 'tt0355135',
 'tt1418598',
 'tt0290970',
 'tt0184124',
 'tt0490736',
 'tt0439354',
 'tt1157935',
 'tt1425641',
 'tt2830404',
 'tt0835397',
 'tt0880581',
 'tt1078463',
 'tt1234506',
 'tt0323463',
 'tt5168468',
 'tt0296322',
 'tt3911254',
 'tt3827516',
 'tt0364899',
 'tt4204032',
 'tt0259768',
 'tt0287880',
 'tt0270763',
 'tt0846349',
 'tt2699648',
 'tt3616368',
 'tt2672920',
 'tt0813074',
 'tt1694422',
 'tt0472241',
 'tt0202186',
 'tt1297366',
 'tt3919918',
 'tt1564985',
 'tt3336800',
 'tt2254454',
 'tt0824737',
 'tt1288431',
 'tt1705811',
 'tt0968726',
 'tt2058840',
 'tt3857708',
 'tt0315030',
 'tt2337185',
 'tt0775356',
 'tt0244356',
 'tt2338400',
 'tt0220047',
 'tt0341789',
 'tt0197151',
 'tt0222529',
 'tt6086050',
 'tt1625263',
 'tt2289244',
 'tt0278229',
 'tt0429438',
 'tt1410490',
 'tt5588910',
 'tt3670858',
 'tt0397182',
 'tt1911975',
 'tt0420366',
 'tt3079034',
 'tt0859270',
 'tt0050070',
 'tt0300798',
 'tt5915502',
 'tt6697244',
 'tt1776388',
 'tt0424639',
 'tt1119204',
 'tt1744868',
 'tt1588824',
 'tt3696798',
 'tt0301123',
 'tt1018436',
 'tt0815776',
 'tt0407462',
 'tt0198147',
 'tt0997412',
 'tt5047494',
 'tt5368216',
 'tt3356610',
 'tt1454750',
 'tt5891726',
 'tt4286824',
 'tt0476926',
 'tt5167034',
 'tt0056759',
 'tt3622818',
 'tt0887788',
 'tt4588620',
 'tt0258341',
 'tt0489430',
 'tt2567210',
 'tt4674178',
 'tt0125638',
 'tt5146640',
 'tt0196284',
 'tt3075154',
 'tt0436003',
 'tt1538090',
 'tt1728226',
 'tt3796070',
 'tt1381395',
 'tt0190199',
 'tt0855213',
 'tt0358890',
 'tt3484986',
 'tt2208507',
 'tt4896052',
 'tt0217211',
 'tt0430836',
 'tt1291098',
 'tt0399968',
 'tt2909920',
 'tt3164276',
 'tt0926012',
 'tt1305560',
 'tt1291488',
 'tt0428088',
 'tt1057469',
 'tt3807326',
 'tt0410964',
 'tt1579186',
 'tt0271931',
 'tt6519752',
 'tt1417358',
 'tt4568130',
 'tt1705611',
 'tt0244328',
 'tt0459155',
 'tt1890984',
 'tt0460381',
 'tt0439069',
 'tt0329817',
 'tt1805082',
 'tt0468985',
 'tt1071166',
 'tt1634699',
 'tt0170930',
 'tt1024887',
 'tt7062438',
 'tt4411548',
 'tt0105970',
 'tt0348949',
 'tt2309197',
 'tt0327271',
 'tt1729597',
 'tt0428108',
 'tt3144026',
 'tt0292770',
 'tt0077041',
 'tt1489024',
 'tt0458269',
 'tt1020924',
 'tt0444578',
 'tt0787980',
 'tt0249275',
 'tt1280868',
 'tt0462121',
 'tt3136086',
 'tt1908157',
 'tt0055714',
 'tt0781991',
 'tt0224517',
 'tt0426804',
 'tt0484508',
 'tt0186742',
 'tt0460081',
 'tt0320809',
 'tt0798631',
 'tt3119834',
 'tt3804586',
 'tt0479614',
 'tt0780447',
 'tt3975956',
 'tt0471990',
 'tt6846846',
 'tt0381741',
 'tt6208480',
 'tt0829040',
 'tt1761662',
 'tt0103411',
 'tt0356281',
 'tt4628798',
 'tt1147702',
 'tt0780444',
 'tt1981147',
 'tt0756524',
 'tt0312095',
 'tt0260645',
 'tt1728958',
 'tt4688354',
 'tt1296242',
 'tt1062211',
 'tt1500453',
 'tt0358320',
 'tt1118205',
 'tt0480781',
 'tt0303490',
 'tt0278256',
 'tt0812148',
 'tt0892683',
 'tt1562042',
 'tt0218767',
 'tt2265901',
 'tt1456074',
 'tt1978967',
 'tt5437800',
 'tt5209238',
 'tt7165310',
 'tt0362379',
 'tt0348512',
 'tt0065343',
 'tt3976016',
 'tt1459376',
 'tt4629950',
 'tt0443361',
 'tt1320317',
 'tt6212410',
 'tt5872774',
 'tt0196232',
 'tt3693866',
 'tt6295148',
 'tt0804424',
 'tt0458252',
 'tt2933730',
 'tt5690306',
 'tt3038492',
 'tt0854912',
 'tt0426740',
 'tt0364787',
 'tt1033281',
 'tt0473416',
 'tt5423592',
 'tt2064427',
 'tt1208634',
 'tt0402660',
 'tt1566044',
 'tt0292845',
 'tt2633208',
 'tt1685317',
 'tt0421158',
 'tt1176154',
 'tt3099832',
 'tt0396337',
 'tt0337790',
 'tt0287847',
 'tt0421343',
 'tt0408364',
 'tt0346300',
 'tt2908564',
 'tt6959064',
 'tt0293725',
 'tt0092362',
 'tt0818895',
 'tt1509653',
 'tt1809909',
 'tt1796975',
 'tt6501522',
 'tt0424611',
 'tt0439932',
 'tt0471048',
 'tt1156526',
 'tt0264226',
 'tt1170222',
 'tt0295081',
 'tt4369244',
 'tt2781594',
 'tt1105316',
 'tt3840030',
 'tt2579722',
 'tt0072546',
 'tt4628790',
 'tt0046590',
 'tt2184509',
 'tt0497854',
 'tt0363323',
 'tt1458207',
 'tt0439356',
 'tt0377146',
 'tt0954318',
 'tt2214505',
 'tt2435530',
 'tt0473419',
 'tt0768151',
 'tt0439365',
 'tt0278177',
 'tt6473824',
 'tt0187632',
 'tt0391666',
 'tt0465344',
 'tt2189892',
 'tt6586510',
 'tt2374870',
 'tt2111994',
 'tt4588734',
 'tt0863047',
 'tt1579108',
 'tt0984168',
 'tt6752226',
 'tt0856723',
 'tt0416347',
 'tt5571740',
 'tt1552185',
 'tt3595870',
 'tt1728864',
 'tt1062185',
 'tt0380949',
 'tt1013861',
 'tt0848174',
 'tt0321000',
 'tt1855738',
 'tt0363335',
 'tt0420381',
 'tt1814550',
 'tt1987353',
 'tt0187654',
 'tt1461569',
 'tt1850160',
 'tt0954661',
 'tt0198095',
 'tt4012388',
 'tt0482028',
 'tt0176381',
 'tt0419307',
 'tt1684732',
 'tt5154762',
 'tt3139774',
 'tt0819708',
 'tt0888280',
 'tt6021260',
 'tt0185065',
 'tt6059298',
 'tt0273025',
 'tt1888795',
 'tt1821879',
 'tt0476038',
 'tt1830924',
 'tt1368470',
 'tt1361721',
 'tt2647792',
 'tt0302163',
 'tt0292859',
 'tt0243082',
 'tt4654650',
 'tt0298682',
 'tt1534856',
 'tt3097134',
 'tt2582840',
 'tt1478217',
 'tt0374366',
 'tt1631948',
 'tt0368494',
 'tt1721347',
 'tt5319670',
 'tt1684855',
 'tt5209280',
 'tt6217260',
 'tt6842890',
 'tt5040090',
 'tt3501210',
 'tt0367323',
 'tt0397012',
 'tt0954837',
 'tt1784056',
 'tt3228548',
 'tt0861753',
 'tt0933898',
 'tt0433705',
 'tt0287845',
 'tt0329816',
 'tt3548386',
 'tt0410958',
 'tt0057740',
 'tt5583124',
 'tt1440045',
 'tt0810737',
 'tt0989753',
 'tt1313075',
 'tt1073528',
 'tt0310516',
 'tt1642103',
 'tt0448973',
 'tt0302098',
 'tt0805368',
 'tt1124662',
 'tt0324891',
 'tt0423631',
 'tt2226096',
 'tt0046641',
 'tt1519575',
 'tt0853078',
 'tt0118423',
 'tt4052124',
 'tt3703500']

# This processes the oringinal list of 600 ids, minus the ones that were successfully pulled, 
# into groups of 100 + 7 in last list
# break up the missing list into groups of 100
subset_loser_list = []
print len(missing_losers)
for i in range(len(missing_losers)/100):
    temp_list = []
    for j in range(100):
        temp_list.append(missing_losers[i*100 + j])
    subset_loser_list.append(temp_list)    

# get last 7
for j in range(500, len(missing_losers)):
    temp_list = []
    for j in range(500, len(missing_losers)):
        temp_list.append(missing_losers[j])

# After reprocessing the first list of ids a 2nd time,  there are still not enough samples of low rated shows
# A third list of 600 low rated shows was scraped from IMDB, and this list is broken into subsets of 100 here

subset_loser_list2 = []
print len(loser_list)
for i in range(len(loser_list)/100):
    temp_list = []
    for j in range(100):
        temp_list.append(loser_list[i*100 + j])
    subset_loser_list2.append(temp_list)

subset_loser_list2[0]

['tt0773264',
 'tt1798695',
 'tt1307083',
 'tt4845734',
 'tt0046641',
 'tt1519575',
 'tt0853078',
 'tt0118423',
 'tt0284767',
 'tt4052124',
 'tt0878801',
 'tt3703500',
 'tt1105170',
 'tt4363582',
 'tt3155428',
 'tt0362350',
 'tt0287196',
 'tt2766052',
 'tt0405545',
 'tt0262975',
 'tt0367278',
 'tt7134262',
 'tt1695352',
 'tt0421470',
 'tt2466890',
 'tt0343305',
 'tt1002739',
 'tt1615697',
 'tt0274262',
 'tt0465320',
 'tt1388381',
 'tt0358889',
 'tt1085789',
 'tt1011591',
 'tt0364804',
 'tt1489335',
 'tt3612584',
 'tt0363377',
 'tt0111930',
 'tt0401913',
 'tt0808086',
 'tt0309212',
 'tt5464192',
 'tt0080250',
 'tt4533338',
 'tt4741696',
 'tt1922810',
 'tt1793868',
 'tt4789316',
 'tt0185054',
 'tt1079622',
 'tt1786048',
 'tt0790508',
 'tt1716372',
 'tt0295098',
 'tt3409706',
 'tt0222574',
 'tt2171325',
 'tt0442643',
 'tt2142117',
 'tt0371433',
 'tt0138244',
 'tt1002010',
 'tt0495557',
 'tt1811817',
 'tt5529996',
 'tt1352053',
 'tt0439346',
 'tt0940147',
 'tt3075138',
 'tt1974439',
 'tt2693842',
 'tt0092325',
 'tt6772826',
 'tt1563069',
 'tt0489598',
 'tt0142055',
 'tt1566154',
 'tt0338592',
 'tt0167515',
 'tt2330327',
 'tt1576464',
 'tt2389845',
 'tt0186747',
 'tt0355096',
 'tt1821877',
 'tt0112033',
 'tt1792654',
 'tt0472243',
 'tt6453018',
 'tt3648886',
 'tt1599374',
 'tt2946482',
 'tt4672020',
 'tt1016283',
 'tt2649480',
 'tt1229945',
 'tt2390606',
 'tt1876612',
 'tt0140732']

# This block calls the API.   It is run repeatedly with each new sublist of 100 show ids,  sleeping 10
# seconds between each request.  There is a do not run flag that will prevent running this block if the 
# notebook is restarted.  The first time it was executed, a new dataframe called "more_losers" was initialized,
# and then commented out for subsequent executions so the data returned in eacn subsequent data request will
# be appended to the bottom of the dataframe.

# After collection is complete, set flag to prevent running this block unnecessarily if notebook is restarted

import time
DO_NOT_RUN = True

if not DO_NOT_RUN:
#     responses = []
#     more_losers = pd.DataFrame()
    for loser_id in subset_loser_list2[0]:   # change the index and re-run to accesses each set of 100 ids
        time.sleep(10)    
        try: 
            # Get the tv show info from the api
            url = "http://api.tvmaze.com/lookup/shows?imdb=" + loser_id
            r = requests.get(url)

            # convert the return data to a dictionary
            json_data = r.json()

            # load a temp datafram with the dictionary, then append to the composite dataframe
            temp_df = pd.DataFrame.from_dict(json_data, orient='index', dtype=None)
            ttemp_df = temp_df.T     # Was not able to load json in column orientation, so must transpose
            more_losers = more_losers.append(ttemp_df, ignore_index=True)
            stat = ''
        except: 
            stat = 'failed'

        print loser_id, stat, r.status_code
        res = [loser_id, stat, r.status_code]
        responses.append(res)
        
    losers.head()

tt0773264  200
tt1798695  200
tt1307083  200
tt4845734  200
tt0046641 failed 404
tt1519575 failed 404
tt0853078 failed 404
tt0118423 failed 404
tt0284767  200
tt4052124 failed 404
tt0878801  200
tt3703500  200
tt1105170 failed 404
tt4363582 failed 404
tt3155428  200
tt0362350 failed 404
tt0287196  200
tt2766052  200
tt0405545 failed 404
tt0262975  200
tt0367278 failed 404
tt7134262 failed 404
tt1695352 failed 404
tt0421470 failed 404
tt2466890 failed 404
tt0343305 failed 404
tt1002739 failed 404
tt1615697 failed 404
tt0274262 failed 404
tt0465320 failed 404
tt1388381  200
tt0358889  200
tt1085789 failed 404
tt1011591  200
tt0364804 failed 404
tt1489335 failed 404
tt3612584  200
tt0363377 failed 404
tt0111930 failed 404
tt0401913 failed 404
tt0808086 failed 404
tt0309212 failed 404
tt5464192  200
tt0080250 failed 404
tt4533338 failed 404
tt4741696  200
tt1922810 failed 404
tt1793868 failed 404
tt4789316 failed 404
tt0185054 failed 404
tt1079622 failed 404
tt1786048 failed 404
tt0790508 failed 404
tt1716372 failed 404
tt0295098 failed 404
tt3409706 failed 404
tt0222574 failed 404
tt2171325 failed 404
tt0442643 failed 404
tt2142117 failed 404
tt0371433 failed 404
tt0138244 failed 404
tt1002010 failed 404
tt0495557 failed 404
tt1811817 failed 404
tt5529996 failed 404
tt1352053 failed 404
tt0439346 failed 404
tt0940147 failed 404
tt3075138 failed 404
tt1974439  200
tt2693842 failed 404
tt0092325  200
tt6772826  200
tt1563069  200
tt0489598  200
tt0142055 failed 404
tt1566154  200
tt0338592  200
tt0167515  200
tt2330327  200
tt1576464 failed 404
tt2389845 failed 404
tt0186747  200
tt0355096 failed 404
tt1821877  200
tt0112033 failed 404
tt1792654 failed 404
tt0472243 failed 404
tt6453018 failed 404
tt3648886 failed 404
tt1599374  200
tt2946482  200
tt4672020 failed 404
tt1016283 failed 404
tt2649480  200
tt1229945  200
tt2390606 failed 404
tt1876612  200
tt0140732 failed 404

len(responses)

for i in range(len(more_losers)):
    print more_losers.loc[i, 'externals']

{u'thetvdb': 279947, u'tvrage': 37045, u'imdb': u'tt3595870'}
{u'thetvdb': None, u'tvrage': 13173, u'imdb': u'tt0848174'}
{u'thetvdb': 72157, u'tvrage': None, u'imdb': u'tt0374366'}
{u'thetvdb': 218241, u'tvrage': None, u'imdb': u'tt1684855'}
{u'thetvdb': 327908, u'tvrage': None, u'imdb': u'tt6842890'}
{u'thetvdb': 279810, u'tvrage': None, u'imdb': u'tt3501210'}
{u'thetvdb': 283658, u'tvrage': None, u'imdb': u'tt0367323'}
{u'thetvdb': 271341, u'tvrage': 33650, u'imdb': u'tt2633208'}
{u'thetvdb': 260677, u'tvrage': None, u'imdb': u'tt2579722'}
{u'thetvdb': 77616, u'tvrage': None, u'imdb': u'tt0072546'}
{u'thetvdb': 74419, u'tvrage': None, u'imdb': u'tt0458269'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt0249275'}
{u'thetvdb': 282527, u'tvrage': 42189, u'imdb': u'tt2815184'}
{u'thetvdb': 246631, u'tvrage': None, u'imdb': u'tt1753229'}
{u'thetvdb': 82500, u'tvrage': None, u'imdb': u'tt1240534'}
{u'thetvdb': 206381, u'tvrage': 26873, u'imdb': u'tt1999642'}
{u'thetvdb': 284259, u'tvrage': None, u'imdb': u'tt3784176'}
{u'thetvdb': 250186, u'tvrage': None, u'imdb': u'tt1958848'}
{u'thetvdb': 320679, u'tvrage': None, u'imdb': u'tt5684430'}
{u'thetvdb': 74181, u'tvrage': 6494, u'imdb': u'tt0134269'}
{u'thetvdb': 84159, u'tvrage': 19672, u'imdb': u'tt1252370'}
{u'thetvdb': 300105, u'tvrage': 48178, u'imdb': u'tt3824018'}
{u'thetvdb': 264850, u'tvrage': None, u'imdb': u'tt2555880'}
{u'thetvdb': 277020, u'tvrage': 35629, u'imdb': u'tt3310544'}
{u'thetvdb': 254524, u'tvrage': 31887, u'imdb': u'tt2125758'}
{u'thetvdb': 271916, u'tvrage': None, u'imdb': u'tt1973047'}
{u'thetvdb': 82005, u'tvrage': None, u'imdb': u'tt0934701'}
{u'thetvdb': 250472, u'tvrage': None, u'imdb': u'tt2059031'}
{u'thetvdb': 81491, u'tvrage': None, u'imdb': u'tt1056536'}
{u'thetvdb': 137691, u'tvrage': None, u'imdb': u'tt1618950'}
{u'thetvdb': 74395, u'tvrage': 3883, u'imdb': u'tt0115206'}
{u'thetvdb': 298860, u'tvrage': 50010, u'imdb': u'tt4575056'}
{u'thetvdb': 269115, u'tvrage': 33511, u'imdb': u'tt2889104'}
{u'thetvdb': 285008, u'tvrage': None, u'imdb': u'tt2644204'}
{u'thetvdb': 82237, u'tvrage': None, u'imdb': u'tt1210781'}
{u'thetvdb': 314998, u'tvrage': None, u'imdb': u'tt0048898'}
{u'thetvdb': 276337, u'tvrage': None, u'imdb': u'tt3398108'}
{u'thetvdb': 221621, u'tvrage': None, u'imdb': u'tt1252620'}
{u'thetvdb': 269059, u'tvrage': 35857, u'imdb': u'tt2901828'}
{u'thetvdb': 273303, u'tvrage': 35560, u'imdb': u'tt3006666'}
{u'thetvdb': 260473, u'tvrage': 30918, u'imdb': u'tt2197994'}
{u'thetvdb': 83313, u'tvrage': None, u'imdb': u'tt1263594'}
{u'thetvdb': 80117, u'tvrage': 7218, u'imdb': u'tt0497079'}
{u'thetvdb': 174991, u'tvrage': 25843, u'imdb': u'tt1755893'}
{u'thetvdb': 71424, u'tvrage': None, u'imdb': u'tt0329824'}
{u'thetvdb': 258632, u'tvrage': 31545, u'imdb': u'tt2245937'}
{u'thetvdb': 259235, u'tvrage': None, u'imdb': u'tt2147632'}
{u'thetvdb': 297209, u'tvrage': 38100, u'imdb': u'tt3218114'}
{u'thetvdb': 185651, u'tvrage': None, u'imdb': u'tt1583417'}
{u'thetvdb': 250370, u'tvrage': 28934, u'imdb': u'tt1963853'}
{u'thetvdb': 129051, u'tvrage': None, u'imdb': u'tt1520150'}
{u'thetvdb': 76370, u'tvrage': None, u'imdb': u'tt0236907'}
{u'thetvdb': 316174, u'tvrage': None, u'imdb': u'tt5865052'}
{u'thetvdb': 82304, u'tvrage': 19011, u'imdb': u'tt1231448'}
{u'thetvdb': 289640, u'tvrage': 46963, u'imdb': u'tt4287478'}
{u'thetvdb': 249750, u'tvrage': None, u'imdb': u'tt1874006'}
{u'thetvdb': 250959, u'tvrage': 28442, u'imdb': u'tt2006560'}
{u'thetvdb': 281375, u'tvrage': 38313, u'imdb': u'tt3565412'}
{u'thetvdb': 274414, u'tvrage': None, u'imdb': u'tt3396736'}
{u'thetvdb': 271820, u'tvrage': None, u'imdb': u'tt0855313'}
{u'thetvdb': 250955, u'tvrage': None, u'imdb': u'tt2309561'}
{u'thetvdb': 273130, u'tvrage': 36774, u'imdb': u'tt3136814'}
{u'thetvdb': 84669, u'tvrage': 18525, u'imdb': u'tt1191056'}
{u'thetvdb': 74697, u'tvrage': 3348, u'imdb': u'tt0235917'}
{u'thetvdb': 76708, u'tvrage': None, u'imdb': u'tt0111892'}
{u'thetvdb': 266934, u'tvrage': None, u'imdb': u'tt2643770'}
{u'thetvdb': 79896, u'tvrage': None, u'imdb': u'tt0423657'}
{u'thetvdb': 303252, u'tvrage': None, u'imdb': u'tt5327970'}
{u'thetvdb': 256806, u'tvrage': None, u'imdb': u'tt2190731'}
{u'thetvdb': 78409, u'tvrage': None, u'imdb': u'tt0101041'}
{u'thetvdb': 274820, u'tvrage': None, u'imdb': u'tt3317020'}
{u'thetvdb': 296474, u'tvrage': 45813, u'imdb': u'tt4732076'}
{u'thetvdb': 285651, u'tvrage': 41593, u'imdb': u'tt3828162'}
{u'thetvdb': 315767, u'tvrage': None, u'imdb': u'tt5819414'}
{u'thetvdb': 287534, u'tvrage': 42884, u'imdb': u'tt4180738'}
{u'thetvdb': 76621, u'tvrage': None, u'imdb': u'tt0300802'}
{u'thetvdb': 280683, u'tvrage': 34278, u'imdb': u'tt2649738'}
{u'thetvdb': 280256, u'tvrage': 41644, u'imdb': u'tt3181412'}
{u'thetvdb': 79496, u'tvrage': 2677, u'imdb': u'tt0382400'}
{u'thetvdb': 271514, u'tvrage': None, u'imdb': u'tt2168240'}
{u'thetvdb': 271826, u'tvrage': None, u'imdb': u'tt2560966'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt0375440'}
{u'thetvdb': 282253, u'tvrage': 44602, u'imdb': u'tt4081326'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt6664486'}
{u'thetvdb': 70734, u'tvrage': 14443, u'imdb': u'tt0247094'}
{u'thetvdb': 70852, u'tvrage': 5323, u'imdb': u'tt0320969'}
{u'thetvdb': 267185, u'tvrage': None, u'imdb': u'tt2720144'}
{u'thetvdb': 265320, u'tvrage': 33976, u'imdb': u'tt2287380'}
{u'thetvdb': 252485, u'tvrage': None, u'imdb': u'tt2010634'}
{u'thetvdb': 271722, u'tvrage': 36787, u'imdb': u'tt3084090'}
{u'thetvdb': 260126, u'tvrage': 30877, u'imdb': u'tt2392683'}
{u'thetvdb': 251033, u'tvrage': 28408, u'imdb': u'tt1628058'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt1837169'}
{u'thetvdb': 260341, u'tvrage': 31462, u'imdb': u'tt2404111'}
{u'thetvdb': 89831, u'tvrage': 22647, u'imdb': u'tt1411598'}
{u'thetvdb': 70609, u'tvrage': 5102, u'imdb': u'tt0106123'}
{u'thetvdb': 245071, u'tvrage': 26645, u'imdb': u'tt1740718'}
{u'thetvdb': 73230, u'tvrage': 6188, u'imdb': u'tt0362153'}
{u'thetvdb': 163671, u'tvrage': None, u'imdb': u'tt1637756'}
{u'thetvdb': 259478, u'tvrage': 31194, u'imdb': u'tt2328067'}
{u'thetvdb': 294774, u'tvrage': None, u'imdb': u'tt0057741'}
{u'thetvdb': 282993, u'tvrage': None, u'imdb': u'tt1261356'}
{u'thetvdb': 268795, u'tvrage': 36420, u'imdb': u'tt2559390'}
{u'thetvdb': 72048, u'tvrage': 4056, u'imdb': u'tt0083433'}
{u'thetvdb': 256513, u'tvrage': 31344, u'imdb': u'tt2330453'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt0804423'}
{u'thetvdb': 159351, u'tvrage': None, u'imdb': u'tt1118131'}
{u'thetvdb': 300384, u'tvrage': None, u'imdb': u'tt4016700'}
{u'thetvdb': 264239, u'tvrage': None, u'imdb': u'tt0950199'}
{u'thetvdb': 106801, u'tvrage': None, u'imdb': u'tt1477137'}
{u'thetvdb': 87131, u'tvrage': None, u'imdb': u'tt1176156'}
{u'thetvdb': 173981, u'tvrage': None, u'imdb': u'tt1545453'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt1240983'}
{u'thetvdb': 264762, u'tvrage': 31404, u'imdb': u'tt1415000'}
{u'thetvdb': 72180, u'tvrage': None, u'imdb': u'tt0144701'}
{u'thetvdb': 307473, u'tvrage': None, u'imdb': u'tt4718304'}
{u'thetvdb': 147701, u'tvrage': None, u'imdb': u'tt1095213'}
{u'thetvdb': 98371, u'tvrage': None, u'imdb': u'tt1453090'}
{u'thetvdb': 72141, u'tvrage': None, u'imdb': u'tt0168372'}
{u'thetvdb': 75567, u'tvrage': 12949, u'imdb': u'tt0425725'}
{u'thetvdb': 275787, u'tvrage': None, u'imdb': u'tt3300126'}
{u'thetvdb': 308457, u'tvrage': 51439, u'imdb': u'tt5459976'}
{u'thetvdb': 285286, u'tvrage': 44525, u'imdb': u'tt4041694'}
{u'thetvdb': 261287, u'tvrage': 32847, u'imdb': u'tt2322264'}
{u'thetvdb': 250325, u'tvrage': None, u'imdb': u'tt1441005'}
{u'thetvdb': 72133, u'tvrage': None, u'imdb': u'tt0365991'}
{u'thetvdb': 72488, u'tvrage': None, u'imdb': u'tt0364807'}
{u'thetvdb': 149371, u'tvrage': 25246, u'imdb': u'tt1591375'}
{u'thetvdb': 291820, u'tvrage': None, u'imdb': u'tt3562462'}
{u'thetvdb': 96071, u'tvrage': None, u'imdb': u'tt1372127'}
{u'thetvdb': 287516, u'tvrage': None, u'imdb': u'tt2088493'}
{u'thetvdb': 295059, u'tvrage': 48857, u'imdb': u'tt4658248'}
{u'thetvdb': 250280, u'tvrage': None, u'imdb': u'tt1973659'}
{u'thetvdb': 272357, u'tvrage': None, u'imdb': u'tt2849552'}
{u'thetvdb': 282130, u'tvrage': None, u'imdb': u'tt3774098'}
{u'thetvdb': None, u'tvrage': 18611, u'imdb': u'tt1151434'}
{u'thetvdb': 271067, u'tvrage': None, u'imdb': u'tt2993514'}
{u'thetvdb': 80311, u'tvrage': None, u'imdb': u'tt0773264'}
{u'thetvdb': 260189, u'tvrage': 32126, u'imdb': u'tt1798695'}
{u'thetvdb': 139481, u'tvrage': 20203, u'imdb': u'tt1307083'}
{u'thetvdb': 297960, u'tvrage': 49841, u'imdb': u'tt4845734'}
{u'thetvdb': 70656, u'tvrage': None, u'imdb': u'tt0284767'}
{u'thetvdb': 80694, u'tvrage': 15758, u'imdb': u'tt0878801'}
{u'thetvdb': 282654, u'tvrage': 39954, u'imdb': u'tt3703500'}
{u'thetvdb': 272737, u'tvrage': 37535, u'imdb': u'tt3155428'}
{u'thetvdb': 76237, u'tvrage': None, u'imdb': u'tt0287196'}
{u'thetvdb': 270469, u'tvrage': 34560, u'imdb': u'tt2766052'}
{u'thetvdb': 301235, u'tvrage': None, u'imdb': u'tt0262975'}
{u'thetvdb': 126811, u'tvrage': None, u'imdb': u'tt1388381'}
{u'thetvdb': 307480, u'tvrage': None, u'imdb': u'tt0358889'}
{u'thetvdb': 83326, u'tvrage': None, u'imdb': u'tt1011591'}
{u'thetvdb': 279772, u'tvrage': None, u'imdb': u'tt3612584'}
{u'thetvdb': 305936, u'tvrage': None, u'imdb': u'tt5464192'}
{u'thetvdb': 267921, u'tvrage': None, u'imdb': u'tt4741696'}
{u'thetvdb': 95351, u'tvrage': None, u'imdb': u'tt1974439'}
{u'thetvdb': 79838, u'tvrage': 5631, u'imdb': u'tt0092325'}
{u'thetvdb': None, u'tvrage': None, u'imdb': u'tt6772826'}
{u'thetvdb': 127351, u'tvrage': 24425, u'imdb': u'tt1563069'}
{u'thetvdb': 79550, u'tvrage': 6890, u'imdb': u'tt0489598'}
{u'thetvdb': 148561, u'tvrage': 24465, u'imdb': u'tt1566154'}
{u'thetvdb': 70905, u'tvrage': 3150, u'imdb': u'tt0338592'}
{u'thetvdb': 70829, u'tvrage': None, u'imdb': u'tt0167515'}
{u'thetvdb': 262883, u'tvrage': 31271, u'imdb': u'tt2330327'}
{u'thetvdb': 84208, u'tvrage': None, u'imdb': u'tt0186747'}
{u'thetvdb': 239961, u'tvrage': 27826, u'imdb': u'tt1821877'}
{u'thetvdb': 216741, u'tvrage': None, u'imdb': u'tt1599374'}
{u'thetvdb': 270465, u'tvrage': 35836, u'imdb': u'tt2946482'}
{u'thetvdb': 268600, u'tvrage': 35103, u'imdb': u'tt2649480'}
{u'thetvdb': 82550, u'tvrage': None, u'imdb': u'tt1229945'}
{u'thetvdb': 248039, u'tvrage': 23213, u'imdb': u'tt1876612'}

more_losers

	status	rating	genres	weight	updated	name	language	schedule	url	officialSite	externals	premiered	summary	_links	image	webChannel	runtime	type	id	network
0	Ended	{u'average': None}	[]	0	1449178946	Famous in 12	English	{u'days': [u'Tuesday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/9024/famous-in-12	None	{u'thetvdb': 279947, u'tvrage': 37045, u'imdb'...	2014-06-03	<p><i><b>"Famous in 12"</b></i>, the new unscr...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	9024	{u'country': {u'timezone': u'America/New_York'...
1	Ended	{u'average': None}	[Comedy, Family]	14	1497059695	The Sharon Osbourne Show	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/19004/the-sharon-o...	None	{u'thetvdb': None, u'tvrage': 13173, u'imdb': ...	2006-08-29	<p>Daily talk show hosted by Sharon Osbourne.</p>	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Talk Show	19004	{u'country': {u'timezone': u'Europe/London', u...
2	Ended	{u'average': None}	[Comedy]	0	1503083428	Steve Harvey's Big Time Challenge	English	{u'days': [u'Sunday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/29202/steve-harvey...	None	{u'thetvdb': 72157, u'tvrage': None, u'imdb': ...	2003-09-11	<p><b>Steve Harvey's Big Time Challenge</b>, a...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Talk Show	29202	{u'country': {u'timezone': u'America/New_York'...
3	Ended	{u'average': None}	[]	0	1475183910	The Spin Crowd	English	{u'days': [u'Sunday'], u'time': u'22:30'}	http://www.tvmaze.com/shows/21619/the-spin-crowd	None	{u'thetvdb': 218241, u'tvrage': None, u'imdb':...	2010-08-22	<p>Nobody knows how to make stars shine bright...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	21619	{u'country': {u'timezone': u'America/New_York'...
4	Running	{u'average': 1}	[]	0	1495714601	Babushka	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/25450/babushka	http://www.itv.com/beontv/shows/babushka	{u'thetvdb': 327908, u'tvrage': None, u'imdb':...	2017-05-01	<p><b>Babushka</b> is a brand new game show wh...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Game Show	25450	{u'country': {u'timezone': u'Europe/London', u...
5	Ended	{u'average': None}	[]	0	1483745416	Chrome Underground	English	{u'days': [u'Tuesday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/24213/chrome-under...	http://www.discovery.com/tv-shows/chrome-under...	{u'thetvdb': 279810, u'tvrage': None, u'imdb':...	2014-05-23	<p>Two international classic car dealers searc...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	24213	{u'country': {u'timezone': u'America/New_York'...
6	Ended	{u'average': None}	[]	0	1495602919	Fear Factor	English	{u'days': [u'Sunday'], u'time': u''}	http://www.tvmaze.com/shows/26838/fear-factor	None	{u'thetvdb': 283658, u'tvrage': None, u'imdb':...	2002-09-10	<p>This version has two teams of three contest...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Game Show	26838	{u'country': {u'timezone': u'Europe/London', u...
7	Ended	{u'average': None}	[]	0	1495254081	Owner's Manual	English	{u'days': [u'Thursday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/9261/owners-manual	None	{u'thetvdb': 271341, u'tvrage': 33650, u'imdb'...	2013-08-15	<p><b>Owner's Manual</b> will test one of the ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	9261	{u'country': {u'timezone': u'America/New_York'...
8	Ended	{u'average': None}	[]	0	1487011574	The Shire	English	{u'days': [u'Monday'], u'time': u'21:45'}	http://www.tvmaze.com/shows/25288/the-shire	None	{u'thetvdb': 260677, u'tvrage': None, u'imdb':...	2012-07-16	<p>The series follows the lives and love of a ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	25	Reality	25288	{u'country': {u'timezone': u'Australia/Sydney'...
9	Ended	{u'average': None}	[Comedy]	0	1483143763	The Montefuscos	English	{u'days': [u'Thursday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/24079/the-montefuscos	None	{u'thetvdb': 77616, u'tvrage': None, u'imdb': ...	1975-09-04	<p>The trials and tribulations of three genera...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Scripted	24079	{u'country': {u'timezone': u'America/New_York'...
10	Ended	{u'average': None}	[]	0	1464030266	I Want to Be a Hilton	English	{u'days': [u'Tuesday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/17541/i-want-to-be...	None	{u'thetvdb': 74419, u'tvrage': None, u'imdb': ...	2005-06-21	<p>Kathy Hilton, onetime actress and mother of...	{u'previousepisode': {u'href': u'http://api.tv...	None	None	60	Reality	17541	{u'country': {u'timezone': u'America/New_York'...
11	Ended	{u'average': None}	[]	20	1478379662	ABC's Nightlife	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/22597/abcs-nightlife	None	{u'thetvdb': None, u'tvrage': None, u'imdb': u...	1964-11-09	<p><b>ABC's Nightlife</b> is a late night dail...	{u'previousepisode': {u'href': u'http://api.tv...	None	None	105	Talk Show	22597	{u'country': {u'timezone': u'America/New_York'...
12	Running	{u'average': None}	[]	0	1454050022	Untying the Knot	English	{u'days': [u'Monday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/6843/untying-the-knot	http://www.bravotv.com/untying-the-knot	{u'thetvdb': 282527, u'tvrage': 42189, u'imdb'...	2014-06-04	<p>Vikki Ziegler, known as the Divorce Diva, i...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	6843	{u'country': {u'timezone': u'America/New_York'...
13	Ended	{u'average': None}	[Action]	0	1495406329	Wipeout Canada	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/12998/wipeout-canada	None	{u'thetvdb': 246631, u'tvrage': None, u'imdb':...	2011-04-03	<p><b>Wipeout Canada</b> is a hilarious game s...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Game Show	12998	{u'country': {u'timezone': u'Canada/Atlantic',...
14	Ended	{u'average': None}	[]	0	1464363967	Hurl!	English	{u'days': [u'Tuesday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/17705/hurl	None	{u'thetvdb': 82500, u'tvrage': None, u'imdb': ...	2008-07-15	<p>Get ready to get grossed out with G4's off-...	{u'previousepisode': {u'href': u'http://api.tv...	None	None	30	Reality	17705	{u'country': {u'timezone': u'America/New_York'...
15	Ended	{u'average': None}	[]	0	1457450255	Meet the Parents	English	{u'days': [u'Thursday'], u'time': u'21:30'}	http://www.tvmaze.com/shows/13973/meet-the-par...	http://www.channel4.com/programmes/meet-the-pa...	{u'thetvdb': 206381, u'tvrage': 26873, u'imdb'...	2010-11-18	<p><i>Meet the Parents</i> is a reality series...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	13973	{u'country': {u'timezone': u'Europe/London', u...
16	Ended	{u'average': None}	[Drama, Action]	0	1481553637	4th and Loud	English	{u'days': [u'Tuesday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/11854/4th-and-loud	http://www.amc.com/shows/4th-and-loud	{u'thetvdb': 284259, u'tvrage': None, u'imdb':...	2014-08-12	<p><b>4th and Loud</b> will follow the LA KISS...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	11854	{u'country': {u'timezone': u'America/New_York'...
17	Ended	{u'average': None}	[]	0	1495496078	It's Worth What?	English	{u'days': [u'Tuesday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/17619/its-worth-what	None	{u'thetvdb': 250186, u'tvrage': None, u'imdb':...	2011-07-19	<p><b>It's Worth What? </b>stars Cedric the En...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Game Show	17619	{u'country': {u'timezone': u'America/New_York'...
18	To Be Determined	{u'average': 6.6}	[Drama, Thriller, Adult]	92	1497788418	The Deleted	English	{u'days': [], u'time': u''}	http://www.tvmaze.com/shows/19884/the-deleted	https://www.fullscreen.com/series/the-deleted	{u'thetvdb': 320679, u'tvrage': None, u'imdb':...	2016-12-04	<p>When escapees from a mysterious cult start ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	{u'country': {u'timezone': u'America/New_York'...	15	Scripted	19884	None
19	Ended	{u'average': 7.3}	[Comedy, Action, Crime]	14	1500877446	V.I.P.	English	{u'days': [u'Saturday'], u'time': u''}	http://www.tvmaze.com/shows/1885/vip	None	{u'thetvdb': 74181, u'tvrage': 6494, u'imdb': ...	1998-09-26	<p>A campy syndicated series about Vallery Iro...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	1885	{u'country': {u'timezone': u'America/New_York'...
20	Running	{u'average': 6}	[Drama]	63	1496679327	The Real Housewives of Atlanta	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/597/the-real-house...	None	{u'thetvdb': 84159, u'tvrage': 19672, u'imdb':...	2008-10-07	<p>An up-close and personal look at life in Ho...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	597	{u'country': {u'timezone': u'America/New_York'...
21	Running	{u'average': None}	[Comedy, Children]	0	1475116665	Pickle and Peanut	English	{u'days': [u'Monday'], u'time': u'18:30'}	http://www.tvmaze.com/shows/3019/pickle-and-pe...	http://disneyxd.disney.com/pickle-and-peanut	{u'thetvdb': 300105, u'tvrage': 48178, u'imdb'...	2015-09-07	<p><i><b>"Pickle & Peanut"</b></i> is abou...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	{u'country': {u'timezone': u'America/New_York'...	15	Animation	3019	{u'country': {u'timezone': u'America/New_York'...
22	Ended	{u'average': None}	[Drama, Comedy, Romance]	0	1501880843	Buckwild	English	{u'days': [u'Thursday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/25036/buckwild	http://www.mtv.com/shows/buckwild	{u'thetvdb': 264850, u'tvrage': None, u'imdb':...	2013-01-03	<p>The show follows the lives of nine young ad...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	25036	{u'country': {u'timezone': u'America/New_York'...
23	Ended	{u'average': 3}	[]	0	1486506841	Mystery Girls	English	{u'days': [u'Wednesday'], u'time': u'20:30'}	http://www.tvmaze.com/shows/3950/mystery-girls	http://abcfamily.go.com/shows/mystery-girls	{u'thetvdb': 277020, u'tvrage': 35629, u'imdb'...	2014-06-25	<p>Two former detective TV show starlets broug...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Scripted	3950	{u'country': {u'timezone': u'America/New_York'...
24	Running	{u'average': None}	[Family]	12	1450883412	Celebrity Wife Swap	English	{u'days': [u'Wednesday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/1783/celebrity-wif...	http://abc.go.com/shows/celebrity-wife-swap/ab...	{u'thetvdb': 254524, u'tvrage': 31887, u'imdb'...	2012-01-02	<p>The spouses in two celebrity families with ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	1783	{u'country': {u'timezone': u'America/New_York'...
25	Running	{u'average': 7}	[Comedy]	0	1472855087	Dish Nation	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/9199/dish-nation	http://www.reelz.com/dish-nation/	{u'thetvdb': 271916, u'tvrage': None, u'imdb':...	2011-07-25	<p><i>Dish Nation</i> is a nightly syndicated ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Scripted	9199	{u'country': {u'timezone': u'America/New_York'...
26	Ended	{u'average': None}	[Children]	0	1502544202	Ni Hao, Kai-lan	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/13161/ni-hao-kai-lan	None	{u'thetvdb': 82005, u'tvrage': None, u'imdb': ...	2008-02-07	<p>Ni Hao, Kai-lan , which is Mandarin for "He...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Animation	13161	{u'country': {u'timezone': u'America/New_York'...
27	Running	{u'average': None}	[Comedy, Family]	0	1502948333	Scaredy Squirrel	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/20564/scaredy-squi...	http://www.scaredysquirrel.com	{u'thetvdb': 250472, u'tvrage': None, u'imdb':...	2011-04-01	<p><b>Scaredy Squirrel </b>follows the adventu...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	10	Animation	20564	{u'country': {u'timezone': u'Canada/Atlantic',...
28	Running	{u'average': None}	[]	76	1502312151	Big Brother After Dark	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/18240/big-brother-...	http://poptv.com/big_brother_after_dark	{u'thetvdb': 81491, u'tvrage': None, u'imdb': ...	2007-07-05	<p><b>Big Brother After Dark</b> is the live, ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	180	Reality	18240	{u'country': {u'timezone': u'America/New_York'...
29	Ended	{u'average': 1}	[Action, Adventure]	0	1474827145	American Paranormal	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/19115/american-par...	None	{u'thetvdb': 137691, u'tvrage': None, u'imdb':...	2010-01-24	<p>Whether it is the existence of aliens, the ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	19115	{u'country': {u'timezone': u'America/New_York'...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
107	To Be Determined	{u'average': None}	[]	0	1495420105	Who's Doing the Dishes?	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/8612/whos-doing-th...	None	{u'thetvdb': 300384, u'tvrage': None, u'imdb':...	2014-09-01	<p><b>Who's Doing the Dishes?</b> is a UK game...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Game Show	8612	{u'country': {u'timezone': u'Europe/London', u...
108	Ended	{u'average': None}	[]	0	1474499818	I'm a Celebrity, Get Me Out of Here! NOW!	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/8558/im-a-celebrit...	http://www.itv.com/imacelebrity/itv2-now	{u'thetvdb': 264239, u'tvrage': None, u'imdb':...	2011-11-13	<p><i>"I'm a Celebrity...Get Me Out of Here! N...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	8558	{u'country': {u'timezone': u'Europe/London', u...
109	Ended	{u'average': None}	[Romance]	0	1474764176	More to Love	English	{u'days': [u'Tuesday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/21467/more-to-love	None	{u'thetvdb': 106801, u'tvrage': None, u'imdb':...	2009-07-28	<p>Follows one regular guy's search for love a...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	21467	{u'country': {u'timezone': u'America/New_York'...
110	Ended	{u'average': None}	[]	0	1467307858	I Want to Work for Diddy	English	{u'days': [u'Monday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/18829/i-want-to-wo...	None	{u'thetvdb': 87131, u'tvrage': None, u'imdb': ...	2008-08-04	<p>Diddy. He only needs one name, but he needs...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	18829	{u'country': {u'timezone': u'America/New_York'...
111	Ended	{u'average': None}	[]	0	1490997318	Donald J. Trump Presents: The Ultimate Merger	English	{u'days': [u'Thursday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/26564/donald-j-tru...	None	{u'thetvdb': 173981, u'tvrage': None, u'imdb':...	2010-06-17	<p>Through a series of challenges, both relati...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	26564	{u'country': {u'timezone': u'America/New_York'...
112	Running	{u'average': None}	[]	0	1477193039	America's Election Headquarters	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/11837/americas-ele...	http://www.foxnews.com/on-air/americas-news-hq...	{u'thetvdb': None, u'tvrage': None, u'imdb': u...	2008-04-22	<p><b>America's Election Headquarters</b> is a...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Talk Show	11837	{u'country': {u'timezone': u'America/New_York'...
113	Running	{u'average': None}	[]	45	1502693229	BBC Weekend News	English	{u'days': [u'Saturday', u'Sunday'], u'time': u''}	http://www.tvmaze.com/shows/7333/bbc-weekend-news	http://www.bbc.co.uk/programmes/b009m51q	{u'thetvdb': 264762, u'tvrage': 31404, u'imdb'...	1954-07-05	<p><b>BBC Weekend News</b> is the national new...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	None	News	7333	{u'country': {u'timezone': u'Europe/London', u...
114	Ended	{u'average': None}	[Comedy, Children]	0	1477293529	Barney & Friends	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/15482/barney-friends	None	{u'thetvdb': 72180, u'tvrage': None, u'imdb': ...	1992-04-06	<p><b>Barney & Friends</b> is an American ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Scripted	15482	{u'country': {u'timezone': u'America/New_York'...
115	Running	{u'average': 7}	[Comedy]	89	1503147213	The Powerpuff Girls	English	{u'days': [u'Sunday'], u'time': u'17:30'}	http://www.tvmaze.com/shows/6771/the-powerpuff...	None	{u'thetvdb': 307473, u'tvrage': None, u'imdb':...	2016-04-04	<p>The city of Townsville may be a beautiful, ...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	15	Animation	6771	{u'country': {u'timezone': u'America/New_York'...
116	Running	{u'average': None}	[]	0	1497251730	TMZ	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/24857/tmz	http://www.tmz.com/when-its-on?adid=tmz_web_na...	{u'thetvdb': 147701, u'tvrage': None, u'imdb':...	2011-11-02	<p><b>TMZ </b>(also known simply as <i>TMZ</i>...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Talk Show	24857	{u'country': {u'timezone': u'America/New_York'...
117	Ended	{u'average': 6}	[]	0	1476263385	Kendra	English	{u'days': [u'Sunday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/21952/kendra	None	{u'thetvdb': 98371, u'tvrage': None, u'imdb': ...	2009-06-07	<p>Kendra Wilkinson finds herself at a crossro...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	21952	{u'country': {u'timezone': u'America/New_York'...
118	Ended	{u'average': None}	[]	0	1465065870	The Roseanne Show	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/12252/the-roseanne...	None	{u'thetvdb': 72141, u'tvrage': None, u'imdb': ...	1998-09-14	<p><i><b>"The Roseanne Show"</b></i> is a synd...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Talk Show	12252	{u'country': {u'timezone': u'America/New_York'...
119	Running	{u'average': 1}	[Music]	0	1484368515	The Xtra Factor Live	English	{u'days': [u'Saturday', u'Sunday'], u'time': u''}	http://www.tvmaze.com/shows/3764/the-xtra-fact...	None	{u'thetvdb': 75567, u'tvrage': 12949, u'imdb':...	2004-09-04	<p>Thousands audition. Only one can win. The s...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	3764	{u'country': {u'timezone': u'Europe/London', u...
120	Ended	{u'average': 7}	[Comedy]	0	1488031177	But I'm Chris Jericho!	English	{u'days': [u'Tuesday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/13150/but-im-chris...	http://butimchrisjericho.com	{u'thetvdb': 275787, u'tvrage': None, u'imdb':...	2013-10-29	<p><b>But I'm Chris Jericho!</b> is an interac...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	{u'country': {u'timezone': u'America/New_York'...	8	Scripted	13150	{u'country': {u'timezone': u'Canada/Atlantic',...
121	Ended	{u'average': 6}	[Comedy]	0	1466802381	Party Over Here	English	{u'days': [u'Saturday'], u'time': u'23:00'}	http://www.tvmaze.com/shows/12662/party-over-here	http://www.fox.com/party-over-here	{u'thetvdb': 308457, u'tvrage': 51439, u'imdb'...	2016-03-12	<p>A new late-night half-hour sketch comedy se...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Variety	12662	{u'country': {u'timezone': u'America/New_York'...
122	Ended	{u'average': None}	[Action, Adventure, Horror]	0	1500043650	Alaska Monsters	English	{u'days': [u'Saturday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/3124/alaska-monsters	http://www.destinationamerica.com/tv-shows/ala...	{u'thetvdb': 285286, u'tvrage': 44525, u'imdb'...	2014-09-12	<p>Treacherous terrain and unforgiving natural...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	3124	{u'country': {u'timezone': u'America/New_York'...
123	Ended	{u'average': None}	[Drama, Children]	0	1495726406	Abby's Ultimate Dance Competition	English	{u'days': [u'Tuesday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/9420/abbys-ultimat...	http://www.mylifetime.com/shows/abbys-ultimate...	{u'thetvdb': 261287, u'tvrage': 32847, u'imdb'...	2012-10-09	<p>Lifetime has picked-up the reality series <...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Game Show	9420	{u'country': {u'timezone': u'America/New_York'...
124	Ended	{u'average': None}	[Children, Mystery, Supernatural]	0	1502934987	The Othersiders	English	{u'days': [u'Wednesday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/9593/the-othersiders	None	{u'thetvdb': 250325, u'tvrage': None, u'imdb':...	2009-06-17	<p><b>The Othersiders</b> was an American para...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	9593	{u'country': {u'timezone': u'America/New_York'...
125	Ended	{u'average': None}	[]	0	1449520834	Canadian Idol	English	{u'days': [], u'time': u'20:00'}	http://www.tvmaze.com/shows/9674/canadian-idol	None	{u'thetvdb': 72133, u'tvrage': None, u'imdb': ...	2003-06-09	<p><i><b>"Canadian Idol"</b></i> is a Canadian...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	9674	{u'country': {u'timezone': u'Canada/Atlantic',...
126	Ended	{u'average': None}	[]	0	1474314323	Extreme Makeover	English	{u'days': [u'Thursday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/21134/extreme-make...	None	{u'thetvdb': 72488, u'tvrage': None, u'imdb': ...	2002-12-11	<p>Three people are chosen to receive the make...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	21134	{u'country': {u'timezone': u'America/New_York'...
127	Ended	{u'average': None}	[]	0	1469556547	Pretty Wild	English	{u'days': [u'Sunday'], u'time': u'22:30'}	http://www.tvmaze.com/shows/19522/pretty-wild	None	{u'thetvdb': 149371, u'tvrage': 25246, u'imdb'...	2010-03-14		{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	19522	{u'country': {u'timezone': u'America/New_York'...
128	Running	{u'average': None}	[Comedy]	0	1502570683	Just for Laughs: All Access	English	{u'days': [u'Saturday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/18044/just-for-lau...	http://www.thecomedynetwork.ca/Shows/JustForLa...	{u'thetvdb': 291820, u'tvrage': None, u'imdb':...	2012-10-12	<p>Comedians celebrate the 30th anniversary of...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Variety	18044	{u'country': {u'timezone': u'Canada/Atlantic',...
129	Ended	{u'average': None}	[]	0	1455387941	Jesse James is a Dead Man	English	{u'days': [u'Sunday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/12951/jesse-james-...	None	{u'thetvdb': 96071, u'tvrage': None, u'imdb': ...	2009-05-31	<p>Jesse James takes on the role of a modern-d...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	12951	{u'country': {u'timezone': u'America/New_York'...
130	Ended	{u'average': None}	[]	0	1488221218	Secretly Pregnant	English	{u'days': [u'Thursday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/25580/secretly-pre...	http://www.discoverylife.com/tv-shows/secretly...	{u'thetvdb': 287516, u'tvrage': None, u'imdb':...	2011-10-13	<p>The stories of women who, for various reaso...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	25580	{u'country': {u'timezone': u'America/New_York'...
131	Running	{u'average': None}	[Family]	13	1455319657	The Briefcase	English	{u'days': [], u'time': u'20:00'}	http://www.tvmaze.com/shows/1831/the-briefcase	None	{u'thetvdb': 295059, u'tvrage': 48857, u'imdb'...	2015-05-27	<p>The show features a social experiment eleme...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	1831	{u'country': {u'timezone': u'America/New_York'...
132	Ended	{u'average': None}	[Comedy]	0	1492370917	PrankStars	English	{u'days': [], u'time': u''}	http://www.tvmaze.com/shows/27206/prankstars	None	{u'thetvdb': 250280, u'tvrage': None, u'imdb':...	2011-07-15	<p>A hidden-camera series where unsuspecting t...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	27206	{u'country': {u'timezone': u'America/New_York'...
133	Ended	{u'average': None}	[]	0	1485549026	Cash Dome	English	{u'days': [u'Tuesday'], u'time': u'21:30'}	http://www.tvmaze.com/shows/24751/cash-dome	None	{u'thetvdb': 272357, u'tvrage': None, u'imdb':...	2013-08-13	<p>For a quarter century, <b>Cash Dome</b> Jew...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	24751	{u'country': {u'timezone': u'America/New_York'...
134	Ended	{u'average': None}	[Comedy]	0	1502593090	CeeLo Green's The Good Life	English	{u'days': [u'Monday'], u'time': u'22:30'}	http://www.tvmaze.com/shows/25900/ceelo-greens...	None	{u'thetvdb': 282130, u'tvrage': None, u'imdb':...	2014-06-23	<p>Follow CeeLo as he tackles not only a packe...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	25900	{u'country': {u'timezone': u'America/New_York'...
135	Ended	{u'average': None}	[]	0	1477193480	America's Prom Queen	English	{u'days': [u'Monday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/16384/americas-pro...	None	{u'thetvdb': None, u'tvrage': 18611, u'imdb': ...	2008-03-17	<p><b>America's Prom Queen</b> is a reality TV...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	16384	{u'country': {u'timezone': u'America/New_York'...
136	Ended	{u'average': None}	[]	0	1461445299	Hollywood Me	English	{u'days': [u'Wednesday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/15972/hollywood-me	None	{u'thetvdb': 271067, u'tvrage': None, u'imdb':...	2013-06-19	<p>Martyn Lawrence Bullard's normal clients in...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	15972	{u'country': {u'timezone': u'Europe/London', u...

137 rows × 20 columns

# Create a backup occasionally, and pickle after we've pulled the data
more_losers_backup = more_losers.copy()

DO_NOT_RUN = True  # Be sure to check the file name to write before enabling execution on this block

if not DO_NOT_RUN:
    pickle.dump( more_losers, open( "save_more_losers_df.p", "wb" ) )

Add a column to both shows (good) and losers (bad) to classify the rows as winners or losers

# All the data pulled from api and placed in dataframes was pickled and written to disk.
# Reading it all back in and adding a column to indicate if it was a winner or loser
# then will clean up and begin the analysis.
# $ ls *.p
# save_losers_df.p    save_more_losers_df.p     save_shows_df.p

# read data back in from the saved file
winners = pickle.load( open( "save_shows_df.p", "rb" ) )
losers1 = pickle.load( open( "save_losers_df.p", "rb" ) )
losers2 = pickle.load( open( "save_more_losers_df.p", "rb" ) )

print " Winners:", winners.shape
print " Losers1:", losers1.shape
print " Losers2:", losers2.shape

 Winners: (235, 20)
 Losers1: (229, 22)
 Losers2: (170, 20)

# Investigate why Losers1 has 22 columns, must have been pickled after a change.   
losers1.columns

Index([u'_links', u'code', u'externals', u'genres', u'id', u'image',
       u'language', u'message', u'name', u'network', u'officialSite',
       u'premiered', u'rating', u'runtime', u'schedule', u'status', u'summary',
       u'type', u'updated', u'url', u'webChannel', u'weight'],
      dtype='object')

losers2.columns

Index([u'status', u'rating', u'genres', u'weight', u'updated', u'name',
       u'language', u'schedule', u'url', u'officialSite', u'externals',
       u'premiered', u'summary', u'_links', u'image', u'webChannel',
       u'runtime', u'type', u'id', u'network'],
      dtype='object')

winners.columns

Index([u'status', u'rating', u'genres', u'weight', u'updated', u'name',
       u'language', u'schedule', u'url', u'officialSite', u'externals',
       u'premiered', u'summary', u'_links', u'image', u'webChannel',
       u'runtime', u'type', u'id', u'network'],
      dtype='object')

# Correct the issue by copying correct columns from losers1 into new_losers1
cols = losers2.columns
new_losers1 = losers1[cols]

new_losers1.shape

(229, 20)

# check that all three dataframes have same data in same order
winners.head(2)

	status	rating	genres	weight	updated	name	language	schedule	url	officialSite	externals	premiered	summary	_links	image	webChannel	runtime	type	id	network
0	Ended	{u'average': 9.4}	[Nature]	87	1490631396	Planet Earth II	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/22036/planet-earth-ii	http://www.bbc.co.uk/programmes/p02544td	{u'thetvdb': 318408, u'tvrage': None, u'imdb':...	2016-11-06	<p>David Attenborough presents a documentary s...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Documentary	22036	{u'country': {u'timezone': u'Europe/London', u...
1	Ended	{u'average': 9.4}	[Drama, Action, War, History]	86	1492651730	Band of Brothers	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/465/band-of-brothers	http://www.hbo.com/band-of-brothers	{u'thetvdb': 74205, u'tvrage': 2708, u'imdb': ...	2001-09-09	<p>Drawn from interviews with survivors of Eas...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	465	{u'country': {u'timezone': u'America/New_York'...

new_losers1.head(2)

	status	rating	genres	weight	updated	name	language	schedule	url	officialSite	externals	premiered	summary	_links	image	webChannel	runtime	type	id	network
0	Running	{u'average': None}	[]	63	1463447317	The Bill Cunningham Show	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/6068/the-bill-cunn...	http://www.thebillcunninghamshow.com/	{u'thetvdb': 283995, u'tvrage': 40425, u'imdb'...	2011-09-19	<p><i><b>"The Bill Cunningham Show"</b>,</i> T...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Talk Show	6068	{u'country': {u'timezone': u'America/New_York'...
1	To Be Determined	{u'average': None}	[Comedy, Music]	0	1477139892	Six Degrees of Everything	English	{u'days': [u'Tuesday'], u'time': u'23:00'}	http://www.tvmaze.com/shows/2821/six-degrees-o...	http://www.trutv.com/shows/six-degrees-of-ever...	{u'thetvdb': 299234, u'tvrage': 50418, u'imdb'...	2015-08-18	<p><b>Six Degrees of Everything</b> is a fast-...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Variety	2821	{u'country': {u'timezone': u'America/New_York'...

losers2.head(2)

	status	rating	genres	weight	updated	name	language	schedule	url	officialSite	externals	premiered	summary	_links	image	webChannel	runtime	type	id	network
0	Ended	{u'average': None}	[]	0	1449178946	Famous in 12	English	{u'days': [u'Tuesday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/9024/famous-in-12	None	{u'thetvdb': 279947, u'tvrage': 37045, u'imdb'...	2014-06-03	<p><i><b>"Famous in 12"</b></i>, the new unscr...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Reality	9024	{u'country': {u'timezone': u'America/New_York'...
1	Ended	{u'average': None}	[Comedy, Family]	14	1497059695	The Sharon Osbourne Show	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/19004/the-sharon-o...	None	{u'thetvdb': None, u'tvrage': 13173, u'imdb': ...	2006-08-29	<p>Daily talk show hosted by Sharon Osbourne.</p>	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Talk Show	19004	{u'country': {u'timezone': u'Europe/London', u...

# Add a column to classify the shows as winners or losers (not winners)
winners['winner'] = 1
new_losers1['winner'] = 0
losers2['winner'] = 0

/Users/erhepp/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

Merge into one dataframe called shows

# now concatenate the loser data to the winner data, the result is the dataframe shows
shows = pd.DataFrame()
shows = winners.copy()
shows = shows.append(new_losers1, ignore_index=True)
shows = shows.append(losers2, ignore_index=True)
shows.shape

(634, 21)

shows.head()

	status	rating	genres	weight	updated	name	language	schedule	url	officialSite	...	premiered	summary	_links	image	webChannel	runtime	type	id	network	winner
0	Ended	{u'average': 9.4}	[Nature]	87	1490631396	Planet Earth II	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/22036/planet-earth-ii	http://www.bbc.co.uk/programmes/p02544td	...	2016-11-06	<p>David Attenborough presents a documentary s...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Documentary	22036	{u'country': {u'timezone': u'Europe/London', u...	1
1	Ended	{u'average': 9.4}	[Drama, Action, War, History]	86	1492651730	Band of Brothers	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/465/band-of-brothers	http://www.hbo.com/band-of-brothers	...	2001-09-09	<p>Drawn from interviews with survivors of Eas...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	465	{u'country': {u'timezone': u'America/New_York'...	1
2	Ended	{u'average': 9.2}	[Nature]	82	1502854135	Planet Earth	English	{u'days': [u'Sunday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/768/planet-earth	http://www.bbc.co.uk/programmes/b006mywy	...	2006-03-05	<p>David Attenborough celebrates the amazing v...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Documentary	768	{u'country': {u'timezone': u'Europe/London', u...	1
3	Running	{u'average': 9.3}	[Drama, Adventure, Fantasy]	100	1502955537	Game of Thrones	English	{u'days': [u'Sunday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/82/game-of-thrones	http://www.hbo.com/game-of-thrones	...	2011-04-17	<p>Based on the bestselling book series <i>A S...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	82	{u'country': {u'timezone': u'America/New_York'...	1
4	Ended	{u'average': 9.3}	[Drama, Crime, Thriller]	97	1502331382	Breaking Bad	English	{u'days': [u'Sunday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/169/breaking-bad	http://www.amc.com/shows/breaking-bad	...	2008-01-20	<p><b>Breaking Bad</b> follows protagonist Wal...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	169	{u'country': {u'timezone': u'America/New_York'...	1

5 rows × 21 columns

# Check id column for any duplicates. There will be some from the losers for two reasons:
#    During first pull, the API limitions were not known, so some were returned with message,
#       "Too Many Requests"  rather tahn data, these need to be removed
#    Some did not contain their own imdb number in the data, so when the list of imdb #s to recheck was generated, 
#        these had to be included in the 2nd attempt as they could not be identified as being in the first pull.  

shows = shows[shows['name'] != 'Too Many Requests']
print shows.shape

print "Duplicate show IDs", shows.duplicated('id').sum()

# Display the duplicates to visually examine before dropping
# shows[shows.isin(shows[shows.duplicated()])].sort("ID")
shows[shows.duplicated('id')]

(498, 21)
Duplicate show IDs 6

	status	rating	genres	weight	updated	name	language	schedule	url	officialSite	...	premiered	summary	_links	image	webChannel	runtime	type	id	network
601	Ended	{u'average': None}	[]	0	1477683583	Tyler Perry's House of Payne	English	{u'days': [u'Friday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/14013/tyler-perrys...	None	...	2007-06-06	<p><b>Tyler Perry's House of Payne</b> is a co...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Scripted	14013	{u'country': {u'timezone': u'America/New_York'...
602	Ended	{u'average': 3.3}	[Comedy]	4	1502774582	The Inbetweeners	English	{u'days': [u'Monday'], u'time': u'22:30'}	http://www.tvmaze.com/shows/1138/the-inbetweeners	None	...	2012-08-20	<p><b>The Inbetweeners</b> takes a comedic loo...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Scripted	1138	{u'country': {u'timezone': u'America/New_York'...
603	Ended	{u'average': 6}	[Family]	0	1497646938	19 Kids and Counting	English	{u'days': [u'Tuesday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/969/19-kids-and-co...	http://www.tlc.com/tv-shows/19-kids-and-counting/	...	2008-09-29	<p><b>19 Kids and Counting</b> follows Michell...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Reality	969	{u'country': {u'timezone': u'America/New_York'...
604	Ended	{u'average': 9}	[Comedy, Food, Family]	0	1463627692	Talia in the Kitchen	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/2369/talia-in-the-...	http://www.nick.com/talia-in-the-kitchen/	...	2015-07-06	<p>When 14-year-old Talia visits her grandmoth...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	30	Scripted	2369	{u'country': {u'timezone': u'America/New_York'...
605	Running	{u'average': 3.8}	[]	48	1497310190	The Factor	English	{u'days': [u'Monday', u'Tuesday', u'Wednesday'...	http://www.tvmaze.com/shows/9066/the-factor	http://www.foxnews.com/shows/the-oreilly-facto...	...	1996-10-07	<p><b>The Factor</b>, originally titled <i>The...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	News	9066	{u'country': {u'timezone': u'America/New_York'...
606	Ended	{u'average': None}	[Drama, Comedy, Music]	0	1462214107	Viva Laughlin	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/6924/viva-laughlin	None	...	2007-10-18	<p>A remake of the British series <i>Blackpool...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	6924	{u'country': {u'timezone': u'America/New_York'...

6 rows × 21 columns

# validate that these are really dups by looking at both rows with the duplicate id
shows[shows['id'] == 6924]

	status	rating	genres	weight	updated	name	language	schedule	url	officialSite	...	premiered	summary	_links	image	webChannel	runtime	type	id	network	winner
462	Ended	{u'average': None}	[Drama, Comedy, Music]	0	1462214107	Viva Laughlin	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/6924/viva-laughlin	None	...	2007-10-18	<p>A remake of the British series <i>Blackpool...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	6924	{u'country': {u'timezone': u'America/New_York'...	0
606	Ended	{u'average': None}	[Drama, Comedy, Music]	0	1462214107	Viva Laughlin	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/6924/viva-laughlin	None	...	2007-10-18	<p>A remake of the British series <i>Blackpool...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60	Scripted	6924	{u'country': {u'timezone': u'America/New_York'...	0

2 rows × 21 columns

# All 6 of these check out as true duplicates, so remove the 2nd instance of each
shows = shows.drop_duplicates(subset='id')

shows.shape

(492, 21)

# make a copy, so there's a backup without having to re-pull shows info from api or from pickle and recombine
df_shows = shows.copy()

# Subdivide the columns so we can fit sections of the dataframe in notebook windows to see what we have
first_cols = df_shows.columns[1:10]
second_cols = df_shows.columns[10:17]
third_cols = df_shows.columns[17:]

df_shows[first_cols].head()

	rating	genres	weight	updated	name	language	schedule	url	officialSite
0	{u'average': 9.4}	[Nature]	87	1490631396	Planet Earth II	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/22036/planet-earth-ii	http://www.bbc.co.uk/programmes/p02544td
1	{u'average': 9.4}	[Drama, Action, War, History]	86	1492651730	Band of Brothers	English	{u'days': [u'Sunday'], u'time': u'20:00'}	http://www.tvmaze.com/shows/465/band-of-brothers	http://www.hbo.com/band-of-brothers
2	{u'average': 9.2}	[Nature]	82	1502854135	Planet Earth	English	{u'days': [u'Sunday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/768/planet-earth	http://www.bbc.co.uk/programmes/b006mywy
3	{u'average': 9.3}	[Drama, Adventure, Fantasy]	100	1502955537	Game of Thrones	English	{u'days': [u'Sunday'], u'time': u'21:00'}	http://www.tvmaze.com/shows/82/game-of-thrones	http://www.hbo.com/game-of-thrones
4	{u'average': 9.3}	[Drama, Crime, Thriller]	97	1502331382	Breaking Bad	English	{u'days': [u'Sunday'], u'time': u'22:00'}	http://www.tvmaze.com/shows/169/breaking-bad	http://www.amc.com/shows/breaking-bad

df_shows[second_cols].head()

	externals	premiered	summary	_links	image	webChannel	runtime
0	{u'thetvdb': 318408, u'tvrage': None, u'imdb':...	2016-11-06	<p>David Attenborough presents a documentary s...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60
1	{u'thetvdb': 74205, u'tvrage': 2708, u'imdb': ...	2001-09-09	<p>Drawn from interviews with survivors of Eas...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60
2	{u'thetvdb': 79257, u'tvrage': 8077, u'imdb': ...	2006-03-05	<p>David Attenborough celebrates the amazing v...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60
3	{u'thetvdb': 121361, u'tvrage': 24493, u'imdb'...	2011-04-17	<p>Based on the bestselling book series <i>A S...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60
4	{u'thetvdb': 81189, u'tvrage': 18164, u'imdb':...	2008-01-20	<p><b>Breaking Bad</b> follows protagonist Wal...	{u'previousepisode': {u'href': u'http://api.tv...	{u'medium': u'http://static.tvmaze.com/uploads...	None	60

df_shows[third_cols].head()

	type	id	network	winner
0	Documentary	22036	{u'country': {u'timezone': u'Europe/London', u...	1
1	Scripted	465	{u'country': {u'timezone': u'America/New_York'...	1
2	Documentary	768	{u'country': {u'timezone': u'Europe/London', u...	1
3	Scripted	82	{u'country': {u'timezone': u'America/New_York'...	1
4	Scripted	169	{u'country': {u'timezone': u'America/New_York'...	1

Cleanup and Organization of the DataFrame

# Cleanup and Organization

# The genres column is generally a list of strings, but is missing some values, and has empty lists for others.
#   !. Change all NaN to []
#   2. Convert all to strings
#   3. Use Count Vectorizer to make new columns for each genre
#   4. Remove existing genres column

df_shows['genres'] = df_shows['genres'].fillna(0).map(lambda x: [] if x == 0 else x)
df_shows['genres'] = df_shows['genres'].map(lambda x: ','.join(x))

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(
    binary=True,
    tokenizer=(lambda x: x.split(','))
    )
cvfit = cv.fit_transform(df_shows['genres']).todense()
genre_cols = pd.DataFrame(cvfit, columns=cv.get_feature_names())
genre_cols.rename(columns={'' : 'unknown'}, inplace=True)
genre_cols.columns

Index([        u'unknown',          u'action',           u'adult',
             u'adventure',           u'anime',        u'children',
                u'comedy',           u'crime',           u'drama',
             u'espionage',          u'family',         u'fantasy',
                  u'food',         u'history',          u'horror',
                 u'legal',         u'medical',           u'music',
               u'mystery',          u'nature',         u'romance',
       u'science-fiction',          u'sports',    u'supernatural',
              u'thriller',          u'travel',             u'war',
               u'western'],
      dtype='object')

new_genre_columns = []
for item in genre_cols:
    new_genre_columns.append('gn_' + item)
genre_cols.columns = new_genre_columns
genre_cols.head()

	gn_action	gn_adventure	gn_crime	gn_drama	...	gn_nature	gn_thriller	gn_war
0	0	0	0	0	...	1	0	0
1	1	0	0	1	...	0	0	1
2	0	0	0	0	...	1	0	0
3	0	1	0	1	...	0	0	0
4	0	0	1	1	...	0	1	0

5 rows × 28 columns

# Add the new genre columns to the df_shows dataframe
df_shows = pd.concat([df_shows, genre_cols], axis=1, join_axes=[df_shows.index])
df_shows = df_shows.drop('genres', 1)

# Genre information is missing for 69 loser shows and 13 winner shows

df_shows[df_shows['gn_unknown'] ==1][['gn_unknown', 'winner']].groupby(['winner']).count()

	gn_unknown
winner
0	69
1	13

df_shows.columns

Index([            u'status',             u'rating',             u'weight',
                  u'updated',               u'name',           u'language',
                 u'schedule',                u'url',       u'officialSite',
                u'externals',          u'premiered',            u'summary',
                   u'_links',              u'image',         u'webChannel',
                  u'runtime',               u'type',                 u'id',
                  u'network',             u'winner',         u'gn_unknown',
                u'gn_action',           u'gn_adult',       u'gn_adventure',
                 u'gn_anime',        u'gn_children',          u'gn_comedy',
                 u'gn_crime',           u'gn_drama',       u'gn_espionage',
                u'gn_family',         u'gn_fantasy',            u'gn_food',
               u'gn_history',          u'gn_horror',           u'gn_legal',
               u'gn_medical',           u'gn_music',         u'gn_mystery',
                u'gn_nature',         u'gn_romance', u'gn_science-fiction',
                u'gn_sports',    u'gn_supernatural',        u'gn_thriller',
                u'gn_travel',             u'gn_war',         u'gn_western'],
      dtype='object')

# Convert the rating to a number
# sometimes the rating column is NaN, and sometimes the value for 'average' in the dictionary is Nan
# so the NaNs must be handled twice, once for each case
# This code first fills the missing dictionarys with -1 (value chosen to signify no rating)
# It then sets the column to the average value in the rating dictionary, and if that is NaN converts to -1

df_shows['rating'] = df_shows['rating'].fillna(-1).map(lambda x: -1 if x == -1 else x['average']).fillna(-1)

# Rating information is missing for 192 loser shows and 6 winner shows
df_shows[df_shows['rating'] == -1][['rating', 'winner']].groupby(['winner']).count()

	rating
winner
0	192
1	6

# Unpack 'schedule' into days treating NaN in a similar way, 
df_shows['sched_day'] = df_shows['schedule'].fillna(0).map(lambda x: [] if x == 0 else x)
df_shows['sched_day'] = df_shows['sched_day'].map(lambda x: x if x == [] else x['days'])
df_shows['sched_day'] = df_shows['sched_day'].map(lambda x: ','.join(x))

cv = CountVectorizer(
    binary=True,
    tokenizer=(lambda x: x.split(','))
    )
cvfit = cv.fit_transform(df_shows['sched_day']).todense()
day_cols = pd.DataFrame(cvfit, columns=cv.get_feature_names())
day_cols.rename(columns={'' : 'unknown'}, inplace=True)
day_cols.columns

Index([  u'unknown',    u'friday',    u'monday',  u'saturday',    u'sunday',
        u'thursday',   u'tuesday', u'wednesday'],
      dtype='object')

new_day_columns = []
for item in day_cols:
    new_day_columns.append('sched_' + item)
day_cols.columns = new_day_columns
day_cols.head()

	sched_unknown	sched_friday	sched_monday	sched_saturday	sched_sunday	sched_thursday	sched_tuesday	sched_wednesday
0	0	0	0	0	1	0	0	0
1	0	0	0	0	1	0	0	0
2	0	0	0	0	1	0	0	0
3	0	0	0	0	1	0	0	0
4	0	0	0	0	1	0	0	0

# Add the new genre columns to the df_shows dataframe
df_shows = pd.concat([df_shows, day_cols], axis=1, join_axes=[df_shows.index])

df_shows = df_shows.drop('sched_day', 1)

# Scheduled Day information is missing for 15 loser shows and 45 winner shows

df_shows[df_shows['sched_unknown'] ==1][['sched_unknown', 'winner']].groupby(['winner']).count()

	sched_unknown
winner
0	15
1	45

# Unpack 'schedule' into times treating NaN in a similar way.
# Samples with a valid show time will be HH:MM and missing values will be :
df_shows['sched_time'] = df_shows['schedule'].fillna(':').map(lambda x: x if x == ':' else x['time'])
df_shows['sched_time'] = df_shows['sched_time'].map(lambda x: ':' if x == '' else x)

# Scheduled Time information is missing for 35 loser shows and 61 winner shows

print len(df_shows[df_shows['sched_time'] == ':'])
df_shows[df_shows['sched_time'] == ':'][['sched_time', 'winner']].groupby(['winner']).count()

	sched_time
winner
0	35
1	61

# Sched time is in HH:MM format as a string. I will leave this as string, and count vectorize it
print type(df_shows.loc[0,'sched_time'])

cv = CountVectorizer(
    binary=True,
    tokenizer=(lambda x: x.split(','))
    )
cvfit = cv.fit_transform(df_shows['sched_time']).todense()
time_cols = pd.DataFrame(cvfit, columns=cv.get_feature_names())
time_cols.rename(columns={':' : 'unknown'}, inplace=True)
time_cols.columns

<type 'unicode'>





Index([  u'00:00',   u'00:30',   u'00:50',   u'00:55',   u'01:00',   u'01:05',
         u'01:30',   u'01:35',   u'02:00',   u'02:05',   u'08:00',   u'10:00',
         u'11:00',   u'12:00',   u'13:00',   u'13:30',   u'14:00',   u'14:30',
         u'15:00',   u'15:15',   u'16:00',   u'16:30',   u'17:00',   u'17:15',
         u'17:30',   u'18:00',   u'18:30',   u'19:00',   u'19:30',   u'19:45',
         u'20:00',   u'20:15',   u'20:30',   u'20:40',   u'20:45',   u'20:55',
         u'21:00',   u'21:10',   u'21:15',   u'21:30',   u'21:45',   u'22:00',
         u'22:10',   u'22:30',   u'22:35',   u'23:00',   u'23:02',   u'23:15',
         u'23:30', u'unknown'],
      dtype='object')

new_time_columns = []
for item in time_cols:
    new_time_columns.append('sched_time_' + item)
time_cols.columns = new_time_columns
time_cols.head()

	...	sched_time_22:00
0	...	0
1	...	0
2	...	0
3	...	0
4	...	1

5 rows × 50 columns

# Add the new genre columns to the df_shows dataframe
df_shows = pd.concat([df_shows, time_cols], axis=1, join_axes=[df_shows.index])

df_shows = df_shows.drop('schedule', 1)

df_shows = df_shows.drop('sched_time', 1)

print df_shows.columns

Index([            u'status',             u'rating',             u'weight',
                  u'updated',               u'name',           u'language',
                      u'url',       u'officialSite',          u'externals',
                u'premiered',
       ...
         u'sched_time_21:45',   u'sched_time_22:00',   u'sched_time_22:10',
         u'sched_time_22:30',   u'sched_time_22:35',   u'sched_time_23:00',
         u'sched_time_23:02',   u'sched_time_23:15',   u'sched_time_23:30',
       u'sched_time_unknown'],
      dtype='object', length=105)

# Print out a network dictionary to learn how to unpack the structure
df_shows.loc[0,'network']

{u'country': {u'code': u'GB',
  u'name': u'United Kingdom',
  u'timezone': u'Europe/London'},
 u'id': 12,
 u'name': u'BBC One'}

# 25 shows have no network info,  might need to drop these, but dummied for now
df_shows['network'].isnull().sum()

# Unpack 'network' into country code, country name, timezone,  treating NaN in a similar way, 
df_shows['country_code'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['country'])
df_shows['country_code'] = df_shows['country_code'].map(lambda x: x if x == '' else x['code'])

df_shows['country_name'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['country'])
df_shows['country_name'] = df_shows['country_name'].map(lambda x: x if x == '' else x['name'])

df_shows['country_tz'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['country'])
df_shows['country_tz'] = df_shows['country_tz'].map(lambda x: x if x == '' else x['timezone'])

df_shows['network_id'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['id'])
df_shows['network_name'] = df_shows['network'].fillna('').map(lambda x: x if x == '' else x['name'])

df_shows = df_shows.drop(['network'], 1)

# Country and network information is missing for 4 loser shows and 21 winner shows

df_shows[df_shows['country_code'] == ''] [['country_code', 'winner']].groupby(['winner']).count()

	country_code
winner
0	4
1	21

df_shows[['country_code', 'country_name', 'country_tz', 'network_id', 'network_name']].head()

	country_code	country_name	country_tz	network_id	network_name
0	GB	United Kingdom	Europe/London	12	BBC One
1	US	United States	America/New_York	8	HBO
2	GB	United Kingdom	Europe/London	12	BBC One
3	US	United States	America/New_York	8	HBO
4	US	United States	America/New_York	20	AMC

df_shows[['updated', 'premiered']].head()

	updated	premiered
0	1490631396	2016-11-06
1	1492651730	2001-09-09
2	1502854135	2006-03-05
3	1502955537	2011-04-17
4	1502331382	2008-01-20

# Updated date is complete, premiered date is missing 6 values

print df_shows['updated'].isnull().sum()
print df_shows['premiered'].isnull().sum()

0
6

# Must represent updated as a real date time object, currently is seconds from epoch (1970)
# Convert string to int, then int to datetime
import datetime
print type(df_shows.loc[0,'updated'])

df_shows['updated'] = df_shows['updated'].fillna(0).apply(lambda x: x if x == 0 else datetime.datetime.fromtimestamp(x))

<type 'int'>

# Turn premiered into real date time object, currently this is a string, need to convert to date
print type(df_shows.loc[0,'premiered'])
df_shows['premiered'] = df_shows['premiered'].fillna(0).apply(lambda x: x if x == 0 else datetime.datetime.strptime(x, '%Y-%m-%d'))

<type 'unicode'>

df_shows[['updated', 'premiered']].head()

	updated	premiered
0	2017-03-27 12:16:36	2016-11-06 00:00:00
1	2017-04-19 21:28:50	2001-09-09 00:00:00
2	2017-08-15 23:28:55	2006-03-05 00:00:00
3	2017-08-17 03:38:57	2011-04-17 00:00:00
4	2017-08-09 22:16:22	2008-01-20 00:00:00

# Updated date is complete, premiered date is missing 6 values, all from loser shows

df_shows[df_shows['premiered'] == 0] [['premiered', 'winner']].groupby(['winner']).count()

	premiered
winner
0	6

# Drop columns not useful for analysis

# webChannel has no or insufficient useful information, can drop
print "webChannel null count:", df_shows['webChannel'].isnull().sum()

# url, officialSite, externals, _links, image, webChannel
df_shows = df_shows.drop(['url', 'officialSite', 'externals', '_links', 'image', 'webChannel', ], 1)

webChannel null count: 464

# Looks like runtime is already an integer number of minutes
# runtime is missing 9 values, 5 winners and 4 losers
print type(df_shows.loc[0,'runtime'])
print df_shows['runtime'].isnull().sum(), " null values"
# df_shows['runtime'].value_counts()

<type 'int'>
9  null values

df_shows[df_shows['runtime'].isnull()][['runtime', 'winner']]

	runtime	winner
17	None	1
65	None	1
137	None	1
144	None	1
198	None	1
544	None	0
556	None	0
577	None	0
609	None	0

# Contains html tags, otherwise a string, html tags will be removed in text processing steps during analysis
print df_shows.loc[0,'summary']
print df_shows['summary'].isnull().sum(), " null values"

<p>David Attenborough presents a documentary series exploring how animals meet the challenges of surviving in the most iconic habitats on earth.</p>
1  null values

df_shows[df_shows['summary'].isnull()]

	status	rating	weight	updated	name	language	premiered	summary	runtime	type	...	sched_time_23:00	sched_time_23:02	sched_time_23:15	sched_time_23:30	sched_time_unknown	country_code	country_name	country_tz	network_id	network_name
570	Ended	-1.0	75	2017-04-18 17:55:28	Chop Socky Chooks	English	2008-03-07 00:00:00	None	11	Animation	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	11	Cartoon Network

1 rows × 104 columns

# This one with the missing summary, Chop Socky Chooks, is missing other information also, and will be dropped.
# Too bad,  looks like a truly dreadful one that would be good for the very bottom of the losers list.
df_shows = df_shows[df_shows['summary'].notnull()]
df_shows.shape

(491, 104)

# Use textacy to clean the html tags, punctuation, etc. from the summary text
from textacy.preprocess import preprocess_text

df_shows['summary'] = df_shows['summary'].map(lambda x: preprocess_text(x, fix_unicode=True, lowercase=True, \
                              transliterate=False, no_contractions = True,
                              no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True))

print df_shows.loc[1,'summary']
print
print df_shows.loc[2,'summary']

<p>drawn from interviews with survivors of easy company as well as their journals and letters <b>band of brothers<b> chronicles the experiences of these men from paratrooper training in georgia through the end of the war as an elite rifle company parachuting into normandy early on dday morning participants in the battle of the bulge and witness to the horrors of war the men of easy knew extraordinary bravery and extraordinary fear and became the stuff of legend based on stephen e ambroses acclaimed book of the same name<p>

<p>david attenborough celebrates the amazing variety of the natural world in this epic documentary series filmed over four years across 64 different countries<p>

# Looks like all the summaries have html paragraph <p> and break <b> tags, and textacy hasn't removed them. 
# These lambda function knock them out

import string
df_shows['summary'] = df_shows['summary'].map(lambda x: x.replace('<p>',''))
df_shows['summary'] = df_shows['summary'].map(lambda x: x.replace('<b>',''))

# This looks better for analysis
print df_shows.loc[1,'summary']

drawn from interviews with survivors of easy company as well as their journals and letters band of brothers chronicles the experiences of these men from paratrooper training in georgia through the end of the war as an elite rifle company parachuting into normandy early on dday morning participants in the battle of the bulge and witness to the horrors of war the men of easy knew extraordinary bravery and extraordinary fear and became the stuff of legend based on stephen e ambroses acclaimed book of the same name

df_shows[df_shows.isnull().any(axis=1)]

	status	rating	weight	updated	name	language	premiered	summary	runtime	type	...	sched_time_23:00	sched_time_23:02	sched_time_23:15	sched_time_23:30	sched_time_unknown	country_code	country_name	country_tz	network_id	network_name
17	Ended	9.0	0	1455913373	The Decalogue	Polish	1989-12-10 00:00:00	<p>Ten television drama films, each one based ...	None	Variety	...	0.0	0.0	0.0	0.0	1.0	PL	Poland	Europe/Warsaw	336	TVP1
65	Ended	9.0	85	1501781828	Sherlock Holmes	English	1984-04-24 00:00:00	<p><b>Sherlock Holmes</b> is one of the world'...	None	Scripted	...	0.0	0.0	0.0	0.0	0.0	GB	United Kingdom	Europe/London	35	ITV
137	Running	8.7	98	1489944935	Taboo	English	2017-01-07 00:00:00	<p>1814: James Keziah Delaney returns to Londo...	None	Scripted	...	0.0	0.0	0.0	0.0	0.0	GB	United Kingdom	Europe/London	12	BBC One
144	Ended	8.6	42	1494693177	The New Batman Adventures	English	1997-09-13 00:00:00	<p>The New Batman Adventures comes from the cr...	None	Animation	...	0.0	0.0	0.0	0.0	1.0	US	United States	America/New_York	71	The WB
198	Ended	9.0	0	1491564027	The Larry Sanders Show	English	1992-08-15 00:00:00	<p>Comic Garry Shandling draws upon his own ta...	None	Scripted	...	0.0	0.0	0.0	0.0	1.0	US	United States	America/New_York	8	HBO
492	Running	-1.0	76	1502312151	Big Brother After Dark	English	2007-07-05 00:00:00	<p><b>Big Brother After Dark</b> is the live, ...	180	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	88	Pop
493	Ended	1.0	0	1474827145	American Paranormal	English	2010-01-24 00:00:00	<p>Whether it is the existence of aliens, the ...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	42	National Geographic Channel
494	Ended	-1.0	11	1469108505	Homeboys in Outer Space	English	1996-08-27 00:00:00	<p>The plot centers around two astronauts, Tyb...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	70	UPN
495	Ended	-1.0	0	1485097253	Gainesville: Friends Are Family	English	2015-08-20 00:00:00	<p><i><b>"Gainesville: Friends Are Family"</b>...	30	Documentary	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	173	CMT
496	Ended	-1.0	0	1449234102	The Show with Vinny	English	2013-05-01 00:00:00	<p>Vinny Guadagnino invites musicians, TV star...	30	Talk Show	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	22	MTV
497	Ended	-1.0	0	1457985576	Gormiti Nature Unleashed	French	2013-04-01 00:00:00	<p>Gormiti Nature Unleashed is an Italian CGI ...	25	Animation	...	NaN	NaN	NaN	NaN	NaN	FR	France	Europe/Paris	1050	Canal J
498	Ended	-1.0	23	1483294279	Denise Richards: It's Complicated	English	2008-05-26 00:00:00	<p><b>Denise Richards: It's Complicated</b> is...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	43	E!
499	Ended	-1.0	0	1482875019	Stanley	English	1956-09-24 00:00:00	<p><b>Stanley</b> revolved around the adventur...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	1	NBC
500	Ended	1.0	0	1468782928	Uncovering Aliens	English	2013-12-15 00:00:00	<p>Across America, there are more UFO sighting...	60	Documentary	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	92	Animal Planet
501	Ended	-1.0	0	1477142177	Bulging Brides	English	2008-01-31 00:00:00	<p><b>Bulging Brides</b> is a television serie...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	CA	Canada	Canada/Atlantic	472	Slice
502	Running	6.7	0	1502923678	Never Ever Do This at Home	English	2013-05-06 00:00:00	<p><b>Never Ever Do This at Home</b> is a come...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	CA	Canada	Canada/Atlantic	298	Discovery Channel
503	Ended	-1.0	0	1465987779	Hello Ross	English	2013-09-06 00:00:00	<p><i><b>"Hello Ross"</b></i> is the new weekl...	30	Talk Show	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	43	E!
504	Ended	-1.0	0	1499803314	3	English	2012-07-26 00:00:00	<p><b>3</b> is a new relationship series in wh...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	2	CBS
505	Ended	-1.0	0	1495568447	Trexx and Flipside	English	0	<p>Wannabe hip hop stars but their music label...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	49	BBC Three
506	Running	8.5	96	1503483430	The Real Housewives of Orange County	English	2006-03-21 00:00:00	<p>These ladies show no signs of slowing down ...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	52	Bravo
507	Ended	5.3	16	1479782037	Skins	English	2011-01-17 00:00:00	<p><b>Skins</b> is about the lives and loves o...	60	Scripted	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	22	MTV
508	Running	-1.0	73	1503490679	Dr. Phil	English	2002-09-16 00:00:00	<p>The <b>Dr. Phil</b> show provides the most ...	60	Talk Show	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	72	Syndication
509	Running	7.5	50	1497449904	My Big Fat American Gypsy Wedding	English	2012-04-29 00:00:00	<p>Going inside the hidden world of American G...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	80	TLC
510	Running	1.0	0	1479731918	Mystery Diners	English	2012-05-20 00:00:00	<p>When a restaurant owner suspects employees ...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	81	Food Network
511	Running	-1.0	0	1498393231	Pig Goat Banana Cricket	English	2015-07-18 00:00:00	<p><b>Pig Goat Banana Cricket</b> features a s...	30	Animation	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	73	nicktoons
512	Ended	-1.0	9	1460230772	Jerseylicious	English	2010-03-21 00:00:00	<p>Jerseylicious is a reality show which takes...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	184	Esquire Network
513	Ended	-1.0	38	1501384818	South Beach Tow	English	2011-07-20 00:00:00	<p>The <b>South Beach Tow</b> crew returns to ...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	84	truTV
514	Ended	-1.0	0	1466882679	Starhyke	English	2009-11-30 00:00:00	<p>It's the year 3034. Everyone on Earth has b...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	324	Showcase TV
515	Ended	-1.0	0	1496675604	Making the Band	English	2000-03-24 00:00:00	<p><b>Making the Band</b> was the brainchild o...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	22	MTV
516	Running	4.5	68	1480821374	Second Jen	English	2016-08-28 00:00:00	<p><b>Second Jen</b> is a ground-breaking scri...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	CA	Canada	Canada/Atlantic	151	City
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
598	Ended	-1.0	0	1502593090	CeeLo Green's The Good Life	English	2014-06-23 00:00:00	<p>Follow CeeLo as he tackles not only a packe...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	32	TBS
599	Ended	-1.0	0	1477193480	America's Prom Queen	English	2008-03-17 00:00:00	<p><b>America's Prom Queen</b> is a reality TV...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	26	FreeForm
600	Ended	-1.0	0	1461445299	Hollywood Me	English	2013-06-19 00:00:00	<p>Martyn Lawrence Bullard's normal clients in...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	45	Channel 4
607	Ended	-1.0	52	1490313454	Utopia	English	2014-09-07 00:00:00	<p>Get ready to witness the birth of a brave n...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	4	FOX
608	Running	-1.0	93	1499236738	Storage Wars: Canada	English	2013-08-29 00:00:00	<p>On a daily basis, high-stakes buyers descen...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	CA	Canada	Canada/Atlantic	350	OLN
609	Ended	-1.0	0	1502217322	Big Brother	English	2001-04-23 00:00:00	<p><b>Big Brother Australia</b> is based on th...	None	Reality	...	NaN	NaN	NaN	NaN	NaN	AU	Australia	Australia/Sydney	120	Nine Network
610	Ended	-1.0	0	1497307824	The Vineyard	English	2013-07-23 00:00:00	<p>ABC Family's newest original docu-series, <...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	26	FreeForm
611	Running	-1.0	0	1503655502	Na dobre i na złe	Polish	1999-11-07 00:00:00		60	Scripted	...	NaN	NaN	NaN	NaN	NaN	PL	Poland	Europe/Warsaw	333	TVP2
612	Ended	-1.0	0	1477348482	Big Top	English	2009-12-02 00:00:00	<p><b>Big Top</b> was a sit-com that aired on ...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	12	BBC One
613	Running	9.0	0	1468322551	MTV Suspect	English	2016-02-23 00:00:00	<p>Across America, people are hiding deep secr...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	22	MTV
614	Ended	-1.0	0	1497305713	Kimora Life in the Fab Lane	English	2007-08-05 00:00:00	<p>A glimpse into the life of former model Kim...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	43	E!
615	Ended	-1.0	0	1490293113	Celebrities Undercover	English	2014-03-18 00:00:00	<p>Celebrities are used to transforming into o...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	79	Oxygen
616	Running	-1.0	0	1458216770	Recipe for Deception	English	2016-01-21 00:00:00	<p>Bravo Media cooks up a battle of secrets an...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	52	Bravo
617	Ended	-1.0	0	1481538915	16 Kids and Counting	English	2013-01-11 00:00:00	<p>What's life like when you have enough child...	60	Documentary	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	45	Channel 4
618	Ended	-1.0	0	1484475919	A Poet's Guide to Britain	English	2009-05-04 00:00:00	<p>Poet and author Owen Sheers presents a seri...	30	Documentary	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	51	BBC Four
619	Running	-1.0	94	1502640953	The Bold and the Beautiful	English	1987-03-23 00:00:00	<p>They created a dynasty where passion rules,...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	2	CBS
620	Running	-1.0	99	1502485797	Life of Kylie	English	2017-08-06 00:00:00	<p><b>Life of Kylie</b> will follow Kylie Jenn...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	43	E!
621	Ended	6.0	0	1502487937	Jersey Shore	English	2009-12-03 00:00:00	<p>Grab your hair gel, wax that Cadillac and g...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	43	E!
622	Ended	-1.0	0	1485103110	The Hills	English	2006-05-31 00:00:00	<p>In the final season of <b>The Hills</b> - K...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	22	MTV
623	Running	2.7	91	1500442171	Teen Mom	English	2009-12-08 00:00:00	<p>In 16 and Pregnant, they were moms-to-be. N...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	22	MTV
624	Ended	5.7	66	1489774713	Coupling	English	2003-09-25 00:00:00	<p><b>Coupling</b> is an American remake of th...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	1	NBC
625	Running	-1.0	0	1486846250	Access Hollywood Live	English	1996-09-09 00:00:00	<p><b>Access Hollywood Live</b> is a weekday t...	60	Variety	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	75	REELZ
626	To Be Determined	-1.0	0	1462596807	The First Family	English	2012-09-17 00:00:00	<p><i><b>"The First Family"</b></i> is an Amer...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	5	The CW
627	Ended	10.0	0	1502461972	Garbage Pail Kids	English	0	<p>From deep within the historic TV animation ...	25	Animation	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	2	CBS
628	Ended	-1.0	50	1497743151	Khloé & Lamar	English	2011-04-10 00:00:00	<p>In <b>Khloé & Lamar</b>, cameras will f...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	43	E!
629	Ended	-1.0	0	1482948423	The Paul Reiser Show	English	2011-04-14 00:00:00	<p>Paul Reiser plays a fictional version of hi...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	1	NBC
630	Ended	-1.0	0	1485719969	Pretty Wicked Moms	English	2013-06-04 00:00:00	<p>Six Atlanta moms give a whole new meaning t...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	18	Lifetime
631	Ended	-1.0	0	1502430474	The Wright Way	English	2013-04-23 00:00:00	<p>Gerald Wright runs the Baselricky Council H...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	12	BBC One
632	Ended	-1.0	0	1474119411	High School Musical: Get in the Picture	English	2008-07-20 00:00:00	<p>A group of teenagers are invited to partici...	60	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	3	ABC
633	Ended	-1.0	0	1477283569	Audrina	English	2011-04-17 00:00:00	<p>Besides Audrina's blossoming career and tum...	30	Reality	...	NaN	NaN	NaN	NaN	NaN	US	United States	America/New_York	55	VH1

141 rows × 103 columns

# What do we have that is mostly complete
print df_shows[~df_shows.isnull().any(axis=1)]['winner'].value_counts()
df_shows_notnull = df_shows[~df_shows.isnull().any(axis=1)]

1    209
0    117
Name: winner, dtype: int64

# In the processing above, NaNs were replaced by other values for some columns.  This block creates a new
# dataframe where all rows with these coded values representing missing data have been removed.

df_shows_complete = df_shows_notnull[(df_shows_notnull['rating'] != -1) & \
                                     (df_shows_notnull['gn_unknown'] != 1) & \
                                     (df_shows_notnull['sched_unknown'] != 1) & \
                                     (df_shows_notnull['sched_time_unknown'] != 1) & \
                                     (df_shows_notnull['country_code'] != '') & \
                                     (df_shows_notnull['country_name'] != '') & \
                                     (df_shows_notnull['country_tz'] != '') & \
                                     (df_shows_notnull['network_id'] != '') & \
                                     (df_shows_notnull['network_name'] != '') & \
                                     (df_shows_notnull['premiered'] != 0)]

df_shows_complete.shape

(157, 103)

# Cool, at least not missing any summaries for samples that are otherwise complete
df_shows_complete['summary'].isnull().sum()

df_shows[['summary', 'winner']].groupby(['winner']).count()

	summary
winner
0	256
1	235

Modeling Section

– Note: Cells in this section must be run sequentially to obtain correct results as some variables are reused in the various modeling sections

Vectorize summary text in different ways

# I'll first try a model with just the summary text, that is available for 491 shows, 256 loosers and 235 winners


# Use NLP techniques to create lots of factors
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from collections import Counter

# Use different Vectorizers to find ngrams for us
tfidf = TfidfVectorizer(ngram_range=(2,4), max_features=2000, stop_words='english')
cvec = CountVectorizer(ngram_range=(2,4), max_features=2000, stop_words='english')
hvec = HashingVectorizer(ngram_range=(2,4), n_features=2000, stop_words='english')

X_tfidf = tfidf.fit_transform(df_shows['summary']).todense()
X_cvec = cvec.fit_transform(df_shows['summary']).todense()
X_hvec = hvec.fit_transform(df_shows['summary']).todense()

y = df_shows['winner'].values

print '\ntfidf shape:', X_tfidf.shape
print '\ncvec shape:', X_cvec.shape
print '\nhvec shape:', X_hvec.shape
print len(y)

tfidf shape: (491, 2000)

cvec shape: (491, 2000)

hvec shape: (491, 2000)
491

Model on summary text using Count Vectorizer

results were best when Count Vectorizer scores were modeled with Gaussian Naive Bayes

Features: 2000
Train Set Accuracy: 0.905
CrossVal Accuracy: 0.644 +/- 0.028
Test Set Accuracy: 0.626

**n-grams with higest cumulative sum of tf-idf scores for winners: ** ‘drama series’, ‘david attenborough’, ‘tells story’, ‘young boy’, ‘anthology series’, ‘documentary series’, ‘years later’, ‘main character’, ‘trials tribulations’, ‘crime drama’, ‘serial killer’, ‘tv history’, ‘super hero’, ‘story starts goku’, ‘starts goku’, ‘story starts’, ‘american television’, ‘fictional town’, ‘television drama’, ‘american crime’

**n-grams with higest cumulative sum of tf-idf scores for losers: ** ‘real housewives’, ‘television series’, ‘reality series’, ‘follows lives’, ‘series produced’, ‘pop culture’, ‘reality television’, ‘reality television series’, ‘animated series’, ‘come true’, ‘aired abc’, ‘reality tv’, ‘series debuted’, ‘real housewives orange county’, ‘real housewives orange’, ‘housewives orange’, ‘housewives orange county’, ‘talk hosted’, ‘studio audience’, ‘cash prize’

# Baseline for training set
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline

0.521384928717

# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_cvec, y, test_size=0.25)

print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)

(368, 2000) 368
(123, 2000) 123

#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

# Run lots of classifiers on this and see which perform the best
# Import all the modeling libraries

from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, \
                                    KFold, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report

import matplotlib.pyplot as plt

# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, Xs_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(X_train,y_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, X_train.shape[1], metrics.accuracy_score(y_train, this_model.predict(Xs_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, this_model.predict(Xs_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()

Baseline:   0.521 

Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        2000         0.938               0.660 +/- 0.037         0.626 
LDA       2000         0.938               0.544 +/- 0.054         0.593 
QDA       2000         0.549               0.399 +/- 0.034         0.390 
KNN       2000         0.758               0.500 +/- 0.010         0.528 
CART      2000         0.943               0.576 +/- 0.028         0.585 
RFST      2000         0.940               0.636 +/- 0.046         0.626 
GB        2000         0.826               0.546 +/- 0.020         0.585 
ADA       2000         0.769               0.552 +/- 0.042         0.545 
SVM       2000         0.519               0.519 +/- 0.018         0.528 
GNB       2000         0.913               0.688 +/- 0.038         0.561 
    MNB  :   failed on this input dataset 
BNB       2000         0.902               0.625 +/- 0.023         0.602

png

# Which words are most common in the winner summaries ?
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = CountVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 1]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'new york', 11),
 (u'drama series', 8),
 (u'york city', 6),
 (u'high school', 6),
 (u'men women', 5),
 (u'tv series', 5),
 (u'series based', 5),
 (u'video game', 5),
 (u'bugs bunny', 5),
 (u'new york city', 5),
 (u'tells story', 4),
 (u'young boy', 4),
 (u'comedy series', 4),
 (u'main character', 4),
 (u'united states', 4),
 (u'life new', 4),
 (u'series follows', 4),
 (u'anthology series', 3),
 (u'mr bean', 3),
 (u'prisoner cell', 3)]

# Which words are most common in the loser summaries ?
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = CountVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 0]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'real housewives', 12),
 (u'television series', 12),
 (u'los angeles', 11),
 (u'pop culture', 10),
 (u'series follows', 9),
 (u'new york', 9),
 (u'animated series', 7),
 (u'cartoon network', 7),
 (u'big brother', 6),
 (u'dance moms', 6),
 (u'reality series', 6),
 (u'best friend', 6),
 (u'high school', 5),
 (u'late night', 5),
 (u'best friends', 5),
 (u'nick jr', 5),
 (u'reality television series', 5),
 (u'plastic surgery', 5),
 (u'access hollywood', 5),
 (u'comedy series', 5)]

# Sum matrix columns to see what has the most overall importance ?

print "Highest sum Count Vectoror score for n_grams in winner shows"

cvec_results = pd.DataFrame(Xs_train, columns=cvec.get_feature_names())
cvec_results['winners'] = y_train

winner_results = pd.DataFrame(cvec_results[cvec_results['winners'] ==1].sum(), columns=['cvec_sum'])


high = winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20).index
print  [str(r) for r in high]

winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20)

Highest sum Count Vectoror score for n_grams in winner shows
['drama series', 'david attenborough', 'main character', 'tells story', 'years later', 'years ago', 'fictional town', 'anthology series', 'documentary series', 'provocative series', 'makes effort', 'standup comedian', 'set world', 'time 13yearold', 'based manga', 'highs lows', 'set fictional', 'sherlock holmes', 'series takes', 'seaside town']

	cvec_sum
drama series	21.615324
david attenborough	20.022240
main character	20.022240
tells story	20.022240
years later	17.315999
years ago	17.315999
fictional town	17.315999
anthology series	17.315999
documentary series	17.315999
provocative series	14.119126
makes effort	14.119126
standup comedian	14.119126
set world	14.119126
time 13yearold	14.119126
based manga	14.119126
highs lows	14.119126
set fictional	14.119126
sherlock holmes	14.119126
series takes	14.119126
seaside town	14.119126

# Sum matrix columns to see what has the most overall importance ?

print "Highest sum Count Vectoror score for n_grams in loser shows"

cvec_results = pd.DataFrame(Xs_train, columns=cvec.get_feature_names())
cvec_results['winners'] = y_train

winner_results = pd.DataFrame(cvec_results[cvec_results['winners'] ==0].sum(), columns=['cvec_sum'])


high = winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20).index
print  [str(r) for r in high]

winner_results.drop(['winners']).sort_values('cvec_sum', axis=0, ascending=False).head(20)

Highest sum Count Vectoror score for n_grams in loser shows
['reality series', 'television series', 'series produced', 'real housewives', 'series debuted', 'family friends', 'follows lives', 'series features', 'animated series', 'los angeles', 'reality television', 'reality television series', 'bros television distribution', 'bros television', 'warner bros television', 'warner bros television distribution', 'television series debuted', 'cash prize', 'new series', 'news channel']

	cvec_sum
reality series	22.787391
television series	21.394240
series produced	20.773274
real housewives	19.851334
series debuted	18.554642
family friends	18.554642
follows lives	18.554642
series features	18.554642
animated series	18.554642
los angeles	18.313960
reality television	17.522176
reality television series	17.522176
bros television distribution	16.046764
bros television	16.046764
warner bros television	16.046764
warner bros television distribution	16.046764
television series debuted	16.046764
cash prize	16.046764
new series	16.046764
news channel	16.046764

Model on summary text using TF-IDF Vectorizer

results were best when tf-idf scores were modeled with Gaussian Naive Bayes

Features: 2000
Train Set Accuracy: 0.924
CrossVal Accuracy: 0.609 +/- 0.034
Test Set Accuracy: 0.609 +/- 0.034

**n-grams with higest cumulative sum of tf-idf scores for winners: ** ‘david attenborough’, ‘drama series’, ‘men women’, ‘new york’, ‘documentary series’, ‘new york city’, ‘york city’, ‘quest save’, ‘tv series’, ‘world know’, ‘television drama’, ‘sitcom set’, ‘young boy’, ‘comedy series’, ‘series created’, ‘tells story’, ’21st century’, ‘super hero’, ‘cable news’, ‘best friends’

**n-grams with higest cumulative sum of tf-idf scores for losers: ** ‘real housewives’, ‘series follows’, ‘television series’, ‘best friends’, ‘best friend’, ‘los angeles’, ‘things just’, ‘group teenagers’, ‘series features’, ‘restaurant industry’, ‘children ages’, ‘animated series’, ‘big brother’, ‘cartoon network’, ‘recent divorce’, ‘american women’, ‘high school’, ‘reality series’, ‘follows lives’, ‘lives loves’

# Baseline for training set
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline

0.521384928717

# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25)

print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)

(368, 2000) 368
(123, 2000) 123

#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(X_train,y_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, X_train.shape[1], metrics.accuracy_score(y_train, this_model.predict(X_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, this_model.predict(X_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()

Baseline:   0.521 

Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        2000         0.957               0.620 +/- 0.020         0.634 
LDA       2000         0.959               0.658 +/- 0.035         0.610 
QDA       2000         0.671               0.437 +/- 0.013         0.431 
KNN       2000         0.647               0.519 +/- 0.031         0.488 
CART      2000         0.959               0.554 +/- 0.026         0.496 
RFST      2000         0.957               0.581 +/- 0.020         0.561 
GB        2000         0.872               0.598 +/- 0.034         0.545 
ADA       2000         0.772               0.541 +/- 0.047         0.504 
SVM       2000         0.505               0.492 +/- 0.009         0.569 
GNB       2000         0.943               0.668 +/- 0.028         0.634 
MNB       2000         0.927               0.658 +/- 0.029         0.642 
BNB       2000         0.932               0.641 +/- 0.048         0.593

png

# Which words are most common in the winner summaries ?
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 1]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'new york', 11),
 (u'drama series', 8),
 (u'york city', 6),
 (u'high school', 6),
 (u'men women', 5),
 (u'tv series', 5),
 (u'series based', 5),
 (u'video game', 5),
 (u'bugs bunny', 5),
 (u'new york city', 5),
 (u'tells story', 4),
 (u'young boy', 4),
 (u'comedy series', 4),
 (u'main character', 4),
 (u'united states', 4),
 (u'life new', 4),
 (u'series follows', 4),
 (u'anthology series', 3),
 (u'mr bean', 3),
 (u'prisoner cell', 3)]

# Which words are most common in the loser summaries ?
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(df_shows[df_shows['winner'] == 0]['summary'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'real housewives', 12),
 (u'television series', 12),
 (u'los angeles', 11),
 (u'pop culture', 10),
 (u'series follows', 9),
 (u'new york', 9),
 (u'animated series', 7),
 (u'cartoon network', 7),
 (u'big brother', 6),
 (u'dance moms', 6),
 (u'reality series', 6),
 (u'best friend', 6),
 (u'high school', 5),
 (u'late night', 5),
 (u'best friends', 5),
 (u'nick jr', 5),
 (u'reality television series', 5),
 (u'plastic surgery', 5),
 (u'access hollywood', 5),
 (u'comedy series', 5)]

# Sum matrix columns to see what has the most overall importance ?

print "Highest cumulative tfidf score for n_grams in winner shows"

tfidf_results = pd.DataFrame(X_train, columns= tfidf.get_feature_names())
tfidf_results['winners'] = y_train

winner_results = pd.DataFrame(tfidf_results[tfidf_results['winners'] ==1].sum(), columns=['tfidf_sum'])


high = winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20).index
print  [str(r) for r in high]

winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20)

Highest cumulative tfidf score for n_grams in winner shows
['new york', 'men women', 'documentary series', 'york city', 'new york city', 'high school', 'drama series', 'tells story', 'years ago', 'david attenborough', 'series created', 'years later', 'young man', 'comedy series', 'main character', '21st century', 'tv series', 'andrew davies', 'cable news', 'series based']

	tfidf_sum
new york	2.835972
men women	2.484786
documentary series	2.171334
york city	1.897754
new york city	1.897754
high school	1.743989
drama series	1.716267
tells story	1.685216
years ago	1.634339
david attenborough	1.522240
series created	1.484294
years later	1.484223
young man	1.474759
comedy series	1.401982
main character	1.366767
21st century	1.358933
tv series	1.307707
andrew davies	1.304276
cable news	1.261358
series based	1.258897

# Sum matrix columns to see what has the most overall importance ?

print "Highest cumulative tfidf score for n_grams in loser shows"

tfidf_results = pd.DataFrame(X_train, columns= tfidf.get_feature_names())
tfidf_results['winners'] = y_train

winner_results = pd.DataFrame(tfidf_results[tfidf_results['winners'] == 0].sum(), columns=['tfidf_sum'])

low = winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20).index

print  [str(r) for r in low]
winner_results.drop(['winners']).sort_values('tfidf_sum', axis=0, ascending=False).head(20)

Highest cumulative tfidf score for n_grams in loser shows
['reality series', 'television series', 'los angeles', 'real housewives', 'things just', 'best friend', 'group teenagers', 'series follows', 'restaurant industry', 'new york', 'high school', 'children ages', 'big brother', 'recent divorce', 'series features', 'cartoon network', 'football team', 'plastic surgery', 'bizarre adventures', 'nick jr']

	tfidf_sum
reality series	3.535573
television series	2.715189
los angeles	2.582930
real housewives	2.160853
things just	2.000000
best friend	1.958176
group teenagers	1.791283
series follows	1.772954
restaurant industry	1.707107
new york	1.662150
high school	1.657534
children ages	1.648007
big brother	1.591352
recent divorce	1.473112
series features	1.443284
cartoon network	1.441719
football team	1.414214
plastic surgery	1.382887
bizarre adventures	1.368894
nick jr	1.366875

Model using data other than the TV show summary text

# Get list of columns for the useful non-summary data.  Dropping the "unknown" columns will solve
# the colinearity issue with dummied columns, as these will be the dropped dummies.  
# Dropping premiered as it is a datatime and standardize can't handle it.  Also dropping
# weight as it is not understood, and rating and winner as they are the targets

cols = [x for x in df_shows.columns if x not in ['rating', 'weight', 'updated', 'premiered', 'summary', 'id', \
                                                 'gn_unknown', 'sched_unknown', 'sched_time_unknown', \
                                                 'country_name', 'country_tz', 'network_name', 'name', 'winner']]
cols

[u'status',
 u'language',
 u'runtime',
 u'type',
 u'network',
 u'gn_action',
 u'gn_adult',
 u'gn_adventure',
 u'gn_anime',
 u'gn_children',
 u'gn_comedy',
 u'gn_crime',
 u'gn_drama',
 u'gn_espionage',
 u'gn_family',
 u'gn_fantasy',
 u'gn_food',
 u'gn_history',
 u'gn_horror',
 u'gn_legal',
 u'gn_medical',
 u'gn_music',
 u'gn_mystery',
 u'gn_nature',
 u'gn_romance',
 u'gn_science-fiction',
 u'gn_sports',
 u'gn_supernatural',
 u'gn_thriller',
 u'gn_travel',
 u'gn_war',
 u'gn_western',
 u'sched_friday',
 u'sched_monday',
 u'sched_saturday',
 u'sched_sunday',
 u'sched_thursday',
 u'sched_tuesday',
 u'sched_wednesday',
 u'sched_time_00:00',
 u'sched_time_00:30',
 u'sched_time_00:50',
 u'sched_time_00:55',
 u'sched_time_01:00',
 u'sched_time_01:05',
 u'sched_time_01:30',
 u'sched_time_01:35',
 u'sched_time_02:00',
 u'sched_time_02:05',
 u'sched_time_08:00',
 u'sched_time_10:00',
 u'sched_time_11:00',
 u'sched_time_12:00',
 u'sched_time_13:00',
 u'sched_time_13:30',
 u'sched_time_14:00',
 u'sched_time_14:30',
 u'sched_time_15:00',
 u'sched_time_15:15',
 u'sched_time_16:00',
 u'sched_time_16:30',
 u'sched_time_17:00',
 u'sched_time_17:15',
 u'sched_time_17:30',
 u'sched_time_18:00',
 u'sched_time_18:30',
 u'sched_time_19:00',
 u'sched_time_19:30',
 u'sched_time_19:45',
 u'sched_time_20:00',
 u'sched_time_20:15',
 u'sched_time_20:30',
 u'sched_time_20:40',
 u'sched_time_20:45',
 u'sched_time_20:55',
 u'sched_time_21:00',
 u'sched_time_21:10',
 u'sched_time_21:15',
 u'sched_time_21:30',
 u'sched_time_21:45',
 u'sched_time_22:00',
 u'sched_time_22:10',
 u'sched_time_22:30',
 u'sched_time_22:35',
 u'sched_time_23:00',
 u'sched_time_23:02',
 u'sched_time_23:15',
 u'sched_time_23:30',
 'country_code',
 'network_id']

# Dummy country code, network id, status, language, and type
df_showsd = pd.get_dummies(df_shows, columns=['network_id'], prefix='NW', prefix_sep='_')

df_showsd = df_showsd.drop('NW_', 1)
df_showsd = df_showsd.drop('network', 1)

df_showsd = pd.get_dummies(df_showsd, columns=['country_code'], prefix='C', prefix_sep='_', drop_first=True)
df_showsd = pd.get_dummies(df_showsd, columns=['status'], prefix='ST', prefix_sep='_', drop_first=True)
df_showsd = pd.get_dummies(df_showsd, columns=['language'], prefix='L', prefix_sep='_', drop_first=True)
df_showsd = pd.get_dummies(df_showsd, columns=['type'], prefix='T', prefix_sep='_', drop_first=True)

# Handle any NaN values that remain
shows_clean = df_showsd.dropna()

# We have 326 total samples left, about 1/3 loser and 2/3 winner
# Seems reasonable to proceed with a classification model

print "Number winner samples:", shows_clean['winner'].sum()
print "Number loser samples:", len(shows_clean[shows_clean['winner'] == 0])

Number winner samples: 230
Number loser samples: 121

cols = [x for x in shows_clean.columns if x not in ['rating', 'weight', 'updated', 'premiered', 'summary', 'id', \
                                                 'gn_unknown', 'sched_unknown', 'sched_time_unknown', \
                                                 'country_name', 'country_tz', 'network_name', 'name', 'winner']]
cols

[u'runtime',
 u'gn_action',
 u'gn_adult',
 u'gn_adventure',
 u'gn_anime',
 u'gn_children',
 u'gn_comedy',
 u'gn_crime',
 u'gn_drama',
 u'gn_espionage',
 u'gn_family',
 u'gn_fantasy',
 u'gn_food',
 u'gn_history',
 u'gn_horror',
 u'gn_legal',
 u'gn_medical',
 u'gn_music',
 u'gn_mystery',
 u'gn_nature',
 u'gn_romance',
 u'gn_science-fiction',
 u'gn_sports',
 u'gn_supernatural',
 u'gn_thriller',
 u'gn_travel',
 u'gn_war',
 u'gn_western',
 u'sched_friday',
 u'sched_monday',
 u'sched_saturday',
 u'sched_sunday',
 u'sched_thursday',
 u'sched_tuesday',
 u'sched_wednesday',
 u'sched_time_00:00',
 u'sched_time_00:30',
 u'sched_time_00:50',
 u'sched_time_00:55',
 u'sched_time_01:00',
 u'sched_time_01:05',
 u'sched_time_01:30',
 u'sched_time_01:35',
 u'sched_time_02:00',
 u'sched_time_02:05',
 u'sched_time_08:00',
 u'sched_time_10:00',
 u'sched_time_11:00',
 u'sched_time_12:00',
 u'sched_time_13:00',
 u'sched_time_13:30',
 u'sched_time_14:00',
 u'sched_time_14:30',
 u'sched_time_15:00',
 u'sched_time_15:15',
 u'sched_time_16:00',
 u'sched_time_16:30',
 u'sched_time_17:00',
 u'sched_time_17:15',
 u'sched_time_17:30',
 u'sched_time_18:00',
 u'sched_time_18:30',
 u'sched_time_19:00',
 u'sched_time_19:30',
 u'sched_time_19:45',
 u'sched_time_20:00',
 u'sched_time_20:15',
 u'sched_time_20:30',
 u'sched_time_20:40',
 u'sched_time_20:45',
 u'sched_time_20:55',
 u'sched_time_21:00',
 u'sched_time_21:10',
 u'sched_time_21:15',
 u'sched_time_21:30',
 u'sched_time_21:45',
 u'sched_time_22:00',
 u'sched_time_22:10',
 u'sched_time_22:30',
 u'sched_time_22:35',
 u'sched_time_23:00',
 u'sched_time_23:02',
 u'sched_time_23:15',
 u'sched_time_23:30',
 'NW_1',
 'NW_2',
 'NW_3',
 'NW_4',
 'NW_5',
 'NW_6',
 'NW_8',
 'NW_9',
 'NW_10',
 'NW_11',
 'NW_12',
 'NW_13',
 'NW_14',
 'NW_16',
 'NW_17',
 'NW_18',
 'NW_19',
 'NW_20',
 'NW_22',
 'NW_23',
 'NW_24',
 'NW_25',
 'NW_26',
 'NW_27',
 'NW_29',
 'NW_30',
 'NW_32',
 'NW_34',
 'NW_35',
 'NW_36',
 'NW_37',
 'NW_41',
 'NW_42',
 'NW_43',
 'NW_44',
 'NW_45',
 'NW_47',
 'NW_48',
 'NW_49',
 'NW_51',
 'NW_52',
 'NW_54',
 'NW_55',
 'NW_56',
 'NW_59',
 'NW_63',
 'NW_66',
 'NW_70',
 'NW_71',
 'NW_72',
 'NW_73',
 'NW_75',
 'NW_76',
 'NW_77',
 'NW_78',
 'NW_79',
 'NW_80',
 'NW_81',
 'NW_84',
 'NW_85',
 'NW_88',
 'NW_91',
 'NW_92',
 'NW_107',
 'NW_109',
 'NW_114',
 'NW_115',
 'NW_118',
 'NW_120',
 'NW_122',
 'NW_125',
 'NW_131',
 'NW_132',
 'NW_137',
 'NW_144',
 'NW_149',
 'NW_151',
 'NW_155',
 'NW_157',
 'NW_158',
 'NW_159',
 'NW_163',
 'NW_173',
 'NW_177',
 'NW_184',
 'NW_185',
 'NW_206',
 'NW_224',
 'NW_231',
 'NW_239',
 'NW_248',
 'NW_251',
 'NW_270',
 'NW_286',
 'NW_298',
 'NW_309',
 'NW_324',
 'NW_333',
 'NW_336',
 'NW_349',
 'NW_350',
 'NW_360',
 'NW_376',
 'NW_409',
 'NW_472',
 'NW_551',
 'NW_553',
 'NW_639',
 'NW_652',
 'NW_714',
 'NW_809',
 'NW_813',
 'NW_821',
 'NW_870',
 'NW_976',
 'NW_1027',
 'NW_1050',
 'NW_1485',
 u'C_AU',
 u'C_CA',
 u'C_DE',
 u'C_DK',
 u'C_FR',
 u'C_GB',
 u'C_IT',
 u'C_JP',
 u'C_KR',
 u'C_NO',
 u'C_NZ',
 u'C_PL',
 u'C_RU',
 u'C_SE',
 u'C_TR',
 u'C_US',
 u'ST_Running',
 u'ST_To Be Determined',
 u'L_English',
 u'L_French',
 u'L_German',
 u'L_Hindi',
 u'L_Italian',
 u'L_Japanese',
 u'L_Korean',
 u'L_Norwegian',
 u'L_Polish',
 u'L_Russian',
 u'L_Swedish',
 u'L_Turkish',
 u'T_Documentary',
 u'T_Game Show',
 u'T_News',
 u'T_Panel Show',
 u'T_Reality',
 u'T_Scripted',
 u'T_Talk Show',
 u'T_Variety']

# Generate X matrix and y target

X = shows_clean[cols]
y = shows_clean['winner'].values

# Baseline 
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline

0.655270655271

# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)

(263, 240) 263
(88, 240) 88

#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)# Test Train Split

# Gridsearch for best C and penalty
gs_params = {
    'penalty':['l1', 'l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,5,100)
}
from sklearn.model_selection import GridSearchCV
lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params, cv=3, verbose=1, n_jobs=-1)

lr_gridsearch.fit(Xs_train, y_train)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:    1.4s finished





GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.26186e-05, ...,   7.92483e+04,   1.00000e+05]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

# best score on the training data:
lr_gridsearch.best_score_

0.9125475285171103

# best parameters on the training data:
lr_gridsearch.best_params_

{'C': 0.068926121043496949, 'penalty': 'l2', 'solver': 'liblinear'}

# assign the best estimator to a variable:
best_lr = lr_gridsearch.best_estimator_

# Score it on the testing data:
best_lr.score(Xs_test, y_test)

0.88636363636363635

# Much better than baseline, and we can find the most important factors and run all the classifiers using
# those factors.

coef_df = pd.DataFrame({
        'features': X.columns,
        'log odds': best_lr.coef_[0],
        'percentage change in odds': np.round(np.exp(best_lr.coef_[0])*100-100,2)
    })

coef_df.sort_values(by='percentage change in odds', ascending=0)

	features	log odds	percentage change in odds
6	gn_comedy	0.500129	64.89
237	T_Scripted	0.477513	61.21
232	T_Documentary	0.268693	30.83
8	gn_drama	0.254062	28.93
3	gn_adventure	0.246904	28.01
90	NW_8	0.242839	27.49
94	NW_12	0.207017	23.00
114	NW_37	0.201048	22.27
7	gn_crime	0.192888	21.27
83	sched_time_23:30	0.178175	19.50
21	gn_science-fiction	0.177248	19.39
23	gn_supernatural	0.177209	19.39
18	gn_mystery	0.167534	18.24
95	NW_13	0.139862	15.01
218	ST_Running	0.139006	14.91
0	runtime	0.138903	14.90
11	gn_fantasy	0.137532	14.74
64	sched_time_19:45	0.137084	14.69
176	NW_270	0.126593	13.50
87	NW_4	0.122341	13.01
143	NW_85	0.115952	12.29
13	gn_history	0.111519	11.80
24	gn_thriller	0.102413	10.78
30	sched_saturday	0.102229	10.76
207	C_GB	0.101888	10.73
1	gn_action	0.090291	9.45
14	gn_horror	0.088511	9.25
62	sched_time_19:00	0.086923	9.08
98	NW_17	0.085510	8.93
225	L_Japanese	0.085179	8.89
...	...	...	...
53	sched_time_15:00	-0.106270	-10.08
153	NW_122	-0.113304	-10.71
132	NW_71	-0.113592	-10.74
201	NW_1485	-0.115251	-10.89
186	NW_376	-0.115363	-10.90
169	NW_185	-0.116553	-11.00
106	NW_26	-0.117644	-11.10
220	L_English	-0.119882	-11.30
105	NW_25	-0.126560	-11.89
55	sched_time_16:00	-0.128590	-12.07
29	sched_monday	-0.134095	-12.55
192	NW_652	-0.143685	-13.38
217	C_US	-0.154332	-14.30
117	NW_43	-0.155093	-14.37
110	NW_32	-0.157596	-14.58
158	NW_144	-0.158937	-14.69
49	sched_time_13:00	-0.159660	-14.76
140	NW_80	-0.160574	-14.83
33	sched_tuesday	-0.174971	-16.05
203	C_CA	-0.179060	-16.39
93	NW_11	-0.179608	-16.44
17	gn_music	-0.189299	-17.25
179	NW_309	-0.216452	-19.46
124	NW_52	-0.224190	-20.08
238	T_Talk Show	-0.233104	-20.79
233	T_Game Show	-0.244726	-21.71
78	sched_time_22:30	-0.251273	-22.22
102	NW_22	-0.268289	-23.53
67	sched_time_20:30	-0.300379	-25.95
236	T_Reality	-0.557917	-42.76

240 rows × 3 columns

# Create a subset of "coef_df" DataFrame with most important coefficients
imp_coefs = pd.concat([coef_df.sort_values(by='percentage change in odds', ascending=0).head(10),
                     coef_df.sort_values(by='percentage change in odds', ascending=0).tail(10)])

imp_coefs.set_index('features', inplace=True)

imp_coefs

	log odds	percentage change in odds
features
gn_comedy	0.500129	64.89
T_Scripted	0.477513	61.21
T_Documentary	0.268693	30.83
gn_drama	0.254062	28.93
gn_adventure	0.246904	28.01
NW_8	0.242839	27.49
NW_12	0.207017	23.00
NW_37	0.201048	22.27
gn_crime	0.192888	21.27
sched_time_23:30	0.178175	19.50
NW_11	-0.179608	-16.44
gn_music	-0.189299	-17.25
NW_309	-0.216452	-19.46
NW_52	-0.224190	-20.08
T_Talk Show	-0.233104	-20.79
T_Game Show	-0.244726	-21.71
sched_time_22:30	-0.251273	-22.22
NW_22	-0.268289	-23.53
sched_time_20:30	-0.300379	-25.95
T_Reality	-0.557917	-42.76

# Plot important coefficients
imp_coefs['percentage change in odds'].plot(kind = "barh")
plt.title("Percentage change in odds with Ridge regularization")
plt.show()

png

df_shows[df_shows['network_id'] == 309]

	status	rating	weight	updated	name	language	premiered	summary	runtime	type	...	sched_time_23:00	sched_time_23:02	sched_time_23:15	sched_time_23:30	sched_time_unknown	country_code	country_name	country_tz	network_id	network_name
279	Running	8.7	40	2017-08-22 19:26:37	I Live with Models	English	2015-02-23 00:00:00	tommy heads to new york city with scarlet as t...	30	Scripted	...	0.0	0.0	0.0	0.0	0.0	GB	United Kingdom	Europe/London	309	Comedy Central
535	To Be Determined	6.5	0	2016-01-13 15:12:17	Brotherhood	English	2015-06-02 00:00:00	twentysomethings dan and toby are in over thei...	30	Scripted	...	NaN	NaN	NaN	NaN	NaN	GB	United Kingdom	Europe/London	309	Comedy Central

2 rows × 104 columns

# Get list of features and re-run model with just the 20 most important features
imp_features = imp_coefs.index

# Set up X and y
X = shows_clean[imp_features]
y = shows_clean['winner'].values

# Baseline
winner_avg = y.mean()
baseline = max(winner_avg, 1-winner_avg)
print baseline

0.655270655271

# Test Train Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print X_train.shape,  len(y_train)
print X_test.shape,  len(y_test)

(263, 20) 263
(88, 20) 88

#  Standardize - 

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)# Test Train Split

# prepare configuration for cross validation test harness
seed = 42

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFST', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))
models.append(('ADA', AdaBoostClassifier()))
models.append(('SVM', SVC()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

print "\n{}:   {:0.3} ".format('Baseline', baseline, cv_results.std())
print "\n{:5.5}:  {:10.8}  {:20.18}  {:20.17}  {:20.17}".format\
        ("Model", "Features", "Train Set Accuracy", "CrossVal Accuracy", "Test Set Accuracy")

for name, model in models:
    try:
        kfold = KFold(n_splits=3, shuffle=True, random_state=seed)
        cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        this_model = model
        this_model.fit(X_train,y_train)
        print "{:5.5}     {:}         {:0.3f}               {:0.3f} +/- {:0.3f}         {:0.3f} ".format\
                (name, X_train.shape[1], metrics.accuracy_score(y_train, this_model.predict(X_train)), \
                 cv_results.mean(), cv_results.std(), metrics.accuracy_score(y_test, this_model.predict(X_test)))
    except:
        print "    {:5.5}:   {} ".format(name, 'failed on this input dataset')

        
                
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
ax.axhline(y=baseline, color='grey', linestyle='--')
plt.show()

Baseline:   0.655 

Model:  Features    Train Set Accuracy    CrossVal Accuracy     Test Set Accuracy   
LR        20         0.905               0.874 +/- 0.010         0.909 
LDA       20         0.901               0.890 +/- 0.020         0.875 
QDA       20         0.669               0.448 +/- 0.082         0.682 
KNN       20         0.909               0.905 +/- 0.005         0.886 
CART      20         0.943               0.894 +/- 0.019         0.898 
RFST      20         0.939               0.909 +/- 0.000         0.886 
GB        20         0.943               0.905 +/- 0.020         0.898 
ADA       20         0.916               0.897 +/- 0.032         0.920 
SVM       20         0.863               0.848 +/- 0.014         0.818 
GNB       20         0.616               0.673 +/- 0.057         0.614 
MNB       20         0.886               0.886 +/- 0.010         0.898 
BNB       20         0.905               0.901 +/- 0.010         0.920

png

Written on September 26, 2017

	gn_action	gn_adventure	gn_crime	gn_drama	...	gn_nature	gn_thriller	gn_war
0	0	0	0	0	...	1	0	0
1	1	0	0	1	...	0	0	1
2	0	0	0	0	...	1	0	0
3	0	1	0	1	...	0	0	0
4	0	0	1	1	...	0	1	0

	...	sched_time_22:00
0	...	0
1	...	0
2	...	0
3	...	0
4	...	1

	gn_action	gn_adventure	gn_crime	gn_drama	...	gn_nature	gn_thriller	gn_war
0	0	0	0	0	...	1	0	0
1	1	0	0	1	...	0	0	1
2	0	0	0	0	...	1	0	0
3	0	1	0	1	...	0	0	0
4	0	0	1	1	...	0	1	0

	...	sched_time_22:00
0	...	0
1	...	0
2	...	0
3	...	0
4	...	1

	gn_action	gn_adventure	gn_crime	gn_drama	...	gn_nature	gn_thriller	gn_war
0	0	0	0	0	...	1	0	0
1	1	0	0	1	...	0	0	1
2	0	0	0	0	...	1	0	0
3	0	1	0	1	...	0	0	0
4	0	0	1	1	...	0	1	0

	...	sched_time_22:00
0	...	0
1	...	0
2	...	0
3	...	0
4	...	1