Quantcast
Channel: Plotly – Modern Data
Viewing all articles
Browse latest Browse all 48

Analyzing Plotly’s Python package downloads

$
0
0

In this post, we will collect and analyze download statistics for Plotly’s Python package available on PyPI. We will also compare the downloads with other interactive charting tools like Bokeh, Vincent, and MPLD3.

Data Collection

PyPI used to show download stats for the packages, but they have terminated the service as they are currently developing the next generation of Python Package Repository, warehouse.

Linehaul will act as a statistics collection daemon for incoming logs from the new PyPI (warehouse). Right now, the current activity log on PyPI is being stored in a BigQuery database. (source: [Distutils] Publicly Queryable Statistics)

import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot

init_notebook_mode(connected=True)

We will use the gbq.read_gbq function to read BigQuery dataset into Pandas DataFrame objects.

import pandas as pd
from pandas.io import gbq

import numpy as np

We will use linregress function for linear regression of scatter plots.

from scipy.stats import linregress

Read the post Using Google BigQuery with Plotly and Pandas to create a new project.

project_id = 'sixth-edition-678'

This query will collect the timestamp, package name, and total download count columns from the table (on a daily basis).

daily_download_query = """
SELECT
  DATE(timestamp) as day,
  MONTH(timestamp) as month,
  file.project,
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    TIMESTAMP("20120701"),
    CURRENT_TIMESTAMP()
  )
WHERE
  file.project = '{0}'
GROUP BY
  day, file.project, month
ORDER BY
  day asc
"""

The following function run the query and returns a DataFrame object, if successful.

def package_df(package):
    """ Return the query result as a pandas.DataFrame object
    
    param: package(str): Name of the package on PyPI
    """
    
    try:
        df = gbq.read_gbq(daily_download_query.format(package), project_id=project_id)
        return df
    except:
        raise IOError

We will construct different DataFrames for each package.

plotly_df = package_df('plotly')
bokeh_df = package_df('bokeh')
matplotlib_df = package_df('matplotlib')
mpld3_df = package_df('mpld3')
vincent_df = package_df('vincent')

Inspecting for missing data

Using a simple TimeDelta calculation, we can find if some rows are missing from the DataFrame.

from datetime import datetime, timedelta

# Number of rows in the DataFrame
actual_rows = len(plotly_df)

start_date = datetime.strptime(plotly_df.iloc[0]['day'], '%Y-%m-%d') # 2016-01-22
end_date = datetime.strptime(plotly_df.iloc[actual_rows - 1]['day'], '%Y-%m-%d') # 2016-08-29

# Expected rows if there was no missing data (day)
expected_rows = (end_date - start_date).days + 1

if (actual_rows != expected_rows):
    print "{0} rows are missing in the DataFrame.".format(expected_rows - actual_rows)

We find that there are no rows from 2016-03-06 to 2016-05-21.

Data Transformation

Here, we will append the missing values in the DataFrames.

missing_data_start_date = '2016-03-06'
missing_data_end_date = '2016-05-21'

# starting/ending date for missing data and time differene (1 day)
s = datetime.strptime(missing_data_start_date, '%Y-%m-%d')
e = datetime.strptime(missing_data_end_date, '%Y-%m-%d')
diff = timedelta(days=1)

# generate all the missing dates in the same format
missing_dates = []
missing_dates_month = []

while (s <= e):
    missing_dates.append(s.strftime('%Y-%m-%d'))
    missing_dates_month.append(int(s.strftime('%m')[1]))
    s += diff
    
missing_row_count = len(missing_dates) # 77

We are using the pandas.concat function to append the new DataFrame with missing values to the old DataFrame.

The following function returns the updated DataFrame after sorting it (sort_values) by the values in the column ‘day’.

def append_missing_data(dataframe, package):
    """Append the missing dates DataFrame to a given DataFrame
    
    param: dataframe(pandas.DataFrame): DataFrame to append
    param: package(str): Name of package on PyPI
    """
    
    missing_dates_df = pd.DataFrame({'day': missing_dates,
                                    'month': missing_dates_month,
                                    'file_project': [package for i in range(missing_row_count)],
                                    'total_downloads': [0 for i in range(missing_row_count)]}
                                   )
    
    # place the appended columns at their right place by sorting
    new_df = pd.concat([dataframe, missing_dates_df])
    
    return new_df.sort_values('day')

Updated DataFrames with the recovered missing data.

bokeh_df = append_missing_data(bokeh_df, 'bokeh')
matplotlib_df = append_missing_data(matplotlib_df, 'matplotlib')
mpld3_df = append_missing_data(mpld3_df, 'mpld3')
plotly_df = append_missing_data(plotly_df, 'plotly')
vincent_df = append_missing_data(vincent_df, 'vincent')

Package Downloads Comparison (daily)

trace1 = go.Scatter(
    x=plotly_df['day'],
    y=plotly_df['total_downloads'],
    name='Plotly',
    mode='lines',
    line=dict(width=0.5,
              color='rgb(10. 240, 10)'),
    fill='tonexty'
)

trace2 = go.Scatter(
    x=bokeh_df['day'],
    y=bokeh_df['total_downloads'],
    name='Bokeh',
    mode='lines',
    line=dict(width=0.5,
              color='rgb(42, 77, 20)'),
    fill='tonexty'
)

trace3 = go.Scatter(
    x=mpld3_df['day'],
    y=mpld3_df['total_downloads'],
    name='MPLD3',
    mode='lines',
    line=dict(width=0.5,
              color='rgb(20, 33, 61)'),
    fill='tonexty'
)

trace4 = go.Scatter(
    x=vincent_df['day'],
    y=vincent_df['total_downloads'],
    name='Vincent',
    mode='lines',
    line=dict(width=0.5,
              color='rgb(0, 0, 0)'),
    fill='tonexty'
)

data = [trace1, trace2, trace3, trace4]

layout = go.Layout(
    title='Package Downloads Comparison (Daily)',
    showlegend=True,
    xaxis=dict(
        type='category',
        showgrid=False
    ),
    yaxis=dict(
        title='No. of downloads (daily)',
        type='linear',
        range=[1, 10000]
    ),
    plot_bgcolor='rgba(250, 250, 250, 1)',
    shapes=[
        dict(
            type='line',
            xref='x',
            yref='y',
            x0='45',
            y0='2000',
            x1='120',
            y1='2000'
        )
    ],
    annotations=[
        dict(
            x=75,
            y=2400,
            xref='x',
            yref='y',
            text="PyPI's stats collection service was down from March 6 to May 21",
            showarrow=False
        ),
        dict(
            x=115,
            y=9600,
            xref='x',
            yref='y',
            text='From Jan 22, 2016 To Aug 29, 2016',
            showarrow=False
        ),
        dict(
            x=121,
            y=2000,
            xref='x',
            yref='y',
            text="",
            showarrow=True,
            ay=0,
            ax=-5
        ),
        dict(
            x=45,
            y=2000,
            xref='x',
            yref='y',
            text="",
            showarrow=True,
            ay=0,
            ax=5
        )
    ]
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

 

Package Downloads Comparison (Monthly)

The dataset was created on Jan 22, 2016. We will use these months on the x-axis.

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug']

We are using pandas’ groupby method to gather all the row by their month value and then adding their count to find out ‘total downloads’ in the month.

trace1 = go.Bar(x=months, y=plotly_df.groupby('month').sum()['total_downloads'], name='Plotly')
trace2 = go.Bar(x=months, y=vincent_df.groupby('month').sum()['total_downloads'], name='Vincent')
trace3 = go.Bar(x=months, y=bokeh_df.groupby('month').sum()['total_downloads'], name='Bokeh')
trace4 = go.Bar(x=months, y=mpld3_df.groupby('month').sum()['total_downloads'], name='MPLD3')

data = [trace1, trace2, trace3, trace4]

layout = go.Layout(
    barmode='group',
    title="Package Downloads Comparison (PyPI)",
    yaxis=dict(
        title='No. of downloads (monthly)'
    ),
    xaxis=dict(
        title='Month'
    ),
    annotations=[
        dict(
            x=3,
            y=0,
            xref='x',
            yref='y',
            text="PyPI's stats collection service
was down from March 6 to May 21",
            showarrow=True,
            arrowhead=2,
            ax=0,
            ay=-150
        ),
        dict(
            x=3.7,
            y=90000,
            xref='x',
            yref='y',
            text='From Jan 22, 2016 To Aug 29, 2016',
            showarrow=False
        )
    ]
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Growth of Plotly package downloads

Following the tutorial Linear fit in Python, we will try to find an approximate regression line for the scatter graph of Plotly package’s downloads.

xvals = np.arange(0, len(plotly_df))

The following traces are for the package downloads scatter plot (for each package).

trace1 = go.Scatter(
    x=xvals[:44], 
    y=plotly_df['total_downloads'].iloc[:44], 
    mode='markers',
    marker=go.Marker(color='rgb(255, 127, 14)',size=5,symbol='x'),
    name='Plotly Downloads'
)

trace2 = go.Scatter(
    x=xvals[121:], 
    y=plotly_df['total_downloads'].iloc[121:],
    mode='markers',
    marker=go.Marker(color='rgb(255, 127, 14)',size=5,symbol='x'),
    name='Plotly Downloads',
    showlegend=False
)

# linear regression line for Plotly package downloads
pslope, pintercept, pr_value, pp_value, pstd_err = linregress(xvals, plotly_df['total_downloads'])
plotly_line = pslope*xvals + pintercept

trace3 = go.Scatter(
    x=xvals, 
    y=plotly_line, 
    mode='lines',
    marker=go.Marker(color='rgb(10, 20, 30)'),
    name='Plotly Regression Line',
    line=dict(
        color='rgba(10, 10, 10, 1)',
        width=1,
        dash='longdashdot'
    )
)

layout = go.Layout(
    title='Linear Regression Line for Plotly\'s Package Downloads Growth',
    yaxis = dict(
        title='No. of downloads (daily)'
    ),
    xaxis = dict(
        title='# days'
    ),
    annotations=[
        dict(
            x=85,
            y=2000,
            xref='x',
            yref='y',
            text="<b>Y = 13.29X - 282.55</b>",
            showarrow=False
        )
    ]
)

data = [trace1, trace2, trace3]

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Similarly, we can find the approximate growth line for ‘Matplotlib’.

mslope, mintercept, mr_value, mp_value, mstd_err = linregress(xvals, matplotlib_df['total_downloads'])
matplotlib_line = mslope*xvals + mintercept

Daily download counts for ‘Matplotlib’ ranges around 7000-8000 as of now.

How much time will it take for Plotly to reach that level?

Using the Plotly’s growth line equation Y=13.29X−282.55, we can find out the approximate no. of days for downloads to reach 8000.

Y(8000), results in X = 624 (nearest integer value), where current day index is 220 as of Aug 29, 2016.

That means it will take almost 400 days (from 29 Aug, 2016) for Plotly to reach the current download range of Matplotlib.

# linear regression line for Plotly package downloads
pslope, pintercept, pr_value, pp_value, pstd_err = linregress(xvals, plotly_df['total_downloads'])
plotly_line = pslope*xvals + pintercept

trace1 = go.Scatter(
    x=xvals, 
    y=plotly_line, 
    mode='lines',
    marker=go.Marker(color='rgb(10, 20, 30)'),
    name='Plotly Regression (Actual)',
    line=dict(
        color='rgba(10, 10, 10, 1)',
        width=1,
        dash='longdashdot'
    )
)

future_xvals = np.arange(221, 221 + 404)

trace2 = go.Scatter(
    x=future_xvals, 
    y=pslope*future_xvals+pintercept, 
    mode='lines',
    marker=go.Marker(color='rgb(10, 20, 30)'),
    name='Plotly Regression (Prediction)',
    line=dict(
        color='rgba(10, 10, 10, 1)',
        width=1,
        dash='dot'
    )
)

layout = go.Layout(
    title='Prediction for Plotly\'s Package Downloads Growth',
    yaxis = dict(
        title='No. of downloads (daily)'
    ),
    xaxis = dict(
        title='# days'
    ),
    annotations=[
        dict(
            x=85,
            y=2000,
            xref='x',
            yref='y',
            text="<b>Y = 13.29X - 282.55</b>",
            showarrow=False
        ),
        dict(
            x=400,
            y=7800,
            xref='x',
            yref='y',
            text="Current download range for Matplotlib",
            showarrow=False
        )
    ],
    shapes=[
        dict(
            type='line',
            xref='x',
            yref='y',
            x0=0,
            y0=8000,
            x1=624,
            y1=8000,
            line=dict(
                color='rgba(10, 10, 10, 1)',
                width=1,
                dash='solid'
            )
        ),
        dict(
            type='line',
            xref='x',
            yref='y',
            x0=624,
            y0=0,
            x1=624,
            y1=8000,
            line=dict(
                color='rgba(10, 10, 10, 1)',
                width=1,
                dash='solid'
            )
        )
    ]
)

data = [trace1, trace2]

fig = go.Figure(data=data, layout=layout)
iplot(fig)

The IPython Notebook for this analysis is available here, Analyzing Plotly’s Python package downloads.


Viewing all articles
Browse latest Browse all 48

Trending Articles