Python & R for Excel Pros

Andy Goldberg, Centerfield Nine

CAP-COOP, June 2018

History of Python

  • "Scripting" language, released 1991
  • GitHub: #2 most popular programming language (JavaScript)
  • Capable of running web sites (Django, Flask)
  • Easy, readable syntax: indents, not braces
  • 2 → 3 was controversial, but now almost every major package is compatible with 3
  • Not originally designed for numeric computing
    • NumPy & Pandas made Python capable

History of R

  • Open-source implementation of 'S', always a numeric language
  • "Competes" with SAS, SPSS, MATLAB
  • Created in a university environment, extremely popular among academics, 12,000+ packages
  • Can't do much else, can't run a website
    • "Shiny" can deliver R as web reports
  • Very strong with complex graphics

Differences from Excel

  • Working primarily with tabular data
  • No visual interface!
  • Runs top-down or in predictable order
  • Testable, debuggable, replicable, auditable
  • Highly unusual to touch the data itself
  • Free & cross-platform
  • Libraries allow you to choose functionality
    • Sort of like options on a car
    • Example: "requests" needed to make web calls

If Excel had libraries...

Core application Included packages Desired packages
  • Spreadsheet interface
  • Common cell functions
  • File ops / printing
  • Basic charts
  • Sort/filter
  • Conditional Formatting
  • Statistical cell functions
  • Solver
  • Multiple tabs
  • "Fit to Page" printing
  • Panel charts
  • Cryptography
  • JSON read/write
  • customizable CSV
  • Write your own!

Yes, Excel has add-ins

  • Complex and very difficult to build
  • No centralized directory
  • No install manager
  • Mostly viable for high-end, niche functionality
    • Stock trading analysis
    • Currency exchange
    • Monte Carlo simulations

No, Excel is NOT obsolete

  • Visual presentation is still super important
  • Rapid building/prototyping
  • Speed is fine with < 100k cells/rows
  • Exploring with PivotTables
  • Total control over print layout
  • Lots of what-if capability
    • Data Table, Goal Seek, Solver, Scenarios

But Python/R "excels" at

  • Reading/writing data to/from other sources
  • Tabular data sets
  • Fast processing of thousands/millions of values or rows
  • Repeatable analytics
  • Sophisticated graphics & charts
  • Can be integrated into website/app
  • High CPU/memory needs? Run in cloud
    • ex: AWS m5.4xlarge (64gb RAM) = 15-39 cents/hr

Develop a toolset

  • Excel
  • SQL
  • Python or R or both
  • Tableau
  • Visualization / presentation
  • "Power" Excel (PowerQuery, PowerBI, DAX)
  • Cloud tools?
  • Big data: Hadoop, Spark, Kafka, Druid...
  • Specialty or company specific tools

Where can we take this?

  • Machine learning
    • Predict offer responders
  • Text analysis
    • Correlate comments with service ratings/NPS
  • Big data
    • Instead of daily results, spin-by-spin
    • Stream player activity in real-time
  • Real-time reports
    • Dashboards, web, mobile
    • Interactive, what-if & sensitivity

Getting set up

  • Official Python distro
    • Install specific libraries with 'pip'
    • IPython/Jupyter notebooks (web UI)
    • Spyder or Rodeo IDEs
  • RStudio IDE, Tinn-R, Deducer, Stagraph
    • can also run Jupyter notebooks
  • Anaconda or Tidyverse distros
    • Prepackaged ecosystems for data science
  • Jupyter via Cloud services
    • PythonAnywhere, Crestle, PaperSpace

Further reading

Thanks, Questions?

Let's do some live coding!