List of Skills Taught:



SAS


Interface

  • Introduction to the SAS interface

Basics

  • Comments in SAS
  • Permanent & temporary folders
  • Introduction to data steps
  • Introduction to procedures
  • Viewing and printing data
  • Basic functions & operators
  • If-then statements
  • Creating indicator variables
  • Keep & drop statements
  • Where statements
  • Missing data
  • Date basics
  • Time intervals
  • Sorting data
  • Duplicate observations
  • Creating lag and lead variables
  • Creating count variables
  • Creating sub-samples
  • Stacking data
  • Merging with a data step
  • Merging with Proc SQL
  • Merging and dates
  • Summarizing data using Proc Means
  • Summarizing data using Proc SQL
  • Calculating buy-and-hold raw returns
  • Ranking variables
  • Creating groups or portfolios
  • Regrouping across time
  • Summarizing data with groups or portfolios

Outliers

  • Winsorizing & trimming
  • Robust regression

MACRO Basics

  • Creating MACRO variables
  • Creating and running MACROS
  • MACRO repositories

WRDS

  • Remote connecting to WRDS
  • Running code at WRDS
  • Pulling Compustat data
  • Pulling CRSP data
  • Merging Compustat and CRSP
  • Uploading & downloading data

Common Statistics Procedures

  • Proc Means, Univaritate, Corr Ttest and Npar1way
  • Proc reg (along with fixed effects, clustered errors, predicted and residual values)
  • Proc surveyreg, logistic surveylogistic, genmod, qlim, and robustreg



Python


Basics

  1. Setting up Python
  2. Anacondas installation
  3. Common libraries (pandas, numpy)
  4. Basic commands
  5. Basic markdown
  6. Reading in, importing, exporting data
  7. Data types, dataframes and data configuration
  8. Merging data, stacking data
  9. Subsetting, dropping data
  10. Cleaning data, missing values, backfilling, winsorize/truncate
  11. Exploring data, descriptive statistics
  12. Variables: creating variables, dropping variables, leads and lags, dealing with dates, ranked variables
  13. Grouping, frequency tables
  14. Basic regression

WRDS

  1. Installing the WRDS library
  2. Connecting to WRDS via Python
  3. Employing SQL within Python
  4. Acquiring Compustat data
  5. Creating common financial statement variables
  6. Acquiring CRSP data
  7. Calculate cumulative and buy-and-hold abnormal returns
  8. Merging Compustat and CRSP

Web scraping

  1. Introduction to HTML structure
  2. Identifying HTML xpaths
  3. Scraping with scrapy
  4. Splitting HTML as text
  5. Pandas read_html
  6. Interacting with a web browser using Selenium (sending text to search boxes, clicking buttons, extracting page source)
  7. Scraping EDGAR
  8. EDGAR file management and master IDX files
  9. Introduction to EDGAR XBRL data
  10. Scraping Yahoo! Finance

Robotic process automation

  1. Obtaining and validating user input
  2. Automating the keyboard (basic keyboard functions, special keys)
  3. Automating the mouse (position, movement, clicking, scrolling)
  4. Basics of operating system file management
  5. Deleting files
  6. Copying files
  7. Moving files
  8. Listing files in a directory
  9. Creating folders

Textual analysis:

  1. Introduction to natural language processing
  2. Text functions (replace, strip, upper, lower, count, join, find, startswith, endswith)
  3. Introduction to regular expressions (match, search, split, findall, sub, groups)
  4. Tokenization (word_tokenize, sent_tokenize, regexp_tokenize)
  5. Text pre-processing (stemming, stop words, punctuation)
  6. Counting words
  7. Calculating disclosure tone
  8. Calculating the Fog Index

Advanced Python:

  1. Machine learning
  2. Intro to machine learning
  3. What is machine learning?
  4. Types of machine learning models: supervised vs. unsupervised
  5. Applications of machine learning in practice
  6. Applications of machine learning in academic research
  7. The basic machine learning process
  8. Logistic regression
  9. Introduce machine learning concepts for predicting binary dependent variables.
  10. Predict default using financial statement and stock price data
  11. True negatives, true positives, false negatives, false positives
  12. Accuracy vs. sensitivity vs. specificity
  13. statsmodels vs. scikit-learn in Python
  14. Exercise: predict employee attrition
  15. Linear Regression
  16. Introduce machine learning concepts for continuous dependent variables
  17. Predict CDS spreads using financial statement and stock price data
  18. Evaluating the prediction
  19. Exercise: predict future revenue growth
  20. Random forest classification
  21. Introduction to decision trees
  22. From decision trees to random forest (including a discussion of overfitting)
  23. RandomForestClassifier from scikit-learn
  24. Feature importance
  25. Exercise: predict employee attrition
  26. Random forest regression
  27. Difference between random forest classification and random forest regression
  28. RandomForestRegressor from scikit-learn
  29. Exercise: predict future revenue growth
  30. Neural network classification
  31. What are neural networks (e.g., input layer, hidden layers, output layer, etc.)
  32. MLPClassifier from scikit-learn
  33. Standardizing input variables
  34. Permutation feature importance
  35. Exercise: predict banking customer churn
  36. Neural network regression
  37. Difference between neural network classification and neural network regression
  38. MLPRegressor from scikit-learn
  39. Exercise: Predict housing prices
  40. Support vector classification
  41. The basics of support vector machines (SVM) (e.g., hyperplanes, data separation, margin, kernels, decision boundaries, cost parameter, support vectors)
  42. SVC from scikit-learn
  43. Standardizing input variables
  44. Exercise: predict banking customer churn
  45. Support vector regression
  46. Difference between support vector classification and support vector regression
  47. SVR from scikit-learn
  48. Cherkassky and Ma (2004) method for obtaining epsilon and C parameters
  49. Exercise: Predict housing prices
  50. Cross validation and Grid Search
  51. What is cross validation?
  52. Holdout cross validation
  53. K-fold cross validation
  54. Data independence in K-fold cross validation
  55. Rolling window cross validation for panel data
  56. Hyperparameter tuning with grid search
  57. Exercise: predict default using cross validation techniques
  58. Machine learning and textual analysis introduction
  59. Machine learning with a large number of independent variables (e.g., words or ngrams)
  60. CountVectorizer from sklearn
  61. Stop words
  62. Stemming using the Porter Stemmer algorithm
  63. Creating matrices for use in the machine learning models
  64. Feature importance
  65. Exercise: predict restaurant review ratings based on comments
  66. Machine learning and textual analysis – TF-IDF
  67. An introduction to term-frequent inverse document frequency (tf-idf)
  68. TfidfVectorizer from scikit-learn
  69. Exercise: predict restaurant review ratings based on comments using tf-idf
  70. Machine learning and textual analysis – Creating customized vectorizers
  71. Limitations with built-in vectorizers
  72. How to create your own customizable vectorizer
  73. Sparse data
  74. Running textual analysis machine learning models on documents stored in your file system
  75. Exercise: predict restaurant review ratings based on a customized vectorizer
  76. Latent Dirichlet Allocation
  77. Unsupervised machine learning model
  78. What is Latent Dirichlet Allocation (LDA) and how does it work?
  79. genism
  80. Exercise: Implement LDA on a set of earnings conference call transcripts
  81. Machine learning project
  82. Use the machine learning models taught in the course to predict daily abnormal stock returns using conference call transcripts
  83. ChatGPT
  84. Introduction to ChatGPT
  85. What is ChatGPT and why you should use it in coding
  86. An overview with examples of how to use ChatGPT to enhance your coding (e.g., to write, optimize, and debug code)
  87. Use ChatGPT to write code
  88. Step by step guide to writing prompts and interacting with responses to write code
  89. Example: We ask ChatGPT to write code to extract Apple’s net income from its 2022 10-K on the SEC’s EDGAR site.
  90. Exercise: Use ChatGPT to write code to extract the tone of words spoken by analysts on earnings conference calls.
  91. Use ChatGPT to optimize code
  92. Multiple examples of inefficient code with prompts to ChatGPT to optimize and remove inefficiencies.
  93. Exercise: Write code to answer basic coding questions. Then use ChatGPT to optimize any inefficiencies in your code
  94. Use ChatGPT to debug code
  95. Multiple examples of coding errors with prompts to ChatGPT to debug and remove these errors.
  96. Integrated Development Environments
  97. Introduction to Integrated Development Environments (IDEs)
  98. Python Idle
  99. Examples of IDEs for Python
  100. Reasons for using a more sophisticated IDE
  101. Visual Studio Code
  102. Overview of the Visual Studio Code environment
  103. Explorer Window
  104. Extensions
  105. Color Themes
  106. Command Palette
  107. Commands and features (e.g., intellisense, multiple cursors, clicking functions, warnings, errors, etc.)
  108. Debugging
  109. Jupyter Notebook Extension
  110. Markdown vs. code cells
  111. Debugging within cells
  112. Keyboard shortcuts and video tutorials
  113. PyCharm
  114. PyCharm Installation
  115. Overview of the PyCharm Environment
  116. Projects explorer
  117. Debugging
  118. Find Action
  119. Color Themes
  120. Commands and features (e.g., intellisense, multiple cursors, clicking functions, wrapping code, extract method, etc.)
  121. Local history to restore deleted code



STATA

Basics

  1. Understanding the STATA environment
  2. Reading files from SAS and Python
  3. Viewing a dataset
  4. Generating variables
  5. Replacing variables
  6. Using local variables
  7. Keeping and dropping variables
  8. Sorting data
  9. Using for loops

Model Output

  1. Running basic regression models
  2. Creating Excel output tables using outreg2
  3. Modifying Excel output table options
  4. Adding statistics to Excel output tables (Adj. R2, Pseudo R2)
  5. Appending additional columns
  6. Adding column headers
  7. Excluding variables from Excel output tables