Preface

As a new international graduate student without much work experience in North America, starting a career is always not an easy errand. But I still strongly believe that I would finally receive my dream offer at someday in the future. I always believe.

The following notes are mainly based on the book Python for Data Analysis (2nd Edition) written by Wes McKinney, which is available online.

Chapter 1

Overview

An overview of Python and the common libraries. Also talked about the content distribution of this book.

Chapter 2

Python Basics, IPython and Jupyter Nootbooks

  • IPython is an enhanced Python interpreter
  • Type ipython on the command line to launch ipython
  • Jupyter Notebooks is created within IPython project
  • Type jupyter notebook on the command line to launch Jupyter Notebook
  • In IPython or Jupyter Notebooks, use <tab> key to see all variables sharing the same beginning
  • Typing ? after variable names will display some general info
  • Typing ?? will show the source code if possible
  • Try np.*load*? for yourself
  • Try %run and %load for yourself
  • Other magic commands in IPython includes: %timeit, %debug, %pwd, refer to Page 29 of the book
  • Type %matplotlib inline in Jupyter Notebook to avoid to interfere with the console session
  • Python uses ` `(whitespace or tabs) to structure code instead of using braces (such as R)
  • After using b=a, if we make a change (such as append(), instead of assigning a new object) to a, then b will change simutaneously, vice versa
  • isinstance(object, type(s)) returns True or False
  • Try object.<tab>
  • Page 37: Binary operators and comparisons
  • Strings and tuples(contained in ()) are immutable but most of the rest is mutable(can be modified)
  • Scalar types: None, str, bytes, float, bool, int (See Page 39 for explanation)
  • Use single quotes ' or double quotes " for strings, use triple quotes ''', """ for multiline strings
  • If need the backslash in strings, double-type it pytho\\n! otherwise pytho\n! will output pytho <new line> !
  • Add r before a string if there are a lot of \
  • Strings can be added together to generate a new sentence
  • Try to import datetime package and use datetime, date and time
  • strftime('date time format') could format a datetime as a string
  • dt.replace(minute = 0) could replace the minute as 0
  • Two datetimes could add or minus each other, and the difference’s type is datetime.timedelta, see P 44 for more details regarding the datetime
  • Try if-elif-else
  • for loop could iterate over a collection (like a list or a tuple, even a string) or an iterator
  • Use continue to skip some specific values in a loop
  • Use break to stop a loop and output when encountering some specific values, but only works for inner loops
  • Try while loops
  • pass in Python is the no-op statement, which can be left as a placeholder
  • range is often used for iterating through sequences by index
  • Try True if condition else False (Ternary expression)

Chapter 3

Build-in Data Structures, Functions, Files

  • Python’s workhorse structures: tuples, lists, dicts and sets
  • Use comma-separated sequence to define a tuple or a nested tuple
  • Use tuple() to convert any sequence(such as lists and strings) or iterators to a tuple
  • tuple is immutable, and the object in it is also immutable in each slot even if the object itself is mutable
  • + could concatenate tuples/lists
  • Assign a tuple to a tuple-like expression, Python will unpack the tuple
  • Use *rest to pluck a few elements from the beginning of tuples, use *_ to discard unwanted values in tuples
  • tuple.count(object) count the number of object in the tuple
  • List is like tuple but mutable, use list() to convert
  • list.append() add to the end; list.insert(index,object) insert the object at the certain index; list.pop(index) removes and returns the object at the certain index
  • Use (not) in to check whether the list(or tuple) contains the object
  • list.extend(another list) used to extend one existing list, and it’s faster than using + to concatenate two lists and thus preferable
  • Use sort function to sort a list, but not for tuple
  • bisect.bisect() returns the location where an element should be inserted to keep the list sorted; bisect.insort() actually inserts the element into that location
  • Index [] is to slice the list, try list[1:], list[-1:], list[-3:-1] and list[-1:-3], list[::2], list[::-1] for yourself
  • Slicing method also works for tuples
  • enumerate(list/tuple) function returns the index and the corresponding value in a list/tuple at each time, usually used in a for loop and used to compute a dict
  • sorted(list/tuple) function returns a sorted list or tuple
  • zip pairs up the elements of a number of lists, tuples, or other sequences to create a list of tuples, and it can be also used for unzip a list by zip(*list), see Page 61
  • reversed to reverse a list/tuple but have to come with a materialization (eg, list())
  • A dict is composed by key and value, both are Python objects (strings, lists…)
  • Use (not) in to check whether a key is in the dict
  • del dict[key] to delete the specific key and its value
  • Use dict.pop(key) to delete the corresponding key and its value
  • list(dict.keys()) shows all keys in dict, list(dict.values()) show all values
  • dict.update({another dict}) to update/add the dict
  • dict(zip(seq1, seq2)) can be used to generate a dict, try
  • dict.get(key, default) help you to get the corresponding key’s value, otherwise returns the default value if the key doesn’t exist
  • Check Page 64 for some practical functions regarding dict
  • Keys in dict are usually immutable, hash() function could tell whether one object is immutable or not
  • Set is an unordered collection of unique elements, like {1,2,3}. It can be created via set(list/tuple) function or via a set literal with curly braces, just like a dict without values, only keys
  • The math set operaters are available here, see Page 66
  • Set elements are generally immutable as well
  • set1.issubset(set2) and set1.issuperset(set2) are used to check whether set1 is the subset/superset of set2
  • Try [expr for x in collection if condition] for yourself (list)
  • Try {key-expr : value-expr for value in collection if condition} for yourself (dict)
  • Try {value-expr for value in collection if condition} for yourself (set)
  • Check Page 68 for nested lsit comprehensions, which is just another concise expression of nested for loop
  • Functions can have multiple returns
  • Functions are objects
  • Check re.sub() for yourself
  • str.strip() remove whitespaces, str.title() define propercase
  • Anonymous (Lambda) functions consist a single statement such as f = lambda x: x**2
  • Using yield in a function is to create a generator (iterable object)
  • Another way is to use a list comprehension but within parentheses (expr for x in collection), and we can modify its type with list/tuple/set/dict as we want
  • Check itertools library in Python, see Page 77
  • Try try/except for yourself (like if/else), we could also add more conditions after except, such as ValueError or (TypeError, ValueError)
  • Try file = open(path) for yourself, add file.close() after finishing our work
  • Check Page 82 for more details about dealing with files in standard Python libraries
  • The last subsection is about bytes and unicode

Chapter 4

NumPy Basics

Reasons for using Numpy:

  • Written in the C language and use much less memory than built-in Python sequences
  • Perform complex computation on entire arrays without the need for Python for loops

4.1

ndarray

  • Try data = np.random.randn(2,3)
  • Try data.dtype and data.shape
  • np.array() create arrays, or convert other types to arrays
  • np.ones(), np.zeros() and np.empty() return corresponding arrays, see Page 90 for more array types
  • Page 91 for Numpy data types
  • Use astype() function to convert data type in Numpy
  • Arrays can be applied with batch operations without for loop, we call this Vectorization
  • The slicing array is the view on the original array, thus a change in the slicing array will lead to a change in the original array. To avoid this, add .copy() after slicing
  • For multi-dim arrays, use array[1,2,:] to slice/select. In a word, the slicing method is quite similar to that for Python lists
  • Use boolean indexing (True/False, 1/0) to slice an array
  • Comparison operator != has the same effect as negating the condition ~(A == B)
  • and and or don’t work for boolean arrays, use & and | instead
  • We can also self-define the order of rows/cols in the slicing array, see Page 102
  • Try np.arange(30).reshape(5,6), np.arange(60).reshape(3,4,5) for yourself
  • array.T transpose 2-d array, np.dot() compute inner matrix product
  • For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes, such as arr.transpose((1,0,2))
  • check arr.swapaxes(0,1) for yourself

4.2

Array Functions

  • Try np.sqrt(arr), np.exp(arr), np.maximum(arr1, arr2)(arr1 and arr2 must have the same length)
  • np.modf(arr) is the vectorized version of built-in Python divmod, returns the decimal and integer parts (two arrays) of an array
  • Refer to Page 107 for more functions

4.3

Array-Oriented Programming with Arrays

  • Try np.meshgrid(arr1, arr1) for yourself
  • np.where is similar to x if condition else y in standard Python, used like np.where(cond, arr1, arr2). At each slot, if condition is true then pick arr1, otherwise take arr2. Note, the last two arguments don’t have to be arrays or same length, both or one of them could be scalar
  • Try arr.cumsum() and arr.cumprod() for yourself
  • Argument axis=1means compute XXX across the columns, # of anws should equal to # of rows
  • Argument axis=0means compute XXX across the rows, # of anws should equal to # of columns
  • Try basic statistical methods for yourself, see Page 112
  • any and all are expecially useful for boolean arrays: any checks whether there is any True value; all checks whether the array are all True
  • Arrays can also be sorted. arr.sort() will modify arr itself, but the top-level method np.sort(array) returns a copy of sorted array instead of modifying it
  • We could use sorted array to find the quantile number, sorted_array[int(Quantile * len(sorted_array))]
  • np.unique(object) returns the sorted unique values, basically as same as sorted(set(object))
  • np.in1d(arr1, arr2) checks whether the value in arr1 exists in arr2, also works for other types, eg lists and tuples, returns a boolean array
  • Refer to Page 115 for more array set operations

4.4 - 4.8

  • Check Page 115 for saving and loading data with Numpy
  • * is element-wise product, np.dot or x.dot(y) excutes the matrix dot product
  • x @ y also excutes a matrix dot product
  • Try inv(inverse) and qr(QR decomposition) in numpy.linalg for matrices, check Page 117 for more functions
  • Use np.random.seed(1234) to change the Numpy’s GLOBAL random generation seed, np.random.RandomState() for a specific number/array, see Page 119 for more np.random functions
  • Page 120 gives a simple application to simulate (multiple) random walks

Chapter 5

Pandas Basics

About Pandas:

  • Often used with numerical computing tools like Numpy and Scipy, analytical libraries like statsmodels and scikit-learn, data visualization libraries like matplotlib
  • Designed for working with tabular or heterogeneous data, but Numpy is mainly for homogeneous numerical array data
  • Open source in 2010, has over 800 distinct contributors

5.1

pandas Data Structures

  • Two workhorse data structures: Series and DataFrame
  • Series is formed by a sequence of values, called values and an associated array of data labels, called index, the latter is not required
  • Try obj = pd.Series(arr), and then obj.values, obj.index by yourself
  • obj[ind] returns the corresponding value, try obj[[ind1,ind2,ind3]] by yourself
  • Series can also be treated as a sorted dict but with the fixed value length at 1
  • Many Numpy’s function also apply to pd.Series object, and many operators for dict also works for pd.Series objects
  • pd.Series(dict, index) returns the dict’s keys in sorted order (order in index). If there is no index1 in the previous dict, then value is NaN; if key1 in dict is not in index, then it would be removed
  • Check pd.isnull(obj) and pd.notnull(obj) by yourself
  • Try obj1 + obj2, suppose they are similar but not the same index
  • Each series object and its index have a name attribute, eg. obj.name = 'population', obj.index.name = 'state'
  • DataFrame is a rectangular table of data with an collection of columns (NOTE: the book said the columns will be sorted automatically, but when I ran the same code in my PC, it’s not. The Python version is 3.7.4 and I am using Jupyter Notebook)
  • DataFrame has both a row and column index, can be viewed as a dict of Series sahring the same index
  • You can use pd.DataFrame(data, columns=['col3','col2','col1']) to customize the column order
  • If you pass a column that isn’t contained in the dict, it will appear with missing values NaNs
  • Try df['col1'] and df.col1 yourself and see output
  • Try df.loc[row_index]
  • df['col1']=.. or df.col1=.. can be used to assign a list/array to an existing column
  • If we assign a Series to a column, then its labels will be realigned exactly to the DataFrame’s index, inserting NaN in any holes
  • New column cannot be created with df.col syntax
  • del df['col1'] delete col1 from df, also doesn’t work for df.col1
  • The column returned from indexing a DataFrame is a view on the underlying data, which means any in-place change to the series will be reflected in the DataFrame. Try copy() function
  • df.T to transpose
  • Compared with Series, DataFrame has one more df.columns.name = ... attribute
  • Check Page 134 for more possible data inputs to DataFrame
  • Index objects are immutable, but this makes it safe to share Index objects among data structures
  • Pandas Index can contain duplicate labels, and there are also a number of related methods, see Page 136

5.2

Essential Functionality

  • obj.reindex([new index]) re-orders the obj by new index and introduce missing values if there is no corresponding index
  • obj.reindex(columns = [new column index]) used to re-order columns, see Page 138 for more details
  • Try obj.drop('index1') and obj.drop(['index1', 'index2'])
  • To drop columns in a DataFrame object, you need to add columns before the index, or add axis=1 or axis = columns
  • obj.drop(index, inplace=True) will manipulate the object in-place without returning a new object
  • The slicing method is similar to Numpy array, except that we could also use the indices, not only numbers
  • When using index, the slicing result will include the end index, different from numbers
  • This indexing method cannot select a subset from DataFrame
  • loc (indices) and iloc (numbers) enable us to select a subset of rows and columns from a DataFrame, see Page 144 for more indexing options with loc and iloc
  • when indexes are integer, try to use loc and iloc
  • Try series1 + series2 yourself, it works like the outer join in database (thus there are often a lot of NaNs in the output DataFrame)
  • To avoid NaNs, try df1.add(df2, fill_value=0), fill_value argument can also be used in other places
  • All basic arithematic methods (+,-,*,/) have the counterpart, starting with the letter r (reverse, I guess), eg. 1/df1 equals to df1.rdiv(1), see Page 149 for more details
  • When an array +/- a row/column, the computation would be performed for every row/col in the array, which is called broadcasting, same for the computation between DataFrame and Series
  • df.apply(func, axis) apply function func to each row/column(default)
  • Check % (format code) yourself
  • Use df.applymap(func) function for every elements in df
  • obj.sort_index(axis, ascending) returns a new, sorted object, and any missing values are sorted to the end of the Series by default
  • Use df.sort_values(by=...) to sort a DataFrame by one or more columns
  • Try obj.rank() yourself, then add method='first' as an argument into it and see what happened
  • Check Page 156 for more methods
  • obj.index.is_unique returns whether obj’s labels are unique

5.3

Descriptive Statistics

  • Most basic mathematical and statistical methods in Pandas are similar to Numpy, and they have built-in handling for missing value, see Page 159 & 160
  • df.corrwith(series1/df1) returns pairwise correlations between a DataFrame’s cols and another series/df
  • Try series.unique() yourself
  • obj.value_counts() returns the unique values and their counts, pd.value_counts(obj.values) also works for arrays
  • obj.isin(target_list) checks whether the value is in the target_list, returns a boolean series, is often used for filtering
  • Index.get_indexer gives you an index array from an array of possibly non-distinct values into another array of distinct values, refer to Page 164 for more details
  • Apply data.apply(pd.value_counts).fillna(0) to your data and see what will happen

Chapter 6

Data Loading, Storage, and File Formats

  • Focus on using pandas
  • Input and output are mainly categorized as

6.1 - 6.2

  • Page 167 provides tons of methods (eg. read_csv() and read_table()) to read tabular data as a DataFrame
  • df = pd.read_csv('filepath') read the specific csv file
  • df = pd.read_csv('filepath', sep=',') specify the delimiter
  • pd.read_csv('examples/ex2.csv', header=None) will let pandas assign default column names {0, 1, 2, …}
  • pd.read_csv('examples/ex2.csv', names = ['a', 'b', 'c', 'd']) Specify the column names by yourself
  • result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
  • Page 172 lists some frequently used options in read_csv() and read_table()
  • Try pd.options.display.max_rows = 10 and then print a DataFrame, see what happens
  • data.to_csv('out.csv'), where data could be a DataFrame or a Series
  • JSON (short for JavaScript Object Notation) data is one of the standard formats for sending data by HTTP
  • Use json.loads(jsonfile) to convert a JSON string to Python form
  • pd.read_json() assume each object in the JSON array is a row in the table
  • xlsx = pd.ExcelFile() and then pd.read_excel(xlsx, 'Sheet1') FOR LOAD
  • writer = pd.ExcelWriter('examples/ex2.xlsx') and then frame.to_excel(writer, 'Sheet1')

6.3 - 6.4

  • Use request package to interact with Web APIs, see Page 187
  • Use sqlalchemy package to access SQL databases, see Page 190, then use pd.read_sql()

Since my main purpose is to learn how to do data analysis and my time now is not very much, I will have a glance this chapter for now.

Chapter 7

Data Cleaning and Preparation

7.1

Handling Missing Data

  • All descriptive statistics in Pandas exclude missing value by default
  • Missing value for numeric data in pandas is represented as NaN, a floating-point value
  • Try dropna, fillna, isnull and notnull yourself
  • dropna for a DataFrame will drop any row containing a missing value by default
  • Adding how='all' will only drop rows with all NAs, or thresh=? to restrict the number of missing value
  • df.fillna(0, replace=True) fill NA with 0, and replace the previous DataFrame
  • df.fillna(data.mean()) fill NA with the mean value, see Page 197 for more details

7.2

Data Transformation

  • data.duplicated() returns a boolean Series indicating whether each row is a duplicate (compared with the previous row)
  • data.drop_duplicates() drops the duplicate; data.drop_duplicates(['col1']) drops the duplicate according to the col1; add keep='last' will keep the last duplicate instead of the first, which is the default
  • dict could be viewed as a mapping in python
  • Using map is a convenient way to perform element-wise transformations and other data cleanning-related operations
  • data.replace(value1, value2) replace the value1 in data with value2, where value1 could be a list
  • We can also use map to modify DataFrame’s index, but we could use a more useful method rename, but rename won’t save the change until you add inplace=True
  • pd.cut(values, bins) returns a categorical object, bins is to divide your data

TBC